1.1. What is Machine Learning?
Before we dive into deeper ML/AI topics, we need to understand first what machine learning is and what it encompasses as a field.
During the past five years or so, we’ve been hearing companies and people jumping into the AI trend and adopting it to transform how they work. You may think it’s a bandwagon but I think it’s also a way for businesses to harness computational power to achieve their goals.
Here in Chemolytics, our goal is to incorporate machine learning in the field of chemistry and pave a way for a new type of science. ML matters in science because it allows us to analyze data in speeds and accuracy never seen before. We can feed algorithms with data and let it find patterns that will help advance the field.
In addition, we can train algorithms to predict how a reaction will go, analyze properties of new materials, and synthesize new compounds without taking too much time on trial and error. That’s the power of machine learning in the context of chemistry and experimental science.
Traditional Programming vs Machine Learning
While some people believe that traditional programming and machine learning are the same, there are some stark differences and similarities.
In traditional programming, the process usually starts with a developer explicitly writing rules or logic that the computer must follow. You will need to provide both the data and the human-defined instructions (algorithms), and the system will output the results. For instance, if you are building a spam detection system, you will need to manually define the rules like “if the subject line contains the word ‘free’, mark it as spam”.
In contrast, machine learning flips this paradigm. In ML, you will provide the data and the outcomes (labels), and the system will learn the rules or patterns on its own. Using the same spam detection system, you might feed the algorithm thousands of labeled emails, whether they’re spam or not. It would then learn what characteristics are statistically associated with spam without any explicit instructions.
Traditional Programming | Machine Learning |
Rules + Data → Output | Data + Output → Rules (model) |
Requires explicit logic | Learns patterns from examples |
Good for fixed, well-defined tasks | Suited for complex, uncertain environments |
Deterministic behavior | Probabilistic and adaptive behavior |
This distinction is crucial. Machine learning allows systems to generalize from experience, making it ideal for tasks where writing rules manually is impractical. Examples of such include image recognition, natural language understanding, or predicting chemical properties.
Core Concepts of Machine Learning
At its core, machine learning is a subset of artificial intelligence (AI) focused on enabling computers to learn patterns from data and use that pattern to make predictions or decisions without being explicitly programmed for each task.
Typical machine learning workflow involves three key components:
- Model: A mathematical structure or algorithm that maps inputs to outputs. Examples include decision trees, neural networks, and support vector machines.
- Training: The process of feeding data into the model so it can adjust its internal parameters to capture meaningful patterns. This data usually includes both the input features and the correct answers (labels) in supervised learning.
- Generalization: The model’s ability to apply what it has learned from the training data to new, unseen data. A well-generalized model performs accurately on both past examples and future scenarios.
The ultimate goal of machine learning is to extract underlying relationships that allow models to perform reliably when exposed to unseen or new inputs. For example, we can train a model to recognize handwritten digits and expect them to correctly classify new handwriting styles that it has not seen before.
In short, machine learning looks at historical data and transforms it into predictive insights, bridging the gap between raw information and actionable intelligence.
In short, machine learning transforms historical data into predictive insight, bridging the gap between raw information and actionable intelligence.
Types of Machine Learning
Machine learning is generally categorized into three main types, based on how the model learns from data.
Supervised Learning
In supervised learning, the algorithm is trained on a labeled dataset which means each input is paired with the correct output. The goal is to learn a mapping from inputs to outputs that can be used to predict labels for new, unseen data.
- Use Cases: Spam detection, fraud detection, medical diagnosis, regression forecasting
- Sample Algorithms: Linear regression, logistic regression, support vector machines, random forests
Unsupervised Learning
Unsupervised learning deals with unlabeled data. The algorithm tries to discover hidden patterns or structures in the data without guidance from known outputs.
- Use Cases: Customer segmentation, anomaly detection, dimensionality reduction
- Sample Algorithms: K-means clustering, hierarchical clustering, principal component analysis (PCA), autoencoders
Reinforcement Learning
In reinforcement learning, an agent learns by interacting with an environment, receiving rewards or penalties based on its actions. The goal is to learn a policy (action) that maximizes cumulative reward over time.
- Use Cases: Robotics, game playing (e.g., AlphaGo), autonomous vehicles
- Sample Algorithms: Q-learning, Deep Q-Networks (DQNs), policy gradient methods
Each type of learning is suited for different problems. Choosing the right one depends on the nature of your data and the task you want to solve.
Why Chemists Should Care
Machine learning is increasingly reshaping how chemists approach research, discovery, and analysis. Rather than replacing chemical intuition, ML augments it by enabling faster hypothesis testing, automated data analysis, and discovery pipelines that were previously unfeasible. Here are some examples of where machine learning can be applied within chemistry.
Catalyst Discovery
During recent years, machine learning has been used to predict catalytic activity based on molecular structure, electronic descriptors, or reaction conditions. This reduces the number of experimental trials needed to get the right catalyst for the right reactions.
For example, in a paper published in American Chemical Society’s ACS Catalysis, 300 quaternary solid catalysts were randomly sampled from a materials space consisting of 36,540 catalysts. The researchers used decision trees to facilitate efficient sampling of quaternary catalysts toward better performance in oxidative coupling of methane.
Spectral Prediction
Another interesting application of machine learning and AI is the accurate prediction of IR/NMR/UV-Vis spectra of new molecules. Researchers train models on large spectral databases to accurately predict the spectra of materials, polymers, or even catalysts.
In a paper published in The Journal of Physical Chemistry, researchers introduced ShiftML, a machine-learning model of chemical shifts in molecular solids. ShiftML was trained on minimum-energy geometries of materials composed of C, H, N, O, and S that provide rapid chemical shift predictions with density functional theory (DFT) accuracy.
Materials Screening
In materials science, ML helps screen thousands of candidate compounds to identify those with desired properties (e.g., conductivity, stability, bandgap).
Examples include GraphINVENT, a platform developed for graph-based molecular design using graph neural networks. GraphINVENT uses a tiered deep neural network architecture to probabilistically generate new molecules, a single bond at a time. Instead of relying on manually coded chemical rules, GraphINVENT learns directly from the training data.
Reaction Outcome Prediction
Machine learning can also forecast reaction yields, selectivities, and possible side-products based on reaction conditions and reactant properties.
For instance, there’s AiZynthFinder, a fast and robust open-source software for retrosynthetic planning. The AiZynthFinder algorithm uses a Monte Carlo tree search that recursively breaks down a molecule to purchasable precursors. The tree search is guided by an artificial neural network policy that suggests possible precursors by utilizing a library of known reaction templates.
In short, chemists equipped with machine learning can reduce experimental workload, accelerate discovery, and gain deeper insights—transforming how modern chemistry is practiced.
Common Misconceptions
Despite its growing popularity, machine learning is often misunderstood especially in the scientific community. Clarifying these misconceptions is essential for using ML responsibly and effectively.
“Machine Learning is AI Magic”
Machine learning is not a black box that automatically solves problems without context. It relies on mathematics, algorithms, and optimization, not magic. It can uncover patterns, but it doesn’t understand chemistry the way a human does. Effective use still requires domain expertise to ask the right questions and interpret results meaningfully.
“More Data = Better Model”
While ML thrives on data, data quality matters far more than sheer volume. Inaccurate, noisy, or biased datasets can mislead models and yield unreliable predictions. In chemistry, poorly curated experimental data or unrepresentative training sets can lead to dangerous conclusions or failed generalization.
“ML Replaces Chemical Theory”
Machine learning complements theory—it does not replace it. Theoretical frameworks guide data selection, feature engineering, and model interpretation. ML can detect correlations, but it cannot explain causation without theory-driven context. Robust scientific discovery still relies on mechanistic understanding, not just statistical pattern recognition.
Chemists who understand these nuances are better positioned to integrate ML thoughtfully—leveraging its strengths without falling into hype-driven misuse.
Closing Thoughts
Machine learning is a powerful tool grounded in mathematics, data, and domain knowledge. For chemists, it offers a new lens to accelerate discovery, automate analysis, and uncover patterns too complex for manual inspection.
However, to use it effectively, one must understand its foundations and limitations. Learning ML equips you not just to use existing models, but to critically evaluate, improve, and innovate within your own field of research.
In the next section, we’ll explore the key components of the machine learning pipeline—starting with how data is prepared and models are trained. If you’re ready to go from understanding what ML is to learning how it works, continue reading.