1. Components of Computational Autonomous Molecular Design Workflow
The workflow for computational autonomous molecular design (CAMD) must be an integrated and closed-loop system (Figure 1) with: (i) efficient data generation and extraction tools, (ii) robust data representation techniques, (iii) physics-informed predictive machine learning models, and (iv) tools to generate new molecules using the knowledge learned from steps i–iii. Ideally, an autonomous computational workflow for molecule discovery would learn from its own experience and adjust its functionality as the chemical environment or the targeted functionality changes through active learning. This can be achieved when all the components work in collaboration with each other, providing feedback while improving model performance as we move from one step to another.
Figure 1. Closed-loop workflow for computational autonomous molecular design (CAMD) for medical therapeutics. Individual components of the workflow are labeled. It consists of data generation, feature extraction, predictive machine learning, and an inverse molecular design engine.
For data generation in CAMD, high-throughput density functional theory (DFT)
[1][2] is a common choice mainly because of its reasonable accuracy and efficiency
[3][4]. In DFT, we typically feed in 3D structures to predict the properties of interest. Data generated from DFT simulations are processed to extract the more relevant structural and properties data, which are then either used as input to learning the representation
[5][6] or as a target required for the ML models
[7][8][9]. Data generated can be used in two different ways: to predict the properties of new molecules using a direct supervised ML approach and to generate new molecules with the desired properties of interest using inverse design. CAMD can be tied with supplementary components, such as databases, to store the data and visualize it. The AI-assisted CAMD workflow presented here is the first step in developing automated workflows for molecular design. Such an automated pipeline will not only accelerate the hit identification and lead optimization for the desired therapeutic candidates but can actively be used for machine reasoning to develop transparent and interpretable ML models. These workflows, in principle, can be combined intelligently with experimental setups for computer-aided synthesis or screening planning that includes synthesis and characterization tools, which are expensive to explore in the desired chemical space. Instead, experimental measurements and characterization should be performed intelligently for only the AI-designed lead compounds obtained from CAMD.
The data generated from inverse design in principle should be validated by using an integrated DFT method for the desired properties or by high throughput docking with a target protein to find out its affinity in the closed-loop system, then accordingly update the rest of the CAMD. These steps are then repeated in a closed-loop, thus improving and optimizing the data representation, property prediction, and new data generation component. Once we have confidence in our workflow to generate valid new molecules, the validation step with DFT can be bypassed or replaced with an ML predictive tool to make the workflow computationally more efficient. In the following, we briefly discuss the main component of the CAMD, while reviewing the recent breakthroughs achieved.
2. Data Generation and Molecular Representation
ML models are data-centric—the more data, the better the model performance. A lack of accurate, ethically sourced well-curated data is the major bottleneck limiting their use in many domains of physical and biological science. For some sub-domains, a limited amount of data exists that comes mainly from physics-based simulations in databases
[10][11] or from experimental databases, such as NIST
[12]. For other fields, such as for biochemical reactions
[13], we have databases with the free energy of reactions, but they are obtained with empirical methods, which are not considered ideal as ground truth for machine learning models. For many domains, accurate and curated data does not exist. In these scenarios, slightly unconventional yet very effective approaches to creating data from published scientific literature and patents for ML have recently gained adoption
[14][15][16][17]. These approaches are based on natural language processing (NLP) to extract chemistry and biology data from open sources published literature. Developing a cutting-edge NLP-based tool to extract, learn, and the reason the extracted data would definitely reduce the timeline for high throughput experimental design in the lab. This would significantly expedite the decision-making based on the existing literature to set up future experiments in a semi-automated way. The resulting tools based on human-machine teaming are much needed for scientific discovery.
3. Molecular Representation in Automated Pipelines
A robust representation of molecules is required for the accurate functioning of the ML models
[18]. An ideal molecular representation should be unique, invariant with respect to different symmetry operations, invertible, efficient to obtain and capture the physics, stereochemistry, and structural motif. Some of these can be achieved by using the physical, chemical, and structural properties
[19], which, all together, are rarely well documented so obtaining this information is considered a cumbersome task. Over time, this has been tackled by using several alternative approaches that work well for specific problems
[20][21][22][23][24][25] as shown in
Figure 2. However, developing universal representations of molecules for diverse ML problems is still a challenging task, and any gold standard method that works consistently for all kinds of problems is yet to be discovered. Molecular representations primarily used in the literature falls into two broad categories: (a) 1D and/or 2D representations designed by experts using domain-specific knowledge, including properties from the simulation and experiments, and (b) iteratively learned molecular representations directly from the 3D nuclear coordinates/properties within ML frameworks.
Figure 2. Molecular representation with all possible formulations used in the literature for predictive and generative modeling.
Expert-engineered molecular representations have been extensively used for predictive modeling in the last decade, which includes properties of the molecules
[26][27], structured text sequences
[28][29][30] (SMILES, InChI), molecular fingerprints
[31], among others. Such representations are carefully selected for each specific problem using domain expertise, a lot of resources, and time. The SMILES representation of molecules is the main workhorse as a starting point for both representation learning as well as for generating expert-engineered molecular descriptors. For the latter, SMILES strings can be used directly as a one-hot encoded vector to calculate fingerprints or to calculate the range of empirical properties using different open-source platforms, such as RDkit
[32] or ChemAxon
[33], thereby bypassing expensive features generation from quantum chemistry/experiments by providing a faster speed and diverse properties, including 3D coordinates, for molecular representations. Moreover, SMILES can be easily converted into 2D graphs, which is the preferred choice to date for generative modeling, where molecules are treated as graphs with nodes and edges. Although significant progress has been made in molecular generative modeling using mainly SMILES strings
[28], they often lead to the generation of syntactically invalid molecules and are synthetically unexplored. In addition, SMILES are also known to violate fundamental physics and chemistry-based constraints
[34][35]. Case-specific solutions to circumvent some of these problems exist, but a universal solution is still unknown. The extension of SMILES was attempted by more robustly encoding rings and branches of molecules to find more concrete representations with high semantical and syntactical validity using canonical SMILES
[36][37], InChI
[29][30], SMARTS
[38], DeepSMILES
[39], DESMILES
[40], etc. More recently, Kren et al. proposed a 100% syntactically correct and robust string-based representation of molecules known as SELFIES
[34], which has been increasingly adopted for predictive and generative modeling
[41].
Recently, molecular representations that can be iteratively learned directly from molecules have been increasingly adopted, mainly for predictive molecular modeling, achieving chemical accuracy for a range of properties
[19][42][43]. Such representations as shown in
Figure 3 are more robust and outperform expert-designed representations in drug design and discovery
[44]. For representation learning, different variants of graph neural networks are a popular choice
[22][45]. It starts with generating the atom (node) and bond (edge) features for all the atoms and bonds within a molecule, which are iteratively updated using graph traversal algorithms, taking into account the chemical environment information to learn a robust molecular representation. The starting atom and bond features of the molecule may just be a one-hot encoded vector to only include atom-type, bond-type, or a list of properties of the atom and bonds derived from SMILES strings. Yang et al. achieved the chemical accuracy for predicting a number of properties with their ML models by combining the atom and bond features of molecules with global state features before being updated during the iterative process
[46].
Figure 3. The iterative update process is used for learning a robust molecular representation either based on 2D SMILES or 3D optimized geometrical coordinates from physics-based simulations. The molecular graph is usually represented by features at the atomic level, bond level, and global state, which represent the key properties. Each of these features is iteratively updated during the representation learning phase, which is subsequently used for the predictive part of the model.
Molecules are 3D multiconformational entities, and hence, it is natural to assume that they can be well represented by the nuclear coordinates as is the case of physics-based molecular simulations
[47]. However, with coordinates, the representation of molecules is non-invariant, non-invertible, and non-unique in nature
[20] and hence not commonly used in conventional machine learning. In addition, the coordinates by themselves do not carry information about the key attribute of molecules, such as bond types, symmetry, spin states, charge, etc., in a molecule. Approaches/architectures have been proposed to create robust, unique, and invariant representations from nuclear coordinates using atom-centered Gaussian functions, tensor field networks, and, more robustly, by using representation learning techniques
[19][43][48][49][50][51], as shown in
Figure 3.