Machine-Learning-Based Chemoinformatics: Comparison
Please note this is a comparison between Version 1 by Sarfaraz K. Niazi and Version 2 by Rita Xu.

In modern drug discovery, the combination of chemoinformatics and quantitative structure–activity relationship (QSAR) modeling has emerged as a formidable alliance, enabling researchers to harness the vast potential of machine learning (ML) techniques for predictive molecular design and analysis.


  • QSAR
  • QSPR
  • chemoinformatics
  • small molecules

1. Introduction

In 1998, the term “chemoinformatics”, coined by Frank K. Brown, was intended to hasten drug discovery and development; however, now, chemoinformatics is crucial in biology, chemistry, and biochemistry. The general process of drug discovery took 12 to 15 years and involved investments of around $500 million in 1998. New developments in machine learning (ML) and artificial intelligence (AI) have revolutionized chemoinformatics and drug discovery to a great degree. Market revenue for small-molecule drug discovery was $75.96 billion in 2022 and is projected to hit around $163.76 billion by 2032 [1][2][1,2].
In contrast to previously well-established statistics, mathematics, and physics-based stand-alone models, ML has introduced a paradigm shift, allowing computers to analyze data and draw conclusions and predictions without relying solely on explicit rules or predefined mathematical equations. These algorithms can discover complex patterns and relations in 3D chemical structures and biological activity data, adaptively adjust their models based on feedback, and generalize from training examples to make accurate predictions on unseen data. This data-driven approach has opened new avenues for optimizing drug–target interactions; empowering target-based drug discovery, chemical library screening, molecular modeling, mechanics, and dynamics; prioritizing potential drug candidates; and predicting possible toxicological responses of biologics with improved accuracy and efficiency.

2. Exploration of Chemoinformatics

At the intersection of chemistry and informatics, chemoinformatics has emerged as a potent field in drug discovery, employing inductive learning to predict chemical phenomena [3][4][3,4]. With the exponentially increasing accessibility of chemical data, the application of ML in chemoinformatics has revolutionized the way researchers now explore, analyze, and predict the properties and activities of molecules. Compared to a few decades ago, it has expedited the process by many folds. It focuses on molecular engineering, molecular manipulation, library design, compound database searching, chemical space exploration, molecular graph mining, pharmacophore, and scaffold analysis [5][6][7][8][9][5,6,7,8,9].

3. Fundamentals of Chemoinformatics

ML models perform prediction tasks based on chemical training data provided in the form of mathematical equations or a numerical representation. This transformation of compound structures into machine-learning-ready chemical data involves a complex, multilayer computational process. The process encompasses descriptor generation, molecular graphs, fingerprint construction, similarity analysis, chemical space searching, molecular dynamic simulations, etc. Each layer is interwoven with the preceding layers, significantly influencing the interpretation of the chemical data by the machine learning models and enhancing their predictive capabilities.

3.1. Data Mining and Chemical Databases

Training ML models requires chemical data, and chemoinformatics involves using chemical databases to store and retrieve chemical information. These databases enable searching for specific molecules or analyze large chemical datasets. The training of models relies heavily on managing and utilizing chemical databases that store vast amounts of chemical information, including compound structures, biological activities, and other relevant physiochemical properties. These databases facilitate data mining, knowledge discovery, and information retrieval for target prediction. Specialized databases of naturally existing compounds, including LOTUS [10], COCONUT [11], SuperNatural-II [12], NPASS [13], SymMap [14], TCMSP [15] and TCMID [16] provide valuable resources. These databases contain comprehensive information on compound structures, molecular physicochemical properties, and molecular descriptors. Utilizing the known structures of these compounds, abductive techniques based on structural similarities can be leveraged to convey knowledge regarding the mechanism. Various similarity scores, as mentioned before, can be computed, considering the similarity of 1D structures (e.g., SMILES- or SELFIES-based similarity [17]), 2D structures (e.g., 2D fingerprints or topological similarity), and even 3D structures (e.g., 3D geometric shape-based similarity). Previous studies have identified several metrics suitable for molecular similarity calculations, including the Tanimoto index, Manhattan distance, Dice index, overlap coefficient, cosine coefficient, and Soergel distance [18][19][20][18,19,20]. Furthermore, chemical bioactivity and structural data can be acquired from drug databases like ChEMBL [21], BindingDB [22], DrugBank [23], Inxight [24], and Protein Data Bank [25]. Despite the availability of extensive databases, utilizing machine learning and deep learning techniques offers significant potential to enhance the creation of molecules and focused libraries, enabling the discovery of potent bioactive compounds through targeted design and generation strategies in QSAR studies. Generative models like recurrent neural networks (RNN) have been employed to generate novel chemical structures predicted to have desirable properties, such as high potency or low toxicity. RNN models have been previously used to generate focused molecule libraries and have implicitly learned chemical knowledge to create molecules with combined characteristics of both bioactive natural products and synthetic compounds, such as DeepMGM. Besides this, generative models have been used for inverse QSAR/QSPR, which involves generating molecules that meet specific target properties. The DeepMGM model was trained using drug-like molecules and produced a general model (g-DeepMGM) capable of generating scaffold-focused libraries. A target-specific model (t-DeepMGM) for the cannabinoid receptor 2 (CB2) using transfer learning was also developed. A discriminator was incorporated into DeepMGM for in silico molecular design and testing. The generated molecule XIE9137 was identified as a potential CB2 allosteric modulator, highlighting the effectiveness of deep learning in de novo molecular design and chemical library generation [26][27][26,27].

3.2. Chemical Data Representation

Advancements in ML modeling and the availability of a vast pool of chemical and biological data have led to a dire need for data to be translated into computer-understandable form before models are trained on them. Chemical data representation can be empirical, molecular, and structural data represented in molecular graphs, fingerprints, descriptors, etc. [28][29][28,29]. A multivariate random forest model generated for genomic characterization was trained on genomic sequencing data given in numerical representation in one study [30]. In another, a Naïve Bayesian (NB) model was developed on numeric-based activity data, representing antagonists’ binding on estrogen receptors [31]. An ML-based model was trained on 31 chemical numerical datasets obtained from Merck to predict the properties of small compounds based on ADMET (absorption, distribution, metabolism, excretion, and toxicity) [32]. Similarly, molecular fingerprint data have also been used to train such models for ADMET properties prediction. NB and QSAR integrated models have been used to predict active compounds against human immunodeficiency virus type-1 trained on descriptors including extended-connectivity fingerprint data [33]. Furthermore, the graph neural networks (GNNs) function with the graph structure data of 3D molecules and have been used to identify potential drug molecules [34]. Besides the choice of representation, data augmentation, and pre-processing, the twin curse of dimensionality and collinearity must be tackled. When encountered in these data representations and modeling approaches, the twin curse of dimensionality and collinearity is addressed through principal components analysis (PCA), partial least squares (PLS), and other available techniques. The data often involve many genomic or chemical descriptors in genomic characterization and small-molecule property prediction. This high-dimensional feature space can lead to overfitting, decreased model interpretability, and increased computational complexity. In studies involving activity data, binding assays, or molecular fingerprints, collinearity can arise from strong correlations or dependencies among these input variables. Highly correlated variables can introduce redundancy and multicollinearity issues, leading to unstable model estimates and difficulties in interpreting the contributions of individual variables. To address these challenges, dimensionality reduction techniques such as feature selection, feature extraction, data regularization, penalization, and genetic algorithms can help mitigate these issues by imposing constraints and encouraging sparsity. The principal components analysis (PCA) and the partial least squares (PLS) methods generally transform massive datasets with correlated variables into smaller uncorrelated ones. PCA has been used to explore complex datasets in QSAR and dimensionality reduction. A study investigating PCA’s different applications in QSAR uses a dataset including CCR5 inhibitors. PCA has been used to detect outliers in the datasets, as well. The original data matrix from a different investigation was examined using PCA, in which molecules are represented by several predictor variables (molecular descriptors). PCA has also been used to design features for estrogen receptor binding prediction. Furthermore, observations revealed enhanced performance in therapeutic activity predictions against a diverse range of pharmacological protein targets identified by the kernel–principal components (kernel-PCA) analysis and a nonlinear PCA variation, surpassing the predictive capabilities of LASSO regression. Similarly, the partial least squares (PLS) method has been employed to discern significant structural patterns that contribute to the biological activity of a molecule. The efficiency and accuracy of PLS in combination with unsupervised dimensionality reduction techniques surpass the approach of explicitly combining unsupervised dimensionality with multivariate regression. PLS is also widely utilized in the field of 3D-QSAR modeling [6][35][36][6,35,36].

3.3. Molecular Descriptors

Molecular descriptors are quantifiable representations that capture chemical compounds’ structural, physicochemical, and biological properties. These descriptors are quantitative measures used for similarity analysis, virtual screening, and predictive modeling. Chemical molecular descriptors are categorized as 0D, 1D, 2D, 3D, and 4D (Table 1) [37][38][39][40][37,38,39,40].
Table 1. The most common 0D to 4D chemical descriptors for QSAR/QSPR analysis.
Descriptor Dimension Descriptor Type Example
0D The molecule’s atoms, bonds, and functional groups count Molecular weight, LogP (partition coefficient)
1D Molecular properties in a linear manner Molecular Formula, SMILES & SELFIES
2D Topological polar surface area (TPSA) Molecular fingerprint (e.g., Morgan fingerprint),

Constitutional descriptors (e.g., atoms, bonds, and rings count)
3D Special properties of a molecule Molecular shape descriptors (e.g., volume, surface area), Pharmacophore features
4D Electrostatic potential descriptors with spatiotemporal aspects Molecular dynamics descriptors, solvent accessible surface area (SASA), radius of gyration (Rg), Time-dependent properties (e.g., dynamic polar surface area (dPSA), time-dependent dipole moment
  • 0D Descriptors: These are constitutional or count descriptors, scalar values that describe several atoms, bonds, or functional groups in the molecule, e.g., molecular weight.
  • 1D Descriptors: These descriptors capture molecular properties in one dimension along a linear sequence or chain of atoms, e.g., structural fragments or fingerprints.
  • 2D Descriptors: These descriptors provide information about the structure on a molecular level and its properties within a 2D plane, e.g., topological polar surface area (TPSA) and graph invariants.
  • 3D Descriptors: These descriptors describe the molecular properties in 3D space, considering the spatial arrangement of atoms, e.g., autocorrelation descriptors, substituent constants, surface:volume descriptors, quantum, chemical descriptors, 3D-MoRSE descriptors, WHIM descriptors, GETAWAY descriptors, size, steric, surface, and volume descriptors.
  • 4D Descriptors: These descriptors encompass properties that change over time or involve spatiotemporal aspects, e.g., drug dissolution rate, Volsurf, and GRID or CoMFA methods.
These molecular descriptors have been used to select the most relevant properties. MoDeSus is an ML-based tool used to determine the most informative molecular descriptors for QSAR studies. Molecular descriptors allow for ligand-based scaffold hopping for hit and lead optimization, which speeds up the early stages of drug development and has been used to compare QSAR and QSPR models. Although each type of descriptor plays a vital role, 3D and 4D descriptors have shown the most significant contribution to identifying active molecules and potential drug targets. Furthermore, 4D descriptors like CoMFA and GRID have been used to identify active sites of receptors and characterize interactions providing insight into the functional properties of small molecules [41][42][43][41,42,43].