Machine-Learning-Based Chemoinformatics: Comparison
Please note this is a comparison between Version 2 by Rita Xu and Version 1 by Sarfaraz K. Niazi.

In modern drug discovery, the combination of chemoinformatics and quantitative structure–activity relationship (QSAR) modeling has emerged as a formidable alliance, enabling researchers to harness the vast potential of machine learning (ML) techniques for predictive molecular design and analysis.


  • QSAR
  • QSPR
  • chemoinformatics
  • small molecules

1. Introduction

In 1998, the term “chemoinformatics”, coined by Frank K. Brown, was intended to hasten drug discovery and development; however, now, chemoinformatics is crucial in biology, chemistry, and biochemistry. The general process of drug discovery took 12 to 15 years and involved investments of around $500 million in 1998. New developments in machine learning (ML) and artificial intelligence (AI) have revolutionized chemoinformatics and drug discovery to a great degree. Market revenue for small-molecule drug discovery was $75.96 billion in 2022 and is projected to hit around $163.76 billion by 2032 [1,2][1][2].
In contrast to previously well-established statistics, mathematics, and physics-based stand-alone models, ML has introduced a paradigm shift, allowing computers to analyze data and draw conclusions and predictions without relying solely on explicit rules or predefined mathematical equations. These algorithms can discover complex patterns and relations in 3D chemical structures and biological activity data, adaptively adjust their models based on feedback, and generalize from training examples to make accurate predictions on unseen data. This data-driven approach has opened new avenues for optimizing drug–target interactions; empowering target-based drug discovery, chemical library screening, molecular modeling, mechanics, and dynamics; prioritizing potential drug candidates; and predicting possible toxicological responses of biologics with improved accuracy and efficiency.

2. Exploration of Chemoinformatics

At the intersection of chemistry and informatics, chemoinformatics has emerged as a potent field in drug discovery, employing inductive learning to predict chemical phenomena [3,4][3][4]. With the exponentially increasing accessibility of chemical data, the application of ML in chemoinformatics has revolutionized the way researchers now explore, analyze, and predict the properties and activities of molecules. Compared to a few decades ago, it has expedited the process by many folds. It focuses on molecular engineering, molecular manipulation, library design, compound database searching, chemical space exploration, molecular graph mining, pharmacophore, and scaffold analysis [5,6,7,8,9][5][6][7][8][9].

3. Fundamentals of Chemoinformatics

ML models perform prediction tasks based on chemical training data provided in the form of mathematical equations or a numerical representation. This transformation of compound structures into machine-learning-ready chemical data involves a complex, multilayer computational process. The process encompasses descriptor generation, molecular graphs, fingerprint construction, similarity analysis, chemical space searching, molecular dynamic simulations, etc. Each layer is interwoven with the preceding layers, significantly influencing the interpretation of the chemical data by the machine learning models and enhancing their predictive capabilities.

3.1. Data Mining and Chemical Databases

Training ML models requires chemical data, and chemoinformatics involves using chemical databases to store and retrieve chemical information. These databases enable searching for specific molecules or analyze large chemical datasets. The training of models relies heavily on managing and utilizing chemical databases that store vast amounts of chemical information, including compound structures, biological activities, and other relevant physiochemical properties. These databases facilitate data mining, knowledge discovery, and information retrieval for target prediction. Specialized databases of naturally existing compounds, including LOTUS [10], COCONUT [11], SuperNatural-II [12], NPASS [13], SymMap [14], TCMSP [15] and TCMID [16] provide valuable resources. These databases contain comprehensive information on compound structures, molecular physicochemical properties, and molecular descriptors. Utilizing the known structures of these compounds, abductive techniques based on structural similarities can be leveraged to convey knowledge regarding the mechanism. Various similarity scores, as mentioned before, can be computed, considering the similarity of 1D structures (e.g., SMILES- or SELFIES-based similarity [17]), 2D structures (e.g., 2D fingerprints or topological similarity), and even 3D structures (e.g., 3D geometric shape-based similarity). Previous studies have identified several metrics suitable for molecular similarity calculations, including the Tanimoto index, Manhattan distance, Dice index, overlap coefficient, cosine coefficient, and Soergel distance [18,19,20][18][19][20]. Furthermore, chemical bioactivity and structural data can be acquired from drug databases like ChEMBL [21], BindingDB [22], DrugBank [23], Inxight [24], and Protein Data Bank [25]. Despite the availability of extensive databases, utilizing machine learning and deep learning techniques offers significant potential to enhance the creation of molecules and focused libraries, enabling the discovery of potent bioactive compounds through targeted design and generation strategies in QSAR studies. Generative models like recurrent neural networks (RNN) have been employed to generate novel chemical structures predicted to have desirable properties, such as high potency or low toxicity. RNN models have been previously used to generate focused molecule libraries and have implicitly learned chemical knowledge to create molecules with combined characteristics of both bioactive natural products and synthetic compounds, such as DeepMGM. Besides this, generative models have been used for inverse QSAR/QSPR, which involves generating molecules that meet specific target properties. The DeepMGM model was trained using drug-like molecules and produced a general model (g-DeepMGM) capable of generating scaffold-focused libraries. A target-specific model (t-DeepMGM) for the cannabinoid receptor 2 (CB2) using transfer learning was also developed. A discriminator was incorporated into DeepMGM for in silico molecular design and testing. The generated molecule XIE9137 was identified as a potential CB2 allosteric modulator, highlighting the effectiveness of deep learning in de novo molecular design and chemical library generation [26,27][26][27].

3.2. Chemical Data Representation

Advancements in ML modeling and the availability of a vast pool of chemical and biological data have led to a dire need for data to be translated into computer-understandable form before models are trained on them. Chemical data representation can be empirical, molecular, and structural data represented in molecular graphs, fingerprints, descriptors, etc. [28,29][28][29]. A multivariate random forest model generated for genomic characterization was trained on genomic sequencing data given in numerical representation in one study [30]. In another, a Naïve Bayesian (NB) model was developed on numeric-based activity data, representing antagonists’ binding on estrogen receptors [31]. An ML-based model was trained on 31 chemical numerical datasets obtained from Merck to predict the properties of small compounds based on ADMET (absorption, distribution, metabolism, excretion, and toxicity) [32]. Similarly, molecular fingerprint data have also been used to train such models for ADMET properties prediction. NB and QSAR integrated models have been used to predict active compounds against human immunodeficiency virus type-1 trained on descriptors including extended-connectivity fingerprint data [33]. Furthermore, the graph neural networks (GNNs) function with the graph structure data of 3D molecules and have been used to identify potential drug molecules [34]. Besides the choice of representation, data augmentation, and pre-processing, the twin curse of dimensionality and collinearity must be tackled. When encountered in these data representations and modeling approaches, the twin curse of dimensionality and collinearity is addressed through principal components analysis (PCA), partial least squares (PLS), and other available techniques. The data often involve many genomic or chemical descriptors in genomic characterization and small-molecule property prediction. This high-dimensional feature space can lead to overfitting, decreased model interpretability, and increased computational complexity. In studies involving activity data, binding assays, or molecular fingerprints, collinearity can arise from strong correlations or dependencies among these input variables. Highly correlated variables can introduce redundancy and multicollinearity issues, leading to unstable model estimates and difficulties in interpreting the contributions of individual variables. To address these challenges, dimensionality reduction techniques such as feature selection, feature extraction, data regularization, penalization, and genetic algorithms can help mitigate these issues by imposing constraints and encouraging sparsity. The principal components analysis (PCA) and the partial least squares (PLS) methods generally transform massive datasets with correlated variables into smaller uncorrelated ones. PCA has been used to explore complex datasets in QSAR and dimensionality reduction. A study investigating PCA’s different applications in QSAR uses a dataset including CCR5 inhibitors. PCA has been used to detect outliers in the datasets, as well. The original data matrix from a different investigation was examined using PCA, in which molecules are represented by several predictor variables (molecular descriptors). PCA has also been used to design features for estrogen receptor binding prediction. Furthermore, observations revealed enhanced performance in therapeutic activity predictions against a diverse range of pharmacological protein targets identified by the kernel–principal components (kernel-PCA) analysis and a nonlinear PCA variation, surpassing the predictive capabilities of LASSO regression. Similarly, the partial least squares (PLS) method has been employed to discern significant structural patterns that contribute to the biological activity of a molecule. The efficiency and accuracy of PLS in combination with unsupervised dimensionality reduction techniques surpass the approach of explicitly combining unsupervised dimensionality with multivariate regression. PLS is also widely utilized in the field of 3D-QSAR modeling [6,35,36][6][35][36].

3.3. Molecular Descriptors

Molecular descriptors are quantifiable representations that capture chemical compounds’ structural, physicochemical, and biological properties. These descriptors are quantitative measures used for similarity analysis, virtual screening, and predictive modeling. Chemical molecular descriptors are categorized as 0D, 1D, 2D, 3D, and 4D (Table 1) [37,38,39,40][37][38][39][40].
Table 1. The most common 0D to 4D chemical descriptors for QSAR/QSPR analysis.
Descriptor Dimension Descriptor Type Example
0D The molecule’s atoms, bonds, and functional groups count Molecular weight, LogP (partition coefficient)
1D Molecular properties in a linear manner Molecular Formula, SMILES & SELFIES
2D Topological polar surface area (TPSA) Molecular fingerprint (e.g., Morgan fingerprint),

Constitutional descriptors (e.g., atoms, bonds, and rings count)
3D Special properties of a molecule Molecular shape descriptors (e.g., volume, surface area), Pharmacophore features
4D Electrostatic potential descriptors with spatiotemporal aspects Molecular dynamics descriptors, solvent accessible surface area (SASA), radius of gyration (Rg), Time-dependent properties (e.g., dynamic polar surface area (dPSA), time-dependent dipole moment
  • 0D Descriptors: These are constitutional or count descriptors, scalar values that describe several atoms, bonds, or functional groups in the molecule, e.g., molecular weight.
  • 1D Descriptors: These descriptors capture molecular properties in one dimension along a linear sequence or chain of atoms, e.g., structural fragments or fingerprints.
  • 2D Descriptors: These descriptors provide information about the structure on a molecular level and its properties within a 2D plane, e.g., topological polar surface area (TPSA) and graph invariants.
  • 3D Descriptors: These descriptors describe the molecular properties in 3D space, considering the spatial arrangement of atoms, e.g., autocorrelation descriptors, substituent constants, surface:volume descriptors, quantum, chemical descriptors, 3D-MoRSE descriptors, WHIM descriptors, GETAWAY descriptors, size, steric, surface, and volume descriptors.
  • 4D Descriptors: These descriptors encompass properties that change over time or involve spatiotemporal aspects, e.g., drug dissolution rate, Volsurf, and GRID or CoMFA methods.
These molecular descriptors have been used to select the most relevant properties. MoDeSus is an ML-based tool used to determine the most informative molecular descriptors for QSAR studies. Molecular descriptors allow for ligand-based scaffold hopping for hit and lead optimization, which speeds up the early stages of drug development and has been used to compare QSAR and QSPR models. Although each type of descriptor plays a vital role, 3D and 4D descriptors have shown the most significant contribution to identifying active molecules and potential drug targets. Furthermore, 4D descriptors like CoMFA and GRID have been used to identify active sites of receptors and characterize interactions providing insight into the functional properties of small molecules [41,42,43][41][42][43].



  1. Small Molecule Drug Discovery Market Size, Report by 2032. Available online: (accessed on 24 May 2023).
  2. Brown, F.K. Chapter 35—Chemoinformatics: What is it and How does it Impact Drug Discovery. In Annual Reports in Medicinal Chemistry; Bristol, J.A., Ed.; Academic Press: New York, NY, USA, 1998; Volume 33, pp. 375–384.
  3. Polanski, J. 4.26-Chemoinformatics. In Comprehensive Chemometrics, 2nd ed.; Elsevier: Amsterdam, The Netherlands, 2020; pp. 635–676.
  4. Gasteiger, J. Chemoinformatics: Achievements and Challenges, a Personal View. Molecules 2016, 21, 151.
  5. Polanski, J. 4.14-Chemoinformatics. In Comprehensive Chemometrics; Elsevier: Amsterdam, The Netherlands, 2009; pp. 459–506.
  6. Gasteiger, J. Handbook of Chemoinformatics; Wiley: New York, NY, USA, 2003.
  7. Varnek, A.; Baskin, I.I. Chemoinformatics as a Theoretical Chemistry Discipline. Mol. Inform. 2011, 30, 20–32.
  8. Bajorath, J.; Bajorath, J. (Eds.) Chemoinformatics and Computational Chemical Biology. In Methods in Molecular Biology; Springer Science+Business Media: Humana Totowa, NJ, USA, 2011.
  9. Kapetanovic, I.M. Computer-aided drug discovery and development (CADDD): In silico-chemico-biological approach. Chem.-Biol. Interact. 2008, 171, 165–176.
  10. Rutz, A.; Sorokina, M.; Galgonek, J.; Mietchen, D.; Willighagen, E.; Gaudry, A.; Graham, J.G.; Stephan, R.; Page, R.; Vondrášek, J.; et al. The LOTUS initiative for open natural products research: Knowledge management through Wikidata. bioRxiv 2021.
  11. Sorokina, M.; Steinbeck, C. Review on natural products databases: Where to find data in 2020. J. Cheminform. 2020, 12, 20.
  12. Banerjee, P.; Erehman, J.; Gohlke, B.O.; Wilhelm, T.; Preissner, R.; Dunkel, M. Super Natural II—A database of natural products. Nucleic Acids Res. 2015, 43, D935–D939.
  13. Zeng, X.; Zhang, P.; He, W.; Qin, C.; Chen, S.; Tao, L.; Wang, Y.; Tan, Y.; Gao, D.; Wang, B.; et al. NPASS: Natural product activity and species source database for natural product research, discovery and tool development. Nucleic Acids Res. 2018, 46, D1217–D1222.
  14. Wu, Y.; Zhang, F.; Yang, K.; Fang, S.; Bu, D.; Li, H.; Sun, L.; Hu, H.; Gao, K.; Wang, W.; et al. SymMap: An integrative database of traditional Chinese medicine enhanced by symptom mapping. Nucleic Acids Res. 2019, 47, D1110–D1117.
  15. Ru, J.; Li, P.; Wang, J.; Zhou, W.; Li, B.; Huang, C.; Li, P.; Guo, Z.; Tao, W.; Yang, Y.; et al. TCMSP: A database of systems pharmacology for drug discovery from herbal medicines. J. Cheminform. 2014, 6, 13.
  16. Xue, R.; Fang, Z.; Zhang, M.; Yi, Z.; Wen, C.; Shi, T. TCMID: Traditional Chinese medicine integrative database for herb molecular mechanism analysis. Nucleic Acids Res. 2012, 41, D1089–D1095.
  17. Krenn, M.; Aspuru-Guzik, A.; Nigam, A.; Friederich, P. Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. arXiv 2020, 1, 045024.
  18. Engel, T.; Gasteiger, J. (Eds.) Chemoinformatics: Basic Concepts and Methods; Wiley: New York, NY, USA, 2018; Available online: (accessed on 7 May 2023).
  19. Xue, H.; Stanley-Baker, M.; Kong, A.W.K.; Li, H.; Goh, W.W.B. Data considerations for predictive modeling applied to the discovery of bioactive natural products. Drug Discov. Today 2022, 27, 2235–2243.
  20. Nikolova, N.; Jaworska, J. Approaches to Measure Chemical Similarity—A Review. Qsar Comb. Sci. 2003, 22, 1006–1026.
  21. Mendez, D.; Gaulton, A.; Bento, A.P.; Chambers, J.; De Veij, M.; Félix, E.; Magariños, M.P.; Mosquera, J.F.; Mutowo, P.; Nowotka, M.; et al. ChEMBL: Towards direct deposition of bioassay data. Nucleic Acids Res. 2019, 47, D930–D940.
  22. Gilson, M.K.; Liu, T.; Baitaluk, M.; Nicola, G.; Hwang, L.; Chong, J. BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 2016, 44, D1045–D1053.
  23. Wishart, D.S.; Feunang, Y.D.; Guo, A.C.; Lo, E.J.; Marcu, A.; Grant, J.R.; Sajed, T.; Johnson, D.; Li, C.; Sayeeda, Z.; et al. DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res. 2018, 46, D1074–D1082.
  24. Siramshetty, V.B.; Grishagin, I.; Nguyễn, Ð.T.; Peryea, T.; Skovpen, Y.; Stroganov, O.; Katzel, D.; Sheils, T.; Jadhav, A.; Mathé, E.A.; et al. NCATS Inxight Drugs: A comprehensive and curated portal for translational research. Nucleic Acids Res. 2022, 50, D1307–D1316.
  25. Sussman, J.L.; Lin, D.; Jiang, J.; Manning, N.O.; Prilusky, J.; Ritter, O.; Abola, E.E. Protein Data Bank (PDB): Database of three-dimensional structural information of biological macromolecules. Acta Crystallogr. Sect. D Biol. Crystallogr. 1998, 54, 1078–1084.
  26. Moret, M.; Friedrich, L.; Grisoni, F.; Merk, D.; Schneider, G. Generative molecular design in low data regimes. Nat. Mach. Intell. 2020, 2, 171–180.
  27. Segler, M.H.S.; Kogej, T.; Tyrchan, C.; Waller, M.P. Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks. ACS Publ. 2017, 4, 120–131.
  28. Haghighatlari, M.; Li, J.; Heidar-Zadeh, F.; Liu, Y.; Guan, X.; Head-Gordon, T. Learning to Make Chemical Predictions: The Interplay of Feature Representation, Data, and Machine Learning Methods. Chem 2020, 6, 1527–1542.
  29. David, L.; Thakkar, A.; Mercado, R.; Engkvist, O. Molecular representations in AI-driven drug discovery: A review and practical guide. J. Cheminform. 2020, 12, 56.
  30. Rahman, R.; Dhruba, S.R.; Ghosh, S.; Pal, R. Functional random forest with applications in dose-response predictions. Sci. Rep. 2019, 9, 1628.
  31. Pang, X.; Fu, W.; Wang, J.; Kang, D.; Xu, L.; Zhao, Y.; Liu, A.L.; Du, G.H. Identification of Estrogen Receptor α Antagonists from Natural Products via In Vitro and In Silico Approaches. Oxid. Med. Cell. Longev. 2018, 2018, 6040149.
  32. Feinberg, E.N.; Joshi, E.; Pande, V.S.; Cheng, A. Improvement in ADMET Prediction with Multitask Deep Featurization. J. Med. Chem. 2020, 63, 8835–8848.
  33. Wei, Y.; Li, W.; Du, T.; Hong, Z.; Lin, J. Targeting HIV/HCV Coinfection Using a Machine Learning-Based Multiple Quantitative Structure-Activity Relationships (Multiple QSAR) Method. Int. J. Mol. Sci. 2019, 20, 3572.
  34. Xiong, J.; Xiong, Z.; Chen, K.; Jiang, H.; Zheng, M. Graph neural networks for automated de novo drug design. Drug Discov. Today 2021, 26, 1382–1393.
  35. Kubinyi, H. Evolutionary variable selection in regression and PLS analyses. J. Chemom. 1996, 10, 119–133.
  36. Eriksson, L.; Jaworska, J.; Worth, A.; Cronin, M.T.D.; McDowell, R.; Gramatica, P. Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs. Environ. Health Perspect. 2003, 111, 1361–1375.
  37. Dehmer, M.; Varmuza, K.; Bonchev, D. Statistical Modelling of Molecular Descriptors in QSAR/QSPR; Wiley-VCH Verlag GmbH & Co. KGaA: Weinheim, Germany, 2012.
  38. Lo, Y.; Rensi, S.E.; Torng, W.; Altman, R.B. Machine learning in chemoinformatics and drug discovery. Drug Discov. Today 2018, 23, 1538–1546.
  39. Chandrasekaran, B.; Abed, S.N.; Al-Attraqchi, O.; Kuche, K.; Tekade, R.K. Computer-Aided Prediction of Pharmacokinetic (ADMET) Properties; Elsevier: Amsterdam, The Netherlands, 2018; pp. 731–755.
  40. Engel, T. Basic Overview of Chemoinformatics. J. Chem. Inf. Model. 2006, 46, 2267–2277.
  41. Ash, J.; Fourches, D. Characterizing the Chemical Space of ERK2 Kinase Inhibitors Using Descriptors Computed from Molecular Dynamics Trajectories. J. Chem. Inf. Model. 2017, 57, 1286–1299.
  42. Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. ScienceDirect. Available online: (accessed on 24 May 2023).
  43. Machine Learning Descriptors for Molecules. ChemIntelligence. 5 January 2021. Available online: (accessed on 14 May 2023).