1. Natural Product Databases
Between 2000 and 2019, 123 commercial and open access NP (natural product) databases have been published. Of them, 98 are still somehow accessible, 92 are open access, and only 50 contain molecular structures that can be retrieved for a chemoinformatic analysis
[1].
Table 1 summarizes examples of the most representative NP databases. Among the largest commercial databases is the Dictionary of Natural Products
[2]. It contains more than 230,000 compounds and provides names and synonyms, physicochemical properties, spectroscopic data, molecular structures, and biological source and use. Another commercial database is Scifinder
[3], assembled and maintained by the American Chemical Society (ACS). It contains arguably the most extensive collection of NPs, with over 300,000. This is due to the fact that, since 1957, the Chemical Abstracts Service (CAS), a division of the ACS, assigns a unique registry number to every new chemical substance reported in the scientific literature. Another large commercial database is Reaxys
[4], collected and maintained by Elsevier. It contains approximately 10
7 molecules including over 200,000 NPs. The Collection of Open Natural Products (COCONUT)
[5] is a major open access database of NPs, containing more than 411,000 NPs collected from 50 open access NP databases. The Universal Natural Product Database
[6] is a compilation that tried to gather all the known NPs; it has more than 229,000 NPs. It provides 3D structures with stereochemical information and calculated molecular descriptors. It is not yet accessible through the link in the original publication. Instead, it is contained and maintained on the ISDB website
[7]. The SuperNatural Ⅱ
[8] database contains over 325,000 NPs and includes information about 2D structures, physicochemical properties, predicted toxicity class, and potential vendors. Nevertheless, it does not provide a bulk download.
ZINC
[9] is another open access database with over 80,000 NPs, with approximately 48,000 which are purchasable. It includes information regarding known biological targets and predicted targets. The download of the entire subset of NPs in 1D or 3D notation is straightforward. Some NP databases are no longer accessible through the link provided in the original publication. Fortunately, their structures are in ZINC. Such is the case with the Herbal Ingredient Targets
[10] and Herbal Ingredients in vivo Metabolism database
[11], which contain NPs mostly from Chinese plants. Specs
[12] has an industrial catalog of purchasable NPs, although the website does not allow the downloading of compounds anymore. Nonetheless, the structures are available via ZINC. Despite the Universal Natural Product Database, SuperNatural Ⅱ, and ZINC being among the largest databases of NPs in the public domain, they do not offer information regarding the taxonomic and geographic origin of the organisms that produce the NPs, and there is a lack of literature references
[1].
Traditional Chinese medicine (TCM) is part of the public health system
[13]. Therefore, the China Government encourages research in the area of NPs, and as a consequence, a large number of NP databases have been published
[14][15][16][17][18][19][20]. Nonetheless, TCM@Taiwan is the most extensive database of NPs used in the TCM
[21], containing approximately 58,000 molecules. Regarding traditional medicine in India (Indian Ayurveda), there are two open access databases available: IMPPAT
[22], which contains more than 10,000 phytochemicals extracted from 1700 medicinal plants; and MedPServer
[23], containing 1124 NPs coming from North-East India. Moreover, there are several databases containing compounds from African traditional medicine
[24][25][26][27][28][29]. Nevertheless, AfroDB
[30] is the most comprehensive, composed of around 1000 NPs, and it is accessible via ZINC.
Table 1. Most representative natural products databases.
Database Name |
Number of Compounds |
Accessibility |
Reference |
Collection of Open Natural Products (COCONUT) |
411,621 |
Open access |
[5] |
Universal Natural Product Database |
∼229,000 |
Open access |
[6] |
SuperNatural Ⅱ |
325,508 |
Open access |
[8] |
ZINC |
∼80,000 |
Open access |
[9] |
Dictionary of Natural Products |
∼230,000 |
Commercial |
[2] |
Scifinder |
∼300,000 |
Commercial |
[3] |
Reaxys |
∼200,000 |
Commercial |
[4] |
TCM@Taiwan |
∼58,000 |
Open access |
[21] |
IMPPAT |
∼10,000 |
Open access |
[22] |
AfroDB |
∼1000 |
Open access |
[30] |
2. Latin American Natural Product Databases
2.1. NuBBEDB
The database is the result of the collaboration between the Nuclei of Bioassays, Biosynthesis and Ecophysiology of Natural Products (NuBBE) research group of the São Paulo State University and the Laboratory of Computational and Medicinal Chemistry of the University of São Paulo. The database was published in 2013 as the first NP library of Brazilian biodiversity, containing 640 compounds
[31]; in 2017, an updated version came out with more than 2000 NPs
[32]. Currently, the database contains 2223 compounds. The available information regarding the compounds includes the International Union of Pure and Applied Chemistry (IUPAC) name, linear notations (SMILES, InChI, and InChIKey strings), Ro5 and Veber descriptors, and predicted spectroscopic data: nuclear magnetic resonance (NMR), source, therapeutic effect and reference. It is possible to download the whole database in .mol2 format. Additionally, the database can be found in Chemspider and ZINC, and it is part of the COCONUT database.
The website allows users to search compounds by selecting specific criteria: metabolic class (alkaloids, flavonoids, lignoids, etc.), name and location of the species that contain the NP, source (marine, plant, etc.), and drug-like physicochemical properties. Furthermore, one can draw a structure and retrieve the compounds that contain it or search compounds that contain a specific NMR signal.
An absorption, distribution, metabolism, excretion and toxicity (ADMET) profile of the database revealed that 91% of the compounds can permeate through the human intestinal barrier, and 93% of the molecules can efficiently move in systemic circulation and reach their desired site of action. Moreover, it is predicted that most of the compounds do not inhibit five isoforms of CYP450 (CYP 3A4, 2D6, 1A2, 2C9, and 2C19). The CYP450 enzyme is responsible for detoxifying more than 80% of drugs in liver first-pass metabolism, and therefore, any compound that inhibits it may cause toxicity. The clearance prediction revealed that 94% of the compounds are readily excreted from the human body after executing their therapeutic function. Finally, 87% of compounds were shown to have no mutagenicity, tumorigenicity, reproductive effect, and irritant properties
[33].
Another study characterized the chemical space and diversity. It was found that NuBBE
DB has a focused chemical space within the space of drug-like physicochemical properties. The study also revealed that the larger source of diversity is driven by the side chains. Another finding revealed that the diversity and complexity varies according to the origin of the compounds when comparing NuBBE
DB to other NP databases. One conclusion of the study is that NuBBE
DB is a promising source of molecules for drug discovery
[34].
The NuBBE
DB database was employed in a VS study with the purpose of finding compounds against Trypanosoma cruzi. The researchers looked for trypanothione reductase inhibitors: this enzyme is a validated target for the discovery of new antiprotozoal compounds. Ten compounds were identified as potential inhibitors of the enzyme
[35]. In another study, 13 compounds against Mycobacterium tuberculosis were identified from NuBBE
DB [36]. The molecules are inhibitors of the serine/threonine protein kinase, which is essential for the growth and survival of the pathogen
[37].
2.2. SistematX
The database was developed at the Laboratory of Cheminformatics of the Federal University of Paraiba, Brazil. The first version came out in 2018 containing 2150 secondary metabolites
[38], and a second version was published in 2021 with a total of 9514 unique secondary metabolites
[39]. The information for every compound includes the IUPAC name, SMILES, InChI and InChIKey strings, CAS registry number, physicochemical drug-like descriptors, predicted NMR spectra, predicted biological activities, and the bibliographic reference. A unique feature is the information regarding the taxonomic rank, from family to species, and the global positioning system (GPS) coordinates of the plant from which the compound was isolated. On the website (
Table 2), the search of specific compounds can be through the 2D drawing of the structure, by the SMILES strings, compound name, taxonomic rank, and physicochemical properties. It is possible to download the entire database in .csv or .sdf format.
Table 2. Latin American natural products databases.
SistematX has been employed in five VS studies. In the first study, compounds with potential antichagasic activity were identified from 1306 sesquiterpene lactones on the database. (Chagas disease is an endemic disease caused by Trypanosoma cruzi.) The study employed two approaches, LBVS and SBVS. From LBVS, the most prominent compound showed a probability of 0.82 of inhibition. From SBVS, 13 potential inhibitors were identified
[46]. In another VS study, with the purpose of identifying compounds against the intracellular parasitic protozoan Leishmania donovani which causes Leishmaniasis, 13 promising, enzyme-targeting, antileishmanial compounds were identified from the sesquiterpene lactones on SistematX
[47]. In the third VS study, the researchers looked for compounds against Schistosoma mansoni, which causes the chronic parasitic disease Schistosomiasis. From the 1000 alkaloids on SistematX, five compounds were identified with potential multitarget schistosomicidal activity
[48]. In the fourth VS study, 1955 diterpenes on SistematX were employed to search for compounds against SARS-CoV-2. Nineteen compounds were identified as potential SARS-CoV-2 inhibitors
[49]. In the most recent VS campaign, the researchers were seeking acetylcholinesterase (AChE) inhibitors, which is an approach for the treatment of Alzheimer’s disease. They employed a combined approach in which machine learning classification models and molecular docking calculations were used to identify two promising AChE inhibitors
[50]. Other applications of SistematX include chemotaxonomic studies using self-organizing map algorithms
[51] and the profile of over 2000 metabolites from the Asteraceae family while screening for inhibitors of Leishmania major dihydroorotate dehydrogenase
[52].
2.3. UEFS
The NP database of the State University of Feira de Santana
[40] was developed and is maintained by the State University of Feira de Santana in Bahia, Brazil (UEFS, for its acronym in Portuguese:
Universidade Estadual de Feira de Santana). The database contains NPs that have been separately published, but there is no common publication nor public database for it. Nevertheless, it is accessible via ZINC. There are 503 NPs in the database. It is possible to download the whole database in .mol2 or .sdf format, and it provides a bulk download of the SMILES strings. The available information of the NPs includes calculated physicochemical properties, biological targets, and binding affinity, together with the bibliographic reference. There is a cross-reference for the biological targets to Reactome which is an open source, open access, manually curated and peer-reviewed pathway database
[53]. Finally, it is possible to find information about the vendors of individual compounds.
2.4. CIFPMA
The NP database of CIFLORPAN from the University of Panama, Republic of Panama (CIFPMA) was developed by the Center for Pharmacognostic Research on Panamanian Flora (CIFLORPAN, for its acronym in Spanish:
Centro de Investigaciones Farmacognósticas de la Flora Panameña), College of Pharmacy of the University of Panama. The first version was published in 2017
[41], containing 354 molecules; in 2019, the database was updated to 454 compounds
[42]. The compounds have been tested in over 25 in vitro and in vivo bioassays, for different therapeutic targets including anti-HIV (human immunodeficiency virus), antioxidants, and anticancer. In fact, the compound structures are available upon request.
A chemoinformatic analysis of the database suggested that, in general, the compounds have drug-like properties. The database was compared to the TCM@Taiwan and UEFS databases. It was found that CIFPMA has the largest scaffold diversity compared to other databases. Moreover, unique scaffolds were found in the CIFPMA database. Finally, it was established which scaffolds are present in compounds with experimental cytotoxic effect, anti-HIV-1, antimalarial, anti-trypanosomatid, and antifungal activities
[41].
The database was part of another chemoinformatics study, which involved a comparison of several NP databases against other databases with compounds of synthetic origin. The study revealed that so many of the NPs and synthetic compounds share the same chemical space. Moreover, the NPs present a larger fingerprint-based diversity than the synthetic compounds. Furthermore, the study revealed that NPs have a higher proportion of chiral carbons and atoms with sp
3 hybridization and greater complexity, while synthetic products contain a greater proportion of aromatic atoms. Lastly, cyclicity, relative shape, and flexibility are very similar in NPs and synthetic compounds
[42].
2.5. UNIIQUIM
The database was created at the National Autonomous University of Mexico (UNAM, for its acronym in Spanish: Universidad Nacional Autónoma de México) by The Informatics Unit of the Institute of Chemistry (UNIIQUIM, for its acronym in Spanish:
Unidad de Informática del Instituto de Química). The database
[43] is composed of NPs from Mexico and mainly NPs isolated and characterized by the Department of Natural Products of the Institute of Chemistry, UNAM. The number of NPs on the database is not clear, and the website is only in Spanish. The information on the NPs includes the IUPAC name, CAS registry number, physicochemical properties, the species that synthesizes the NP, the spectroscopic techniques employed to characterize the compound, experimental biological activity, and reference to either the article where the NP is reported or to the articles that report the biological activities. In the current version, it is not possible to make a bulk download. The content can be browsed displaying a table either with the chemical structures or with the producing organism. Furthermore, the content can be browsed in a table that contains the bibliographic references.
In a study, the chemical and toxicological profile of molecules with analgesic activity was described. The results showed that most of the compounds probably interact with the opioid receptor. Moreover, the predicted acute toxicity is low, and none is predicted to be mutagenic. The study concludes that due to the structural diversity, the common nociception activity and the predicted safety profile as nonmutagenic agents highlights the importance of the molecules for further studies on the search of analgesic and nociception effects
[54].
2.6. BIOFACQUIM
The database was curated and constructed in Mexico by the Computer-Aided Drug Design at the School of Chemistry (DIFACQUIM, for its acronym in Spanish:
Diseño de Fármacos Asistido por Computadora) research group, UNAM. The first version came out in 2019
[44] and contained 423 NPs isolated and characterized in Mexico at the School of Chemistry, UNAM, between the years 2000 and 2018. Later, in 2020, a second version came out
[45], and the database was updated with NPs isolated and characterized by research groups of other Mexican institutions, reaching a total of 531 molecules. Nowadays, the database contains 553 NPs. The database is composed mainly of NPs that come from plants, followed by fungus, and to a lesser extent, propolis and marine animals. There is a website for the first version of the database, and it allows the user to search the compounds by name. Moreover, it is possible to retrieve compounds by kingdom (plant, fungus, propolis). The entire database can be downloaded in .csv format. The latest version of the database is available on a different website
[45], and it is possible to download the whole database in .sdf format. Information on the NPs includes the compound name, SMILES strings, bibliographic reference, taxonomic rank (kingdom, genus, species), place where it is found, the source from which the NP was isolated, biological activity, and
IC50 value. The database is also available at ZINC, and it is part of the COCONUT database.
A chemoinformatics analysis of the first version of the database concluded that the compounds have a broad coverage in the chemical space and overlap regions in the drug-like space. Furthermore, compounds very similar to drugs approved for clinical use were identified
[44]. In another study, a structural content analysis of the second version was performed. BIOFACQUIM was compared to ChEMBL 25 (1,667,509 molecules) and a database with 169,839 NPs. The researchers concluded that 44.3% of the unique compounds contained in BIOFACQUIM are focused on drug-like space in terms of physicochemical properties. Additionally, a significant number of compounds and scaffolds (79 and 29, respectively) were identified that were not present in the two large reference sets
[45]. Finally, an in silico absorption, distribution, metabolism, excretion and toxicological (ADMET) profile of the second version of BIOFACQUIM was performed. The study concluded that the absorption and distribution profiles of the compounds in BIOFACQUIM are similar to those of approved drugs, while the metabolism profile is comparable to that in other NP databases. The excretion profile of the compounds is different from that of the approved drugs, but their predicted toxicity profile is comparable
[55].
An independent VS study looked for beta-glucosidase inhibitors. The pharmacological applications of these compounds include obesity, diabetes, hyperlipoproteinemia, cancer, HIV, and hepatitis B and C. Employing classification models (two-variable artificial network), eight compounds were identified from BIOFACQUIM as active
[56]. In addition, in an independent study, Barrera-Vázquez et al. looked for senolytic compounds which selectively eliminate senescent cells. Cellular senescence is a cellular condition that involves significant changes in gene expression and the arrest of cell proliferation. The elimination of senescent cells delays, prevents, and improves multiple adverse outcomes related to age. Through the use of chemoinformatics tools (fingerprinting and network pharmacology), and employing two NP databases, InflamNat and BIOFACQUIM, three senolytic compounds were identified
[57].
Table 3 summarizes the main applications of databases of representative Latin American natural products to identify bioactive compounds.
Table 3. Practical applications of the databases of Latin American natural products.
Database Name |
Disease or Symptom |
Causative Agent |
Number of Identified Compounds |
Reference |
NuBBEDB |
Chagas disease |
Trypanosoma cruzi |
10 |
[35] |
Tuberculosis |
Mycobacterium tuberculosis |
13 |
[36] |
SistematX |
Chagas disease |
Trypanosoma cruzi |
13 |
[46] |
Leishmaniasis |
Leishmania donovani |
13 |
[47] |
Schistosomiasis |
Schistosoma mansoni |
5 |
[48] |
Coronavirus disease 2019 |
SARS-CoV-2 |
19 |
[49] |
Alzheimer’s disease |
|
2 |
[50] |
UNIIQUIM |
Pain |
|
6 |
[54] |
BIOFACQUIM |
Obesity |
|
8 |
[56] |
Diabetes |
|
|
Hyperlipoproteinemia |
|
|
Cancer |
|
|
HIV/AIDS * |
|
|
Hepatitis B and C. |
|
|
Age-related diseases |
3 |
[57] |
This entry is adapted from the peer-reviewed paper 10.3390/biom12091202