Databases in Metabolomics | Encyclopedia MDPI

Databases in Metabolomics: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Hardware & Architecture | Biochemistry & Molecular Biology

Contributor: Bayan Hassan Banimfreg , Abdulrahim Shamayleh , Hussam Alshraideh

Metabolomics has advanced from innovation and functional genomics tools and is currently a basis in the big data-led precision medicine era. Metabolomics is promising in the pharmaceutical field and clinical research.

metabolomics
pathway analysis
computer
databases

1. Introduction

Understanding the molecular system of living organisms has led to advancements in technological techniques for measuring the function of critical biomolecules in living organisms: RNA, DNA, proteins, and small molecules of diverse natures. The analysis of such elements led to the growth of the research field known as Omics [1,2]. Omics has become the new motto of molecular biology. In recent years, the utility of Omics technologies, such as genomics, proteomics, and metabolomics [2], has delivered new perceptions of well-being.

Metabolomics enhances the monitoring of disease evolution, dietary interventions, and drug toxicities by revealing the triggers of several diseases and detecting promising links between apparently different conditions [3]. In addition, Metabolomics seeks to catch the whole set of biomolecules confined in a biological sample, creating big data explored by biostatistics and bioinformatics methods [4].

Two main challenges in Omics data analysis are the dimensionality dilemma produced by more variables than samples and the development of algorithms that successfully integrate and analyze biological data, incorporating present and future knowledge. Pathway Analysis (PA) has developed and established a reliable way of managing these issues. PA is one of the commonly used principal tools of Omics research. PA tools analyze data obtained from high-throughput technologies, identifying potentially perturbed genes in diseased samples compared to a control. In this sense, PA methods aspire to overcome the dilemma of interpreting large lists of essential genes, the main output of most basic high-throughput data analysis. In addition, PA methods provide meaning to experimental high-throughput biological data, thereby enabling interpretation and successive hypothesis generation. PA targets have been achieved by combining databases’ biological knowledge with statistical testing, mathematical analyses, and computational algorithms.

The advancement of analytical techniques and extraction methods helps detect a wide range of metabolites. The well-known analytical techniques are either mass spectrometry (MS) or nuclear magnetic resonance (NMR).

The conventional methodological pipeline of a metabolomics experiment combines different steps. This pipeline starts with biological sample acquisition to further produce metabolic information. The pre-processed metabolomics data, both MS and NMR, is typically organized into a feature quantification matrix (FQM). In this matrix, rows typically relate to the samples, while columns relate to the metabolomic features obtained. The concentration of a metabolite usually characterizes the metabolomic feature. Data analysis techniques can then be applied using these metabolomic features as input.

Medical data is mass-produced, requiring very efficient tools to manage, store, and analyze the data. Therefore, various sources are used to generate high throughput profiling of such biological and clinical data cost-effectively, such as mobile phones, sensor devices, electronic health records (EHR), patients, hospitals and clinics, researchers, and other organizations.

Big data tools in modern software systems empower remarkable research opportunities and innovation in the healthcare domain. New emerging and interrelated paradigms such as Informatics & Data-Driven Medicine [5], eHealth [6] and mHealth [7], and Digital Health [8] are booming and attaining recognition among healthcare specialists and patients.

2. Metabolomics Databases

The vast amount of information in the ever-growing quantity of experimental and computational chemical data needs to be stored, made accessible, and manipulated. Today, hundreds of database projects are created and annotated biological knowledge. Each has a dedicated context.

As a result, the database’s current catalog is robust and diverse, including organism focus, curation approach, type of pathways, and interactions covered, along with other differences. In addition, many databases are available to researchers for data mining and sharing consistent chemical data for various purposes. For example, all pathway search tools depend on a database from which biochemical reactions and molecules can be enlisted to comprise the pathway of interest. This section discusses the databases related to various metabolite annotation, metabolism, and metabolomics workflows.

The Reactome Knowledgebase [11] (Reactome.org, accessed on 1 April 2022) is a distinct curated database of pathways and reactions in human biology, cross-referenced with several resources, such as essential literature and different pathway-related databases. It aims its manual annotation effort on Homo-sapiens, a single species, and applies a separate consistent data model within the whole biology domain. The Reactome describes a reaction as an event in biology that alters the condition of a biological molecule. Degradation, activation, binding, translocation, and typical biochemical events, including a catalyst, are reactions. It presents molecular features of signal transduction, transport, metabolism, DNA replication, and more cellular activities. It contains 2546 human pathways and 1940 small molecules [11].

BioCyc (Biocyc.org, accessed on 1 April 2022) [12] is a comprehensive reference to a collection of 19,494 Pathway and Genome Databases for model eukaryotes and thousands of microbes and software tools for exploring them. In addition, BioCyc comprises curated data from 130,000 publications. The MetaCyc and EcoCyc databases are freely available via BioCyc. However, access to the remaining BioCyc databases, such as HumanCyc (HumanCyc.org, accessed on 1 April 2022) [57], needs a paid subscription.

MetaCyc (MetaCyc.org, accessed on 1 April 2022) [13] is a broad metabolic pathways and enzymes database from each field of life. It includes 2937 pathways obtained from 3295 different organisms, making it the most extensive curated collection of metabolic pathways [13].

EcoCyc (EcoCyc.org, accessed on 1 April 2022) [14] is a systematic database for Escherichia coli K-12 MG1655. The EcoCyc presents a literature-based curation of its genome, transporters, metabolic pathways, and transcriptional regulation. Original and improved data analysis and visualization tools involve a circular genome viewer, an interactive metabolic network explorer, and several upgrades to the usability and speed of current tools [14]. It mainly focuses on metabolic pathways and signaling.

Metabolite Network of Depression Database (MENDA) [30] (http://menda.cqmu.edu.cn:8080/index.php, accessed on 1 April 2022) is a broad metabolite-disease association database that integrates all existing knowledge and datasets of metabolic characterization in depression. In addition, study and tissue type, organism, category of depression, sample size, platform (MS-based, MRS, NMR), and differential metabolites are provided.

BiGG Models (BIGG.ucsd.edu, accessed on 1 April 2022) [15] is a biochemical, genetic, and genomic knowledge base of genome-scale metabolic network reconstructions. BiGG Models includes more than 75 superior, manually curated genome-scale metabolic models. It also delivers a broad application interface for accessing BiGG Models with modeling and analysis kits. In addition, reaction and metabolite identifiers and pathway visualization were formalized in BiGG Models.

Kyoto Encyclopedia of Genes and Genomes (KEGG) (www.kegg.jp/, accessed on 1 April 2022) [16] is an extensive and widely used database. It is a manually curated source incorporating 18 databases classified into genomic, systems, health, and chemical data.

The Braunschweig Enzyme Database (BRENDA) enzyme database (www.brenda-enzymes.org, accessed on 1 April 2022) [17] contains comprehensive functional enzyme and metabolism data such as measured kinetic parameters. The main part has more than 5 million data points for almost 90,000 enzymes. In addition, BRENDA presents accessible enzyme information from fast to superior text- and structured-based searches for word maps, enzyme-ligand interactions, and enzyme data visualization.

PubChem (pubchem.ncbi.nlm.nih.gov, accessed on 1 April 2022) [18] is the world’s most extensive set of open and accessible chemical information from more than 750 data sources. It stores information in three primary categories: compounds, substances, and bioactivities. In addition, several research areas use PubChem as a big data resource, including machine learning and data science for drug repurposing, virtual screening, drug side effect prediction, metabolite identification, and chemical toxicity prediction. Furthermore, PubChem provides physical and chemical properties, safety and toxicity information, biological activities, literature citations, patents, and more.

ChEBI (www.ebi.ac.uk/chebi, accessed on 1 April 2022) [19] is an open-access glossary of molecular entities aimed at small biochemical compounds.

The HMDB (https://hmdb.ca, accessed on 1 April 2022) [20] is a broad source delivering information about homo-sapiens metabolites and their associated physiological, chemical, and biological properties. To date, HMDB has 220,945 total metabolites.

ChemSpider (chemspider.com, accessed on 1 April 2022) [21] is a freely accessible chemical structure database delivering a quick structure and text search covering over one hundred million structures from hundreds of data resources.

MetaboLights (https://www.ebi.ac.uk/metabolights, accessed on 1 April 2022) [22] is a database that includes metabolomics studies research, raw experimental data, and related metadata. MetaboLights is cross-technique and cross-species and includes metabolite structures and their related biological roles, reference spectra, concentrations and locations, and metabolic experiments data. Users can upload their research datasets into the MetaboLights Repository. Researchers are then automatically given a unique and stable identifier for publication reference.

The Metabolomics Workbench (metabolomicsworkbench.org, accessed on 1 April 2022) [23] is a public repository for experimental metabolomics metadata and data covering several species and experimental platforms, metabolite structures, metabolite standards, tutorials, protocols, training material, and more educational resources. It can combine, examine, deposit, track, and distribute big heterogeneous data from many MS- and NMR-based metabolomics studies. It covers over twenty diverse species, including humans and other mammals, insects, invertebrates, plants, and microorganisms.

SMPDB (https://smpdb.ca, accessed on 1 April 2022) [24] is a comprehensive, interactive, visual database that includes over 48,000 discovered pathways. Most of the pathways do not exist in other pathway databases. SMPDB help in pathway discovery and interpretation in metabolomics, proteomics, transcriptomics, and systems biology.

MetSigDis [25] (http://www.bio-annotation.cn/MetSigDis/, accessed on 1 April 2022) is a free web-based tool that offers a comprehensive metabolite alterations resource in various diseases. The database deposited 6849 curated associations between 2420 metabolites and 129 diseases among eight species, including humans and model organisms.

Virtual Metabolic Human [26] (VMH, www.vmh.life, accessed on 1 April 2022) is a web-based database capturing the knowledge of Homo-sapiens metabolism within 5 interlinked resources, including, Homo-sapiens metabolism, Disease, Gut microbiome, ReconMaps, and Nutrition. The VMH’s exceptional features are (i) the introduction of the metabolic reconstructions of Homo-sapiens and gut microbes for metabolic modeling; (ii) seven Homo-sapiens metabolic maps for data visualization; (iii) a nutrition designer; (iv) an accessible webpage and application user interface to access the content; (v) feedback option for community users’ interactions and (vi) the linking of its entities to 57 web resources.

WikiPathways [28] (wikipathways.org, accessed on 1 April 2022) is a reliable and rich pathway database that captures biological pathways’ collective knowledge. By delivering a database in a curated, machine-readable system, visualization and omics data studies is supported.

The relational database of Metabolomics Pathways (RaMP) [29] is a public database to combine biological pathways from the WikiPathways, KEGG Reactome, and the HMDB. RaMP maps metabolites and genes to biochemical and disease pathways and can be incorporated into other existing software. It can be used as a stand-alone resource (https://github.com/mathelab/RaMP-DB/, accessed on 1 April 2022) or incorporated into other tools (https://github.com/mathelab/RaMP-DB/inst/extdata/, accessed on 1 April 2022).

Pathway Commons [27] (https://www.pathwaycommons.org, accessed on 1 April 2022) is one of the most extensive composite databases. It is an integrated resource of openly accessible information about biological pathways involving biochemical reactions, transport and catalysis events, assembly of biomolecular complexes, and physical interactions, including DNA, RNA, proteins, and small molecules such as drug compounds and metabolites.

A variety of databases stands as a metabolomics dataset repository. To mention some, BioMagResBank (BMRB) (http://www.bmrb.wisc.edu, accessed on 1 April 2022) [58] is a public repository for NMR spectroscopy data from peptides, proteins, nucleic acids, and more biomolecules. In addition, the Golm Metabolome Database (GMD) (http://gmd.mpimp-golm.mpg.de/, accessed on 1 April 2022) [59] provides datasets for biologically quantified active metabolites and text search capabilities for GC-MS data. Moreover, the Mass Spectral Library (https://www.NIST.gov/srd/NIST-standard-referencedatabase-1a, accessed on 1 April 2022) [60] extensively collects EI MS, MS/MS, replicate spectra, and retention index datasets. Finally, the Spectral Database System (SDBS) (https://sdbs.db.aist.go.jp/, accessed on 1 April 2022) [61] is a spectral database for organic compounds and has various MS, NMR, IR, Raman, ESR datasets.

Taken all together, Pathguide [62] is a necessary initial step for considering the prospect of pathway databases. Pathguide is a meta-database that contains information about 702 biological pathway-related databases and molecular interaction-related databases. For example, the Pathguide categories include signaling pathways, metabolic pathways, pathway diagrams, gene regulatory networks, transcription factor targets, genetic interactions networks, protein sequence-focused, protein-protein interactions, protein–compound interactions, etc.

Despite the emerging number of chemical databases, the significant challenge for this expansion is the incompetence to use metabolite and reaction information from databases such as KEGG, BRENDA, and MetaCyc because of representation inconsistencies and duplications, and errors. In addition, the same metabolite is obtained with several names among models and databases, which slows down assembling information from different data sources. Therefore, researchers designed the MetRxn database [63], Rhea [64], and RefMet [65] to standardize reaction and metabolite names. Additions and modifications to databases are made regularly to increase the quality and coverage of their biological knowledge. Some databases can update their information frequently to sustain pace with discoveries. For instance, the KEGG database [16] revises its data weekly; however, other databases do it less often. The preference of databases should consider the relative sizes, degree of overlap, and scope. For instance, KEGG comprises considerably more compounds than MetaCyc, but MetaCyc includes more pathways and reactions than KEGG. For example, pathway sets might vary between databases in several ways, involving the number of pathways present, the size of pathways, how pathways are curated, be it manually or automatically, or a combination of both, organisms supported, and the pathway boundaries [66]. However, interpreting metabolomics data has been intriguing since realizing the relationships among dozens of modified metabolites have often relied on researchers’ biochemical assumptions and knowledge. However, recent biochemical databases deliver information about metabolism’s interrelations, automatically polling using metabolomics analysis tools, i.e., mathematical and computational tools.

This entry is adapted from the peer-reviewed paper 10.3390/metabo12101002

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.