Multi-Omics Approaches and Resources in the Plant Kingdom: History
Please note this is an old version of this entry, which may differ significantly from the current revision.
Subjects: Plant Sciences

In higher plants, the complexity of a system and the components within and among species are rapidly dissected by omics technologies. Multi-omics datasets are integrated to infer and enable a comprehensive understanding of the life processes of organisms of interest. Further, growing open-source datasets coupled with the emergence of high-performance computing and development of computational tools for biological sciences have assisted in silico functional prediction of unknown genes, proteins and metabolites, otherwise known as uncharacterized.

  • computational approaches
  • functional genomics
  • metabolomics
  • systems biology
  • transcriptomics

1. Introduction

The plant kingdom is comprised of photosynthetic eukaryotes, mainly green plants. The enormous variations among and within plant populations include the physical forms, reproductive mechanisms, carbon assimilation strategies (photosynthesis metabolisms), growth and development and other factors such as responses against pests and pathogens, stress environments and productivity [1]. Plants are drastically subjected to constant changes that appear invisible to the human eye, otherwise regarded as unknown.
The phenotype accounts for highly flexible differences which result from the genetics (G), environment (E), and genetics by environment interaction (GXE). The deoxyribonucleic acid (DNA) molecule is the central hereditary unit, as the genetic material is passed from one generation to the other. Composed of four different nucleotides (adenine, thymine, cytosine and guanine), DNA carries gene fragments that encode protein molecules, of which protein-encoding genes contribute to a relatively minor portion (2%) of the total genetic material (genome). The major fraction (98%) of the genome is represented by non-coding sequences, which may indirectly participate in the protein-coding gene expression mechanisms and actions. The central dogma of molecular biology maintains genetic integrity at each life cycle via replication (DNA–DNA), reverse-transcription (RNA–DNA), transcription (DNA–RNA) and translation (RNA–protein) [2]. On the other hand, gene regulatory elements (enhancers and silencers), non-coding RNAs such as microRNA (miRNA), small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), long non-coding RNAs (lncRNAs), and Piwi-interacting RNA (piRNA) are explicitly reported to affect gene expression levels, DNA methylation, alternative splicing events, and epigenetics [3,4].
While the study of the entire genetic material of an organism is known as genomics, the landscape of all the elemental genes expressed (transcripts) at a given time/condition is referred as the transcriptome. Transcripts are translated into protein molecules which may undergo further modifications to form small molecules of <15,000 Da (known as metabolites). These catalogues of proteins and metabolites synthesized at a given time/condition are studied in proteomics and metabolomics, respectively. Thus, transcripts, proteins and metabolites are central components driving the complexity of a biological organism. The growing application of various omics technologies has marked a burst of scientific and technological omics-based approaches offering a wealth of plant science information. “Omics” data are either interpreted independently or integrated via multi-omics analysis to understand critical questions in plant-based research [5].
Systems biology approaches (SBA) offer a plethora of virtual modelling systems equipped with in silico designs for gene function prediction [6]. Revolutionized by high-throughput omics technologies, SBA offers a vast amount of big data generated at the molecular level [7,8]. In parallel, computational biology has gained importance alongside SBA for dissecting and further improving the biological information of the target organisms per se [9,10]. Moving forward, conventional approaches that are dependent on sequence information to predict the putative biological functions (Gene Ontology classification) of a target gene have expanded robustly to accommodate organizational level-function annotations: the structural features of a given sequence, the interaction between the gene product and the cellular entity, and the phenotypic diversity of a population. In recent years, machine learning approaches and deep learning architectures such as feature-based and artificial neural networks (convolutional neural networks (CNNs) and recurrent neural networks) have been massively deployed in plant research [11,12]. The latter was evidently highly advantageous. For example, in cis-regulatory element (CRE) prediction, the CNN, in the absence of a priori knowledge on the target location, outperforms conventional k-mer enrichment, expectation maximization and Gibbs sampling methods with a lower false positive rate [13,14,15].

2. The Omics-Platform

2.1. Genomics

The development and application of next-generation sequencing (NGS) technologies have revolutionized crop improvement strategies primarily through genome exploration and gene discovery [16,17]. Genomics study infers the function and evolutionary history of plants, and with growing NGS technologies such as Illumina, Pacific Biosciences, Beijing Genomic Institute (BGI), Twist Bioscience, 10XGenomics and Oxford Nanopore, the research output (scientific publication) has significantly increased over the last decade (2012 to 2022) (Figure 1). The NGS technologies are indeed robust tools for genome characterization (genome size and genome ploidy level) and genetic variation identification at the genome and/or population level. Genomic datasets are established by means of comprehensive methods which involves the target species’ DNA isolation, sequencing and sequence annotation using bioinformatics tools. Whole-genome sequencing (WGS) requires the entire DNA content of a single organism, while exome sequencing examines the coding DNA sequences (exons) of a genome. Another technique, namely genotype by sequencing (GBS), is a combinatorial technique that employs restriction enzymes to select single nucleotide polymorphisms (SNP) within a population. Epigenomics targets the gene-regulating components such as DNA methylation [18,19].
Figure 1. Scholarly omics-related articles published under the plant sciences category from 2012 to 2022. The literature search using Web of Science (https://www.webofknowledge.com) search engine was accessed on 18 September 2022 with Boolean ‘or’ and the following keywords: genomic, genome, transcriptomic, transcriptome, proteomic, proteome, metabolomic and metabolome.
The decreasing cost of genome sequencing has led to a deluge of plant genome sequences, particularly of agricultural crop sequences [20,21]. Sequencing price varies by the experimental designs and each design considers a myriad of technical features, such as number of reads, read length, methodology and technology. The most used methodologies to generate paired end reads in Illumina are Hiseq (100–250 bp) and Miseq (up to 300 bp). The latter has a low throughput and thus is highly recommended for small genomes <20 Mb. Next, PacBio emerged as a third-generation technology for complex genome sequencing of about 2.5–80 kb. The detection principle is based on the nucleotide excitation of a single molecule, and the technology is subjected to high error rates. The MinION by Oxford Nanopore sequences up to 20 Gb and comes with a low cost, portability features and a high error rate, comparatively much higher than PacBio. Another affordable NGS platform is BGISEQ, a forthcoming technology gaining a foothold in Asia. This technology generates single-end and paired-end reads of about 50–100 bp [22,23]. To date, Illumina remains the best quality read-producing technology. The quality of read profiles generated by Illumina can be evaluated in real time, and poor reads are filtered off using various user-friendly applications as follows: FastQC [24], Cutadapt [25], AdapterRemoval [26], Skewer [27], and Trimmomatic [28].
Plant genome assembly is challenged by the genome size, sequence repetitive nature and ploidy level (autoploid and alloploid). For example, a wheat genome of about 17 Gb features three independent sub-genomes [29]. The genome assembly procedure becomes easiest with the availability of a single allele per locus, although that is not the usual case in most plant genomes. In a systemic comparison between plant and vertebrate genomes using the unbiased kmer-based approach, plant genomes showed higher repeat contents [30].
Upon genome assembly, subsequent genome annotation is required to identify functional elements present along the genome sequence [31]. The genome structural annotation or gene predicting process adds biological meaning to the raw sequences and offers fundamental insights into the biology of the target species. However, the genome annotation process for high-quality genome assemblies is often challenged by the gene density and the introns abundantly present in a genome. There are three distinct computational algorithms developed for detecting the coding region; ab initio (intrinsic), evidence-based (extrinsic) and genomic sequence comparison. The ab initio gene finding prediction software includes the hidden Markov models (HMM), conditional random field, support vector machine, and neural networks. Integrating the information from both the content sensor and signal sensor [31,32,33,34,35], the content sensor classifies the DNA sequence as coding or non-coding, whilst the signal sensor identifies specific functional regions (donor or acceptor of splice site) throughout the genome [30]. Ab initio gene predictors, for instance, GenePRIMP [36], SnowyOwl [37], CodingQuarry [38], BRAKER1 [39], MAKER2 [40], MAKER-P [41] and Seqping [42], can thus be used as a pipeline to predict a reliable annotation on the newly sequenced genomes.
The evidence-based method exploits a cost-effective approach in the form of transcriptional evidence by expressed sequence tags (ESTs) or complementary DNA (cDNA) [43]. The genomic sequence comparison identifies the relativity of the content sensor to the sequence of other genomic DNA [44]. Among the notable comparative gene-finding predictors, CONTRAST [44] has a higher accuracy in both exon/gene sensitivity and specificity than any previous year predictors; N-SCAN, TWINSCAN [45] and GENSCAN [46]. The ab initio and genomic sequence comparison methods are somehow less convincing than evidence-based due to automatic prediction based on training datasets and have poor quality in algorithms that often result in errors.
Genome sequence data facilitate comparative genomic studies targeted to infer the functions of unknown genes [47,48], enable reconstruction of metabolic pathways [49,50] and advance the understanding of evolutionary relationships between and among species [51]. Genome annotation is generally performed using sequence similarity search whereby annotated genes which encode proteins are matched with known proteins available in open repositories [48,52]. To date, plant genomic information can be retrieved from public databases such as NCBI [52] and Ensembl Plants [53]. Meanwhile, PlantGDB [54], PLAZA [55], Gramene [56] and Phytozome [57].

2.1.1. Genomic-Assisted Gene Discovery for Crop Improvement

Genomics is the key enabler of the five Gs in crop improvement instruments: (i) genome assembly, (ii) germplasm characterization, (iii) gene function identification, (iv) genomic breeding and (v) gene editing [58]. Crops with established genome assemblies are research-friendly, as the ease of computational analyses is becoming highly feasible. Plant genetic resources play a fundamental role in leveraging maximum genetic gain in a breeding program. Genetic variation under the natural setting offers breeders the basis for selection and further exploitation for crop improvement. Genetic diversity of highly valuable agronomic traits such as yield, yield-related traits, and resistance against biotic and abiotic components are amongst the most widely exploited traits for further modifications [59]. Generally, mining desirable genetic variants for subsequent improvement serves as the underlying principle of crop genetic improvement. Population-level characterization of genetic variation includes the identification of deletions, insertions, transversions, copy numbers and single nucleotide polymorphisms (SNPs). A germplasm collection holds a broad genetic diversity; thus, the accurate characterization of a large-scale germplasm remains challenging. Nevertheless, advances in genotyping and phenotyping technologies have revolutionized genomic breeding (GB) approaches.
Early GB methods were developed using markers specifically associated with genes and the quantitative trait loci governing major effects of a trait per se. Such methods were extensively applied in early GB programs: marker-assisted selection (MAS), marker-assisted backcrossing (MABC) and marker-assisted recurrent selection (MARS) [60]. Later, in the quest for genetic gain and enhanced breeding efficiency, new, improved methods emerged: genome-wide association study (GWAS), expression QTL (eQTL), haplotype-based breeding, forward breeding (FB), genomic selection (GS) and speed breeding (SB) [60,61].

2.1.2. Single Cell Sequencing

A single cell is the basic structural and functional unit of living organisms. The formation and function of higher-level tissues and organs are influenced by the various genetic mechanisms along stimuli at the cellular environment. Cell heterogeneity refers to the diverse cell states formed throughout cell growth (genetic and molecular biological changes). With highly specialized structures and functions, the cells of multicellular organisms share identical genetics and sets of genetic instructions in the translation of a functional organism. Single-cell genomics offers the cell-specific landscape information regarding the organisms’ genetics, capturing the cell physiology dynamics [62].
The discovery of cell-specific transcription, tissue-specific spatial gene expression, the role of cell localization, the binding and activity of transcription factors, and the chromatin and cis-regulatory signatures of a system of interest is now feasible with growing commercial and specialized equipment systems catered toward resolving cell-specific activities. The chromatin accessibility profiling methods such as the DNase 1 hypersensitive site sequencing and assay for transposase-accessible chromatin sequencing (ATAC-seq) measure the chromatin accessibility for plant regulatory DNA across population-level species [63]. The disadvantages of these methods include a tendency to mask the cell-specific and rare events of a target tissue. Alternatively, improved high-cost systems such as the single-cell ATAC seq assays (integrated co-encapsulation or barcoding of individual cells) perform sequencing at the single-cell level [64]. In transcriptional profiling using the scRNA-seq method, the following strategies are most frequently employed: (i) fluorescence activated sorting (FACS), (ii) isolation of nuclei tagged in individual cell types (INTACT) and (iii) laser capture microdissection (LCM). Both FACS and INTACT have restricted use on selected plant species only, whereas the LCM offers a broader application range on a vast number of plant species. In general, these methods lack markers corresponding to the different differentiation states of the cell types [65].
The establishment of the Plant Cell Atlas in 2019 officially marked the trajectory of single-cell studies performed by the plant research community. Comprehensive high-resolution plant cell information (nucleic acids, proteins and metabolites) is built and shared among the scientific community [66]. Single-cell RNA sequencing (scRNA-seq) resolves cell-to-cell heterogeneity using high-throughput technologies: Drop-Seq, Chromium, Seq-well, SMART-seq 3 and iCell8 [67]. These methods offer a variety of features, which account for the following factors: (1) the target mRNA region (5′, 3′ or full length), (2) the number of cells, (3) the cell preparation technique (droplets, cell sorting and nanowells), (4) unique molecular identifiers (UMIs)—the mRNA molecule label, (5) cell size, and (6) method availability. In numerous previous studies, scRNA-seq applied on numerous tissues (Arabidopsis, rice, peanut, maize) revealed high heterogeneity, highlighting the expression signatures of cell types and development trajectories [68]. In the conventional RNA-seq method, the bulk information (average gene expression of the sample) is obtained, whereas the scRNA-seq technique consists of pools of information, each corresponding to the different types of cells present in the sample. The cell preparation is rendered as the utmost challenge to obtaining a decent result with accurate interpretations. Optimizing the protoplast isolation is vital, considering the following factors in a typical plant cell: cell density, cell wall thickness, digestion efficacy (influenced by cuticle, lignin, suberin and other deposition), enzyme type and requirement and enzyme digestion time [67,69].

2.1.3. Genome-Wide Association Study (GWAS)

Amongst these methods, GS is the most preferred tool for breeding programs, as the method does not rely on diagnostic markers entirely and the selection is made on the breeding lines evaluated according to genomic-estimated breeding values (GEBV) generated from the genomic-wide marker data sets. Genomic selection (GS) gathers the additive effects of all the genes governing the genetic variance of a given trait. With each independent gene imparting a relatively small effect, the number of genes controlling a single trait may stretch from hundreds to thousands [60,70]. Using a genome-wide marker and phenotype information, the GS method establishes the association between markers and phenotypes from an observed population. A GS analysis was first performed following Fisher’s infinite model, and soon was extended to the genomic best linear unbiased prediction (GBLUP) model. The latter accommodates GXE interactions and thus offers a more accurate prediction [61,71]. Later, the Markov chain, Monte Carlo and Bayesian modelling methods were developed to include non-additive genetic effects such as adverse environmental conditions. In the GS method, machine learning builds a training/reference population of individuals with information of interest (genotype and phenotype) to train prediction models on the test population or selection candidates. The prediction accuracy is affected by training set population size, density/number of the genome-wide markers and the heritability of the trait of interest [72].
Genomics, together with advanced-level genomic tools, open-source genome resources and powerful technologies, have accelerated crop breeding through rapid trait discovery techniques. Proposed 15 years ago, genomics-assisted breeding (GAB) has now expedited a broad range of breeding programs for resistance enhancement against diseases and tolerance improvement against abiotic factors such as submergence, salinity and drought. In rice, the “Improved Samba Mahsuri”, a GAB product, carries the Xa21, xa13, xa5 and xa38 genes governing the bacterial blight (BB) disease (causal pathogen, Xanthomonas oryzae) along with Pi-2 and Pi-54, blast disease (causal pathogen, Magnoporthe oryzae) resistance genes [73,74,75].

2.1.4. Pan-Genomics

There are about 390 thousand land plant species, and their genomes are highly complicated (highly repetitive DNA content, polyploidy and heterozygosity) and diverse (genome size varying from 60 Mb to 150 Gb). Plant genome changes arise from evolutionary forces that shaped plant speciation and evolution. Pan-genomics, a subset of plant genomic research, is highly suitable for plant species with extensive genetic diversity at the population level. Pan-genomes have been developed for important agricultural crops and model plants such as rice, Arabidopsis, barley, soybean, maize, wheat, tomato, etc. [76]. The key principles of pan-genomics include the comparison of high-quality genomes to provide insights into the collection of core and dispensable genes in a species population. Generally, a single genome or a small number of genomes do not make a good sample in pan-genome construction. Integration of many high-quality genomes is important to obtain comprehensive genetic information of the target population [77]. Genes are designated as the basic units defining a pan-genome. Pan-genome studies are most useful in understanding plants with a wide spectrum of genetic diversity and gene pools. In brief, the pan-genome strategy first establishes a target population of highly diverse individuals. A good selection of representative individuals in the population is reflected by phenotypic diversity, as determined by the phylogenetic relationship among the individuals of the population. Next, a high-quality genome assembly method for long reads is employed using automatic annotation pipelines. The construction approaches available for pan-genome analyses includes the de novo assembly (detects variant types and classifies genes into core and dispensable), iterative assembly (based on a single reference genome), and graph-based assembly strategy (utilizes graphs from a reference genome to represent the diversity and variations). Comprehensive tools and pipelines popularly employed in pan-genome analyses were exhaustively described by Li et al., 2022 [78].

2.2. Transcriptomics

A transcriptome is an atlas of RNA transcripts of a tissue, cell or defined specific condition [79]. Using the genome information, a transcriptome is “read” to obtain a comprehensive description of the genes expressed at a given time point. The mapping and quantification of the transcriptional activity are central to transcriptome studies. In the modern era, the transcriptomes are produced either by the microarray [80] or RNA-sequencing (RNA-seq) technology [81]. The latter is preferred by the plant research community due to higher precision in capturing lowly expressed RNAs and isoforms [81]. Comparatively, the RNA-seq technology detects a greater percentage of novel transcripts than the microarray [82,83]. In most transcriptome data analyses, the raw count data are subjected to differentially expressed genes (DEGs) analysis, co-expression network construction and other techniques such as alternative splicing and isoform analysis [84,85]. Both DEG and network analyses are used extensively to discover genes underpinning various biological processes such as plant defense response [86], regulation [87], water stress JAZ1 in G. arboreum [88], desiccation tolerance and drought (such as LEA) in A. thaliana seeds [89], cellulose synthase in secondary cell wall synthesis [90] and cell wall-related genes in A. thaliana [91].
In 2002, the Gene Expression Omnibus (GEO) repository was first established as an open repository for gene expression data obtained from various platforms such as microarrays, serial analysis of gene expression (SAGE) and other sequence-based data [92]. Since then, the number of open-source gene expression data repositories for various plant species and condition-specific has been on the rise: The Arabidopsis Information Resource (TAIR) [93], TRAVA [94], RiceXPro [95], Transcriptome Encyclopedia of Rice (TENOR) [96], Barley Gene Expression Database (Bex-db) [97], and Plant Stress RNA-Seq Nexus (PSRN) [98] (Table 1).
Table 1. Plant omics databases, as accessed on 24 August 2022.
Transcriptome data relate to the prediction of genome-scale reconstruction from previous studies: the starch biosynthesis of Manihot esculenta [99], the light and temperature acclimation in Arabidopsis thaliana [100], and the biosynthesis of biotic stress-regulated pathways (i.e., tryptophan, auxin and serotonin) in Oryza sativa [101]. High and low levels of mRNA transcription have improved the understanding of the response outcome in the genome, especially those mechanistic associations between the cellular trade-offs and epistatic gene interactions [102,103].

Transcriptome-Wide Association Studies: Prediction of Genes Governing Complex traits

Global transcriptional activity measured by the transcriptome-wide association studies (TWAS) offers a fundamental understanding of the spatiotemporal regulation of transcription events in plants [148]. Transcription causes variation, often observed as a collection of events resulting from altered coding sequences. Both mRNA and protein expression are spatial and temporal targets for selecting variations caused by the coding sequences. TWAS unravel endophenotype or variation that is predominantly caused by genetic factors. Such a feature is highly valuable for prioritizing candidate genes governing complex agronomic traits. TWAS was recently proposed as a powerful tool to predict trait-associated gene expression based on GWAS summary data [149]. TWAS, in combination with GWAS, increases the power of detection of unknown genes and offers a selection of prioritized causal genes [150,151].

This entry is adapted from the peer-reviewed paper 10.3390/plants11192614

This entry is offline, you can click here to edit this entry!
Video Production Service