Developing Genomic Resources for Crop Improvement: History
Please note this is an old version of this entry, which may differ significantly from the current revision.
Contributor: , , , , , ,

The emerging sequencing technologies target generating more data with fewer inputs and at lower costs. This has also translated to an increase in the number and type of corresponding applications in genomics besides enhanced computational capacities (both hardware and software). Alongside the evolving DNA sequencing landscape, bioinformatics research teams have also evolved to accommodate the increasingly demanding techniques used to combine and interpret data, leading to many researchers moving from the lab to the computer. 

  • sequencing technologies
  • assemblies
  • crop
  • plant genomic

1. Introduction

With more than 40 years of remarkable DNA sequencing improvements, today, the development of cost-reducing and higher throughput sequencing technologies, along with relevant bioinformatics tools, have made it possible to produce high-quality genome assemblies in a much-reduced timeline, which has subsequently led to the mapping of the genetic variations in thousands of individuals, providing genetic insights into population histories and domestication events. The multinational and multi-institutional consortium the Earth BioGenome Project (EBP) aims to unify the phylogenetic networks across all eukaryotic life derived from their complete de novo genomes [1,2]. This illustrates how far the advancement and standardization of genome data generation, assemblies, storage, retrieval, and analysis have developed, with more expected and required with the generation of massive genomic data from species bridging the phylogenetic gaps between currently sequenced genomes.
Complete reference genome assemblies of the entire plant kingdom will open new scientific views on the evolution and speciation events on earth and genetic control of plant traits, both at intra- and inter-species levels. They will also enhance the understanding of how plants function in ecosystems, lead to the discovery of natural botanical compounds for human medicine, and will aid an increase in food production to curb global hunger while respecting planetary boundaries and adapting to climate change.

2. Plant Genomic Resources (Big Data Generation)

Sequencing technologies, mainly using high-throughput NGS sequencers, generate significant amounts of data. For example, the recent sequencer from Illumina (NovaSeq 6000) has a higher output than the earlier generation of sequencing machines producing between 1300–20,000 million reads (65 Gb to 3 Tb). The long reads from PacBio reach up to a maximum of 300 Kb, and the data generated with Sequel I, II (CLR), II (HiFi) range from 0.5 million to 400 million reads (15 Gb to 100 Gb), with the nanopore sequencing technology (Minion and Promethion) sequencing ranging from 2.5–12 million reads (40 Gb to 180 Gb).
With this capacity, sequencing land plants having a wide range of genome size DNA content can, in theory, possibly generate good coverage of the entire genome sequence data. For example, the corkscrew plant Genlisea margaretae with a 1C value of 0.07 pg (65 Mb) and the canopy plant Paris japonica with a 1C value of 152.2 pg (148.9 Gb) are equally accessible in terms of raw sequence generation and coverage [53] (https://cvalues.science.kew.org/). Generating several-fold coverage of genomic data produces potentially massive datasets, ranging from Gb to Tb of sequence information. Depending on the scope of the project, handling such large datasets is a major concern for small (or even big) research labs. Decades ago, geneticists were mostly involved in lab work; now, the most limiting factor is the analysis of the data to derive meaning or interpretation out of it using computational tools. Understanding the algorithms and processing the data are a crucial part of genetics and genomics data analysis when searching for biological meaning.
Genomic sequencing is a field where handling big data and its processing requires a suitable storage and data transfer platform, such as is present in cloud technologies. These are extensively applied to enhance the availability of the data to all researchers in a project and indeed researchers worldwide. The genome sequence data generated for a crop genome project are immense; for example, a single Sorghum genome sequence contains over 50 gigabytes of raw data (depending on the data format generated), and processing the data for large population-wide studies, such as finding deeper scientific insights, marker–trait association, analyzing diversity, domestication, and assessing data from gene-editing technologies, requires robust storage and computing capacities.
To maintain the uniformity of the data in the global databases, the members’ databases (GenBank, EMBL, DDBJ, CNGBdb, IBDC) of the International Nucleotide Sequence Database Collaboration (INSDC) [54] share and update genomic data periodically.
The recent stats release of GenBank reports having 16.7 trillion nucleotide bases for 1.7 million whole genome sequences (as of June 2022) (GenBank and WGS Statistics (ncbi.nlm.nih.gov)). Of which, green plant data (Viridiplantae) alone have 93.8 million sequences from 2324 genomes (including variants of the same plant species genome), including genomic DNA/RNA for 33.4 million sequences, mRNA for 41.5 million sequences, and rRNA for 80,709 sequences.
With the increasing complexity of genomic data themselves, the major databases also integrate other genomic features and provide tools to search and retrieve these datasets. The Entrez system of NCBI is one such tool allowing users to search, view, and download the sequences from GenBank. Other modes of data accessibility allow for downloading from the FTP site (ftp.ncbi.nlm.nih.gov) or downloading data programmatically with the provided public API to the Entrez system (https://eutils.ncbi.nlm.nih.gov).
Numerous databases have been developed for genomic data to suit a variety of different purposes. Based on the data catchment of the database, the database is as big as a global repository holding the sequences of all species, like Ensembl Plants, the National Centre for Biotechnology Information (NCBI), PlantGDB, the Plant Genome Database Japan (PGDBj), to medium size databases hosting only plant genome assemblies/annotations, like Phytozome and the Legume Information System (LIS) (https://www.legumeinfo.org), to smaller databases containing crop/plant-specific information, such as for the chickpea SSR database (https://cegresources.icrisat.org/CicArMiSatDB/index.html) [55] and chickpea SNP and indel database (https://cegresources.icrisat.org/cicarvardb/) [56]. However, the medium to smaller databases are limited to the scope of species-level data, like the LIS and proposed angiosperms database [57], and may do not need to use powerful bioinformatics tools and computational resources to explore the terabytes of genomic data, and many such databases were earlier discussed in [58].

3. Plant Genome Assemblies

Genome assembly refers to aligning the small fragments of a DNA sequence to reconstruct the genome sequence in the original order and orientation. High-throughput sequencing through first- and second-generation sequences has enabled the assembly of many plant genomes. The highly fragmented genome assemblies generated with short reads have been improved with long read sequence assemblies, simplifying and improving the ability to generate chromosome-level assemblies with reduced reliance on dedicated research experts.
Thanks to the NGS technology and increased computational power, the standard of the genome assemblies available has improved significantly. Genomics has accelerated its growth in the past decade from draft-level genome assemblies to reference-level genome assemblies [78,79,80].
The plant genomes assembled in the FGS era faced significant throughput issues and were limited by a read length of around 1 Kb. This necessitated approaches such as BAC-end reads and BAC barcoding to allow contigs to be linked and positioned throughout the genetic mapping. The plant genomes assembled in the FGS era are far fewer than the genomes assembled in the SGS and TGS sequencing technology era, primarily due to the lower throughput and high cost of FGS. The situation changed sharply with SGS, as the volume of the sequence (although not the length) was significantly increased. Long-read sequence technologies play a crucial role in genome assembly projects, which helps in scaffolding the contig sequences, and thus many genome projects were initiated with combined SGS and TGS technologies. With the advent of advanced sequence technologies such as PacBio HiFi sequencing, which produces a 10 to 30 Kb circular consensus sequence, thus reducing error rates (CCS) [11], Oxford Nanopore long-read protocols [81], Hi-C scaffolding [32], and optical mapping technologies, such as Bionano [82], it is possible to assemble complex genomes. The emerging third-generation sequence data have boosted the genome assembly quality to build a chromosome-level assembly by overcoming the limitation of short reads assembly, particularly in plants, where islands of repeat sequences need to be bridged between the gene-rich regions of the chromosomes. With the low-cost and high-throughput sequence data generations, at least 1143 plant reference assemblies have been published (www.plabipd.de). Based on the availability of funds and the feasibility of applying high-volume sequence data generation, multiple individuals of the same species were de novo assembled, e.g., potato [83], or the genome assembly of the same varieties improved, such as for chickpea [84,85] and sesame [86]. The development of long-read technologies as part of the TGS allowed for a relatively simple assembly of smaller genomes. With optical and chromatin-based methods, such as Bionano and HiC, far more comprehensive and larger genome assemblies are now possible, which are based on a range of techniques, including the integration of scaffolds into the chromosome through genetic mapping.
In recent years, gold-standard and platinum-standard chromosome-level genome assemblies are being achieved in prominent model crop plants [87,88,89,90,91,92]. Here, gold-standard assembly refers to cases where the number of superscaffolds matches the number of haploid chromosomes, yielding a chromosome-level assembly; a platinum-standard assembly refers to a telomere-to-telomere (T2T) assembly with the final scaffolds matching the number of haploid chromosomes. This era has led to gold- or platinum-standard assemblies in crop plants, and publications meeting these standards are continuing to appear [93]. The importance of having platinum-standard reference genome assemblies and the importance to compare cultivated species with wild relatives of rice is documented [94].
Chromosome-level genome assemblies were initiated with Arabidopsis in 2000 [95] and later with rice in 2005 [96]. These assemblies were generated with the traditional, expensive, and low-throughput Sanger sequencing method. With current third-generation sequencing (such as PacBio, HiFi, Hi-C, and optical mapping methods), it is possible to generate chromosome-level pseudomolecules [97]. With PacBio sequence data, a chromosome-level assembly was first achieved for Arabidopsis [98] followed by Oropetium [99]. Similar to the PacBio long reads, ONT generates around 200 Kb length reads highly suitable for bacterial genomes assembly [100]. Synthetic long reads (SLR) are long reads generated from Illumina short-read data to assemble long reads [101]. In total, 113 plant species have the chromosome-level genome assemblies published (as of the end of 2022) (www.plabipd.de) of the total assembly number of 1143 flowering plants, and 125 are non-flowering plants. Most of these near-complete plant genomes were produced with sequence data generated from multiple technologies. The long-read 10× Genomics with short-read Illumina data were used to assemble the blueberry genome [102]. PacBio and Hi-C sequence technology were used for assembling the octoploid sugarcane genome [103], allotetraploid peanut [104], and teff [105].
Several novel technologies have emerged (such as optical mapping [106]), the Irys system by BioNano Genomics (www.bionanogenomics.com) and chromosome conformation capture sequencing (Hi-C) [32]) to improve the scaffolding without depending on genetic mapping. However, these advances in genome assembly have recently improved further to generate the telomere-to-telomere (T2T) assemblies, as first implemented in 2020 for the X chromosome sequence of the human genome [107] and later adapted to plants, such as Arabidopsis [108,109], rice [110], and banana [111]. The combined integration of PacBio and modified Hi-C protocol as Dovetail Genomics has improved the assembly contiguity for A. alpina [112]. The high-resolution gap-free T2T genome assemblies ensure the capture of all the repetitive sequences and genomic variants without any misassemblies.
The greatest bioinformatics challenge for sequencing plant genomes was repetitive sequences, leading to sequencing errors and unrecognizable assembling errors at earlier stages of assembly computation. As the plant genome size and ploidy or repeat content increases, the complexity of assembly of the sequence reads correctly also increases, and thus the assembly programs used in these genome projects needed increasingly sophisticated strategies (such as chromosome flow sorting methods used in wheat) to handle such challenges. Additionally, handling the terabytes of sequence data and storage and managing the computing clusters and complexity of the algorithms also need to be addressed.
In addition to improving the quality of reference genomes to platinum-standard, present-day technologies paved the way for the transformational shift from the representative single genotype’s genome sequence to the pan-genome sequence as a reference for a better understanding of the variability present within a species [113]. The advantages of the pan-genome reference are being realized in generating novel insights and the identification of the genes or genomic regions underlying the important agronomical traits and domestication process [86,114,115,116,117,118].

4. Genome Assemblers

As sequencing technology evolved, assembly approaches also had to evolve. The Celera Assembler and Arachne assemblers were developed to handle genomes of the fruit fly (Drosophila melanogaster) and human genome in 2000–2003; later, AMOS was launched under an open-source framework. These assemblers were developed based on overlap–layout–consensus on an overlap graph [120] in which the nodes were the reads and the edges represented the shared sequence between reads. This type of assembler is suitable for assembling FGS technology sequencing reads produced by the dideoxy termination method (Sanger sequencing). As massively parallel high-throughput sequencing technology was developed to produce millions of bases (in SGS), the read size became smaller and more error-prone with higher genome coverage. The leading Illumina technology of SGS/NGS sequencing technology yields 35–150 bp length paired-end reads from fragments with a 200–300 bp insert size. Such high-throughput data required a new approach, and thus de Bruijn graph-based assembly was developed [121,122] where the nodes represent fixed-length strings drawn from a larger set of strings, and the edges represent perfect shared sequences. However, de Bruijn graph-based assemblers have difficulties handling sequencing errors and need high computational power (100+ Gb of memory). The challenge with uneven genome coverage and reads too short to span repeated regions can be addressed by a combination of many short reads and fewer longer reads or mate–pair reads (Sanger, 454 and Illumina sequencing methods). Multiplex de Bruijn graphs automate the assemblies of long HiFi reads [123], and the recently updated Minimap2 version can be used for long read assembly [124]. Newbler was the first assembler released in 2004 to assemble the 454 sequence data followed by a hybrid version of the MIRA assembler for 454 and mixed with Sanger reads. After upgrading the Illumina sequence technology to produce from the initial 36 base-length read to reads over 100 bases in length, the produced sequence was suitable for de novo assembly. After the release of the SHARCGS assembler for Solexa reads, other assemblers were released and became the most popular assembly tools.
Plant genome assembly was initiated with Arabidopsis thaliana in December 2000 [95] where the approach relied on overlapping bacterial artificial chromosome (BAC) clones which were end sequenced and the same approach was applied to the crop plant rice [125,126]. Later, the emerging whole genome shotgun (WGS) strategy was applied to black cottonwood [127]. This was where more difficulties and challenges were faced to assemble the short sequence reads, which resulted in a more fragmented assembled genome sequence followed by two versions of the grapevine genome sequence in 2007 [128,129]. A hybrid approach was adopted to sequence the cucumber with Illumina and Sanger sequencing technology, indicating the feasibility of using this approach for plant genome sequencing [130]. With the change in technology, 454 combined with the Sanger sequencing approach was applied to the genomes of apple [131], cocoa [132], and muskmelon [133]. In 2011, the first plant genome was sequenced using SGS technology combining 454, Illumina, and the SOLID platform for strawberry [134], Chinese cabbage [135], potato [136], chickpea [137], pigeonpea [138], and watermelon [139].
The advances in sequencing technology (SGS and TGS) and assembly approaches have removed the limitation of genome sequencing for not only the crops with small genome sizes but also enabled sequencing and assembly of large genome crops, like wheat (~17 Gbp) [87,140,141], barley (5.1 Gbp) [142], rye (~7–8 Gbp) [143], and tea (~3.8–4.0 Gbp) [144], which are important for animal feed and human nutrition.
The genome assembly quality has improved as the sequencing technologies and assembling tools improved, especially when combined with the utilization of multiple sequencing technologies of TGS, for example.
The initial assembly version of the sorghum genome assembly released in 2009 [145] with shotgun sequencing and BAC libraries data captured 738.5 Mb of sequences in 12,873 contig sequences (scaffolded to 3304 sequences), which is more fragmented compared to the chromosome-scale assembly of the sorghum genome using nanopore sequencing and optical mapping data that produced a hybrid assembly made of 29 scaffolds capturing the 661.16 Mbps [146].
For a large genome (~8 Gb) rye (Secale cereale), initially, a virtual linear gene order model (22,426 genes) was established with high-throughput transcript mapping and chromosome survey sequencing [147]; following reference genome assembly with a shotgun, de novo genome assembly produced 1.29 million scaffolds, capturing 2.8 Gbp of sequence [148] and later chromosome-scale genome assembly with 10×, HiC, Bionano optical genome mapping, and chromosome-specific shotgun (CSS) reads produced 6.74 Gb (of estimated 7.9 Gb) [149].
In addition to the chromosome-scale assemblies, TGS has enabled the assembly of polyploid genomes, such as bread wheat [87], potato [150], and peanut [151].

5. Advancements in Plant Genomics

With the emerging sequence technology and bioinformatics tools, it is possible to assemble a nearly complete genome sequence. With cytogenetic advances to measure the genome size (such as flow cytometry), a genome size estimation is a useful first step in a complete genome sequencing project. The amount of sequencing data required to produce a given level of coverage depends on the 1C amount of DNA per cell (including ploidy level), and for most species, this can be found in the Kew Plant Genome Database. Most plant genome assemblies are smaller than the cytogenetic genome estimation size; this may be because of assembly errors or difficult-to-approach genomic regions, like centromeric and repetitive regions in the plant genome, where assemblers struggle (physical maps, such as Bionano, resolve such issues). Some of the assembled plant genome sizes are quite close to the cytogenetic estimated size, indicating the assembler has captured the majority of the genome content. Assemblies above the estimated size, however, may need refinement to reduce contaminants or alter the assembly parameters.
The genome assembly provides the coordinate system for the gene models and other genomic features, like SNPs, Indels, SSRs, etc. Predicting the gene models with ab initio gene findings and supporting evidence in the form of RNA data increases the accuracy. However, this may not list out the complete complement of genes of the species for which resequencing a wide range of diverse accessions will reveal more genes that are genotype-specific. For example, the resequencing of >1000 wild and cultivated rice accessions has predicted the presence of thousands of genes with lower sequence diversity in cultivated rice, indicating a rice domestication genetic bottleneck [114,152]. Moreover, genetic diversity is often reduced during domestication, and resequencing a single individual may not capture the species-wide gene content. Thus, the concept of the pan-genome was developed and adapted to plants’ genomes to identify the species-wide gene content. The core genome is usually defined as the housekeeping genes (which must be present for the organism to survive and reproduce) and the variable/dispensable genes (these genes are present or absent in a particular cultivar/accession of a species) that exhibit the gene diversity or variability in a species. Thus, the first plant pan-genomes appeared in 2007, describing the variable genes in rice and maize genomes, and were later adapted to a wide range of plant genomes [153], including banana [154], white lupin [155], barley [156], wheat [156], wheat panache [157], and sorghum [158].
The most commonly used downstream analysis with pan-genome assemblies is to identify the genetic variation of any DNA segment in a genome or a gene (including gene fragments) that can be used as a marker for genotyping. Bioinformatics resources enhancing crop genomics for downstream analysis include copy number variations (CNV), identification of variations based on the length (SNP, SSR, Indels), a set of SNPs used as a unit in the form of a haplotype to increase the resolution of GWAS, k-mer analysis, linkage disequilibrium (LD), presence–absence variations, pan-genome-wide association studies (PWAS), genotyping-by-sequencing, reduced representation sequencing, domestication, and diversity analysis. With these bioinformatics tools, the genomic data also assists plant phylogenomic research with useful information, such as genome diversity and speciation events. Therefore, bioinformatics has become a most essential part of plant genomics research.
High-throughput genotyping enables the genotyping of thousands of targeted loci (genetic markers) on thousands of samples. Depending on the number of markers and the sample size, different genotyping techniques can call genotypes in different ranges. Some of the technologies include Illumina golden gate, Affymetrix SNP, reduced-representation genome sequencing, exome-seq, Fluidigm (https://investors.fluidigm.com/node/13686/pdf), IntelliQube (https://www.myebpl.com/intelliqube.html), MassARRAY [185], MassEXTEND, GeneChip [186], APEX-Seq [187], BeadARRAY (https://www.illumina.com/science/technology/microarray.html), TaqMan [188], and DArT (https://www.diversityarrays.com/). Genotyping by sequencing (GBS) is a highly multiplexed system for constructing reduced representation libraries from the sequencing platform with low-cost, reduced sample handling with no need for a reference genome. GBS (including the single digest RAD and double digest RAD and skim-sequencing) are tools for genomics-assisted breeding in a range of plant species through the applications of SNPs identification, gene/QTL mapping, molecular diversity, GWAS, construction of high-density genome maps, haplotype maps, phylogenetics, identification of candidate genes, genetic linkage analysis, molecular marker discovery, and genome sequencing and selection. Such genetic resources assist in predicting the genetic value of selected candidates based on the genomic estimated breeding values (GEBV) from high-density and quality markers. Genomic selection (GS) is an approach to exploit genetic markers to develop new markers-based models to increase the genetic gain of complex traits for breeding programs. High-throughput marker technologies have changed the entire scenario of marker applications and enabled the use of GS routine work for crop improvement.
Plant phenotyping through conventional methods relies on manual measurements, which are laborious, error-prone, and time-consuming. Similar to genotyping, high-throughput phenotyping (HTP) (“phenomics”) has unique advantages in facilitating accurate, automated, high-quality data collection techniques, including visible light imaging, X-ray computed tomography, visible and near-infrared spectroscopy, multispectral imaging, chlorophyll fluorescence, fluorescence imaging, and nuclear magnetic resonance (NMR) [189] (Xiao et al., 2022). These tools are generally used to obtain high-resolution images of samples from which features are extracted with image processing algorithms. Mostly machine learning algorithms are used to generate robust data processing to produce accurate and time-efficient phenotypes of plants [190]. Highly accurate genotype and phenotypic data need appropriate statistical methods to identify true associations between genetic and phenotypic variation. Plant phenotyping systems, imaging techniques, challenges, and their applications have been reviewed elsewhere, including imaging systems, data collection methods, and analysis techniques and problems [191,192,193]. GWAS has high efficiency and high resolution and is conducted on a genome-wide scale with statistical programs. Some of the R packages developed for association analysis are GAPIT [194], qqman [195], gwasrapidd [196], eQTpLot [197], Postgwas [198], GWASTools [199], and IntAssoPlot [200].

This entry is adapted from the peer-reviewed paper 10.3390/life13081668

This entry is offline, you can click here to edit this entry!
ScholarVision Creations