Plant Genome Sequencing Technology: History
Please note this is an old version of this entry, which may differ significantly from the current revision.
Contributor:

The development of technology that can capture large volumes of sequence data at low costs and with high accuracy has driven the acceleration of plant genome sequencing advancements. The sequencing of all plant species is a long-term goal that may become key to effectively supporting life on Earth through the improved management of plants in wild populations and their selection and genetic enhancement for use in agriculture and food production.

  • DNA sequencing
  • plant genome
  • long read sequencing
  • chromosome assembly

1. DNA Isolation

The starting point for the sequencing of plant genomes is obtaining a sample of DNA to sequence. The ease of obtaining a DNA sample of suitable quality for sequencing varies greatly between species. Plants contain many secondary metabolites, proteins and polysaccharides that may interfere with DNA extraction and become a source of contaminants that often reduce the efficiency of DNA extraction. Current technologies require a minimum amount of DNA and the DNA must be pure (free from contaminants that may inhibit sequencing) to facilitate efficient sequencing and the generation of large volumes of data. The amount of DNA required for long read sequencing has been greater (usually μg quantities) than that needed for short read sequencing (https://dnatech.genomecenter.ucdavis.edu/pacbio-library-prep-sequencing; accessed on 6 June 2022). For long read sequencing, the DNA must be intact (not degraded) so that long sequences are present in the sample and can be sequenced to extract long reads. Simple methods that were suitable for DNA extraction in the past for purposes such as PCR analysis [20] are no longer adequate, so the development of species- [21] or tissue-specific methods that can support next generation sequencing are often required [22]. Some species are especially difficult and require the isolation of the nucleus [23] first as a source of DNA that may be free from contaminants from other parts of the plant cell. The isolation of nuclei in plants is challenging for some of the same reasons that DNA isolation is difficult. The disruption of the plant cell wall requires forces that may damage organelles, such as nuclei and shear DNA.

2. Sort Read Sequences

The first set of next generation sequencing technologies provided lager volumes of short DNA sequences. The accuracy of these short sequences and the volume of data have since increased dramatically. The length of these sequences started at around 30 bp and has rapidly advanced to 100–150 bp. Paired-end sequencing has extended this technology to allow for the production of sequences of around 400 bp, but most applications currently deliver sequences of around 150 bp. Illumina sequencing platforms are the dominant technology used for short read sequencing. This technology conducts sequencing by synthesis in a very large number of parallel reactions. The incorporation of nucleotides is monitored as the DNA is copied. Other techniques (e.g., solid [24] and 454 sequencing [25]) have been replaced by new technologies because they generally offered lower accuracy or data volumes, resulting in relatively higher costs. Ion Torrent sequencing is used for the rapid determination of a sequence of large numbers of small sequences, such as amplicon sequencing and 16 S metagenomic sequencing. In plants, this has been used for chloroplast sequencing [26].

3. Long Read Sequences

The assembly of plant genomes with large numbers of repetitive sequences is not possible with only short read sequences. Therefore, technology that allows much longer sequences to be generated has been key to simplifying genome assembly. The length of these sequences and their accuracy have improved greatly since the technology was first introduced.

3.1. PacBio

Pacific Biosciences (PacBio) has developed a long-read sequencing platform that provides accurate long read sequencing. The single-molecule real-time (SMRT) sequencing involves monitoring the incorporation of fluorescent-labeled nucleotides [27]. Recently, single long reads (also known as continuous long reads (CLR)) have largely been replaced by HiFi reads, which provide a consensus sequence based on sequencing a long fragment of DNA (approximately 15,000 bp) multiple times by first circularizing the DNA and reading around the circle many times [28]. The repeated sequencing of the same molecule allows a highly accurate sequence to be generated as the circular consensus sequence (CCS) is read. The quality of the genomes that are generated by the assembly of these reads into contigs has been improved by the application of optimized assembly tools, such as those provided by hifiasm.

3.2. ONT

Oxford Nanopore Technologies (ONT) provides a long read sequencing technology that delivers accurate sequence data quickly. The sequence is determined by measuring the changes in electrical currents as the DNA is passed through a pore. The ONT platform generates very long reads and has the advantage of very low instrumentation costs. This platform has continued to improve and deliver very long read sequences with increasing accuracy [29]. ONT sequencing has been widely applied to very rapid sequencing, such as that required for diagnostics [30], due to the advantage of having portable instruments. The chromosome-level assemblies of plant genomes can be achieved in combination with methods, such as optical mapping [31].

3.3. Other

Several technology providers have developed pseudo-long reads that are created by linking short reads. These techniques may produce long reads at lower costs, but the long reads that are generated often do not match the accuracy of the current long read methods [19]. These technologies have been developed by Universal Sequencing Technology [32], MGI [33] and 10× genomics [34]. Despite the great contribution that long read sequencing technology has made to the efficient production of high-quality plant genomes, the emergence of further advances in long read sequencing technologies remains one of the key areas that may contribute to future advances.

3.4. Advances

Genome sequencing and assembly requires an adequate depth of sequencing. The size of contigs that can be assembled as long read sequence data has been shown to increase in an almost linear way [8]. The size of the assembled genomes reduces slightly with more contiguous assemblies, probably due to the joining of homologous contig ends, as does the completeness [8]. Improved software has also enabled improvements in the assembly of long read sequences [35]. The use of hifiasm has been shown to allow the haplotype-resolved assembly of the large (30 Gb) genome of the Californian redwood (Sequoia sempervirens) [36]. These advances are illustrated by the quality of the early plant genomes relative to those that are being generated by the latest technology. The first rice genome, which was reported in 2002, was highly fragmented while current technology delivers sequence contigs that are often full-length chromosomes [35].

4. Chromosome-Level Assembly

The ultimate aim of genome sequencing is to obtain a complete genome sequence of each chromosome, from telomere to telomere. This relies on evidence from beyond the DNA sequence data. Physical and genetic mapping methods have been used to achieve the chromosome-level assembly of contigs that were generated from sequencing data [37]. Recently, the advances in sequencing technology have made it possible to generate many full-length chromosomes from the sequence data alone [35]. The complete assembly of sequence contigs into whole chromosomes has been widely achieved using genetic mapping data, chromatin mapping (Hi-C) or optical mapping. Hi-C [38] involves the mapping of chromatin by crosslinking the DNA in the intact chromatin, digesting the DNA and then sequencing (short reads) the DNA fragments at the ends of the crosslinks. These are used to position the sequence contigs along the chromosome. Optical mapping (Bionano) can also be used to locate sequences along the DNA sequence and to scaffold the sequence contigs [39]. Many projects have combined these technologies to support the generation of high-quality genomes. Recent advances in long read sequencing have enabled the generation of long contigs of highly accurate sequences, reducing reliance on these techniques for high-level assembly. They remain essential for the de novo assembly of most chromosome-level genomes. High-quality sequence contigs in combination with genetic mapping data, Bionano optical data or Hi-C chromatin mapping have generally succeeded in achieving chromosome-level assemblies of plant genomes. A report on more than 100 chromosome-level assemblies in 2021 [40] found only a 73% coverage of the pseudomolecules that represent the chromosomes. The combination of long read sequencing and the use of these tools has resulted in the recent reporting of many high-quality chromosome-level genome sequences (Table 1), with the quality improving greatly along with the most recent technology.
Table 1. Some recent chromosome-level assemblies of plant genomes.

5. Haplotype-Resolved Genomes

Most published plant genomes are collapsed representations of the diploid genome as a single sequence, with a random inclusion of one of the two alleles at each heterozygous position. Only recently has it become possible to assemble each haplotype separately [63]. This has been the result of advances in both sequencing technology and sequence assembly tools. Current technology suggests that most genomes can now be sequenced at the haplotype level, thereby replacing the reporting of collapsed genomes with the sequences of the two haplotypes.

6. Pan-Genomes

The sequencing of plant genomes has shown that significant differences may be found within a plant species, which means that more than one reference genome is required to represent the species. The sequencing of plant genomes has also demonstrated that many genes are variably present in different individuals within a species. These presence/absence differences have led to the construction of pan-genomes, which represent the complete set of genes found within a population. A genome that includes all of the variations within a group of plants is known as a pan-genome. The pan-genome concept is a powerful tool for plant breeders for the analysis of gene pools [64]. Pan-genomes can be generated at different levels to represent the diversity that is found within, for example, domesticated gene pools, species or genera.

7. Transcriptomes

Transcriptome sequencing is an important tool for the analysis of the expressed regions of a genome. This is key to understanding gene functions and the determination of the genetic basis of important plant traits [65,66,67,68,69,70]. Transcriptome sequencing complements genome sequencing in genome characterization. Transcripts provide physical evidence that the sequence is formed of the expressed and complementary predictions of genes, based on the sequence alone. The comparison of the transcriptomes of different genotypes from different tissues or cell types at different stages of development and under different environmental conditions allows for the discovery of the genes that control plant traits and has become a key approach in plant biology and the discovery of genes for selection in plant breeding. Single-cell transcriptomics has become a powerful tool for understanding gene expression at the cell and tissue level but has had limited application in plants [71], partly due to the difficulty in isolating specific plant cells without disrupting expression.

7.1. RNAseq

The quantitative analysis of the levels of expression of genes in any specific cell, tissue, organ, genotype or development stage is widely determined by RNA sequencing (RNAseq) [72,73]. RNAseq has largely replaced earlier array-based or gene by gene analysis tools as it provided a more unbiased analysis of the whole transcriptome.
An analysis of the gene expression in the highly polyploid sugarcane genome revealed that while the different alleles of most genes are expressed in direct proportion to their abundance in the genome, some genes show highly biased patterns of expression [74]. In hexaploid wheat, subgenome-specific responses to diseases have also been reported [75].

7.2. Long Read Transcriptomes

Long read sequencing is a method that has been applied to the analysis of plant transcriptomes, which reveals the diversity of full-length transcripts and defines the variations in splicing and intron retention in gene expression [76]. The long read sequencing of transcriptomes avoids the challenge of the assembly of many closely related transcripts from short reads. Unique 3′ and 5′ sequences may be separated by common intervening sequences, which creates the risk of incorrectly combining the ends of the transcripts when using short reads.
Some examples of the application of long read sequencing to the analysis of plant transcriptomes of increasing complexity can be found for polyploid species in Table 2.
Table 2. The long read sequencing of polyploid transcriptomes.

8. Organelle Genome Sequencing

Plant cells usually contain a single nucleus and many organelles, probably hundreds of mitochondria and thousands of chloroplasts. Sequencing the organelle genomes is complicated by the transfer of genes between these genomes. The nuclear genome often contains many insertions of large and small sequences of organellar genomes. Many early methods struggled to distinguish organellar gene sequences from those of copies that were inserted into the nuclear genome because they relied on PCR amplification or organelle separation [80]. Nuclear inserts may represent versions of organellar genomes that were transferred in the past and that have diverged since insertion.

8.1. Chloroplast Genomes

The chloroplast genomes of plants are highly conserved sequences of 100–150 Kb, containing around 100 genes [81]. The structure of most chloroplasts is similar, with four components including inverted repeats that separate large and small single-copy regions. Chloroplasts have been widely used in plant identification due to their presence in all green plants and the high copy numbers in the cell that simplify the detection of chloroplast sequences. Early approaches that relied on chloroplast isolation or PCR amplification were plagued by confusion due to the copies of chloroplast sequences in the nuclear and mitochondrial genomes. Recent approaches [82] rely on the higher abundance of chloroplast genome sequence reads in short read sequence data to clearly distinguish the correct sequence of the relevant chloroplast [83]. The development of software tools now allows for the efficient extraction of accurate whole chloroplast genome sequences from even low (nuclear) coverage sequencing datasets. The annotation of chloroplast genomes that were generated in this way has resulted in the identification of around 100 genes with increasingly well-defined functions [84].
The sequencing of the maternal (e.g., chloroplast) and nuclear genomes of plants has frequently revealed discordant phylogenies [85,86,87], suggesting widespread reticulate evolution in plant populations (Table 3). Chloroplast transfers between species during rare events results in “chloroplast capture” by closely related species.
Table 3. Discordant phylogenies for chloroplast and nuclear genome sequences.

8.2. Plant Mitochondrial Genomes

The mitochondrial genomes of plants [91] are much larger and less conserved than the chloroplast genomes and as a result, they have been much less studied than chloroplasts. The mitochondrial genome, as with the nuclear genome, may include sequences that were derived from the chloroplast that have been inserted into the genome at various times throughout its evolutionary history. Due to the relatively higher number of mitochondrial genomes in cells, these sequences are even more likely to be confused with chloroplast genome sequences than chloroplast sequences that were inserted into nuclear genomes.

This entry is adapted from the peer-reviewed paper 10.3390/applbiosci1020008

This entry is offline, you can click here to edit this entry!
ScholarVision Creations