Read: History
Please note this is an old version of this entry, which may differ significantly from the current revision.
Subjects: Cell Biology

In DNA sequencing, a read is an inferred sequence of base pairs (or base pair probabilities) corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.

  • base pair
  • fragmentation
  • DNA sequencing

1. Read Length

Sequencing technologies vary in the length of reads produced. Reads of length 20-40 base pairs (bp) are referred to as ultra-short.[1] Typical sequencers produce read lengths in the range of 100-500 bp.[2] However, Pacific Biosciences platforms produce read lengths of approximately 1500 bp.[3] Read length is a factor which can affect the results of biological studies.[4] For example, longer read lengths improve the resolution of de novo genome assembly and detection of structural variants. It is estimated that read lengths greater than 100 kilobases (kb) will be required for routine de novo human genome assembly.[5] Bioinformatic pipelines to analyze sequencing data usually take into account read lengths.[6]

2. Generations of Sequencing and Read Lengths

A genome is the complete genetic information of an organism or a cell. Single or double stranded nucleic acids store this information in a linear or in a circular sequence. To precisely determine this sequence, over time more efficient technologies with increased accuracy, throughput and sequencing speed have been developed. Sanger and Maxam-Gilbert sequencing technologies were classified as the First Generation Sequencing Technology who initiated the field of DNA sequencing with their publication in 1977.[7] First Generation Sequencing typically has read lengths of 400 to 900 base pairs.

In 2005 Roche’s 454 technology introduced new sequencing technology that was capable of high throughput at low cost.[8] This and similar technologies came to be known as Second Generation Sequencing or Next Generation Sequencing (NGS). One of the hallmarks of NSG is short sequence reads. NGS methods may sequence millions to billions of reads in a single run, and the time it takes to create GigaBase-sized reads is only a few days or hours, making it superior to first-generation sequencing techniques like Sanger sequencing. All NSG techniques produce short reads, i.e. 80–200 bases, as opposed to longer length reads produced by Sanger sequencing.[9]

Beginning in the 2010s, revolutionary new technologies ushered in the Third-Generation Sequencing era (TGS). TGS is a term used to describe methods that are capable of sequencing single DNA molecules without amplification. While Sanger and SRS techniques can only produce read lengths of one kilobase pair, third-generation sequencing technologies can produce read lengths of 5 to 30 kilobase pairs. The longest read length ever generated by a third-generation sequencing technology is 2 gigabase pairs.[10]

3. NGS and Read Mapping

Historically, only one individual per species was addressed due to time and expense constraints, and its sequence served as the species' "reference" genome. These reference genomes can be used to guide resequencing efforts in the same species by serving as a read mapping template. Read mapping is the process to align NGS reads on a reference genome.[11] Any NGS application, such as genome variation calling, transcriptome analysis, transcription factor binding site calling, epigenetic mark calling, metagenomics, and so on, requires read mapping. The performance of these applications is influenced by accurate alignment. Furthermore, because the number of reads is so large, the mapping process must be efficient. There are different methods used to align reads on reference genome depending on how many mismatches and indels are being allowed. Roughly speaking, the methods can be divided into two categories: the seed-and-extension approach and the filtering approach. Many short read aligners use the seed-and-extend strategy, such as BWA-SW, Bowtie 2, BatAlign, LAST, Cushaw2, BWA-MEM , etc. A filter-based approach is used by a number of methods like SeqAlto, GEM, MASAI etc.[12]

4. Genome Assembly and Sequence Reads

In genomics, reassembling genomes by DNA sequencing is a significant challenge. The retrieved reads span the entire genome uniformly due to random sampling. Reads are stitched together computationally to reconstruct the genome. This process is known as de novo genome assembly.

I Sanger sequencing has larger read length compared to NGS. Two assemblers were developed for assembling Sanger sequencing reads - the OLC assembler Celera and the de Bruijn graph assembler Euler. These two methods were used to put together our human reference genome. However, since Sanger sequencing is low throughput and expensive, only a few genomes are assembled with Sanger sequencing.

Second-generation sequencing reads are short, and these sequencing techniques can efficiently and cost-effectively sequence hundreds of millions of reads. For rebuilding genomes from short sequences, some custom genome assemblers have been built. Their success spawned several de novo genome assembly projects. Although this method is cost-effective, the reads are short and the repeat sections are long, resulting in fragmented genomes.

We now have very long reads (of 10,000 bp) thanks to the arrival of third-generation sequencing. Long reads are capable of resolving the ordering of repeat regions, although they have a high error rate (15–18%). To correct errors in third-generation sequencing reads, a number of computational methods have been devised.

Assembling with short reads and assembling with long reads have different advantages and disadvantages owing to error rates and ease of assembly. Sometimes a hybrid method is preferred, and short reads and long reads are combined to get better result. There are two approaches, the first one is using mate-pair reads and long reads to improve the assembly from the short reads. Second approach is using short reads to correct the errors in long reads.

5. Advantages and Disadvantages of Short Reads

Second-generation sequencing generates short reads (of length < 300bp) and these are highly accurate (sequencing error rate equals ∼1%). Short read sequencing technologies have made sequencing much easier, a lot faster and much cheaper than Sanger sequencing. The August 2019 report from the National Human Genome Research Institute put the cost of sequencing a complete human genome at $942.00 United States dollars (USD).[13][14]

The inability to sequence lengthy sections of DNA is a drawback shared by all second-generation sequencing technology. To use NGS to sequence a big genome like human DNA, the DNA must be fragmented and amplified in clones ranging from 75 to 400 base pairs, that is why NGS is also known as "shortread sequencing" (SRS). After sequencing short reads, it then becomes a computational problem and many computer programs and techniques have been developed to assemble the random clones into a contiguous sequence.[15]

A necessary step in SRS is polymerase chain reaction which causes preferential amplification of repetitive DNA. SRS also fails to generate sufficient overlap sequence from the DNA fragments. This constitutes a major challenge for de novo sequencing of a highly complex and repetitive genome like the human genome.[16] Another challenge with SRS is the detection of large sequence changes, which is a major roadblock to studying structural variations.[17]

6. Advantages and Disadvantages of Long Reads

The third-generation sequencing sequences long reads and is often referred to as long read sequencing (LRS). LRS technologies are capable of sequencing single DNA molecules without amplification. The availability of long reads constitutes a great advantage, because it is often difficult to generate long continuous consensus sequence using NSG because of the difficulty of detecting overlaps between NGS short reads, thus impacting the overall quality of assembly. LRS has been shown to considerably improve the quality of genome assemblies in several studies.[18] [19] Another advantage of LRS over NGS is that it provides the simultaneous capability of characterizing a variety of epigenetic marks along with DNA sequencing.[20] [21]

Major challenge of LRS is accuracy and cost. Though with LRS is improving fast in those areas too.

The content is sourced from:


  1. Chaisson, Mark J. (2009). "De novo fragment assembly with short mate-paired reads: Does the read length matter?". Genome Research 19 (2): 336–346. doi:10.1101/gr.079053.108. PMID 19056694. PMC 2652199. Retrieved 23 July 2017. 
  2. Junemann, Sebastian (2013). "Updating benchtop sequencing performance comparison". Nature Biotechnology 31 (4): 294–296. doi:10.1038/nbt.2522. PMID 23563421.
  3. Quail, Michael A. (2012). "A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers". BMC Genomics 13 (1): 341. doi:10.1186/1471-2164-13-341. PMID 22827831.
  4. Chhangawala, Sagar; Rudy, Gabe; Mason, Christopher E.; Rosenfeld, Jeffrey A. (23 June 2015). "The impact of read length on quantification of differentially expressed genes and splice junction detection". Genome Biology 16 (1): 131. doi:10.1186/s13059-015-0697-y. PMID 26100517.
  5. Chaisson, Mark J.P. (2015). "Genetic variation and the de novo assembly of human genomes". Nature Reviews Genetics 16 (11): 627–640. doi:10.1038/nrg3933. PMID 26442640.
  6. Conesa, Ana; Madrigal, Pedro; Tarazona, Sonia; Gomez-Cabrero, David; Cervera, Alejandra; McPherson, Andrew; Szcześniak, Michał Wojciech; Gaffney, Daniel J. et al. (26 January 2016). "A survey of best practices for RNA-seq data analysis". Genome Biology 17 (1): 13. doi:10.1186/s13059-016-0881-8. PMID 26813401.
  7. Giani, Alice Maria; Gallo, Guido Roberto; Gianfranceschi, Luca; Formenti, Giulio (2020). "Long walk to genomics: History and current approaches to genome sequencing and assembly". Computational and Structural Biotechnology Journal 18: 9–19. doi:10.1016/j.csbj.2019.11.002. PMID 31890139.
  8. Qiang-long, Zhu; Shi, Liu; Peng, Gao; Fei-shi, Luan (1 September 2014). "High-throughput Sequencing Technology and Its Application". Journal of Northeast Agricultural University (English Edition) 21 (3): 84–96. doi:10.1016/S1006-8104(14)60073-8.
  9. Chaisson, M.; Pevzner, P.; Tang, H. (1 September 2004). "Fragment assembly with short reads". Bioinformatics 20 (13): 2067–2074. doi:10.1093/bioinformatics/bth205.
  10. Kraft, Florian; Kurth, Ingo (16 July 2019). "Long-read sequencing in human genetics". Medizinische Genetik 31 (2): 198–204. doi:10.1007/s11825-019-0249-z.
  11. Sung, Wing-Kin (2017). Algorithms for next-generation sequencing. Boca Raton. ISBN 978-1466565500. 
  12. Sung, Wing-Kin (2017). Algorithms for next-generation sequencing. Boca Raton. ISBN 978-1466565500. 
  13. Adewale, Boluwatife A. (26 November 2020). "Will long-read sequencing technologies replace short-read sequencing technologies in the next 10 years?". African Journal of Laboratory Medicine 9 (1): 5. doi:10.4102/ajlm.v9i1.1340. PMID 33354530.
  14. "DNA Sequencing Costs: Data" (in en). 
  15. Mardis, Elaine R (February 2017). "DNA sequencing technologies: 2006–2016". Nature Protocols 12 (2): 213–218. doi:10.1038/nprot.2016.182. PMID 28055035.
  16. Mardis, Elaine R (February 2017). "DNA sequencing technologies: 2006–2016". Nature Protocols 12 (2): 213–218. doi:10.1038/nprot.2016.182. PMID 28055035.
  17. Ho, Steve S.; Urban, Alexander E.; Mills, Ryan E. (March 2020). "Structural variation in the sequencing era". Nature Reviews Genetics 21 (3): 171–189. doi:10.1038/s41576-019-0180-9. PMID 31729472.
  18. Rhoads, Anthony; Au, Kin Fai (October 2015). "PacBio Sequencing and Its Applications". Genomics, Proteomics & Bioinformatics 13 (5): 278–289. doi:10.1016/j.gpb.2015.08.002. PMID 26542840.
  19. Wenger, Aaron M.; Peluso, Paul; Rowell, William J.; Chang, Pi-Chuan; Hall, Richard J.; Concepcion, Gregory T.; Ebler, Jana; Fungtammasan, Arkarachai et al. (October 2019). "Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome". Nature Biotechnology 37 (10): 1155–1162. doi:10.1038/s41587-019-0217-9. PMID 31406327.
  20. Flusberg, Benjamin A; Webster, Dale R; Lee, Jessica H; Travers, Kevin J; Olivares, Eric C; Clark, Tyson A; Korlach, Jonas; Turner, Stephen W (June 2010). "Direct detection of DNA methylation during single-molecule, real-time sequencing". Nature Methods 7 (6): 461–465. doi:10.1038/nmeth.1459. PMID 20453866.
  21. Simpson, Jared T; Workman, Rachael E; Zuzarte, P C; David, Matei; Dursi, L J; Timp, Winston (April 2017). "Detecting DNA cytosine methylation using nanopore sequencing". Nature Methods 14 (4): 407–410. doi:10.1038/nmeth.4184. PMID 28218898.
This entry is offline, you can click here to edit this entry!