Next-generation sequencing (NGS) is widely used to study microorganisms, allowing the elucidation of bacteria and viruses inhabiting different body systems and identifying new pathogens.
Along with classical diagnostic methods such as virus isolation, serology, and PCR, NGS plays an important role in virus identification, especially in outbreaks of known and/or new diseases [1].
NGS technology is currently revolutionizing the field of genomics and clinical virology is no exception. High-throughput sequencing techniques have made significant contributions to many areas of virology, including virus discovery and metagenomics, molecular epidemiology, pathogenesis, and research into how viruses evade the host’s immune response. Previously, unknown viruses have been identified using NGS techniques, including a new rhabdovirus associated with acute hemorrhagic fever identified in Central Africa [2] and a new cyclovirus found in the cerebrospinal fluid of patients with infections of the central nervous system [3].
Metagenomics refers to the study of the complete genomic composition of a complex mixture of microorganisms [69]. Unlike bacteria, viruses do not have a common gene for all families, and therefore the study of the virome is based on complex analytical methods. In addition to detecting viruses, NGS is also capable of providing additional information on virulence markers, epidemiology, genotyping, and evolution of pathogens, as well as estimating the copy number from the number of DNA/RNA reads [4][5].
NGS methods are also used to study the genetic and phenotypic heterogeneity of viruses during replication in host cells, which is missed in conventional studies using consensus sequencing data. An example of such a study is Ikuyo Takayama (2021), who studied the genetic diversity of the A(H1N1)pdm09 virus using NGS in upper and lower respiratory tract samples from nine patients. Significant genetic heterogeneity was found in the samples, including 47 amino acid substitutions and 1 D222G/N substitution in the hemagglutinin was common to several patients. The authors note the need for such studies in order to not miss the potentially important mutations that occur during viral replication in the host, especially in patients with severe disease [6]. The NGS process typically consists of two parts. The first was experimental work in the laboratory (wet lab), including the stages of sample preparation, DNA/RNA extraction, library preparation, and sequencing. The second part is bioinformatic analysis (dry lab), which includes data quality control, removal of non-target DNA, and analysis of nucleotide sequences [7].
The application of NGS to viral studies has certain experimental and analytical features, in contrast to the study of microbial communities.
This applies to sample preparation and sequencing. For example, it must be considered that viral genomes (especially RNA) are rather fragile and easily destroyed and the ratio of viral material to the host genome is very low (less than 1%). Therefore, an important procedure is to conduct the procedure of amplification of target viral nucleic acids or enrichment of viral particles [8][4].
In some cases, it is necessary to reverse-transcribe the viral RNA before PCR and sequencing. PCR amplification leads to errors that are difficult to distinguish from real mutations. In addition to PCR errors, all NGS platforms introduce sequencing errors at a rate similar to the mutation rate in RNA viruses [9]. Additionally, the depth of sequencing, that is the number of unique reads including a given nucleotide in the sequence, can also vary depending on the genome [8].
Modern viral NGS protocols have already been optimized for detecting both RNA and DNA viruses [10][11][12]. In addition, viral particle enrichment techniques are often used to increase the relative concentration of viral particles and/or their nucleotide sequences, as well as methods of depleting host genomic DNA and ribosomal or mRNA. These methods are laborious and not easy to automate for routine use in clinical diagnostics, which imposes restrictions on their use in mass clinical diagnostics [4].
Difficulties in the detection of viruses by NGS in clinical specimens, especially respiratory ones, are due to the presence of an extremely small number of viruses and their nucleic acids in the study samples, compared with the high content of host genomic material and the bacterial component. These circumstances determine the high importance of the preliminary steps in NGS sequencing. Proper sampling and production of nucleic sequences of interest are critical for obtaining the desired results (Figure 1).
Initially, metagenomics was actively developed in the study of bacterial genomes and achieved tremendous success. However, a new direction in metagenomics, the virome has been actively developed. Virus studies using NGS methods are now at the peak of their development and technological approaches are being improved every time. Examples of applications are pathogen detection, including novel detection, species identification, and typing, detection of antibiotic resistance, virulence, and more [1][13][14].
With the development of NGS, its practical application is constantly expanding, especially in clinical virology in the diagnosis of new or previously undetected pathogens of infectious diseases [8][15][16]. It was shown that the sensitivity of the NGS method is comparable to that of the PCR method with increasing sequencing depth [17][1].
So, W.I. Lipkin (2013) in his works because of the bioinformatic analysis, revealed new viruses, such as rhabdovirus associated with acute hemorrhagic fever and cyclovirus found in the cerebrospinal fluid of patients [18][19]. Using the NGS method T. Kustin detected human parainfluenza 1 virus, human parainfluenza 4 virus, and influenza C among 54 patients [20].
In 91 samples of NFS by the NGS method were identified human rhinoviruses, enteroviruses, influenza A virus, coronavirus OC43 and respiratory syncytial virus (RSV) A, as well as rotavirus, torque teno virus, human papillomavirus, human betaherpesvirus 7, cyclovirus, vientovirus, gemycircularvirus and statovirus [21]. When examining NFS in 48 children, NGS revealed 11 RNA viruses, 4 DNA viruses, 4 bacterial species, and one fungus [22].
H. Mostafa (2020) in studies, when detecting SARS-CoV-2 by NGS in 500 patients, showed the possibility of diagnosing other infections and analyzing the respiratory microbiome [15].
Yi-Yi Qian (2021) showed that the sensitivity of NGS turned out to be higher than that of the traditional cultivation method, but in comparison with PCR, these indicators were lower [16]. Thorburn (2015) studied 89 nasopharyngeal swabs the sensitivity and specificity of the NGS method compared to Real-Time PCR were 78% and 80%, respectively [23]. So, the NGS technology as a diagnostic tool is still in the development stage, and approaches to its application are being improved every year.
Technically, NGS is run on various platforms, which can be divided by reading length into short-read and long-read. The short-read sequencing approaches fall into two categories: sequencing by ligation (SBL) and sequencing by synthesis (SBS).
In most approaches, SBL and SBS DNA are clonally amplified on a solid surface. The presence of many thousands of identical copies of a piece of DNA in a certain area ensures that the signal can be distinguished from background noise. Mass parallelization is also facilitated by the creation of many millions of individual SBL or SBS reaction centers, each with its own clonal DNA template. The sequencing platform can simultaneously collect information from many millions of reaction sites, thereby sequencing many millions of DNA molecules in parallel.
SBL technologies include Applied Biosystems/SOLiD and MGI/BGI/Complete Genomics. Sequencing by synthesis (SBS) is performed on the Illumina and Qiagen platforms.
Illumina offers a popular series of sequencing platforms–ISeq, MiSeq, MiniSeq, NextSeq, HiSeq, and NovaSeq. High throughput and low error rate (less than 1%) are the main reasons why this technology currently dominates the field of virology and beyond.
The very first NGS platform for studying viral metagenomics was Life Science/Roche 454, a pyrosequencing method. The 454 sequencing has been widely used to identify several new viruses and virome profiles from human and animal samples [24], including arboviruses [25], orbiviruses [26], arenaviruses [27], Lujo virus [28], astrovirus [29], gyroviruses [30], porcine bocaviruses [31], picornaviruses [32], rhabdoviruses [2], coronaviruses [33], gamma papillomavirus [34], and seadornavirus [35]. Most of these viruses have been identified in serum, respiratory, and feces samples.
Although this technology offered a higher yield than Sanger sequencing at a lower cost, this technology has been supplanted by other NGS technologies due to its high cost, errors in homopolymer regions, and low throughput.
Ion Torrent semiconductor sequencing technology with the Ion Proton and Ion S5 series sequencers which benefits from fast sequencing makes these sequencers particularly useful for targeted detection of viruses in clinical specimens, such as HIV [36], hepatitis B virus [37], HCV [38] and rapid genome sequencing of several viruses, including Tuscany virus [39], polyomavirus [40], porcine reproductive and respiratory syndrome virus [41], orthoreovirus [42], bluetongue virus [43], rotavirus [44], influenza virus [45]. This technology has been used to study the virome of skin [46], ticks [47], intestines in piglets [48] and seals [49].
Over the past few years, sequencing technologies have grown rapidly, introducing of third-generation sequencing (TGS) technologies such as Oxford Nanopore and PacBio platforms which are real-time single molecule sequencing (SMRT), that which reduces amplification bias and short reading problems. The reduction in cost and time presented by these sequencing methods is a valuable benefit.
TGS is considered the next revolution in sequencing technology. Sequencing of long sequences and speed, without PCR amplification, allows uniform coverage of the entire genome. This technology has also been used for virus sequencing [50][51][52]. Looking forward, future developments in TGS should focus on improving sequencing accuracy and high throughput.
An important stage in metagenomics is computer analysis or bioinformatics, the task is to process a big array of NGS data, which can be represented by sequences of the genomes of viruses, bacteria, humans, animals, and others.
When using early sequencing methods, sequences are usually classified using NCBI BLAST [53] against the NCBI (nt) database[54]. However, when using NGS data, it is necessary to process a much larger number of short (up to 300 bps) reads, for which homologous regions are not always available in databases and possible sequencing errors made by the sequencer must be considered.
Therefore, NGS needs specialized methods of analysis. Many biological information specialists have developed computational workflows for the analysis of viral metagenomes. Their publications describe many computer tools for taxonomic classification. While these tools can be helpful, choosing the right workflow can be difficult, especially for less experienced users [55][56].
Bioinformatics involves the processing of sequencing data for checking the quality of reads, filtering sequences, and their identification. Some of the workflows of metagenomics have been tested and described in review articles [57][58][59][60][61].
There are specialized programs and online services for virus analysis, such as Viral MetaGenome Annotation Pipeline (VMGAP) [62][63], Viral Informatics Resource for Metagenomic Exploration (VIROME) [64] and Metavir 2 [65], DisCVR [66].
Additionally, there are cloud-deployed clinical metagenomic computing workflows such as SURPI (sequence-based ultra-rapid pathogen identification) [67] and CZ ID (IDseq) [68], for the detection and identification of pathogens.
The CosmosID program has been used to analyze the microbiome of various groups and quantify microorganisms [69][70][71].
Annotated data visualization programs are available for MEGAN, Pavian, Krona, PanViz, MetaViz, and Anvi’o. MEGAN and Pavian perform broad analyses but require specific inputs that make them less suitable for different workflows. PanViz, MetaViz, and Anvi’o are sharpened for the analysis of bacteria and are of little use for viruses. The available programs Geneious and CLC bio are paid for and require an expensive license [72][73][74][75][76].
To separate viral and non-viral sequences, vFams is used [77]. The VIP program is also used to identify viruses [35]. Virus-TAP, VirusSeker for BLAST-based virus identification with modules (VS-VIROME and VS-DISCOVERY), and SHIVER for de novo assembly [78][79][80].
In the age of NGS and bioinformatics, open, easily accessible, free, and globally distributed platforms for data analysis can significantly change the accessibility and quality of biomedical research. Baker et al. (2020) showed the possibility and importance of data exchange using the example of SARS-CoV-2 [81]. For example, for all virus genomic data, the Galaxy platform (https://usegalaxy.eu/ accessed on 18 November 2022) was used, which can be replicated using open-source tools by any researcher with an Internet connection.
The opportunities for such access allow for raising community awareness in the absence of primary data needed to respond to global emergencies such as the COVID-19 outbreak effectively, transparently, and reproducibly perform all analyzes on an equal footing.
Additionally, the publication emphasizes the problem of non-reproducibility of results that are published in scientific papers and, which cannot be completed again because the data are not shared or deliberately hidden. Thus, any researcher should be able to apply the same analytical procedures to their data and have access to all data analysis tools, including computing power and infrastructure [81].