In the 21st century, three highly pathogenic betacoronaviruses have emerged, with an alarming rate of human morbidity and case fatality. Genomic information has been widely used to understand the pathogenesis, animal origin and mode of transmission of coronaviruses in the aftermath of the 2002–2003 severe acute respiratory syndrome (SARS) and 2012 Middle East respiratory syndrome (MERS) outbreaks. Furthermore, genome sequencing and bioinformatic analysis have had an unprecedented relevance in the battle against the 2019–2020 coronavirus disease 2019 (COVID-19) pandemic, the newest and most devastating outbreak caused by a coronavirus in the history of mankind. Here, we briefly review the application of genomics and bioinformatics in the molecular epidemiology of pathogenic betacoronaviruses.
Molecular epidemiology focuses on the contribution of genomic, genetic and other molecular factors to etiology, distribution and prevention of diseases. Central to molecular epidemiology of coronaviruses is their circulation among different animal hosts, as well as the evolutionary forces that facilitate these cross-species jumps. Here, we discuss how genomic information has been used to better understand the rate of evolution of betacoronaviruses and their transmission in human populations, as well as the evolutionary changes associated with host and tissue tropism.
Coronaviruses (CoV) are important pathogens of vertebrates with the ability to cause respiratory, enteric and systemic diseases in humans and animals. They are enveloped, single-stranded, positive-sense RNA viruses belonging to subfamily Orthocoronavirinae of family Coronaviridae, order Nidovirales. The subfamily is further divided into four genera, namely, Alphacoronavirus, Betacoronavirus, Gammacoronavirus and Deltacoronavirus. The majority of clinically relevant coronaviruses belong to the Alphacoronavirus and Betacoronavirus genera. Genus Alphacoronavirus comprises species infecting a diverse group of mammals, including two (229E and NL63) of the seven known species of human coronaviruses. In the case of genus Betacoronavirus, the International Committee on Taxonomy of Viruses (ICTV) currently divides it into five subgenera, Embecovirus, Sarbecovirus, Merbecovirus, Nobecovirus and Hibecovirus, established based on phylogenetic analysis of conserved protein domains. The first four of these subgenera were formerly known as lineages or subgroups A, B, C and D, respectively.
Subgenus Embecovirus includes two human coronaviruses (HKU1 and OC43), as well as several animal coronaviruses of veterinary relevance such as bovine, canine, equine, porcine and murine coronaviruses. Sarbecovirus comprises the severe acute respiratory syndrome (SARS)-related coronaviruses such as SARS-CoV and SARS-CoV-2 (2019-nCoV), respectively responsible for the 2002–2003 SARS outbreak and the 2019–2020 coronavirus disease 2019 (COVID-19) pandemic. Several SARS-related bat coronaviruses, mainly isolated from Chinese horseshoe bats (Rhinolophus spp.), also belong to this subgenus. Subgenus Merbecovirus comprises the Middle East respiratory syndrome (MERS)-related coronaviruses, including the MERS-CoV responsible for the 2012 MERS outbreak, as well as two additional species of bat coronaviruses isolated from Tylonycteris and Pipistrellus bats. Subgenera Nobecovirus and Hibecovirus comprise only bat coronaviruses, mainly isolated from Rousettus and Hipposideros bats, respectively.
Since the 2002–2003 SARS outbreak, genomic information has become ever-increasingly significant to address outbreaks caused by pathogenic coronaviruses. Before the 2019–2020 COVID-19 pandemic, there were ~1200 complete genomes of betacoronaviruses deposited in the GenBank database. The number of available genomes has increased dramatically during the pandemic, with more than 6000 complete genomes available in Genbank as of June 2019, and almost 50,000 genomic sequences in other public repositories. A variety of information including phylogenetic relationships, mode of transmission, evolutionary rates and the role of mutations in infection and disease severity can be deduced from comparing multiple genomes.
Estimation of evolutionary rates is an important step to characterize the genetic diversity among viral lineages and to place a timescale in phylogenetic hypotheses explaining their origin and divergence. The rate of evolution of viruses is often assessed through the number of errors occurring during replication of the viral genome (the mutation rate) and the frequency at which such mutations become fixed in the population (the substitution rate) . The substitution rate depends on several factors, including the underlying mutation rate and the presence of selective forces that influence fixation of mutations in association with their fitness. Mutation rates of RNA viruses are generally higher than those of DNA viruses, due to the lack of a proofreading activity and consequent low fidelity of their RdRp . However, due to the proofreading activity of Nsp14, members of the order Nidovirales have relatively lower mutation rates .
The substitution rate is often expressed in substitutions per nucleotide site per year (s/n/y) and can be estimated from phylogenetic reconstructions when divergence time is known for particular lineages. Although several methods have been traditionally used to estimate substitution rates, including linear regression and maximum likelihood (ML), the most popular method nowadays is the Bayesian Markov chain Monte Carlo (MCMC) approach, such as that implemented in the BEAST package . Globally, substitution rates of coronaviruses have been estimated to be in the order of 10−3–10−4 s/n/y . Studies conducted in SARS-CoV and MERS-CoV have estimated whole genome substitution rates to be between 0.80–2.38 × 10−3 and 0.88–1.37 × 10−3 s/n/y, respectively . However, variation in the estimates for particular genes have been observed for both SARS-CoV  and SARS-CoV-2 [14,15], suggesting that the S gene and the region between ORF7b and ORF8 may be subjected to positive selective pressure in some lineages. In contrast, more conserved regions of the genome such as ORF1a and ORF1b appear to be under strong negative or purifying selection.
An important application of this type of analysis is the estimation of the time to most recent common ancestor (TMRCA) between two lineages, as an approximate measure of their time since divergence. Several studies have estimated that the SARS-CoV lineage within the SARS-related coronaviruses most probably emerged between 1961–1985, while the civet SARS-CoV strains may have originated around 1986–1995 . TMRCA estimates for the SARS-CoV and MERS-CoV strains involved in previous outbreaks have been roughly consistent with the dates when the first cases were reported . A preliminary study has estimated that the group containing SARS-CoV-2 and its closest bat coronavirus, RaTG13, may have diverged between 40–70 years ago .
Recombination events are often inferred by comparing phylogenetic trees built from different genes or genomic regions, since occurrence of recombination often leads to inconsistent topologies in such trees. One of the most popular methods for detecting recombination in viral genomes is bootscan analysis . To use this method, sequences of target genomes are aligned against reference sequences thought to be involved in the recombination events. The alignments are divided into short sequential segments and phylogenetic trees are then built from these segments. Recombination is suspected in segments for which the trees exhibit an alternative topology, involving different reference sequences. Bootscan and other complementary methods, such as sequence similarity plots for the putative recombinant regions, are implemented in packages such as SimPlot  or the Recombination Detection Program (RDP) . Studies relying on these methods have documented the occurrence of recombination in several coronavirus genera and have also provided ample evidence supporting the important role of this process in coronavirus cross-species transmission .
The first reported example of natural recombination in human coronaviruses was that occurring between two different HKU1 genotypes . Putative recombination events between genotypes have also been documented for human coronaviruses NL63  and OC43 . In the case of SARS-related and MERS-related coronaviruses, recombination appears to occur among strains infecting several animal hosts, including bats, intermediary hosts and humans (reviewed by Su et al. ). Studies considering several bat SARS-related coronaviruses have suggested the occurrence of recombination in lineages leading to human and/or civet strains of SARS-CoV, with breakpoints often located close or within the S and ORF8 genes . These findings suggest that recombination between existing strains can result in new strains or species, with possible differences in host and tissue tropism. For instance, bat coronavirus strains Rs3367 (WIV1) and WIV16 have been reported to have high sequence similarity to human/civet SARS-CoV at the S gene, allowing them to use ACE2 as a receptor for cell entry . Recombination analysis suggested that at least one civet SARS-CoV strain (SZ3) may have originated by recombination between WIV16 and another bat SARS-CoV strain (Rf4092) .
In the case of SARS-CoV-2, comparison of its genome to those of other SARS-related coronaviruses did not provide enough evidence supporting recent recombination as a possible explanation for its origin . However, two putative breakpoints, possibly derived from a past recombination event, were identified within the S gene and flanking its receptor-binding domain (RBD) . Globally, the SARS-CoV-2 genome is more similar to those of bat coronaviruses SL-CoVZXC21 and SL-CoVZC45 however, the region between the two breakpoints was found to be more similar to human/civet SARS-CoV and WIV1. Putative recombination signals have also been reported between SARS-CoV-2, RaTG13 and the pangolin-related coronaviruses . Although SARS-CoV-2 is more similar to RaTG13 than to the pangolin coronaviruses, some of the latter have higher sequence similarity to SARS-CoV-2 in the RBD. It has also been suggested that these similarities at the amino acid level may be due to convergent evolution, arising from positive selection instead of recombination .
The fact that different evolutionary events often involve the RBD is likely to be associated with the role of this domain and its RBM in receptor recognition and adaptation to different animal hosts. In SARS-CoV-like viruses, six RBD amino acids have been found to be essential for binding to ACE2, five of which differ between SARS-CoV and SARS-CoV-2 . Particular sets of RBD mutations appear to be associated with a specific host range for each coronavirus species, as is the case of humans and civets in SARS-CoV or humans and camels in MERS-CoV. Although it has been suggested that SARS-CoV-2 is optimized for binding to human ACE2, it may also infect other animals with highly similar ACE2 homologs such as pigs, ferrets, cats and primates . In fact, the S gene of a SARS-CoV-2 strain recently isolated from a tiger (GenBank accession number MT365033) is identical to those of human isolates, both clustering into the same branch in phylogenetic trees.
Comparative sequence analysis has suggested that positive selection may have a role in shaping the evolution of the S protein and the RBD of SARS-CoV and MERS-CoV . The role of natural selection in the evolution of particular genes is typically inferred by computing the ratio of the rates of nonsynonymous to synonymous changes (Ka/Ks) between groups or lineages, with a value greater than 1 indicating an overall positive selective pressure. Although evolution of the SARS-CoV genome during the 2002–2003 SARS outbreak was found to be largely neutral or nearly neutral, at least six mutations occurred in the S protein during the early, middle and late phases of the outbreak, all of which were present in the epidemic strain (Urbani) . However, the average Ka/Ks values for the early phase were found to be significantly higher than those for the middle and late phases, suggestive of initial positive selection in the S gene, followed by purifying selection and stabilization. Likewise, a recent study has suggested limited episodes of positive selection during divergence of SARS-CoV-2 from RaTG13, although there is still insufficient evidence to associate these changes with its adaptation to humans .
As pathogenic viruses replicate and spread during outbreaks, their genomes accumulate random mutations that can be used to track the spread of the disease, reconstruct their transmission routes and detect lineages with different levels of virulence and transmissibility. There are several methods for tracking mutations and inferring the mode of transmission from genomic data, most of which require the alignment of sequences from new isolates to reference genomes and the subsequent identification of genetic variants such as single nucleotide variants (SNV) and insertion/deletions (indels). The simplest and fastest methods are based on pairwise distances among samples computed from these variants. However, these methods do not consider evolutionary models and can be highly inaccurate when there is substantial divergence between donor and recipients in transmission chains . More advanced methods are based on ML or MCMC approaches, often within a Bayesian framework, applying an explicit model of evolution to phylogenetic estimation. When these methods are combined with sampling dates, estimations of the presence of significant molecular evolution over a sampling period is possible . Such analyses are implemented in packages like TransPhylo , Phyloscanner , Outbreaker2  or Phybreak .
The main output of these methods is a transmission tree indicating which individuals infected others. Although a transmission tree cannot be directly inferred from a phylogenetic tree, it must be consistent with the underlying phylogeny. Together, phylogenetic and transmission trees help us trace back the origin of an outbreak, detect multiple introductions of a pathogen into a given territory, identify mutations that define specific lineages and predict the potential existence of unsampled individuals that may have acted as missing transmission links. Due to their relatively recent development, transmission trees based on genomic data were not widely used to study the transmission routes of previous SARS and MERS outbreaks; however, a recent study analyzing data from the 2003 SARS outbreak has provided new insights into its early stages .
Progress in genome sequencing technologies has resulted in an exceptionally high number of SARS-CoV-2 genomes sequenced during the 2019–2020 COVID-19 pandemic. In fact, the COVID-19 pandemic is the second one in history, after the 2009 H1N1 influenza pandemic , for which genomic data have been generated almost in a real-time fashion, allowing a detailed reconstruction of transmission trees. Genome sequences have been made publicly accessible through several repositories, including a data sharing service hosted by the Global Initiative on Sharing All Influenza Data (GISAID) (https://www.gisaid.org/) . In addition to public genome repositories, open-source platforms for real-time data visualization and analysis of genomic data are also available, including NextStrain (https://nextstrain.org) and CoV-GLUE (http://cov-glue.cvr.gla.ac.uk). NextStrain is fed with sequences from the GISAID repository and uses the Augur bioinformatics toolkit (https://github.com/nextstrain/augur) for tracking molecular evolution and the Auspice software (https://nextstrain.github.io/auspice/) for interactive visualization of phylogenomic data. Conversely, CoV-GLUE is a web application based on an integrated software environment called GLUE (Genes Linked by Underlying Evolution), designed to create bioinformatic resources based on viral genome sequences . CoV-GLUE is also based on GISAID data and contains a database of replacements and indels that have been found in previously sampled SARS-CoV-2 sequences. New SARS-CoV-2 genome sequences can be loaded into the platform to identify novel or known mutations, assign them to potential lineages and visualize them in a phylogenetic context.
Several attempts have been made to classify circulating strains of SARS-CoV-2 into lineages or genotypes with potential differences in transmissibility and disease severity. Among these, a study comparing 103 SARS-CoV-2 genomes suggested the existence of two lineages, termed L and S, with potential differences in prevalence . Another recent study compared 160 genomes from different countries and suggested the existence of three subtypes, A, B and C, with differences in geographic distribution and prevalence . However, such studies have been criticized for possible sampling biases and misinterpretation of results , stressing that caution should be taken when drawing conclusions from genomic analyses. Limited or inappropriate sampling can bias the inference of transmission networks, potentially hiding introduction events and intermediate states and resulting in inaccurate mutation rate estimates . When describing new lineages based on SNVs and other genetic variants, fixation of the corresponding mutations should be first demonstrated in local populations.
Since the 2002–2003 SARS outbreak, genomic information has been crucial to tackle epidemics caused by betacoronaviruses. During the 2019–2020 COVID-19 pandemic, quick availability of genomic data has allowed a very rapid, detailed and accurate follow-up of disease progression worldwide and has tremendously supported the development of diagnostic systems, drug candidates and vaccines. Full viral genome analysis has swiftly changed the way scientists deal with epidemic viruses in two main ways: first, the speed that allows the description and classification of the responsible pathogen in a record timeframe, and second, the ability to generate massive amounts of viral genome data, contributing to establishing sound hypotheses on evolution and transmission. It is remarkable that genome analyses at such a scale are now increasingly feasible, without having to culture the viruses, many of which are classified as Biosafety Level 3 agents. However, it is important to stress that genomic information must be used carefully when drawing conclusions related to human and animal health. Sampling bias, selection of inadequate bioinformatic tools and misinterpretation of results can all lead to unreliable conclusions. Furthermore, the quality of genomic sequences used in comparative analyses is also crucial to establish sound conclusions. Currently, there are several platforms used for genome sequencing, each of them with their own patterns of systematic sequencing errors. Data curation and normalization is extremely important before conducting further analyses, particularly when comparing data from different sequencing platforms. If analyzed properly, genomic data will indisputably serve as strong basis for addressing future outbreaks caused by highly pathogenic emerging viruses such as SARS-CoV-2.