Alignment-free (AF) methodologies have increased in popularity in the last decades as alternative tools to alignment-based (AB) algorithms for performing comparative sequence analyses. They have been especially useful to detect remote homologs within the twilight zone of highly diverse gene/protein families and superfamilies. The most popular alignment-free methodologies, as well as their applications to classification problems, have been described in previous reviews. Despite a new set of graph theory-derived sequence/structural descriptors that have been gaining relevance in the detection of remote homology, they have been omitted as AF predictors when the topic is addressed. Here, we first go over the most popular AF approaches used for detecting homology signals within the twilight zone and then bring out the state-of-the-art tools encoding graph theory-derived sequence/structure descriptors and their success for identifying remote homologs.
We compile the most popular alignment-free methods applied to the detection of homologous sequences within the twilight zone of alignment algorithms. Such homologous proteins placed at this zone or beyond are known as remote homologs.
-
Word Frequency-Based Methods
The most popular AF approaches are based on word frequency counting, known as word-based methods. They estimate how many times a letter from the DNA or protein alphabets appears along the query sequence, or alternatively they can also count the occurrences of certain subsequence of length k, where k size must be smaller than the query sequence length. Thus, they encompass those AF methods based on nucleotide [1], amino acid [56] and pseudo compositions [3] and others related to subsequence frequencies like k-mers or k-words [4], spaced k-words [5], and k-tuples [55]. They all have been applied up to a certain extent in database searching, gene annotation, comparative genomics, and phylogenetics by using AF similarity measures to cope with the previously mentioned alignment’s handicaps. For example, amino acid composition (ACC) was implemented in a webserver named composition-based protein identification (COPid) to perform protein searches and phylogenetic analysis by means of AF distances (https://webs.iiitd.edu.in/raghava/COPid/[7]) but also has been applied to detect remote homology in the G-protein coupled receptor superfamily (GPCR) [57]. The GPCR family has represented a challenging target, due to its high sequence diversity, for studying the prediction performance of several AF tools [58] including the pseudo-amino acid composition (PseAAC) protein feature [10].
Chou’s PseAAC concept was first applied to predict protein cellular attributes related to the biological function regardless of alignment information [11]. This AF approach incorporated the sequence order effect to the ACC to improve the quality of predictions. It was implemented in a webserver hosted at http://www.csbio.sjtu.edu.cn/bioinf/PseAA/ . The performance of PseACC has been evaluated in the twilight zone by (1) identifying enzymatic signatures and delimiting their subclasses in a nonredundant subset of enzymes and nonenzymes sharing sequence similarities lower than 40% of identity [13], (2) classifying structurally characterized proteins sharing < 30% of similarity into the four Structural Classification of Proteins extended (SCOPe)’s classes (α, β, α/β, α + β) just having sequence primary information, [14] and (3) detecting remote homologous proteins using benchmark datasets [15]. In addition to the proven utility of other compositional AF features like k-mers in assembling reads from Next Generation Sequencing (NGS) technologies into contigs [16], identification of species in metagenomic samples [17], [18], and improving heterologous gene expression [19], they have been applied to overcome several handicaps found in the twilight zone such as (1) the annotation of protein families within the metagenome’s diversity [20], (2) the classification of structural protein classes in designed datasets sharing low sequence similarities just by using k-word frequencies or AF distances [21], [22] or by k-mers incorporation into the general scheme of PseACC [23], and (3) the phylogeny reconstruction for constantly evolving viral genomes by the estimation of alignment-free distances [4], [24].
Popular AF methods based on compositional features have been also applied to genome- or proteome-based phylogeny reconstructions [25], [5] because they circumvent some well-known problems arising when intending the alignment of large genomic sequences, finding orthologs to build species trees, and dealing with low homology genes/proteins [26],[27]. Instead they can estimate directly AF distances from unassembled NGS reads for phylogenetic tree building [28].
Last but not least, many of the previously mentioned word-based methods have been also exploited to detect, analyze, and compare the less conserved blocks of the genomes made up by regulatory regions including promoters, transcription factors, and enhancers [29]. In this sense, the D2z AF measure derived from k-words frequencies highlights as one of the first reports in detecting functional and/or evolutionary similarities among cis-regulatory modules (CRMs) from several tissues of human’s and Drosophila’s genomic sequences [30]. One year later, k-words distributions were added directly to Markov models to define new AF similarity measures to discriminate functionally related CRMs from the unrelated ones [31]. In 2010, the concept of regulatory region scoring (RRS), based on the potential distribution of the transcription factors in CRMs, was introduced as an AF prediction model for the detection of related functional signals in non-alignable enhancers found in the CRMs, but could also be extended to other regulatory sequences like promoters [32]. More details about the definition and application of the most popular AF methods and measures were addressed by Vinga and Almeida in several outstanding reviews [33], [34], [35].
-
Information Theory-Based Methods
The runners-up of most popular AF methods are those based on the information theory which measure the information contained in the organization of DNA and protein strings using different approaches. For example, the Kolmogorov complexity of a sequence is measured through the shortest description of its string. However, such abbreviated description of the string is really expressed as a “compression” measure like the “.zip files”. As longer and more complex is the sequence, a larger description would be needed and, therefore, less compression of its string would be possible to apply [36]. Another type of complexity information measure is the Lempel-Ziv complexity that calculates the number of different substrings (occurrence rates) found along the sequence. The number of iterations needed to find such substring occurrences is related with the complexity of the sequence [37]. Once the Kolmogorov’s and Lempel-Ziv’s complexities are determined for the sequences, the estimation of similarity or distance metrics can be easily computed [36], [38], [39]. In this sense, compression-based distance measures from Lempel-Ziv’s and Kolmogorov’s complexities were used to detect distant protein similarities in a subset of the SCOP protein structure database [40], and to classify nonhomologous domains into the CATH levels (class, architecture, and topology) [41], respectively.
The so-called universal similarity metric introduced by Li et al. in 2001 [42] lying over the Kolmogorov complexity concept showed success to cluster protein structures sharing low sequence similarity within structural families and subfamilies [43].
Another theory-based measure is the Shannon entropy defined as the uncertainty of finding a given symbol (nucleotide or amino acid) or word (L-tuples) in the analyzed sequence [44]. The Shannon entropy concept has been used to estimate Kullback–Leibler (KL) divergence measure that allowed the comparison of two sequences , [45]. The Shannon entropy has been recently applied to relieve the perturbation caused by several biological processes such as mutations, recombinations, insertions and deletions, and fast-evolving genomes on pairwise effective genome comparisons [46].
AF methods based on the information theory have been also applied to characterize/compare regulatory sequences [47], [48] and to identify/compare transcription factor binding sites , . For further details about the application of information theory-based AF methods to noncoded sequence analysis, one may go through a comprehensive review published by Vinga [48]. At last, Table 1 shows a summary of the most popular AF methods applied to datasets of low sequence similarity for remote homology detection and the clustering of similar protein structures under such conditions.
Table 1. Summary of the most popular AF features applied to detect remote homology
Word-frequency methods
|
AF feature
|
Low-similarity dataset
|
Web-Implementation
|
Ref.
|
|
Amino Acid Composition (ACC)
|
G-protein coupled receptor superfamily
|
COPid https://webs.iiitd.edu.in/raghava/COPid/
|
[1]
|
|
Pseudo Amino Acid (PseACC)
|
G-protein coupled receptor superfamily
|
http://www.csbio.sjtu.edu.cn/bioinf/PseAA/
|
[49], [50]
|
|
PseACC
|
Designed dataset identity from ENZYME SwissPro database in [13]
|
http://chou.med.harvard.edu/bioinf/EzyPred/
|
[13]
|
|
PseACC
|
Chou’s designed dataset from SCOP structural classes
|
http://www.csbio.sjtu.edu.cn/bioinf/PseAA/
|
[14]
|
|
k-mers
|
Benchmark Structural data designed based on [51], [52]
|
No publicly available for proteins
|
[21]
|
|
k-mers
|
Benchmark Structural data designed in [53] and also used by [54]
|
No publicly available for proteins
|
[22]
|
|
Information theory-based methods
|
|
Lempel–Ziv complexity
|
Subset of SCOP designed by [53]
|
No publicly available
|
[40]
|
|
Kolmogorov complexity
|
Subset of SCOP designed by [53]
|
No publicly available
|
[40]
|
|
Kolmogorov complexity (Universal Similarity Metric)
|
Benchmark Structural data <25% designed based on [51], [52]
|
No publicly available
|
[41]
|
|
Kolmogorov complexity (Universal Similarity Metric)
|
Clustering protein structures using at low sequence similarity
Benchmark Structural data [52]
|
http://www.cs.nott.ac.uk/~nxk/USM/protocol.html
|
[43]
|
|
This entry is adapted from the peer-reviewed paper 10.3390/biom10010026