Most Popular Alignment-Free Approaches in the Twilight Zone

Most Popular Alignment-Free Approaches in the Twilight Zone: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Biochemistry & Molecular Biology

Contributor: Guillermin Agüero-Chapin , Deborah Galpert , Reinaldo Molina-Ruiz , Evys Ancede-Gallardo ,

Gisselle Pérez-Machado

Gustavo A. De la Riva

, Agostinho Antunes

Alignment-free (AF) methodologies have increased in popularity in the last decades as alternative tools to alignment-based (AB) algorithms for performing comparative sequence analyses. They have been especially useful to detect remote homologs within the twilight zone of highly diverse gene/protein families and superfamilies. The most popular alignment-free methodologies, as well as their applications to classification problems, have been described in previous reviews. Despite a new set of graph theory-derived sequence/structural descriptors that have been gaining relevance in the detection of remote homology, they have been omitted as AF predictors when the topic is addressed. Here, we first go over the most popular AF approaches used for detecting homology signals within the twilight zone and then bring out the state-of-the-art tools encoding graph theory-derived sequence/structure descriptors and their success for identifying remote homologs.

alignment-free
popular
remote homolgs

We compile the most popular alignment-free methods applied to the detection of homologous sequences within the twilight zone of alignment algorithms. Such homologous proteins placed at this zone or beyond are known as remote homologs.

Word Frequency-Based Methods
The most popular AF approaches are based on word frequency counting, known as word-based methods. They estimate how many times a letter from the DNA or protein alphabets appears along the query sequence, or alternatively they can also count the occurrences of certain subsequence of length k, where k size must be smaller than the query sequence length. Thus, they encompass those AF methods based on nucleotide ^[1], amino acid ^[56] and pseudo compositions ^[3] and others related to subsequence frequencies like k-mers or k-words ^[4], spaced k-words ^[5], and k-tuples ^[55]. They all have been applied up to a certain extent in database searching, gene annotation, comparative genomics, and phylogenetics by using AF similarity measures to cope with the previously mentioned alignment’s handicaps. For example, amino acid composition (ACC) was implemented in a webserver named composition-based protein identification (COPid) to perform protein searches and phylogenetic analysis by means of AF distances (https://webs.iiitd.edu.in/raghava/COPid/^[7]) but also has been applied to detect remote homology in the G-protein coupled receptor superfamily (GPCR) ^[57]. The GPCR family has represented a challenging target, due to its high sequence diversity, for studying the prediction performance of several AF tools ^[58] including the pseudo-amino acid composition (PseAAC) protein feature ^[10]_.
Chou’s PseAAC concept was first applied to predict protein cellular attributes related to the biological function regardless of alignment information ^[11]. This AF approach incorporated the sequence order effect to the ACC to improve the quality of predictions. It was implemented in a webserver hosted at http://www.csbio.sjtu.edu.cn/bioinf/PseAA/ . The performance of PseACC has been evaluated in the twilight zone by (1) identifying enzymatic signatures and delimiting their subclasses in a nonredundant subset of enzymes and nonenzymes sharing sequence similarities lower than 40% of identity ^[13], (2) classifying structurally characterized proteins sharing < 30% of similarity into the four Structural Classification of Proteins extended (SCOPe)’s classes (α, β, α/β, α + β) just having sequence primary information, ^[14] and (3) detecting remote homologous proteins using benchmark datasets ^[15]. In addition to the proven utility of other compositional AF features like k-mers in assembling reads from Next Generation Sequencing (NGS) technologies into contigs ^[16], identification of species in metagenomic samples ^[17], ^[18], and improving heterologous gene expression ^[19], they have been applied to overcome several handicaps found in the twilight zone such as (1) the annotation of protein families within the metagenome’s diversity ^[20], (2) the classification of structural protein classes in designed datasets sharing low sequence similarities just by using k-word frequencies or AF distances ^[21], ^[22] or by k-mers incorporation into the general scheme of PseACC ^[23], and (3) the phylogeny reconstruction for constantly evolving viral genomes by the estimation of alignment-free distances ^[4], ^[24].
Popular AF methods based on compositional features have been also applied to genome- or proteome-based phylogeny reconstructions ^[25], ^[5] because they circumvent some well-known problems arising when intending the alignment of large genomic sequences, finding orthologs to build species trees, and dealing with low homology genes/proteins ^[26],^[27]. Instead they can estimate directly AF distances from unassembled NGS reads for phylogenetic tree building ^[28].
Last but not least, many of the previously mentioned word-based methods have been also exploited to detect, analyze, and compare the less conserved blocks of the genomes made up by regulatory regions including promoters, transcription factors, and enhancers ^[29]. In this sense, the D2z AF measure derived from k-words frequencies highlights as one of the first reports in detecting functional and/or evolutionary similarities among cis-regulatory modules (CRMs) from several tissues of human’s and Drosophila’s genomic sequences ^[30]. One year later, k-words distributions were added directly to Markov models to define new AF similarity measures to discriminate functionally related CRMs from the unrelated ones ^[31]. In 2010, the concept of regulatory region scoring (RRS), based on the potential distribution of the transcription factors in CRMs, was introduced as an AF prediction model for the detection of related functional signals in non-alignable enhancers found in the CRMs, but could also be extended to other regulatory sequences like promoters ^[32]. More details about the definition and application of the most popular AF methods and measures were addressed by Vinga and Almeida in several outstanding reviews ^[33], ^[34], ^[35].
Information Theory-Based Methods
The runners-up of most popular AF methods are those based on the information theory which measure the information contained in the organization of DNA and protein strings using different approaches. For example, the Kolmogorov complexity of a sequence is measured through the shortest description of its string. However, such abbreviated description of the string is really expressed as a “compression” measure like the “.zip files”. As longer and more complex is the sequence, a larger description would be needed and, therefore, less compression of its string would be possible to apply ^[36]. Another type of complexity information measure is the Lempel-Ziv complexity that calculates the number of different substrings (occurrence rates) found along the sequence. The number of iterations needed to find such substring occurrences is related with the complexity of the sequence ^[37]. Once the Kolmogorov’s and Lempel-Ziv’s complexities are determined for the sequences, the estimation of similarity or distance metrics can be easily computed ^[36], ^[38], ^[39]. In this sense, compression-based distance measures from Lempel-Ziv’s and Kolmogorov’s complexities were used to detect distant protein similarities in a subset of the SCOP protein structure database ^[40], and to classify nonhomologous domains into the CATH levels (class, architecture, and topology) ^[41], respectively.
The so-called universal similarity metric introduced by Li et al. in 2001 ^[42] lying over the Kolmogorov complexity concept showed success to cluster protein structures sharing low sequence similarity within structural families and subfamilies ^[43].
Another theory-based measure is the Shannon entropy defined as the uncertainty of finding a given symbol (nucleotide or amino acid) or word (L-tuples) in the analyzed sequence ^[44]. The Shannon entropy concept has been used to estimate Kullback–Leibler (KL) divergence measure that allowed the comparison of two sequences , ^[45]. The Shannon entropy has been recently applied to relieve the perturbation caused by several biological processes such as mutations, recombinations, insertions and deletions, and fast-evolving genomes on pairwise effective genome comparisons ^[46].
AF methods based on the information theory have been also applied to characterize/compare regulatory sequences ^[47], ^[48] and to identify/compare transcription factor binding sites , . For further details about the application of information theory-based AF methods to noncoded sequence analysis, one may go through a comprehensive review published by Vinga ^[48]. At last, Table 1 shows a summary of the most popular AF methods applied to datasets of low sequence similarity for remote homology detection and the clustering of similar protein structures under such conditions.

Table 1. Summary of the most popular AF features applied to detect remote homology

Word-frequency methods
AF feature	Low-similarity dataset	Web-Implementation	Ref.
Amino Acid Composition (ACC)	G-protein coupled receptor superfamily	COPid https://webs.iiitd.edu.in/raghava/COPid/	^[1]
Pseudo Amino Acid (PseACC)	G-protein coupled receptor superfamily	http://www.csbio.sjtu.edu.cn/bioinf/PseAA/	^[49], ^[50]
PseACC	Designed dataset identity from ENZYME SwissPro database in ^[13]	http://chou.med.harvard.edu/bioinf/EzyPred/	^[13]
PseACC	Chou’s designed dataset from SCOP structural classes	http://www.csbio.sjtu.edu.cn/bioinf/PseAA/	^[14]
k-mers	Benchmark Structural data designed based on ^[51], ^[52]	No publicly available for proteins	^[21]
k-mers	Benchmark Structural data designed in ^[53] and also used by ^[54]	No publicly available for proteins	^[22]
Information theory-based methods
Lempel–Ziv complexity	Subset of SCOP designed by ^[53]	No publicly available	^[40]
Kolmogorov complexity	Subset of SCOP designed by ^[53]	No publicly available	^[40]
Kolmogorov complexity (Universal Similarity Metric)	Benchmark Structural data <25% designed based on ^[51], ^[52]	No publicly available	^[41]
Kolmogorov complexity (Universal Similarity Metric)	Clustering protein structures using at low sequence similarity Benchmark Structural data ^[52]	http://www.cs.nott.ac.uk/~nxk/USM/protocol.html	^[43]

This entry is adapted from the peer-reviewed paper 10.3390/biom10010026

References

Feng-Biao Biao; Chuan Dong; Hong-Li Hua; Shuo Liu; Hao Luo; Hong-Wan Zhang; Yan-Ting Jin; Kai-Yue Zhang; Feng-Biao Guo; Accurate prediction of human essential genes using only nucleotide composition and association information. Bioinformatics 2017, 33, 1758-1764, 10.1101/084129.
Manish Kumar; Varun Thakur; Gajendra P S Raghava; COPid: composition based protein identification.. In Silico Biology 2008, 8, 121-128, PMID: 18928200 .
Kuo-Chen Chou; Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of Theoretical Biology 2011, 273, 236-247, 10.1016/j.jtbi.2010.12.024.
Yongyong Kang; Xiaofei Yang; Jiadong Lin; Kai Ye; PVTree: A Sequential Pattern Mining Method for Alignment Independent Phylogeny Reconstruction. Genes 2019, 10, 73, 10.3390/genes10020073.
Kai Song; Jie Ren; Gesine Reinert; Minghua Deng; Michael S. Waterman; Fengzhu Sun; New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.. Briefings In Bioinformatics 2013, 15, 343-53, 10.1093/bib/bbt067.
Wei Chen; Tian-Yu Lei; Dian-Chuan Jin; Hao Lin; Kuo-Chen Chou; PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition. Analytical Biochemistry 2014, 456, 53-60, 10.1016/j.ab.2014.04.001.
Manish Kumar; Varun Thakur; Gajendra P S Raghava; COPid: composition based protein identification.. In Silico Biology 2008, 8, 121-128, PMID: 18928200 .
David W. Elrod; Kuo-Chen Chou; A study on the correlation of G-protein-coupled receptor types with amino acid composition.. Protein engineering 2002, 15, 713-715, 10.1093/protein/15.9.713.
Makiko Suwa; Bioinformatics Tools for Predicting GPCR Gene Functions. Retinal Degenerative Diseases 2013, 796, 205-224, 10.1007/978-94-007-7423-0_10.
Q Gu; Y S Ding; T L Zhang; Prediction of G-protein-coupled receptor classes in low homology using Chou's pseudo amino acid composition with approximate entropy and hydrophobicity patterns.. Protein Pept Lett 2010, 17, 559-567, 10.2174/092986610791112693.
K C Chou; Prediction of protein cellular attributes using pseudo-amino acid composition.. Proteins: Structure, Function, and Bioinformatics 2001, 43, 246-255, 10.1002/prot.1035.
Phillip E. C. Compeau; Pavel A. Pevzner; Glenn Tesler; How to apply de Bruijn graphs to genome assembly.. Nature Biotechnology 2011, 29, 987-91, 10.1038/nbt.2023.
Sasha K. Ames; David A. Hysom; Shea N. Gardner; G. Scott Lloyd; Maya B. Gokhale; Jonathan E. Allen; Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics 2013, 29, 2253-2260, 10.1093/bioinformatics/btt389.
Rachid Ounit; Steve Wanamaker; Timothy J Close; Stefano Lonardi; CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers.. BMC Genomics 2015, 16, 236, 10.1186/s12864-015-1419-2.
Claes Gustafsson; Sridhar Govindarajan; Jeremy Minshull; Codon bias and heterologous protein expression. Trends in Biotechnology 2004, 22, 346-353, 10.1016/j.tibtech.2004.04.006.
Robert A. Edwards; Robert Olson; Terry Disz; Gordon D. Pusch; Veronika Vonstein; Rick Stevens; Ross Overbeek; Real time metagenomics: using k-mers to annotate metagenomes.. Bioinformatics 2012, 28, 3316-7, 10.1093/bioinformatics/bts599.
Qi Dai; Tianming Wang; Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'. BMC Bioinformatics 2008, 9, 394-394, 10.1186/1471-2105-9-394.
Thomas Lingner; Peter Meinicke; Remote homology detection based on oligomer distances. Bioinformatics 2006, 22, 2224-2231, 10.1093/bioinformatics/btl376.
Yu-Fang Qin; Chun-Hua Wang; Xiao-Qing Yu; Jie Zhu; Tai-Gang Liu; Xiao-Qi Zheng; Predicting protein structural class by incorporating patterns of over-represented k-mers into the general form of Chou's PseAAC.. Protein & Peptide Letters 2012, 19, 388-397, 10.2174/092986612799789350.
Mirjana Domazet-Lošo; Bernhard Haubold; Alignment-free detection of local similarity among viral and bacterial genomes. Bioinformatics 2011, 27, 1466-1472, 10.1093/bioinformatics/btr176.
Michael Höhl; Mark A. Ragan; Is Multiple-Sequence Alignment Required for Accurate Inference of Phylogeny?. Systematic Biology 2007, 56, 206-221, 10.1080/10635150701294741.
Cheong Xin Chan; Mark A Ragan; Next-generation phylogenomics. Biology Direct 2013, 8, 3-3, 10.1186/1745-6150-8-3.
Ji Qi; Hong Luo; Bailin Hao; CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Research 2004, 32, W45-W47, 10.1093/nar/gkh362.
Kai Song; Jie Ren; Zhiyuan Zhai; Xuemei Liu; Minghua Deng; Fengzhu Sun; Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads. Journal of Computational Biology 2013, 20, 64-79, 10.1089/cmb.2012.0228.
Upuli Gunasinghe; Damminda Alahakoon; Susan Bedingfield; Extraction of high quality k-words for alignment-free sequence comparison. Journal of Theoretical Biology 2014, 358, 31-51, 10.1016/j.jtbi.2014.05.016.
Chris-Andre Leimeister; Marcus Boden; Sebastian Horwege; Sebastian Lindner; Burkhard Morgenstern; Fast alignment-free sequence comparison using spaced-word frequencies.. Bioinformatics 2014, 30, 1991-9, 10.1093/bioinformatics/btu177.
Miriam R. Kantorovitz; Gene E. Robinson; Saurabh Sinha; A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics 2007, 23, 249-255, 10.1093/bioinformatics/btm211.
Qi Dai; Yanchun Yang; Tianming Wang; Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison. Bioinformatics 2008, 24, 2296-2302, 10.1093/bioinformatics/btn436.
Hashem Koohy; Nigel P. Dyer; Georgy Koentges; John E. Reid; Sascha Ott; An alignment-free model for comparison of regulatory sequences. Bioinformatics 2010, 26, 2391-2397, 10.1093/bioinformatics/btq453.
Susana Vinga; Jonas Almeida; Alignment-free sequence comparison-a review.. Bioinformatics 2003, 19, 513-523, 10.1093/bioinformatics/btg005.
Andrzej Zielezinski; Susana Vinga; Jonas Almeida; Wojciech M. Karlowski; Alignment-free sequence comparison: benefits, applications, and tools. Genome Biology 2017, 18, 186, 10.1186/s13059-017-1319-7.
Susana Vinga; Editorial: Alignment-free methods in computational biology. Briefings In Bioinformatics 2014, 15, 341-342, 10.1093/bib/bbu005.
Li, M.; Vitányi, P.M.B. An introduction to Kolmogorov complexity and its applications; Springer: New York, 2008; pp. 790.
A. Lempel; J. Ziv; On the Complexity of Finite Sequences. IEEE Transactions on Information Theory 1976, 22, 75-81, 10.1109/tit.1976.1055501.
Hasan H. Otu; Khalid Sayood; A new sequence distance measure for phylogenetic tree construction.. Bioinformatics 2003, 19, 2122-2130, 10.1093/bioinformatics/btg295.
Ming Li; Xin Chen; Xin Li; Bin Ma; P.M.B. Vitanyi; The Similarity Metric. IEEE Transactions on Information Theory 2004, 50, 3250-3264, 10.1109/tit.2004.838101.
András Kocsor; Attila Kertész-Farkas; László Kaján; Sándor Pongor; Application of compression-based distance measures to protein sequence classification: a methodological study. Bioinformatics 2005, 22, 407-412, 10.1093/bioinformatics/bti806.
Paolo Ferragina; Raffaele Giancarlo; Valentina Greco; Giovanni Manzini; Gabriel Valiente; Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment. BMC Bioinformatics 2007, 8, 252-252, 10.1186/1471-2105-8-252.
Ming Li; Jonathan H. Badger; Xin Chen; Sam Kwong; Paul Kearney; Haoyong Zhang; An information-based sequence distance and its application to whole mitochondrial genome phylogeny.. Bioinformatics 2001, 17, 149-154, 10.1093/bioinformatics/17.2.149.
N. Krasnogor; D. A. Pelta; Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics 2004, 20, 1015-1021, 10.1093/bioinformatics/bth031.
B.J. Strait; T.G. Dewey; The Shannon information entropy of protein sequences.. Biophysical Journal 1996, 71, 148-155, 10.1016/s0006-3495(96)79210-x.
S. Kullback; R. A. Leibler; On Information and Sufficiency. The Annals of Mathematical Statistics 1951, 22, 79-86, 10.1214/aoms/1177729694.
Fei Nan; Nald Adjeroh; On complexity measures for biological sequences. Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004. 2004, CSB 2004, 501-505, 10.1109/csb.2004.1332483.
Dianhui Wang; Sarwar Tapan; MISCORE: a new scoring function for characterizing DNA regulatory motifs in promoter sequences. BMC Systems Biology 2012, 6, S4-S4, 10.1186/1752-0509-6-S2-S4.
Matteo Comin; Morris Antonelli; Fast Alignment-free Comparison for Regulatory Sequences using Multiple Resolution Entropic Profiles. International Conference on Bioinformatics Models, Methods and Algorithms 2015, 1, 171-177, 10.5220/0005251001710177.
Dianhui Wang; Sarwar Tapan; MISCORE: a new scoring function for characterizing DNA regulatory motifs in promoter sequences. BMC Systems Biology 2012, 6, S4-S4, 10.1186/1752-0509-6-S2-S4.
Dianhui Wang; Sarwar Tapan; MISCORE: a new scoring function for characterizing DNA regulatory motifs in promoter sequences. BMC Systems Biology 2012, 6, S4-S4, 10.1186/1752-0509-6-S2-S4.
Dianhui Wang; Sarwar Tapan; MISCORE: a new scoring function for characterizing DNA regulatory motifs in promoter sequences. BMC Systems Biology 2012, 6, S4-S4, 10.1186/1752-0509-6-S2-S4.
Hong-Bin Shen; Kuo-Chen Chou; EzyPred: A top–down approach for predicting enzyme functional classes and subclasses. Biochemical and Biophysical Research Communications 2007, 364, 53-59, 10.1016/j.bbrc.2007.09.098.
Yong-Sheng Ding; Tong-Liang Zhang; Kuo-Chen Chou; Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network.. Protein & Peptide Letters 2007, 14, 811-815, 10.2174/092986607781483778.
Mehul Jani; Rajeev K. Azad; Information entropy based methods for genome comparison. ACM SIGBioinformatics Record 2013, 3, 1-4, 10.1145/2500124.2500126.
Ivan Erill; Michael C O'neill; A reexamination of information theory-based methods for DNA-binding site identification. BMC Bioinformatics 2009, 10, 57-57, 10.1186/1471-2105-10-57.
Minli Xu; Zhengchang Su; A Novel Alignment-Free Method for Comparing Transcription Factor Binding Site Motifs. PLOS ONE 2010, 5, e8797, 10.1371/journal.pone.0008797.
Susana Vinga; Information theory applications for biological sequence analysis. Briefings In Bioinformatics 2013, 15, 376-389, 10.1093/bib/bbt068.
Manish Kumar; Varun Thakur; Gajendra P S Raghava; COPid: composition based protein identification.. In Silico Biology 2008, 8, 121-128, PMID: 18928200 .
David W. Elrod; Kuo-Chen Chou; A study on the correlation of G-protein-coupled receptor types with amino acid composition.. Protein engineering 2002, 15, 713-715, 10.1093/protein/15.9.713.
Makiko Suwa; Bioinformatics Tools for Predicting GPCR Gene Functions. Retinal Degenerative Diseases 2013, 796, 205-224, 10.1007/978-94-007-7423-0_10.
Q Gu; Y S Ding; T L Zhang; Prediction of G-protein-coupled receptor classes in low homology using Chou's pseudo amino acid composition with approximate entropy and hydrophobicity patterns.. Protein Pept Lett 2010, 17, 559-567, 10.2174/092986610791112693.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Word Frequency-Based Methods

Information Theory-Based Methods

References