Enhanced Genetic Method for Optimizing Multiple Sequence Alignment

Enhanced Genetic Method for Optimizing Multiple Sequence Alignment: Comparison

Please note this is a comparison between Version 2 by Sirius Huang and Version 1 by Mohammed Khaleel Ibrahim.

In the realm of bioinformatics, Multiple Sequence Alignment (MSA) is a pivotal technique used to optimize the alignment of multiple biological sequences, guided by specific scoring criteria. Existing approaches addressing the MSA challenge tend to specialize in distinct biological features, leading to variability in alignment outcomes for the same set of sequences.

Multiple Sequence Alignment
evolutionary algorithm
genetic algorithm
bioinformatics
optimization

1. Introduction

Sequence alignment (SA) is one of the popular approaches in bioinformatics that is used to arrange the primary sequences of DNA/RNA to identify regions of similarity that may have evolutionary or structural relationships among the sequences ^[1]. It helps to locate portions with a common evolutionary history by arranging multiple sequences such that a maximum number of similar or identical residues are aligned or matched in a column ^[2]. This can be achieved by aligning the unknown sequences with known sequences from a database [3,4]^[3][4]. SA can broadly be divided into Multiple Sequence Alignment (MSA), where multiple sequences are aligned simultaneously, and pairwise sequence alignment (PSA), where only two sequences are involved in the alignment process. Generally, MSA is the most commonly used tool that is capable of precisely identifying a sequence’s functional and structural information, as it can deal with several sequences of a family at a time [5,6]^[5][6].

MSA can be achieved globally ^[7], where the similarity over the entire sequence length is generally considered, or locally ^[8], in which the local best-scoring parts of similar characters are considered. An alignment generally considers scoring functions to measure the alignment quality [9,10]^[9][10]. However, in MSA, it is a challenging task to identify an optimum scoring function since the statistically optimized functions are not biologically optimal ^[11]. Moreover, in MSA, computational complexity requires many resources ^[12]. Recently, to improve the MSA optimization process, a dynamic programming approach was applied ^[13]. However, the dynamic programming-based approaches in MSA generally experience high dimensionality due to the increasing number of sequences, which results in exponential growth of the time requirement ^[14]. Essentially, the MSA process is NP-complete ^[15], and thus, all real-world MSA techniques consider heuristic methods that are approximate in real-world situations.

Two different heuristic methods are popularly used for MSA solutions, namely, the iterative and progressive approaches. In the progressive alignment ^[16], the PSA method addresses an MSA process in which all of the possible sequence pairs are first aligned, and a guide tree based on the pairwise distance values is then developed. Then, eventually, the MSA is generated stepwise via the gradual arrangement of all of the sequences based on the guide tree, in which, mainly, the best alignment pair is considered ^[17]. A major shortcoming of the progressive approach is that the initial sequence pair alignment usually affects the resultant alignments. Thus, changing the position of the gap in the later stages is practically impossible ^[18]. To mitigate these issues, among others, iterative approaches are used in the literature [19,20,21,22,23]^{[19][20][21][22][23]} for the MSA problem. Iterative methods are used to iteratively change the building of the guide tree by adjusting the alignment pairs. One of the popular iterative approaches is the Genetic Algorithm (GA) ^[24], which is inspired by natural genetics [25,26,27,28]^{[25][26][27][28]}.

Several approaches have been introduced to utilize GA to solve MSA problems ^[27]. For example, SAGA, introduced in ^[25], uses 22 various GA operators for the MSA. Naznin et al. ^[26] used a GA-based approach to improve an MSA solution by vertically demarcating the sequences into several sub-sequences. In reference ^[27], the GA was also applied to identify the best guide tree by iteratively altering the guide trees. In similar methods, GA was integrated with other optimization methods like ant colony optimization (ACO) ^[29] and the rubber band technique (RBT) to optimize sequence alignment. Currently, the MSA problem is considered a multi-objective process, where each condition can represent a distinct objective function. However, in multi-objective functions, there is a tendency for the accuracy of one objective function to be affected by the optimization of one or more objective functions. Thus, in a real-world situation, a set of non-dominated solutions is generally considered, known as Pareto optimal solutions ^[30]. Apart from the non-dominated set, no other means is feasible to improve any one of the objective functions without affecting the others ^[31].

For better MSA optimization, recently, some multi-objective GA-based methods were proposed ^[12]. One method was introduced in ^[32], which considered three objective functions: Totally Conserved Columns (TCCs), STRIKE score, and non-gaps percentage. One shortcoming of this method is the inadequate availability of structures. In a similar method ^[33], three objectives were introduced to derive the non-dominated Pareto alignment solutions: similarity maximization, affine gap penalty minimization, and support maximization. In references [34^[34][35],35], a shuffled frog-leaping optimization method ^[36] and an artificial bee colony method ^[37] were applied, respectively. Both approaches applied two commonly used fitness functions, the sum of pairs (SOP) and TCC, to obtain a Pareto optimal set. These methods utilized another effective Kalign ^[38] as a local search strategy to improve the solutions’ quality. However, in the multi-objective Pareto optimal method, one needs to specify the dominant and non-dominant solutions to obtain the set of non-dominated solutions. This is hard to conduct in real-world situations ^[39].

2. RMelated Workthods for Multiple Sequence Alignment

This section reviews previous works related to MSA that use multi-objective and metaheuristic methods. Handl et al. introduced one of the earlier approaches to study the multi-objective method in MSA ^[40]. Since then, several other approaches have been introduced ^[1]. Seluangsawat et al. ^[41] introduced an evolutionary method to solve the MSA problem based on outputs obtained from the Clustal X method by utilizing multiple objective functions, which include the gap penalty and the sum of pairs. This method uses three mutation operators and a two-point crossover. Ortuño et al. ^[32] proposed a multi-objective optimization-based approach that uses the classical metaheuristic NSGA based on structural evaluations to solve MSA problems. The proposed approach optimizes multiple objective functions: non-gap percentage, structural information, and TCC. This method applies the hyper-volume ^[30] as the quality evaluation measure. Kaya et al. ^[33] introduced another approach based on the NSGA-II algorithm, which considers three objective functions: similarity, affine gap penalty minimization, and support maximization. This approach uses three mutations and two crossover genetic strategies. Soto and Becera ^[42] proposed a multi-objective approach based on the genetic technique to optimize pre-aligned sequences. Their suggested model uses three operators: random insertion, two-point crossover, and shift mutation.

Silva et al. ^[43] proposed Parallel Niche Pareto by using two different objective functions, including the number of totally identical columns and the sum of pairs. Six mutation operators and three crossover strategies were used in this method. Abbasi et al. ^[44] proposed different local search methods for MSA solutions to minimize the number of indels and maximize the substitution score. The suggested method uses several neighborhood definitions and perturbations. A multi-objective-based approach based on decomposition applied to solve MSA was introduced by Zhu et al. ^[45]. This model applies a gap insertion operation to generate the initial population. Several existing evolutionary alignment algorithms were compared with the tool to evaluate the model performances.

For better MSA optimization, recently, some multi-objective GA-based methods were proposed ^[12]. One method was introduced in ^[32], which considers three objective functions: Totally Conserved Columns (TCCs), the STRIKE score, and non-gaps percentage. One shortcoming of this method is the inadequate availability of structures. In a similar method ^[33], three objectives were introduced to derive the non-dominated Pareto alignment solutions: similarity maximization, affine gap penalty minimization, and support maximization. Recently, Rubio-Largo et al. [35,46]^[35][46] introduced two different methods to improve the MSA solution: the hybrid multi-objective Memetic Metaheuristic approach ^[46] and the hybrid multi-objective artificial bee colony method ^[35]. These two approaches use the conserved columns and weighted sum-of-pairs function (WSP) with affine gap penalties integrated with the Kalign method ^[47]. Finally, Rani et al. ^[48] introduced two approaches, the Bacterial Foraging Optimization method and the Hybrid GA with Artificial Bee Colony Algorithm. The authors notably utilized four objective functions: the maximization of similarity, conserved blocks, non-gap percentage, and minimization of gap penalty. However, in the multi-objective Pareto optimal approach, one needs to specify the dominant and non-dominant solutions to obtain the set of non-dominated solutions. This is hard to conduct in real-world situations.

Despite the good performances of the above GA-based methods, they experience several shortcomings. Firstly, some existing algorithms use only one criterion in their objective function, and improving one objective may deteriorate one or more other objectives. It is impossible to optimize a single objective to achieve all objectives simultaneously. Secondly, the conventional GA generally represents a solution or a chromosome with a binary string. However, the binary coding in MSA increases the chromosome/string length, computational complexity, and memory space. Table 1 summarizes some popular state-of-the-art metaheuristic methods for the MSA.

Table 1.

Some popular methods for MSA.

Model	Brief Description/Advantages	Drawbacks/Disadvantages
^[32]	Uses TCC, STRIKE score, and non-gaps percentage as the objective function	Inadequate availability of structures
^[34]	Uses frog-leaping optimization method; considers SOP and TCC as fitness functions	It is hard to conduct in real-world situations
^[33]	NSGA-II, which considers similarity, support maximization, and affine gap as the fitness functions	Computationally intensive
^[45]	A multi-objective-based approach utilizing a decomposition strategy	Computationally intensive
^[49]	A hybrid method combining the RBT and the GA	Single objective/function, low performance
^[50]	Uses a recursive-based GA to find the optimal fragmentation of the sequences	Single objective function/computationally intensive
^[51]	MSA of protein sequences using GA; SOP used as the fitness function	Single objective/function, low performance
^[52]	A decomposition-based multi-objective method that uses artificial bee colony	Inadequate availability of structures

References

Paruchuri, T.; Rao, G.; Suresh, K.; Rohit, D.; Yadav, K.; Singh, S. Nature Inspired Algorithms for Solving Multiple Sequence Alignment Problem: A Review. Arch. Comput. Methods Eng. 2022, 29, 5237–5258.
Altwaijry, N.; Almasoud, M.; Almalki, A.; Al-Turaiki, I. Multiple Sequence Alignment Using a Multiobjective Artificial Bee Colony Algorithm. In Proceedings of the 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS), Riyadh, Saudi Arabia, 19–21 March 2020; pp. 1–6.
Thompson, J.D.; Linard, B.; Lecompte, O.; Poch, O. A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives. PLoS ONE 2011, 6, e18093.
Shen, C.; Park, M.; Warnow, T. WITCH: Improved Multiple Sequence Alignment through Weighted Consensus Hidden Markov Model Alignment. J. Comput. Biol. 2022, 29, 782–801.
Do, C.B.; Mahabhashyam, M.S.P.; Brudno, M.; Batzoglou, S. ProbCons: Probabilistic Consistency-Based Multiple Sequence Alignment. Genome Res. 2005, 15, 330–340.
Paruchuri, T.; Kancharla, G.R.; Dara, S. Solving Multiple Sequence Alignment Problems by Using a Swarm Intelligent Optimization Based Approach. Int. J. Electr. Comput. Eng. 2023, 13, 1097.
Needleman, S.B.; Wunsch, C.D. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. J. Mol. Biol. 1970, 48, 443–453.
Smith, T.F.; Waterman, M.S. Identification of Common Molecular Subsequences. J. Mol. Biol. 1981, 147, 195–197.
Goswami, A.; Dubey, K.K. A Novel Population-Based Optimization for Multiple Sequence Alignment in Protein Sequencing. Eng. Sci. 2022, 21, 786.
Soam, S.S. A Genetic Algorithm Based Approach for the Optimization of Multiple Sequence Alignment. In Proceedings of the 2020 International Conference on Computational Performance Evaluation (ComPE), Shillong, India, 2–4 July 2020.
Notredame, C. Recent Progress in Multiple Sequence Alignment: A Survey. Pharmacogenomics 2002, 3, 131–144.
Amorim, A.R.; Zafalon, G.F.D.; de Godoi Contessoto, A.; Valêncio, C.R.; Sato, L.M. Metaheuristics for Multiple Sequence Alignment: A Systematic Review. Comput. Biol. Chem. 2021, 94, 107563.
Shyu, C.; Sheneman, L.; Foster, J.A. Multiple Sequence Alignment with Evolutionary Computation. Genet. Program. Evolvable Mach. 2004, 5, 121–144.
Lipman, D.J.; Altschul, S.F.; Kececioglu, J.D. A Tool for Multiple Sequence Alignment. Proc. Natl. Acad. Sci. USA 1989, 86, 4412–4415.
Wang, L.; Jiang, T. On the Complexity of Multiple Sequence Alignment. J. Comput. Biol. 1994, 1, 337–348.
Hogeweg, P.; Hesper, B. The Alignment of Sets of Sequences and the Construction of Phyletic Trees: An Integrated Method. J. Mol. Evol. 1984, 20, 175–186.
Thompson, J.D.; Higgins, D.G.; Gibson, T.J. CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice. Nucleic Acids Res. 1994, 22, 4673–4680.
Thompson, J.D.; Koehl, P.; Ripp, R.; Poch, O. BAliBASE 3.0: Latest Developments of the Multiple Sequence Alignment Benchmark. Proteins Struct. Funct. Bioinform. 2005, 61, 127–136.
Yamada, S.; Gotoh, O.; Yamana, H. Improvement in Accuracy of Multiple Sequence Alignment Using Novel Group-to-Group Sequence Alignment Algorithm with Piecewise Linear Gap Cost. BMC Bioinform. 2006, 7, 524.
Katoh, K.; Kuma, K.; Toh, H.; Miyata, T. MAFFT Version 5: Improvement in Accuracy of Multiple Sequence Alignment. Nucleic Acids Res. 2005, 33, 511–518.
Edgar, R.C. MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput. Nucleic Acids Res. 2004, 32, 1792–1797.
Gotoh, O. Significant Improvement in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinement as Assessed by Reference to Structural Alignments. J. Mol. Biol. 1996, 264, 823–838.
Pei, J.; Grishin, N. V PROMALS: Towards Accurate Multiple Sequence Alignments of Distantly Related Proteins. Bioinformatics 2007, 23, 802–808.
Golberg, D.E. Genetic Algorithms in Search, Optimization, and Machine Learning. Addion Wesley 1989, 1989, 36.
Notredame, C.; Higgins, D.G. SAGA: Sequence Alignment by Genetic Algorithm. Nucleic Acids Res. 1996, 24, 1515–1524.
Naznin, F.; Sarker, R.; Essam, D. Vertical Decomposition with Genetic Algorithm for Multiple Sequence Alignment. BMC Bioinform. 2011, 12, 353.
Naznin, F.; Sarker, R.; Essam, D. Progressive Alignment Method Using Genetic Algorithm for Multiple Sequence Alignment. IEEE Trans. Evol. Comput. 2012, 16, 615–631.
Gondro, C.; Kinghorn, B.P. A Simple Genetic Algorithm for Multiple Sequence Alignment. Genet. Mol. Res. 2007, 6, 964–982.
Lee, Z.-J.; Su, S.-F.; Chuang, C.-C.; Liu, K.-H. Genetic Algorithm with Ant Colony Optimization (GA-ACO) for Multiple Sequence Alignment. Appl. Soft Comput. 2008, 8, 55–78.
Zhou, A.; Qu, B.-Y.; Li, H.; Zhao, S.-Z.; Suganthan, P.N.; Zhang, Q. Multiobjective Evolutionary Algorithms: A Survey of the State of the Art. Swarm Evol. Comput. 2011, 1, 32–49.
Ehrgott, M. Multicriteria Optimization; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2005; Volume 491, ISBN 3540213988.
Valenzuela, O.; Rojas, F.; Pomares, H.; Ortun, F.M.; Florido, J.P.; Urquiza, J.M.; Rojas, I. Optimizing multiple sequence alignments using a genetic algorithm based on three objectives: Structural information, non-gaps percentage and totally conserved columns. Bioinformatics 2013, 29, 2112–2121.
Kaya, M.; Sarhan, A.; Alhajj, R. Multiple Sequence Alignment with Affine Gap by Using Multi-Objective Genetic Algorithm. Comput. Methods Programs Biomed. 2014, 114, 38–49.
Rubio-Largo, Á.; Vanneschi, L.; Castelli, M.; Vega-Rodríguez, M.A. A Characteristic-Based Framework for Multiple Sequence Aligners. IEEE Trans. Cybern. 2016, 48, 41–51.
Rubio-Largo, Á.; Vega-Rodríguez, M.A.; González-Álvarez, D.L. Hybrid Multiobjective Artificial Bee Colony for Multiple Sequence Alignment. Appl. Soft Comput. 2016, 41, 157–168.
Eusuff, M.; Lansey, K.; Pasha, F. Shuffled Frog-Leaping Algorithm: A Memetic Meta-Heuristic for Discrete Optimization. Eng. Optim. 2006, 38, 129–154.
Karaboga, D. An Idea Based on Honey Bee Swarm for Numerical Optimization; Technical Report—tr06; Erciyes University: Kayseri, Turkey, 2005.
Lassmann, T.; Frings, O.; Sonnhammer, E.L.L. Kalign2: High-Performance Multiple Alignment of Protein and Nucleotide Sequences Allowing External Features. Nucleic Acids Res. 2009, 37, 858–865.
DeRonne, K.W.; Karypis, G. Pareto Optimal Pairwise Sequence Alignment. IEEE/ACM Trans. Comput. Biol. Bioinform. 2013, 10, 481–493.
Handl, J.; Kell, D.B.; Knowles, J. Multiobjective Optimization in Bioinformatics and Computational Biology. IEEE/ACM Trans. Comput. Biol. Bioinform. 2007, 4, 279–292.
Seeluangsawat, P.; Chongstitvatana, P. A Multiple Objective Evolutionary Algorithm for Multiple Sequence Alignment. In Proceedings of the 7th Annual Conference on Genetic and Evolutionary Computation, Washington, DC, USA, 25–29 June 2005; pp. 477–478.
Soto, W.; Becerra, D. A multi-objective evolutionary algorithm for improving multiple sequence alignments. In Advances in Bioinformatics and Computational Biology, Volume 8826 of Lecture Notes in Computer Science; Springer: Berlin, Germany, 2014; pp. 73–82.
da Silva, F.J.M.; Pérez, J.M.S.; Pulido, J.A.G.; Rodriguez, M.A.V. Parallel Niche Pareto AlineaGA—An Evolutionary Multiobjective Approach on Multiple Sequence Alignment. J. Integr. Bioinform. 2011, 8, 57–72.
Abbasi, M.; Paquete, L.; Pereira, F.B. Local Search for Multiobjective Multiple Sequence Alignment. In Proceedings of the Third International Conference on Bioinformatics and Biomedical Engineering, Granada, Spain, 15–17 April 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 175–182.
Zhu, H.; He, Z.; Jia, Y. A Novel Approach to Multiple Sequence Alignment Using Multiobjective Evolutionary Algorithm Based on Decomposition. IEEE J. Biomed. Health Inform. 2015, 20, 717–727.
Rubio-Largo, Á.; Vega-Rodríguez, M.A.; González-Álvarez, D.L. A Hybrid Multiobjective Memetic Metaheuristic for Multiple Sequence Alignment. IEEE Trans. Evol. Comput. 2015, 20, 499–514.
Lassmann, T.; Sonnhammer, E.L.L. Kalign–an Accurate and Fast Multiple Sequence Alignment Algorithm. BMC Bioinform. 2005, 6, 298.
Rani, R.R.; Ramyachitra, D. Multiple Sequence Alignment Using Multi-Objective Based Bacterial Foraging Optimization Algorithm. Biosystems 2016, 150, 177–189.
Taheri, J.; Zomaya, A.Y. RBT-GA: A Novel Metaheuristic for Solving the Multiple Sequence Alignment Problem. BMC Genom. 2009, 10, S10.
Belattar, K.; Zemali, E.-A.; Baouni, S.; Dehni, S. Parallel Multiple DNA Sequence Alignment Using Genetic Algorithm and Asynchronous Advantage Actor Critic Model. Int. J. Bioinform. Res. Appl. 2022, 18, 460–478.
Kumar, M. An Enhanced Algorithm for Multiple Sequence Alignment of Protein Sequences using genetic algorithm. EXCLI J. 2015, 14, 1232–1255.
Ye, L. A Decomposition and Dominance-Based Multiobjective Artificial Bee Colony Algorithm for Multiple Sequence Alignment. Mob. Inf. Syst. 2022, 2022, 5444055.