Enhanced Genetic Method for Optimizing Multiple Sequence Alignment

Enhanced Genetic Method for Optimizing Multiple Sequence Alignment: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Artificial Intelligence

Contributor:

Mohammed K. Ibrahim

Umi Kalsom Yusof

Taiseer Abdalla Elfadil Eisa

Maged Nasser

n the realm of bioinformatics, Multiple Sequence Alignment (MSA) is a pivotal technique used to optimize the alignment of multiple biological sequences, guided by specific scoring criteria. Existing approaches addressing the MSA challenge tend to specialize in distinct biological features, leading to variability in alignment outcomes for the same set of sequences.

Multiple Sequence Alignment
evolutionary algorithm
genetic algorithm
bioinformatics
optimization

1. Introduction

Sequence alignment (SA) is one of the popular approaches in bioinformatics that is used to arrange the primary sequences of DNA/RNA to identify regions of similarity that may have evolutionary or structural relationships among the sequences [1]. It helps to locate portions with a common evolutionary history by arranging multiple sequences such that a maximum number of similar or identical residues are aligned or matched in a column [2]. This can be achieved by aligning the unknown sequences with known sequences from a database [3,4]. SA can broadly be divided into Multiple Sequence Alignment (MSA), where multiple sequences are aligned simultaneously, and pairwise sequence alignment (PSA), where only two sequences are involved in the alignment process. Generally, MSA is the most commonly used tool that is capable of precisely identifying a sequence’s functional and structural information, as it can deal with several sequences of a family at a time [5,6].

MSA can be achieved globally [7], where the similarity over the entire sequence length is generally considered, or locally [8], in which the local best-scoring parts of similar characters are considered. An alignment generally considers scoring functions to measure the alignment quality [9,10]. However, in MSA, it is a challenging task to identify an optimum scoring function since the statistically optimized functions are not biologically optimal [11]. Moreover, in MSA, computational complexity requires many resources [12]. Recently, to improve the MSA optimization process, a dynamic programming approach was applied [13]. However, the dynamic programming-based approaches in MSA generally experience high dimensionality due to the increasing number of sequences, which results in exponential growth of the time requirement [14]. Essentially, the MSA process is NP-complete [15], and thus, all real-world MSA techniques consider heuristic methods that are approximate in real-world situations.

Two different heuristic methods are popularly used for MSA solutions, namely, the iterative and progressive approaches. In the progressive alignment [16], the PSA method addresses an MSA process in which all of the possible sequence pairs are first aligned, and a guide tree based on the pairwise distance values is then developed. Then, eventually, the MSA is generated stepwise via the gradual arrangement of all of the sequences based on the guide tree, in which, mainly, the best alignment pair is considered [17]. A major shortcoming of the progressive approach is that the initial sequence pair alignment usually affects the resultant alignments. Thus, changing the position of the gap in the later stages is practically impossible [18]. To mitigate these issues, among others, iterative approaches are used in the literature [19,20,21,22,23] for the MSA problem. Iterative methods are used to iteratively change the building of the guide tree by adjusting the alignment pairs. One of the popular iterative approaches is the Genetic Algorithm (GA) [24], which is inspired by natural genetics [25,26,27,28].

Several approaches have been introduced to utilize GA to solve MSA problems [27]. For example, SAGA, introduced in [25], uses 22 various GA operators for the MSA. Naznin et al. [26] used a GA-based approach to improve an MSA solution by vertically demarcating the sequences into several sub-sequences. In reference [27], the GA was also applied to identify the best guide tree by iteratively altering the guide trees. In similar methods, GA was integrated with other optimization methods like ant colony optimization (ACO) [29] and the rubber band technique (RBT) to optimize sequence alignment. Currently, the MSA problem is considered a multi-objective process, where each condition can represent a distinct objective function. However, in multi-objective functions, there is a tendency for the accuracy of one objective function to be affected by the optimization of one or more objective functions. Thus, in a real-world situation, a set of non-dominated solutions is generally considered, known as Pareto optimal solutions [30]. Apart from the non-dominated set, no other means is feasible to improve any one of the objective functions without affecting the others [31].

For better MSA optimization, recently, some multi-objective GA-based methods were proposed [12]. One method was introduced in [32], which considered three objective functions: Totally Conserved Columns (TCCs), STRIKE score, and non-gaps percentage. One shortcoming of this method is the inadequate availability of structures. In a similar method [33], three objectives were introduced to derive the non-dominated Pareto alignment solutions: similarity maximization, affine gap penalty minimization, and support maximization. In references [34,35], a shuffled frog-leaping optimization method [36] and an artificial bee colony method [37] were applied, respectively. Both approaches applied two commonly used fitness functions, the sum of pairs (SOP) and TCC, to obtain a Pareto optimal set. These methods utilized another effective Kalign [38] as a local search strategy to improve the solutions’ quality. However, in the multi-objective Pareto optimal method, one needs to specify the dominant and non-dominant solutions to obtain the set of non-dominated solutions. This is hard to conduct in real-world situations [39].

2. Related Work

This section reviews previous works related to MSA that use multi-objective and metaheuristic methods. Handl et al. introduced one of the earlier approaches to study the multi-objective method in MSA [40]. Since then, several other approaches have been introduced [1]. Seluangsawat et al. [41] introduced an evolutionary method to solve the MSA problem based on outputs obtained from the Clustal X method by utilizing multiple objective functions, which include the gap penalty and the sum of pairs. This method uses three mutation operators and a two-point crossover. Ortuño et al. [32] proposed a multi-objective optimization-based approach that uses the classical metaheuristic NSGA based on structural evaluations to solve MSA problems. The proposed approach optimizes multiple objective functions: non-gap percentage, structural information, and TCC. This method applies the hyper-volume [30] as the quality evaluation measure. Kaya et al. [33] introduced another approach based on the NSGA-II algorithm, which considers three objective functions: similarity, affine gap penalty minimization, and support maximization. This approach uses three mutations and two crossover genetic strategies. Soto and Becera [42] proposed a multi-objective approach based on the genetic technique to optimize pre-aligned sequences. Their suggested model uses three operators: random insertion, two-point crossover, and shift mutation.

Silva et al. [43] proposed Parallel Niche Pareto by using two different objective functions, including the number of totally identical columns and the sum of pairs. Six mutation operators and three crossover strategies were used in this method. Abbasi et al. [44] proposed different local search methods for MSA solutions to minimize the number of indels and maximize the substitution score. The suggested method uses several neighborhood definitions and perturbations. A multi-objective-based approach based on decomposition applied to solve MSA was introduced by Zhu et al. [45]. This model applies a gap insertion operation to generate the initial population. Several existing evolutionary alignment algorithms were compared with the tool to evaluate the model performances.

For better MSA optimization, recently, some multi-objective GA-based methods were proposed [12]. One method was introduced in [32], which considers three objective functions: Totally Conserved Columns (TCCs), the STRIKE score, and non-gaps percentage. One shortcoming of this method is the inadequate availability of structures. In a similar method [33], three objectives were introduced to derive the non-dominated Pareto alignment solutions: similarity maximization, affine gap penalty minimization, and support maximization. Recently, Rubio-Largo et al. [35,46] introduced two different methods to improve the MSA solution: the hybrid multi-objective Memetic Metaheuristic approach [46] and the hybrid multi-objective artificial bee colony method [35]. These two approaches use the conserved columns and weighted sum-of-pairs function (WSP) with affine gap penalties integrated with the Kalign method [47]. Finally, Rani et al. [48] introduced two approaches, the Bacterial Foraging Optimization method and the Hybrid GA with Artificial Bee Colony Algorithm. The authors notably utilized four objective functions: the maximization of similarity, conserved blocks, non-gap percentage, and minimization of gap penalty. However, in the multi-objective Pareto optimal approach, one needs to specify the dominant and non-dominant solutions to obtain the set of non-dominated solutions. This is hard to conduct in real-world situations.

Despite the good performances of the above GA-based methods, they experience several shortcomings. Firstly, some existing algorithms use only one criterion in their objective function, and improving one objective may deteriorate one or more other objectives. It is impossible to optimize a single objective to achieve all objectives simultaneously. Secondly, the conventional GA generally represents a solution or a chromosome with a binary string. However, the binary coding in MSA increases the chromosome/string length, computational complexity, and memory space. Table 1 summarizes some popular state-of-the-art metaheuristic methods for the MSA.

Table 1. Some popular methods for MSA.

Model	Brief Description/Advantages	Drawbacks/Disadvantages
[32]	Uses TCC, STRIKE score, and non-gaps percentage as the objective function	Inadequate availability of structures
[34]	Uses frog-leaping optimization method; considers SOP and TCC as fitness functions	It is hard to conduct in real-world situations
[33]	NSGA-II, which considers similarity, support maximization, and affine gap as the fitness functions	Computationally intensive
[45]	A multi-objective-based approach utilizing a decomposition strategy	Computationally intensive
[49]	A hybrid method combining the RBT and the GA	Single objective/function, low performance
[50]	Uses a recursive-based GA to find the optimal fragmentation of the sequences	Single objective function/computationally intensive
[51]	MSA of protein sequences using GA; SOP used as the fitness function	Single objective/function, low performance
[52]	A decomposition-based multi-objective method that uses artificial bee colony	Inadequate availability of structures

This entry is adapted from the peer-reviewed paper 10.3390/math11224578

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.