Accelerating Multiple Sequence Alignments Using Parallel Computing

Accelerating Multiple Sequence Alignments Using Parallel Computing: Comparison

Please note this is a comparison between Version 1 by Qanita Bani Baker and Version 2 by Catherine Yang.

Multiple sequence alignment (MSA) stands as a critical tool for understanding the evolutionary and functional relationships among biological sequences. Obtaining an exact solution for MSA, termed exact-MSA, is a significant challenge due to the combinatorial nature of the problem. Using the dynamic programming technique to solve MSA is recognized as a highly computationally complex algorithm. To cope with the computational demands of MSA, parallel computing offers the potential for significant speedup in MSA.

multiple sequence alignment
dynamic programming
parallel computing

1. Introduction

Sequence alignment (SA) refers to the process of arranging and comparing biological sequences, such as DNA, RNA, and proteins, with the ability to reveal meaningful information about the similarities and differences among them. SA is one of the fundamental steps in most genomic analyses [1]. The classification of sequence alignment techniques encompasses two fundamental distinctions: global versus local alignment and pairwise versus multiple alignment. Global alignment algorithms, such as the Needleman–Wunsch algorithm, align sequences from the beginning to the end [2]. By contrast, the local sequence alignment is used to compare specific regions and to find contiguous regions of high similarity, such as that performed by the the Smith–Waterman algorithm [3]. The main two types of sequence alignment are also performed in two ways: the pairwise sequence alignment (PSA) [4] and the multiple sequence alignment (MSA) [5]. PSA involves the comparison of two sequences to identify regions of similarity and dissimilarity, while MSA extends the comparison to more than two sequences, identifying conserved regions and variations across a set of sequences.

The dynamic programming technique [6] provides an efficient computational approach for optimizing alignment scores by breaking down complex problems into smaller overlapping subproblems [7]. Common dynamic programming algorithms for pairwise sequence alignment include the Needleman–Wunsch algorithm, employed for global alignment, and the Smith–Waterman algorithm, utilized for local alignment. Dynamic programming extends its utility to multiple sequence alignment algorithms, such as the progressive and iterative methods ^[8][9][8,9]. Aligning N sequences using dynamic programming is an NP-Hard problem [10] that stems from the complexity of considering all possible combinations and alignments among the N sequences. To address complexity challenges in MSA, heuristic methods [11] and approximation algorithms [12] are employed in practice for the MSA of a large number of sequences. In addition to these algorithms, applying parallel computing techniques offers a promising avenue to mitigate the computational demands associated with MSA [13].

2. Pairwise Sequence Alignment (PSA)

Pairwise sequence alignment (PSA) is considered an important tool for aligning biological sequences such as DNA and protein sequences [14]. Haque et al. [15] presented a comprehensive overview of both local and global pairwise sequence alignment algorithms. They also included an identification of the techniques utilized in these algorithms and discussed their respective advantages and limitations. In [16], Edgar et al. distinguished between the main three methods used to align sequences: sequence–sequence methods (like BLAST), profile–sequence methods (like PSI-BLAST), and profile–profile methods (like CLUSTALW). The survey in [17] reviewed the wide range of aligning algorithms and tools developed to assess the quality of the aligned sequences. In [18], bacterial DNA sequences were aligned using pairwise alignment and dynamic programming. Table 1 shows an overview of the most well-known approaches utilized for PSA.

Table 1.

Pairwise sequence alignment techniques.

#	Technique	Approach	Reference
1	Needleman–Wunsch	Dynamic Programming	[2]
2	Smith–Waterman	Dynamic Programming	[3]
3	Gotoh’s Algorithm	Dynamic Programming	[19]
4	FASTA Algorithm	Heuristic	[20]
5	BLAST Algorithm	Heuristic	[21]
6	EMBOSS Software	Toolkit	[22]
7	Parasail	Toolkit/Library	[23]
7	Minimap2	Toolkit/Program	[24]
9	ASCA-PSO	Heuristic	[25]
8	WFA-GPU	Toolkit	[26]

Several studies have aimed to accelerate the performance and the accuracy of the tools used in sequence alignment by using several parallelization techniques [27]. For example, Fakirah et al. [28] utilized a diagonal traversing approach to enhance the Needleman–Wunsch algorithm by utilizing the iterations used to fill the scoring matrix. Balhaf et al. [29] enhanced the Levenshtein edit distance algorithm’s performance by using the diagonal traversing approach, and the performance was enhanced using both CPU and GPU. Jararweh et al. [30] accelerated the Levenshtein and Damerau algorithms by using parallel implementation on a GPU. Jararweh et al. showed that using unified memory resulted in the best performance. Shehab et al. [31] enhanced the performance of multiple pairwise alignments in protein sequences by utilizing a hybrid CPU-GPU implementation. In [26], Puig et al. utilized a GPU (graphics processing unit) to compute exact gap-affine alignments based on the wavefront alignment (WFA) algorithm. They showed that the proposed tool is up to 29× faster than other GPU implementations.

3. Multiple Sequence Alignment

Numerous studies have employed various techniques to address the challenge of multiple sequence alignments (MSAs) [9]. One widely adopted technique is progressive alignment [32]. The progressive alignment method initially starts with pairwise alignments and progressively builds an MSA alignment through a series of pairwise alignments, producing accurate results for moderately sized sequence sets [33]. In addition to the progressive methods, iterative approaches have also played a crucial role in improving the accuracy of MSA [34]. Iterative methods generally refine alignments applying successive cycles of alignment improvement. Iterative refinement involves realigning sequences based on the initial solution and gradually converging toward a more accurate alignment. Iterative techniques often outperform progressive methods in terms of alignment accuracy, especially in cases where sequences are more distant [35]. Lupyan et al. proposed a hybrid algorithm that combined the progressive and iterative algorithms for MSA. The hybrid approach provided a significant advancement compared to earlier methods involving a notable decrease in computational cost. In addition to progressive and iterative methods, several studies focus on the utilization of metaheuristics techniques for performing MSA [11]. Ali et al. [36] reviewed the landscape of metaheuristics in bioinformatics highlighting various metaheuristic approaches, including tabu search [37], simulated annealing [38], and particle swarm optimization [39], showcasing their applications in computational biology problems and MSA. Hatzou et al. [9] provided valuable insights centered on the heuristic-based progress of MSA methods. Similarly, Chowdhury et al. [40] offered an overview of MSA methods with a focus on the multi-objective approach. In contrast, Vega et al. [41] provided a comparative analysis of different formulations of multi-objective metaheuristics for MSA. In Table 2, the researchers present some well-known tools for MSA and show the general techniques used for each.

Table 2.

Multiple sequence alignment tools with techniques.

#	Technique	Approach	Heuristics	Ref.
1	Recursive MAGUS	Divide-and-Conquer Alignment	Guide Tree	[42]
2	ClipKIT	Trimming Strategies	IQ-TREE Hill-Climbing	[43]
3	Kalign	Progressive Alignment	Guide Tree	[44]
4	ProbCons	Probabilistic Consistency	Probabilistic Modeling	[45]
5	MUSCLE	Progressive Alignment	Guide Tree	[46]
6	MAFFT	Progressive Alignment	Guide Tree	[47]
7	T-Coffee	Various	Various	[48]
8	DIALIGN	Local Multiple Alignment	Pairwise Alignments	[49]
9	CLUSTAL W	Progressive Alignment	Guide Tree	[50]

Limited studies have been directed towards seeking exact solutions for multiple sequence alignment due to the time complexity associated with obtaining the optimal results. Mojbak et al. [51] proposed an exact-MSA approach using forward dynamic programming. Also, in the comprehensive exploration of exact solutions, Hosseininasab et al. [52] proposed a framework employing a dynamic programming approach to construct a multivalued decision diagram, representing all PSAs. The synchronization of PSAs with the proposed decision diagram effectively incorporates modeling the MSA problem within polynomial space complexity. Moreover, Domínguez [53] delves into statistical and biological concepts employed in the MSAProbs-MPI tool to complete the alignments where high-performance computing techniques are employed for alignment acceleration. Additionally, Ju et al. [54] introduced an end-to-end deep neural network and called it CopulaNe, designed to directly estimate residue co-evolution from MSA, representing a cutting-edge approach in the finding of exact solutions for MSA. In addition to the previously mentioned approaches, several parallelization strategies have been employed to tackle the challenges associated with MSA [55]. Some of these strategies focus on the parallelization of dynamic programming algorithms, such as in [56]. Other strategies aim to parallelize the progressive alignment [57]. Several studies focus on the parallelization of heuristic algorithms, such as [58]. Recently, many studies utilized GPU acceleration for MSA [59]. The optimization of parallel MSA is characterized by continuous innovation in algorithmic design and adaptation to emerging hardware architectures [55].