Protein Tertiary Structure Prediction

Protein Tertiary Structure Prediction: Comparison

Please note this is a comparison between Version 1 by Wei Zheng and Version 2 by Lindsay Dong.

The prediction of three-dimensional (3D) protein structure from amino acid sequences has stood as a significant challenge in computational and structural bioinformatics for decades. TRecently, the widespread integration of artificial intelligence (AI) algorithms has substantially expedited advancements in protein structure prediction, yielding numerous significant milestones. In particular, the end-to-end deep learning method AlphaFold2 has facilitated the rise of structure prediction performance to new heights, regularly competitive with experimental structures in the 14th Critical Assessment of Protein Structure Prediction (CASP14).

AlphaFold2
contact map
deep learning
distance map
end-to-end methods
multi-domain proteins
protein language model

1. Introduction

Proteins are macromolecules that play important roles in facilitating the essential functions vital for life’s sustenance. Their pivotal involvement spans a diverse array—providing structural support to cells, safeguarding the immune system, catalyzing crucial enzymatic reactions, orchestrating cellular signal transmission, regulating the intricate processes of transcription and translation, and encompassing the synthesis and breakdown of biomolecules. Moreover, they contribute significantly to the regulation of developmental processes, biological pathways, and the constitution of protein complexes and subcellular structures. These diverse and remarkable functions originate from their distinct three-dimensional (3D) structures, which vary across different protein molecules. Since Anfinsen showed that the tertiary structure of a protein is determined by its amino acid sequence in 1973 [1], understanding the protein sequence–structure–function paradigm has emerged as a fundamental cornerstone within modern biomedical studies. Due to significant efforts in genome sequencing over the last few decades ^[2][3][4][2,3,4], the number of known amino acid sequences deposited in UniProt [5] has grown to over 250 million. Despite the impressive number of data, the amino acid sequences themselves only offer limited insights into the biological functions of individual proteins, as these functions are primarily determined by their three-dimensional structures.

Some of the most widely used experimental techniques for determining protein structures include X-ray crystallography [6], NMR spectroscopy [7], and cryo-electron microscopy [8]. Despite their accuracy, the considerable human involvement and substantial expenses involved in experimentally resolving a protein’s structure have hindered advancement in the number of solved protein structures. Consequently, the expansion in solved protein structures has considerably trailed the accumulation of protein sequences. At present, the Protein Data Bank [9] (PDB) contains structures for approximately 0.21 million proteins, accounting for less than 0.1% of the total sequences cataloged in the UniProt database [10]. This disparity highlights the ever-widening gap between known protein sequences and experimentally solved protein structures. Nevertheless, owing to substantial collective efforts within the scientific community in recent decades ^{[11][12][13][14][15][16][17][18][19][20][21][22][23][24][25]}[11,12,13,14,15,16,17,18,19,20,21,22,23,24,25], computational approaches have made remarkable progress, through which an increasing fraction of sequences in various organisms have had their tertiary structures reliably modeled ^{[26][27][28][29][30][31][32][33][34][35][36][37][38][39]}[26,27,28,29,30,31,32,33,34,35,36,37,38,39]. For example, the first version of AlphaFold demonstrated exceptional predictive capabilities in protein structure prediction by employing the deep learning-based distance map prediction during the 13th Critical Assessment of Protein Structure Prediction (CASP13). Furthermore, with the utilization of the end-to-end deep learning approach, the AlphaFold2 has facilitated the rise of structure prediction performance to new heights, regularly competitive with experimental structures in CASP14. These methodologies have significantly contributed to diverse biomedical investigations, including structure-based protein function annotation ^{[40][41][42][43][44]}[40,41,42,43,44], mutation analysis ^{[45][46][47][48][49][50][51][52]}[45,46,47,48,49,50,51,52], ligand screening ^{[53][54][55][56][57][58][59]}[53,54,55,56,57,58,59], and drug discovery ^{[60][61][62][63][64][65]}[60,61,62,63,64,65].

2. Protein Structure Prediction

2.1. Template-Based Modeling (TBM) Methods

Template-based modeling (TBM) methods have emerged as pivotal approaches in the realm of computational biology for predicting protein structures. TBM leverages known protein structures, referred to as templates, from the PDB to predict the structure of an unknown protein (target), assuming that the target shares a significant degree of sequence similarity with the template. As shown in Figure 12, TBM methods usually consist of the following four steps: (i) identifying templates related to the protein of interest, (ii) aligning the query protein with the templates, (iii) building the initial structural framework by replicating the aligned regions, and (iv) constructing the unaligned regions and refining the structure. TBM can be classified as homology modeling (comparative modeling), which is often employed when there is substantial sequence identity—typically 30% or greater—between the template and the protein of interest, and threading (fold recognition), which is used when the sequence identity drops below the 30% threshold [66].

Figure 12. Illustration of template-based modeling (TBM) methods. Starting from a query sequence, templates are identified from Protein Data Bank (PDB) and subsequently aligned with the query protein sequence. Then, the final structural model is constructed by replicating the aligned regions and refining the unaligned regions.

In homology modeling, high-quality templates are detected and aligned using straightforward sequence–sequence alignment algorithms, such as dynamic programming-based techniques like the Needleman–Wunsch [67] algorithm for global alignment and the Smith–Waterman [68] algorithm for local alignment. BLAST [69] is another widely used tool to identify templates and generate alignments, which initially identified short matches between the query and template, and then extended these matches to generate alignments.

In threading, since the sequence identity between the best available template and the query protein falls below 30%, it is hard to identify templates simply based on straightforward sequence–sequence alignment algorithms. Hence, the 1D profile of local structural features is used to represent a template’s 3D structure, because they are often more conserved than the amino acid identities themselves and, thus, can be used to identify and align proteins with similar structures but more distant sequence homology. A commonly used sequence profile is the Position-specific Scoring Matrix (PSSM), which captures the amino acid tendencies at each position within the multiple sequence alignment (MSA). The PSSM is iteratively employed to search through a template database, aiming to identify distantly homologous templates for a specific protein sequence. One popularly used profile-based threading algorithm is MUSTER [70], which combines various sequence and structural information into single-body terms in a dynamic programming search, as follows: (i) sequence profiles; (ii) secondary structures; (iii) structure fragment profiles; (iv) solvent accessibility; (v) dihedral torsion angles; and (vi) hydrophobic scoring matrix. In addition to PSSMs, profile hidden Markov models (HMMs) are another type of sequence profile.

Given the recent substantial improvements in contact and distance map prediction using deep learning, which will be discussed later, threading methods guided by these maps represent the cutting edge in fold recognition, achieving superior accuracy compared to general profile or profile HMM-based threading methods. Among these approaches, EigenTHREADER ^[71][73] utilized the eigen decomposition of contact maps to derive the primary eigenvectors, which were used for aligning the template and query contact maps. CEthreader ^[72][74], employing a similar eigen decomposition strategy, outperformed pure contact map-based threading methods by integrating data from local structural feature prediction and sequence-based profiles. map_align [21], on the other hand, introduced an iterative dual dynamic programming technique to align contact maps, while DeepThreader ^[73][75] leveraged predicted distance maps to establish alignments. Most recently, DisCovER ^[74][76] integrated deep learning-predicted distance and orientation into the threading method by generating alignments through an iterative double dynamic programming framework.

Furthermore, deep learning-based methods have been directly applied to recognize distant homology templates. The cutting-edge methods, such as ThreaderAI ^[75][81] and SAdLSA ^[76][82], conceptualize the task of aligning query sequence with template as the classical pixel classification problem in computer vision, which allows for the integration of a deep residual neural network ^[77][83] into fold recognition. More recently, the application of language models, originally developed for text classification and generative tasks, to protein sequences marks a significant advancement in the bioinformatics field. Protein language models (PLMs) are a type of neural network with self-supervised training on an extensive number of protein sequences ^[78][79][84,85]. Once trained, PLMs can be used to rapidly generate high-dimensional embeddings on a per-residue level, which can be viewed as a “semantic meaning” of each amino acid within the context of the full protein sequence. Such representations have proven invaluable in identifying distant homologous relationships between proteins.

Once the templates are identified and aligned with the query proteins, the subsequent step involves building a model by replicating and refining the structure of the template. The most widely used method was MODELLER [16], which constructed tertiary structure models by optimally satisfying spatial constraints extracted from the template alignments, along with other general structural constraints, such as ideal bond lengths, bond angles, and dihedral angles.

With the development of computational techniques, some methods are proposed to convert alignments directly into 3D models. A notable example is I-TASSER ^[80][81][82][91,92,93], an extension of TASSER [28]. This method utilized a process wherein continuous fragments were extracted from the aligned regions of multiple threading templates identified by LOMETS. These fragments were reassembled during structure assembly simulations. I-TASSER incorporated constraints derived from template alignments and a set of knowledge-based energy terms. These energy terms included hydrogen bonding, secondary structure formation, and side-chain contact formation. The integration of these components was used to guide the Replica Exchange Monte Carlo (REMC) simulation. After clustering low-energy decoys and selecting the centroid of the most favorable cluster, the centroid was compared against the PDB to identify additional templates. The constraints from these new templates, combined with those from the initial cluster model and threading templates, as well as the intrinsic knowledge-based potentials, were employed to direct a subsequent round of structure assembly simulations. The lowest energy structure was selected, which was then subjected to full-atom refinement. Since its first emergence in the CASP7, I-TASSER has consistently achieved top rankings among automated protein structure prediction servers in subsequent CASP experiments [66].

2.2. Fragment Assembly Simulation Methods for Free Modeling (FM)

Theoretically, all-atom molecular dynamics (MD) simulations are able to predict protein structures if the computer is powerful enough. However, modern MD simulations can only deal with proteins of less than ~100 amino acids in size. Thus, 90% of the natural proteins cannot be predicted because of the required computational complexity ^[83][95]. Hence, an alternative method, namely free modeling (FM), was proposed to model protein structures. Compared to MD simulations, FM methods employ the coarse-grained protein elements and physics- or knowledge-based energy functions, together with extensive sampling procedures, to construct protein structure models from scratch. In contrast to TBM methods, they do not depend on global templates. Hence, they are commonly referred to as ab initio or de novo modeling approaches ^[17][19][17,19]. State-of-the-art FM methods have evolved to assemble protein fragments ^[84][96]. These fragment assembly techniques assume that protein fragments extracted from the PDB covered most of the conformation of protein folding. Thus, the sampling space was sharply narrowed down. Their implementation involves generating a set of fixed-length (9 residues) and variable-length (15–25 residues) fragments from a repository of known 3D structures (as shown in Figure 23). These fragments are subsequently linked, rotated, and scored to find the global minimum state. This methodology of fragment assembly serves to reduce the exploration of conformational space while ensuring the coherent formation of local structures within the assembled fragments.

Figure 23. Illustration of free modeling (FM) methods. Starting from a query sequence, local fragments are identified from databases of solved protein structures, using profile-based threading methods. These fragments are subsequently utilized to construct full-length structural models, guided by physics- or knowledge-based energy potentials.

The first version of Rosetta modeling software, released in 1997, is one of the most well-known FM methods developed by David Baker’s group [17]. Rosetta utilized a three- and nine-residue fragment database for assembly. Particularly, the fragments were selected by quantifying the profile–profile and secondary structure similarity between the query sequence and fragment database within a defined window size. The fragments were simplified to backbone atoms and side-chain centers, and subsequently conducted by simulated annealing Monte Carlo simulations, which exchanged the backbone torsion angles with those of one of the highly scored fragments in the database.

2.3. Contact-Based Protein Structure Prediction

A contact map for a protein of length L is defined as a symmetric, binary L × L matrix. Each element in the matrix represents a binary value, signifying whether the residues form a contact (Cβ-Cβ distance (Cα for glycine) < 8 Å) or not. Since the concept of contact was first brought up, many attempts were made to predict contacts based on correlated mutations in MSAs ^[85][86][87][97,98,99]. The hypothesis behind these approaches was that residue pairs that are in contact in 3D space would exhibit correlated mutation patterns, also known as co-evolution, because there is evolutionary pressure to conserve the structures of proteins.

n the early 2010s, an increasing number of predictors began integrating deep learning architectures into their prediction methods. A breakthrough occurred in 2017, when Xu’s group introduced RaptorX-Contact [22], which revolutionized contact prediction by integrating deep residual convolutional neural networks (ResNets ^[77][83]). A Residual Neural Network incorporates an identity map of the input to the output of the convolutional layer, facilitating smoother gradient flow from deeper to shallower layers and enabling training of deep networks with numerous layers. RaptorX-Contact’s utilization of deep ResNets, featuring approximately 60 hidden layers, led to a significant performance leap, outstripping other methods [66]. The introduction of deep ResNets, consisting of approximately 60 hidden layers, enabled RaptorX-Contact to significantly outperform other methods [66].

Due to the latest advances in residue–residue contact prediction, contact-guided protein structure prediction methods have been developed and are becoming more and more successful. The idea of contact-based protein structure prediction methods is described in Figure 34. Starting from a query sequence, an MSA is first generated by searching through databases. The MSA is then used as the input for deep learning methods to predict a contact map. Finally, the contact potential derived from the predicted contact map is used in a folding simulation to predict the final model.

Figure 34. Illustration of contact-based protein structure prediction methods. Starting from a query sequence, an MSA is first generated by searching through databases. The MSA is then used as the input of deep learning methods to predict a contact map. Finally, the contact potential derived from the predicted contact map is used in a folding simulation to predict the final model.

2.4. Distance-Based Protein Structure Prediction

From the definition of contact map prediction, a more detailed extension is distance map prediction. The distinction lies in contact map prediction entailing binary classification, whereas distance map prediction generally estimates the likelihood of the distance between residues falling within various bins (despite attempts made to directly predict real-value distances ^[88][104]). Distance map prediction gained significant prominence in the field during CASP13 in 2018, when RaptorX-Contact [22], DMPfold ^[89][105], and AlphaFold ^[90][106] extended the application of deep ResNets from contact prediction to distance prediction. Among these predictors, AlphaFold, created by Google DeepMind, exhibited superior performance in tertiary structure modeling, as it was ranked as the top one among all groups in CASP13. Leveraging co-evolutionary coupling information extracted from an MSA, AlphaFold employed a deep residual neural network, comprising 220 residual blocks, to predict the distance map for a target sequence, which was subsequently used to assemble protein models. Figure 45 shows the basic steps of distance-based protein structure prediction methods.

Figure 45. Illustration of distance-based protein structure prediction methods. Starting from a query sequence, an MSA is first generated by searching through databases. Then, the MSA is fed into deep neural networks to predict spatial restraints, such as distance maps, inter-residue orientations, and hydrogen bond networks. Finally, the final structural model is constructed by employing the potentials extracted from the predicted spatial restraints in a folding simulation to identify the lowest energy structure.

2.5. End-to-End Protein Structure Prediction

AlphaFold2 achieved remarkable modeling accuracy and substantially addressed the challenge of predicting the structures of single-domain proteins in CASP14 ^[91][109]. The success of AlphaFold2 can be attributed, in part, to its unique “end-to-end” learning approach. This end-to-end learning approach eliminates the need for complex folding simulations, allowing deep neural networks, such as 3D equivariant transformers in AlphaFold2, to predict structural models directly.

AlphaFold2 adopted a novel architecture that is quite different from those of previous methods, including the first version of AlphaFold, to accomplish end-to-end structure prediction. The architecture of AlphaFold2 includes the following two primary components: the Trunk Module, which utilizes self-attention transformers to process input data consisting of the query sequence, templates, and MSA; and the Structure (or Head) Module, which employs 3D rigid body frames to directly generate 3D structures from the training components ^[92][110].

Despite its breakthrough in accuracy and performance, AlphaFold2 has notable limitations, such as increased time consumption with longer protein lengths. To address these challenges, several faster artificial intelligence-driven protein folding tools, based on AlphaFold2, have been developed ^[93][94][95][111,112,113]. For example, ColabFold ^[93][111] improved the speed of protein structure prediction by integrating MMseqs2′s efficient homology search (Many-against-Many sequence searching) ^[96][114] with AlphaFold2 ^[92][110]. OpenFold ^[94][112], a trainable and open-source implementation of AlphaFold2 using PyTorch ^[97][115], achieved enhanced computational efficiency with reduced memory usage, thereby facilitating the prediction of exceedingly long proteins on a single GPU. Similarly, Uni-Fold ^[95][113] redeveloped AlphaFold2 within the PyTorch framework and reproduced its original training process on a larger set of training data, achieving comparable or superior accuracy and faster speed. Collectively, these developments represent significant strides in enabling rapid and accurate predictions of protein structures.

2.6. Protein Language Model-Based Protein Structure Prediction

AlphaFold2 has facilitated the rise of structure prediction performance to new heights, nearly comparable to the accuracy of experimental determination methods since CASP14. Standard protein structure prediction pipelines heavily rely on co-evolution information from MSAs. However, the excessive dependence on MSAs often acts as a bottleneck in various protein-related problems. While model inference in the structure prediction pipeline typically takes a few seconds, the MSA construction step is time-intensive, consuming tens of minutes per protein. This time-consuming process significantly hampers tasks requiring high-throughput requests, like protein design ^[98][119] A large-scale protein language model (PLM) presents an alternative avenue to MSAs for acquiring co-evolutionary knowledge, facilitating MSA-free predictions. In contrast to MSA-based methods, wherein information retrieval techniques explicitly capture co-evolutionary details from protein sequence databases, PLM-based methods embed co-evolutionary information into the large-scale model parameters during training, and allow for implicit retrieval through model inference, wherein the PLM is viewed as a repository of protein information. Furthermore, MSA-based approaches have lower efficiency in information retrieval, relying on manually designed retrieval schemes. Inspired by the progress of PLMs and AlphaFold2, many protein structure prediction methods have been proposed. For example, ESMFold ^[79][85], developed by Meta AI, used the information and representations learned by a PLM called ESM-2 to perform end-to-end 3D structure prediction using only a single sequence as input. ESMFold demonstrated comparable accuracy to AlphaFold2 and RoseTTAFold for sequences exhibiting low perplexity and thorough comprehension by PLM. Notably, ESMFold’s inference speed was ten times faster than that of AlphaFold2, thereby facilitating efficient exploration of the structural landscape of proteins within practical time frames. OmegaFold ^[99][121] predicted the high-resolution protein structure from a single primary sequence alone, using a combination of a PLM and a geometry-inspired transformer model, trained on protein structures. OmegaFold requires only a single amino acid sequence for protein structure prediction and does not rely on MSAs or known structures as templates. Similar to ESMFold, OmegaFold can also scale roughly ten times faster than MSA-based methods, such as AlphaFold2 and RoseTTAFold. HelixFold-Single ^[98][119] was an end-to-end MSA-free protein structure prediction pipeline that combined a large-scale PLM with the superior geometric learning capability of AlphaFold2. HelixFold-Single first pre-trained a large-scale PLM with thousands of millions of primary structures, utilizing the self-supervised learning paradigm, and then obtained an end-to-end differentiable model to predict 3D structures by combining the pre-trained PLM and the essential components of AlphaFold2. EMBER3D ^[100][122] predicted 3D structure directly from single sequences by computing both 2D (distance maps) and 3D structure (backbone coordinates) from sequences alone, based on embeddings from the pre-trained PLM called ProtT5. EMBER3D exhibited a speed that was orders of magnitude faster than its counterparts, enabling the prediction of average-length structures in mere milliseconds, even on consumer-grade machines.

2.7. Multi-Domain Protein Structure Prediction

Since the advent of AlphaFold2 in the recent CASP14, great progress has been made in protein structure prediction. However, AlphaFold2 and most of the subsequent state-of-the-art methods have mainly focused on the modeling of single-domain proteins, which are the minimum folding units of proteins that fold and function independently. Nonetheless, it is worth noting that several of the CASP14 targets, especially large multi-domain targets, were not predicted with high accuracy, suggesting that further improvements are needed for multi-domain prediction ^[101][123]. A common approach to multi-domain protein structure modeling is to split the query sequence into domains and generate models for each individual domain separately. The individual domain models are subsequently assembled into full-length models, usually under the guidance of other homologous multi-domain proteins from the PDB. Such domain assembling methods can be divided into the following two categories: linker-based domain assembly and inter-domain rigid body docking. Linker-based methods, such as Rosetta ^[102][125] and AIDA ^[103][126], primarily focus on the construction of linker models by exploring the conformational space, with domain orientations loosely constrained by physical potential from generic hydrophobic interactions. Docking-based methods, such as DEMO ^[104][105][127,128] and SADA ^[106][129], assemble the single domain structure via rigid body docking, which is essentially a template-based method that guides domain assembly by detecting available templates.

2.8. CASP and Most Recent CASP Results

The Critical Assessment of Protein Structure Prediction (CASP) was established in 1994, by Professor John Moult and others from the University of Maryland, and has taken place every other year since then ^[107][137]. Its purpose is to provide an objective evaluation of protein structure prediction technologies within the field of protein structure prediction. Employing a rigorous double-blind prediction mechanism, it is viewed as the gold standard for assessing protein structure prediction techniques and is regarded in the industry as the “Olympics of protein structure prediction”. In order to fairly evaluate protein structure prediction methods, CASP assessors have incorporated and designed multiple measures. Two widely used evaluation measures by CASP are the TM-score and the global distance test score (GDT score). The TM-score between the model and the experimental structure is usually used to assess the global quality of a structural model ^[108][138]. The TM-score ranges between 0 and 1, with TM-scores > 0.5 indicating that the structure models have the same fold defined in SCOP/CATH ^[109][117]. The GDT score is calculated by GDT = (GDT_P1 + GDT_P2 + GDT_P4 + GDT_P8)/4, where GDT_Pn indicates the percent of residues under the distance cut-off ≤ n Å ^[110][139]. The GDT score primarily focuses on assessing the backbone modeling quality of a protein. With the substantial enhancement in prediction accuracy witnessed since the advent of AlphaFold2 in CASP14, more and more measures for assessing side-chain modeling quality have been introduced. According to the rules of CASP, all participating methods are categorized into the following two groups: server-based and human-based. Participants in the server-based group have a limited window of 72 h for structure prediction, while those in the human-based group are allotted 3 weeks, allowing for manual intervention. This signifies that the server-based group relies solely on computer predictions; hence, the competitive difficulty in this category is often higher than in the human-based groups. Starting from CASP7, the proteins modeled during CASP have been classified as TBM, TBM-easy, TBM-hard, FM/TBM, or FM, depending on the availability and quality of PDB templates for each target, where TBM-easy targets have readily identifiable, high-quality templates, and FM targets typically lack homologous templates in the PDB. Starting from CASP12, protein complex prediction has been included in CASP as an independent assessment category, called the protein assembly category. Protein complex modeling is distinguished from the classical protein–protein docking, where two protein subunits, named the ligand and the receptor, are in contact through a single interface. In the CASP protein assembly assessment, predictions of full-length protein complexes involve predictions of both individual protein–protein interfaces and overall complex topology. Starting from CASP13, deep learning techniques have achieved significant breakthroughs, markedly enhancing the accuracy of protein tertiary structure prediction. In CASP13, the adoption of distance map prediction began to play a pivotal role in guiding protein structure prediction. Notable examples include RaptorX-Contact [22], DMPfold ^[89][105], and AlphaFold ^[90][106], which employed deep Residual Networks (ResNets) from contact prediction to distance prediction, significantly boosting predictive modeling performance. In particular, AlphaFold, developed by Google DeepMind, was ranked as the top method in tertiary structure modeling among all groups in CASP13. However, the majority of other groups continued to rely on contact prediction information for guiding protein structure prediction. Due to the remarkable accuracy of deep learning-based contact map predictions, even contact-based protein structure prediction methods also achieved excellent performance. The effectiveness of distance prediction, as demonstrated in CASP13, has led to its widespread applications in various structure prediction methodologies. A promising example is trRosetta ^[25][111][25,107], which employed a deep residual neural network to predict both pairwise residue distances and inter-residue orientations for guiding protein structure prediction. Following the inspiration from trRosetta, numerous groups in CASP14 incorporated orientation and distance constraints predicted by deep residual neural networks into their protein structure prediction processes. Among these methods, D-I-TASSER ^[112][108] and D-QUARK ^[112][108] were two top CASP14 servers from Yang Zhang’s group. D-I-TASSER, in particular, leveraged deep learning-based hydrogen bond network prediction to guide protein structure prediction, significantly improving modeling accuracy for CASP14 targets, especially those lacking homologous templates ^[112][108].

2.9. AlphaFold Protein Structure Database (AlphaFold DB)

The AlphaFold Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk, accessed on 10 December 2023), created in partnership between DeepMind and the EMBL-European Bioinformatics Institute (EMBL-EBI), is a freely accessible database of high-accuracy protein structure predictions by the scientific community ^[113][148]. Powered by AlphaFold2 of Google DeepMind, AlphaFold DB provides highly accurate protein structure predictions, competitive with experimental structures. The latest AlphaFold DB release contains over 200 million entries, providing broad coverage of UniProt ^[114][149], which is the standard repository of protein sequences and annotations. AlphaFold DB provides individual downloads for the human proteome and for the proteomes of 47 other key organisms important in research and global health. AlphaFold DB also provides a download for the manually curated subset of UniProt. The prediction results of AlphaFold DB can be accessed through several mechanisms, as follows: (i) bulk downloads (up to 23 TB) via FTP; (ii) programmatic access via an application programming interface (API); and (iii) download and interactive visualization of individual predictions on protein-specific web pages keyed on UniProt accessions. The AlphaFold DB’s release of a multitude of novel protein structures has provided bioinformaticians across the globe with a rich repository of data. Developers specializing in protein structure analysis tools are leveraging this influx of accurate models, leading to numerous significant breakthroughs in protein-related fields. For example, the AlphaFold DB, through its accurate prediction of protein structures, offers a robust foundation for understanding how different ligands might interact with various proteins, which is pivotal in identifying potential drug targets, aiding in the design of novel pharmaceuticals, and contributing to a broader understanding of biological functions. In this context, several methods have been developed. AlphaFill, for instance, was developed to enrich the models in the AlphaFold DB by “transplanting” ligands, co-factors, and ions, based on sequence and structure similarity ^[115][150]. While the AlphaFold DB has significantly expanded the application and scalability of tools and algorithms for protein-related analyses, effectively analyzing more than a couple of hundred thousand protein structures or models poses a challenge. There is a pressing need to develop novel approaches capable of managing the unanticipated and rapid growth of available models. Notably, state-of-the-art tools such as FoldSeek ^[116][154] and 3D-AF-Surfer ^[117][155] have already been developed, aiding researchers in searching through extensive repositories of protein structures to identify hits with structural similarity to a provided input structure. Leveraging high-throughput structural similarity searches facilitates classification problems, such as assigning structural CATH domains to AlphaFold models ^[118][156].