Protein Tertiary Structure Prediction

Protein Tertiary Structure Prediction: Comparison

Please note this is a comparison between Version 2 by Lindsay Dong and Version 1 by Wei Zheng.

The prediction of three-dimensional (3D) protein structure from amino acid sequences has stood as a significant challenge in computational and structural bioinformatics for decades. Recently, tThe widespread integration of artificial intelligence (AI) algorithms has substantially expedited advancements in protein structure prediction, yielding numerous significant milestones. In particular, the end-to-end deep learning method AlphaFold2 has facilitated the rise of structure prediction performance to new heights, regularly competitive with experimental structures in the 14th Critical Assessment of Protein Structure Prediction (CASP14).

AlphaFold2
contact map
deep learning
distance map
end-to-end methods
multi-domain proteins
protein language model

1. Introduction

Proteins are macromolecules that play important roles in facilitating the essential functions vital for life’s sustenance. Their pivotal involvement spans a diverse array—providing structural support to cells, safeguarding the immune system, catalyzing crucial enzymatic reactions, orchestrating cellular signal transmission, regulating the intricate processes of transcription and translation, and encompassing the synthesis and breakdown of biomolecules. Moreover, they contribute significantly to the regulation of developmental processes, biological pathways, and the constitution of protein complexes and subcellular structures. These diverse and remarkable functions originate from their distinct three-dimensional (3D) structures, which vary across different protein molecules. Since Anfinsen showed that the tertiary structure of a protein is determined by its amino acid sequence in 1973 ^[1], understanding the protein sequence–structure–function paradigm has emerged as a fundamental cornerstone within modern biomedical studies. Due to significant efforts in genome sequencing over the last few decades [2^[2][3][4],3,4], the number of known amino acid sequences deposited in UniProt ^[5] has grown to over 250 million. Despite the impressive number of data, the amino acid sequences themselves only offer limited insights into the biological functions of individual proteins, as these functions are primarily determined by their three-dimensional structures.

Some of the most widely used experimental techniques for determining protein structures include X-ray crystallography ^[6], NMR spectroscopy ^[7], and cryo-electron microscopy ^[8]. Despite their accuracy, the considerable human involvement and substantial expenses involved in experimentally resolving a protein’s structure have hindered advancement in the number of solved protein structures. Consequently, the expansion in solved protein structures has considerably trailed the accumulation of protein sequences. At present, the Protein Data Bank ^[9] (PDB) contains structures for approximately 0.21 million proteins, accounting for less than 0.1% of the total sequences cataloged in the UniProt database ^[10]. This disparity highlights the ever-widening gap between known protein sequences and experimentally solved protein structures. Nevertheless, owing to substantial collective efforts within the scientific community in recent decades [11^{[11][12][13][14][15][16][17][18][19][20][21][22][23][24][25]},12,13,14,15,16,17,18,19,20,21,22,23,24,25], computational approaches have made remarkable progress, through which an increasing fraction of sequences in various organisms have had their tertiary structures reliably modeled [26,27,28,29,30,31,32,33,34,35,36,37,38,39]^{[26][27][28][29][30][31][32][33][34][35][36][37][38][39]}. For example, the first version of AlphaFold demonstrated exceptional predictive capabilities in protein structure prediction by employing the deep learning-based distance map prediction during the 13th Critical Assessment of Protein Structure Prediction (CASP13). Furthermore, with the utilization of the end-to-end deep learning approach, the AlphaFold2 has facilitated the rise of structure prediction performance to new heights, regularly competitive with experimental structures in CASP14. These methodologies have significantly contributed to diverse biomedical investigations, including structure-based protein function annotation [40,41,42^{[40][41][42][43][44]},43,44], mutation analysis [45,46,47^{[45][46][47][48][49][50][51][52]},48,49,50,51,52], ligand screening [53^{[53][54][55][56][57][58][59]},54,55,56,57,58,59], and drug discovery [60,61,62,63,64,65]^{[60][61][62][63][64][65]}.

2. Protein Structure Prediction

2.1. Template-Based Modeling (TBM) Methods

Template-based modeling (TBM) methods have emerged as pivotal approaches in the realm of computational biology for predicting protein structures. TBM leverages known protein structures, referred to as templates, from the PDB to predict the structure of an unknown protein (target), assuming that the target shares a significant degree of sequence similarity with the template. As shown in Figure 21, TBM methods usually consist of the following four steps: (i) identifying templates related to the protein of interest, (ii) aligning the query protein with the templates, (iii) building the initial structural framework by replicating the aligned regions, and (iv) constructing the unaligned regions and refining the structure. TBM can be classified as homology modeling (comparative modeling), which is often employed when there is substantial sequence identity—typically 30% or greater—between the template and the protein of interest, and threading (fold recognition), which is used when the sequence identity drops below the 30% threshold ^[66].

Figure 21. Illustration of template-based modeling (TBM) methods. Starting from a query sequence, templates are identified from Protein Data Bank (PDB) and subsequently aligned with the query protein sequence. Then, the final structural model is constructed by replicating the aligned regions and refining the unaligned regions.

In homology modeling, high-quality templates are detected and aligned using straightforward sequence–sequence alignment algorithms, such as dynamic programming-based techniques like the Needleman–Wunsch ^[67] algorithm for global alignment and the Smith–Waterman ^[68] algorithm for local alignment. BLAST ^[69] is another widely used tool to identify templates and generate alignments, which initially identified short matches between the query and template, and then extended these matches to generate alignments.

In threading, since the sequence identity between the best available template and the query protein falls below 30%, it is hard to identify templates simply based on straightforward sequence–sequence alignment algorithms. Hence, the 1D profile of local structural features is used to represent a template’s 3D structure, because they are often more conserved than the amino acid identities themselves and, thus, can be used to identify and align proteins with similar structures but more distant sequence homology. A commonly used sequence profile is the Position-specific Scoring Matrix (PSSM), which captures the amino acid tendencies at each position within the multiple sequence alignment (MSA). The PSSM is iteratively employed to search through a template database, aiming to identify distantly homologous templates for a specific protein sequence. One popularly used profile-based threading algorithm is MUSTER ^[70], which combines various sequence and structural information into single-body terms in a dynamic programming search, as follows: (i) sequence profiles; (ii) secondary structures; (iii) structure fragment profiles; (iv) solvent accessibility; (v) dihedral torsion angles; and (vi) hydrophobic scoring matrix. In addition to PSSMs, profile hidden Markov models (HMMs) are another type of sequence profile.

Given the recent substantial improvements in contact and distance map prediction using deep learning, which will be discussed later, threading methods guided by these maps represent the cutting edge in fold recognition, achieving superior accuracy compared to general profile or profile HMM-based threading methods. Among these approaches, EigenTHREADER [73]^[71] utilized the eigen decomposition of contact maps to derive the primary eigenvectors, which were used for aligning the template and query contact maps. CEthreader [74]^[72], employing a similar eigen decomposition strategy, outperformed pure contact map-based threading methods by integrating data from local structural feature prediction and sequence-based profiles. map_align ^[21], on the other hand, introduced an iterative dual dynamic programming technique to align contact maps, while DeepThreader [75]^[73] leveraged predicted distance maps to establish alignments. Most recently, DisCovER [76]^[74] integrated deep learning-predicted distance and orientation into the threading method by generating alignments through an iterative double dynamic programming framework.

Furthermore, deep learning-based methods have been directly applied to recognize distant homology templates. The cutting-edge methods, such as ThreaderAI [81]^[75] and SAdLSA [82]^[76], conceptualize the task of aligning query sequence with template as the classical pixel classification problem in computer vision, which allows for the integration of a deep residual neural network [83]^[77] into fold recognition. More recently, the application of language models, originally developed for text classification and generative tasks, to protein sequences marks a significant advancement in the bioinformatics field. Protein language models (PLMs) are a type of neural network with self-supervised training on an extensive number of protein sequences [84,85]^[78][79]. Once trained, PLMs can be used to rapidly generate high-dimensional embeddings on a per-residue level, which can be viewed as a “semantic meaning” of each amino acid within the context of the full protein sequence. Such representations have proven invaluable in identifying distant homologous relationships between proteins.

Once the templates are identified and aligned with the query proteins, the subsequent step involves building a model by replicating and refining the structure of the template. The most widely used method was MODELLER ^[16], which constructed tertiary structure models by optimally satisfying spatial constraints extracted from the template alignments, along with other general structural constraints, such as ideal bond lengths, bond angles, and dihedral angles.

With the development of computational techniques, some methods are proposed to convert alignments directly into 3D models. A notable example is I-TASSER [91^[80][81][82],92,93], an extension of TASSER ^[28]. This method utilized a process wherein continuous fragments were extracted from the aligned regions of multiple threading templates identified by LOMETS. These fragments were reassembled during structure assembly simulations. I-TASSER incorporated constraints derived from template alignments and a set of knowledge-based energy terms. These energy terms included hydrogen bonding, secondary structure formation, and side-chain contact formation. The integration of these components was used to guide the Replica Exchange Monte Carlo (REMC) simulation. After clustering low-energy decoys and selecting the centroid of the most favorable cluster, the centroid was compared against the PDB to identify additional templates. The constraints from these new templates, combined with those from the initial cluster model and threading templates, as well as the intrinsic knowledge-based potentials, were employed to direct a subsequent round of structure assembly simulations. The lowest energy structure was selected, which was then subjected to full-atom refinement. Since its first emergence in the CASP7, I-TASSER has consistently achieved top rankings among automated protein structure prediction servers in subsequent CASP experiments ^[66].

2.2. Fragment Assembly Simulation Methods for Free Modeling (FM)

Theoretically, all-atom molecular dynamics (MD) simulations are able to predict protein structures if the computer is powerful enough. However, modern MD simulations can only deal with proteins of less than ~100 amino acids in size. Thus, 90% of the natural proteins cannot be predicted because of the required computational complexity [95]^[83]. Hence, an alternative method, namely free modeling (FM), was proposed to model protein structures. Compared to MD simulations, FM methods employ the coarse-grained protein elements and physics- or knowledge-based energy functions, together with extensive sampling procedures, to construct protein structure models from scratch. In contrast to TBM methods, they do not depend on global templates. Hence, they are commonly referred to as ab initio or de novo modeling approaches [17,19]^[17][19]. State-of-the-art FM methods have evolved to assemble protein fragments [96]^[84]. These fragment assembly techniques assume that protein fragments extracted from the PDB covered most of the conformation of protein folding. Thus, the sampling space was sharply narrowed down. Their implementation involves generating a set of fixed-length (9 residues) and variable-length (15–25 residues) fragments from a repository of known 3D structures (as shown in Figure 32). These fragments are subsequently linked, rotated, and scored to find the global minimum state. This methodology of fragment assembly serves to reduce the exploration of conformational space while ensuring the coherent formation of local structures within the assembled fragments.

Figure 32. Illustration of free modeling (FM) methods. Starting from a query sequence, local fragments are identified from databases of solved protein structures, using profile-based threading methods. These fragments are subsequently utilized to construct full-length structural models, guided by physics- or knowledge-based energy potentials.

The first version of Rosetta modeling software, released in 1997, is one of the most well-known FM methods developed by David Baker’s group ^[17]. Rosetta utilized a three- and nine-residue fragment database for assembly. Particularly, the fragments were selected by quantifying the profile–profile and secondary structure similarity between the query sequence and fragment database within a defined window size. The fragments were simplified to backbone atoms and side-chain centers, and subsequently conducted by simulated annealing Monte Carlo simulations, which exchanged the backbone torsion angles with those of one of the highly scored fragments in the database.

2.3. Contact-Based Protein Structure Prediction

A contact map for a protein of length L is defined as a symmetric, binary L × L matrix. Each element in the matrix represents a binary value, signifying whether the residues form a contact (Cβ-Cβ distance (Cα for glycine) < 8 Å) or not. Since the concept of contact was first brought up, many attempts were made to predict contacts based on correlated mutations in MSAs [97,98,99]^[85][86][87]. The hypothesis behind these approaches was that residue pairs that are in contact in 3D space would exhibit correlated mutation patterns, also known as co-evolution, because there is evolutionary pressure to conserve the structures of proteins.

n the early 2010s, an increasing number of predictors began integrating deep learning architectures into their prediction methods. A breakthrough occurred in 2017, when Xu’s group introduced RaptorX-Contact ^[22], which revolutionized contact prediction by integrating deep residual convolutional neural networks (ResNets [83]^[77]). A Residual Neural Network incorporates an identity map of the input to the output of the convolutional layer, facilitating smoother gradient flow from deeper to shallower layers and enabling training of deep networks with numerous layers. RaptorX-Contact’s utilization of deep ResNets, featuring approximately 60 hidden layers, led to a significant performance leap, outstripping other methods ^[66]. The introduction of deep ResNets, consisting of approximately 60 hidden layers, enabled RaptorX-Contact to significantly outperform other methods ^[66].

Due to the latest advances in residue–residue contact prediction, contact-guided protein structure prediction methods have been developed and are becoming more and more successful. The idea of contact-based protein structure prediction methods is described in Figure 43. Starting from a query sequence, an MSA is first generated by searching through databases. The MSA is then used as the input for deep learning methods to predict a contact map. Finally, the contact potential derived from the predicted contact map is used in a folding simulation to predict the final model.

Figure 43. Illustration of contact-based protein structure prediction methods. Starting from a query sequence, an MSA is first generated by searching through databases. The MSA is then used as the input of deep learning methods to predict a contact map. Finally, the contact potential derived from the predicted contact map is used in a folding simulation to predict the final model.

2.4. Distance-Based Protein Structure Prediction

From the definition of contact map prediction, a more detailed extension is distance map prediction. The distinction lies in contact map prediction entailing binary classification, whereas distance map prediction generally estimates the likelihood of the distance between residues falling within various bins (despite attempts made to directly predict real-value distances [104]^[88]). Distance map prediction gained significant prominence in the field during CASP13 in 2018, when RaptorX-Contact ^[22], DMPfold [105]^[89], and AlphaFold [106]^[90] extended the application of deep ResNets from contact prediction to distance prediction. Among these predictors, AlphaFold, created by Google DeepMind, exhibited superior performance in tertiary structure modeling, as it was ranked as the top one among all groups in CASP13. Leveraging co-evolutionary coupling information extracted from an MSA, AlphaFold employed a deep residual neural network, comprising 220 residual blocks, to predict the distance map for a target sequence, which was subsequently used to assemble protein models. Figure 54 shows the basic steps of distance-based protein structure prediction methods.

Figure 54. Illustration of distance-based protein structure prediction methods. Starting from a query sequence, an MSA is first generated by searching through databases. Then, the MSA is fed into deep neural networks to predict spatial restraints, such as distance maps, inter-residue orientations, and hydrogen bond networks. Finally, the final structural model is constructed by employing the potentials extracted from the predicted spatial restraints in a folding simulation to identify the lowest energy structure.

2.5. End-to-End Protein Structure Prediction

AlphaFold2 achieved remarkable modeling accuracy and substantially addressed the challenge of predicting the structures of single-domain proteins in CASP14 [109]^[91]. The success of AlphaFold2 can be attributed, in part, to its unique “end-to-end” learning approach. This end-to-end learning approach eliminates the need for complex folding simulations, allowing deep neural networks, such as 3D equivariant transformers in AlphaFold2, to predict structural models directly.

AlphaFold2 adopted a novel architecture that is quite different from those of previous methods, including the first version of AlphaFold, to accomplish end-to-end structure prediction. The architecture of AlphaFold2 includes the following two primary components: the Trunk Module, which utilizes self-attention transformers to process input data consisting of the query sequence, templates, and MSA; and the Structure (or Head) Module, which employs 3D rigid body frames to directly generate 3D structures from the training components [110]^[92].

Despite its breakthrough in accuracy and performance, AlphaFold2 has notable limitations, such as increased time consumption with longer protein lengths. To address these challenges, several faster artificial intelligence-driven protein folding tools, based on AlphaFold2, have been developed [111,112,113]^[93][94][95]. For example, ColabFold [111]^[93] improved the speed of protein structure prediction by integrating MMseqs2′s efficient homology search (Many-against-Many sequence searching) [114]^[96] with AlphaFold2 [110]^[92]. OpenFold [112]^[94], a trainable and open-source implementation of AlphaFold2 using PyTorch [115]^[97], achieved enhanced computational efficiency with reduced memory usage, thereby facilitating the prediction of exceedingly long proteins on a single GPU. Similarly, Uni-Fold [113]^[95] redeveloped AlphaFold2 within the PyTorch framework and reproduced its original training process on a larger set of training data, achieving comparable or superior accuracy and faster speed. Collectively, these developments represent significant strides in enabling rapid and accurate predictions of protein structures.

2.6. Protein Language Model-Based Protein Structure Prediction

AlphaFold2 has facilitated the rise of structure prediction performance to new heights, nearly comparable to the accuracy of experimental determination methods since CASP14. Standard protein structure prediction pipelines heavily rely on co-evolution information from MSAs. However, the excessive dependence on MSAs often acts as a bottleneck in various protein-related problems. While model inference in the structure prediction pipeline typically takes a few seconds, the MSA construction step is time-intensive, consuming tens of minutes per protein. This time-consuming process significantly hampers tasks requiring high-throughput requests, like protein design [119]^[98] A large-scale protein language model (PLM) presents an alternative avenue to MSAs for acquiring co-evolutionary knowledge, facilitating MSA-free predictions. In contrast to MSA-based methods, wherein information retrieval techniques explicitly capture co-evolutionary details from protein sequence databases, PLM-based methods embed co-evolutionary information into the large-scale model parameters during training, and allow for implicit retrieval through model inference, wherein the PLM is viewed as a repository of protein information. Furthermore, MSA-based approaches have lower efficiency in information retrieval, relying on manually designed retrieval schemes. Inspired by the progress of PLMs and AlphaFold2, many protein structure prediction methods have been proposed. For example, ESMFold [85]^[79], developed by Meta AI, used the information and representations learned by a PLM called ESM-2 to perform end-to-end 3D structure prediction using only a single sequence as input. ESMFold demonstrated comparable accuracy to AlphaFold2 and RoseTTAFold for sequences exhibiting low perplexity and thorough comprehension by PLM. Notably, ESMFold’s inference speed was ten times faster than that of AlphaFold2, thereby facilitating efficient exploration of the structural landscape of proteins within practical time frames. OmegaFold [121]^[99] predicted the high-resolution protein structure from a single primary sequence alone, using a combination of a PLM and a geometry-inspired transformer model, trained on protein structures. OmegaFold requires only a single amino acid sequence for protein structure prediction and does not rely on MSAs or known structures as templates. Similar to ESMFold, OmegaFold can also scale roughly ten times faster than MSA-based methods, such as AlphaFold2 and RoseTTAFold. HelixFold-Single [119]^[98] was an end-to-end MSA-free protein structure prediction pipeline that combined a large-scale PLM with the superior geometric learning capability of AlphaFold2. HelixFold-Single first pre-trained a large-scale PLM with thousands of millions of primary structures, utilizing the self-supervised learning paradigm, and then obtained an end-to-end differentiable model to predict 3D structures by combining the pre-trained PLM and the essential components of AlphaFold2. EMBER3D [122]^[100] predicted 3D structure directly from single sequences by computing both 2D (distance maps) and 3D structure (backbone coordinates) from sequences alone, based on embeddings from the pre-trained PLM called ProtT5. EMBER3D exhibited a speed that was orders of magnitude faster than its counterparts, enabling the prediction of average-length structures in mere milliseconds, even on consumer-grade machines.

2.7. Multi-Domain Protein Structure Prediction

Since the advent of AlphaFold2 in the recent CASP14, great progress has been made in protein structure prediction. However, AlphaFold2 and most of the subsequent state-of-the-art methods have mainly focused on the modeling of single-domain proteins, which are the minimum folding units of proteins that fold and function independently. Nonetheless, it is worth noting that several of the CASP14 targets, especially large multi-domain targets, were not predicted with high accuracy, suggesting that further improvements are needed for multi-domain prediction [123]^[101]. A common approach to multi-domain protein structure modeling is to split the query sequence into domains and generate models for each individual domain separately. The individual domain models are subsequently assembled into full-length models, usually under the guidance of other homologous multi-domain proteins from the PDB. Such domain assembling methods can be divided into the following two categories: linker-based domain assembly and inter-domain rigid body docking. Linker-based methods, such as Rosetta [125]^[102] and AIDA [126]^[103], primarily focus on the construction of linker models by exploring the conformational space, with domain orientations loosely constrained by physical potential from generic hydrophobic interactions. Docking-based methods, such as DEMO [127,128]^[104][105] and SADA [129]^[106], assemble the single domain structure via rigid body docking, which is essentially a template-based method that guides domain assembly by detecting available templates.

2.8. CASP and Most Recent CASP Results

The Critical Assessment of Protein Structure Prediction (CASP) was established in 1994, by Professor John Moult and others from the University of Maryland, and has taken place every other year since then [137]^[107]. Its purpose is to provide an objective evaluation of protein structure prediction technologies within the field of protein structure prediction. Employing a rigorous double-blind prediction mechanism, it is viewed as the gold standard for assessing protein structure prediction techniques and is regarded in the industry as the “Olympics of protein structure prediction”. In order to fairly evaluate protein structure prediction methods, CASP assessors have incorporated and designed multiple measures. Two widely used evaluation measures by CASP are the TM-score and the global distance test score (GDT score). The TM-score between the model and the experimental structure is usually used to assess the global quality of a structural model [138]^[108]. The TM-score ranges between 0 and 1, with TM-scores > 0.5 indicating that the structure models have the same fold defined in SCOP/CATH [117]^[109]. The GDT score is calculated by GDT = (GDT_P1 + GDT_P2 + GDT_P4 + GDT_P8)/4, where GDT_Pn indicates the percent of residues under the distance cut-off ≤ n Å [139]^[110]. The GDT score primarily focuses on assessing the backbone modeling quality of a protein. With the substantial enhancement in prediction accuracy witnessed since the advent of AlphaFold2 in CASP14, more and more measures for assessing side-chain modeling quality have been introduced. According to the rules of CASP, all participating methods are categorized into the following two groups: server-based and human-based. Participants in the server-based group have a limited window of 72 h for structure prediction, while those in the human-based group are allotted 3 weeks, allowing for manual intervention. This signifies that the server-based group relies solely on computer predictions; hence, the competitive difficulty in this category is often higher than in the human-based groups. Starting from CASP7, the proteins modeled during CASP have been classified as TBM, TBM-easy, TBM-hard, FM/TBM, or FM, depending on the availability and quality of PDB templates for each target, where TBM-easy targets have readily identifiable, high-quality templates, and FM targets typically lack homologous templates in the PDB. Starting from CASP12, protein complex prediction has been included in CASP as an independent assessment category, called the protein assembly category. Protein complex modeling is distinguished from the classical protein–protein docking, where two protein subunits, named the ligand and the receptor, are in contact through a single interface. In the CASP protein assembly assessment, predictions of full-length protein complexes involve predictions of both individual protein–protein interfaces and overall complex topology. Starting from CASP13, deep learning techniques have achieved significant breakthroughs, markedly enhancing the accuracy of protein tertiary structure prediction. In CASP13, the adoption of distance map prediction began to play a pivotal role in guiding protein structure prediction. Notable examples include RaptorX-Contact ^[22], DMPfold [105]^[89], and AlphaFold [106]^[90], which employed deep Residual Networks (ResNets) from contact prediction to distance prediction, significantly boosting predictive modeling performance. In particular, AlphaFold, developed by Google DeepMind, was ranked as the top method in tertiary structure modeling among all groups in CASP13. However, the majority of other groups continued to rely on contact prediction information for guiding protein structure prediction. Due to the remarkable accuracy of deep learning-based contact map predictions, even contact-based protein structure prediction methods also achieved excellent performance. The effectiveness of distance prediction, as demonstrated in CASP13, has led to its widespread applications in various structure prediction methodologies. A promising example is trRosetta [25^[25][111],107], which employed a deep residual neural network to predict both pairwise residue distances and inter-residue orientations for guiding protein structure prediction. Following the inspiration from trRosetta, numerous groups in CASP14 incorporated orientation and distance constraints predicted by deep residual neural networks into their protein structure prediction processes. Among these methods, D-I-TASSER [108]^[112] and D-QUARK [108]^[112] were two top CASP14 servers from Yang Zhang’s group. D-I-TASSER, in particular, leveraged deep learning-based hydrogen bond network prediction to guide protein structure prediction, significantly improving modeling accuracy for CASP14 targets, especially those lacking homologous templates [108]^[112].

2.9. AlphaFold Protein Structure Database (AlphaFold DB)

The AlphaFold Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk, accessed on 10 December 2023), created in partnership between DeepMind and the EMBL-European Bioinformatics Institute (EMBL-EBI), is a freely accessible database of high-accuracy protein structure predictions by the scientific community [148]^[113]. Powered by AlphaFold2 of Google DeepMind, AlphaFold DB provides highly accurate protein structure predictions, competitive with experimental structures. The latest AlphaFold DB release contains over 200 million entries, providing broad coverage of UniProt [149]^[114], which is the standard repository of protein sequences and annotations. AlphaFold DB provides individual downloads for the human proteome and for the proteomes of 47 other key organisms important in research and global health. AlphaFold DB also provides a download for the manually curated subset of UniProt. The prediction results of AlphaFold DB can be accessed through several mechanisms, as follows: (i) bulk downloads (up to 23 TB) via FTP; (ii) programmatic access via an application programming interface (API); and (iii) download and interactive visualization of individual predictions on protein-specific web pages keyed on UniProt accessions. The AlphaFold DB’s release of a multitude of novel protein structures has provided bioinformaticians across the globe with a rich repository of data. Developers specializing in protein structure analysis tools are leveraging this influx of accurate models, leading to numerous significant breakthroughs in protein-related fields. For example, the AlphaFold DB, through its accurate prediction of protein structures, offers a robust foundation for understanding how different ligands might interact with various proteins, which is pivotal in identifying potential drug targets, aiding in the design of novel pharmaceuticals, and contributing to a broader understanding of biological functions. In this context, several methods have been developed. AlphaFill, for instance, was developed to enrich the models in the AlphaFold DB by “transplanting” ligands, co-factors, and ions, based on sequence and structure similarity [150]^[115]. While the AlphaFold DB has significantly expanded the application and scalability of tools and algorithms for protein-related analyses, effectively analyzing more than a couple of hundred thousand protein structures or models poses a challenge. There is a pressing need to develop novel approaches capable of managing the unanticipated and rapid growth of available models. Notably, state-of-the-art tools such as FoldSeek [154]^[116] and 3D-AF-Surfer [155]^[117] have already been developed, aiding researchers in searching through extensive repositories of protein structures to identify hits with structural similarity to a provided input structure. Leveraging high-throughput structural similarity searches facilitates classification problems, such as assigning structural CATH domains to AlphaFold models [156]^[118].

References

Anfinsen, C.B. Principles that Govern the Folding of Protein Chains. Science 1973, 181, 223–230.
Sanger, F.; Nicklen, S.; Coulson, A.R. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. USA 1977, 74, 5463–5467.
Venter, J.C.; Adams, M.D.; Myers, E.W.; Li, P.W.; Mural, R.J.; Sutton, G.G.; Smith, H.O.; Yandell, M.; Evans, C.A.; Holt, R.A.; et al. The Sequence of the Human Genome. Science 2001, 291, 1304–1351.
Metzker, M.L. Sequencing technologies—The next generation. Nat. Rev. Genet. 2010, 11, 31–46.
Bairoch, A.; Apweiler, R.; Wu, C.H.; Barker, W.C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M.; et al. The Universal Protein Resource (UniProt). Nucleic Acids Res. 2005, 33 (Suppl. 1), D154–D159.
Glusker, J.P. X-ray crystallography of proteins. Methods Biochem. Anal. 1994, 37, 1–72.
Cavanagh, J. Protein NMR Spectroscopy: Principles and Practice; Academic Press: Cambridge, MA, USA, 1996.
Cheng, Y. Single-Particle Cryo-EM at Crystallographic Resolution. Cell 2015, 161, 450–457.
Berman, H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, I.N.; Bourne, P.E. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235–242.
The UniProt Consortium. The Universal Protein Resource (UniProt). Nucleic Acids Res. 2008, 36 (Suppl. 1), D190–D195.
Levitt, M.; Warshel, A. Computer simulation of protein folding. Nature 1975, 253, 694–698.
Lewis, P.N.; Momany, F.A.; Scheraga, H.A. Folding of Polypeptide Chains in Proteins: A Proposed Mechanism for Folding. Proc. Natl. Acad. Sci. USA 1971, 68, 2293–2297.
McCammon, J.A.; Gelin, B.R.; Karplus, M. Dynamics of folded proteins. Nature 1977, 267, 585–590.
Bowie, J.U.; Lüthy, R.; Eisenberg, D. A Method to Identify Protein Sequences That Fold into a Known Three-Dimensional Structure. Science 1991, 253, 164–170.
Skolnick, J.; Kolinski, A. Simulations of the Folding of a Globular Protein. Science 1990, 250, 1121–1125.
Šali, A.; Blundell, T.L. Comparative Protein Modelling by Satisfaction of Spatial Restraints. J. Mol. Biol. 1993, 234, 779–815.
Simons, K.T.; Kooperberg, C.; Huang, E.; Baker, D. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. J. Mol. Biol. 1997, 268, 209–225.
Roy, A.; Kucukural, A.; Zhang, Y. I-TASSER: A unified platform for automated protein structure and function prediction. Nat. Protoc. 2010, 5, 725–738.
Xu, D.; Zhang, Y. Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field. Proteins Struct. Funct. Bioinform. 2012, 80, 1715–1735.
Yang, J.; Yan, R.; Roy, A.; Xu, D.; Poisson, J.; Zhang, Y. The I-TASSER Suite: Protein structure and function prediction. Nat. Methods 2015, 12, 7–8.
Ovchinnikov, S.; Park, H.; Varghese, N.; Huang, P.-S.; Pavlopoulos, G.A.; Kim, D.E.; Kamisetty, H.; Kyrpides, N.C.; Baker, D. Protein structure determination using metagenome sequence data. Science 2017, 355, 294–298.
Wang, S.; Sun, S.; Li, Z.; Zhang, R.; Xu, J. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model. PLoS Comput. Biol. 2017, 13, e1005324.
Zheng, W.; Li, Y.; Zhang, C.; Pearce, R.; Mortuza, S.M.; Zhang, Y. Deep-learning contact-map guided protein structure prediction in CASP13. Proteins Struct. Funct. Bioinform. 2019, 87, 1149–1164.
Senior, A.W.; Evans, R.; Jumper, J.; Kirkpatrick, J.; Sifre, L.; Green, T.; Qin, C.; Žídek, A.; Nelson, A.W.R.; Bridgland, A.; et al. Improved protein structure prediction using potentials from deep learning. Nature 2020, 577, 706–710.
Yang, J.; Anishchenko, I.; Park, H.; Peng, Z.; Ovchinnikov, S.; Baker, D. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl. Acad. Sci. USA 2020, 117, 1496–1503.
Fischer, D.; Eisenberg, D. Assigning folds to the proteins encoded by the genome of Mycoplasma genitalium. Proc. Natl. Acad. Sci. USA 1997, 94, 11929–11934.
Sánchez, R.; Šali, A. Evaluation of comparative protein structure modeling by MODELLER-3. Proteins Struct. Funct. Bioinform. 1997, 29, 50–58.
Zhang, Y.; Skolnick, J. Automated structure prediction of weakly homologous proteins on a genomic scale. Proc. Natl. Acad. Sci. USA 2004, 101, 7594–7599.
Malmström, L.; Riffle, M.; Strauss, C.E.M.; Chivian, D.; Davis, T.N.; Bonneau, R.; Baker, D. Superfamily Assignments for the Yeast Proteome through Integration of Structure Prediction with the Gene Ontology. PLoS Biol. 2007, 5, e76.
Mukherjee, S.; Szilagyi, A.; Roy, A.; Zhang, Y. Genome-Wide Protein Structure Prediction. In Multiscale Approaches to Protein Modeling: Structure Prediction, Dynamics, Thermodynamics and Macromolecular Assemblies; Kolinski, A., Ed.; Springer: New York, NY, USA, 2011; pp. 255–279.
Xu, D.; Zhang, Y. Ab Initio structure prediction for Escherichia coli: Towards genome-wide protein structure modeling and fold assignment. Sci. Rep. 2013, 3, 1895.
Zhang, C.; Zheng, W.; Cheng, M.; Omenn, G.S.; Freddolino, P.L.; Zhang, Y. Functions of Essential Genes and a Scale-Free Protein Interaction Network Revealed by Structure-Based Function and Interaction Prediction for a Minimal Genome. J. Proteome Res. 2021, 20, 1178–1189.
Kim, D.E.; Chivian, D.; Baker, D. Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res. 2004, 32 (Suppl. 2), W526–W531.
Kelley, L.A.; Sternberg, M.J.E. Protein structure prediction on the Web: A case study using the Phyre server. Nat. Protoc. 2009, 4, 363–371.
Schwede, T.; Kopp, J.R.; Guex, N.; Peitsch, M.C. SWISS-MODEL: An automated protein homology-modeling server. Nucleic Acids Res. 2003, 31, 3381–3385.
Söding, J.; Biegert, A.; Lupas, A.N. The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res. 2005, 33 (Suppl. 2), W244–W248.
Wang, Z.; Eickholt, J.; Cheng, J. MULTICOM: A multi-level combination approach to protein structure prediction and its assessments in CASP8. Bioinformatics 2010, 26, 882–888.
Källberg, M.; Wang, H.; Wang, S.; Peng, J.; Wang, Z.; Lu, H.; Xu, J. Template-based protein structure modeling using the RaptorX web server. Nat. Protoc. 2012, 7, 1511–1522.
Xu, J. Distance-based protein folding powered by deep learning. Proc. Natl. Acad. Sci. USA 2019, 116, 16856–16865.
Vaidehi, N.; Floriano, W.B.; Trabanino, R.; Hall, S.E.; Freddolino, P.; Choi, E.J.; Zamanakos, G.; Goddard, W.A. Prediction of structure and function of G protein-coupled receptors. Proc. Natl. Acad. Sci. USA 2002, 99, 12622–12627.
Zhang, Y.; Thiele, I.; Weekes, D.; Li, Z.; Jaroszewski, L.; Ginalski, K.; Deacon, A.M.; Wooley, J.; Lesley, S.A.; Wilson, I.A.; et al. Three-Dimensional Structural View of the Central Metabolic Network of Thermotoga maritima. Science 2009, 325, 1544–1549.
Loewenstein, Y.; Raimondo, D.; Redfern, O.C.; Watson, J.; Frishman, D.; Linial, M.; Orengo, C.; Thornton, J.; Tramontano, A. Protein function annotation by homology-based inference. Genome Biol. 2009, 10, 207.
Radivojac, P.; Clark, W.T.; Oron, T.R.; Schnoes, A.M.; Wittkop, T.; Sokolov, A.; Graim, K.; Funk, C.; Verspoor, K.; Ben-Hur, A.; et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 2013, 10, 221–227.
Zhang, C.; Zheng, W.; Huang, X.; Bell, E.W.; Zhou, X.; Zhang, Y. Protein Structure and Sequence Reanalysis of 2019-nCoV Genome Refutes Snakes as Its Intermediate Host and the Unique Similarity between Its Spike Protein Insertions and HIV-1. J. Proteome Res. 2020, 19, 1351–1360.
Capriotti, E.; Fariselli, P.; Casadio, R. I-Mutant2.0: Predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res. 2005, 33 (Suppl. 2), W306–W310.
Tokuriki, N.; Tawfik, D.S. Stability effects of mutations and protein evolvability. Curr. Opin. Struct. Biol. 2009, 19, 596–604.
Quan, L.; Lv, Q.; Zhang, Y. STRUM: Structure-based prediction of protein stability changes upon single-point mutation. Bioinformatics 2016, 32, 2936–2946.
Porta-Pardo, E.; Hrabe, T.; Godzik, A. Cancer3D: Understanding cancer mutations through protein structures. Nucleic Acids Res. 2015, 43, D968–D973.
Pires, D.E.V.; Ascher, D.B.; Blundell, T.L. mCSM: Predicting the effects of mutations in proteins using graph-based signatures. Bioinformatics 2014, 30, 335–342.
Porta-Pardo, E.; Godzik, A. Mutation Drivers of Immunological Responses to Cancer. Cancer Immunol. Res. 2016, 4, 789–798.
Sundaram, L.; Gao, H.; Padigepati, S.R.; McRae, J.F.; Li, Y.; Kosmicki, J.A.; Fritzilas, N.; Hakenberg, J.; Dutta, A.; Shon, J.; et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 2018, 50, 1161–1170.
Woodard, J.; Zhang, C.; Zhang, Y. ADDRESS: A Database of Disease-associated Human Variants Incorporating Protein Structure and Folding Stabilities. J. Mol. Biol. 2021, 433, 166840.
Evers, A.; Klebe, G. Successful Virtual Screening for a Submicromolar Antagonist of the Neurokinin-1 Receptor Based on a Ligand-Supported Homology Model. J. Med. Chem. 2004, 47, 5381–5392.
Klebe, G. Virtual ligand screening: Strategies, perspectives and limitations. Drug Discov. Today 2006, 11, 580–594.
Zhou, H.; Skolnick, J. FINDSITEX: A Structure-Based, Small Molecule Virtual Screening Approach with Application to All Identified Human GPCRs. Mol. Pharm. 2012, 9, 1775–1784.
Roy, A.; Zhang, Y. Recognizing Protein-Ligand Binding Sites by Global Structural Alignment and Local Geometry Refinement. Structure 2012, 20, 987–997.
Vajda, S.; Guarnieri, F. Characterization of protein-ligand interaction sites using experimental and computational methods. Curr. Opin. Drug Discov. Dev. 2006, 9, 354–362.
Choudhary, S.; Malik, Y.S.; Tomar, S. Identification of SARS-CoV-2 Cell Entry Inhibitors by Drug Repurposing Using in silico Structure-Based Virtual Screening Approach. Front. Immunol. 2020, 11, 1664.
Chan, W.K.B.; Zhang, Y. Virtual Screening of Human Class-A GPCRs Using Ligand Profiles Built on Multiple Ligand–Receptor Interactions. J. Mol. Biol. 2020, 432, 4872–4890.
Kuntz, I.D. Structure-Based Strategies for Drug Design and Discovery. Science 1992, 257, 1078–1082.
Drews, J. Drug Discovery: A Historical Perspective. Science 2000, 287, 1960–1964.
Evers, A.; Klabunde, T. Structure-based Drug Discovery Using GPCR Homology Modeling: Successful Virtual Screening for Antagonists of the Alpha1A Adrenergic Receptor. J. Med. Chem. 2005, 48, 1088–1097.
Ekins, S.; Mestres, J.; Testa, B. In silico pharmacology for drug discovery: Applications to targets and beyond. Br. J. Pharmacol. 2007, 152, 21–37.
Shan, Y.; Kim, E.T.; Eastwood, M.P.; Dror, R.O.; Seeliger, M.A.; Shaw, D.E. How Does a Drug Molecule Find Its Target Binding Site? J. Am. Chem. Soc. 2011, 133, 9181–9183.
Han, X.; Wang, C.; Qin, C.; Xiang, W.; Fernandez-Salas, E.; Yang, C.-Y.; Wang, M.; Zhao, L.; Xu, T.; Chinnaswamy, K.; et al. Discovery of ARD-69 as a Highly Potent Proteolysis Targeting Chimera (PROTAC) Degrader of Androgen Receptor (AR) for the Treatment of Prostate Cancer. J. Med. Chem. 2019, 62, 941–964.
Pearce, R.; Zhang, Y. Toward the solution of the protein structure prediction problem. J. Biol. Chem. 2021, 297, 100870.
Needleman, S.B.; Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970, 48, 443–453.
Smith, T.F.; Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 1981, 147, 195–197.
Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403–410.
Wu, S.; Zhang, Y. MUSTER: Improving protein sequence profile–profile alignments by using multiple sources of structure information. Proteins Struct. Funct. Bioinform. 2008, 72, 547–556.
Buchan, D.W.A.; Jones, D.T. EigenTHREADER: Analogous protein fold recognition by efficient contact map threading. Bioinformatics 2017, 33, 2684–2690.
Zheng, W.; Wuyun, Q.; Li, Y.; Mortuza, S.M.; Zhang, C.; Pearce, R.; Ruan, J.; Zhang, Y. Detecting distant-homology protein structures by aligning deep neural-network based contact maps. PLoS Comput. Biol. 2019, 15, e1007411.
Zhu, J.; Wang, S.; Bu, D.; Xu, J. Protein threading using residue co-variation and deep learning. Bioinformatics 2018, 34, i263–i273.
Bhattacharya, S.; Roche, R.; Moussad, B.; Bhattacharya, D. DisCovER: Distance- and orientation-based covariational threading for weakly homologous proteins. Proteins Struct. Funct. Bioinform. 2022, 90, 579–588.
Zhang, H.; Shen, Y. Template-based prediction of protein structure with deep learning. BMC Genom. 2020, 21, 878.
Gao, M.; Skolnick, J. A novel sequence alignment algorithm based on deep learning of the protein folding code. Bioinformatics 2021, 37, 490–496.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; et al. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7112–7127.
Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123–1130.
Zheng, W.; Zhang, C.; Bell, E.W.; Zhang, Y. I-TASSER gateway: A protein structure and function prediction server powered by XSEDE. Future Gener. Comput. Syst. 2019, 99, 73–85.
Yang, J.; Zhang, Y. I-TASSER server: New development for protein structure and function predictions. Nucleic Acids Res. 2015, 43, W174–W181.
Zhang, Y. Template-based modeling and free modeling by I-TASSER in CASP7. Proteins Struct. Funct. Bioinform. 2007, 69, 108–117.
Piana, S.; Klepeis, J.L.; Shaw, D.E. Assessing the accuracy of physical models used in protein-folding simulations: Quantitative evidence from long molecular dynamics simulations. Curr. Opin. Struct. Biol. 2014, 24, 98–105.
Bowie, J.U.; Eisenberg, D. An evolutionary approach to folding small alpha-helical proteins that uses sequence information and an empirical guiding fitness function. Proc. Natl. Acad. Sci. USA 1994, 91, 4436–4440.
Göbel, U.; Sander, C.; Schneider, R.; Valencia, A. Correlated mutations and residue contacts in proteins. Proteins Struct. Funct. Bioinform. 1994, 18, 309–317.
Thomas, D.J.; Casari, G.; Sander, C. The prediction of protein contacts from multiple sequence alignments. Protein Eng. Des. Sel. 1996, 9, 941–948.
Chiu, D.K.Y.; Kolodziejczak, T. Inferring consensus structure from nucleic acid sequences. Bioinformatics 1991, 7, 347–352.
Ding, W.; Gong, H. Predicting the Real-Valued Inter-Residue Distances for Proteins. Adv. Sci. 2020, 7, 2001314.
Greener, J.G.; Kandathil, S.M.; Jones, D.T. Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints. Nat. Commun. 2019, 10, 3977.
Senior, A.W.; Evans, R.; Jumper, J.; Kirkpatrick, J.; Sifre, L.; Green, T.; Qin, C.; Žídek, A.; Nelson, A.W.R.; Bridgland, A.; et al. Protein structure prediction using multiple deep neural networks in the 13th Critical Assessment of Protein Structure Prediction (CASP13). Proteins Struct. Funct. Bioinform. 2019, 87, 1141–1148.
Callaway, E. ‘It will change everything’: DeepMind’s AI makes gigantic leap in solving protein structures. Nature 2020, 588, 203–204.
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589.
Mirdita, M.; Schütze, K.; Moriwaki, Y.; Heo, L.; Ovchinnikov, S.; Steinegger, M. ColabFold: Making protein folding accessible to all. Nat. Methods 2022, 19, 679–682.
Gustaf, A.; Nazim, B.; Christina, F.; Sachin, K.; Qinghui, X.; William, G.; Timothy, J.O.D.; Daniel, B.; Ian, F.; Niccolò, Z.; et al. OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. bioRxiv 2023.
Ziyao, L.; Xuyang, L.; Weijie, C.; Fan, S.; Hangrui, B.; Guolin, K.; Linfeng, Z. Uni-Fold: An Open-Source Platform for Developing Protein Folding Models beyond AlphaFold. bioRxiv 2022.
Steinegger, M.; Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 2017, 35, 1026–1028.
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017.
Fang, X.; Wang, F.; Liu, L.; He, J.; Lin, D.; Xiang, Y.; Zhu, K.; Zhang, X.; Wu, H.; Li, H.; et al. A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. Nat. Mach. Intell. 2023, 5, 1087–1096.
Ruidong, W.; Fan, D.; Rui, W.; Rui, S.; Xiwen, Z.; Shitong, L.; Chenpeng, S.; Zuofan, W.; Qi, X.; Bonnie, B.; et al. High-resolution de novo structure prediction from primary sequence. bioRxiv 2022.
Konstantin, W.; Michael, H.; Martin, S.; Burkhard, R. Ultra-fast protein structure prediction to capture effects of sequence variation in mutation movies. bioRxiv 2022.
Schauperl, M.; Denny, R.A. AI-Based Protein Structure Prediction in Drug Discovery: Impacts and Challenges. J. Chem. Inf. Model. 2022, 62, 3142–3156.
Wollacott, A.M.; Zanghellini, A.; Murphy, P.; Baker, D. Prediction of structures of multidomain proteins from structures of the individual domains. Protein Sci. 2007, 16, 165–175.
Xu, D.; Jaroszewski, L.; Li, Z.; Godzik, A. AIDA: Ab initio domain assembly for automated multi-domain protein structure prediction and domain–domain interaction prediction. Bioinformatics 2015, 31, 2098–2105.
Zhou, X.; Peng, C.; Zheng, W.; Li, Y.; Zhang, G.; Zhang, Y. DEMO2: Assemble multi-domain protein structures by coupling analogous template alignments with deep-learning inter-domain restraint prediction. Nucleic Acids Res. 2022, 50, W235–W245.
Zhou, X.; Hu, J.; Zhang, C.; Zhang, G.; Zhang, Y. Assembling multidomain protein structures through analogous global structural alignments. Proc. Natl. Acad. Sci. USA 2019, 116, 15930–15938.
Peng, C.-X.; Zhou, X.-G.; Xia, Y.-H.; Liu, J.; Hou, M.-H.; Zhang, G.-J. Structural analogue-based protein structure domain assembly assisted by deep learning. Bioinformatics 2022, 38, 4513–4521.
Moult, J.; Pedersen, J.T.; Judson, R.; Fidelis, K. A large-scale experiment to assess protein structure prediction methods. Proteins Struct. Funct. Bioinform. 1995, 23, ii-iv.
Zhang, Y.; Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins Struct. Funct. Bioinform. 2004, 57, 702–710.
Xu, J.; Zhang, Y. How significant is a protein structure similarity with TM-score = 0.5? Bioinformatics 2010, 26, 889–895.
Simpkin, A.J.; Mesdaghi, S.; Sánchez Rodríguez, F.; Elliott, L.; Murphy, D.L.; Kryshtafovych, A.; Keegan, R.M.; Rigden, D.J. Tertiary structure assessment at CASP15. Proteins 2023, 91, 1616–1635.
Du, Z.; Su, H.; Wang, W.; Ye, L.; Wei, H.; Peng, Z.; Anishchenko, I.; Baker, D.; Yang, J. The trRosetta server for fast and accurate protein structure prediction. Nat. Protoc. 2021, 16, 5634–5651.
Zheng, W.; Li, Y.; Zhang, C.; Zhou, X.; Pearce, R.; Bell, E.W.; Huang, X.; Zhang, Y. Protein structure prediction using deep learning distance and hydrogen-bonding restraints in CASP14. Proteins Struct. Funct. Bioinform. 2021, 89, 1734–1751.
Varadi, M.; Anyango, S.; Deshpande, M.; Nair, S.; Natassia, C.; Yordanova, G.; Yuan, D.; Stroe, O.; Wood, G.; Laydon, A.; et al. AlphaFold Protein Structure Database: Massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022, 50, D439–D444.
The UniProt, C. UniProt: The universal protein knowledgebase in 2021. Nucleic Acids Res. 2021, 49, D480–D489.
Hekkelman, M.L.; de Vries, I.; Joosten, R.P.; Perrakis, A. AlphaFill: Enriching AlphaFold models with ligands and cofactors. Nat. Methods 2023, 20, 205–213.
van Kempen, M.; Kim, S.S.; Tumescheit, C.; Mirdita, M.; Lee, J.; Gilchrist, C.L.M.; Söding, J.; Steinegger, M. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 2023.
Aderinwale, T.; Bharadwaj, V.; Christoffer, C.; Terashi, G.; Zhang, Z.; Jahandideh, R.; Kagaya, Y.; Kihara, D. Real-time structure search and structure classification for AlphaFold protein models. Commun. Biol. 2022, 5, 316.
Bordin, N.; Sillitoe, I.; Nallapareddy, V.; Rauer, C.; Lam, S.D.; Waman, V.P.; Sen, N.; Heinzinger, M.; Littmann, M.; Kim, S.; et al. AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms. Commun. Biol. 2023, 6, 160.