Despite its breakthrough in accuracy and performance, AlphaFold2 has notable limitations, such as increased time consumption with longer protein lengths. To address these challenges, several faster artificial intelligence-driven protein folding tools, based on AlphaFold2, have been developed
[93][94][95][111,112,113]. For example, ColabFold
[93][111] improved the speed of protein structure prediction by integrating MMseqs2′s efficient homology search (Many-against-Many sequence searching)
[96][114] with AlphaFold2
[92][110]. OpenFold
[94][112], a trainable and open-source implementation of AlphaFold2 using PyTorch
[97][115], achieved enhanced computational efficiency with reduced memory usage, thereby facilitating the prediction of exceedingly long proteins on a single GPU. Similarly, Uni-Fold
[95][113] redeveloped AlphaFold2 within the PyTorch framework and reproduced its original training process on a larger set of training data, achieving comparable or superior accuracy and faster speed. Collectively, these developments represent significant strides in enabling rapid and accurate predictions of protein structures.
2.6. Protein Language Model-Based Protein Structure Prediction
AlphaFold2 has facilitated the rise of structure prediction performance to new heights, nearly comparable to the accuracy of experimental determination methods since CASP14. Standard protein structure prediction pipelines heavily rely on co-evolution information from MSAs. However, the excessive dependence on MSAs often acts as a bottleneck in various protein-related problems. While model inference in the structure prediction pipeline typically takes a few seconds, the MSA construction step is time-intensive, consuming tens of minutes per protein. This time-consuming process significantly hampers tasks requiring high-throughput requests, like protein design
[98][119]
A large-scale protein language model (PLM) presents an alternative avenue to MSAs for acquiring co-evolutionary knowledge, facilitating MSA-free predictions. In contrast to MSA-based methods, wherein information retrieval techniques explicitly capture co-evolutionary details from protein sequence databases, PLM-based methods embed co-evolutionary information into the large-scale model parameters during training, and allow for implicit retrieval through model inference, wherein the PLM is viewed as a repository of protein information. Furthermore, MSA-based approaches have lower efficiency in information retrieval, relying on manually designed retrieval schemes.
Inspired by the progress of PLMs and AlphaFold2, many protein structure prediction methods have been proposed. For example, ESMFold
[79][85], developed by Meta AI, used the information and representations learned by a PLM called ESM-2 to perform end-to-end 3D structure prediction using only a single sequence as input. ESMFold demonstrated comparable accuracy to AlphaFold2 and RoseTTAFold for sequences exhibiting low perplexity and thorough comprehension by PLM. Notably, ESMFold’s inference speed was ten times faster than that of AlphaFold2, thereby facilitating efficient exploration of the structural landscape of proteins within practical time frames. OmegaFold
[99][121] predicted the high-resolution protein structure from a single primary sequence alone, using a combination of a PLM and a geometry-inspired transformer model, trained on protein structures. OmegaFold requires only a single amino acid sequence for protein structure prediction and does not rely on MSAs or known structures as templates. Similar to ESMFold, OmegaFold can also scale roughly ten times faster than MSA-based methods, such as AlphaFold2 and RoseTTAFold. HelixFold-Single
[98][119] was an end-to-end MSA-free protein structure prediction pipeline that combined a large-scale PLM with the superior geometric learning capability of AlphaFold2. HelixFold-Single first pre-trained a large-scale PLM with thousands of millions of primary structures, utilizing the self-supervised learning paradigm, and then obtained an end-to-end differentiable model to predict 3D structures by combining the pre-trained PLM and the essential components of AlphaFold2. EMBER3D
[100][122] predicted 3D structure directly from single sequences by computing both 2D (distance maps) and 3D structure (backbone coordinates) from sequences alone, based on embeddings from the pre-trained PLM called ProtT5. EMBER3D exhibited a speed that was orders of magnitude faster than its counterparts, enabling the prediction of average-length structures in mere milliseconds, even on consumer-grade machines.
2.7. Multi-Domain Protein Structure Prediction
Since the advent of AlphaFold2 in the recent CASP14, great progress has been made in protein structure prediction. However, AlphaFold2 and most of the subsequent state-of-the-art methods have mainly focused on the modeling of single-domain proteins, which are the minimum folding units of proteins that fold and function independently. Nonetheless, it is worth noting that several of the CASP14 targets, especially large multi-domain targets, were not predicted with high accuracy, suggesting that further improvements are needed for multi-domain prediction
[101][123].
A common approach to multi-domain protein structure modeling is to split the query sequence into domains and generate models for each individual domain separately. The individual domain models are subsequently assembled into full-length models, usually under the guidance of other homologous multi-domain proteins from the PDB. Such domain assembling methods can be divided into the following two categories: linker-based domain assembly and inter-domain rigid body docking. Linker-based methods, such as Rosetta
[102][125] and AIDA
[103][126], primarily focus on the construction of linker models by exploring the conformational space, with domain orientations loosely constrained by physical potential from generic hydrophobic interactions. Docking-based methods, such as DEMO
[104][105][127,128] and SADA
[106][129], assemble the single domain structure via rigid body docking, which is essentially a template-based method that guides domain assembly by detecting available templates.
2.8. CASP and Most Recent CASP Results
The Critical Assessment of Protein Structure Prediction (CASP) was established in 1994, by Professor John Moult and others from the University of Maryland, and has taken place every other year since then
[107][137]. Its purpose is to provide an objective evaluation of protein structure prediction technologies within the field of protein structure prediction. Employing a rigorous double-blind prediction mechanism, it is viewed as the gold standard for assessing protein structure prediction techniques and is regarded in the industry as the “Olympics of protein structure prediction”.
In order to fairly evaluate protein structure prediction methods, CASP assessors have incorporated and designed multiple measures. Two widely used evaluation measures by CASP are the TM-score and the global distance test score (GDT score). The TM-score between the model and the experimental structure is usually used to assess the global quality of a structural model
[108][138]. The TM-score ranges between 0 and 1, with TM-scores > 0.5 indicating that the structure models have the same fold defined in SCOP/CATH
[109][117]. The GDT score is calculated by GDT = (GDT_P1 + GDT_P2 + GDT_P4 + GDT_P8)/4, where GDT_Pn indicates the percent of residues under the distance cut-off ≤ n Å
[110][139]. The GDT score primarily focuses on assessing the backbone modeling quality of a protein. With the substantial enhancement in prediction accuracy witnessed since the advent of AlphaFold2 in CASP14, more and more measures for assessing side-chain modeling quality have been introduced.
According to the rules of CASP, all participating methods are categorized into the following two groups: server-based and human-based. Participants in the server-based group have a limited window of 72 h for structure prediction, while those in the human-based group are allotted 3 weeks, allowing for manual intervention. This signifies that the server-based group relies solely on computer predictions; hence, the competitive difficulty in this category is often higher than in the human-based groups.
Starting from CASP7, the proteins modeled during CASP have been classified as TBM, TBM-easy, TBM-hard, FM/TBM, or FM, depending on the availability and quality of PDB templates for each target, where TBM-easy targets have readily identifiable, high-quality templates, and FM targets typically lack homologous templates in the PDB.
Starting from CASP12, protein complex prediction has been included in CASP as an independent assessment category, called the protein assembly category. Protein complex modeling is distinguished from the classical protein–protein docking, where two protein subunits, named the ligand and the receptor, are in contact through a single interface. In the CASP protein assembly assessment, predictions of full-length protein complexes involve predictions of both individual protein–protein interfaces and overall complex topology.
Starting from CASP13, deep learning techniques have achieved significant breakthroughs, markedly enhancing the accuracy of protein tertiary structure prediction.
In CASP13, the adoption of distance map prediction began to play a pivotal role in guiding protein structure prediction. Notable examples include RaptorX-Contact
[22], DMPfold
[89][105], and AlphaFold
[90][106], which employed deep Residual Networks (ResNets) from contact prediction to distance prediction, significantly boosting predictive modeling performance. In particular, AlphaFold, developed by Google DeepMind, was ranked as the top method in tertiary structure modeling among all groups in CASP13. However, the majority of other groups continued to rely on contact prediction information for guiding protein structure prediction. Due to the remarkable accuracy of deep learning-based contact map predictions, even contact-based protein structure prediction methods also achieved excellent performance.
The effectiveness of distance prediction, as demonstrated in CASP13, has led to its widespread applications in various structure prediction methodologies. A promising example is trRosetta
[25][111][25,107], which employed a deep residual neural network to predict both pairwise residue distances and inter-residue orientations for guiding protein structure prediction. Following the inspiration from trRosetta, numerous groups in CASP14 incorporated orientation and distance constraints predicted by deep residual neural networks into their protein structure prediction processes. Among these methods, D-I-TASSER
[112][108] and D-QUARK
[112][108] were two top CASP14 servers from Yang Zhang’s group. D-I-TASSER, in particular, leveraged deep learning-based hydrogen bond network prediction to guide protein structure prediction, significantly improving modeling accuracy for CASP14 targets, especially those lacking homologous templates
[112][108].
2.9. AlphaFold Protein Structure Database (AlphaFold DB)
The AlphaFold Protein Structure Database (AlphaFold DB,
https://alphafold.ebi.ac.uk, accessed on 10 December 2023), created in partnership between DeepMind and the EMBL-European Bioinformatics Institute (EMBL-EBI), is a freely accessible database of high-accuracy protein structure predictions by the scientific community
[113][148]. Powered by AlphaFold2 of Google DeepMind, AlphaFold DB provides highly accurate protein structure predictions, competitive with experimental structures. The latest AlphaFold DB release contains over 200 million entries, providing broad coverage of UniProt
[114][149], which is the standard repository of protein sequences and annotations. AlphaFold DB provides individual downloads for the human proteome and for the proteomes of 47 other key organisms important in research and global health. AlphaFold DB also provides a download for the manually curated subset of UniProt. The prediction results of AlphaFold DB can be accessed through several mechanisms, as follows: (i) bulk downloads (up to 23 TB) via FTP; (ii) programmatic access via an application programming interface (API); and (iii) download and interactive visualization of individual predictions on protein-specific web pages keyed on UniProt accessions.
The AlphaFold DB’s release of a multitude of novel protein structures has provided bioinformaticians across the globe with a rich repository of data. Developers specializing in protein structure analysis tools are leveraging this influx of accurate models, leading to numerous significant breakthroughs in protein-related fields.
For example, the AlphaFold DB, through its accurate prediction of protein structures, offers a robust foundation for understanding how different ligands might interact with various proteins, which is pivotal in identifying potential drug targets, aiding in the design of novel pharmaceuticals, and contributing to a broader understanding of biological functions. In this context, several methods have been developed. AlphaFill, for instance, was developed to enrich the models in the AlphaFold DB by “transplanting” ligands, co-factors, and ions, based on sequence and structure similarity
[115][150].
While the AlphaFold DB has significantly expanded the application and scalability of tools and algorithms for protein-related analyses, effectively analyzing more than a couple of hundred thousand protein structures or models poses a challenge. There is a pressing need to develop novel approaches capable of managing the unanticipated and rapid growth of available models. Notably, state-of-the-art tools such as FoldSeek
[116][154] and 3D-AF-Surfer
[117][155] have already been developed, aiding researchers in searching through extensive repositories of protein structures to identify hits with structural similarity to a provided input structure. Leveraging high-throughput structural similarity searches facilitates classification problems, such as assigning structural CATH domains to AlphaFold models
[118][156].