Proteins and their functions are distinguished by their structures in numerous aspects, but the rate of discovering protein structures has been much slower than the rate of sequence identifications owing to the cost and complexity. Therefore, protein structure predictor has become one of the most efficient and high-throughput tools in Bioinformatics to handle flooding known sequence data with developing methodologies such as statistical, ML, and DL methods. The feature used in the predicting process is known as PSA; it contains simplified information to ease the computing process and is used as an intermediate step to estimate the full protein structure. One dimensional- (1D-) and two-dimensional- (2D-) PSAs have enjoyed a great amount of attention, where secondary structure, solvent accessibility, or intrinsic disorder is mainly described as 1D-PSA, and CM or the detailed version of CM (multi-class CM or distance map) is expressed with 2D-PSA. Several DL applications have been developed for 1D- and 2D-PSA predictions, becoming more accurate owing to expanding of the availability of sequence and structure data.
3.1. 1D Prediction
The most fruitful feature among 1D-PSAs is the secondary structure, the very first step for the full protein structure prediction from the sequence. Two main classifications are available: three-state categorization into α-helix, β-strand, and coil region, or eight fine-grained categorizations, which further segregate the previous three states (vide supra). The earlier stage methods have used sequence data solely as input sources, but later, evolutionary information and physicochemical properties were involved in enhancing the prediction accuracy
[41]. The accuracy can be easily expressed by three-state percentage accuracy (Q
3 score) or eight-state percentage accuracy (Q
8 score), which is defined as the percentage of correctly predicted secondary structure residues.
One of the earliest servers available for secondary structure prediction would be JPred developed by Cuff et al.
[42]. The server adopts six different secondary structure prediction algorithms: DSC using linear discrimination, PHD using jury decision neural networks, NNSSP using nearest neighbors, PREDATOR using hydrogen bonding propensities, ZPRED using conservation number weighted prediction, and MULTIPRED using consensus single sequence method combination
[43]. Another secondary structure prediction server, PSIPRED, became available, where the method conjugates two FFNNs, training neural networks upon evolutionary conservation information derived from PSI-BLAST
[44][45]. Another attempt called SSpro showed an enhanced algorithm application, using BRNN–CNN
[46]. The method utilizes a mixture of estimators that leverages evolutionary information, indicated in multiple alignments, both at input and output levels of BRNN. Porter, Porter+, and PaleAle among the Distill series are also based on ensembles of BRNN–CNN, each used to predict different 1D-PSAs (Porter for secondary structure prediction, Porter+ for local motif prediction, and PaleAle for residue solvent accessibility prediction)
[47]. In the following Distill methods, the sequence is processed by the first BRNN–CNN stage and then pulled into a set of averages, which are processed by the second BRNN–CNN stage. Porter achieved better performance using both PSI-BLAST and HHBlits for harnessing evolutionary information
[48][49]. Likewise, Porter+ considers local structural motifs for predicting torsional angles
[50]. PaleAle, dealing with relative solvent accessibility (RSA), is structured with double BRNN–CNN stacks in the most recent version of 5.0, surpassing benchmarks from other methods for RSA prediction
[51]. NetSurfP-2.0, concatenating CNNs and BRNNs, was developed in 2019. This method predicts secondary structures, solvent accessibility, torsion angles, and intrinsic disorder, all at once
[52].
Taking other 1D-PSAs into account along with secondary structure and considering physicochemical properties, as well as evolutionary information, helped to enhance the overall accuracy. DESTRUCT, proposed by Wood and Hirst, iteratively used cascade-correlation neural networks upon both secondary structure and torsional angles
[53]. The iteration is composed of the first FFNN trained to predict the secondary structure and φ dihedral, and filtering FFNN intervening successively to transform the predictions into new values. Hirst group upgraded DESTRUCT into DISSPred that relied on support vector machine (SVM) and obtained better performance
[54]. SPINE-X by Faraggi et al. in 2012, later replaced by SPOT-1D from the same group, enhanced the accuracy by incorporating physicochemical properties such as hydrophobicity, polarizability, and isoelectric point, among others. This method could also be used for residue solvent accessibility and torsion angle predictions
[55][56]. SPIDER2 launched anticipated multiple 1D-PSAs—secondary structure, solvent accessible surface area (SASA), and torsion angles—all at once with three iterations of deep neural networks
[57]. Its successor, SPIDER3, improved the performance overall, and now the method predicts four PSAs at once, including contact number with four iterations for the prediction
[58]. ProteinUnet, published in 2020, yields similar accuracy for secondary structure prediction as SPIDER3-single, but uses half parameters with an 11-fold faster training time
[59][60]. Most servers and methods discussed now have over 84% Q
3 score in their latest versions with deeper neural networks and better algorithms. Considering the explosive advancement in reliability for Q
3 score with DL methods, it might not take too long until the theoretical limit of 88–90% is attained.
One special kind of 1D-PSA targets disordered regions of proteins. Many proteins contain intrinsically disordered regions (IDRs) that are highly flexible. Having multiple structures available, IDRs are involved in assembling, signaling, and many genetic diseases
[61]. Therefore, this PSA is of particular interest in addition to being a component of full protein structure prediction. IDRs have been predicted using statistical potentials, SVM, or artificial neural networks. IUPred employs a statistical pairwise potential expressed as a 20 × 20 matrix that expresses the general preferences of each amino acid pair in contact
[62]. The pairwise energy profile is calculated, and disorder probability is estimated accordingly. DISOPRED3 method is formulated on SVM, a supervised machine learning model, to discriminate between ordered and disordered regions
[63]. DISOPRED3 is trained on PSI-BLAST profile because it outperforms the models trained on single sequences, showing the improvements predicated on evolutionary information. SPOT-Disorder2 offers per-residue disorder prediction based on a deep neural network utilizing LSTM cells
[64]. Higher accuracy was obtained by upgrading its architecture from a single LSTM topology used in the previous version, SPOT-Disorder, to an ensemble set of hybrid models consisting of residual CNNs with inception paths followed by LSTM layers
[65].
3.2. 2D Prediction
With the information gained from 1D-PSAs in hand, one might need 2D-PSAs to fully construct the three-dimensional protein structure. Recent endeavors for 2D-PSAs are focused on CM and multi-class CM, both expressing the closeness between residue pairs in a protein. CM takes a binary 2D matrix structure of N × N, where N is the length of the protein sequence, assessing each residue pair as 1 (presence) or 0 (absence) for matrix elements based on the user-defined threshold Euclidean distance (a typical value is ~8 Å between Cα atoms). Multi-class CM is expressed in a 2D matrix, but the matrix elements are quantized in detail, categorized into more than two states. The importance of this CM for protein structure prediction is directly shown in estimations; an early study estimated that one could assemble a structure model within 5 Å RMSD from the native structure if N/4 long-range protein contacts are known, and another study estimated that one contact per twelve residues allows for robust and accurate protein fold modeling
[66][67].
The CM itself definitely provides useful information on the given protein’s spatial organization, but one should note that CMs often contain transitive noise coming from “indirect” correlations between residues. Methods for direct correlation analysis are used to remove this noise such as mutual information (MI), direct coupling analysis (DCA), and protein sparse inverse covariance estimation (PSICOV)
[68][69][70]. DCA infers direct co- “evolutionary couplings” among residue pairs in an MSA table to uncover native intra-domain and inter-domain residue–residue contacts in protein families
[71][72].
Many groups have developed CM predictors utilizing multi-stage deep neural networks. The previously introduced Distill server also provides the CM predictor named XX-Stout
[47]. The developers included contact density profile as an intermediate step using another Distill module named BrownAle
[73]. Calculating this contact density profile, principal eigenvector significantly increased the performance overall. DNCON by Eickholt and Cheng took advantage of surging GPU developments for training largely boosted ensembles of residue–residue contact predictors
[74]. MetaPSICOV is another CM predictor known for the first method utilizing co-evolution signals from 1D-PSAs extracted with three different algorithms
[75]. Then, a two-layer neural network was used to deduce CM. Its successive versions, named MetaPSICOV2 and DeepMetaPSICOV, exist where deeper network architecture and ReLU units are employed. RaptorX-Contact from RaptorX series utilized co-evolution signals to improve the accuracy
[76]. RaptorX-Contact predicts local structure properties, contact and distance matrix, inter-residue orientation, and tertiary structure of a protein using an ultra-deep convolutional residual neural network from primary sequence or a multiple sequence alignment. DNCON2 is implemented with six CNNs and applied co-evolution signal from 1D PSAs. This method predicts CM with various distance thresholds of 6, 7.5, 8, 8.5, and 10 Å, and then refines them to leave with only 8 Å CM with an improved prediction rate
[77]. TripletRes starts with the collection of MSAs through whole-genome and metagenome sequence databases and then constructs three complimentary co-evolutionary feature matrices (covariance matrix, precision matrix, and pseudolikelihood maximization) to create contact-map models through deep residual convolutional neural network training
[78]. DeepContact is also a CNN-based approach that discovers co-evolutionary motifs and leverages these patterns to enable accurate inference of contact probabilities
[79]. The authors argue that the program is useful, particularly when few related sequences are available. DeepCov uses fully convolutional neural networks operating on amino-acid pair frequency or covariance data derived directly from sequence alignments, without using global statistical methods such as sparse inverse covariance or pseudolikelihood estimation
[80]. In contrast to other software programs that require third-party programs, Pconsc4 is a hassle-free contact prediction tool that does not use any external programs
[81].
Recently, in 2019, DeepCDPred was developed, which includes a multi-class CM predictor exploiting distance constraint terms
[82]. The authors used four FFNN-based models to distinguish four classes of contact ranges: 0–8, 8–13, 13–18, and 18–23 Å. AlphaFold from the same year generates the most fine-grained multi-class CM, 64 equal bins distogram (distance histogram) along 2–22 Å, becoming state-of-the-art for the field
[83]. An architecture of deep 2D dilated convolutional residual network with 220 residual blocks was employed for the distance map prediction in AlphaFold (note that it will be discussed in more detail in the next section). These 2D-PSA developments have benefitted from the growth of affiliated fields, including algorithmic development and advancement of technologies, which is immediately beneficial for precise 3D structure prediction.