Currently, the term "protein folding problem" has two meanings, one emphasizing the result, the other the process. The former is understood as what structure will be attained by the protein chain of a certain amino acid sequence, while the latter implies how the protein chain chooses, in minutes, its unique structure among a giant number of others. For a long time, these two problems were considered as one, assuming that once "how" were solved, "what" would be solved right away. However, now it is clear that these are two different problems, because they have been solved by two quite different methods. The problem of "what" has been very recently solved by bioinformatics with the aid of neural networks and artificial intelligence [Senior et al., 2019, 2020; Yang et al., 2020]. This topic needs to be described in a separate Entry of Encyclopedia. The problem of "how" has been solved by physics. The aim of the article below is to outline the principal moments of this solution.
The ability of proteins to fold spontaneously puzzled protein science for a long time (see, e.g., [Anfinsen & Scheraga, 1975; Jackson, 1998; Fersht, 2000; Grantcharova et al., 2001; Robson & Vaithilingam, 2008; Dill & MacCallum, 2012; Wang et al., 2012; Wolynes, 2015; Finkelstein & Ptitsyn, 2016]).
As known, in living cells, gene-encoded protein chains are synthesized by special molecular machines, called ribosomes. To perform its unique biological function, the protein chain has to obtain its unique ("native") three-dimensional (3D) structure.
This phenomenon is called "protein folding".
Its importance for protein functioning was recognized in the 1950s [Anfinsen, 1959], followed by the finding that protein folding can occur not only in vivo but also in vitro [Anfinsen et al., 1961].
Since it is rather difficult to follow a change in the structure of a nascent protein chain against the background of many other molecules in a living cell, the investigation of protein folding started with in vitro experiments on the folding of water-soluble molecules of globular proteins.
However, it makes sense to begin this paper with a short overview of comparatively recent results on the folding that occurs in the course of protein biosynthesis on ribosomes.
The first studies were carried out on large proteins. They showed that these start to fold before their biosynthesis has been completed: the first synthetized (N-terminal) immunoglobulin domain folds when the whole chain has not been synthesized yet [Isenman et al., 1979]; the luciferase protein starts to work immediately upon completion of the chain biosynthesis [Kolb et al., 1994]; and the globin chain can bind to the heme when a bit more than its half has been synthesized by the ribosome [Komar et al., 1997] (though it is hard to say whether structuring of this half-made chain occurred before the heme-binding or resulted from it). Anyway, these data suggest that the in vivo protein chain folding starts just on the ribosome and that this co-translational process differs from the discussed below in vitro folding ("renaturation") of the entire protein chain.
However, more up-to-date experiments on co-translational structure acquisition by small nascent proteins (monitored by 15N, 13C NMR, and FRET) have shown that “polypeptides [at a ribosome] remain unstructured during elongation but fold into a compact, native-like structure when the entire sequence is available” [Eichmann et al., 2010]; “… folding [occurs] immediately after the emergence of the full domain sequence” [Han et al., 2012]; “… co-translational folding … proceeds through a compact, non-native conformation [i.e., something molten globule-like] … [and] rearranges into a native-like structure immediately after the full domain sequence has emerged from the ribosome” [Holtkamp et al., 2015].
Thus, there is no fundamental difference between the in vivo and in vitro folding, at least for small proteins, though some details of the on-ribosome and in vitro folding pathways can differ. In both cases, native structures emerge only when the entire sequence of a protein domain has been synthesized (it should be noted that truncated protein chains do not refold and remain compact but disordered in vitro [Flanagan et al., 1992]).
The discovery of chaperones, the cell’s troubleshooters [Ellis & Hartl, 1999], again aroused numerous suggestions that the in vivo and in vitro protein folding may be quite different, because chaperones may have a foldase/unfoldase activity (see, e.g., [Libich et al., 2015] and references therein). However, the analysis of data presented in [Libich et al., 2015] reveals that the most studied chaperone (GroEL) does not speed up the overall folding process [Marchenko et al., 2015]. This corroborates the conclusion that GroEL serves as an auxiliary transient trap that binds excess unfolded protein chains, thus preventing them from irreversible aggregation [Marchenkov et al., 2004; Marchenko et al., 2009].
Thus, the self-organization of protein structures (which in the case of in vitro folding of water-soluble globular proteins is unassisted by other biomolecules) captures the main peculiarities of the protein folding phenomenon. This means that all the information necessary to build up the 3D structure of a protein is inscribed in its amino acid sequence (this was Anfinsen's "thermodynamic hypothesis").
The study of self-organization has shown that an unfolded protein chain can spontaneously, "by itself", fold into its unique native 3D structure [Anfinsen et al., 1961; Anfinsen, 1973]. In Anfinsen's experiment, the enzyme ribonuclease A unfolded in the presence of urea and a thiol reagent, and with these agents removed, the enzyme spontaneously refolded, recovering its structure (as shown by correct restoration of all four S-S bonds) and function. However, as it has been recently found by David Eisenberg [2018], "essentially the same experiment had been performed earlier by a medical student [Lisa A. Steiner] at Yale, but neither [she nor] her research supervisor nor her department chair thought it particularly significant, and her work was not published". "Why did this transformative result lay hidden in her thesis?" asked Eisenberg, and answered: "She had the answer to a hugely important question, but that question had not yet been posed" because then (in the mid-1950s) it had not yet been elucidated "how biological information passes from the genome to proteins".
In the course of self-organization, the protein chain has to find its native (and seemingly the most stable) fold among zillions of others (Fig. 1) within only minutes given by a cell for its folding.
Indeed, the number of alternatives is vast [Levinthal, 1968, 1969]: it is at least 2100 but more likely 3100 or even 10100 (or 100100) for a 100-residue chain, because at least 2 ("right" and "wrong") but more likely 3 (a, b, "coil") or »10 (Privalov's [1979] experimental estimate), or even 100 [Levinthal, 1969] conformations are possible for each residue.
Since the chain cannot pass from one conformation to another faster than within a picosecond (the time of a thermal vibration), the exhaustive search would take at least ~2100 picoseconds (but more likely 3100, or even 10100, or 100100), that is, ~1010 (or 1025, or even 1080, or 10180) years. And it looks like the sampling has to be exhaustive because the protein “feels” that it has come to the stable structure only when it hits it precisely, while even a 1Å deviation can strongly increase the chain energy in the closely packed globule.
Figure 1. The Levinthal's choice problem. The choice of the native structure can be determined either by the folding process (Levinthal's "kinetic hypothesis") or by the enhanced native fold stability (Anfinsen's "thermodynamic hypothesis").
The main protein folding puzzle is why the native protein structure is found within minutes rather than within "Levinthal's" ~1010 or more years (that is, within ~1016 or more minutes)! This reduction of the folding process by 10 000 000 000 000 000 (!) times (compared to iterating over all structures) must be always kept in mind without distracting to dead-end considerations that promise, say, 1 000- or 1 000 000-fold acceleration of the process.
How can the protein chain choose, in minutes, its native structure among a giant number of others, asked Levinthal [1968; 1969] (who first noticed this paradox), and answered: It seems that the protein folding follows some specific fast pathway, and the native fold is simply the end of this pathway, no matter if it is the most stable chain fold or not (this was Levinthal's "kinetic hypothesis"). In other words, Levinthal suggested that the native protein structure is determined by kinetics rather than stability and corresponds to the easily accessible local free energy minimum rather than the global one.
However, computer experiments with lattice models of protein chains strongly suggest that the chains fold to their most stable structure, i.e., that the "native protein structure" is the lowest-energy one, and protein folding is under thermodynamic rather than kinetic control [Šali et al., 1994; Abkevich et al., 1994].
Nevertheless, most of the proposed and widely discussed hypotheses on protein folding were based on the "kinetic" (rather than "thermodynamic") "control assumption".
In particular, ahead of Levinthal, Phillips [1966] proposed that the protein folding nucleus is formed near the N-end of the nascent protein chain, and the remaining chain wraps around it. Meanwhile, the successful in vitro folding of many single-domain proteins and protein domains does not begin from the N-end [Goldenberg & Creighton, 1983; Grantcharova et al., 1998; Lappalainen et al., 2008].
Wetlaufer [1973] hypothesized the formation of the folding nucleus by adjacent residues of the protein chain but in vitro experiments have shown that this is not always so [Fulton et al., 1999; Wensley et al., 2009].
Ptitsyn [1973] proposed a model of hierarchical folding, i.e., a stepwise involvement of different interactions and the formation of different folding intermediate states.
More recently, various “folding funnel” models [Leopold et al., 1992; Wolynes et al., 1995; Dill & Chan, 1997; Bicout & Szabo, 2000; Wang et al., 2012] have become popular for illustrating and describing the reason for the speedy folding processes.
The difficulty of the "kinetics vs stability" problem is that it hardly can be solved by a direct experiment. Indeed, suppose that a protein has some structure that is more stable than the native one. How can we find it if the protein does not do so itself? Shall we wait for ~1010 (or even ~10180) years?
On the other hand, the question as to whether the protein structure is controlled by kinetics or stability arises again and again in solving practical problems of protein physics, engineering, and design. For example, when predicting the protein structure from its sequence, what should we look for? The most stable or the most rapidly folding structure? When designing a de novo protein, should we maximize the stability of the desired fold or create a rapid pathway to this fold?
However, is there a contradiction between “the most stable” and the “rapidly folding” structure? Maybe, the stable structure automatically forms a focus for the “rapid” folding pathways, and therefore it is automatically capable of fast-folding?
Before considering these questions, i.e., before considering the kinetic aspects of protein folding, let us recall some basic experimental facts concerning protein thermodynamics (as usual, we will consider single-domain water-soluble globular proteins only, i.e., chains of ~100 residues). These facts will help us understand what chains and folding conditions we have to consider. The facts are as follows:
(For the below theoretical analysis, it is essential to note that (i) as is customary in the literature on this subject, the term “entropy” as applied to protein folding means conformational entropy of the chain without solvent entropy; (ii) accordingly, the term “energy” actually implies “free energy of interactions” (often called the “mean force potential”), so that hydrophobic and other solvent-mediated forces, with all their solvent entropy [Tanford, 1968], come within “energy”. This terminology is commonly used to concentrate on the main problem of sampling the protein chain conformations.)
The above-mentioned “all-or-none” transition means that the native (N) and denatured (U) states are separated by a high free-energy barrier. It is the height of this barrier that limits the rate of this transition, and just this height is to be estimated to solve Levinthal's paradox.
The “kinetic control” hypothesis initiated very intensive studies of protein folding intermediates.
It was clear almost from the very beginning that the stable intermediates are not obligatory for folding, since the protein can also fold near the mid-point of equilibrium between the native and denatured states (Fig. 2) [Segava & Sugihara, 1984], where the transition is of the “all-or-none” type [Privalov, 1979], which excludes any stable intermediates.
The obtained basic experimental facts on protein folding kinetics are as follows:
Figure 2. Rates (k) of lysozyme re- and denaturation vs temperature. The mid-transition point is where the rates of renaturation (U®N) and denaturation (N®U transition) are equal, i.e., where the curves intersect. The plot is adapted from [Segava & Sugihara, 1984]. Note that folding at physiological temperatures of »40oC is only ~10-fold faster than folding at the mid-transition point. The similar in value but the opposite slopes of the U®N and N®U lines indicate that the transition state is intermediate in properties between the native and denatured states.
To begin with, it is not out of place considering whether the “Levinthal’s paradox” is a paradox indeed. Bryngelson & Wolynes [1989] mentioned that this “paradox” is based on the absolutely flat (and therefore unrealistic) “golf course” model of the protein potential energy surface (Fig. 3a), and somewhat later Leopold et al. [1992], following the line of Go & Abe [1981], considered more realistic (tilted and biased to the protein native structure) energy surfaces and introduced the “folding funnels” (Fig. 3b), which seemingly eliminate the “paradox”.
Figure 3. Schematic illustration of basic models of the energy landscapes of protein chains. (a) The “Levinthal's golf course model". (b) The “funnel” model; the funnel is centered in the lowest-energy (“native”) structure. (c) The potential energy landscape of a protein chain in more detail with bumps and wells, the deepest of which ("native") is by many kBTmelt (where kB is Boltzmann's constant and Tmelt is protein melting temperature) deeper than the others: this energy GAP between the global and other energy minima is necessary to provide the “all-or-none” type of decay of the stable protein structure [Shakhnovich & Gutin, 1990]. Only two coordinates (q1 and q2) can be shown in the figures, while the protein chain conformation is determined by hundreds of coordinates.
Various “folding funnel” models became popular for explaining and illustrating protein folding [Wolynes et al., 1995; Karplus, 1997; Nölting, 2010; Wolynes, 2015]. In the funnel, the lowest-energy structure (formed by a set of the most powerful interactions) is the center surrounded by higher-energy structures containing only a part of these powerful interactions. The “energy funnels” may appear not perfectly smooth due to some "frustrations" [Bryngelson & Wolynes, 1987, i.e., contradictions between optimal interactions for different links of a heteropolymer forming the protein globule, but a stable protein structure is distinguished by minimal frustrations (that is, most of its elements enhance the native fold stability) [Bryngelson & Wolynes, 1987, 1989; Bryngelson et al., 1995; Finkelstein et al., 1995]. The “energy funnel” can channel the protein chain towards the single lowest-energy structure, thereby apparently preventing “Levinthal’s” sampling of all conformations.
This would be so provided there were only energy and no entropy, which (if temperature >0oK) opposes the chain movement towards the single structure, even though corresponding to the global energy minimum.
But protein folding occurs in liquid water, at temperatures >273oK, where the entropy term is big, and, moreover, the folding proximity to the mid-transition point (Fig. 2) makes the entropy term compensate for the folding energy.
The mid-transition point fits the best to analyze Levinthal’s paradox (though in the "strongly folding conditions" the folding can be, say, 1 000-fold faster – but this is infinitely less than the puzzling 10 000 000 000 000 000-fold acceleration of the folding process compared to iterating over all structures).
At the mid-transition point, the protein chain has two equally stable low-free-energy states (denatured, often to the random coil, and natively folded) which are separated by a free energy barrier providing the all-or-none transition between them [Privalov, 1979], and the free energy landscape is "volcano-shaped" [Rollins & Dill, 2014] (Fig. 4).
Figure 4. This purely illustrative drawing shows how entropy converts the energy funnel (Fig. 3b) into a "volcano-shaped" free-energy folding landscape with a barrier on any pathway leading from unfolded conformations to the native fold. The smooth free energy landscape corresponds to compact semi-folded intermediate structures; the rocks (denoted by dotted lines) present the high-energy non-compact semi-folded intermediate structures and intermediate structures with high-energy bumps (see Fig. 3c). More accurate but less beautiful scheme of the free-energy landscape is shown in Fig. 2 in [Galzitskaya & Finkelstein, 1999].
Thus, any pathway from the unfolded state to the native one first goes uphill in free energy, and only then, in the vicinity of the native state, after passing the free energy barrier (i.e., the crater edge), the "free-energy funnel" starts working and pulls the chain downhill to the native state.
To have a rapid transition from the coil to the native state, the free energy barrier created by the volcano must be not high: according to the conventional transition state theory [Eyring 1935; Pauling, 1970; Emanuel & Knorre, 1984], the time of overcoming the barrier is estimated as
TIME » τ × exp(+∆F#/kBT) (1)
where τ is the time of a step from the barrier onwards, and ∆F# is the height of the free energy barrier (that is, the free energy of the folding nucleus).
It should be noted that protein folding is a multistep process [Finkelstein & Ptitsyn, 2002], and the traditional steady state theory is not very accurate when applied to multistep processes, including protein folding [Djikaev & Ruckenstein, 2016] and phase transitions in general [Ruckenstein & Berim, 2016]. However, this relatively small error mainly concerns [Finkelstein, 2015; Ruckenstein & Berim, 2016] to the estimate of the pre-exponential factor (τ in equation (1)), which is of secondary importance compared to the main term (exponent in equation (1) that accounts for the transition state free energy).
Because not any type of energy funnel provides a low volcano height, not any energy funnel per se can resolve Levinthal’s paradox. Strict analysis [Bogatyreva & Finkelstein, 2001] of the straightforwardly presented funnel models [Zwanzig et al, 1992; Bicout & Szabo, 2000] corresponding to the uniform condensation of the chain (previously considered by Shakhnovich & Finkelstein [1989]) shows that close to the mid-transition point such funnels cannot simultaneously explain both major features observed in protein folding: (i) its non-astronomical time and (ii) the "all-or-none" type of transition. By the way, the stepwise folding mechanism [Ptitsyn, 1973] also cannot [Finkelstein, 2002] simultaneously explain both of these, and hence, cannot resolve Levinthal's paradox.
The basic solution of the paradox is provided by funnels of a special type providing separation of folded and unfolded phases [Finkelstein & Badretdinov, 1997a, b] within the folding chain. (Which, as it was later mentioned in a review by Wolynes [1997], resembles the "capillarity in the nucleation" in the first-order phase transitions. The separation of the folded and unfolded phases in protein folding is seen in later computer simulations by Shaw et al. [2010]).
The optimal pathway of folding can be outlined using the optimal pathway of unfolding (which can be found much easier) because, according to the well-known detailed balance law [Landau & Lifshitz, 1980], the direct and reverse reactions follow the same pathway and have equal rates when both end-states have equal stability (otherwise, i.e., if the pathways for and reactions were different, the result would be a permanent circular flow (generating a perpetual motion machine of the second kind), which contradicts to the second law of thermodynamics).
As for a good unfolding pathway, one can easily figure out that this can be a sequential unfolding pathway passing through the least unstable semi-unfolded states, i.e., those where the compact globular phase is separated from the unfolded one by a relatively small boundary (Fig. 5) [Finkelstein & Badretdinov, 1997a, b; Galzitskaya & Finkelstein, 1999; Garbuzynskiy et al., 2013]. (To resolve Levinthal's paradox, it is not necessary to prove that this is the best possible pathway; it is enough to prove that this pathway resolves the paradox because any additional pathway will only accelerate the process. Imagine two pools, full of water and empty, with water leaking from one to the other through cracks in the wall between them; when the cracks cannot absorb all the water – which is prohibited by the all-or-none transition – each additional crack accelerates filling of the empty pool.)
Figure 5. Schematic illustration of the sequential folding/unfolding pathway of a globule with compact semi-folded intermediates. At each step of sequential unfolding, one residue leaves the native-like part of the globule (shaded) and turns into a coil (shown by a dashed line). The highest-free-energy intermediate (the folding nucleus corresponding to the transition state, denoted as #) has the largest (on the pathway) interface of the globular and unfolded phases. Its globular part covers about half of the chain. Adapted from [Finkelstein & Badretdinov, 1997a, b].
In a simplified form (for details, see [Finkelstein & Badretdinov, 1997a, b; 1998; Garbuzynskiy et al., 2013]), the resulting free energy barrier is estimated as follows.
When the free energies of the folded and unfolded phases are equal (i.e., in the mid-transition ambient conditions), the free energy of a semi-folded structure depends only on the interface between the two phases. The largest unavoidable interface corresponds to the intermediate that looks like a half of the native globule (Fig. 5) and has »L2/3 residues at the interface (assuming the most compact spherical shape of the native globule; for an oblate or oblong globule, the largest unavoidable interface can be a little less).
Thus, the transition state free energy is proportional not to the number L of the chain residues (as Levinthal’s estimate implies), but to L2/3 only.
The energy constituent DE# of the barrier free energy results from interactions lost by the interface residues; it is about , where » 1.3 kcal/mol » 2kBTmel is the average latent heat of protein melting per residue [Privalov, 1979] (this is the first empirical parameter used by the theory), and »1/4 is, roughly, the fraction of interactions lost by an interface residue. Thus,
DE#/kBTmelt » 0.5 . (2)
The entropy constituent DS# of the barrier free energy is caused by entropy lost by closed loops protruding from the globular into the unfolded phase (note that the second folding intermediate, denoted as # in Fig. 5, contains two closed loops, and the first folding intermediate in Fig. 5 contains no closed loops).
When the shape of the native protein fold and especially the shape of the chain fold in the transition state are not known, the closed-loops-connected DS# value (which is £0) can only be estimated from above and below. The upper limit of DS# is zero (when the interface contains no closed loops). The lower limit of DS# is about
(DS#)lower » . (3)
Here, is the maximal expected number of loops protruding from the maximal (containing » residues) unavoidable interface. Actually, is the average number of loops protruding from the interface of residues. The multiplier results from the fact that the chain can have, roughly, 6 directions in each interface residue (4 along the interface, 1 inside the folded part, and only 1 looking outside, thereby initiating a loop). Among many possible cross-section interfaces dividing the globule into two halves, the lowest-free-energy interface should serve for the transition state on the folding/unfolding pathway. Therefore, this "optimal" interface should be covered by no more than , but possibly smaller, number of loops.
The value is the average number of residues in a closed loop in the transition state ( being the number of unfolded residues in the folding nucleus, and the number of loops there). The value is the entropy lost by a -residue closed loop at the interface (such a loop cannot cross the interface plane; this restriction changes , the conventional Flory's coefficient for the entropy of an unrestricted closed loop, to [Finkelstein & Badretdinov, 1997a, b]). Having L ~ 100 (actually, this approximation is good for the whole range of L = 10 - 1000), one obtains
(DS#)lower » (3a)
In the mid-transition ambient conditions, the transition state free energy equals to DE#-TmeltDS#. The value is not less than DE#/ - 0 (when DS#=0) and not larger than DE# - Tmelt(DS#)lower, that is,
DE# £ £ DE# - Tmelt(DS#)lower (4)
Thus, when the free energy difference ∆F between the native (most stable) and unfolded state is equal to zero, the time of both folding and unfolding of the L-residue protein chain is estimated as
TIME∆F=0 » τ × exp[+ ] ∼ τ × exp[+(0.5 ¸ 1.5)L2/3] (5)
where τ » 10 ns is the time of structure growth by one residue [Zana, 1975] (this τ is the second and the last empirical parameter used in the theory [Finkelstein & Badretdinov, 1997a, b]).
Here, one thing should be added: A search over folds with different chain knottings can, in principle, be a rate-limiting “quasi-Levinthal” factor, since the knotting cannot be changed without globule decay. However, since the computer experiments show that one chain knot involves many tens of residues, the search for correct knotting can only be rate-limiting for extremely long (L>>1000) chains [Finkelstein & Badretdinov, 1998] that cannot fold within a reasonable time (according to eq. (5)) in any case.
The above equation shows that in the mid-transition conditions (where ∆F=0), a 100-residue protein chain should attain its most stable fold within milliseconds or days, but not years.
If the native fold is more stable than the unfolded state (i.e., if ∆F<0), the folding is faster. Because the folding nucleus covers about half of the chain (more detailed calculations [Garbuzynskiy et al., 2013] give »40%), its free energy decreases from (that was at ∆F=0) to approximately + 0.4∆F at ∆F<0, so that
TIME∆F<0 ∼ TIME∆F=0 × exp[+0.4∆F T], (6)
which can be approximately presented as
TIME∆F<0 ∼ 10 ns × exp[+(0.5 ¸ 1.5)×(L2/3+0.4∆F T)] (6a)
[Garbuzynskiy et al., 2013]. Because the value ∆F » 40 kJ/mol for a ~100-residue protein under physiological conditions [Privalov, 1979], the folding time of such a protein decreases about 100-fold, and now ranges from a fraction of millisecond to a few hours.
It should be noted that all the above considerations are focused on the case of the moderate stability of the native fold, which corresponds to the available data on protein folding (occurring near the mid-transition point, see Fig. 2). For the opposite case of a very high native fold stability (‑DF >> kBT), another but similar to eq. (5) scaling law (ln(TIME) ∼ L1/2) was obtained by Thirumalai [1995].
Concluding: one can see that although the protein folding problem is the so-called “NP-hard” problem [Ngo & Marks, 1992; Unger, & Moult, 1993] (which loosely speaking implies an exponentially-long time to be spent to solve it by a folding chain or by a computer), and indeed the time is a stretched-exponential function of the chain length L (see eqs. (5), (6a), and the later rigorous mathematical papers [Fu & Wang, 2004; Steinhofel et al., 2006]), this does not mean that this time is unreasonably long for a normal protein domain.
The observed protein folding times (see Fig. 6) span over 11 orders of magnitude (which is akin to the difference between the life span of a mosquito and the age of the universe).
Figure 6 shows the region theoretically allowed for the folding times by equations (5) - (6a) (and obtained with only two empirical and no adjustable parameters) and describes the observed folding times of all studied single-domain globular proteins of any size and stability of their native state.
Figure 6 also shows that a chain of L≲80-90 residues will find its most stable fold within minutes (or faster) even under "non-biological" mid-transition conditions, where folding is known [Creighton, 1978; Fersht, 1999] to be the slowest (see also Fig. 2). Thus, native structures of such relatively small proteins are under complete thermodynamic control: they are the most stable among all structures of these chains.
Native structures of larger proteins (of »100-400 residues) are, in addition, under a kinetic "control of complexity", in a sense that too entangled folds of their long chains cannot be achieved within days or weeks even if they are thermodynamically stable; and indeed, globular domains with greatly entangled folds of long protein chains have never been observed [Garbuzynskiy et al., 2013]: they seem to be excluded from the repertoire of existing protein structures. Besides, the native fold of at least one protein (serpin) of »400 residues is not the most stable but a long-living metastable fold [Tsutsui et al., 2012].
The kinetic control also explains why even larger (with L≳450) proteins should have far from spherical shape or consist (according to the “divide and rule” principle) of separately folding domains: otherwise, chains of more than 450 residues would fold too slowly. This is a kinetic "size restriction" for domains. In a sense, this effect resembles Levinthal's "kinetic control", though at another level and only for very large proteins. The above estimates (»100 and »400 residues) are somewhat (by 30-50%) elevated when the native fold free energy DF is substantially lower than that of the unfolded chain, but essentially they remain nearly the same [Garbuzynskiy et al., 2013].
Figure 6. The folding rates and times. Experimental in vitro measurements have been made "in water" (under approximately "biological" conditions) and at mid-transition for 107 single-domain proteins (or separate domains) without SS bonds and covalently bound ligands (though the rates for proteins with and without SS bonds are principally the same [Galzitskaya et al., 2001]). The golden-and-white triangle: the region theoretically allowed by physics at the mid-transition. Its golden part corresponds to biologically-reasonable folding times (≤10 min), the bronze belt is the additional area allowed in "biological" conditions. The white zone: the larger folding times (i.e., the smaller folding rates) are observed (for some proteins) only under the mid-transition (i.e., "non-biological") conditions. The yellow dashed line limits the additional area allowed for oblate (1:2) and oblong (2:1) globules at mid-transition; the bronze dashed line means the same for “biological” conditions. L is the number of amino acid residues in the protein chain. DF is the free energy difference between the native and unfolded states of the chain under the experimental conditions and temperature T. Adapted from [Garbuzynskiy et al., 2013].
Equations (5) - (6a) only outline the range of folding times depending on the protein size and stability of its native structure under given ambient conditions. To predict the protein folding time more accurately, the shape of its folding nucleus or, for lack of such information, of its native fold should be taken into account; this has been done by Plaxco et al. [1998], who introduced a "contact order" (CO, that equals to the average chain separation of the residues that are in contact in the native protein fold, divided by the chain length) as a phenomenological measure of complexity of the native fold (though, only for small proteins that fold without folding intermediates). Then this CO was adjusted for the already developed [Finkelstein & Badretdinov, 1997a, b] chain length dependence, and the resulting method [Ivankov et al., 2003] has shown quite good results, now for all proteins, and the later extension of this method [Finkelstein et al., 2013] gave even more accurate results.
It should be added that no attention was paid in these works to specific structures of folding nuclei; the attention was only paid to their overall features like size, instability and complexity. The reason: there is ample evidence that in some cases, folding nuclei are well-organized and possess specific structural features (see [Fersht, 1999, 2000; Garbuzynskiy & Kondratova, 2008; Shaw et al., 2010]), while in others, they are poorly organized ("diffused nuclei") (see [Grantcharova et al., 2001; Finkelstein et al., 2007, 2014] and references therein). The latter, together with the observed sensitivity of positions and shapes of the folding nuclei to mutations, led to the conclusion that a "nucleus" is an ensemble of structures rather than a single structure, and that that the folding nucleus and folding pathway are much less resistant to amino acid sequence mutations and change of ambient conditions than the native protein structure.
Also, it should be noted that all the above considerations were focused on stability (or rather, instability) of transition states (folding nuclei) and paid virtually no attention to folding intermediates, because these - in contrast to transition states - do not determine the rate of folding of native structures [Fersht, 1999, 2000].
The total ("Levinthal's") volume of the protein conformation space estimated at the level of amino acid residues is huge: ≳3100 conformations for a 100-residue chain.
However, should the chain sample all these conformations in search for its most stable fold? No: a vast majority of them are non-compact (that is, high-energy ones); but the conformation space is covered by local energy minima, each surrounded by a local energy funnel (Fig. 7) providing fast downhill decent to this local minimum. And, actually, the folding protein chain has to sample only various ways of packing the chain into the compact protein globule.
Figure 7. Comparison of a huge search among all, for the most part disordered, conformations and a much less voluminous search only among compact and well-structured globules, thus corresponding to the deep energy minima surrounded by energy funnels. Adapted from [Finkelstein, 2017].
To estimate the actual volume of this sampling, one has to estimate the number of local energy minima. This is similar to the idea of enumerating possible "topomers" that a protein chain can form [Debe et al. 1999; Makarov & Plaxco, 2003; Wallin & Chan, 2005].
An overview of protein structures shows that interactions occurring in the chains are mainly connected with secondary structures [Levitt & Chothia, 1976; Chothia & Finkelstein, 1990; Finkelstein & Ptitsyn, 2002]. Thus, a question arises as to how large the total number of energy minima is if considered at the level of formation and assembly of secondary structures into a globule, that is, at the level considered by Ptitsyn [1973] in his model of stepwise protein folding.
We will be interested mostly in proteins that fold under thermodynamic control, that is, ones having chains of L~100 or less amino acid residues (see above). Such proteins have no more than 10 α- and b-structural elements [Ptitsyn & Finkelstein, 1980; Rollins & Dill, 2014].
The number of compact globular packings of the chain is by many orders of magnitude smaller than that of conformations of amino acid residues [Finkelstein & Garbuzynskiy, 2015]: the latter, according to Levinthal’s estimate, scales up as something like 100L or 10L or 3L with the number L of residues in the chain, while the former scales up not faster (see below) than ~LN with the chain length L and the number N of the secondary structure elements. N is much less than L (N < L/10, according to Rollins & Dill [2014]), and this drastic decrease of the power N as compared to L is the main reason for the drastic decrease of the conformation space.
The number of compact globular packings of the chain with given secondary structures can be presented [Finkelstein & Garbuzynskiy, 2015] as a product of the following multipliers (Fig. 8).
Figure 8. A scheme of estimate of the conformation space volume at the level of secondary structure assembly and packing. Explanations are given in the text. Adapted from Supplement to [Finkelstein & Garbuzynskiy, 2015].
MA, the number of architectures, i.e., types of dense stacks of given secondary structures. This number is small (cf. [Levitt & Chothia, 1976; Murzin & Finkelstein, 1988; Chothia & Finkelstein, 1990]). It is usually ~10 or less (Fig. 8a) for a given set of secondary structures, since the architectures are packings of a few secondary structure layers (each containing several secondary structures), and therefore the combinatorics of the layers is very small as compared to that of much more numerous secondary structure elements, which is described below.
MP, the number of all possible combinations of positions of N structural elements within the given protein architecture that cannot exceed N! º N ´ (N -1) ´ . ´ 2 ´ 1 (Fig. 8b).
MT, the number of all possible topologies, i.e., all combinations of directions of these structural elements that cannot exceed 2N (Fig. 8c).
MS*T, the number of possible shifts and turns (Fig. 8d) of structural elements within the dense globule. Here, transverse shifts and tilts are prohibited by the dense packing, while longitudinal shifts and rotations of structural elements are coupled (this is shown using a b-sheet as the best illustrative example, but this is also true for a-helices – remember “knobs in the holes” close packings by Crick [1953]); as a result, each a- or b-element can have about L/N (that is, about the element's mean length) possible shifts/turns in the globule formed by N secondary structures in the L-residue chain.
As a result, the number of energy minima in the conformational space is MA ´ MP ´ MT ´ MS*T ~ 10´(L/N)N´2N´N! conformations; this (using Stirling's approximation N! » (N/e)N) gives
NUMBER of energy minima to be sampled ~ MA ´ MP ´ MT ´ MS*T ~ LN (7)
in the main term (if L >> N >> 1) [Finkelstein & Garbuzynskiy, 2015].
This number can be somewhat reduced by the symmetry of the globule, by shortness of some loops, by the impossibility to have a-helices inside b-sheets, etc., but this is not important in estimating the upper limit of the number of conformations [Finkelstein & Garbuzynskiy, 2015].
As to the question of how the chain knows where and what secondary structures to form, the answer is that most of the secondary structures are determined by local amino acid sequences [Ptitsyn & Finkel'shtein, 1970; Ptitsyn, 1973; Lim, 1974a, b; Chou & Fasman, 1974; Schulz et al., 1974; Ptitsyn &Finkelstein, 1983; Finkelstein et al., 1990; Jones, 1999; etc.].
Because in a chain of L»20 residues one (N=1) α-helix forms within »0.2 ms [Mukherjee et al., 2008], and a b-hairpin of N=2 b-strands forms within »6 ms [Muñoz et al., 1997], the time necessary for iterating ~ LN of possible assemblies of the secondary structures can be estimated (cf. eq. (6a)) as
TIME ~10 ns ´ LN (7a)
In a compact globule, the length of a secondary structure element should be proportional to the globule's diameter, i.e., to ~L1/3. More specifically, a diameter of a globule of L residues is »5L1/3Å, and thus, on the average, a helix consists of »3L1/3 residues, while a b-strand, as well as a loop, comprises »1.5L1/3 residues. Thus, (an α-glodule (consisting of α-helices connected by loops) contains »L/[L1/3(3+1.5)] = L2/3/4.5 helices, and a b-structural globule (consisting of b-strands connected by loops) contains of »L/[L1/3(1.5+1.5)] = L2/3/3 b-strands [Finkelstein & Garbuzynskiy, 2015]. This means that
NUMBER of secondary elements N » L2/3/4.5 [for α-proteins] — L2/3/3 [for b-proteins]. (8)
Thus, the value LN of possible secondary structure assemblies is expected to come within the range
º exp([ln(L)/4.5]´L2/3]) [for α-proteins] — º exp([ln(L)/3]´L2/3]) [for b-proteins] (9)
Since ln(L=50¸150) = 4¸5, the outlined range of possible secondary structure assemblies is close to
» exp(L2/3) — » exp(1.5L2/3) (9a)
for normal domains of L~100 residues (see explanations after eq. (3)), and the number of the secondary structure assemblies scales with the chain length L approximately as the upper boundary of the range of folding times outlined by equation (6a) [Finkelstein & Badretdinov, 1997a, b].
It is not out of place mentioning that the scaling of LN given by equation (9) looks like those obtained by Fu & Wang [2004] and Steinhofel et al. [2006] from mathematical consideration of the problem complexity rather than from physical reasons.
This Entry for Encyclopedia has been written after my recent reviews [Finkelstein et al., 2017, 2018; Finkelstein, 2017, 2018; Ivankov & Finkelstein, 2020] and lectures [Finkelstein & Ptitsyn, 2016; Finkelstein, 2020].