Food allergies are adverse immune responses to foods. The symptoms of a food allergy range from mild hives and itching to life-threatening anaphylaxis. In the US, it was estimated that up to 26 million adults
[1] and 6 million children
[2] have food allergies. Depending on the methods of studies, the sub-population suffering from food allergies in Europe was estimated between 0.8% (by positive food challenge) to 19.9% (by a survey of self-reported food allergies)
[3], and the overall food allergy prevalence in Asia is comparable to that in the West
[4]. Food allergies are among the top causes of anaphylaxis that lead to children’s visits to emergency departments in the United States
[5]. In addition, the situation is worsening as food allergy prevalence has increased in the past few decades
[2][6][7][8]. Most food allergy reactions are the immediate type that happens within hours of food intake. They are reactions to proteins (except for a small number of cases, see below) mediated by immunoglobulin E (IgE) antibodies. Food allergy reactions happen because the immune system has previously, for unknown reasons, mistaken a food protein as a dangerous invader, switched the class of T helper cells that determines whether the B cells produce IgG or IgE, and developed IgE antibodies against the protein in a so-called sensitization stage. Sensitization to food can happen in an individual when that person consumes it for the first time. It can also occur in people even though they have been eating the food safely for years to decades. In the sensitization stage, abnormal immune responses promote the class switching of B-cells to produce IgE antibodies specific to a food protein and clonal expansion of naive and IgE
+ memory B-cell populations
[9]. IgE molecules bind to the surface of mast cells and basophils through their association with the high-affinity IgE receptor FcεRI. Subsequent consumption of the food by the patient leads to allergen cross-linking of the IgE antibodies, which, in turn, results in the initiation of allergic reactions via signaling through the high-affinity receptor for the Fc region of immunoglobulin E (IgE) or FcεRI
[10][11]. Extensive research on food allergies has been conducted in recent years. Most of these efforts involved studying the genetic, environmental, and other factors that cause a sub-population to develop a food allergy
[12]. In comparison, less research has been devoted to studies on the offending allergens.
2. Food Allergens
Allergens are given names by the Allergen Nomenclature Sub-committee, which operates under the auspices of the World Health Organization (WHO) and the International Union of Immunological Societies (IUIS)
[13]. The approved name contains three parts with separations by a space, three letters for the genus being the first part, one letter for the species as the second part, and an Arabic number. The letters are those at the beginning of the genus and species names and the Arabic number indicates the order of the identification of the allergens in that species
[13]. The fourth letter of the genus and/or the second letter of the species is included when necessary to remove ambiguity. The WHO/IUIS Allergen Nomenclature Sub-committee also maintains a database of allergens with designated names. This database currently contains 248 food allergens of plant sources from 76 species.
In the early days of plant protein studies, they were classified based on their solubility and extractability in a series of solvent extractions and they were grouped into four groups (albumins, globulins, prolamins, and glutelins)
[14]. With the advancement of knowledge of the function, biochemical, and molecular properties of proteins, plant proteins can also be classified into three groups based on their functions, structural and metabolic proteins, protective proteins, and storage proteins. Metabolic proteins can be named by their biochemical activity, while storage proteins are generally those without a known cellular activity other than the storage of nitrogen, carbon, and sulfur for the development of the next generation of the plant. Protective proteins are those with a function to defend the plant against pests, microbial pathogens, or environmental stresses. In the field of modern protein research, one of the methods of obtaining valuable information on proteins by analyzing their sequence, structure, and function is the Pfam classification of protein families based on hidden Markov model profiles
[15]. At present, proteins are classified into about 19 thousand Pfam signatures
[16].
The number of protein families that contain proteins from plant sources that are known to be capable of eliciting allergic responses in atopic individuals is several orders of magnitude lower compared to the total number of Pfam signatures. There are thousands of proteins in a mature plant seed
[17][18], but 79% of plant food allergens belong to only 12 protein families.
Table 1 listed the protein families that are known to contain more than two food allergens along with the number of known allergens in each of the families. Some of the allergens (e.g., chitinases) include two or more domains that belong to different protein families. In addition to those shown in
Table 1, six protein families contain two allergens. Twenty-four allergens of plant sources belong to protein families that contain only one known allergen. Nine allergens do not belong to any of the classified Pfam families, i.e., searching the Pfam database with the allergen sequences using the online search tool at the site of the Pfam database did not find any hit. Note that minimal sequence information was available about four of the allergens in the database.
Table 1. Protein families with the most known food allergens from plant sources.
The protein family with the most food allergens from plant sources is PF00234 (the protease inhibitor/seed storage/LTP family), which has 74 known allergens. This family includes the plant nonspecific lipid transfer proteins (NsLTP), such as NsLTP1 and NsLTP2, the 2S albumin seed storage proteins, trypsin/alpha-amylase inhibitors, and other proteins. The next protein family is the Cupin-1 family, which includes 39 allergens that are in the WHO/IUIS allergen database. They are also seed storage proteins with close to half of the allergens belonging to the 11S legumins and half belonging to the 7S vicilins. Three of these allergens are now renamed as isoforms of other allergens, but their entries in the database stay due to historical reasons and literature references. With 27 food allergens, the profilin family ranked third in containing more food allergens from plant sources. Thus, the top three protein families contain more than half of the known plant food allergens. In addition, numerous other profilins are known to be pollen allergens. This indicates that the biological activities, physical–chemical properties, and conserved structures of the allergens may play a role in determining or promoting their allergenicity. The following describes the leading plant food allergen families:
Nonspecific lipid transfer proteins. Nonspecific lipid transfer proteins (NsLTPs) are found in all land plants
[19]. They are small proteins with molecular masses of around ten kDa. They were demonstrated in vitro to be able to bind and transport various phospholipids to chloroplasts or mitochondria without specificity
[20]. NsLTPs are plant pathogenesis-related proteins known as PR-14, and a number of them have been demonstrated to have antimicrobial activities including NsLTPs in wheat (
Triticum aestivum L.)
[21][22] and mung bean (
Vigna radiata L.
R. Wilczek)
[23]. NsLTPs can be identified by an eight-cysteine residue motif (8CM). Based on the number of residues separating one cysteine from the next and the conservation of residue types at specific positions of the flanking sequences, the NsLTPs can be divided into two types. The 8CM motif for NsLTP1 is CX
2VX
5–7C[VLI]XY[LAV]X
8–13CCXGX
12DX[QKR]X
2CXCX
16–21PX
2CX
13–15C, and that for NsLTP2 is CX
4LX
2CX
9–11P[ST]X
2CCX
5QX
2–4C[LF]CX
2[ALI]X[DN]PX
10–12[KR]X
4–5CX
3–4PX
0–2C
[24], where X with a subscript number represents the number of non-conserved amino acids residues and allowed residue variation at a single position is placed in a square bracket. Thus, C1X
7–10C2X
12–17C3C4X
8–19C5XC6X
19–24C7X
4–15C8 can be used to describe the 8CM of the plant NsLTPs, where the Cys residues are numbered from 1 to 8. The functions of NsLTPs are not well understood, but their expression levels are known to be high in most tissues, indicating that they may be essential for the reproduction and survival of plants. Four NsLTP2s and 38 NsLTP1s are known to be food allergens. Known NsLTP food allergens from the major allergen sources recognized by US Food and Drug Administration (FDA) include peanut (
Arachis hypogaea L.) allergen Ara h 9
[25] and Ara h 17
[26], almond (
Prunus dulcis (Mill.) D.A.Webb) allergen Pru du 3
[26], chestnut (
Castanea sativa Mill.) allergen Cas s 8
[27], hazelnut (
Corylus avellana L.) allergen Cor a 8
[28], walnut (
Juglans regia L.) allergen Jug r 3
[29], and wheat allergen Tri a 14
[30]. Furthermore, the NsLTPs from many plants not used for food are known to be pollen allergens.
The crystal structure Cor a 8
[31] was the first structure reported for an NsLTP food allergen from the major allergen sources, though that of wheat allergen Tri a 14
[32] and the solution structure of Tri a 14
[33] were reported many years ago before it was identified as a food allergen. As shown in
Figure 1A, the cysteines in the 8CM of Cor a 8 form four disulfide bonds. Protein structures were generated with the CCP4MG program
[34]. The structures of many other NsLTP1 food and pollen allergens are also available. The conservation of these disulfide bond connectivities (between C1–C6, C2–C3, C4–C7, and C5–C8) in NsLTP1s maintains the tertiary contacts of the secondary structural elements and ensures a stable hydrophobic cavity for lipid binding
[35]. Moreover, the structure of rice (
Oryza sativa L.) NsLTP2 was determined by NMR, which showed an overall structure similar to that of an NsLTP1. The disulfide bond connectivities in NsLTP2 (C1–C5, C2–C3, C4–C7, and C6–C8) are different from those in NsLTP1
[36].
Figure 1. Structures of representative members of protein families that contain the majority of the known food allergens of plant origin. The name of the allergen or protein and the protein family/subfamily is indicated below every structure following the (A–L) sequence label of the individual panels. The coordinates of the structures were downloaded from the WorldWide Protein Data Bank, and the graphics displays were generated using the CCP4MG program. The PDB codes for the structures are included in the figure labels along with the names of the allergens. Each structure is shown as a ribbon diagram with a blend-through coloring scheme displaying the N-terminal blue and the C-terminal red, except for the multimeric allergens Ara h 1 and Ara h 3 where the monomers were blended through different color ranges. Two panels of Ara h 1 are presented, with the right panel being the left panel rotated about a horizontal axis parallel to the paper pointing to the right. The side chains of cysteines that are involved in disulfide bonds are shown as ball-and-stick. Cysteines that are conserved in well-defined sequence motifs are labeled with their numbering in the motifs (see text).
2S albumins. Plant proteins coagulable by heat and soluble in water were called albumins in the early 20th century for their properties that resembled hen egg albumin
[14]. The 2S albumins migrated with a 2S sedimentation coefficient during sucrose gradient centrifugation
[37]. The 2S albumins also contain an 8CM similar to that of the NsLTPs but with longer sequences separating C2 and C3 and C6 and C7. Seed storage proteins are believed to accumulate in developing seeds to act as a nitrogen reserve for germination
[38][39]. The 2S albumins were considered to be a major group of storage proteins in many dicotyledonous plant species
[40] that also play a role in providing sulfur reserve in the seed
[37]. The 2S albumins have also been suggested to have antimicrobial activities
[41][42][43]. Known 2S food allergens from the major allergen sources recognized by FDA include peanut allergen Ara h 2
[44], soybean (
Glycine max) allergen Gly m 8
[45], Brazil nut (
Bertholletia excelsa Silva Manso) allergen Ber e 1
[46], cashew (
Anacardium occidentale L.) allergen Ana o 3
[47], hazelnut allergen Cor a 14
[48], pecan (
Carya illinoinensis (Wangenh.) K.Koch) allergen Car i 1
[49], pistachio (
Pistacia vera L.) allergen Pis v 1
[50], Stone pine (
Pinus pinea L.) allergen Pin p 1
[51], sesame (
Sesamum indicum L.) allergens Ses i 1
[52] and Ses i 2
[53], and Black walnut (
Juglans nigra L.) allergens Jug n 1 and walnut allergen Jug r 1
[54].
The structures of many 2S albumins including food allergens in castor beans (
Ricinus communis L.) (Ric c 1)
[55], rapeseed (
Brassica napus L.) (Bra n 1)
[56], and Brazil nuts (Ber e 1)
[57] have been reported. The first structure reported for a 2S albumin allergen from the major allergen sources is that of Ara h 6, which was determined by NMR using recombinantly expressed Ara h 6 with uniform
15N and
13C labeling
[58]. Three peanut 2S albumins have been identified as food allergens. They are Ara h 2, Ara h 6, and Ara h 7. The structure of Ara h 2 was also reported (
Figure 1B). It was determined by X-ray crystallography using recombinantly expressed Ara h 2 with a maltose-binding protein fused to the
N-terminal to enhance its solubility and aid its crystallization
[59]. These 2S albumins have the same disulfide bond connections as NsLTP2. However, Ara h 6 has an additional disulfide bond, which is formed by an extra cysteine between C6 and C7 (C6′) and another cysteine residue after C8 (C8′), as shown by one of the models of its structures determined by NMR (
Figure 1C).
11S legumins. Both the 11S and the 7S seed storage proteins belong to the cupin superfamily, which was initially recognized based on a 50% sequence identity between the wheat protein germin and a slime mold (
Physarum polycephalum) protein spherulin
[60]. Germin is an unusually thermostable protein produced during the early phase of germination in wheat embryos. The sequence similarity was then extended to a group of germin-like proteins and globulin storage proteins. Globulins were classified as those soluble in dilute salt solution but insoluble in water
[14]. After structural information on canavalin
[61] and phaseolin
[62] became available, sequence alignment revealed a much larger group of proteins in this superfamily, and the family was given the name cupin
[63] (from the Latin term ‘
cupa’ which means small barrel). The cupin superfamily contains monocupin, bicupins, and multicupins. It is known to be one of the most functionally diverse protein superfamilies
[64] including various proteins with enzymatic functions, non-enzymatic transcription factors, and the 11S and 7S seed storage proteins.
The signature of the cupins includes two sequence motifs separated by an inter-motif sequence with variable length (from 11 amino acids to over a hundred residues). The first motif was defined as GX
5HXHX
3–4EX
6G, and the second motif was characterized as GX
5PXGX
2HX
3N. The two histidines and the glutamate in motif 1 and the histidine in motif 2 may act as ligands to bind metal ions. In many cupins with enzymatic activity, a metal ion is part of the active site
[65]. However, the motifs are now known to tolerate variations and not all cupins have a metal ligand
[64]. Nevertheless, residues other than those specified above can also provide metal ligand coordination
[66].
The 11S globulins are the most widespread among seed storage protein groups. They are present in monocot and eudicot seeds, as well as in conifers and other gymnosperms. They are particularly abundant in legume seeds and are often called legumins. Typical legumins have molecular weights (MW) of about 300–450 kD and consist of six subunits of about 60 kD. These subunits are the products of a multigene family. Each subunit is post-translationally processed to give rise to an acidic (MW about 40 kD) chain and a basic (~20 kD) chain
[67]. The acidic and the basic chains are linked by a single disulfide bond. The 11S globulins are rarely, if ever, glycosylated. This family of proteins accounts for many of the known major food allergens from the FDA-recognized major allergen sources including peanut allergen Ara h 3
[68], soybean allergen Gly m 6
[69], almond allergen Pru du 6
[70], Brazil nut allergen Ber e 2, cashew allergen Ana o 2
[71], hazelnut allergen Cor a 9
[72], macadamia (
Macadamia integrifolia Maiden and Betche) allergen Mac i 2
[73][74], pecan allergen Car i 4
[75], pistachio allergen Pis v 2
[50], sesame allergens Ses i 6 and Ses i 7
[76], and walnut allergens Jug n 4
[77] and Jug r 4
[78].
The first structure reported for a legumin food allergen from the FDA-recognized major food sources is that of Gly m 6, which was determined before it was designated as a food allergen
[13][79]. The 11S seed storage proteins in many species have more than one type of subunit and five for Gly m 6
[79]. Mature 11S proteins are hexamers that can be composed of different subunits, making it problematic for crystallization. The crystal structure of Gly m 6 was determined by purifying the protein from genetically modified soybeans with four of the subunits of the 11S protein deleted. The peanut 11S food allergen is also known to be coded by at least five different genes
[80]. Nevertheless, the population of the mature protein that is composed of the translation from a single gene may be high, and the crystallization of Ara h 3 from wild-type peanuts was successful
[81]. The first crystal structure of a peanut allergen was that of Ara h 3 (
Figure 1D)
[82]. The structure of another 11S allergen from the FDA-recognized major food sources, Pru du 6, was also determined
[83]. Generally, the 11S allergens are a dimer of trimers. While the doughnut-shaped trimer was made up of three subunits by head-to-tail associations, the back-to-back binding of the trimers forms the hexameric structure of the native molecule. The dimerization of the trimers buries the
N-terminus of the basic subdomain which was generated as a result of the cleavage at a conserved peptidase recognition site. The
C-terminal of the acidic domain, however, moved away before the trimer–trimer interface to facilitate the packing of the mature hexamer, making it impossible to express the 11S allergen recombinantly with most of the commonly used strategies
[83]. The structure of an 11S putative allergens purified from coconut was also determined recently
[84][85].
7S vicilins. The 7S globulins are called vicilins, and they are also present in flowering plants and other spermatophytes. Vicilins are trimeric proteins of MW of ~150–190 kD, with a typical subunit MW of ~50 kD. No disulfide bond was found in vicilins, but proteolytic processing and glycosylation may occur
[86][87]. Thus, the subunit structure of vicilins revealed by SDS-PAGE is similar in the absence or presence of reducing agents. Vicilins also account for many known major food allergens from the FDA-recognized major allergen sources including peanut allergen Ara h 1
[88], soybean allergen Gly m 5
[69], almond vicilin
[89], cashew allergen Ana o 1
[90], hazelnut allergen Cor a 11 and Cor a 16
[91][92], macadamia allergen Mac i 1
[73], pecan allergen Car i 2
[93], pistachio allergen Pis v 3
[94], Korean pine (
Pinus koraiensis Siebold and Zucc.) allergen Pin k 2
[95][96], walnut allergens Jug n 2 and Jug r 2
[97], and sesame allergen Ses i 3
[53].
Vicilin leader peptides. Vicilins from some species, such as pea (
Pisum sativum L.) allergen Pis s 1
[98], contain just the di-cupin region and a signal peptide. However, vicilins from other species are known to have a variable region between the signal peptide, which can be predicted
[99], and the
C-terminal di-cupin domains
[100][101][102]. This variable leader peptide (VLP) was also called vicilin leader peptide. When they are found in a food independent of the mature vicilin protein with demonstrated allergenicity, they are designated as vicilin iso allergens by the WHO/IUUS Allergen Nomenclature Subcommittee, e.g., Ara h 1.0101 (26–84). In many vicilins, this region can consist of one (as in almond vicilin allergen
[89] and peanut allergen Ara h 1
[88]) or more (as in pecan allergen Car i 2
[93]) repeats of a coupled-C3C (cC3C) motif which has a quintet of cysteines arranged as a pair of CX
3C linked by 8–12 amino acids. Interestingly, none of this variable region, part of it, or the entirety of the region can be found in the native vicilin purified from the seeds, depending on the plant species. The cC3C area of macadamia nut vicilin was reported to have antimicrobial activities
[103]. Available data on the cC3C repeat in the variable region of vicilin food allergens are summarized in
Figure 2.
Figure 2. cC3C repeats of the variable leader peptide of known vicilin allergens. The domain structure of vicilin is shown at the top. Several features of the N-terminal variable region of vicilin between the signal peptide (Sig. P.) and the C-terminal cupin domains are shown below the domain structure. A question mark in the second column indicates the full sequence of the variable region of the allergen is not available. “ND” means no data available, and superscripts indicate the reference number of the cited literature. The variable leader peptides are derived by determining the signal peptides and the N-terminal peptides of the natural allergens. The sequences of the allergens were downloaded from the protein database at NCBI. The signal peptides were predicted using SignalP 6.0 and the N-termini were those reported in the relevant references. The reference numbers are given as superscripts in the table cells. Almond vicilin was reported to be an allergen but does not have an Allergen Nomenclature Subcommittee designated allergen name.