2.2. Challenging Evaluation of Plant TE Diversity and Classification
Our knowledge about TEs structure, organization and transposition mechanisms has greatly progressed since the discovery of the first TE sequences in plants (the maize Ac/Ds elements) in 1984 by two different laboratories
[14][15]. Transposable elements are extremely diverse and use various mechanisms to move. The so-called autonomous elements encode all specific functions to achieve their mobility, while some non-autonomous elements hitch-hike mobility proteins from autonomous copies, or from other TEs to transpose.
Several types of TEs might co-exist in a given genome, and each TE type can harbor multiple TE copies clustered into different families according to their sequence similarity. As some transposition mechanisms are prone to generate mutations, each family might have evolved over time, displaying a continuum of more or less diverged copies composed of both autonomous and defective elements
[7][8][10][16]. Such behaviour made TE identification and classification a difficult task.
In 1989, the first TE nomenclature organized TEs into two classes according to their transposition intermediate
[17]: (1) an RNA intermediate for retrotransposons (class I elements) that move via a replicative “copy-and-paste” mechanism, where a “mother” copy gives rise to several “daughter” copies without excising itself; (2) a DNA intermediate for DNA transposons (class II elements) that use a conservative “cut-and-paste” mechanism for their transposition, where the “mother” copy excises from its location to insert elsewhere in the genome (“jumping gene” concept).
This bimodal schematic classification has been further refined in 2007 (1) by creating subclasses in the class II group to include TEs that use a DNA replicative mechanism, such as Helitrons, and (2) by setting the “80-80-80” rule to define the identity percentage TE copies should share to belong to the same family, i.e., sequences over 80 bp sharing at least 80% sequence identity in at least 80% of their internal domain or terminal repeats (or both)
[16][18][19]. This TE hierarchical classification splits TEs into the two previously cited classes (Class I and Class II), then into subclasses, orders and superfamilies
[18]. This classification is based on the presence and order of coding regions for specific proteins or structural motifs present in TE sequences, and on their transposition specificities (presence and sequence of target site duplications (TSDs), i.e., short 2–11 bp DNA duplications, generated upon TE transposition at the new insertion site). For example, the presence of long terminal repeats (LTRs) in direct orientation at TE 3′ and 5′ ends is the signature of LTR-retrotransposons (LTR-RTs). The occurrence of the reverse transcriptase (RT) domain in TE ORF is a hallmark of most but not all Class I retrotransposons. Like self-replicating entities as viruses, TEs have a modular evolution, exchanging essential or facultative protein-coding domains that may blur TE classification
[19].
If these first four classification levels are generally well accepted, the need of extra levels to reach the TE family level is still unclear and may vary according to the TE considered. For example, the superfamily subgroups “chromovirus” or “non-chromovirus” have been introduced for Gypsy LTR-RTs
[20]. An increase in family/lineage number or even new superfamilies will probably arise with the accumulation of genome sequencing data and the improvement of bioinformatic pipelines for TE detection and annotation. This hierarchical classification also does not include some non-autonomous TEs, i.e., TEs that are still able to transpose but need extra-function in trans coming from another TE element (phylogenetically related or not). This includes LARDs (large retrotransposon derivatives), TRIMs (transposon in miniature) and SMARTs (small LTR-retrotransposon) for Class I TEs, and MITEs (miniature inverted-repeats transposable elements), SNACs (small non-autonomous CACTA) or MULEs (Mu-like elements) for class II TEs. Many such non-autonomous elements share enough sequence similarity/motifs to be easily linked to a candidate “helper” autonomous TE family, as reported in rice with the isolation of the complete RIRE2 Gyspy retrotransposon displaying a high LTR sequence similarity to sequence extremities of the defective
Dasheng retrotransposon
[21]. Another example is the description of both complete and truncated—but nevertheless active—CACTA
Caspar elements in Triticeae
[16][22]. Some other non-autonomous TEs may not present any clear feature allowing them to be classified, as some non-autonomous short TIR-harboring Tes that may share only a few bases homology with the autonomous helper
[16]. In the classification we propose, we choose to include the non-autonomous SINES as a full order, as these TEs do not correspond to deleted versions of autonomous class I TEs.
Table 1 and
Figure 1 present, respectively, the up-to-date classification and structures of plant transposable elements adapted from Wicker et al.
[16]. It is important to note that only a subset of existing TE superfamilies in all living organisms (as reported in repbase
https://www.girinst.org/repbase/update/browse.php accessed on 10 March 2022) has been detected in land plants (~30% of class II superfamilies, and ~17% of class I—representing 20% of all described superfamilies). Plant TEs fall into six different orders, four orders corresponding to class I elements (LTR-RTs,
Penelope-like elements (PLE), long interspaced nuclear elements (LINEs), short interspaced nuclear elements (SINEs)) and two orders including elements of class II (terminal inverted repeat (TIR) transposons, Helitrons) (
Table 1).
Figure 1. Structure and organization of plant transposable element superfamilies (adapted from
[16]). Schemes are not to scale. Protein coding domains: APE = apurinic endonuclease, CHR = chromodomain, EN = endonuclease, GAG = capsid protein, HEL = helicase, INT = integrase, PROT = proteinase, RH = RNAse H, RPA = replication protein A, RT = reverse transcriptase. eORF = extra open reading frame (unknown function), Tpase = transposase (* with DDE motif), YR = tyrosine recombinase, Y2 = YR with YY motif, ◊ = different possible locations of an additional cellular-like ribonuclease H (aRH) specific of the Tat lineages (see
Table 1). Optional protein-coding domains only present in some superfamily lineages are indicated in brackets. Some structural features are also represented. Terminal repeats in the same or reverse orientation are indicated by black arrows, and purple rectangles refer to diagnostic sequences present in non-coding sequences. Specific base termination of some TEs are also indicated. PBS = primer binding site, PPT = poly purine tract. Interrupted line in Helitron representation means that the region may contain one or more additional ORFs.
Table 1. Plant transposable element (TE) classification compiled from
[16] with updates from
[23] for Copia lineages, ref.
[20] for Gypsy LTR retrotransposons (LTR-RTs), ref.
[24][25][26] for
Penelope-like elements (PLEs), ref.
[27] for long interspaced nuclear elements (LINEs), ref.
[28][29] for short interspaced nuclear elements (SINEs), and
[30][31] for Sola elements.
Class |
Order (Non-Autonomous TE Name) |
Superfamily |
Family/Lineage |
Plant Family Examples |
Class I |
LTR-Retrotransposons |
Copia |
Osser |
Volvox canteri Osser |
(retrotransposons) |
(LARD) |
|
Bryco |
representatives in moss species |
|
(TRIM/SMART) |
Lyco |
representatives in clubmosses species (Lycopodiaceae) |
|
|
Gymco-I |
representatives in gymnosperms species |
|
|
Gymco-II |
representatives in gymnosperms species |
|
|
Gymco-III |
representatives in gymnosperms species |
|
|
Gymco-IV |
representatives in gymnosperms species |
|
|
Ale/Retrofit |
Oryza longistaminata Retrofit, Oryza sativa Hopscotch |
|
|
Ivana |
Oryza sativa Oryko1-1 and Ilona, Hordeum vulgare HORPIA, Nicotiana tabacum Queenti |
|
|
Ikeros |
Zea mays Sto-4 |
|
|
Tork |
Nicotiana tabacum Tnt1, Tto1 and Tnt2, Solanum lycopersicum Tork4, Ipomea batatas Batata |
|
|
Alesia |
low copy number representatives in many Angiosperms, close to the Ale lineage |
|
|
Angela |
Triticum aestivum Angela, Oryza sativa RIRE1, Hordeum vulgare BARE1 |
|
|
Bianca |
Triticeae Bianca, Arabidopsis thaliana RomaniAT5 |
|
|
SIRE/Maximus |
Solanum lycopersicum ToRTL1, Zea mays Opie-2, Glycine max SIRE1 |
|
|
TAR |
Oryza spp. Houba and Osr-1, Arabidopsis thaliana ATcopia95 |
|
|
Gypsy (Chromovirus) |
Galadriel |
Solanum esculentum Galadriel, Musa Monkey, Tntom1 |
|
|
|
Tekay |
Hordeum vulgare Bagy-1, Arabidopsis thaliana Legolas Peabody, Oryza sativa RIRE3, Lilium henryi Del |
|
|
|
Reina |
Zea mays Reina, Arabidopsis thaliana Gloin or Gimli |
|
|
|
CRM |
Zea mays CRM (centromeric retrotransposon of maize), Beta vulgaris Beetle1, Oryza sativa RIRE7 |
|
|
(Non-chromovirus) |
Phygy |
Phycomitrella patens Chr21 (4035670,4045566) |
|
|
|
Selgy |
Selaginella moellendorffii LTR-RT |
|
|
|
Athila |
Arabidopsis thaliana Athila4-1, Diaspora, Hordeum vulgare Bagy-2 |
|
|
|
TatI |
Selaginella moellendorffii LTR-RT |
|
|
|
TatII |
Picea abies, Picea glauca LTR-RTs |
|
|
|
TatIII |
Picea abies, Picea glauca LTR-RTs |
|
|
|
Ogre/TatIV + TatV |
Pisum sativum Ogre |
|
|
|
Retand/TatVI |
Zea mays Cinful-1, Arabidopsis thaliana Tat4-1, Oryza sativa RIRE2, Sorghum bicolor RetroSor1, Silene latifolia Retand |
|
Non-LTR retrotransposons PLE |
Penelope/Poseidon |
|
Pinus taeda (loblolly pine) and Picea abies (Norway spruce) Dryad PLEs by horizontal transfer |
|
|
EN(-)PLE |
|
Selaginella moellendorffii spike moss, Pinus taeda and Picea abies EN(-)PLEs |
|
LINE |
L1 |
Llb |
sweet potato Llb, Beta vulgaris BvL1 |
|
|
|
LINE-CS |
Cannabis sativa LINE-CS, Beta vulgaris Belline2, Belline5 |
|
|
|
BNR |
Beta vulgaris Belline1/BNR |
|
|
|
PUR |
Carica papaya L1-26_Cpa, Solanum tuberosum L1-3_Stu, Vitis vinifera |
|
|
|
Cin4 |
Zea mays Cin4 |
|
|
|
Karma |
Oryza sativa Karma |
|
|
|
nubo |
Oryza sativa LINE-1 or OSLINE1-4, Zea mays L1-2_ZM |
|
|
RTE |
plant RTE |
Malus x domestica RTE-1_Mad, Solanum tuberosum RTE-1_Stu |
|
SINE |
tRNA |
|
Nicotiana tabacum TS, Au, Solanales SolS-II, Brassicale BraS-I, SB families, mainly found in Angiosperm |
Class II Subclass 1 |
TIR (MITE) |
Tc1-Mariner |
|
Stowaway (MITE): Sorghum bicolor Stowaway, Brassica BraSto |
|
|
hAT |
|
Zea mays Ac/Ds, Antirrhinum majus Tam3, Nicotiana tabacum Slide |
|
|
Sola |
|
Physcomitrella Sola1, found also in Capsicum annuum and C. baccatum |
|
(MULE) |
MuDR-Foldback |
|
Zea mays Mu, MULEs |
|
|
PIF-Harbinger |
|
Zea mays PIFa, Oryza sativa Pong; Tourist (MITE): mPing/Ping; mPIF/PIFa |
|
|
CACTA |
|
Zea mays En/Spm, Arabidopsis thaliana CAC1, Antirrhinum majus Tam1, Petunia hybrida PsI |
Subclass 2 |
Helitron |
Helitron |
|
Oryza sativa, Arabidopsis thaliana AthE1 Atrep, Ipomoea tricolor Hel-It1 |
In plants, class II elements (subclass 1) belonging to the TIR order (also called DNA transposons) fall into six superfamilies, based on the structure and sequence of their transposase and on the sequence of their terminal inverted repeats (TIRs) (
Figure 1). Transposase is the protein catalyzing their transposition, while TIRs harbor key sequences recognized by the DNA-binding domains of the transposase during a transposition event. Some TIR elements also harbor additional coding sequences, as the maize MuDR, and plant CACTA or PIF/Harbinger elements
[16]. Most of these superfamilies are also characterized by specific target site duplications (TSDs) lengths, generated after the filling of DNA nicks generated by the transposase on the integration site. Their transposition is not always strictly conservative and could lead to an increase of copy number if it occurs before a DNA replication fork
[32].
Replicating plant TEs fall into two major groups: (1) Helitrons (class II-subclass 2, see
Table 1) replicate through a rolling-circle (RC) mechanism from one DNA strand, without generating TSDs, by using a RepHel protein with a RC replication initiator (Rep) and DNA helicase (Hel) domains, in association with an ssDNA-binding “replication protein A” (RPA)
[33]. (2) Class I retrotransposons replicate from RNA templates by reverse transcription using a TE-encoded reverse transcriptase (RT) and use at least one additional protein to mediate their insertion into their host genome, such as endonuclease (EN) or DDE integrase (INT). We do not include the DIRS superfamily as a member of land plants, as DIRS elements have only been found in green algae until now
[34].
Among the four retrotransposon orders present in plants, SINEs occupy a particular place, as these small non-coding and non-autonomous elements of a few hundred base pairs exploit the transposition machinery of LINEs to ensure their amplification. Plant SINEs are derived from tRNAs
[28]. They are transcribed by polymerase III, harbor short degenerated internal promoters (A and B boxes), and display mostly A tail at the 3′-end. Apart from these small structural domains, SINEs display a high sequence diversity that hinders their detection and characterization. Recently, a 37 pb Angio-domain located in the 3′-end has been reported in many Angiosperm SINEs
[29].
The second non-LTR retrotransposon order present in plants, LINEs, contains elements belonging to the L1 and the RTE (retrotransposable element) superfamilies (
Table 1 and
Figure 1), which are two of the five known superfamilies of LINEs detected in eukaryotes
[16]. RTE and L1 LINEs have one or two open reading frames (ORFs), respectively, and code for proteins required for retrotransposition, such as an endonuclease (EN), a RT, and often a ribonuclease H (RNase H (RH)). The L1 ORF1 is involved in the binding, protection and transport of the RNA intermediate used for retrotransposition. At their 3′-end lies a stretch of (A)
n for L1 or (GTT)
n for RTE involved in the reverse transcription initiation. A recent study shows that plant LINEs extracted from 23 genomes fall into only seven L1 and one RTE families/lineages/subclades
[27]. As the reverse transcription starts from the 3′-end of LINEs and does not always reach the 5′-end, many incomplete daughter copies can be generated.
Between their bordering direct repeats 5′-LTR and 3′-LTR, autonomous LTR-retrotransposons (LTR-RTs) code for structural capsid-like (GAG) and functional (POL) proteins needed for their retrotransposition cycle (RT = reverse transcriptase, RH, INT = integrase), resembling the replication cycle of retroviruses. Only two out of the five superfamilies found in eukaryotes are represented in plants
[16]. Plant LTR-RTs are further classified into Copia/Ty1 or Gypsy/Ty3 superfamilies according to the order of their coding
pol domains. Recently, a systematic survey of plant LTR-RTs in 80 plant genomes refined the classification by introducing 16 lineages/families into the Copia/Ty1 superfamily and 14 lineages/families into the Gypsy/Ty3 group (six with a chromo-domain and eight without)
[20]. Two Gypsy lineages, Chlamyvir and Tcn1, having only representatives in algae and non-
Viridiplantae species, have not been included in
Table 1. Non-autonomous derivatives of variable size (from a few hundred bp up to 25 kb) have been characterized in plants, containing between both LTRs a DNA sequence of variable length, either non-coding or reminiscent of some retrotransposon internal domains. Large internal sequences (>4 kb) define LARDs (large retrotransposon derivatives), and short ones (<4 kb) are often called TRIM (terminal repeat retrotransposon in miniature)
[16].
Retrotransposons belonging to the
Penelope-like elements (PLE) order are also found in some plant genomes, but with a patchy distribution. PLE encode an RT domain related to telomerase, a highly specialized class of non-mobile RTs responsible for chromosome end maintenance in most eukaryotes. Some PLEs also carry a second EN domain with a specific GIY-YIG motif. PLEs are bordered by repeats in direct or reverse orientation and are often subjected to 5′ truncation upon retrotransposition, as non-LTR retrotransposons. EN(+)PLEs (
Dryads elements belonging to the
Penelope/Poseidon group) and EN(-)PLEs have been found in some Conifer genomes (
Table 1,
Figure 1), and some of them were presumably derived from a horizontal transfer (HT) event
[26].
The accumulation of sequenced genomes and TE detection pipelines allow the analysis and comparison of TE composition and diversity across plant genera.
Figure 2 presents a heatmap of genomic percentage of four types of TEs—LTR-RTs, LINEs, SINEs and TIR DNA transposons—across 74 Angiosperm species displaying variable genome sizes (Data collected from
[35] in the Supplementary Tables S1 and S2 of this article). Among the different types of TEs, LTR-RTs (Copia/Ty1 and Gypsy/Ty3) occupy the largest proportion of these genomes, the highest being up to 80% in
Zea mays. Such an increase can result from rapid amplification of only a few families. For example,
Oryza australiensis has undergone a recent burst of transposition involving only three families (one Copia = RIRE1 and two Gypsy = Wallabi and Kangourou), which compose 60% of its genome
[36]. Genomes of the legume tribe
Fabeae are also dominated by the Ty3/Gypsy Ogre family/lineage, that accounts for 57% of genome size variation on average in this clade
[37]. The predominance of one type of LTR-RT may vary depending on the taxa considered. For example, in
Gossypium species, Copia LTR-RTs have accumulated in the small genome of
G. raimondii (880 Mb), while Gypsy LTR-RTs (mainly Gorge3) have proliferated in large genomes lineages of
G. herbaceum 1667 Mb) and
G. exiguum (2460 Mb)
[38]. Plant species belonging to the same order (see Brassicales, Poales,
Figure 2) might display different TE compositions and genome sizes. Some plant species also harbor specific TE composition as shown in
Figure 2, with dominance of LINEs retrotransposons or TIR DNA transposon for the aquatic plant coontail
Ceratophyllum demersum (33.6% of total genome size) and the small herbaceous plant
Trichopus zeylanicus (~27.3% of total genome size). Non-LTR-retrotransposons have been shown to be abundant (~11.7%) in other plant genomes, such as
Arachis ipaenis, one of the peanut parental genomes
[39].
Figure 2. Transposable elements (TE) profiles in some land plant genomes. Species are clustered according to their TE profiles. TE percentages, plant orders and genome size estimations from 74 land plant species have been collected from
[35] (data collected in Supporting information, Tables S1 and S2 from
[35]). Some plant orders as Poales or Brassicales have been highlighted in colors (green and red respectively) in order to underline the diversity of TE composition between species belonging to the same plant order. Plant belonging to different orders as Dioscoreales and Ceratophyllales (in blue) can share close TE composition.