LINE-1 (L1) is a class of autonomous mobile genetic elements that form somatic mosaicisms in various tissues of the organism. The activity of L1 retrotransposons is strictly controlled by many factors in somatic and germ cells at all stages of ontogenesis. Alteration of L1 activity was noted in a number of diseases: in neuropsychiatric and autoimmune diseases, as well as in various forms of cancer.
Dispersed DNA repeats of LINE-1 (L1) retrotransposons account for 17% of the human genome [1]. Most of L1, which includes more than 500 thousand copies, is not active in the genome as they are truncated repeats or contain mutations in the protein-coding sequence necessary for retrotransposition [2]. However, approximately 150 copies are full-length and capable of self-copying and distributing in the genome [3,4,5]. Moreover, L1 elements provoke the spread of other genetic repeats, such as Alu and SVA [6,7,8]. Insertions of L1 elements occur mainly in the non-coding regions of the genome: introns and intergenic spaces [9]. The presence of the L1 element at a certain locus can affect gene expression and even lead to the formation of alternative transcripts, which can make a significant contribution to the functions of individual cells, tissues, and the whole organism [10,11].
Figure 1) [12,13,14]. ORF1 encodes a ~40 kD protein with the chaperone activity necessary to stabilize a new L1 copy [15,16,17,18]. ORF2 encodes ~150 kD proteins with the endonuclease and retrotransposase activity required for the retrotransposition process [15,19,20]. In the opposite direction, ORF0 and two antisense promoters are located in the 5′and 3′ UTR. The function of ORF0 remains poorly understood. According to some data, ORF0 is involved in the formation of chimeric proteins or enhances the mobility of L1 [13,21,22,23].
Figure 1. The structure of a full-length copy of L1 retrotransposon. ORF1 consists of an N-terminal domain (N), a coiled-coil domain (CCD), an RNA recognition motive (RRM), and a C-terminal domain (CTD) [18]. ORF2 consists of endonuclease (EN), retrotransposase (RT), a cryptic domain (Cry), a Z-domain (Z), and a C-terminal domain with a cysteine-rich region (Cys-rich) [24].
Figure 2) [20,25,26]. A new RNA copy of the L1 element is expressed from the forward chain of the L1 element due to the strong promoter in the 5′UTR [14,27]. The new copy is then polyadenylated and leaves the nucleus [28]. Translation of ORF1 and ORF2 as well as the formation of L1 ribonucleoprotein (L1 RNP) occur in the cytoplasm. L1 RNP is a bicistronic mRNA coated with ORF1 proteins and contains one or two copies of the ORF2 protein [29]. In the cytoplasm, polyadenylate-binding protein 1 (PABPC1) attaches to the polyA tail of L1; its presence is critical for the formation of L1 RNP [28]. Transportation of L1 RNP from the cytoplasm to the nucleus is accomplished using the membrane-associated endosomal sorting complex required for transportation (ESCRT) [30]. It has also recently been shown that the ORF1 protein interacts with the KPNA2 and KPNB1 karyopherins, as well as possibly other KPNA family proteins involved in nuclear protein importation through nuclear pores (nuclear pore complex) [18]. The cancer cell model showed that L1 RNP penetrates the nucleus during mitosis and the integration of the new copy of the L1 element into the genome occurs in the S phase of a cell cycle [31]. Interestingly, in different tissues, the association of the L1 retrotransposition process with certain stages of the cell cycle can be different. For example, in neuronal cell cultures, it has been shown that retrotransposition can occur in non-dividing cells [32]. In entering the nucleus and reaching the genomic DNA, the endonuclease recognizes the consensus cleavage site 5′-TTTT/AA-3′ [20,33,34,35], and creates a single-stranded DNA break with the formation of both phosphate 5′-PO4 and hydroxyl 3′-OH groups at the ends [20]. The L1 transcript is attached via the polyA tail to the region of the endonuclease recognition site and reverse transcription of the L1 RNA occurs [36,37,38]. For the classical mechanism of retrotransposition, host proteins involved in DNA repair and replication are necessary [34,39,40]. A complex of PARP1 and PARP2 proteins is formed at the single-strand DNA breaks [40]. PARP2 specifically recognizes a single-stranded DNA gap at the L1 integration site. PARP2 is activated by the poly-ADP ribosylation process (PARylation). Activated PARP2 interacts with the RPA complex, which allows for the integration of a new synthesized L1 copy into the DNA. RPA, a replicative complex of protein A (heterotrimeric protein A complex), consisting of RPA70, RPA32, and RPA14 proteins, is required to bind single-stranded DNA in eukaryotes and to protect it from the deamination of cytidine [41]. The role of PARP1 in L1 retrotransposition has not been fully understood but it has been revealed that PARP1 interacts directly with ORF2, and the retrotransposase domain is responsible for this process. The absence of one of the PARP1 or PARP2 proteins leads to a decrease in the retrotransposition by about 50%; the absence of both proteins or the RPA complex reduces L1 retrotransposition by 80% [40]. The ORF2 complex, which is formed in the region of integration into the genome and promotes reverse transcription, includes various proteins involved in DNA stabilization and enzyme processability [42]. The first of these proteins is the proliferating cell nuclear antigen factor (PCNA). ORF2 interacts with PCNA and this interaction is critical for retrotransposition [43]. RUVBL1 and RUVBL2 repair proteins are also required for L1 spreading and their absence leads to a decrease in the retrotransposition [43]. In addition, the nonsense-mediated decay factor UPF1 and the MOV10 helicase were detected in the L1 RNP. Interestingly, UPF1 knockdown increases the amount of mRNA and L1 proteins but simultaneously reduces the effectiveness of the retrotransposition [43]. Inhibitory activity against L1 was shown for MOV10. However, in a recent study, it was suggested that MOV10 may facilitate the attachment of UPF1 to L1 RNP [42,44]. Insertion of the full-length L1 copy is a rare event; usually, a new copy of the L1 element is truncated from the 5′UTR [26,45,46,47]. Double-stranded DNA repair factors XRCC6 (Ku70/Ku80), Artemis (DCLRE1C), and LigIV (LIG4) are involved in the truncation of a new copy of the L1 element [46] (
Figure 2). The exact mechanism of their action remains unclear. It is suggested that XRCC6 can facilitate the attachment of ORF2 to the overlying targeted DNA, thus accelerating the completion of the retrotransposition process and leading to truncation [46]. After integration of the reverse complement strand of the new L1 copy, the second strand of DNA breaks and the first strand of the new L1 copy is synthesized using the host cellular enzymes involved in both DNA replication and reparation [34,40]. Retrotransposition can occur using an alternative endonuclease-independent (EN-independent) mechanism in p53-defective cells or cells containing mutations in the non-homologous end-junction (NHEJ) genes during DNA repair, which apparently uses DNA breaks to initiate transcription [45,48,49,50].
Figure 2. Scheme of the classical retrotransposition mechanism of L1. The transition of L1 from one stage of retrotransposition to another is indicated by blue dashed arrows. The upper left part of the figure shows the expression of a full-length copy of active L1 in the cell nucleus. The L1 RNA transcript (marked in red) is transported to the cytoplasm. The L1 ORF1p and ORF2p proteins are synthesized and the L1 RNP is formed (in the lower right part of the figure). Then, through the endoplasmic reticulum (EPR) and nuclear pore complex (NPC), L1 RNP is transported to the nucleus and L1 DNA copy formed by a reverse transcription is integrated into a new genomic locus (in the upper right part of the figure). The cellular factors involved in the retrotransposition process, which are described in this review, are also depicted.
L1s belonging to the LINE class of mobile genetic elements are found in the genomes of animals and plants [51,52]. Although animal L1 is found in the genomes of some protostomes, the history of gradual accumulation and the evolution of modern L1s can be traced at the level of deuterostomes, possessed by three highly divergent groups: a united species from echinoderms to teleost fishes; non-mammal vertebrates; vertebrates from fish to mammals [52,53].
4.1. LINE Evolution in Deuterostomes and Non-Mammals
A high diversity of ancient L1 families was found in the lancelet, sea urchin, and tunicates, and despite their variability, they make up a small proportion of the repeated sequences [53,54]. Mobile elements are in constant competition with each other and with factors limiting their activity in the cell, and they try to maintain the ability to spread in the genome and increase the number of copies by capturing new genomic loci. However, this battle is not always successful considering that in most bony fish genomes, although there are exceptions, the number of LINEs is not large compared to other mobile elements, yielding to DNA transposons [54,55,56]. On the contrary, in the known representatives of living jawless fishes, cartilaginous fish, coelacanths, and lungfishes, the number of LINEs is not inferior to other classes of DNA repeats and constitutes 25–50% of all repeats in the genome [54,57,58,59,60,61,62]. Interestingly, increased diversity of the L1 families is observed in fish [53]. Moreover, the highest diversity of L1 was found in African coelacanth. Nonetheless, the most successful LINEs in this group are still CR1 and L2 [56]. A high diversity of L1 was observed in amphibians, although, similar to that in bony fish, the number of LINEs remains small and most of them are either DNA transposons or LTR [53,54,58]. In reptiles, except green anole, several LINE families (CR1, BovB, L2) are evolutionarily successful, the activity of which continues to persist in the genomes, and their amount increases relative to other repeats [55,63,64,65,66,67,68]. The genome of tuatara is distinguished by a variety of repeats in which L2 is the most successful group [68]. The most widespread CR1, completely displacing L1, is in the genomes of turtles, crocodiles, and birds. The success of these elements was facilitated by the highly conserved hairpin structure and octameric microsatellite motif at their 3′UTR [65,69,70,71]. In the avian genomes, there was a sharp decrease in the genome size and number of repeated sequences. LINE/CR1 are the remaining bulk [71].
4.2. LINE Evolution in Mammals
In mammals, many LINEs lost the ability to spread due to various mutations and truncations of full-length copies [64,72]. Only one family of LINEs remains active. The most successful group of mobile elements in mammals is L1. An exception is the group of monotremes, which have no L1 sequences [52]. Thus, in platypus, L2 is the most prevalent of LINEs [73]. Metatheria (marsupials) and eutheria (placental mammals) have similarities in the composition and evolutionary tendencies of their mobile elements. In the genomes of most mammalian species, L1 becomes the most successful and active group, while many ancient repeats gradually disappear from the genome in process of evolution. Interestingly, that some enhancers and ultra-conserved elements are originated from ancient retrotransposon repeats [55,74,75]. Active L1s are species-specific genomic elements. Nevertheless, their structure is similar in all mammals and the greatest differences involve the non-coding region 5′UTR, the size of which varies greatly in different species [64]. 5′UTR changes play an important role in the interaction with cellular transcription factors that regulate L1 expression. Of the L1 encoded proteins, ORF1 differs in variability, while ORF2, on the contrary, is conservative [64]. Differences and evolutionary trends of L1 elements in mammals have been described in some animal groups. For example, some bats, similar to flying birds, have a decrease size of genomes and are characterized loss of active L1 elements [71,76]. The L1 extinction is also observed in certain mammalian species that are not adapted to flight. The disappearance of L1 activity was noted for Spermophilus tridecemlineatus from the superorder Afrotheria, perissodactyls, and sigmodontine rodents [77,78,79,80].
L1 is active in rats and mice. However, the accumulation and activity of mobile genetic elements of the widely studied mouse (Mus musculus) differ from the general tendencies of mammals, including humans and other primates, because LINE elements are quantitatively inferior to LTR repeats in its genome [81]. However, L1 makes up about 20% of the murine genome and L1Md is currently active. LINEs account for about 23% of the rat genome [82]. In addition to the traditional L1, the rat genome acquired the activity of HAL1 (HALF-L1) elements, the shorter version of L1 elements. In follow up the integration into the genome the HAL1 elements retain their internal promoter, that is often truncated in case of integration of full-length L1 elements.
4.3. LINE Evolution in Primates
Primates separated from other ancestral mammals about 90–65 million years (myrs) ago and are characterized by the distribution of the L1PA-L1PB families [83,84,85]. Comparative evolutionary analysis of L1 revealed different trends in discrete primate species [86]. In most primate species, L1 is the most active family capable of self-propagation in the genome, as well as the most capable of contributing to the amplification of SINE elements, the copy number of which in genomes reaches the maximum of all dispersed repeats. In most species of New and Old World primates, the L1 remains active. Only in New World South American spider monkeys the absence of L1 activity was found [87,88]. The evolutionary history of the Old World primates began approximately 21–25 myrs and is associated with the distribution of L1PA6—L1PA5 elements [83,89,90,91]. L1PA5–6 elements, which are evolutionarily closest to their modern active L1 subfamilies, are most widely distributed in the genomes of monkeys (Cercopithecoidea) [89,90]. Interestingly, the greatest differences in the number of primate L1 were found among the Cercopithecoidea. For example, the baboon has the highest L1 amplification rates in the genome compared to other primates. On the contrary, the green macaque has the lowest number of L1 repeats compared with other primates [86].
The branch of great apes split off about 26 myrs ago [92]. Among the great apes, the largest number of L1 insertion loci was found in the orangutan. Moreover, the number of LINEs in orangutan genome significantly dominates over other families of dispersed DNA repeats, while in other primates, SINE insertions are most common [86]. Compared to other primates, the number of LINE insertions in humans is not large. However, the largest number of currently active LINE elements was found in the human genome [86]. Thus, the gorilla genome harbors twelve intact full-length gorilla-specific L1s belonging to the L1PA2 subfamily [93]. In chimpanzees, L1Pt-2 are active and only nine copies are full-length elements with intact ORFs [94]. In contrast, in humans, the active family is the L1HS, consisting of several subfamilies, namely pre-Ta, Ta-0, Ta-1, Ta1-d, and Ta1-nd [9,95] [, of which about 146 copies are active [5]. Moreover, comparative analysis showed that the activity of human L1 copies is significantly higher than that of chimpanzees [90,94].
4.4. LINE Evolution in Ancient and Modern Humans
A number of studies have shown that the accumulation of loci containing L1 repeat insertions is not random but occurs in accordance with functional significance. Thus, L1 insertions are more often retained in the trans-orientation relative to the gene, while insertions in the cis-orientation are washed out from the genome [96,97]. The evolutionary trends of L1 in the Homo branch are of great interest. However, the genomic architecture of L1 elements in ancient humans (Homo sapiens sapiens) and related subspecies, ancient hominids (Neanderthals and Denisovans), are poorly understood due to the difficulties of genomic mapping of repeat elements using short reads available from the sequencing of ancient DNA. Nevertheless, several studies carried out an analysis of the mobile elements, which showed the presence of introgression of the L1 insertion loci of ancient people in the DNA of modern people, the nature of which corresponds to the same for SNV [98,99]. Moreover, in the genomes of ancient hominids, the sequences corresponding to the most active L1Ta1d mobile elements of the modern human genome were determined. Thus, the origin of L1Ta1d could have occurred in the common ancestor of ancient hominids and modern humans more than 800 thousand years ago [98]. An analysis of the insertion loci in genes in ancient people and modern humans showed that most of the repeat insertion loci specific to modern humans, including L1, originated in the genes that are highly expressed in the brain and are involved in neuronal maturation [99].
Analysis of L1 insertions in modern world human populations of the Phase3 data release project 1000 Genomes, which included 2.5 thousand individuals from 26 populations, reveals 2.91 thousand polymorphic L1 loci [100]. It was found that the majority (over 93%) of the identified loci of active retrotransposons (L1, Alu, and SVA) have a low population frequency of less than 5%. Moreover, such low frequency of insertion loci have substantial geographic differentiation. In support of this, in a recent study with a significantly smaller number of individuals (296 individuals) but greater population diversity (146 populations) from the Simons Genome Diversity Project (SGDP), a relatively large number of 1.886 thousand polymorphic unreferenced L1 loci were identified [101]. In both studies, the number of L1 polymorphic loci is 6–10 times lower than the Alu polymorphic loci but exceeds 3.5–4 times the number of SVA element polymorphic loci. The polymorphism of the insertion loci of active retrotransposons reflects the evolutionary aspects of modern populations and the migration processes of the world [100,101]. The greatest diversity is observed in Africans who are evolutionarily basal in world populations [100,101]. A decrease in heterozygosity is observed in populations of Eurasia and a minimum value was found in Native Americans [101].
4.5. LINE Evolution and Host Regulation
The L1 regulatory factors are evolved along with evolution of L1 elements. APOBEC3 protein family and the Piwi-interacting RNA (piRNA)-signaling pathway are involved into the cellular defense mechanisms against the uncontrolled spread of L1 (see Regulation of L1 Activity). One of the most susceptible proteins to strong evolutionary selection, amplification, and divergence in mammalian genomes is the APOBEC3 subfamily of antiviral factor genes [102,103]. High divergence of APOBEC3 was noted in the genomes of bats and primates [104,105,106]. Interestingly, other closely related genes belonging to the AID/APOBEC family have lower evolutionary rates in mammals [107]. High evolutionary rates are also observed for the piRNA pathway, many genes of which are under the influence of positive selection [108]. The different regulatory pathways capable to repress L1 elements have been evolved reflecting the constant battle between mobile elements and the cellular host defence [109,110,111]. The difference in expression of genes involved in host defence pathways of mobile elements between animal species plays an important role in the effectiveness of L1 inhibition. As shown in one study, there is a higher expression level of APOBEC3B (also known as A3B) and PIWIL2 genes in human pluripotent stem cells, compared to the closest non-human primates (Pan troglodytes and Pan paniscus). The study showed that L1 silencing in human cells is more efficient as compared to chimpanzee cells [112].
Further, the factors regulating L1 elements are considered in detail.
The process of the regulation of L1 activity throughout ontogenesis is complicated. In most cells, L1 activity is inhibited at all stages of the retrotransposition process in various ways: by decreasing the availability of DNA using DNA methylation [110,113,114], histone modifications, and heterochromatin formation [110,115,116]; through post-transcriptional inhibition by degradation of new RNA copies of L1 [117,118]; through repression of ORF1 and ORF2 translation; through the binding of L1 RNPs and the obstruction of their transportation to the nucleus [119,120,121,122]; and, at the last stage of integration for a new copy of the L1 element into the genome, through using DNA repair mechanisms [120,123,124,125,126] (
Figure 3). In the process of organism ontogenesis, changes in the regulation of L1 activity occur. Thus, at the stage of formation of germ cells and mature germ cells, L1 poses a great threat to the future organism and, therefore, is thoroughly repressed by cells [127,128]. Most experimental knockouts of factors involved in the L1-silencing in germ cells lead to their death and infertility [129]. In the early stages of embryogenesis, activity of L1 is also dangerous for the developing organism and, therefore, is repressed [130,131,132]. Some changes occur in the pathways of L1 repression during embryogenesis and the activity of L1 elements increases at certain stages [133,134]. L1 elements are mainly repressed in somatic tissues of a mature organism, but increased L1 activity is noted for some pathologies including cancer as well as autoimmune and neuropsychiatric disorders [135]. With normal ageing, changes in the number of insertions of L1 are insignificant [136]. However, in some tissues, especially in the brain, L1 is not completely suppressed and L1 retrotranspositions can be activated [137,138,139].
Figure 3. Factors affecting the activity of L1 retrotransposons during ontogenesis. The factors involved in L1 regulation are grouped horizontally in accordance with the stages of prenatal development (pre- and post-implantation period and in germline cells) and the postnatal period (somatic cells), as well as vertically depending on the stage of the retrotransposition process (expression, L1 RNP formation, and integration into the genome).
Normally, L1 can be active in the brain [113,138,139,234,235,236]. In some neuropsychological pathologies, changes in L1 activity were detected. The most pronounced increase in L1 activity was found in Rett syndrome [241] and autism [400], as well as in ataxia telangiectasia [45]. The genetic causes that lead to an increase in L1 activity have been mostly studied with Rett syndrome and ataxia telangiectasia, and are associated with damage in the MEPC2 and ATM genes [45,113]. The L1 control mechanism of these genes is described above. Some trends are observed in schizophrenia [401,402] and major depressive disorder [403]. However, the causes and factors that change the activity of L1 elements remain unknown for most diseases. Recently, some studies demonstrated the connection of genetic factors associated with neurodegenerative pathologies and L1 activity. One of these factors is the TAR DNA-binding protein (TDP-43), which is able to bind DNA and RNA, and is involved in the regulation of many processes [404]. TDP-43 is associated with neuropsychiatric pathologies such as amyotrophic lateral sclerosis (ALS) and frontotemporal degeneration (FTD) [405]. The protein cleavage, hyperphosphorylation, and aggregation in the form of ubiquitinated granules in the cytoplasm occur in the pathologies. Similar the TDP-43 “proteinopathy” occurs in other neurodegenerative diseases such as Alzheimer’s disease [406], Parkinson’s disease [407], and Huntington’s disease [408], and also with hereditary inclusion body myopathy (HIBM) [409]. Controversial results have been obtained regarding the effect of TDP-43 on L1 activity. TDP-43 is involved in many processes that can affect L1 activity, such as in autophagy, which contributes to the destruction of L1 stress granules [410], and in double-stranded DNA repair, wherein it binds to the damaged site and provides further formation of the XRCC4-DNA ligase IV complex, the activity of which can contribute to retrotransposition [49,50,411]. Additionally, in one of the latest studies, data were obtained regarding the inhibitory effect of TDP-43 on L1 activity and its absence was found to increase the level of L1 retrotranspositions by chromatin decompactivation [412]. Despite this, other studies obtained different results, showing that TDP-43 regulates the transcription of many genes and retrotransposons of Alu elements, and does not affect the activity of L1 elements [413,414]. In addition, an increase in HERV-K retroviral repeats was noted, while no changes in L1 activity were detected in lateral amyotrophic sclerosis [415,416]. The change in the expression of retrotransposons is also associated with the Tau protein encoded by the MAPT (microtubule-associated protein tau) gene [417,418]. Tau pathology is observed in various neurodegenerative disorders including Alzheimer’s disease [419,420]. The Tau protein hyperphosphorylates and forms hyperphosphorylated insoluble aggregates called neurofibrillary tangles [421,422,423]. One study showed activation of various retrotransposons, including L1 that lost the ability to retrotranspose due to accumulated mutations in the transcriptional reading frame [418]. We revealed no significant changes in the copy number of L1 in Alzheimer’s disease [136]. Another study showed an increase in the expression of endogenous retroviruses, but not of active L1, as a result of chromatin decondensation and a decrease in both piRNA and piwi proteins associated with Tau pathology in Alzheimer’s disease [417]. Mitochondrial dysfunction and oxidative stress are characteristic features of a number of diseases such as some forms of ataxia, neurodegenerative diseases (Parkinson’s disease in particular), various forms of cancer, and other diseases [424,425,426,427]. Recent studies have shown that abnormalities and a deficiency of both the mitochondrial chain and oxidative stress cause DNA hypomethylation and increased L1 activity [428,429,430]. The stress sensor GABB45B gene was connected to the death of dopaminergic neurons in Parkinson’s disease [431]. A recent study in mice shows that overexpression of Gadd45b leads to disorganized heterochromatin, increased DNA damage, vulnerability to oxidative stress, and further stable changes in DNA methylation, particularly in introns of neuronal genes harboring L1 [432].