Inteins are mobile genetic elements that apply standard enzymatic strategies to excise themselves post-translationally from the precursor protein via protein splicing. Since their discovery in the 1990s, recent advances in intein technology allow for them to be implemented as a modern biotechnological contrivance. Radical improvement in the structure and catalytic framework of cis- and trans-splicing inteins devised the development of engineered inteins that contribute to various efficient downstream techniques.
Splicing mechanism can be broadly categorized as RNA splicing and protein splicing, two mechanisms responsible for the flow of information from a gene to its protein product to yield a functional protein whose sequence is strictly noncolinear with the gene. While group I introns self-splice at a precursor RNA level, intein splicing involves the removal of an intervening sequence at a precursor polypeptide level . Initially, this intervening polypeptide sequence was termed as spacer or protein introns, currently termed as inteins (INTervening protEINS) . In a radical post-translational event, inteins excise themselves precisely from a larger precursor protein by sequential cleavage of peptide bonds and concomitant ligation by peptide bond formation between the flanking amino-terminal (N-) and carboxy-terminal (C-) residues termed as exteins, resulting in the formation of an active protein product . The intein-mediated splicing mechanism lacks the use of any exogenous cofactors or high-energy molecules . The embodiment of intein-mediated protein splicing in the “central dogma” of molecular biology puts in an additional level of complexity to the mechanism of gene expression .
Most of the inteins are interrupted by a homing endonuclease domain (HED) . However, the HED can be removed from within the intein, without entirely compromising the splicing activity . Thus the presence of HEDs increases the allele frequency at a rate higher than that of Mendelian rates . Homing endonucleases encoded within an intein contains the prefix “PI-” in terms of intein nomenclature . Conventionally, intein nomenclature comprises abbreviated names of both genus and species followed by the name of the protein; an intein residing in the GyrA protein of Mycobacterium xenopi is designated as Mxe GyrA, for instance . Mxe GyrA, coincidentally, is also the smallest known naturally occurring intein .
Inteins naturally exist in three different configurations (Figure 1): (1) full-length inteins, where a sequence-specific homing endonuclease domain is embedded in between the splicing (catalytic) domains; (2) mini-inteins, lacking the homing endonuclease domain and containing a contiguous protein splicing domain; and (3) split inteins, transcribed and translated as two separate polypeptides each joined with an extein . The study of intein distribution, dissemination, and their potential biological functions are particularly fascinating in the field of translational research. Inteins distribution is sporadic in the genomes of organisms spanning from archaea, bacteria and eukaryotes to several viral genomes . The reason for such anomalous distribution has spurred the proposal for numerous evolutionary scenarios, including the role of inteins in genetic mobility and as a selfish DNA . Still, the question remains as to why inteins persisted for millions of years? Do they perform a beneficial role in the host or are they just a selfish gene? This phenomenon is puzzling and needs to be explored further.
Figure 1. Intein configurations. Schematic representation of various types of intein: (a) full-length intein with Homing endonuclease domain (HED), (b) mini-intein, and (c) split intein.
The potential to exploit inteins for a practical purpose has led to the development of a diverse array of applications in modern biotechnology. Inteins can be engineered to undergo conditional protein splicing (CPS) which requires environmental or molecular triggers like light, changes in pH or temperature, change in redox state, or addition of small molecules . The bias nature of inteins toward plant and human pathogens makes it an attractive tool for novel drug development . The development of engineered inteins or synthetic intein systems has encouraged efficient protein purification, ligation, and cyclization strategies . Recent advances in intein research have extended these in vitro application to whole organisms. Such developing applications suggest that inteins are becoming a mature and critical biological tool, capable of widening the aperture to new avenues of scientific research, including enhanced transgenic plants and novel therapeutic strategies .
The first intein sequence was discovered 32 years ago in the Saccharomyces cerevisiae VMA1 gene that encodes for an alpha subunit of vacuolar H+ ATPase . The translational product of the gene was calculated to be 118.6 KDa but experimentally estimated as 67 KDa. The deduced amino acid sequence shows similarity to other ATPase at N- and C- terminal regions, but the central region was not determined . Experimental analysis by Kane et al. revealed the presence of two separate proteins of molecular weights 69 and 50 KDa . Since then, further examples of inteins were found in all three domains of life—in archaea, the DNA polymerase of the extremely thermophilic archaebacteria Thermococcus litoralis , in bacteria, the RecA proteins of M. tuberculosis  and M. leprae  and in eukarya, the 69 KDa subunit of vacuolar ATPase of the yeast Candida tropicalis . This highlights a wider distribution of inteins across all three domains of life (Figure 2b), suggesting an ancient origin that predates the separation of prokaryotes and eukaryotes . We dug into the NCBI Gene database (www.ncbi.nlm.nih.gov/gene) to scan the distribution of intein in all the three domains of life, where out of 2709 intein-containing genomes, 56% of the total intein-containing genome is found in eukaryotes, 19.8% in archaea, and 6.64% in eubacteria. We also performed an assessment for intein distribution in viruses and observed 17.4% of the total intein-containing genome is present in viruses (Figure 2a).
Figure 2. Sporadic distribution of inteins. (a) Summary of intein distribution with the total number of intein-containing genome from respective species indicated. The intein distribution data were extracted from the NCBI Gene database. (b) Schematic representation of the tree of life showing four phyla for bacteria, three phyla for archaea, and three kingdoms of eukarya (Metazoa, Fungi, and Viridiplantae). All other eukaryotes are shown with the basal branch. Intein-containing gene sequences were obtained from NCBI and analyzed by MEGA-X software. The phylogenetic tree was constructed using the neighbor-joining method.
Novikova et al. performed a large-scale survey in order to analyze intein presence across bacteria and archaea. The survey revealed that half of the total archaeal genomes analyzed had at least one intein; in contrast, only a quarter of bacteria were found to be intein positive among the total bacterial genome studied. A recent study conducted by Kelly et al. sheds light on intein distribution across bacteria and their phages. This analysis provides the first clear evidence of mycobacteriophages as major facilitators of intein dissemination across all of mycobacteria. The study found that 19.1% of mycobacteriophages contain inteins residing mostly in nucleic acid binding proteins, enriched in specific clusters . Regardless of the exiguous presence of inteins in eukaryotes as reported by bioinformatics analysis, there is, however, intein presence in the fungal nuclear genome, algal chloroplast genome and within few eukaryotic viruses. There is, however, a preponderance of inteins observed in fungi, mostly in Ascomycota representing some noteworthy pathogenic fungi, such as Candida sp. and Aspergillus sp. Among others, inteins found in Basidiomycota include human pathogens, such as Cryptococcus neoformans and C. gattii  and plant pathogens Tilletia indica and T. walkeri . The chloroplast DNA of diverse algae and seaweeds contains a staggering number of inteins in the Rhodophyta, Chlorophyta, Cryptophyta, Ochrophyta and Heterokonta phylums . Amidst known eukaryotic viruses, there are hundreds of intein across four families, namely, Iridoviridae, Marseilleviridae, Phycodnaviridae and Mimiviridae . Aforementioned fungal pathogens have intein presence commonly in Prp8 (pre-mRNA processing factor 8), VMA1 (vacuolar ATPase, subunit A), DnaB (DNA replication helicase DnaB-like), DdRP (RNA polymerase subunit beta RpoB), DdDP (DNA polymerases) and RIR (Ribonucleoside-diphosphate reductases).
The primary indication of intein origin lies in its two-domain structure, suggesting that a mobile intein is a result of a fusion between two proteins, most likely, a self-splicing intein and an endonuclease protein. Sequence and mutational studies reported that the endonuclease activity is concentrated in the central portion of the intein, whereas the splicing activity is located in the two-terminal regions . However, it remains unclear whether an intein came first or the autocatalytic self-splicing domain in regulatory proteins. Xiang-Qin Liu stated that a self-splicing mini-intein shows a correspondence between its structural and functional composition. A mini-intein structurally consists of two subdomains along with a loop exchange between the same. Functionally, the splicing pathway consists of two peptide cleavages and a coupling between the two cleavages. This is not rather coincidental but suggests a structure-function relationship of the mini-intein. Liu further hypothesized that a fusion between two coding sequences gives rise to a duplication event in the domain responsible for the self-cleaving activity. This fusion protein retains its biological property to perform self-cleavage independently. It may be that the homing endonucleases invade such an element later on. This idea is supported by the reason that endonucleases, being mobile in the genome, although remove themselves from the gene product but would account for a preferable integration site in these locations since the function encoded by the surrounding genetic elements would not be disrupted. It is reasonable to think that naturally occurring mini-inteins most likely evolved from bifunctional mobile inteins by losing their endonuclease domain because once an intein enters a host protein, there is no considerable selection pressure to maintain endonuclease activity, but a strong selection pressure for maintaining the splicing activity. A split-intein may evolve from a mini-intein by initiating a break in the intein’s coding region. The discovery of naturally occurring split-intein in a cyanobacterial DNA polymerase (DnaE) supports the idea. The N- and C-exteins of DnaE are linked to their respective intein fragment. It is, however, encoded by two separate genes located on different parts of the genome .
Interestingly, inteins are biased towards invading regulatory proteins that are responsible for DNA metabolisms (polymerases, topoisomerases, helicases, ribonucleotide reductases) and essential housekeeping genes, including essential proteases, metabolic enzymes, RNA processing proteins, and energy supplying vital proteins. Their insertion site coincides with the conserved domains, responsible for host protein function like catalytic or ligand binding sites, enzyme active site, DNA binding sites etc. Insertion at these critical sites ensures the survivability of inteins, making them less prone to deletions. This site-specific behavior of intein insertion may be due to the functionality of its homing endonuclease domain. The amount of information conceived regarding the genome organization and expression of inteins in the last two decades has led to the understanding as to how mobile genetic elements are not solely parasitic sequences, but also have a dynamic role in the evolution of species.