Satellite DNA (satDNA) is defined as highly repetitive DNA consisting of short sequences tandemly repeated a large number of times. Collectively known as the "satellitome", this genomic component offers exciting evolutionary insights into aspects of primate genome biology that raise new questions and challenge existing paradigms.
Primate genomes are enriched in repeats (more than 50%), some of which remain uncharacterized[1][2] [18,19]. Similar to other vertebrates, primate genomes include an abundance of tandem repeats that are organized in such a pattern that the sequences are repeated directly adjacent to each other[3] [20]. These repeat sequences consist of satellite DNA (satDNA), which is defined as tandemly arranged repeats that represent a considerable proportion of the heterochromatic portion of the eukaryotic genome, forming the main structural component (heterochromatin) of chromosomes [4][5][6][7][8][9][10][13,21,22,23,24,25,26]. SatDNA has been implicated in a variety of important functions, including segregation during cell division, homologous chromosomal pairing, kinetochore formation, chromatid attachment, chromosomal rearrangements, and differentiation of sex chromosomes[11][12][13][14][15][16][17] [27,28,29,30,31,32,33]. Perhaps most importantly, satDNA can constitute rapidly evolving sequences of the genome[18][19][20] [34,35,36] and is now considered to be important in driving genomic and karyotypic evolution[4][6][7][8][21] [13,22,23,24,37].
A range of evidence have been collected to propose the dynamics of satDNA in primates which can be characterized as follows: (i) SatDNA repeats may follow an independent evolution in primate genomes and differences in their genomic abundance among taxa can increase with phylogenetic distance, (ii) the predominant satDNA families are conserved in primates with the exception of certain satDNA types that have undergone extreme divergence, (iii) specific portions of satDNA in the genome show population/species/lineage-level divergence and a paradoxical link with the evolution of centromeres, (iv) the Library model of satDNA evolution is still applicable in primate genome, and (v) satDNA transcriptional activity can mediate regulation of gene expressions that consequently influence wide ranging cellular phenomena.
The genomes of most primates, such as monkeys, apes, and humans, comprise up to 50% repeat contents, of which satDNA may constitute as much as 10% of the total number of repeats [22][23][50,51]. RepeatMasker data[24] [52] for different primate species indicate that their genomes can contain a highly variable proportion of satDNA (Figure 1). Comparison of these data shows that satellite repeats are highly abundant in certain families, such as nocturnal primates (superfamily Lorisoidea), strepsirrhine primates (family Cheirogaleidae), and haplorrhine primates (family Tarsiidae) (Figure 1), which suggests extensive expansion of satDNA in the genomes of these lineages. By contrast, in Hominidae and Hylobatidae, satellite repeats are comparatively low in abundance. The genomes of Hominidae and Hylobatidae are invaded by TEs at higher proportions compared with those of the Lorisoidea, Cheirogaleidae, and Tarsiidae lineages. This observation suggests that phylogenetically close lineages show similar patterns of satellite abundance in their genomes, whereas differences in abundance among taxa increase with phylogenetic distance. However, data on the relative percentage of satDNA in different primate genomes must be treated carefully, and precise information on satDNA abundance in primate genomes is still lacking due to misassemblies, gaps, and unresolved assembled centromeric regions that span these repeats [2][25][19,55].
Figure 1. A comprehensive phylogeny of 301 primate species based on mitochondrial DNA sequences using Bayesian inference. Pie charts for selected common primate species show percentage differences of repeat types in the respective genomes. The abundance of satellite DNA in primate genomes varies considerably among lineages (red colored area of pie charts). Additionally, the comparative repeatomic landscape shows LINEs and SINEs emerged as the most expanded elements of primate genomes (blue- and orange-colored areas of pie charts) with consistent pattern across diverse lineages.
SatDNAs were initially identified by their buoyant densities (in g/mL) on cesium chloride gradients[26] [56]. This technique was formerly employed for satDNA detection and biased procedures. This technique can identify a single satellite or sometimes multiple satellites in a genome but cannot detect the entire set of satellite families. Modern techniques, such as NGS and fluorescence in situ hybridization (FISH), have replaced traditional methods and have substantially improved detection and characterization of satDNA[2] [19]. This methodological shift has brought advances in the identification of different repeat types and structural units of satDNA in primate genomes[27][28] [41,47]. Using cytogenetics, the genomic organization and diversity of satDNA have been widely studied mostly in humans, and to some extent in other primate genomes. As a result, a wealth of knowledge is now available on the localizations of satDNA repeats, their lengths, and different units, variability, and number of copies in different genomes[29][30][31][32][33][34][35] [55,57,58,59,60,61,62,63]. These repeats can be categorized as different types of satDNA to better understand their roles, evolution, and applications in phylogenetic analyses. This can include satellites that are generally shared across all eukaryotic lineages and those that are exclusive to primate genomes.
Certain tandem repeat sequences can be classified by the number of base pairs (bp) into two types as microsatellites (ranging in length from one to six or more bp) and minisatellites (usually from 10 to 100 bp)[36] [64]. The human genome contains as much as 3% microsatellites [37][65] and several thousand chromosomal loci enriched with minisatellites[38] [66], also called variable number tandem repeats (VNTRs)[39] [67]. Previous isolation of microsatellites from the human genome has enabled researchers to amplify these sequences in several NHP species, including apes, baboons, macaques, and some platyrrhine monkeys[39][40][41][42][43][44][45][46] [68,69,70,71,72,73,74]. Microsatellites tend to accumulate many substitutions and/or insertions/deletions, and are thus considered to show limited conservation across primate lineages[47] [75]. Many conserved microsatellites, such as AP74, which was discovered in New World monkeys, exhibit similar sequence length (up to 176 bp) in monkeys and humans[48][49] [76,77]. Boán et al.[50] [78] identified the minisatellite MsH42 in the human genome and performed a comparative analysis in 11 NHP species. Phylogenetic analysis detected several variants of MsH42 and the evolutionary birth of minisatellites in the primate genome was hypothesized. According to this hypothesis, the evolutionary birth of MsH42 took place within an intron early in primate lineage evolution and more than 40 million years ago. Then, various mutations including insertions, duplications, and single nucleotide polymorphism of repeat blocks were probably the major forces governing the generation of this minisatellite and its divergence throughout primate evolution[50][78]. Certain (TTAGGG)n sequences, which are specific monomers of microsatellites, can be repeated multiple times, eventually forming the bulk of the telomeric region up to 15 kb on human chromosomes[51][52] [79,80]. These telomeric repeats can serve as binding sites for certain nucleoproteins, such as TRF1, TRF2, and POT1, forming a complex termed “shelterin”[53] [81] that interacts with a ribonucleoprotein[54] [82]. This complex is involved in DNA repair processes and the protection against degradation of chromosomal ends[55] [83].
Well-characterized telomeric satellites of the human genome can also be applied broadly as informative markers to study a variety of hominoid species owing to multiallelic variation and a high degree of heterozygosity[41] [70]. The MsH42 locus shows high similarity with immunoglobulin regions and is involved in recombination events as well as in promoting high rates of unequal crossovers [56][57][78,84,85]. The telomeres harbor short stretches of sequences termed interstitial telomeric sequences (ITSs), which are located far from the chromosomal ends. To trace the evolutionary origin of these sequences in NHP genomes, 22 ITS loci from the human genome were compared with their orthologs in 12 NHPs, representing species such as great apes, gibbons, Old World monkeys, and New World monkeys. Comparison of sequences indicated that, unlike other microsatellites, these ITS sequences were not derived from expansion of pre-existing TTAGGG monomers but rather emerged abruptly during genome evolution in primates as a result of double-strand break repair [58][86]. Similar findings were observed from investigation of a chimpanzee-specific ITS. A universal satDNA classification is still the subject of debate; however, most commonly, satDNA can be grouped according to position and association with different chromosomal loci. SatDNA is primarily clustered within the heterochromatin regions of primate chromosomes. The heterochromatic portion is mainly localized in centromeric and telomeric regions, and sometimes within the interstitial regions of the chromosomes[59] [87], whereas satDNA sequences are mostly located in centromeric regions, and the nearby pericentromeres may be enriched with TEs. Different types of primate satDNA are discussed and summarized as Supplementary Table S1.
The centromere cores of human chromosomes span abundant and highly enriched stretches of satDNA, and are surrounded by heterochromatin containing a combination of short satDNA sequences and retroelements[13] [60][29,88]. Occasionally, these centromeric regions are termed “satellite centromeres”[61] [89]. The centromere is an important region of the chromosome for preservation of genetic materials and plays a critical role in chromosome segregation, cell division, kinetochore organization, and spindle attachment[61][62][63][64] [89,90,91,92]. In primates, the bulk of the centromere is composed of the pancentromeric alpha satellite (AS), organized as stretches of 171 bp monomers in a head-to-tail fashion extending for ~250 kbp up to ~5 Mbp per chromosome [65][66][67][68][93,94,95,96] (Figure 3a(i)). This structure has been reported across diverse groups, including great apes, Old World monkeys, and New World monkeys [69][70][71][72][73][74[96,97,98,99,100,101,102]. These centromere-associated satellites are arranged as superfamilies (SFs) that can be orthologous between human and gorilla[32] [60]. The surrounding pericentromeric satDNA are essential elements that assist in stabilization of DNA–protein binding and regulation of chromosome segregation[75][76] [58,61]. These pericentromeric satellites vary greatly across NHP species but can be conserved among closely related species or may be species-specific[3][7775] [20,103]. For instance, a large block of human chromosome 9 that spans a pericentromeric area enriched with satellite III (SatIII) shares close homology with the gorilla sequence [7876][104]. The Y chromosome of NHPs may carry higher numbers of copies of satellite III sequences than the human Y chromosome [7977][105]. FISH mapping of the pericentromeric-type satellite pW-1 SatIII DNA on chromosomes of various NHP species showed that these sequences might be lacking in the genomes of squirrel monkey (Saimiri sciureus) and baboon (Papio hamadryas)[7977] [105]. These centromeric satellites can vary substantially across different species, but certain species-specific or even highly conserved satDNA may also be present in the centromere domains[3][7775][20,103]. For example, two major families of centromeric satellites, termed C1 and C2, detected in Old World monkey species crested mona monkey (Cercopithecus pogonias) and sun-tailed monkey (Cercopithecus solatus) have remained highly conserved[8078] [48]. For Old World monkeys, apes, and humans, each genome harbors evolutionarily distinct AS monomers[8179] [106]. Although most primate centromeres can be enriched with satellites repeats, there are certain chromosomes of orangutan that comprise non-repeated centromeres[64][8280][8381][8482] [92,107,108,109,110]. In such cases, the centromeres may resemble newly formed neocentromeres as a result of disruption in the centromeric region, such as in humans[64][8583][92,111]. Such non-repeated centromeres are likely to be evolutionary new centromeres (ENCs), forming neocentromeres that might have subsequently gained repeat sequences to stabilize the genome and become fixed in populations. This phenomenon can also occur in the centromeres of several non-primate species, such as horse and chicken[8684][8785] [112,113]. In the following, we focus mainly on the predominant centromeric satDNA in primate genomes as AS repeats.
The AS repeats were first observed as tandem repeats in the African green monkey (Chlorocebus aethiops) genome[65] [93], followed by identification of homologous repeats in New World monkeys and apes[68][8886] [96,114]. These sequences are considered to be critical components for the various functions of primate centromeres[66] [94]. Previous results suggest that AS sequences were involved in stabilization of ENCs after their emergence in primate genomes[8381][8987] [109,115]. Human and macaque chromosomes contain a total of 14 ENCs, of which nine ENCs in the macaque genome show abundant arrays of AS[8381] [109]. Interestingly, ENCs occur in macaque chromosome 4 and human chromosome 6, which are orthologous to each other (Figure 2a(ii)) [8381][9088][9189][109,116,117].
The AS monomer size is 171 bp, tandemly arranged in a head-to-tail manner, and shows as much as 70% sequence similarity. The combined monomers can form a long array spanning an uninterrupted 250–5000 kb stretch of repeated satellites, giving rise to high-order repeats (HORs) (Figure 2a(iii)). A certain monomer in the HORs with a sequence size of 17 bp is termed the CENP-B box. This motif acts as a protein-binding site for a centromeric CENP-B protein in primates. The human genome project, which was declared complete in 2003, was still unable to recover a large proportion of the centromeric and other repeats, including more than 10% of the contents of the whole genome, mainly sex chromosomes. However, subsequent technological developments enabled assembly of the entire human Y chromosomal centromere[64][9290] [62,118]. The Y chromosome assembly could be used as a reference sequence to extend evolutionary insights into the centromeric repeats of NHPs for which Y chromosome assemblies have not been hitherto accomplished.
In primates, the flanked regions of centromeres have specialized HORs arrays, whereas AS sequences are organized as non-structured and heterogeneous repeats, forming distinctive pericentromeres. In these pericentromeres, AS sequence repeats are arranged as monomers instead of HORs and are interrupted with additional elements, mainly retrotranposable elements in humans[9391] [119] (Figure 2a(iii)), which may also be common to other primate genomes. The pericentromeres of certain human chromosomes may also show enrichment of several other repeat sequences, including the 5 bp satDNA II and III type sequences[7775][9492] [103,120]. The AS sequences can show nucleotide variation when one monomer is compared with the repeats of the same array, with nucleotide identity ranging from 70% to 90%. The sequences of a monomer in one array may show up to 95% similarity with its counterpart unit in the other array at the same locus [35][9593][9694][63,121,122]. In the human genome, the organization of HORs with their monomer units has been extensively studied[37][69][9795][9896] [65,97,123,124], and shows the occurrence of various subfamilies of chromosome-specific AS sequences. The sequences of HORs in great apes, such as orangutan, gorilla, and chimpanzee, show a lower degree of variation in comparison with HORs observed in the human genome[97][98][99][100][101][102] [125,126,127,128]. Initially, it was presumed that the organization of HORs might be restricted to hominids; however, HORs were subsequently detected in the genomes of gibbons[73][74][103101] [101,102,129] and of Old World and New World monkeys[8078][74[104102] [48,102,130]. During the evolution of the primate genome, the 170 bp AS monomer underwent a series of sequence variations [59][87]. A novel AS monomer type of 189 bp was discovered in the centromeres of gorilla [60]. Chromosome-specific subfamilies are absent in Old World and New World monkeys as well as in gibbons[59][73][8179] [87,101,106]. Cloning, sequencing, and hybridization of acrocentric chromosomes revealed novel AS sequence repeats in Azara’s owl monkey (Aotus azarae), which is a species of New World monkey[7][8] [22,23]. These repeats include three megasatellites, namely OwlRep, OwlAlp1, and OwlAlp2, which vary in size from 184 to 344 bp as identified in the centromeric and pericentromeric regions. Analysis of retina samples using three-dimensional FISH revealed that OwlRep is the major component of heterochromatin, which indicates its role in the evolution of night vision in this species[105103][106104] [131,132]. Recently, Cacheux et al.[107105] [49] investigated the evolutionary dynamics of AS sequence repeats and their diversity in the Old World monkeys Cercopithecus pogonias and C. solatus using targeted sequencing and FISH mapping. These authors reported evidence of chromosome-specific subfamilies that might have evolved through homogenization. The OwlRep repeat shows ~82% homology with a satellite sequence termed HSAT6, which is a 126 bp long tandem centromeric repeat. The HSAT6 sequence was also detected in the owl monkey genome, and comparative analysis revealed its broad distribution among hominoids and New World and Old World monkeys. Phylogenetic analysis confirmed that OwlRep evolved from HSAT6[106104] [132].
In addition to AS, an additional type of satellite family termed the beta satellite is distributed in the heterochromatin of primates[108106][109107][110108] [133,134,135]. Beta satDNA are repeats that comprise ~68 bp monomers. They are predominantly organized in the shorter arm of acrocentric chromosomes and arranged in stretches several kb in length[109][110][111][112][113][114] [136,137,138,139]. The beta satDNA repeats can form complexes with arrays of specific repeats, termed D4Z4 repeats, at certain acrocentric loci, such as 10q26 and 4q35[115113][116114] [140,141]. Evolutionary analyses involving cloning and FISH experiments have predicted that 4q35 containing D4Z4 repeats might represent an ancestral locus with an extensively radiated sequence region that evolved after the divergence of hominoids and Old World monkeys[117115][118116][119117] [142,143,144]. The origin and evolution of beta satDNA vary in diverse species of hominids, such as humans, chimpanzee, and gorilla[120118][121119] [145,146]. FISH mapping data confirm that D4Z4 is also conserved in Old World and New World monkeys, whereas in primates distantly related to humans (e.g., lemurs), this sequence has retained tandem repetition but conservation is limited to promotor regions[122120] [147]. Genomic analysis of orangutan has revealed the origin of beta satDNA in earlier ancestors of hominoids and shows that these repeats are preferentially located in pericentromeres[110108] [135]. This study concluded that these repeats originated as low copies, remained non-duplicated in the early ape ancestors, and later evolved as duplicons acquiring the typical characteristics of classical satellites in humans and other primates. Adjacent to ASs, the classical non-alphoid satDNA repeat families I, II, and III are located in pericentromeres of human chromosomes[67] [95]. The human genome includes the Sat III family, which is composed of GGAAT and GGAGT repeat sequences in different percentages. The satellite III family is mainly localized on the short arm of acrocentric chromosomes in humans and other primate species. This family is also present in the chimpanzee, gorilla, and orangutan genomes[123121] [124122][148,149]. The chromosomal organization of this satellite family has provided interesting evolutionary insights into primate genomes[79] [105]. Sequence comparisons have detected variation across different primate species and suggest that the Sat III family might have appeared ~16–23 million years ago in Hominoidea[79] [105]. The evolutionary origin and extensive diversification of centromeric satellites in primate genomes remain unclear; however, it is speculated that TEs are the possible progenitors and sources that form novel satellites by insertions into existing satellite regions[93] [119].
The telomere is located at the end of the chromosome and is enriched with a non-coding, repetitive DNA sequence. The 500 kb region of each chromosomal arm terminal is the so-called subtelomeric region[125] [150]. Both telomere and subtelomere have high-density of satDNA repeats. Telomeric regions of the primate genome show a high frequency of minisatellites, which also occur in other loci of chromosomes[126][127] [67,151]. The bulk of telomeric-specific regions are mainly composed of (TTAGGG)n microsatellites in humans[128] [79]. Adjacent to the telomere, the subtelomere region is mostly enriched in rapidly evolving satellite repeats with variable levels of repetitiveness and size [30][129][130][57,152,153]. Although these subtelomeric satellites can be species-specific and often chromosome-specific, there are also satellites that remain highly conserved[131] [154]. The microsatellites (CCCTAA)n, (CCCCAA)n, and (CCCTCA)n are present in telomeres of primates [132][155], whereas (CCCGAA)n is restricted to subtelomeres [133][156] (Figure 3b). In New World monkeys, the subtelomeres can carry novel satDNA sequences. The subtelomeric regions of callitrichid monkeys harbor a satellite termed MarmoSAT that is composed of a 171 bp motif [134][157]. The MarmoSAT occurs as a monomer, whereas in common marmoset (Callithrix jacchus) it is organized in HORs with a sequence of 338 bp. Recently, some intriguing groups of satDNA sequences enriched with AT nucleotides, termed StSats, have been reported in telomeres of humans and great apes, including bonobo, chimpanzee, gorilla, and orangutan[135] [47]. The StSats are located in proximity to telomeric regions [136][137][138][158,159,160]. Astonishingly, these satellites are very highly enriched in the gorilla and chimpanzee genomes compared with their abundance in humans[135] [47]. Previously, it was hypothesized that these repeats occurred in hominid ancestors and were lost in humans[136][137][138] [158,159,160]. The abundance of StSats repeats in the bonobo, chimpanzee, and gorilla genomes indicates that these sequences might contribute to important genomic functions in these species. Different functions have been proposed for these repeats that include their role in meiosis, telomere clustering, and control of replication duration with telomeric regions[136][137][138] [158,159,160].
Figure 2. Schematic illustration of satellite DNA repeats and their organization in primate genomes. (a) (i) Primate centromeric (red) and pericentromeric (green) regions are enriched with alpha satellite (AS) DNA as the most abundant satellite repeats of primate genomes and form the bulk of the heterochromatin core. (ii) A sketch highlighting the orthologous chromosomes and centromeric repositioning as evolutionary new centromeres (ENCs) between human and rhesus macaque. The circos plot depicts the syntenic relationship between the two genomes. Circos graphics was plotted using Synteny Portal [117]. Note that human chromosome 6 is completely orthologous to macaque chromosome 4, with evolved centromeres [139][140][109,116]. (iii) The AS constitute the tandem repeat units (blue triangles) and can be either organized as disordered arrays (monomeric) mostly located in pericentromeres, or highly ordered in a head-to-tail fashion (HORs) forming longer arrays in centromeres. Some monomers may also have a short sequence termed the CENP-B box (yellow line), which binds the centromeric regions to the DNA-binding proteins. Diverged monomers (orange and dark triangles), and interspersed repeats (purple rectangles) are also depicted. (b) Telomeric and subtelomeric regions of primate chromosomes are enriched with distinct microsatellites (light blue) and minisatellites (dark blue). Various primate-associated satellite examples are shown.
The distribution of two distinct satellite repeats, termed Cap-A and Cap-B, was reported in a New World monkey species, Cebus paella[141] [161]. The Cap-A sequence is 1500 bp long and forms heterochromatic blocks in the interstitial sites of chromosome 11 and a few telomeric regions. This suggests that this sequence underwent a new episode of amplification in New World monkeys. This satDNA repeat is absent in most marmoset species and present in species of the family Cebidae. By contrast, the Cap-B satellite, which is 342 bp long, is mainly localized in the centromeric regions of many chromosomes of New World monkeys. The Cap-B monomer sequence shares more than 60% identity with AS repeats, which indicates that Cap-B might be the New World monkey homolog of Old World monkey AS repeat sequences. Telomeric satDNA sequences can participate in the formation and maintenance of telomeres, and may have an incidental role in cases losing of conventional telomeric repeats. In this way, telomeric ends are stabilized by satDNA[142] [162]. Further, it has been demonstrated that telomere-like sequences interspersed within subtelomeric DNA may also play a role in subtelomeric recombination and transcription, via alternative lengthening of the telomere pathway and in telomere healing[143] [163]. It is necessary to identify and characterize the telomeric/centromeric satDNA sequences particularly at the breakpoint sites because of their role in mediating chromosomal rearrangements[144] [164] that occurred during primate evolution. Such analyses have been performed with regard to the gorilla-specific translocation[145] [165] as well as the chromosome scale variations that serve to distinguish human and chimpanzee chromosomes. Various hotspot rearrangement regions of the gibbon genome have also been characterized[146] [166]. In contrast to the great apes, gibbons have chromosomes with higher levels of rearrangement compared to ancestral primate karyotypes[147] [167]. A comparison of human and chimpanzee karyotypes showed that two ancestral chromosomal homologs of chimpanzee chromosomes 12 and 13 underwent a fusion event to give rise to human chromosome 2[148] [168]. This fusion was mediated by recombination between telomeric satDNA repeats of the two sub-metacentric ancestral chromosomes. The hyper-expanded repeats are localized in subtelomeric regions of chimpanzee chromosomes. These repeat enriched regions are also prone to other types of rearrangement events such as duplicative transpositions and inter-chromosomal sequence variations[149] [169]. Since many primate-specific rearranged loci are enriched with high-copy repetitive sequence elements such as alpha satDNA repeats, SINEs, LINEs, and LTRs, a range of different molecular mechanisms were probably involved in promoting chromosomal breakage during the evolution of primate genomes[144] [164]. Genome-wide scale analyses at higher resolution are necessary to determine the precise mechanisms underlying the different types of rearrangement, and to assess their relative contribution to the process of evolutionary change.