1. G-Quadruplex-Binding Proteins
1.1. Detection of G-Quadruplex-Binding Proteins
The G4 structure is highly dynamic in vivo and depends on the cell type and chromatin state
[1]. Meanwhile, the formation and unwinding of G4 structures across the whole genome and transcriptome are directly or indirectly regulated by G4BPs, thereby affecting various biological processes
[1][2]. Thus, the identification and an in-depth study of G4BPs can provide a full explanation of G4–protein interactions and their biological roles in vivo. These studies will further inspire the development of medical applications for these proteins.
G4BPs are most often identified by biochemical experiments. The commonly used methods include affinity chromatography, quantitative methods based on mass spectrometry, and fluorescence energy resonance transfer (FRET) technology. Affinity chromatography is often used in combination with mass spectrometry to separate proteins that bind to specific G4 motifs
[3]. For instance, this method was applied to the identification of proteins binding to G4s in the 5’ UTR of tumor-associated mRNA
[4]. FRET is a spectroscopic technique that provides information about the conformation and dynamics of biomolecules. It has been widely used because this technology can detect whether there is a direct interaction between G4 structures and proteins in vivo
[5][6]. In addition to performing biochemical experiments, computational analyses could be conducted to identify G4BPs. For example, putative G4 motifs can be predicted at known binding sites of nucleic acid-binding proteins, or the computational modeling of structural features may be exploited to discover new G4BPs
[1][7][8][9].
Recently, great advances have been made in the identification of G4BPs. The affinity purification experiments do not take into consideration the native chromatin state, so Shankar Balasubramanian et al. pioneered a co-binding-mediated protein profiling (CMPP) approach for the exploration of DNA G4BPs in living cells
[10]. Researchers designed small-molecule ligands that specifically target DNA G4 in cells so that the probes could approach G4BPs with minimal interference with G4-protein interactions and enable labeling by subsequent photoproximity crosslinking
[10]. The strategy was employed to identify hundreds of potential G4BPs, and finally in vitro experiments confirmed the binding specificity of several candidate proteins. Overall, this approach laid the foundations for the subsequent investigation of new G4BPs.
In conclusion, the detection methods are made up of in vivo, in vitro and in silico approaches. Generally, in vivo and in silico approaches are employed to identify potential G4BPs while in vitro approaches are utilized to confirm G4-protein interactions.
1.2. DNA G-Quadruplex-Binding Proteins
Recent discoveries related to the involvement of DNA G4BPs in the regulation of cellular fundamental functions will be discussed in the following section.
1.2.1. Telomeric G-Quadruplex-Binding Proteins
The human telomeric sequence was one of the first sequences discovered to form G4 structures. Telomeres are nucleoprotein complexes that constitute the ends of eukaryotic chromosomes, which play crucial roles in maintaining the integrity and stability of the genome
[11][12]. When specific proteins bind to telomeric DNA, these proteins can prevent not only the degradation of the chromosome ends by nucleases, but also the recognition of them as broken fragments by the DNA repair mechanism
[12][13][14].
Telomeric DNA is highly conserved between vertebrates and consists of the identical TTAGGG short repeat sequences with a guanine-rich single-stranded 3′ overhang
[12]. These repeat sequences have the potential to form a G4 structure. The experiments using G4 specific antibodies and G4 ligands also confirmed the existence of G4 structures at telomeres in vivo
[11].
Mammalian telomeric DNA is bound by a protein complex called shelterin, which protects the DNA termini from being considered as damaged and prevents the triggering of the repair mechanism (
Figure 1)
[15][16][17]. The proteins TRF1 and TRF2 (Telomere Repeat Binding Factor 1 and 2) in shelterin bind to double-stranded telomeric DNA; POT1 (Protection of Telomeres protein 1) binds to 3′ overhang of telomeric repeats and regulates the folding and unwinding of the G4 structures with its heterodimeric partner TPP1 (TIN2 Interacting Protein)
[11][18][19][20]. TPP1 connects POT1 to TRF1 and TRF2 via TIN2 (TRF1-interacting Nuclear protein 2)
[21]. In addition, the study found that the helicases WRN (Werner syndrome ATP-dependent helicase) and BLM (Bloom syndrome protein) of the RecQ family are recruited to the telomeres and unfold the G4 structures to maintain the integrity of telomeres and ensure telomere replication
[22][23]. WRN colocalizes with TRF2 and POT1, and both WRN and BLM can bind to POT1 with high affinity, which indicates that the telomeric DNA-binding proteins are essential for the recruitment of helicases
[24].
Figure 1. Schematic diagram of the telomere-associated protein complexes shelterin and CST. Shelterin and CST play crucial roles in telomere maintenance. TPP1-POT1 subunit of shelterin regulates the folding and unwinding of G4 structures. CST could resolve and prevent the formation of G4 structures.
Mammalian cells also contain another telomere-associated protein complex called CST (CTC1-STN1-TEN1), which plays a crucial role in efficient telomere replication and in the maintenance of telomere length (
Figure 1)
[21][25][26][27]. Human CST is a single-stranded DNA-binding protein complex that helps to solve the genome-wide replication problems
[21][26]. For example, GC-rich regions of the genome may induce obstacles to DNA replication, because DNA polymerase may stall at a G4. Experiments confirmed that CST could bind to the G4s and unfold them
[21]. G4 structures possibly form on the lagging strand template at telomeres where WRN, BLM and POT1 all participate in G4 removal
[21][24][28][29]. However, the presence of CST could make the replication of double-stranded telomeric DNA more effective as this complex unwinds G4 structures more rapidly than POT1
[21].
Other telomere-binding proteins also have similar functions as is the case for these two protein complexes.
1.2.2. G-Quadruplex-Binding Proteins Involved in Replication
The G4 structure has a dual effect on the process of DNA replication. On one hand, the G4 structure has been demonstrated to support the initiation of DNA replication at the replication origin
[30]. Furthermore, the G4 structure may prevent the uncoupling of the leading- and lagging-strand polymerases, thereby protecting proper replication
[31]. On the other hand, the G4 could hinder the progression of the replication fork and influence DNA synthesis, which may lead to mutations and deletions in the genome. Consequently, helicases usually unfold the G4 structures before replication to maintain genome stability
[31].
FANCJ (Fanconi anemia complementation group J) is a 5′–3′ DNA helicase, which is involved in various biological processes such as DNA damage repair, G4 resolution, homologous recombination and genome stability maintenance
[32]. FANCJ can unfold and remove G4 structures for efficient DNA replication while its absence will stop replication at G4s and eventually lead to DNA damage
[33]. Studies have shown that FANCJ might promote replication at G4s by two independent mechanisms
[34]. One mechanism is that FANCJ may cooperate with polymerase REV1 to aid replication at the replication fork
[35]. REV1 destabilizes the G4 structures so that FANCJ can unwind them from the other side of the G4 structures. Second, WRN or BLM may assist FANCJ to bind and unfold the G4s from the opposite direction in order to promote replication synergistically
[34][36][37].
The helicase Pif1 from yeasts is able to bind and unfold G4 structures to support DNA replication. It is not clear whether Pif1 can play a role unwinding G4 structures on both chains or if it has a binding preference for the G4 structure at a certain chain
[31]. However, recent studies have suggested that the ubiquitin ligase complex protein Mms1 is not only a DNA G4-binding protein, but also assists Pif1 to bind to a specific G4 structure located on the lagging strand. It could be observed that the absence of Mms1 leads to a reduction in Pif1 binding and slow replication at G4 motifs, and finally causes G4-dependent genome instability
[38].
1.2.3. G-Quadruplex-Binding Proteins Involved in Transcription
It has been found that about 50% of human genes contain G4 motifs near their promoter region, which indicates that G4s play an essential role in the regulation of gene expression
[24]. When DNA G4 is located at the first intron downstream of the transcription start site (TSS), it blocks the RNA polymerase and suppresses transcription
[39]. However, recent studies have shown that endogenous G4s in promoters are prominent binding sites for multiple transcription factors and are thus invariably linked to high transcription levels
[39][40]. Notably, G4s and their associated transcription factors cooperate to shape the cell-specific transcriptome
[39][41]. In fact, transcription factors account for a significant part of the G4BPs. Statistically, there are 14 transcription factors among the 56 DNA G4-binding proteins in the G4IPDB (G4 Interacting Proteins DataBase)
[42]. For example, SP1 (Specificity protein 1) is a zinc finger transcription factor, which can bind to the G4 structures on the
c-KIT promoter and regulate the expression of a variety of housekeeping genes
[39]. MAZ (Myc-associated zinc finger) and PARP-1 (Poly [ADP-ribose] polymerase 1) interact with the G4 structures upstream of the transcription start site of
KRAS, and both of them are activators of
KRAS [1][24][43].
The G4 motif occurs more frequently in proto-oncogenes and regulatory genes than in housekeeping genes and tumor suppressor genes
[24][44][45]. The first reported G4 on the promoter is formed in the nuclease hypersensitivity element III1 (NHE III1) which locates upstream of the P1 promoter of the proto-oncogene
c-MYC [32][46]. This guanine-rich region controls 85–90% of the transcriptional activation of the gene, and can fold into an intramolecular parallel G4 as a transcriptional repressor element
[47]. In addition to
c-MYC, many genes have been demonstrated to form G4 structures in the promoter regions, such as proto-oncogenes
VEGF [48],
KRAS [49],
BCL-2 [50] and
c-KIT [51]; human platelet-derived growth factor receptor
PDGFR-β [52]; human telomerase reverse transcriptase
hTERT [53] and other genes
[32]. In particular, the G4s in the promoter regions of the proto-oncogenes have been most intensively studied so far
[1].
Nucleolin (NCL) is a multifunctional phosphoprotein that is most abundant in the nucleolus. Nucleolin is mainly associated with ribosome biosynthesis and also involved in chromatin remodeling, transcriptional regulation, G4 binding and apoptosis
[47]. Nucleolin can bind to the
c-MYC G4 with high affinity and promote the formation and stabilization of G4 structures. The luciferase assay results also proved that the overexpression of nucleolin could contribute remarkably to a reduction in
c-MYC-driven transcription
[47]. Another protein NM23-H2 which belongs to the NM23 family of nucleoside diphosphate kinase (NDPK) has a completely different structure effect on G4s from nucleolin. It has a variety of functions, including kinase activity, promoter binding, transcriptional regulation and DNA repair
[54]. Experiments have confirmed that NM23-H2 could bind to the
c-MYC G4 to promote the unfolding of the G4 structure, thereby activating the transcription of
c-MYC [54].
The tumor suppressor protein p53 functions in apoptosis, DNA repair, cell cycle regulation and aging. As a transcriptional regulator, p53 can inhibit the expression of cell cycle regulatory and growth promoting genes via multiple mechanisms and plays a key role in tumor suppression
[55]. Previous studies have found that wild-type p53 (wtp53) and several types of mutant p53 (mutp53) have the ability to selectively bind
c-MYC and
hTERT promoter G4s
[56], and the C-terminal region of p53 is essential for the recognition of the G4. Accordingly, the interaction between p53 and G4 structures in promoter regions of p53 target genes may play an important role in p53-mediated transcriptional regulation
[55].
1.2.4. Other DNA G-Quadruplex-Binding Proteins
Direct evidence has demonstrated that the endogenous human G4 DNA landscape is dynamically shaped by chromatin relaxation or cell status
[57][58]. Indeed, several G4BPs also function in chromatin structure regulation and histone modification
[59][60]. For example, various epigenetic and chromatin remodeling enzymes bind selectively to DNA G4
[1]. Genomic binding sites of the chromatin remodeling protein ATR-X colocalize with GC-rich tandem repeats and CpG islands (CGI) that have the potential to form G4 structures
[61][62].
Guanine-rich sequences are very common around CpG islands, with a high distribution rate of up to 80%
[32][63][64]. The presence of G4 structures is closely related to the hypomethylation of CpG islands in the human genome. Studies have revealed that DNMT1 (DNA methyltransferase 1) interacts with these G4 sites, which is consistent with the results observed in biophysical experiments. Specifically, DNMT1 shows a higher binding affinity to G4 compared with double-stranded, single-stranded or hemimethylated DNA
[65]. Biochemical analyses demonstrated that G4 structures inhibit the enzymatic activity of DNMT1, and the formation of G4 also hinders DNMT1 to protect specific CpG islands from methylation and inhibit local methylation
[65].
In addition, it has been found that G4s colocalize with CTCF (CCCTC-binding factor) binding sites in CpG islands and interact with CTCF in vitro. G4 is also crucial to the localization of CTCF
[66]. CTCF is frequently recruited to CpG islands that are usually hypomethylated. Furthermore, the enrichment of G4s at CpG islands maintains CGI hypomethylation, which may explain the correlation between CpG islands and CTCF
[66]. CTCF also functions as a chromatin remodeling factor with the capability of nucleosome repositioning; therefore, G4 can facilitate the binding of CTCF to genomic DNA by recruiting chromatin proteins
[60].
1.3. RNA G-Quadruplex-Binding Proteins
It is easier for single-stranded RNA to form G4s in guanine-rich regions, and G4 is also an important structural characteristic of mRNA
[11][67]. Recently, in vitro experiments combining high-throughput sequencing with reverse transcriptase stalling at RNA G4s (rG4) have found more than 13,000 loci with the potential to form rG4 structures in the human transcriptome; and immunofluorescence using G4 specific antibodies demonstrated rG4 formation in cells
[68][69]. Notably, the highest abundance of rG4 is in functional regions including 5’ and 3’-UTR
[67]. All these observations of the enrichment of rG4s in functionally important regions suggest that they play crucial roles in transcription termination, alternative splicing, translational regulation, and chromosome integrity maintenance
[4][70].
A substantial number of proteins interacting with rG4s have been identified by biochemical experiments, for example hnRNPs, ribosomal proteins and splicing factors
[4][11]. Although there are DNA and RNA G4 specific proteins, their binding proteins have a significant overlap due to structural similarities between DNA and RNA G4s
[11]. Basically, the discrimination between DNA and RNA G4BPs may depend on their different biological functions. It was found that the fragile X mental retardation protein (FMRP) could bind to the G4s in its own mRNA coding region, thereby regulating its own translation through a negative feedback pathway
[71]. Additionally, FMRP is likely to interact with G4s in other mRNAs for translation repression by the recruitment of translation inhibitors, miRNA pathway activation, and direct interaction with ribosomes
[4]. FRAXE-associated mental retardation protein FMR2 could also bind to G4s in mRNAs and function in alternative splicing
[72].
The rG4 in the region where proto-oncogene
NRAS 5′-UTR folds into a stable intramolecular parallel G4 structure and it has been demonstrated that it represses translation in vitro
[4]. The study revealed that DEAD box helicase DDX3X involved in several pathways of RNA biology could bind to
NRAS rG4s and the mutations of DDX3X are associated with tumorigenesis, especially medulloblastoma
[4]. In addition, some helicases such as DHX36 (DEAH-Box Helicase 36) and DDX21 are able to bind and unfold rG4 structures. Another multifunctional helicase DHX9 shows a binding affinity for several secondary nucleic acid structures including G4s, but it is more inclined to bind RNA substrates. Therefore, helicases with the function of recognition and resolution of rG4s may play essential roles in post-transcriptional biological processes such as mRNA translation, transportation and stability
[67].
Although the vast majority of rG4s are present in mRNAs, others are also detected in long non-coding RNAs (lncRNAs) including nuclear paraspeckle assembly transcript 1 (
NEAT1).
NEAT1 is involved in gene regulation as a scaffold for the assembly of paraspeckles
[73]. An upregulation of
NEAT1 could be observed in the majority of solid tumors such as lung cancer, esophageal cancer and hepatocellular carcinoma, and
NEAT1 also plays a critical role in neurodegenerative diseases and viral infection
[74][75]. Evidence has shown that nascent
NEAT1 transcripts interact directly with the non-POU domain-containing octamer-binding protein (NONO) through its conserved rG4 motifs. The primary paraspeckle formation is required for the recruitment of NONO to
NEAT1 transcripts which stabilizes
NEAT1 and lays the foundation for the recruitment of additional protein components to facilitate subsequent steps of assembly and maturation
[75].
2. Structural Properties of G-Quadruplex-Binding Proteins
2.1. RGG Domain
The RGG (Arginine-Glycine-Glycine) domain, also termed the RGG/RG motif or GAR (glycine-arginine-rich) domain is composed of repeat sequences rich in RGG or RG and is highly conserved in evolution (
Figure 2)
[76][77]. Researchers have discovered RGG/RG motifs in more than 1000 human proteins which influence transcription, precursor mRNA splicing, DNA damage signaling pathways, mRNA translation, and apoptosis
[76]. A study analyzed the amino acid composition of 77 human G4-binding proteins
[8]. Compared with a random subset of the human proteome and a well-defined group of nucleic acid binding proteins, the study demonstrated a significant enrichment of glycine and arginine and also high abundance in RR, GR and RG in G4BPs. Research was conducted to investigate the presence of a conserved RG-rich motif, which is a typical characteristic of G4BPs
[8].
Figure 2. Structural properties of G-quadruplex-binding proteins. RGG/RG motifs are from NCL, hnRNP U and CIRBP. The RRM domain structure is derived from Protein Data Bank with structure code 2KRR (NCL). RRM domain is an αβ sandwich structure composed of one four-stranded antiparallel β-sheet and two α-helices packed against the β-sheet. The OB-fold domain structure is derived from Protein Data Bank with structure code 5W2L (CTC1). OB-fold domain is a β-barrel formed by five antiparallel β-sheets.
The RGG domain is usually found in G4BPs and it has been shown to mediate G4-protein interactions. For example, hnRNP U contains the RGG domain
[12]. The C-terminal region of nucleolin composed of RNA-binding domain (RBD) 3 and 4 and the RGG domain is essential for the recognition of the
c-MYC NHE III1 sequence and the promotion of G4 formation
[9]. In addition, more than half of the newly identified
NRAS rG4BPs contain the GAR domain which has been proved to be critical for
NRAS rG4-DDX3X interaction
[67].
The short residue gap between RGG repeats in the RGG domain frequently contains aromatic amino acids. The research on the binding mechanisms of the RGG domain revealed that the small segment RGG motif in this domain greatly contributes to the G4 binding affinity. Huang et al. found that the internal arrangement of RGG repeats and gap amino acids are more fundamental to G4-protein interactions than the length of RGG peptides and numbers of RGG repeats
[9]. Experiments demonstrated that the peptide 12 with seven RGG repeats could efficiently bind to DNA G4s. Based on the above results, they discovered that the cold-inducible RNA-binding protein (CIRBP) containing peptide 12 could bind G4s both in vitro and in vivo, and this RGG peptide is essential for the G4 recognition of CIRBP
[9]. The team provided a great deal of insight into the interaction between the RGG peptide and G4s, and identified a new G4-binding protein based on the exploration of G4-binding RGG motifs. In summary, this approach also adds a new dimension to the discovery of other G4BPs.
2.2. RRM Domain
Several G4BPs, such as hnRNPs, nucleolin, CIRBP, TLS/FUS (translocated in liposarcoma, also known as fused in sarcoma), and EWS (Ewing’s sarcoma), have shared structural features, such as RNA recognition motifs (RRM) and RGG domains
[78]. RRM, also known as the RNA-binding domain (RBD) or ribonucleoprotein domain (RNP), is one of the most highly conserved nucleic acid binding domains that occurs in approximately 0.5–1% of human genes and folds into an αβ sandwich structure composed of one four-stranded antiparallel β-sheet and two α-helices packed against the β-sheet (
Figure 2)
[79][80][81]. Proteins with RRM are implicated in the regulation of transcription, translation, RNA processing, RNA export and stability
[82], and they are also common in G4BPs.
The RRM and RGG domains at the C-terminal of nucleolin are necessary to inhibit and induce the formation of the G4 on the
c-MYC promoter. The RRM in nucleolin can form G4s with guanine-containing single strands, but it unfolds G4s without guanines in the single strands of the 5′ and 3′ terminals
[82]. The RRMs of hnRNP A1 and hnRNP D are able to bind and unfold G4s. The crystal structure of the two RRMs of hnRNP A1 with single-stranded telomeric DNA showed that RRM1 and RRM2 interact directly with d(TAGG) and d(TTAGG), respectively
[83]. The RRM of hnRNP D could recognize d(TAG) in d(TTAGGG) determined by NMR
[84]. A recent study indicated that a novel G4-binding protein SLIRP (stem-loop interacting RNA binding protein) also contains the RRM domain, which is required for efficient interaction between DNA G4s and SLIRP
[85]. Furthermore, the sequence alignment for the RRMs derived from SLIRP and other G4BPs such as hnRNP A1 and nucleolin showed similar amino acid composition of these domains
[85]. The findings of these studies shed light on the roles of the RRM domain conserved in many nucleic acid binding proteins and contribute greatly to the exploration of its biological functions.
2.3. OB-Fold Domain
Oligonucleotide/oligosaccharide binding (OB)-fold is a β-barrel structure comprising a five-stranded antiparallel β-sheet, and this barrel is capped by an α-helix located between the third and fourth strands (
Figure 2)
[86]. The OB-fold structure is highly dynamic, and the dynamic properties enable OB-fold containing proteins to participate in multiple cellular pathways, including the re-initiation of DNA synthesis and the maintenance of genome stability
[87].
Replication protein A (RPA) is a single-stranded DNA-binding complex with three subunits which unfolds the G4s and is involved in various biological processes such as DNA replication, repair and recombination. Although both RPA and POT1-TPP1 can bind to telomeric overhangs, RPA is more abundant in cells
[11]. The CST complex resembles RPA in that they harbor comparable arrays of OB-folds and possess small subunits with similar structures
[21]. Since CST contains multiple OB-folds (one each in STN1 and TEN1, and seven in CTC1), it was estimated that CST could play distinct roles in replication using a dynamic binding mechanism similar to that observed in RPA
[21][88][89]. The dynamic properties of RPA binding due to the microscopic dissociation and re-association of individual OB-folds allow RPA to diffuse along the single-stranded DNA and to melt unwanted DNA secondary structures
[21]. In addition, POT1 also contains the OB-fold domain, and FRET has shown that it is critical for gradual G4 unfolding
[28].
DHX36 can bind DNA and RNA G4 structures with high affinity. It is a multifunctional helicase involved in G4-dependent transcriptional and post-transcriptional regulation, and plays a critical role in heart development, hematopoiesis and embryogenesis in mice
[90]. The DHX36-specific motif at the N-terminal of the protein forms a DNA-binding-induced α-helix that together with the OB-fold-like subdomain selectively binds to parallel G4s
[90].
This entry is adapted from the peer-reviewed paper 10.3390/biom12050648