Open chromatin is typically depleted of nucleosomes, providing the possibility of physical interaction between transcription factors (TFs) and regulatory sequences to initiate transcriptional machinery to medicate cell fate commitment and differentiation
[7][8]. It is well known that active CREs are normally located in accessible chromatin regions(ACRs)
[8]. To characterize CREs, different methods have been developed to profile chromatin accessibility at the genome scale. The Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq) identifies open chromatin by using a hyperactive Tn5 transposase that inserts sequencing adaptors into accessible regions harboring potential CREs
[9]. Compared with other chromatin accessibility assays like DNase-seq (DNase I hypersensitive sites sequencing), ATAC-seq requires a smaller number of cells, making it especially suitable for rare samples with limited cells
[10]. However, this approach generates an average number of signals across a group of cells, potentially obscuring the intricate cellular dynamics and regulatory programs specific to individual cells. The need to dissect the complexity and intercellular variability of chromatin accessibility has directed researchers toward single-cell techniques. The development and refinement of single-cell ATAC-seq (scATAC-seq) have the ability to address these limitations by profiling DNA accessibility at cellular resolution
[11]. With the advancement of scATAC-seq, researchers can now dissect the heterogeneity of chromatin states among single cells within a population, gaining a more holistic insight into the gene regulatory networks (GRNs) and epigenetic mechanisms governing cellular identity. This technology has been widely applied in biomedical science, ranging from immunology
[12][13] and stem cell developmental biology
[14][15] to tumorigenesis
[16][17], and it has been gradually expanded to the field of plant developmental biology
[18][19][20].
2. Challenges for Application of scATAC-seq in Plants
scATAC-seq is a powerful tool used to map the epigenomic landscape of complex organs at single-cell resolution and has been widely applied in animal science. However, the application of scATAC-seq in plants faces several challenges. Nuclei isolation is key to constructing high-quality libraries, and a few reports have been made on nucleus isolation methods in plant scATAC-seq. Most of these methods have been developed for immature or young organs in model plants, such as the root
[19][21][22], inflorescences
[23], and shoot apex
[24]. A recent report on maize prepared the scATAC-seq library for six organs by nuclei isolation from fresh or flash-frozen tissue
[20].
The lack of analytical tools is another obstacle which hinders the application of scATAC-seq in plants. It is well known that most bioinformatic pipelines for scATAC-seq are tailor-made for humans and mice, with parameters specifically optimized based on empirical and computational estimations from animal genomic features
[25]. Moreover, many tools only include built-in databases for mammals, making the creation of custom databases for plants a time-consuming and technically challenging task
[26]. Additionally, cell type annotation in scATAC-seq is more challenging compared with scRNA-seq, primarily due to the fact that the relationship between noncoding genomic regions and cell identity has not been fully elucidated in plants.
Lastly, the captured accessible regions need to be accurately mapped to a reference genome for the discovery of open or closed regions in a given cell
[27]. Therefore, a well-annotated genome is essential for downstream analyses. High-quality genome assemblies can more accurately determine the locations of ACRs. Poor genome assembly can lead to incorrect mappings, which may compromise peak calling and annotation.
2.1. Preparation of Nuclei Suspensions Compatible with scATAC-seq
An aspect critical to the success of scATAC-Seq assays is the isolation of high-quality nuclei in sufficient amounts while maintaining the integrity of their overall structure throughout the experiment to prevent DNA from degradation. High-quality nuclei suspension is typically characterized as a lower clump rate and intact nuclear membrane. The presence of a nuclei clump is likely to interfere with cell/nuclei counting statistics, clog the chromium chip, and result in the failure of the GEM (Gel Bead in Emulsion) in a 10x Genomics microfluid system. The native architecture of chromosomes may also be disrupted by improper mechanical handling or endonucleases cleavage in broken nuclei, leading to the inaccurate enrichment of open chromatin. Therefore, the establishment of reliable nuclei isolation protocols is important for scATAC-seq.
Currently, several well-established protocols have been proposed, tailor-made for nuclei isolation from frozen or fresh samples in mammals for scATAC-seq
[28][29][30][31][32]. However, only a few of the documented methods have been specifically designed for nuclei isolation in plants, most of which are used for bulk assays and have not been evaluated for their applicability in single-cell omics analysis
[33]. Unlike animal cells, plant cells are coated with rigid cell walls and extremely abundant in various secondary metabolites, especially in some woody plant species like poplar. These characteristics can strongly interfere with nuclei isolation and possibly inhibit follow-up Tn5 transposition
[34]. Hence, how to effectively remove these impurities remains problematic for scATAC-seq in plants.
Recently, there have been several reports applying scATAC-seq in plant science, where different nuclei isolation protocols compatible with scATAC-seq have been developed
[19][20][21][23][24][35][36][37][38]. Briefly, isolation protocols often involve laborious tissue homogenization, filtering the slurry through a cell strainer, and nuclei sorting by flow cytometry, and these protocols often require optimized chemical usage and centrifugation conditions for specific organs or species (
Figure 1). The initial step for crude nuclei isolation varies among protocols, where tissues are homogenized either by chopping with a sharp razor blade or grinding in liquid nitrogen. Crude nuclei can also be obtained by protoplast lysis, but this may not be applicable to some plant tissues recalcitrant to protoplasting. Notably, the application of FACS is a routine step in nuclei isolation procedures, where there is an intention to separate intact nuclei from impurities like cell debris
[20][21][23][24][35][37]. Whether or not to exploit FACS in plant nuclei isolation remains an open question. Recently, a handful of researchers have developed FACS-free nuclei isolation methods suitable for snATAC-seq (Single-nucleus ATAC sequencing)/snRNA-seq (Single-nucleus RNA sequencing), which generate high-quality libraries, as indicated by major metrics
[19][36][38]. Additionally, other studies have demonstrated that the use of FACS to sort nuclei depends on the diameters of the nuclei in a given tissue, where nuclei with a diameter < 30 μm can be purified without FACS, and several rounds of filtering using a cell strainer are adequate for nuclei isolation, with only nuclei with a diameter greater than 30 μm requiring FACS. In summary, to achieve optimal results in scATAC-seq, caution should be taken to select proper nuclei isolation methods and make adjustments where necessary for special plant species.
Figure 1. General procedure for nuclei isolation for scATAC-seq in plants. Fresh or frozen tissues are homogenized in pre-cooled nuclei isolation buffer or liquid nitrogen; the crude nuclei suspension is filtered using cell strainer; crude nuclei suspension is loaded on upper layer of density gradient to remove cell debris and organelles; nuclei are stained by DAPI (4′,6-diamidino-2-phenylindole) and further purified by FACS-based sorting; purified nuclei are then counted and diluted to proper concentration for library construction.
Maintaining nuclei integrity and reducing nuclei clump rates have always been key in nuclei isolation for scATAC-seq. However, the widespread use of detergents in nuclei isolation poses considerable challenges to nuclei quality as these chemicals can undermine the structure of the nuclei membrane. To tackle these challenges, formaldehyde fixation is introduced prior to organelle lysis with a detergent during nuclei isolation in plants, greatly decreasing the occurrence of nuclei clumps and mitigating nuclei membrane disruption caused by the detergent
[38]. Formaldehyde fixation makes nuclei resilient to detergent washes; thus, it can efficiently remove contaminations from organelle DNA, leading to a reduced doublet rate and making FACS unnecessary.
2.2. Analytical Tools Compatible with Plants
Most analytical tools, like Cicero (version 1.20.0), snapATAC (version 2.0), Signac (version 1.12.0), and ArchR (version 1.0.1), have been developed for model species in mammals like humans and mice. These are not out-of-the-box tools; they require adaptation for plant species, which may not be user friendly for researchers with limited bioinformatics skills. Furthermore, most of these tools only include pre-defined databases for mammals and are not applicable to plant species. Building a custom database for plants is a time-consuming process that demands an extensive comprehension of related bioinformatic knowledge.
Due to the diversity of the scATAC-seq platforms, be it droplet-based or split-pool-based ones, these tools cannot meet the requirements needed for all platforms to conduct data preprocessing analysis. For example, both cellranger and snapATAC2 (version 2.5.3) are only applicable to the 10x Genomics scATAC-seq platform, which lacks scalability. Furthermore, most analytical tools only address specific problems and do not provide end-to-end (from data comparison to downstream clustering and cell annotation) analysis. Finally, the latest single-cell sequencing technology can simultaneously perform multi-modality (multi-omics) characteristic analysis on the same set of cell samples. However, most current analytical tools can only analyze single-cell data of one modality and cannot analyze and integrate multi-omics data
[39].
2.3. Challenges for Cell Type Annotation in scATAC-seq
The open chromatin regions identified by single-cell ATAC-seq mainly fall within non-coding regions. At present, there is a lack of databases related to cis-acting elements and cell type annotation in plants. It is challenging to directly annotate cell types based on differences in peak information. While the read count in promoter regions for known cell type marker genes can be calculated as the predicted value of associated adjacent gene expression, some studies have found that simple promoter accessibility is not an ideal predictor of gene expression. Additionally, for most non-model species, tissue and cell-type-specific marker gene databases are not available.
Determining which genes are highly specific to a given cell type is a commonly used strategy to mark unknown cell groups with a proper identity in single-cell omics analysis. The identification of marker genes with high cell-type specificity is therefore important for cell type annotation. The identification of market genes in scATAC-seq is relatively challenging as the features used in this assay are a set of genomic coordinates, which are highly dataset-dependent and make it hard to interpret and compare with different datasets. One solution to this problem is to calculate gene scores based on read counts from certain gene body and promoter regions, which results in a cell-by-gene matrix similar to scRNA-seq.
Presently, many analytical tools are available to identify marker genes for scATAC-seq, such as Signac (version 1.12.0), SnapATAC2 (version 2.5.3), ArchR (version 1.0.1), and SEMITONES (https://github.com/ohlerlab/SEMITONES, accessed on 22 January 2024). Most of these tools extract read counts from open chromatin regions with the gene body and promoter to calculate the gene activity score of accessible genes. Marker genes are then identified using a similar strategy in scRNA-seq, and cell type annotation is implemented by interrogating the chromatin accessibility at canonical marker genes. Unlike Signac (version 1.12.0), ArchR (version 1.0.1), and snapATAC2 (version 2.5.3)
[26][40][41], which are specifically designed for scATAC-seq analysis, SEMITONES is more scalable as it can identify marker features from scRNA-seq, scATAC-seq, and even spatially resolved transcriptome in a cluster-independent manner
[42]. SEMITONES implements the strategy of enrichment scores calculation to identify cell-type-specific peaks. As this is a well-known difficulty in cell type annotation when using marker peaks, the GREAT (Genomic Regions Enrichment of Annotations Tool) algorithm is introduced to assign significantly accessible peaks to nearby genes. Cell type annotation is then made possible by the assigned genes and their enriched GO terms
[42]. SEMITONES was initially developed for human data and benchmarked with other tools to confirm its reliability and efficiency. Recently, this tool was applied to the scRNA-seq root atlas comparison between wild-type and cell type mutants in
Arabidopsis thaliana, indicating its applicability to the plant science community
[43].
The identification of marker features in most analytical tools is based on Euclidean distance determined by feature differences between the cell cluster of interest and other cell clusters. However, the marker features identified in these tools may not be reliable in some cases. Specifically, features highly enriched in a rather small portion of cells in a given cell group, e.g., less than 5% in a cell cluster, are still likely to be identified as marker features. To avoid this potential problem, a new method called COSine similarity-based marker gene identification (COSG)
[44] was developed for the identification of marker genes based on a cosine value that is highly accurate and scalable for the identification of marker features. It has been proved to be applicable to scRNA-seq, scATAC-seq, and spatially resolved transcriptomes. COSG (version 1.0.0) excels in dealing with super large datasets and accomplishes marker gene or peak identification within a very short time frame
[44]. In conclusion, the marker features identified by COSG (version 1.0.0) exhibit higher cell-type specificity compared with other existing approaches.
2.4. The Annotation Quality of Reference Genomes
The reference genome serves as a map against which the DNA fragments obtained from scATAC-seq are aligned, which helps to identify accessible regions in the genome. The quality of the reference genome will determine the accuracy of ACRs localization, which may impact downstream analysis and the interpretation of the results.
To evaluate the effect of reference quality on scATAC-seq analysis, researchers have compared different genome assemblies with varying levels of annotation quality in the dataset analysis. The results indicate that it has a strong impact on scATAC-seq in terms of QC(quality control) metrics, peak calling, and cell clustering. The 10x Genomics scATAC-seq tutorial also demonstrates that a high-quality genome reference is required to achieve desirable results and meet the QC metrics, which is also important for scRNA-seq. In a recent study applying scRNA-seq in non-model plants, high-quality reference genomes were first assembled to improve mapping rates, ensuring the accurate quantification of scRNA-seq
[45]. Therefore, it is advisable to evaluate the quality of a genome assembly first to generate reliable results in scATAC-seq.