Classical Approaches at the Synchrotron Radiation Facilites

Classical Approaches at the Synchrotron Radiation Facilites: Comparison

Please note this is a comparison between Version 1 by Chunpeng Wang and Version 2 by Lindsay Dong.

Synchrotron radiation sources are widely used in interdisciplinary research, generating an enormous amount of data while posing serious challenges to the storage, processing, and analysis capabilities of the large-scale scientific facilities worldwide.

synchrotron
data processing
data analysis

1. Introduction

In recent years, the rapid development of the large-scale synchrotron radiation facilities has brought the electron beam divergence close to the diffraction limit, while steadily increasing both the photon flux and coherence. As the experimental techniques at the new generation light source facilities are evolving in order to match modern users’ needs in terms of high-throughput, multimodal, ultrafast, in situ, and dynamic investigations, this lays the foundation for real-time, multi-functional, and cross-facilities experiments. In addition, the imaging sensors, such as the complementary metal oxide semiconductor (CMOS) and charge-coupled device (CCD), have made remarkable advancements in terms of smaller pixel sizes, larger areas, and faster frame rates, allowing for experimental techniques with a better spatial and temporal resolution. Their widespread usage in beamlines has gradually caused the digital images to become the predominant scientific raw data format in synchrotron radiation facilities [1]. As a result, within the next few years, the resulting exponential increase in terms of data volume will exceed the processing capability of the existing classical methods relying on manned data analysis. This “data deluge” effect [2] has severely challenged all the synchrotron radiation facilities worldwide, particularly in terms of data acquisition, local storage, data migration, data management, data analysis, and interpretation.

For example, the X-ray photon correlation spectroscopy can now generate images with a file size of 2 MB at 3000 Hz with a data generation rate of 6 GB/s [3], which is comparable to the data rate of the Large Hadron Collider. Using the Oryx detector, tomography beamlines can acquire 1500 projections of 9 s each (with each consisting of 2048 × 2448 pixels) at a data rate exceeding 1 GB/s [4]. Using these techniques, it is possible to study time-dependent phenomena for several weeks, accumulating an enormous amount of data. According to the statistics from the National Synchrotron Light Source-II (NSLS-II) ^[5][6][5,6], solely in 2021, over 1 PB of raw data was generated, and the future data volumes are expected to further increase. Furthermore, it is expected that the High Energy Photon Source (HEPS) under development in China will generate, with its 14 beamlines, 24 PB of raw experimental data per month during the initial phase [7].

Therefore, due to the vast amount of data generated during these experiments, novel capabilities in terms of on-site real-time data analysis, processing, and interpretation at the beamlines are a crucial and urgent need for the synchrotron radiation users. Not addressing this issues may then result in a large portion of the users’ data not being effectively analyzed, obscuring any potential scientific discovery hidden within these data [8].

2. Classical Approaches at the Synchrotron Radiation Facilites

Synchrotron beamlines typically offer two approaches to provide users with on-site data processing and analysis services to address their computationally intensive needs. The first approach involves uploading data and jobs to a national supercomputer via high-speed scientific network infrastructures. The second approach involves the deployment of on-premises high-performance workstations, or small clusters, dealing with the task involving the data processing jobs.

For example, the Superfacility project [9] at the Lawrence Berkeley National Laboratory (LBNL) links research facilities with the National Energy Research Scientific Computing Center (NERSC)’s high-performance computing (HPC) resources via ESnet [10], allowing for large-scale data analysis with minimal human intervention. This decreases the length of the analysis cycles from days or weeks down to minutes or hours. Users have the possibility of accessing storage, open software, and tools without the need of managing complex architectures or possessing computational competences ^{[11][12][13][14]}[11,12,13,14]. The Advanced Light Source (ALS) has launched several projects at the NERSC, including a data portal, a data-sharing service, and an artificial intelligence (AI)/machine learning (ML) collaboration project, streamlining for the users the data ingestion, sharing and labeling processes [15]. This approach adheres to the concept of resource concentration and intensification. However, it is important to note that the resource allocation and scheduling, as well as the queuing time, are beyond the control of the beamline scientists and users, due to the operational regulations of the supercomputer itself. Taking the example of the ALS using the SPOT framework and NERSC to process tomography data, the actual job execution time for computed tomography (CT) reconstructions was less than 10 min, while the queuing time in the NERSC scheduling system was circa 30 min [15].

The TOMCAT beamline, at Swiss Light Source, has instead adopted an on-premises computing system approach, installing the GigaFRoST detector system for quick data acquisition [16] and creating an effective tomographic reconstruction pipeline using high-performance computing to manage and analyze the massive data influx ^[17][18][17,18]. The TomoPy framework [19], developed by the Advanced Photon Source (APS) using Python, represents a highly effective data-intensive strategy. ALS has adopted and further developed the TomoPy framework by implementing a modernized user interface [20], which can considerably increase the CT users’ workflow efficiency. Until 2019, the macromolecular crystallography (MX) beamlines at the shanghai synchrotron radiation facility (SSRF) utilized an automated system, Aquarium [21], which employed a local dedicated high-performance computing cluster for large-scale parallel computations. This expedites the data reduction, single-wavelength anomalous diffraction (SAD) phasing, and model construction procedures, which took place within a 5 to 10 min time window. Although local dedicated small-scale computing clusters can ensure real-time job execution through resource exclusivity, they come with the limits of higher economic costs and a lower scalability.

Integrating the two approaches could then result in a more efficient solution that prioritizes the usage of local dedicated infrastructures matching the real-time needs of the users’ experiments, and then allocates computational tasks to large computing centers when higher computational demands arise, all within a framework designed to accommodate the needs of the diverse scientific communities. This hybrid approach will provide substantial benefits but requires close collaboration between the local computing infrastructures at the beamlines and the large computing centers in order to ensure a seamless integration and an efficient data transfer.

The SSRF is the first medium-energy, third-generation synchrotron radiation source on the Chinese mainland. It features a 150 MeV linear accelerator, a 3.5 GeV booster, a 3.5 GeV storage ring, 27 operational beamlines, approximately 40 operational experimental endstations, support facilities, and a dedicated data center ^{[22][23][24][25][26][27]}[22,23,24,25,26,27]. With the ongoing development of the SSRF, and the expansion of its application scope, the amount of data generated exhibits a similar upward trend, including varying computing requirements for various beamlines.

In 2019, the SSRF generated over 0.8 PB of unprocessed data and 2.4 PB of processed data. Once the Phase II project is completed, the SSRF is expected to generate approximately 30 PB of raw data and 100 PB of processed data per year. Assuming a dataset size of 10 GB, on average, the SSRF can currently process one dataset every 3 min. If the daily volume of data processed reaches 160 TB and totals 16,384 datasets, it would take 819 h to complete a single day of computing tasks [25].

In this context, the processing and analysis of the data from the large-scale synchrotron radiation sources, as well as the improvement of the computing resource usability and data transfer efficiency, are subject to intensive research. As an emerging computing paradigm, edge computing ^[28][29][28,29] relocates the computing resources and data processing capabilities to the network edge, thereby addressing latency, bandwidth bottlenecks, and other challenges inherent to the conventional computing models. Thus, it has the potential for being an effective data processing and analysis solution for the synchrotron radiation facilities ^[30][31][30,31].