Classical Approaches at the Synchrotron Radiation Facilites: Comparison
Please note this is a comparison between Version 2 by Lindsay Dong and Version 1 by Chunpeng Wang.

Synchrotron radiation sources are widely used in interdisciplinary research, generating an enormous amount of data while posing serious challenges to the storage, processing, and analysis capabilities of the large-scale scientific facilities worldwide.

  • synchrotron
  • data processing
  • data analysis

1. Introduction

In recent years, the rapid development of the large-scale synchrotron radiation facilities has brought the electron beam divergence close to the diffraction limit, while steadily increasing both the photon flux and coherence. As the experimental techniques at the new generation light source facilities are evolving in order to match modern users’ needs in terms of high-throughput, multimodal, ultrafast, in situ, and dynamic investigations, this lays the foundation for real-time, multi-functional, and cross-facilities experiments. In addition, the imaging sensors, such as the complementary metal oxide semiconductor (CMOS) and charge-coupled device (CCD), have made remarkable advancements in terms of smaller pixel sizes, larger areas, and faster frame rates, allowing for experimental techniques with a better spatial and temporal resolution. Their widespread usage in beamlines has gradually caused the digital images to become the predominant scientific raw data format in synchrotron radiation facilities [1]. As a result, within the next few years, the resulting exponential increase in terms of data volume will exceed the processing capability of the existing classical methods relying on manned data analysis. This “data deluge” effect [2] has severely challenged all the synchrotron radiation facilities worldwide, particularly in terms of data acquisition, local storage, data migration, data management, data analysis, and interpretation.
For example, the X-ray photon correlation spectroscopy can now generate images with a file size of 2 MB at 3000 Hz with a data generation rate of 6 GB/s [3], which is comparable to the data rate of the Large Hadron Collider. Using the Oryx detector, tomography beamlines can acquire 1500 projections of 9 s each (with each consisting of 2048 × 2448 pixels) at a data rate exceeding 1 GB/s [4]. Using these techniques, it is possible to study time-dependent phenomena for several weeks, accumulating an enormous amount of data. According to the statistics from the National Synchrotron Light Source-II (NSLS-II) [5,6][5][6], solely in 2021, over 1 PB of raw data was generated, and the future data volumes are expected to further increase. Furthermore, it is expected that the High Energy Photon Source (HEPS) under development in China will generate, with its 14 beamlines, 24 PB of raw experimental data per month during the initial phase [7].
Therefore, due to the vast amount of data generated during these experiments, novel capabilities in terms of on-site real-time data analysis, processing, and interpretation at the beamlines are a crucial and urgent need for the synchrotron radiation users. Not addressing this issues may then result in a large portion of the users’ data not being effectively analyzed, obscuring any potential scientific discovery hidden within these data [8].

2. Classical Approaches at the Synchrotron Radiation Facilites

Synchrotron beamlines typically offer two approaches to provide users with on-site data processing and analysis services to address their computationally intensive needs. The first approach involves uploading data and jobs to a national supercomputer via high-speed scientific network infrastructures. The second approach involves the deployment of on-premises high-performance workstations, or small clusters, dealing with the task involving the data processing jobs.
For example, the Superfacility project [9] at the Lawrence Berkeley National Laboratory (LBNL) links research facilities with the National Energy Research Scientific Computing Center (NERSC)’s high-performance computing (HPC) resources via ESnet [10], allowing for large-scale data analysis with minimal human intervention. This decreases the length of the analysis cycles from days or weeks down to minutes or hours. Users have the possibility of accessing storage, open software, and tools without the need of managing complex architectures or possessing computational competences [11,12,13,14][11][12][13][14]. The Advanced Light Source (ALS) has launched several projects at the NERSC, including a data portal, a data-sharing service, and an artificial intelligence (AI)/machine learning (ML) collaboration project, streamlining for the users the data ingestion, sharing and labeling processes [15]. This approach adheres to the concept of resource concentration and intensification. However, it is important to note that the resource allocation and scheduling, as well as the queuing time, are beyond the control of the beamline scientists and users, due to the operational regulations of the supercomputer itself. Taking the example of the ALS using the SPOT framework and NERSC to process tomography data, the actual job execution time for computed tomography (CT) reconstructions was less than 10 min, while the queuing time in the NERSC scheduling system was circa 30 min [15].
The TOMCAT beamline, at Swiss Light Source, has instead adopted an on-premises computing system approach, installing the GigaFRoST detector system for quick data acquisition [16] and creating an effective tomographic reconstruction pipeline using high-performance computing to manage and analyze the massive data influx [17,18][17][18]. The TomoPy framework [19], developed by the Advanced Photon Source (APS) using Python, represents a highly effective data-intensive strategy. ALS has adopted and further developed the TomoPy framework by implementing a modernized user interface [20], which can considerably increase the CT users’ workflow efficiency. Until 2019, the macromolecular crystallography (MX) beamlines at the shanghai synchrotron radiation facility (SSRF) utilized an automated system, Aquarium [21], which employed a local dedicated high-performance computing cluster for large-scale parallel computations. This expedites the data reduction, single-wavelength anomalous diffraction (SAD) phasing, and model construction procedures, which took place within a 5 to 10 min time window. Although local dedicated small-scale computing clusters can ensure real-time job execution through resource exclusivity, they come with the limits of higher economic costs and a lower scalability.
Integrating the two approaches could then result in a more efficient solution that prioritizes the usage of local dedicated infrastructures matching the real-time needs of the users’ experiments, and then allocates computational tasks to large computing centers when higher computational demands arise, all within a framework designed to accommodate the needs of the diverse scientific communities. This hybrid approach will provide substantial benefits but requires close collaboration between the local computing infrastructures at the beamlines and the large computing centers in order to ensure a seamless integration and an efficient data transfer.
The SSRF is the first medium-energy, third-generation synchrotron radiation source on the Chinese mainland. It features a 150 MeV linear accelerator, a 3.5 GeV booster, a 3.5 GeV storage ring, 27 operational beamlines, approximately 40 operational experimental endstations, support facilities, and a dedicated data center [22,23,24,25,26,27][22][23][24][25][26][27]. With the ongoing development of the SSRF, and the expansion of its application scope, the amount of data generated exhibits a similar upward trend, including varying computing requirements for various beamlines.
In 2019, the SSRF generated over 0.8 PB of unprocessed data and 2.4 PB of processed data. Once the Phase II project is completed, the SSRF is expected to generate approximately 30 PB of raw data and 100 PB of processed data per year. Assuming a dataset size of 10 GB, on average, the SSRF can currently process one dataset every 3 min. If the daily volume of data processed reaches 160 TB and totals 16,384 datasets, it would take 819 h to complete a single day of computing tasks [25].
In this context, the processing and analysis of the data from the large-scale synchrotron radiation sources, as well as the improvement of the computing resource usability and data transfer efficiency, are subject to intensive research. As an emerging computing paradigm, edge computing [28,29][28][29] relocates the computing resources and data processing capabilities to the network edge, thereby addressing latency, bandwidth bottlenecks, and other challenges inherent to the conventional computing models. Thus, it has the potential for being an effective data processing and analysis solution for the synchrotron radiation facilities [30,31][30][31].

References

  1. Wang, C.; Ullrich, S.; Alessandro, S. Synchrotron Big Data Science. Small 2018, 14, 1802291.
  2. Bell, G.; Hey, T.; Szalay, A. Beyond the Data Deluge. Science 2009, 323, 1297–1298.
  3. Pralavorio, C. LHC Season 2: CERN Computing Ready for Data Torrent; CERN: Geneva, Switzerland, 2015.
  4. FLIR Systems. Available online: https://www.flir.com/products/oryx-10gige (accessed on 1 May 2019).
  5. Campbell, S.I.; Allan, D.B.; Barbour, A.M.; Olds, D.; Rakitin, M.S.; Smith, R.; Wilkins, S.B. Outlook for artificial intelligence and machine learning at the NSLS-II. Mach. Learn. Sci. Technol. 2021, 2, 013001.
  6. Barbour, J.L.; Campbell, S.; Caswell, T.A.; Fukuto, M.; Hanwell, M.D.; Kiss, A.; Konstantinova, T.; Laasch, R.; Maffettone, P.M.; Ravel, B.; et al. Advancing Discovery with Artificial Intelligence and Machine Learning at NSLS-II. Synchrotron Radiat. News 2022, 35, 44–50.
  7. Hu, H. The design of a data management system at HEPS. J. Synchrotron Radiat. 2021, 28, 169–175.
  8. Parkinson, D.Y.; Beattie, K.; Chen, X.; Correa, J.; Dart, E.; Daurer, B.J.; Deslippe, J.R.; Hexemer, A.; Krishnan, H.; MacDowell, A.A.; et al. Real-time data-intensive computing. AIP Conf. Proc. 2016, 1741, 050001.
  9. Bard, D.; Snavely, C.; Gerhardt, L.M.; Lee, J.; Totzke, B.; Antypas, K.; Arndt, W.; Blaschke, J.P.; Byna, S.; Cheema, R.; et al. The LBNL Superfacility Project Report. arXiv 2022, arXiv:2206.11992.
  10. Bashor, J. NERSC and ESnet: 25 Years of Leadership; Lawrence Berkeley National Laboratory: Berkeley, CA, USA, 1999.
  11. Blaschke, J.; Brewster, A.S.; Paley, D.W.; Mendez, D.; Sauter, N.K.; Kröger, W.; Shankar, M.; Enders, B.; Bard, D.J. Real-Time XFEL Data Analysis at SLAC and NERSC: A Trial Run of Nascent Exascale Experimental Data Analysis. arXiv 2021, arXiv:2106.11469.
  12. Giannakou, A.; Blaschke, J.P.; Bard, D.; Ramakrishnan, L. Experiences with Cross-Facility Real-Time Light Source Data Analysis Workflows. In Proceedings of the 2021 IEEE/ACM HPC for Urgent Decision Making (UrgentHPC), St. Louis, MO, USA, 19 November 2021; pp. 45–53.
  13. Vescovi, R.; Chard, R.; Saint, N.; Blaiszik, B.; Pruyne, J.; Bicer, T.; Lavens, A.; Liu, Z.; Papka, M.E.; Narayanan, S.; et al. Linking Scientific Instruments and HPC: Patterns, Technologies, Experiences. arXiv 2022, arXiv:2204.05128.
  14. Enders, B.; Bard, D.; Snavely, C.; Gerhardt, L.M.; Lee, J.R.; Totzke, B.; Antypas, K.; Byna, S.; Cheema, R.; Cholia, S.; et al. Cross-facility Science with the Superfacility Project at LBNL. In Proceedings of the 2020 IEEE/ACM 2nd Annual Workshop on Extreme-Scale Experiment-in-the-Loop Computing (XLOOP), Atlanta, GA, USA, 12 November 2020; pp. 1–7.
  15. Deslippe, J.R.; Essiari, A.; Patton, S.J.; Samak, T.; Tull, C.E.; Hexemer, A.; Kumar, D.; Parkinson, D.Y.; Stewart, P. Workflow Management for Real-Time Analysis of Lightsource Experiments. In Proceedings of the 2014 9th Workshop on Workflows in Support of Large-Scale Science, New Orleans, LA, USA, 16 November 2014; pp. 31–40.
  16. Mokso, R.; Schlepütz, C.M.; Theidel, G.; Billich, H.; Schmid, E.; Celcer, T.; Mikuljan, G.; Sala, L.; Marone, F.; Schlumpf, N.; et al. GigaFRoST: The gigabit fast readout system for tomography. J. Synchrotron Radiat. 2017, 24, 1250–1259.
  17. Buurlage, J.-W.; Marone, F.; Pelt, D.M.; Palenstijn, W.J.; Stampanoni, M.; Batenburg, K.J.; Schlepütz, C.M. Real-time reconstruction and visualisation towards dynamic feedback control during time-resolved tomography experiments at TOMCAT. Sci. Rep. 2019, 9, 18379.
  18. Marone, F.; Studer, A.; Billich, H.; Sala, L.; Stampanoni, M. Towards on-the-fly data post-processing for real-time tomographic imaging at TOMCAT. Adv. Struct. Chem. Imag. 2017, 3, 1.
  19. Gürsoy, D.; De Carlo, F.; Xiao, X.; Jacobsen, C. TomoPy: A framework for the analysis of synchrotron tomographic data. J. Synchrotron Radiat. 2014, 21, 1188–1193.
  20. Pandolfi, R.J.; Allan, D.; Arenholz, E.A.; Barroso-Luque, L.; Campbell, S.I.; Caswell, T.A.; Blair, A.; De Carlo, F.; Fackler, S.W.; Fournier, A.P.; et al. Xi-cam: A versatile interface for data visualization and analysis. J. Synchrotron Radiat. 2018, 25 Pt 4, 1261–1270.
  21. Yu, F.; Wang, Q.; Li, M.; Zhou, H.; Liu, K.; Zhang, K.; Wang, Z.; Xu, Q.; Xu, C.; Pan, Q.; et al. Aquarium: An automatic data-processing and experiment information management system for biological macromolecular crystallography beamlines. J. Appl. Crystallogr. 2019, 52, 472–477.
  22. Jiang, M.H.; Yang, X.; Xu, H.J.; Ding, Z.H. Shanghai Synchrotron Radiation Facility. Chin. Sci. Bull. 2009, 54, 4171–4181.
  23. He, J.; Zhao, Z. Shanghai synchrotron radiation facility. Natl. Sci. Rev. 2014, 1, 171–172.
  24. Yin, L.; Tai, R.; Wang, D.; Zhao, Z. Progress and Future of Shanghai Synchrotron Radiation Facility. J. Vac. Soc. Jpn. 2016, 59, 198–204.
  25. Wang, C.; Yu, F.; Liu, Y.; Li, X.; Chen, J.; Thiyagalingam, J.; Sepe, A. Deploying the Big Data Science Center at the Shanghai Synchrotron Radiation Facility: The first superfacility platform in China. Mach. Learn. Sci. Technol. 2021, 2, 035003.
  26. Sun, B.; Wang, Y.; Liu, K.; Wang, Q.; He, J. Design of new sub-micron protein crystallography beamline at SSRF. In Proceedings of the 13th International Conference on Synchrotron Radiation Instrumentation, Taipei, Taiwan, 11–15 June 2018.
  27. Li, Z.; Fan, Y.; Xue, L.; Zhang, Z.; Wang, J. The design of the test beamline at SSRF. In Proceedings of the 13th International Conference on Synchrotron Radiation Instrumentation, Taipei, Taiwan, 11–15 June 2018.
  28. Shi, W.; Jie, C.; Quan, Z.; Li, Y.; Xu, L. Edge Computing: Vision and Challenges. Internet Things J. IEEE 2016, 3, 637–646.
  29. Ning, H.; Li, Y.; Shi, F.; Yang, L.T. Heterogeneous edge computing open platforms and tools for internet of things. Future Gener. Comput. Syst. 2020, 106, 67–76.
  30. Yin, J.; Zhang, G.; Cao, H.; Dash, S.; Chakoumakos, B.C.; Wang, F. Toward an Autonomous Workflow for Single Crystal Neutron Diffraction. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Virtual Event, 23–25 August 2022.
  31. Hirschman, J.; Kamalov, A.; Obaid, R.; O’Shea, F.H.; Coffee, R.N. At-the-Edge Data Processing for Low Latency High Throughput Machine Learning Algorithms. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Virtual Event, 23–25 August 2022.
More
Video Production Service