Optimizing Data Processing | Encyclopedia MDPI

Optimizing Data Processing: Comparison

Please note this is a comparison between Version 1 by Thanda Shwe and Version 2 by Fanny Huang.

Intelligent applications in several areas increasingly rely on big data solutions to improve their efficiency, but the processing and management of big data incur high costs. Although cloud-computing-based big data management and processing offer a promising solution to provide scalable and abundant resources, the current cloud-based big data management platforms do not properly address the high latency, privacy, and bandwidth consumption challenges that arise when sending large volumes of user data to the cloud. Computing in the edge and fog layers is quickly emerging as an extension of cloud computing used to reduce latency and bandwidth consumption, resulting in some of the processing tasks being performed in edge/fog-layer devices. Although these devices are resource-constrained, recent increases in resource capacity provide the potential for collaborative big data processing.

big data
IoT
serverless
data processing

1. Introduction

In the last few years, a new class of applications that use unprecedented amounts of data generated from mobile devices and Internet of Things (IoT) sensors has been widely deployed in many areas, bringing better quality of life to humans through the automation of daily tasks and enabling time and energy savings, as well as monitoring, efficient communication, better decision making, etc. ^[1][2][3][1,2,3]. This type of application, which is present in various domains, such as healthcare monitoring, smart cities, industry, and transportation [4], not only involves performing intensive computation on large amounts of sensor data but also occasionally requires the output to be processed in real time to provide faster interactions and better user experiences. To provide the necessary computing resources, such as virtual machines and storage resources, including object-based storage and databases for these types of applications, cloud solutions are considered, as they are capable of storing and processing data in the cloud and sending the results back to the IoT devices ^[5][6][5,6]. Most existing systems use big data management platforms, which deal with large amounts of big data processing in the cloud computing platform or in self-hosted, dedicated clusters or data centers ^[7][8][7,8]. However, some IoT applications, such as smart cities and machinery automation, are real-time applications, and the adoption of cloud computing for such applications can result in high latency because sensor data need to be transferred to a cloud data center in a remote location. Thus, cloud solutions may no longer be appropriate for these types of applications due to the limitations caused by network bandwidth consumption and latency constraints [9].

Another noteworthy issue with cloud-based processing is that all users’ data tend to be sent to the cloud. Specifically, there is potential for security and privacy issues and an increase in the communication payload and costs. The storage of sensitive data in the cloud raises concerns about data vulnerability. In a centralized and shared environment, there is a risk of unauthorized access or data breaches. Regarding data privacy, the geographic locations of cloud servers may impact data privacy. Navigating international data protection laws and ensuring data residency compliance becomes a complex task in terms of protecting sensitive information. In addition, transferring large volumes of data to and from the cloud can incur significant costs. Organizations must carefully manage data transfer to reduce their expenses. These issues are particularly relevant in the context of data processing applications, such as healthcare and sensitive data handling, where the importance of security and low latency necessitates more focused consideration ^{[10][11][12][13][14]}[10,11,12,13,14]. Concerns about sensitive information being exposed to remote cloud data centers highlight the limitations of cloud-based processing.

To overcome this challenge, fog and edge computing, which move the processing and storage tasks to locations that are closer to the data source, have been introduced as complements to cloud computing to provide storage and computing capabilities ^[9][15][16][9,15,16]. In these types of computing, any local device, however small, is considered capable of some processing tasks. Edge/fog computing helps users to perform data processing at lower latency and in the most appropriate way. At present, edge- and fog-layer devices are only used for some preprocessing tasks, such as filtering, feature processing, and compression. Computationally intensive tasks such as big data analytics and machine learning training are sent out to the cloud, as they cannot be efficiently run on low-performance local edge and fog devices ^[17][18][17,18]. Data processing in a layer architecture allows real-time and lightweight tasks processed on edge devices that need high storage capacities and processing capabilities to be moved to the cloud. However, important drawbacks related to the high bandwidth consumption involved in sending large quantities of data between local devices and cloud providers emerge. If the data can be stored and processed locally, closer to the IoT devices, the data transmitted through the network would be significantly reduced.

Researchers can observe a recent breakthrough in edge- and fog-layer computing devices as the processing abilities of these devices have become increasingly efficient and capable [16]. In addition, most of the data or applications pass through the edge/fog-layer devices as a gateway before connecting to the cloud. Edge/fog computing is potentially a good solution for data processing problems, thanks to its recently developed resource-scaling capabilities. This provides the possibility of deploying edge- and fog-layer big data management platforms using local edge/fog devices as computing and storage nodes. Such a big data management platform can not only help to reduce latency, bandwidth consumption, and cloud budgets but can also be applied in some environments without an Internet connection.

IoT applications that span various domains, ranging from smart cities to industrial environments and healthcare systems, continuously produce massive volumes of data, including sensor readings, images, videos, and other types of information. Data processing platforms handle large volumes of diverse data, making them well-suited for the processing of the varied data generated by IoT sources. IoT deployments often involve distributed environments with numerous devices spread across different locations. The distributed computing models of big data processing allow them to efficiently process data across distributed clusters, accommodating the distributed nature of IoT deployments. As many IoT applications demand real-time or near-real-time processing to enable swift responses and decision making, centralized data management approaches encounter challenges in the efficient handling of information. Thus, by strategically integrating computational capabilities closer to the data source, within the edge and fog layers, it is possible to facilitate local data processing, reducing the necessity of transmitting all data to centralized cloud infrastructure. This not only addresses latency concerns but also optimizes bandwidth usage, making it particularly beneficial in scenarios where network resources are limited. Furthermore, the decentralized nature of edge/fog computing contributes to enhanced security and privacy by allowing sensitive data to be processed closer to their origin. In summary, the cooperative interconnection between IoT applications, data management, and edge/fog computing represents a strategic approach to meet the evolving demands of efficient, real-time, and secure data processing within the dynamic landscape of the Internet of Things.

In addition, the requirements of big data processing, such as the training of machine learning models ^[19][20][21][19,20,21], stream processing [22], and event processing ^[23][24][23,24], raise scalability issues and require significant computing and storage resources ^[25][26][25,26]. Fortunately, big data management and processing platforms such as Apache Spark [27], Apache Flink [28], and Apache OpenWhisk [29] address these concerns by facilitating distribution and collaboration. In such systems, big data processing tasks are performed using the collaborative power of nodes in clusters. With recent increases in the resource capacity of edge/fog devices and the capability of big data processing platforms, it has become increasingly important to explore the deployment of big data processing platforms on resource-constrained edge/fog devices.

2. Optimizing Data Processing

2.1. Capability of Resource-Constrained Devices

Due to the increasing capabilities of low-cost, single-board computers and their benefits in terms of cost, low power consumption, small size, and reasonable performance, several works ^[30][31][32][37,38,39] have studied the capabilities and performance of low-cost edge-layer devices for deep learning and machine learning interference. The works reported in ^[30][31][37,38] focused on the inference of deep learning models in various low-cost devices, such as the Raspberry Pi 4, Google Coral Dev Board, and Nvidia Jetson Nano, and evaluated their performance in terms of the inference time and power consumption. Their main focus was to implement the inference parts of deep learning models in edge devices in order to achieve real-time processing. The authors of ^[32][39] conducted a comprehensive survey of design methodologies for AI edge development, emphasizing the importance of single-layer specialization and cross-layer codesign, which includes hardware and software components for edge training, inference, caching, and offloading. Several insights were highlighted with respect to the quality of AI solutions in the edge computing layer. While the deployment of low-cost edge devices is still a relatively new paradigm for advanced applications, it has found widespread interest, particularly in function processing. Some approaches ^{[33][34][35][36][37]}[40,41,42,43,44] have been developed for edge-layer serverless platforms. Specifically, all of these works attempted to develop edge-layer serverless platforms that can support edge AI applications. Moreover, increasing attention has been paid to the capabilities and viability of single-board computers; several works ^[38][39][40][45,46,47] have explored the characteristics and possibilities of deploying different big data applications in resource-constrained environments.

Some previous studies have evaluated data processing platforms under different application scenarios and proposed design changes to off-the-shelf software platforms to cater to the requirements of applications in edge/fog computing layers. In ^[41][30], the authors proposed a serverless edge platform based on the OpenWhisk serverless platform to address the challenges of real-time and data-intensive applications in fog/edge computing. The proposed platform comprised additional components for latency-sensitive computation offloading, stateful partitions, and real-time edge coordination. The platform was evaluated in terms of its resource utilization footprint, latency overhead, throughput, and scalability under different application scenarios. The results show that the serverless architecture reduced the burden of infrastructure management, allowed greater functionality to be deployed on fog nodes with limited resources, and fulfilled the requirements of different application scenarios and the heterogeneous deployment of fog nodes. To optimize stream processing in the edge/fog environment, the authors of ^[42][31] proposed Amnis, a stream processing framework that considers computational and network resources at the edge. It extended the Storm framework in terms of stream processing operator allocation and placement for stream queries. Compared to the default operator scheduler in Apache Storm, it performed better in terms of end-to-end latency and overall throughput. These works targeted only individual data processing paradigms, such as batch processing, stream processing, or function processing.

2.3. Comparative Studies of Big Data Processing Platforms

Some previous works have included benchmark studies of the performance of stream processing systems, such as ^[43][32], which evaluated the performance of three stream processing platforms, namely Apache Storm, Apache Spark, and Apache Flink, in terms of throughput and latency. In ^[44][34], these aspects were also evaluated for Apache Spark and Apache Flink. With respect to batch processing platforms, the work reported in ^[44][45][33,34] included a comprehensive study of two widely used big data analytics tools, namely Apache Spark [27] and Hadoop MapReduce ^[46][48], on a common data mining task, i.e., classification. With respect to function processing platforms, the work reported in ^[47][48][35,36] provided a benchmarking framework for the characterization of serverless platforms in both commercial cloud and open-source platforms. The work that is most related to ours is [49], which evaluated computing resources across the computing continuum using three applications: video encoding, machine learning, and in-memory analytics. It also provided recommendations based on the evaluation results regarding where to perform the tasks across the computing continuum. The authors utilized a real test bed named the Carinthian Computing Continuum (C 3) to extend cloud data centers with low-cost devices located close to the edge of the network. They recommended offloading the applications to edge and fog resources to reduce the network traffic and CO

_{2}

emissions, with an acceptable performance penalty, while the cloud was used for lower execution times. Another work [50] mainly evaluated big data processing possibilities on a Raspberry Pi-based Apache Spark and Hadoop Cluster and examined the impact on storage performance of employing three different external storage solutions. Then, the cluster performance was compared with that of a single desktop PC using microbenchmarks such as Wordcount, TeraGen/TeraSort, TestDFSIO, and Pi computation.

However, neither of these works included a comprehensive investigation of the performance of existing big data processing platforms for the three computing paradigms, namely batch processing, stream processing, and function processing, based on various applications (e.g., image classification, object detection, image resizing). Therefore, exploring the operating performance of batch, stream, and function processing in the current edge, fog, and cloud layers is of great significance for research and industrial applications in the field of big data processing.