Data integrity is a prerequisite for ensuring data availability of IoT data and has received extensive attention in the field of IoT big data security. Stream computing systems are widely used in the field of IoT for real-time data acquisition and computing. The real-time, volatility, suddenness, and disorder of stream data make data integrity verification difficult. The data integrity tracking and verification system is constructed based on a data integrity verification algorithm scheme of the stream computing system (S-DIV) to track and analyze the message data stream in real time. By verifying the data integrity of message during the whole life cycle, the problem of data corruption or data loss can be found in time, and error alarm and message recovery can be actively implemented.
1. Introduction
The rapid development of emerging technologies and applications such as the Internet of Things and 5G networks has led to a worldwide data explosion, which pushes human society into the era of big data. Due to the new computing and storage model of IoT big data, its management and protection measures are somewhat different from those of ordinary data; therefore, how to maintain and guarantee the integrity and consistency of data throughout its life cycle in the IoT big data environment has become one of the important issues in the field of data security
[1].
Currently, batch offline computing and stream real-time computing are the main computing models for IoT big data
[2]: batch computing is mainly a computing model for static persistent data, which usually stores the data first and then distributes the data and computing logic to distributed computing nodes for data computation—a common batch computing architecture is Hadoop; stream computing is mainly a computing model for data stream, it does not store all the data, but instead performs data computation directly in memory for a certain period of time.
Data integrity is a prerequisite for ensuring the availability of IoT data and is the key to the secure operation of IoT big data systems. According to the research, almost all current data integrity verification schemes are studied under the batch computing mode with a high degree of technical and research maturity, while there is no perfect solution for the data stream integrity problem under the stream computing mode. In the traditional batch computing model, data integrity verification mechanisms are divided into two categories based on whether fault-tolerant recovery measures are applied: provable data possession (PDP) and proof of retrievability (POR). However, the current PDP and POR schemes cannot be directly applied to IoT stream computing systems. Due to the characteristics of real-time, volatile, emergent, disorderly, and infinite in data stream, data incompleteness issues such as data loss, duplication, and state inconsistency in stream computing systems are becoming more prominent, making the study of data integrity and consistency more difficult than ever
[2].
Although most stream computing systems currently have an acker
[3], which is a mechanism to check whether each message can be processed completely, the integrity of the message data itself is not guaranteed. At the same time, because the acker module is inside the system, the efficiency of real-time message computation is easily affected if complex validation computations are run on the acker. Since the stream computing process is not persistent, it is not possible to view the historical message processing path, making it difficult to reproduce the problem of incomplete message data.
This data security issue becomes a serious constraint to the application of stream computing systems in the Internet of Things.
In order to solve the above problems, this research constructs an external data integrity tracking and verification system (i.e., external tracking and verification system) to monitor the integrity and correctness of message data content and processing path in the stream computing system, which can accurately record the processing path of each message and verify the integrity of message content, as well as detect errors, give an alarm, and recover messages in time without affecting the efficiency of the original stream computing system.
The external tracking and verification system satisfies the following capabilities:
-
Accuracy: Accuracy is a key consideration for stream real-time computing systems used in IoT. Only with a high level of accuracy and precision can the system be trusted by end users and be widely applied.
-
Real-time: Data sharing in IoT requires high timeliness; so, data integrity verification needs real-time. Since the tracking and verification system is built outside the stream computing system, integrity verification does not affect the efficiency of the original system. Meanwhile, the verification time is synchronized with stream computing, making it possible to trace and recover error messages as soon as possible.
-
Transparency: Different stream computing systems for IoT may have different topological frameworks, corresponding to different business and application interfaces; thus, the design of external tracking and verification systems should be transparent in order to achieve system versatility.
2. Data Integrity Tracking and Verification System
The detailed design of the external data integrity tracking and verification system of stream computing system is shown in Figure.
1. The phase of real-time data collection: When a message is sent from module A to module B, data collection is performed at the data sending port (module A) and the data receiving port (module B), and the collected message data are sent to the message tracking data center.
2. The phase of message classification: After the data center receives the collected message, it judges whether it is a sending message or a receiving message according to the Flag, the sending message (Flag = 00) will be put into the sending data storage module (i.e., sending module), and the receiving message (Flag = 01) will be put into the receiving data storage module (i.e., receiving module).
3. The phase of key generation: The key management center sends the pregenerated key to the batch data preprocessing module and the batch data verification module, respectively.
4. The phase of batch data preprocessing: The preprocessing module preprocesses the message data of sending module, calculates each message M(Mid1,Sid1), and generates a verification tag T(Mid1,Sid1); then, it sends the tag to the batch data verification module.
5. The phase of batch data integrity verification: The messages of receiving module are sent to batch data verification module. The batch data verification module verifies data integrity of the message one by one according to T(Mid1,Sidi) and M(Mid1,Sidi). Specifically, aggregate T(Mid1,Sidi) and M(Mid1,Sidi) according to Mid: aggregate and verify a set of messages fMgMid1= fM(Mid1,Sid1), M(Mid1,Sid2), M(Mid1,Sid3), . . .o with the same Mid and the corresponding series of tags fTgMid1= fT(Mid1,Sid1), T(Mid1,Sid2), T(Mid1,Sid3), . . .o. If the verification passes, the information will be sent to the message tracking data center
and the stream computing system, and the intermediate data in the two caches will be deleted; if the verification fails, the message alarm and recovery will be carried out.
6. The phase of alarm and recovery: When the alarm module receives the error information, it calls out the error message from the batch data verification module and resends the error message to the message tracking data center according to the Mid and Sid. The data center finds out the original message and sends it to the stream computing system. Finally, the stream computing system replays and recalculates the message according to the original route.
This entry is adapted from the peer-reviewed paper 10.3390/s22176496