Internet of Things Device Identification and Two-Stage Clustering

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Mizuki Asano	--	2177	2024-01-22 11:00:49	\|
2	layout	Jessie Wu	+ 6 word(s)	2183	2024-01-23 02:10:59	\|

This entry is adapted from the peer-reviewed paper 10.3390/fi16010017

Smart home environments, which consist of various Internet of Things (IoT) devices to support and improve our daily lives, are expected to be widely adopted in the near future. Owing to a lack of awareness regarding the risks associated with IoT devices and challenges in replacing or the updating their firmware, adequate security measures have not been implemented. Instead, IoT device identification methods based on traffic analysis have been proposed. Since conventional methods process and analyze traffic data simultaneously, bias in the occurrence rate of traffic patterns has a negative impact on the analysis results.

device identification internet of things machine learning traffic analysis two-stage clustering

1. Internet of Things Device Identification Methods

Several identification methods of Internet of Things (IoT) devices have been proposed. Takasaki et al. proposed an IoT device identification method based on two-stage traffic analysis ^[1]. In the first step, this method extracts domain names from domain name system packets that IoT devices send regularly and estimates the manufacturers of the connected devices based on the domain names of the destination servers. Moreover, the method classifies devices into three categories, IoT devices, non-IoT devices, and routers, using supervised machine learning. By capturing all packets sent from each device in 10 min, the number of packets, total and average data sizes, the number of protocol types, and the number of destination addresses were extracted from the packets and used as feature values in the first-stage analysis. In the first step, the method classified connected devices into three categories with over 94% accuracy. The second step attempts to classify the functional categories of devices identified as IoT devices in the first step by analyzing traffic waveforms that represent the time variation of the transmitted packets. This method analyzed a set of 6000 packets collected from each IoT device every 100 milliseconds for 10 min as feature values using deep learning, a LSTM network ^[2], and a convolutional neural network. The accuracy of identifying the seven functional categories of the IoT devices was 83.7%. However, the accuracy of identifying reactive devices that operate in response to the actions of users and sensors, such as light bulbs, healthcare devices, and air quality sensors, was low.

Koike et al. proposed a method for identifying the called functions of IoT devices using the characteristic time variability of their generated traffic to visualize their operating status ^[3]. This method collects traffic data using specific functions on a smart speaker. Since the time variability of the communication traffic is discriminative for each function, 30 consecutive packets were set as one window. From each window, the average, variance, and standard deviation of the packet size, excluding the error packets from each transmission protocol, were extracted as feature values. In these experiments, 10 functions known as Amazon Echo Spots were estimated using a random forest. The results indicated that the classification accuracy of the ten functions was only 56.1%. The authors of this research attribute the low accuracy to the labeling of even a no-communication period between function calls as part of each function and the definition of the window size. This method can automatically monitor the operation status of IoT devices, but the identification target is limited to one smart speaker.

Koike et al. proposed an improved version of this method ^[4]. This version uses two categories of features extracted from traffic data of functions called on three smart speakers: Amazon Echo Spot, Amazon Echo Dot, and Amazon Echo Flex. In the first category, the features extracted from each individual packet, such as the transport protocol, port number, and source and destination addresses, were used. In the second category, the features of each window consisting of 30 consecutive packets were used, as was done in ^[3]. However, they were extended to the average, variance, standard deviation, maximum, minimum, and difference between the maximum and minimum packet sizes. Random forests was employed as a machine learning algorithm based on comparisons of the identification accuracy of different supervised learning algorithms: random forests, extreme gradient boosting (XGBoost) ^[5], light gradient boosting machine (LightGBM) ^[6], and CatBoost ^[7]. In addition, regarding the treatment of the no-communication period, which was a factor contributing to low accuracy in the previous method when a function was not called before the first call or between calls, this method extracts idle time from the traffic data as a separate function. The experimental results demonstrated that the method could identify eleven called functions with an accuracy of 76.1% for the Amazon Echo Spot, nine functions with an accuracy of 89.8% for the Amazon Echo Dot, and had an accuracy of 85.2% for the Amazon Echo Flex. Moreover, a device identification method for the three types of Amazon Echo was implemented with an accuracy of 99.1%. This method can automatically and more precisely monitor how IoT devices operate than the previous method ^[3]. Nevertheless, the target is still limited to smart speakers only.

Hattori et al. proposed a method to estimate the execution functions on several types of IoT devices ^[8], which is also an extended version of the previous two methods. To examine the possibility of analyzing which functions of an IoT device have been executed, they used eight IoT devices from four categories: two smart cameras, two smart remote controllers, two smart speakers, and two smart plugs. The devices were connected to an edge router, where traffic data were collected during the execution of each function 10 s before and after execution. From the collected traffic data, the number and size of sending, receiving, TCP, and UDP packets, as well as the number of source and destination IP addresses in time windows of 0.5, 1, and 1.5 s were extracted. The mean, maximum, variance, and standard deviation values of the 30 extracted items were computed and evaluated based on the importance of the features calculated using random forests. Thereafter, the method selected and used 28 important features from 120 features as feature values and identified the function of the IoT device executed using random forests. The accuracy of detecting eight functions was 73%, and the accuracy of detecting 16 combinations of eight executed functions and two IoT devices in each device category was 91%. This method enables the automatic understanding of the behavior of not only smart speakers but also a wide variety of IoT devices.

Ammar et al. proposed an autonomous IoT device identification prototype ^[9], thereby outlining a methodology for an IoT device identification assistant and its architecture. The objective was to identify the types of devices that were newly connected to the home gateway to help end users better manage their devices and obtain more services from them. In feature selection, the first set of features was extracted from the flow characteristics of packet size, the interarrival time of flows, flow size, and protocol. In addition, the second set of features was extracted from the device description inspected from the packet payloads: the manufacturer name from the MAC address and the device name from the DHCP information. This method identified 28 IoT devices with an average accuracy of 98% using a decision tree. This method was considered for architectures designed to operate in smart home environments. However, all processing related to device identification was performed on a server, and the identification results were obtained on the web browser. This poses a problem in terms of user privacy protection.

Nguyen-An et al. proposed a method for visualizing IoT traffic characteristics and identifying IoT devices based on the average information content, which is known as information entropy ^[10]. This method analyzed IoT traffic properties by calculating the information entropy of the traffic parameters: the number of source and destination IP addresses, the number of source and destination ports, packet sizes, and the total amount of data for source IP addresses observed in five minutes. Subsequently, this method visualized them using behavior-shaped graphs. Moreover, an IoT device identification method based on the information entropy of IoT traffic features was implemented, and IoT devices were successfully identified with 94% accuracy. This method uses entropy to achieve time series feature extraction in IoT device communication. However, focusing on entropy in only a certain time interval could lead to inadequate feature extraction for some IoT devices.

Okui et al. proposed an identification of IoT device models in the home domain using IP flow information export (IPFIX) records ^[11]. The aim of this method based on IPFIX, which is a standard for flow information, is to reduce data volume for communication costs. In this method, IoT traffic captured on a gateway router was converted to IPFIX information, and the converted data were sent to the device identification server. Then, feature extraction from the IPFIX records and training of an identification model using LightGBM were operated on the server. As a result, this method identified 25 IoT devices with 98.48% precision. Moreover, using IPFIX records reduced the data volume to approximately 11% compared to traffic data. On the other hand, since all procedures for the IoT device identification were performed on a single server, there are concerns about the concentrated load.

Trad et al. proposed a method to mitigate frequent retraining for IoT device identification models ^[12]^[13]. A Siamese neural network (SNN) was trained to generate embeddings corresponding to the similarities of feature values. In this method, a database of embedding vectors corresponding to IoT traffic was created using an SNN. In the identification, 95 feature values extracted from IoT traffic were input to the identification model using an SNN, and an embedding vector was output. Then, the closest embedding to the output vector was searched from the database, and the input traffic was classified according to the IoT device corresponding to the retrieved embedding. When a new device was added, this IoT traffic was also input to the identification model using an SNN, and an embedding corresponding to the device was generated. And then, this embedding was added to the database. Therefore, the database was extended to include the new device type, and the model could recognize the new device without the updating. Based on the results in ^[13], this method yielded an 85.8% F-measure even when 28 unknown IoT device data were added after training. However, this method also processed traffic data at once, whereas all traffic data should be shared with the cloud servers in the real environments.

2. Considerations Regarding Internet of Things Device Identification in Smart Home Environments

As mentioned previously, several IoT device identification methods based on traffic analysis with machine learning have been proposed. They automatically identify and understand IoT devices using traffic data. However, no consideration has been given to the implementation of these methods in a smart home environment.

IoT device identification in smart home environments involves three considerations. The first is the need to update identification models. This is because it is unlikely that IoT devices, once installed, will be used indefinitely in smart home environments. As new devices are installed and older devices are removed, frequent updating of the identification model and periodic training for model updating are essential to achieve proper management in response to the replacement of IoT devices installed in smart homes. However, conventional methods implement IoT device identification with only one-time learning and do not consider the computational and communication loads associated with model updating.

The second problem is how to process the traffic data. Analyzing a large amount of traffic data simultaneously in one place can cause a heavy load on the memory and CPU, as well as interfere with the analysis. With the growing use of IoT devices, the processing of a larger amount of collected IoT traffic is required. Although collecting and analyzing traffic data on a cloud server is one solution, this approach requires sending traffic data captured in smart homes to the cloud server every time the identification model is trained, which is unsuitable from the perspective of communication cost. Additionally, since IoT traffic contains a large amount of user information, uploading all the traffic data to the cloud server could lead to the leakage of user privacy. In addition, since the frequency of device usage and the amount of communication data vary by IoT device, biases potentially exist in the IoT traffic. The biases increase in proportion to the amount of traffic data, which potentially has a negative influence on traffic analysis. Therefore, it is necessary to consider how to process significant amounts of traffic data appropriately when implementing this method in smart home environments. However, conventional methods typically process traffic data simultaneously.

The third consideration is the target range for IoT device identification. Specific IoT devices in smart homes can be identified simply using static information such as IP and MAC addresses obtained from traffic data. From the perspective of security in smart homes, however, it is important to not only determine whether the appropriate IoT devices are connected but also to identify and grasp whether they are operating properly. Conventional methods analyze IoT traffic based on the communication properties, such as the number of destination IP addresses, protocol types, and the total amount of data. However, they only focus on the features extracted from traffic data for a fixed interval, and they accordingly ignore that each IoT device has different communication and operation cycles. Thus, they overlook the behavior of IoT devices because they cannot extract features from some IoT traffic owing to variations in the communication periods or behavior.

References

Takasaki, C.; Korikawa, T.; Hattori, K.; Ohwada, H.; Shimizu, M.; Takaya, N. IoT device identification based on two-stage traffic analysis. IEICE Tech. Rep. 2021, 121, 47–51.
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780.
Koike, D.; Ishida, S.; Arakawa, Y. Called function identification of IoT devices by network traffic analysis. In Proceedings of the Multimedia, Distrib., Cooperative & Mobile Symp. (DICOMO2020), Virtual Event, 24–26 June 2020; pp. 933–939.
Koike, D.; Ishida, S.; Arakawa, Y. Called function identification of IoT devices by network traffic analysis. In Proceedings of the 36th Annual ACM Symp. on Applied Comput. (SAC2021), Virtual Event, Republic of Korea, 22–26 March 2021; pp. 737–743.
dmlc XGBoost. XGBoost Documentation. Available online: https://xgboost.readthedocs.io/en/latest/ (accessed on 31 October 2023).
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the Advances in Neural Information Processing Systems (NIPS2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 3149–3157.
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Proceedings of the Advances in Neural Information Processing Systems (NIPS2018), Montreal, QC, Canada, 3–8 December 2018; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31, pp. 6639–6649.
Hattori, Y.; Arakawa, Y.; Inoue, S. Function estimation of multiple IoT devices by communication traffic analysis. In Proceedings of the 4th International Conference on Activity and Behavior Computing (ABC2022), London, UK, 27–29 October 2022.
Ammar, N.; Noirie, L.; Tixeuil, S. Autonomous IoT device identification prototype. In Proceedings of the 2019 Network Traffic Measurement and Analysis Conference (TMA), Paris, France, 19–21 June 2019; pp. 195–196.
Nguyen-An, H.; Silverston, T.; Yamazaki, T.; Miyoshi, T. IoT traffic: Modeling and measurement experiments. IoT 2021, 2, 140–162.
Okui, N.; Nakahara, M.; Miyake, Y.; Kubota, A. Identification of an IoT device model in the home domain using IPFIX records. In Proceedings of the 2022 IEEE 46th Annual Computing Software, and Applications Conference (COMPSAC), Virtual Event, 27 June–1 July 2022; pp. 583–592.
Trad, F.; Hussein, A.; Chehab, A. Using siamese neural networks for efficient and accurate IoT device identification. In Proceedings of the 2022 Seventh International Conference on Fog and Mobile Edge Computing (FMEC), Paris, France, 12–15 December 2022; pp. 1–7.
Trad, F.; Hussein, A.; Chehab, A. Assessing the effectiveness of siamese neural networks to mitigate frequent retraining in IoT device identification models. In Proceedings of the 2023 International Conference on Platform Technology and Service (PlatCon), Busan, Republic of Korea, 16–18 August 2023; pp. 47–52.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Telecommunications

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Mizuki Asano

Takumi Miyoshi

Taku Yamazaki

View Times: 143

Update Date: 23 Jan 2024

Table of Contents

Video Upload Options

Confirm

1. Internet of Things Device Identification Methods

2. Considerations Regarding Internet of Things Device Identification in Smart Home Environments

References