Website fingerprinting is valuable for many security solutions as it provides insights into applications that are active on the network.
1. Introduction
Being able to automatically associate a portion of network traffic to a particular web application is desired by network administrators or attackers. With the growth in the usage of end-to-end encryption protocols (such as SSL/TLS), attackers can not inspect the content of communications. However, traditional encryption obscures only the content but does not hide information such as the traffic volume and direction. This allows an attacker to exploit information leaked by the side channel, such as the packet length, timing, and order.
Recent studies have proposed a number of potential solutions to analyze encrypted traffic. A proposed framework
[1] monitors network traffic between users and network resources to identify the associated web application. Many machine learning algorithms (e.g., random forest) and deep learning methods, such as convolutional neural networks (CNNs), are used to uncover what applications are running on users’ smartphones
[2][3][4] or what webpages/websites users are visiting
[5][6][7]. Among them, the technology that identifies webpages/websites from network traffic is referred to as Website Fingerprinting Attack (WFA).
Despite many WFA methods having been proposed, previous studies primarily focused on fingerprinting individual webpages (most existing methods simply refer to homepages as representative webpages) to identify whether users have accessed a monitored website. They usually ignore sequence visits, such as webpage transitions via clicking hyperlinks. However, for most websites, users often follow hyperlinks to carry out their actions. For example, users follow hyperlinks to read/post blogs on a social forum.
Identifying a web application via interaction patterns is practically significant. It is not a difficult job to create a web application today since there are many ready-made templates to choose from. A website builder named Wix
[8] provides different types of templates (ranging from e-business, and album, to social forum ones, and so on) and publicizes that customers can create a website in just four steps without any coding skills. Reports
[9] surfaced that a police officer provided a source code seized during the investigation of a case to other criminals to create a new gambling website and illegally obtain huge profits. In reality, even if a gambling website is targeted by law enforcement officers, criminals may modify its appearance (e.g., the website title and pictures) and rebuild a new one easily. In order to block these slightly modified illegal websites, an approach that can detect a template-based web application is needed.
Intuitively, web applications that derive from the same template may share similar functional logic.A web application is often designed to provide users with different capabilities, i.e., users can perform certain actions. Criminals might modify the appearance of an illegal website to avoid punishment, but they cannot change those capabilities.
2. Website Fingerprinting Attack
The purpose of a Website Fingerprinting Attack (WFA) is to infer which websites/webpages are visited by users. This type of analysis can reveal the privacy of a user (e.g., interests, habits, sexual and political orientations). WFA was first carried out by Cheng and Avnur
[10] in 1998. They demonstrated that the SSL protocol can not address traffic analysis attacks. WFA turns to be a hot research topic in recent years, and many machine-learning techniques have been proven to be very effective.
A work published in 2012
[11] was the first demonstration that application-level defenses, such as HTTPOS and randomized pipe-lining, are not secure. The authors modeled websites using Hidden Markov Models (HMMs), where each state corresponds to a page or a class of pages of the site. To simplify the model, they created it with states corresponding to page templates rather than individual pages. According to their approach, an attacker can construct a HMM for each target website and use the forward algorithm to compute the log-likelihood that a given packet trace would be generated by a user visiting the target website. However, it is not a trivial thing to build a HMM model for a website.
Hayes and Danezis
[12] did a systematic analysis of feature importance and filled the gap of a notable absence of feature analysis in the website fingerprinting literature. They proposed the k-fingerprinting attack based on random decision forests and enabled attackers to infer which web page a client is browsing through encrypted or anonymized network connections. They demonstrated that Tor hidden services are easily distinguished from standard web pages, rendering them vulnerable to Website Fingerprinting Attacks.
FLOWPRINT
[4] is a semi-supervised mobile-app fingerprinting prototype. The authors observe that mobile apps are composed of different modules that often communicate with a relatively invariable set of network destinations. This property is leveraged to discover patterns in the network traffic. Fingerprints are created based on temporal correlations among network flows between monitored devices and their destinations.
Zhuo and Zhang et al.
[13] proposed a website-modeling method based on PHMM; they took advantage of the first tab and the second tab hidden relationship to improve accuracy in identifying a particular website instead of identifying web pages separately.
3. User Action Identification
User action identification has been extensively treated in the domain of personal mobile devices. Apps leverage the Wi-Fi and cellular network of mobile devices to send and receive data. Users perform several actions while interacting with apps and generate data transmissions. The network traffic sequence of a given action typically follows a pattern that depends on the nature of the user–app interaction of that action. These patterns can be used to recognize specific user actions related to a particular app of interest in generic network traces
[14].
Conti and Mancini
[15] proposed a framework to infer which particular actions the user executes on some apps installed on her mobile phone. Dynamic Time Warping and Random Forest were used to measure the similarity between traffic sequences and classify unseen traffic traces, respectively. The authors considered seven popular apps with different purposes from the official Android market to assess their approach’s performance and showed that the accuracy and precision were higher than 95%.
Similar to
[15], Fu and Xiong investigated how to exploit encrypted Internet traffic for classifying in-App usages. They developed a system named CUMMA for classifying usages of mobile messaging Apps by jointly modeling user behavioral patterns, network traffic characteristics (packet length and time delay), and temporal dependencies
[16]. In their work, traffic flows were segmented into sessions with a number of dialogs; then, the dialogs were classified into single-type usages or outliers. A clustering Hidden Markov Model-based method was used to detect mixed dialogs from outliers to sub-dialogs or single-type usage. Experiments on WhatsApp and WeChat demonstrated the effectiveness and efficiency of their proposed method.
4. Other Related Works
A few previous papers are notable for using different techniques on similar problems. He and Yang
[17] selected features such as burst volumes and directions to represent the application behaviors and leveraged PHMM to model different types of applications (Web, FTP, P2P, and IM) on Tor. Their experimental results demonstrated that PHMM is quite good at modeling network traffic.
Network traffic analysis technology has been extended to the mobile smart home equipment research field. PINGPONG
[18] automatically extracts the fingerprints from network traffic generated by the smart home devices and recognizes their actions (such as turning on or off the light). Similarly, HoMonit
[19] analyzes the network traffic generated by smart home devices to determine the actions performed on the home device applications. Li and Feng et al.
[20] proposed generating fine-grained fingerprints based on the subtle differences between the file systems of various firmware images. They applied the natural language processing technique to process the file content and used the document object model to obtain the firmware fingerprint. Using this fingerprinting approach, they were able to recognize firmware on the Internet. However, their approach has to interact actively with the firmware, thus is easy to be detected.
Network traffic analysis has also been extended to intelligent software testing. In work
[21], an automated penetration-testing framework is built to detect vulnerability through traffic analysis. Pyshark is used to capture the traffic in IoT devices’ four different states (booting, mobile application interaction, firmware mode, and offline mode). Then, ‘tshark’ is used to read the .pcap files and check for vulnerabilities such as insecure firmware, lack of transport encryption, and insecure network services. Similar to
[20], this approach also interacts actively with the firmware.