Persistence Landscapes for Clustering Noisy IoT Time Series

Persistence Landscapes for Clustering Noisy IoT Time Series: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Contributor:

Renjie Chen

With the advancement of IoT technologies, there is a large amount of data available from wireless sensor networks (WSN), particularly for studying climate change. Clustering long and noisy time series has become an important research area for analyzing this data.

elbow method
feature construction
IoT time series
persistence landscape

1. Introduction

Enhanced IoT technologies have been developing at a remarkable pace, allowing long streams of data to be collected from a large number of in situ wireless sensor networks. Application domains include business, biomedicine, energy, finance, insurance, and transportation sensors being installed across broad geographic regions. Data streams collected by these sensors constitute long, noisy time series with complex temporal dependence patterns, leading to several different types of interesting and useful data analysis. For example, specially adapted machine learning techniques for anomaly detection in internal temperatures, by sensors placed inside thousands of buildings in the US by an insurance company, were developed in ^[1], with the goal of alerting clients, and attempting to mitigate the risk of pipe freeze hazard. An extended analysis of the same IoT streams was provided in ^[2], by employing a Gaussian process model framework, to assess the causal impact of a client reaction to an alert. Data analysis of energy usage, management, and monitoring on a large academic campus was described in ^[3]. The role of wireless sensor technologies in the agriculture and food industry was discussed in ^[4].

There is also considerable interest in analyzing IoT streams to understand different aspects of weather monitoring and climate change. For example, remote sensing in water environmental processes was discussed in ^[5], while ^[6] discussed how inexpensive open-source hardware is democratizing (climate) science, because open-source sensors are able to measure environmental parameters at a fraction of the cost of commercial equipment, thus offering opportunities for scientists in developed and developing countries to analyze climate change at both local and global regional levels. A report from the United Nations Intergovernmental Panel on Climate Change (IPCC) states that average temperatures are likely to continue rising, even with mitigating efforts in place (https://www.ipcc.ch/report/ar6/wg1/ (accessed on 1 October 2019)). NOAA observation systems collect data twice every day, from nearly 100 locations in the US. The National Weather Service (NWS) launches weather balloons, carrying instrument packages called radiosondes. Radiosonde sensors measure upper air conditions, such as atmospheric pressure, temperature and humidity, and wind speed and direction. The Automated Surface Observing Systems (ASOS) program is a joint effort by the National Weather Service (NWS), the Federal Aviation Administration (FAA), and the Department of Defense (DOD). The ASOS system serves as the primary surface weather observing network in the US, updating observations every minute (https://www.weather.gov/about/ (accessed on 1 October 2019)) (https://nosc.noaa.gov/OSC/ (accessed on 1 October 2019)).

When weather data are available from a large number of locations, clustering/grouping locations based on stochastic properties of the data are of considerable interest ^[7]^[8]^[9]^[10]. To this end, it is useful to develop effective algorithms that construct useful features that capture the behavior of the time series: clustering then proceeds on the basis of similarity/dissimilarity metrics between the features. There is a considerable literature on feature-based time series clustering. For example ^[11], categorized feature representations for time series fall into into four broad types: (i) data-adaptive representations, which are useful for time series of arbitrary lengths; (ii) non-data-adaptive approaches, which are used for time series of fixed lengths; (iii) model-based methods, which are used for representing time series in a stochastic modeling framework; and (iv) data-dictated approaches, which are automatically defined, based on raw time series.

Topological data analysis ^[12] encompasses methods for discovering interesting shape-based patterns, by combining tools from algebraic topology, computer science, and statistics, and it is becoming an increasingly useful area in many time series applications. For a review of persistent homology for time series, and a tutorial using the R software, see ^[13]. In particular, the review discusses ideas such as transforming time series into point clouds via Takens embedding ^[14], creating persistence diagrams ^[15]^[16], and constructing persistence landscapes of all orders ^[17]. While persistent homology is a central tool in TDA, for summarizing geometric and topological information in data using a persistence diagram (or a bar code), it is cumbersome to construct useful statistical quantities using metrics such as the Wasserstein distance. Persistence landscapes enable us to map persistent diagrams to a Hilbert space, thereby making it easier to apply tools from statistics and machine learning. Recent research has explored the use of persistence landscapes as features to either cluster or classify time series ^[18]^[19]^[20]. Persistence landscapes of all orders using weighted Fourier transforms of continuous-valued EEG time series were constructed by ^[21], and used as features for clustering the time series, using randomness testing to examine the robustness of the approach to topology-preserving transformations, while being sensitive to topology-destroying transformations. First-order persistence landscapes, constructed from Walsh–Fourier transforms of categorical time series from a large activity–travel transportation data set, were used by ^[22] to create features for clustering, arguing that the first-order landscape was sufficient for accurately clustering time series with relatively simple dependence properties. Several aspects of using persistence homology in time series analysis have been discussed in-depth in ^[23].

It is well-known that lower-order persistence landscapes contain more important topological features than higher-order landscapes, which are closer to zero ^[17]. In many situations, it may be unnecessary and computationally prohibitive to use persistence landscapes of all orders to elicit useful features of time series. Selecting the order of the persistence landscapes to serve as features requires a delicate balance between missing important signals and introducing too much noise. Existing research has not addressed the problem of data-based selection of the order of persistence landscapes that is sufficient to yield accurate clustering: thus, the focus of this research was to address the problem of deciding the order of persistence landscapes in a time series clustering scenario, and to study this question in the context of noisy, periodic stationary time series, using the smoothed second-order spectrum to construct persistence landscapes. The solution was an algorithm which automatically selected the optimal order of persistence landscapes in a sequential way. These features were then used in clustering the time series. The computational gain from the algorithm was demonstrated through extensive simulation studies, which showed a speed-up of approximately 13 times. Researchers then illustrate their approach, using long temperature streams from different US locations, and show that features constructed from the selected orders of persistence landscapes produced meaningful clusters of the locations, which may be useful for climate scientists in a comparative study of temperature patterns over time in several locations.

2. Persistence Landscapes for Clustering Noisy IoT Time Series

Clustering locations or climate stations based on temperature or precipitation time series have been discussed in several recent articles. A two-step cluster analysis of 449 southeastern climate stations was described in ^[7], to determine general climate clusters for eight southeastern states in the US, and has been employed in several follow-up analyses involving the classification of synoptic climate types. In a similar vein, ref. ^[8] used a hierarchical cluster analysis to demarcate climate zones in the US, based on weather variables, such as temperature and precipitation.

Spatial grouping of over 1000 climate stations in the US was discussed in ^[10], by using a hybrid clustering approach, based on a measure of rank correlation as a metric of statistical similarity. Based on the clustering temperatures at these stations, they showed that roughly 25% of the sites accounted for nearly 80% of the spatial variability in seasonal temperatures across the country.

A framework for implementing the break detection of critical transitions on daily price cryptocurrencies, using topological data analysis (TDA), was proposed by ^[19]. They (i) transformed the time series into point clouds, using Taken’s delay embedding, (ii) computed persistence landscapes for each point cloud window, (iii) converted to their L

_{1}

norms, and (iv) used K-means clustering for these windowed time series.

Persistence landscapes of all orders were employed by ^[20] as topological features for time series, via Taken’s time-delayed embedding transformation, and principle component analysis for denoising the time series.

The problem of clustering continuous-valued EEG time series was studied by ^[21]. They constructed weighted Fourier transforms of the time series, and constructed persistence landscapes of all orders: they used these as features for clustering the time series. They also examined the robustness of their approach to topology-preserving transformations, while being sensitive to topology-destroying transformations.

The use of persistence landscapes for clustering categorical time series from a large activity–travel transportation data set was described in ^[22]. They first constructed Walsh–Fourier transforms of the categorical time series, and then obtained first-order persistence landscapes from the Walsh–Fourier transforms, which they used as features for clustering the time series. In this case, they argued that the first-order landscape was sufficient for accurately clustering time series with relatively simple dependence properties, as in the activity–travel transportation data.

An analysis of multivariate time series using topological data analysis was proposed in ^[24], by converting the time series into point cloud data, calculating Wasserstein distances between the persistence diagrams, and using the k-nearest neighbors algorithm for supervised machine learning, with an application to predicting room occupancy during a time window.

Time series clustering with topological–geometric mixed distance (TGMD) was discussed in ^[25], which jointly considered the local geometric features and global topological characteristics of time series data. The results revealed that their proposed mixed-distance-based similarity measure could lead to promising results, and to better performance than standard time series analysis techniques that consider only topological or geometrical similarity.

This entry is adapted from the peer-reviewed paper 10.3390/fi15060195

References

Soliman, A.; Rajasekaran, S.; Toman, P.; Ravishanker, N.; Lally, N.; D’Addeo, H. A Custom Unsupervised Approach for Pipe-Freeze Online Anomaly Detection. In Proceedings of the 2021 IEEE 7th World Forum on Internet of Things (WF-IoT), New Orleans, LA, USA, 14 June–31 July 2021; pp. 663–668.
Toman, P.; Soliman, A.; Ravishanker, N.; Rajasekaran, S.; Lally, N.; D’Addeo, H. Understanding insured behavior through causal analysis of IoT streams. In Proceedings of the 2023 6th International Conference on Data Mining and Knowledge Discovery (DMKD 2023), Chongqing, China, 24–26 June 2023.
Chen, M.H.; Lim, D.; Ravishanker, N.; Linder, H.; Bolduc, M.; McKeon, B.; Nolan, S. Collaborative analysis for energy usage monitoring and management on a large university campus. Stat 2022, 11, e513.
Ruiz-Garcia, L.; Lunadei, L.; Barreiro, P.; Robla, J.I. A review of wireless sensor technologies and applications in agriculture and food industry: State of the art and current trends. Sensors 2009, 9, 4728–4750.
Cui, X.; Guo, X.; Wang, Y.; Wang, X.; Zhu, W.; Shi, J.; Lin, C.; Gao, X. Application of remote sensing to water environmental processes under a changing climate. J. Hydrol. 2019, 574, 892–902.
Levintal, E.; Suvočarev, K.; Taylor, G.; Dahlke, H.E. Embrace open-source sensors for local climate studies. Nature 2021, 599, 32.
Stooksbury, D.; Michaels, P. Cluster analysis of southeastern US climate stations. Theor. Appl. Climatol. 1991, 44, 143–150.
Fovell, R.G.; Fovell, M.Y.C. Climate zones of the conterminous United States defined using cluster analysis. J. Clim. 1993, 6, 2103–2135.
Fovell, R.G. Consensus clustering of US temperature and precipitation data. J. Clim. 1997, 10, 1405–1427.
DeGaetano, A.T. Spatial grouping of United States climate stations using a hybrid clustering approach. Int. J. Climatol. J. R. Meteorol. Soc. 2001, 21, 791–807.
Aghabozorgi, S.; Seyed Shirkhorshidi, A.; Ying Wah, T. Time-series Clustering—A Decade Review. Inf. Syst. 2015, 53, 16–38.
Edelsbrunner, H.; Harer, J. Computational Topology an Introduction; American Mathematical Society: Providence, RI, USA, 2010.
Ravishanker, N.; Chen, R. An introduction to persistent homology for time series. Wiley Interdiscip. Rev. Comput. Stat. 2021, 13, e1548.
Takens, F. Detecting strange attractors in turbulence. Lect. Notes Math. 1981, 898, 366–381.
Perea, J.A.; Harer, J. Sliding Windows and Persistence: An Application of Topological Methods to Signal Analysis. Found. Comput. Math. 2015, 15, 799–838.
Fasy, B.T.; Kim, J.; Lecci, F.; Maria, C. Introduction to the R package TDA. arXiv 2014, arXiv:cs.MS/1411.1830.
Bubenik, P. Statistical Topological Data Analysis Using Persistence Landscapes. J. Mach. Learn. Res. 2015, 16, 77–102.
Truong, P. An Exploration of Topological Properties of High-Frequency One-Dimensional Financial Time Series Data Using TDA. Master’s Thesis, KTH Royal Institute of Technology, Mathematical Statistics, Stockholm, Sweden, 2017.
Gidea, M.; Goldsmith, D.; Katz, Y.; Roldan, P.; Shmalo, Y. Topological recognition of critical transitions in time series of cryptocurrencies. Phys. Stat. Mech. Appl. 2020, 548, 123843.
Kim, K.; Kim, J.; Rinaldo, A. Time Series featurization via topological data analysis. arXiv 2018, arXiv:1812.02987.
Wang, Y.; Ombao, H.; Chung, M.K. Topological data analysis of single-trial electroencephalographic signals. Ann. Appl. Stat. 2018, 12, 1506.
Chen, R.; Zhang, J.; Ravishanker, N.; Konduri, K. Clustering activity—Travel behavior time series using topological data analysis. J. Big Data Anal. Transp. 2019, 1, 109–121.
Chen, R. Topological Data Analysis for Clustering and Classifying Time Series. 2019, University of Connecticut, USA, Doctoral Dissertations. Available online: https://opencommons.uconn.edu/dissertations/2365 (accessed on 1 January 2022).
Wu, C.; Hargreaves, C.A. Topological machine learning for multivariate time series. J. Exp. Theor. Artif. Intell. 2022, 34, 311–326.
Zhang, Y.; Shi, Q.; Zhu, J.; Peng, J.; Li, H. Time Series Clustering with Topological and Geometric Mixed Distance. Mathematics 2021, 9, 1046.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.