Adaptive Graphs for Multi-View Subspace Clustering

Adaptive Graphs for Multi-View Subspace Clustering: Comparison

Please note this is a comparison between Version 1 by Qiliang Liu and Version 2 by Jason Zhu.

Clustering of multi-source geospatial big data provides opportunities to comprehensively describe urban structures. Most existing studies focus only on the clustering of a single type of geospatial big data, which leads to biased results. Although multi-view subspace clustering methods are advantageous for fusing multi-source geospatial big data, exploiting a robust shared subspace in high-dimensional, non-uniform, and noisy geospatial big data remains a challenge.

multi-view subspace clustering
geospatial big data
shared nearest neighbor graph

1. Introduction

Multi-source geospatial big data have become increasingly available in the current era of big data, such as taxi GPS trajectories [1], smart card transactions [2], mobile phone data [3], social media check-in records [4], and points of interests (POIs) [5]. Geospatial big data provides a new opportunity for understanding the “human-earth” relationship [6]. Clustering geospatial big data are vital for describing urban structures and understanding the organization of cities [7]. For example, remote sensing techniques have been widely used for uncovering urban land use information based on physical characteristics of ground components (e.g., spectral, shape, and texture) [8]; however, remote sensing techniques are hard to capture the socioeconomic attributes and human dynamics that are highly related to urban land use [3]. In contrast, clustering of human mobility data can help understand urban land use information from the perspective of social function which is an important complement of remote sensing [6]. Clustering of geospatial big data are also useful for identifying urban functional structures and human activity patterns, which are useful for human-centric urban planning ^[9][10][11][9,10,11]. For example, the actual functions of a region may be inconsistent with the original zoning scheme designed by urban planners [12]. Clusters discovered from geospatial big data can reveal the urban function zones naturally formulated according to human activities, which may provide useful calibration for urban planners [9]. Clusters discovered from social media check-in records are also useful for identifying emergency events in a city, which are helpful for maintaining public safety [4].

Although clustering of geospatial big data has received attention in recent years, most existing studies focus on a single type of geospatial big data ^[11][13][11,13]. Owing to the bias of each type of geospatial big data, the clustering results obtained from single-source geospatial big data cannot provide a comprehensive view of urban structures [14]. A few studies have used a weighted average strategy to fuse multi-source geospatial big data ^[15][16][15,16]. Multi-source geospatial big data usually reflect different or overlapping dimensions of human activities. Without considering the shared and complementary information among different types of geospatial big data, the weighted average strategy may introduce unpredictable errors [17]. Multi-view subspace clustering has the potential to fuse the underlying complementary information of multi-source geospatial big data ^[18][19][18,19]; however, high-dimensional, non-uniform, and noisy geospatial big data bring two challenges ^{[20][21][22][23]}[20,21,22,23]: (1) the quality of the low-dimensional subspace is substantially influenced by the redundant features and noise in the original data; and (2) neighboring relationships of data points in high-dimensional and non-uniform original data space are difficult to preserve in a low-dimensional subspace. Therefore, existing multi-view subspace clustering methods are highly likely to generate an inaccurate subspace, which degrades the clustering performance. To overcome the above challenges, this study developed a method with adaptive graphs to constrain multi-view subspace clustering of geospatial big data from multiple sources (agc2msc).

2. MuRelti-source Geospatial Big Dataated Work

Most existing studies mainly focus on the clustering of a single type of geospatial big data, e.g., taxi GPS trajectories [1], social media check-in records [4], POIs [13], and mobile phone data ^[24][26]. After extracting clustering features from a certain type of geospatial big data, traditional clustering methods such as k-means ^[25][27], spectral clustering ^[26][28], and DBSCAN ^[27][29] are used to identify clusters. To consider the dynamic characteristic of geospatial big data, some online and incremental clustering methods are also currently available [4]; these methods are useful for understanding the organizations of cities from the perspective of social functions [7]. Despite these fruitful results, the bias of a single type of geospatial big data hinders the comprehensive understanding of urban structures ^[11][17][11,17]. To overcome this limitation, clustering of multi-source geospatial big data has received increasing attention in recent years. For example, some scholars ^[28][30] first combined the taxi trajectory data and public transit records to reveal human mobility patterns, then used POI features as prior knowledge to extract features of human mobility patterns, and finally performed k-means on the extracted features. To consider the contributions of different types of geospatial big data, the weighted average strategy was employed to fuse the features of multi-source geospatial big data. The weights of different types of geospatial big data can be determined based on the proportions of total bus and cab ridership [15] or the entropy weight approach [16]. The weighted average methods can fuse the information of multi-source geospatial big data to a certain extent; however, they cannot incorporate complex interactions and correlations among multi-source geospatial big data. Researchers can assume that the cone reflects the socioeconomic information that comprehensively describe the urban structures (i.e., the underlying structure of multi-source geospatial big data). In practice, this socioeconomic information is often embedded in different types of geospatial data (e.g., triangle and circle). Different types of geospatial data can be regarded as different views to observe socioeconomic information. The weighted average strategy does not capture the complementarity of multi-source geospatial big data. Therefore, the result of the weighted average strategy may be only a simple superposition of multiple features. Therefore, the underlying structure of multi-source geospatial big data cannot be reconstructed by using the weighted average strategy. Compared with the weighted average strategy, multi-view subspace clustering has the potential to reconstruct the underlying structure of multi-source geospatial data ^[18][19][18,19]. Multi-view subspace clustering assumes that multi-view data points are drawn from a shared low-dimensional subspace, rather than being uniformly distributed in the original space ^[29][31]. The features of each type of geospatial big data can be reconstructed from the shared subspace. In theory, multi-view subspace clustering can fuse the shared and complementary information among different types of geospatial big data. Existing multi-view subspace clustering methods are mainly extensions of self-representation-based subspace clustering methods ^[30][31][32,33]. Self-representation-based subspace clustering assumes that each point

x_{i}

can be represented by a linear combination of other points

x_{j}

(j ≠ i) ^{[32][33][34][35]}[34,35,36,37]. Previous multi-view subspace clustering methods first calculate a subspace representation for each type of data and then combine the multiple subspace representations for clustering ^{[21][29][36][37][38]}[21,31,38,39,40]. Although these methods can consider the shared and/or specific information of multi-source data, the subspaces reconstructed using the original data are not robust to redundant features and noise in the original data ^[39][41]. To address this limitation, latent multi-view subspace clustering methods have recently been developed ^[18][22][18,22]; these methods first use dimension reduction techniques to project the original data features into a latent representation, and then use the latent representation for subspace clustering. Although latent multi-view subspace clustering methods can boost the clustering performance of multi-source geospatial big data, two challenges should be further addressed: (1) Existing method usually used a linear projection to transform the original data features into a latent representation ^[22][39][40][22,41,42]; however, the relationship between each type of data and its latent representation is usually non-linear ^[18][41][18,43]. Therefore, the inaccurate latent representations obtained by existing methods may degrade the clustering performance. (2) The neighboring relationships of data points in high-dimensional, non-uniform, and noisy original data are difficult to preserve in the shared subspace ^[34][42][36,44]. Some scholars have used neighbor graphs as constraints to preserve the neighboring relationships of data points in multi-view subspace clustering ^[21][43][44][21,45,46]; however, the neighbor graphs defined based on Euclidean distance and k-nearest neighbor cannot construct appropriate neighboring relationships for high-dimensional and non-uniform geospatial big data ^[45][46][47,48]. Therefore, existing methods are highly likely to generate an inaccurate subspace, which will reduce the clustering quality ^[47][49].