k-NN Query for High-Dimensional Index Using Machine Learning

k-NN Query for High-Dimensional Index Using Machine Learning: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Theory & Methods

Contributor:

Dojin Choi

Jiwon Wee

Sangho Song

Hyeonbyeong Lee

Jongtae Lim

Kyoungsoo Bok

Jaesoo Yoo

The three k-nearest neighbor (k-NN) optimization techniques for a distributed, in-memory-based, high-dimensional indexing method to speed up content-based image retrieval. The techniques perform distributed, in-memory, high-dimensional indexing-based k-NN query optimization: a density-based optimization technique that performs k-NN optimization using data distribution; a cost-based optimization technique using query processing cost statistics; and a learning-based optimization technique using a deep learning model, based on query logs.

query optimization
data distribution
k-NN
high-dimensional index
machine learning

1. Introduction

With the recent development of real-time image processing, technologies for object recognition and object retrieval in images extracted from real-time operating devices, such as closed-circuit television (CCTV), are being actively implemented [1,2,3,4,5,6,7]. These technologies can be used in various fields, such as crime prevention, monitoring systems, and analyzing traffic information.

Content-based image retrieval (CBIR) involves the retrieving of images by using features extracted from objects in video images. CBIR involves vectorizing the features extracted from images and determining the similarities between the vectors to retrieve similar images. In sum, CBIR is a way of retrieving images from a database. In CBIR, a user specifies a query image and obtains images from the database that are similar to the query. To find the most similar images, CBIR compares the content of the query image with the database images. Owing to the development of new technologies, such as artificial intelligence and machine learning, researchers have conducted studies into extracting various features from images [8,9,10,11]. Data becomes increasingly high-dimensional as the features provided by images become more varied. Therefore, similarity retrieval techniques that use high-dimensional data are needed to retrieve and compare similar images or objects. Additionally, the indexing structure for high-dimensional data similarity retrieval must be appropriately constructed, in accordance with the characteristics of high-dimensional data.

Nearest neighbor search (NNS) deals with the problem of finding the closest or most similar item to a given point. Closeness is typically expressed in terms of a dissimilarity function, such as Euclidean distance, Manhattan distance, and other distance metrics. Formally, the nearest-neighbor (NN) search problem is defined as follows: given a set S of points in a space M and a query point q ∈ M, find the closest point in S to q. k-NN search is a generalized NN problem. k-NN needs to find the k closest points.

Generally, in the distributed processing environment, a distributed k-NN query is processed as follows. First, each distributed node indexes data points. Depending on the indexing scheme, the k- NN query processing methodology is different. In general, when a k-NN query is inputted, all of the nodes are requested to process the k-NN query. Each node generates k closest data points and k data points generated in n nodes are merged. Finally, the result includes the k closest points to the query from the merged result.

Various studies have been conducted recently to address these problems [12,13,14,15,16,17]. A distributed, in-memory-based, high-dimensional indexing technique was proposed to efficiently perform image retrieval using high-dimensional feature data [12]. The authors of [12] utilized big data analysis platforms, such as Spark, to implement distributed, in-memory-based, high-dimensional indexing. In addition, the authors of [13] proposed an M-tree [13] indexing algorithm on Spark. Since all distributed servers participate in query processing [12], the load on all the servers can increase when there are many retrieval requests from users. In another study, a master/slave model for distributed index query processing was used to perform efficient image retrieval in airport video monitoring systems [14]. The researchers proposed a distributed MVP (multi-vantage-point) tree, which was based on the MVP tree. However, it had an inherent flaw: it was difficult to load large-scale, high-dimensional data into its memory. Moreover, backtracking operations frequently occurred when performing k-NN query processing in the tree. Backtracking is a commonly used algorithm for searching. When processing k-NN queries in a tree structure, it explores a specific node and returns to the parent node if there are no results. It generates query results by repeating this process. In another study, a distributed k-d tree [16] was proposed to enhance the performance of k-NN processing [15]. However, depending on the amount of distributed data, the height of the k-d tree could increase, which would increase the search time [15]. Using a k-d tree also results in frequent backtracking operations when performing k-NN processing [14].

Contrary to conventional hash functions [17], LSH (locality-sensitive hashing) aims to maximize hash collisions when indexing high-dimensional data. It stores similar data in the same bucket to improve the efficiency of the search and indexing. Using random vectors, LSH transforms high-dimensional data into low-dimensional bucket indexes. Following a query request, the system searches for the bucket that contains the query result using the query location, measures the actual distance between the data within the bucket, and performs k-NN query processing. However, in LSH, the k-NN results may include false positives, depending on the index creation parameters. More buckets can be searched to ensure accurate results; however, this increases the search cost because it requires distance comparisons among all the data in the adjacent buckets. Because k-NN query processing involves finding the closest k items, distance-based indexing is efficient as it can pre-calculate the distance values to the items and index them.

The authors of [18] used a deep learning technique based on a combination of CNNs to classify images, and used RNNs to analyze natural language queries. They also used CNNs to take advantage of the deep learning technology in image content classification. The RNN model helps users to make search queries more efficiently. The authors of [19] identified occupied and vacant parking lots using a hybrid deep learning model. The model proposed in that study combined the superior features of CNNs and LSTM deep learning methods. The authors of [20] proposed a CBIR system that was based on multi deep neural networks, which combined convolutional neural networks (CNNs) and k-NN methodologies. The feature extraction of the user-supplied image was performed based on the CNNs, and the image similarity was calculated based on the k-NN methodologies in order to return a list of images. The authors of [21] compared the image retrieval of various machine learning models, such as SVMs (support vector machines), k-NNs, and CNNs.

2. k-NN Query Optimization for High-Dimensional Index Using Machine Learning

To efficiently perform CBIR, researchers have studied high-dimensional indexing techniques for retrieval, using the high-dimensional feature vectors of objects within images [12,13,14,15,16,22,23,24,25,26].

A method to address the challenges in quickly and efficiently indexing large-scale multimedia data was proposed in [12]. The proposed technique used Spark to build a distributed M-tree, enabling fast and cost-effective multimedia database retrieval. However, each node (or executor) in Spark did not have a specified indexing area and performed partitioning and indexing using the data partitioning policy. Consequently, the nodes were not filtered, because all of them must be visited when processing a k-NN query; therefore, when many queries occur, the load on all of the nodes increases, because they have to be visited when processing the queries. As a result, when the search is concentrated on a single node, the overall query processing time can be delayed until the result for that node is returned.

A new indexing method to address the scalability issue of k-d trees for k-NN query processing was proposed in [16]. The researchers constructed a distributed k-d tree to index multi-dimensional data, which is often used in this manner [16]. The distributed k-d tree consists of a global k-d tree and local k-d trees, which can serve as both masters and as slaves at each terminal node of the global k-d tree. The global k-d tree is used to divide the entire data area for processing, and local k-d trees are constructed for each partitioned area to build the index. The master/slave indexing structure is built and processed to perform distributed processing. Because the distributed k-d tree divides the area, it is easy to identify nodes that do not participate in query processing. Therefore, in [1], a filtering feature was added, which enabled more efficient query processing. However, a disadvantage of the k-d tree are the frequent backtracking operations. Moreover, because the distributed k-d tree is rebuilt as a set of local k-d trees, in addition to the global k-d tree, the query processing time increases as the tree height increases, depending on the data distribution.

A distributed MVP tree, distributed-MVP (D-MVP), was implemented to perform high-dimensional indexing for image retrieval in an airport video monitoring system [14]. The MVP tree is an improved version of the VP tree. To improve the effectiveness of later tree searches, the MVP tree divides the data using multiple partition points and stores them in each node. It also stores the distance to the partition points. The D-MVP tree uses a master/slave model for query processing, wherein the master node divides the area. The master monitors the overall system load and maintains balance. To avoid overloading a slave node, when input data are concentrated on it, the master node increases the height of the tree—a method called “hot spot load balancing”—to dynamically add partition areas. However, the M-tree family constantly calculates distances at higher nodes to find terminal nodes that can conduct query processing, which increases the load on the master node and delays the processing time in real-time query processing environments.

iDistance indexing [22] is a distance-based indexing technique that represents high-dimensional data as one-dimensional distances and indexes the data in a B+-tree [22,23]. The distance is calculated between the reference point and the corresponding data when they are displayed in the distance space. Key-values are generated in order of proximity to the reference point, and these key-values are indexed in the B+-tree. Because indexing is performed using only distance, the exact location of the object cannot be determined, and there is a possibility of generating the same key-value. Since iDistance adds a constant to the distance, there is almost no probability that one data point will be included in two reference points. However, if the distance exceeds the constant, an incorrect key-value may be generated. Therefore, an appropriate constant must be assigned. K-NN queries are converted to range queries for processing in iDistance because the index is the distance to the reference point. It is not possible to precisely locate all the data. The distance value between the query and reference points is calculated in order to compute the actual key-value to be inserted into the B+ tree, and only the data within the search range in which the k-NN query is converted is searched to generate the k-NN result. The search range converted to iDistance is initially set to 1% of the total index area, which may lead to frequent search range expansions. Therefore, an optimization technique is needed to convert these k-NN queries into range queries.

The authors of [24] proposed an optimization technique based on product quantization. They applied the optimization technique in different similarity search scenarios, such as brute-force, approximate, and compressed-domain. They also presented a near-optimal algorithmic layout for exact and approximate k-nearest neighbor searches on GPU.

The authors of [25] proposed a distributed image retrieval framework, based on location-sensitive hash (LSH) on Spark. A distributed, K-means-based bag of visual words (BoVW) algorithm and an LSH algorithm were proposed to build LSH index vectors on Spark in parallel. It performs a K-means-based BoVW distributed algorithm by extracting SIFT feature data point sets. However, it takes more than 95% of the time for the whole process of constructing the index, and the effect of clustering is very uncertain.

The authors of [26] proposed a fast CBIR system using Spark (CBIR-S), which targets large-scale images. It uses a memory-centric distributed storage system called Tachyon to enhance the write operation. It can speed up by using a parallel k-NN search method, which is based on the MapReduce model on Spark. In addition, it can exploit the cache method of the Spark framework.

A distributed, in-memory, high-dimensional indexing technique was presented in [17]. It aims at efficient CBIR in large-scale, high-dimensional data, using a high-dimensional indexing technique with a master/slave structure. Additionally, Spark was used to build an index for high-dimensional vector data. A hybrid distributed high-dimensional index was implemented to address the issue of k-d trees and iDistance. Combining the advantages of both indexes, the overall structure consists of a master/slave structure, in which the master is responsible for data distribution and for selecting slaves for query processing, which reduces the system load. The slave nodes process the query and conduct the data indexing.

The k-NN algorithm has a wide range of uses because it finds k neighbors for a given value of k. However, it has a drawback: the processing cost increases proportionally to the amount of data or the number of dimensions. Several studies have been conducted to improve the throughput of k-NN [26,27,28].

The “jump method” was proposed to increase the speed of k-NN query processing [28]. To achieve this, the k-means algorithm is applied to the data, and clusters are generated using the formula proposed in the study. When the user provides a value for k, k centers are created, and the k-means algorithm allocates each data point to the nearest center.

Reference [29] overcomes the limitation of the original k-NN algorithm’s ignoring of the influence of the neighboring points, which directly affects localization accuracy. The researchers improved the indoor location identification using Wi-Fi-received, signal strength indicator (RSSI)-based fingerprints. The RSSI is highly dependent on the access point (AP) and the fingerprint learning stage, thus DNN is used in conventional k-NN to resolve this dependency. Additionally, DNN is used to classify fingerprint data sets. Then these possible locations in a certain class are classified by the improved k-NN algorithm to determine the final position. The improved k-NN is conducted by boosting the weights on K-nearest neighbors according to the number of matching AP.k-NN, and DNN were used to address the intrusion detection accuracy of existing intrusion detection systems [30]. Network attack data are classified by performing classification and labeling tasks on a “CICIDS-2017” dataset, which includes network attacks. The k-NN algorithm is used for machine learning, whereas DNN is used for deep learning, and the outputs of the two methods are then compared. A comparison with the general k-NN queries using the results demonstrates the superiority of DNN.

This entry is adapted from the peer-reviewed paper 10.3390/electronics12112375

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.