One of the challenges of spatial cognition, such as self-localization and navigation, is to develop an efficient learning approach capable of mimicking human ability. One of the enduring challenges for the autonomous agent in the field of geoinformatics, computer vision, and robotics is to determine its location in the environment. The concept of location is inherently relative, and one cannot describe the location of an object without providing a reference or map.
1. Introduction
The human brain is a brilliant information processor and is exceptionally skilled at finding one’s location on a map. Such extraordinary abilities have attracted much attention from neuroscientists seeking to explore and model how the human brain performs this fundamental cognitive task. An early neuroscience study has shown that an internal map of the environment referred to as the “cognitive map” uses a graph representation to locate oneself
[1] and navigate to a designated destination
[2]. For instance, in vector-based navigation agents can simply find their location on a map based on the distance they traversed and corners they turned
[3][4]. Understanding such a process and building computational models is crucial to offer advanced artificial intelligent capabilities to a number of applications, including path planning
[5] and navigation
[6].
In parallel with the exploration of biological mechanisms for localization and navigation, engineered alternative solutions have also been designed to achieve such functionality. The most commonly used system is the GPS, which was established in the 1970s for outdoor positioning using the constellation of a satellite network
[7]. Apart from GPS, traditional relative localization typically utilizes visual or inertial information to simultaneously compute the platform’s pose and 3D environmental structure
[8]. Despite these studies, there is still no widely accepted solution for localization in challenging conditions, due to environmental confusers, sensor drifts, multi-path problems, and high computational costs.
Unlike the GPS embedded in devices, our brain’s system accesses location and navigation information by integrating multiple signals relating internal self-motion (path integration)
[9] and planning direct trajectories to goals (vector-based navigation)
[2][10]. Recent research
[10][11] has shown that the mammalian brain uses an incredibly sophisticated GPS-like localization and tracking system of its own to help recognize locations and guide them from one location to the next. One typical method used is called path integration
[9], a mechanism of calculating location simply by integrating self-motion information, including direction and speed of movement—a task carried out without reference to external cues such as physical landmarks. Another method suggested representing space as a graph structure in which nodes denote specific places and links are represented as roads between pairs of nodes
[4]. The resulting graph reflects the topology of the explored environment upon which localization and navigation can be directly implemented by the graph search algorithm. This research aims at exploiting characteristics from these two methods together.
With the recent progress in deep learning, especially for graph neural networks (GNN)
[12][13][14][15], researchers have shown powerful models that yield expressive embedding of non-Euclidean data and result in promising performances in a variety of tasks
[6][16][17].
2. Visual Localization
A major category of work in the literature is dedicated to the use of images for localization, referred to as visual localization. These methods can be classified into photogrammetric localization
[18][19][20][21] and retrieval-based localization
[22][23]. The first set of approaches assumes the scene is represented by 3D sparse point clouds, which are commonly generated from structure from motion
[24]). Then, the camera pose for a given input image is directly estimated. The training dataset consists of pairs of images and the corresponding camera poses where the camera pose is usually represented by 6-DoF position and orientation. Despite their performance, the photogrammetric pipeline for generating and storing large 3D maps is not trivial and needs a large memory footprint. Another set of methods works by matching a given image to a database of location-tagged images or location-tagged image features. From the hand-craft features such as SIFT
[25], bag-of-visual words
[26], Fisher Vector
[27] and VLAD
[28], to the learned features
[29][30], all of these approaches struggle to find a good representation robust to changes in viewpoint, appearance, and scale, which is a requirement hard to fulfill in practice. Furthermore, creating an up-to-date image/feature database seems at best costly if not impossible. There is also a potential privacy issue of storing visual descriptors in the database.
3. Probabilistic Localization
A common form of localization problem is to use sensory readings to estimate the absolute coordinates of the object on the map using Bayesian filtering
[31][32][33][34][35]. The authors of
[31] presented a Bayesian approach to model the posterior distribution of the position given the prior map, which is considered a classic method commonly adopted in the robotics field. However, this method requires GPS readings and endures a rigorous mathematical model. In more recent studies
[32][33], the authors proposed a probabilistic self-localization method using OpenStreetMap and visual odometry where the location is determined by matching with road topology. The authors of
[34][35] presented a localization approach based on stochastic trajectory matching using brute-force search. However, all of these methods require the generation and maintenance of posterior distributions, which lead to complicated inference and high computational costs. For interested readers, a more comprehensive reference about probabilistic approaches is given in
[36].
4. Topological Localization
There are a small number of studies closely related to ours that uses topological map and deep learning. Traditional approaches utilize topological road structures and try to match features onto the map using Chamfer distance and Hamming distance
[37][38]. Chen et al.
[6] proposed a topological approach to achieve localization and visual navigation using several different deep neural networks. However, the method aims at visual navigation problems and is only investigated in a small indoor environment. Wei et al.
[39] proposed a sequence-to-sequence labeling method for trajectory matching using a neural machine translation network. This approach was shown to only work well on synthetic scenarios where the input trajectory was synthetically generated with a known sequence of nodes from the map. In
[40], the author presented a variable-length sequence classification method for motion trajectory localization using a recurrent neural network, which largely inspired us to employ motion-based data to achieve localization. Zha et al.
[41] introduced a topological map-based trajectory learning method and utilized hypotheses generation and pruning strategies to achieve consistent geolocalization of moving platforms where the problems were formulated as conditional sequence prediction. In contrast, this research focuses on the node localization problem on a topological map based on motion trajectory and develops a subgraph embedding classification model using a graph neural network, which generalizes sequence representation to graph representation and preferably fits the graph-based map structure.
5. Vector-Based Navigation
In neuroscience, much of the literature focuses on studying the mechanisms of animals’ ability to learn maps, as well as self-localization and navigation
[1][10][42]. These studies have shown that one typical method used in animals, such as desert ants, is path integration, which is a mechanism in which neurons calculate location by integrating self-motion. Self-motion includes direction and the speed of movement, which inspired us to utilize turning and distance information in this paper. In
[4], the authors elaborated on a topological strategy for navigation using place cells
[42][43] and metric vector navigation using grid cells
[11], from a biological perspective.
6. GNN on Spatial Data
The idea of GNN is to generate representations of nodes, edges, or whole graphs that depend on the structure of the graph, as well as any feature information endowed by the graph. The basic GNN model can be motivated in a variety of ways, either from the perspective of a spatial domain
[14][44] or a spectral domain
[45][46]. Further comprehensive reviews can be found in
[12][13][47]. In recent years, the GNN has extended its applications to geospatial data due to its powerful ability to model irregular data structures. For example, the authors of
[48] combined the convolutional neural network and GNN to infer road attributes, which overcome the limitation of capturing the long-term spatial propagation of the features; the authors of
[49] presented a graph neural network estimator for an estimated time of arrival (ETA), which accounts for complex spatiotemporal interactions and has been employed in production at Google Maps; and the authors of
[50] improved the generalization ability of GNN through a sampling technique and demonstrated its performance on real-world street networks. Ref.
[51] proposed a GNN architecture to extract road graphs from satellite images.