1. Introduction
The rapid development of augmented reality (AR) technology is transforming digital interactions by seamlessly integrating virtual information with the real world. In outdoor settings, AR technology has been widely applied in the field of geospatial information
[1]. It connects map information and virtual graphical objects
[2] with the real environments, providing more accurate, efficient, and intuitive interactive experiences. The application of AR technology in geospatial information holds significant value and importance. Firstly, AR geospatial applications make unmanned aerial vehicle (UAV) photography
[3] efficient and reliable
[4], enabling accurate positioning and navigation. Secondly, AR geospatial applications can offer intuitive navigation experiences. Whether it is locating destinations in urban areas
[5], navigating for tourism purposes
[6] and emergencies
[7], or investigating complex outdoor environments
[8], AR maps enhance navigation with higher accuracy, real-time performance, and convenience. Furthermore, AR technology demonstrates immense potential in the field of underground engineering construction. In underground pipeline laying
[9] and mining surveys
[10], AR systems provide a visual perception of the underground space, reducing errors, risks, and enhancing work efficiency.
2. The Methods Applied in Geo-Registration
In augmented reality applications, geo-registration refers to the process of aligning and matching virtual objects with the geographic location and orientation of the real-world scene. Currently, there are three common methods for pose estimation: sensor-based approaches, vision-based approaches, and hybrid approaches. These methods have been extensively applied in numerous projects and research endeavors. However, pose estimation in outdoor scenarios still faces numerous challenges. Factors such as signal interference, environmental variations, lighting changes, feature scarcity, occlusions, and dynamic objects severely impact the accurate determination of the geographic north orientation, alignment with the terrain surface, and precision in coordinate system transformations. These factors can lead to cumulative errors in pose estimation.
2.1. Sensor-Based Methods
Localization in outdoor industrial environments typically utilizes sensors such as GPS and magnetometers to obtain spatial coordinates and orientation
[11]. Sensor-based approaches utilize non-visual sensors such as GPS, IMU, and magnetometers to acquire users’ positional and directional information
[12]. Subsequently, virtual information is generated based on geospatial databases and aligned with the real environment. These methods primarily rely on built-in, non-visual sensors (e.g., accelerometers, gyroscopes, magnetometers) to obtain the azimuth and tilt angles of the device
[13][14]. By fusing this information with GPS localization data, they collectively provide position and orientation estimation. These sensors provide information on linear acceleration, angular velocity, and magnetic field strength. By using filtering and attitude algorithms, the data can be used to infer the pose of the device. Sensor-based approaches are characterized by their low cost, low complexity, and reasonable continuity, making them suitable for simple AR application scenarios. Behzadan and Kamat
[2] demonstrated GPS’s effective use for real-time virtual graphics registration in an outdoor AR setting. Accelerometers and gyroscopes can provide accurate and robust attitude initialization, as demonstrated by Tedaldi et al.
[15]. However, these sensors also have limitations, such as drift and noise over time
[16]. Moreover, the heading angle measured by a magnetometer, which is often integrated with accelerometers and gyroscopes, is susceptible to significant influence from ambient magnetic field noise
[17].
An Inertial Measurement Unit (IMU) offers a fast response and update rate for capturing motion changes, but it is vulnerable to noise, drift, and measurement range limitations, affecting the accuracy and stability of AR applications
[18]. This will cause deviations or oscillations in the orientation of virtual objects. Consequently, the accuracy and stability of aligning with the Earth’s surface are affected. Furthermore, cumulative errors may arise in spatial rotations, and the conversion between coordinate systems cannot guarantee the spatial rotational invariance of objects relative to the geographic reference frame. RTK-GPS offers centimeter-level positioning accuracy and works well during high-speed motion. However, it may encounter failures and initialization challenges in certain outdoor AR map scenarios
[19]. Generally, the distance between the mobile station and the reference station should not exceed 10 km to 15 km, as it could impact the positioning accuracy or even lead to failure
[20]. Thus, outdoor localization using consumer-grade sensors is a challenging problem
[21] that requires a method that can combine the strengths and overcome the weaknesses of different sensors.
By combining the measurements from RTK-GPS, including the position, velocity, and heading angle, with the IMU data, improved outcomes can be achieved at a lower cost. Moreover, IMUs have a high output frequency and enable the measurement of the six degrees of freedom (6DOF) pose. This feature makes them ideal for short-term applications in environments with limited visual texture and rapid motion
[22]. The distinctive characteristics of IMUs make them a valuable complement to RTK-GPS signals, leading to their widespread adoption and extensive investigation in conjunction with RTK-GPS integration
[23][24]. However, one limitation that arises is the zero offset of the accelerometer and the gyroscope, which leads to a significant pose offset over time
[25]. Moreover, when employing low-accuracy (consumer-grade) RTK-GPS in conjunction with a highly accurate IMU, the impact of north positioning errors becomes significant during substantial changes in altitude
[26]. Furthermore, the pose estimation approach that combines RTK-GPS and IMU is limited to open areas because it relies on satellite availability
[27].
Due to potential deviations between sensor data, environmental interference or drift can destabilize the azimuth estimation of such methods, as they rely on sensor data such as magnetometer readings. Sensor-based approaches are prone to error accumulation. The estimation of position and pose using sensor data, such as accelerometer and gyroscope data, requires integration and filtering, which exacerbates the issue of error accumulation. Those issues make sensor-based approaches suffer from insufficient accuracy and stability in both position alignment and map motion tracking.
2.2. Vision-Based Methods
An alternative approach is the utilization of vision-based methods. These methodologies can be considered as a specific instance within the broader category of sensor-based approaches, where the camera functions as the primary sensor. Different from the conventional sensor method, vision-based approaches utilize cameras to capture real-world images. These images are then processed using techniques such as feature extraction, matching, and tracking to detect and match feature points within the environment. By analyzing the positional changes of these feature points, it is possible to estimate the device’s pose, the user’s location and orientation information, and the geographical positioning within the real environment. This allows for the alignment between virtual information and the real world
[28]. With the advancement of spatial data acquisition technologies, recent studies have focused on geographic localization and registration using multi-source data, including satellite imagery and video data
[29]. However, these methods are both complex and expensive. On the one hand, the accuracy of the heading angle in certain localization methods that rely on multi-source data may not be appropriate for outdoor, wide-range AR applications
[30]. On the other hand, their wide application is challenging because they require a substantial number of pre-matched georeferenced images or a large database of point clouds captured from the physical world
[21]. A more efficient approach is to employ automated computer vision techniques to generate three-dimensional point clouds for positioning
[31]. However, this method requires pre-generation of point clouds, which primarily relies on visual features. Camera-based pose estimation methods involve capturing images of the environment using cameras and utilizing computer vision algorithms to infer the position and orientation of the device. These methods typically involve feature point detection, camera calibration, and visual geometric algorithms, which in turn require significant computational resources and algorithmic complexity. Even with map motion tracking employed, they still result in a slow system response due to the heavy computation load required for feature recognition, localization, and mapping.
Visual methods are better suited for pose estimation in local scenes because the camera is primarily used as a local sensor in these methods
[32]. In outdoor global geo-registration and pose estimation, the camera is susceptible to environmental interferences, resulting in degraded image quality or the loss of feature points. The influence of blurred visual features on the approach becomes increasingly noticeable as the velocity of motion increases. As a consequence, the visual data may not be able to provide stable and reliable orientation estimation. The SIFT feature algorithm, for instance, is considered a relatively reliable visual feature algorithm
[33]. However, it requires the computation of key point orientations, which can be influenced by noise and lighting conditions. Furthermore, the SIFT feature algorithm is limited to handling small-angle rotations and may fail when dealing with large-angle rotations. Neural network-based methods exhibit stronger capabilities in extracting image features compared to traditional visual feature algorithms. However, they require a large amount of training data, and in the case of rare or novel objects, there may not be enough data to achieve satisfactory generalization capabilities
[30]. In summary, visual methods have limitations in global position alignment and require abundant feature dependencies and motion constraints from MAR (Motion and Attitude Reference) devices for surface alignment.
2.3. Hybrid Method
Single sensors and vision alone are insufficient to achieve robust, accurate, and large-scale 6-DOF localization in complex real-world environments. Hybrid methods for pose estimation integrate non-visual sensor and visual information to obtain more accurate and stable results in estimating pose. To achieve a robust and accurate outdoor geo-registration system, it is necessary to fuse multiple complementary sensors. By leveraging the complementary advantages of sensors and visual data, the limitations of individual methods can be overcome, particularly in demanding outdoor applications such as urban and highway environments
[34].
The visual-inertial fusion method has become popular for positioning and navigation, especially in GPS-denied environments
[35], due to the complementary nature of vision and IMU sensors
[36]. By integrating vision and IMU sensors
[37], this method overcomes the limitations of using either vision or IMU sensors alone
[38][39]. Vision sensors can be affected by factors such as lighting, occlusion, and feature matching, while visual localization methods rely on a large amount of georeferenced or pre-registered data. IMU, on the other hand, suffers from issues such as accumulated errors and zero drift. By comparing and calibrating the attitude estimation of the IMU with visual data, it is possible to achieve more accurate pose estimation. Common methods for this fusion include Extended Kalman Filtering (EKF) and Tightly Coupled Filtering. The fusion of the camera and the IMU sensor can solve the problems of low output frequency and the accuracy of visual pose estimation, enhancing the robustness of the positioning results.
However, current multi-sensor fusion methods have limitations. The advantage of multi-sensor fusion is its adaptability and stability, but its implementation is also more complex. These methods fail to simultaneously meet the requirements of high-precision initial and motion pose estimation with a low-cost solution. Although these methods may provide cost advantages, they have limitations in achieving both accurate pose estimation and cost-effectiveness. There is a trade-off and inherent constraints between low-cost solutions and high-precision requirements. Ren et al.
[3] achieved geo-registration in low-cost UAVs by fusing RTK-GPS and the IMU sensor. However, the limited accuracy of the IMU sensor in these UAVs led to imprecise attitude fusion. Their method heavily relies on the stability of RTK-GPS data. Burkard and Fuchs-Kittowski
[21] estimated the gravity vector and geographic north in visual-inertial fusion registration through user gesture calibration. However, the accuracy of the registration relies on manually input information. Oskiper et al.
[40] utilized road segmentation direction information and annotated remote sensing image data in their visual-inertial method to achieve accurate global initial registration. However, their performance in pose matching may degrade under outdoor continuous motion and spatial rotation. Hansen et al.
[41] proposed a precise method for estimating positioning and orientation using LiDAR, an intelligent IMU with high accuracy, and a pressure sensor that can measure altitude. However, the high cost of these devices prevents widespread adoption. Multi-sensor fusion is primarily hindered by the lack of precision in surface alignment.
This entry is adapted from the peer-reviewed paper 10.3390/rs15153709