Accurately calibrating camera–LiDAR systems is crucial for achieving effective data fusion, particularly in data collection vehicles. Data-driven calibration methods have gained prominence over target-based methods due to their superior adaptability to diverse environments. However, current data-driven calibration methods are susceptible to suboptimal initialization parameters, which can significantly impact the accuracy and efficiency of the calibration process. Precise calibration is essential for achieving accurate data fusion between LiDAR and camera sensors. In general, calibration approaches can be classified into four categories: target-based approaches, feature matching-based approaches, statistics-based approaches, and deep learning-based approaches.
1. Introduction
The fusion of LiDAR (Light Detection and Ranging) data and camera image data is an essential step in many fields such as autonomous driving, 3D reconstruction, urban planning, and environmental monitoring
[1][2][3]. The significance of data fusion arises from the dissimilarities between LiDAR and camera image data, which vary in terms of spatial resolution, velocity and distance estimation capacities, resistance to adverse weather conditions, and sensor sizes, among other factors
[4]. LiDAR sensors are capable of accurately capturing 3D spatial information while lacking color and texture data that images can provide
[5][6]. By fusing these two types of data, it is possible to create more accurate and detailed maps of the environment, which is crucial for applications like autonomous driving
[7]. What is more, combining LiDAR and image data can improve the detection and recognition of objects in a scene. LiDAR can detect the presence and position of objects while images can provide details about their appearance and texture. The fusion of these data sources can lead to better object detection and tracking, which is vital for applications such as robotics and autonomous vehicles. Lastly, LiDAR is capable of measuring distances with high accuracy but it cannot provide depth perception of objects that are hidden from view. Combining LiDAR and image data can help overcome this limitation by providing a more comprehensive understanding of the scene’s depth and structure. Overall, the fusion of LiDAR and image data provides a more comprehensive understanding of the environment, which is essential for a wide range of applications. It enables improved accuracy
[8], object detection
[9][10], depth perception
[11][12][13], and 3D modeling
[14][15], making it an essential technique for many fields.
When it comes to the fusion between LiDAR and image data, the precise calibration between the two sensors plays a key role
[16][17][18]. LiDAR sensors capture 3D point clouds of the environment, and camera sensors capture 2D images. To create a comprehensive 3D model, it is necessary to transform the camera image data into 3D coordinates that match the LiDAR point cloud data. Calibration ensures that the transformation is accurate, which is essential for generating accurate point clouds. Accurate calibration of LiDAR and camera sensors also enables better object detection and tracking
[9][10]. By precisely aligning the two sensor data streams, it is possible to accurately locate objects in 3D space. What is more, calibration reduces measurement errors and noise, which can improve the overall accuracy of the data fusion process. This is particularly important for LiDAR data, which can be affected by noise caused by reflections and other factors.
Precise calibration is essential for achieving accurate data fusion between LiDAR and camera sensors. However, calibration is not a one-time process as the sensors may shift and change over time. This is particularly prevalent in data collection vehicles where the sensors are affixed to a mobile vehicle, resulting in a greater likelihood of modification to their relative pose. It will be much more time-consuming and resource-intensive if the data collection vehicle has to be sent back to the calibration field to do a thorough calibration. Accordingly, the implementation of data-driven precise calibration serves as a valuable means to account for such fluctuations
[19], where the calibration parameters are updated using the data acquired by LiDAR and camera sensors. By updating the calibration parameters as necessary, data-driven calibration can adjust for changes in the environment and improve the accuracy of the data fusion process without the need for extensive calibration in a dedicated calibration field. This agile calibration approach saves time and resources while maintaining the integrity of the calibration process.
Currently, a considerable body of research is dedicated to the calibration of LiDAR and camera systems. Some approaches, as described in
[20][21][22], rely on predefined targets that are visible in both the camera and LiDAR data to estimate calibration parameters. However, to eliminate the need for pre-deployed targets, several calibration methods leverage feature extraction and matching techniques. These methods utilize various types of features, including point features
[23][24][25], line features
[26][27][28], surface features
[29], semantic features
[30], and 3D structure features
[31]. Instead of establishing explicit feature correspondence between the camera image and the point cloud, certain methods
[19][32][33][34][35][36][37][38] employ general appearance similarity as a metric to evaluate calibration quality, formulating the calibration problem as a nonlinear optimization task. Additionally, a few alternative approaches
[39][40][41][42][43][44][45][46] based on deep learning have emerged as promising paradigms for addressing this calibration challenge.
2. LiDAR Calibration in Data Collection Vehicles
Research on the calibration problem between LiDAR and cameras has been ongoing since LiDAR began to be used in vehicles. Calibration techniques can be broadly classified as either offline or online. Offline methods require a predefined target and are typically carried out in an offline setting. Online methods, on the other hand, rely on LiDAR and camera data and are more suitable for on-road applications. Online methods can be further classified into feature matching-based and statistical-based approaches. Statistical-based methods, also known as direct methods, use all the available information without finding corresponding points. Feature-based methods, also known as indirect methods, involve finding the corresponding points and utilizing that information. Features may include points, lines, or surfaces. In recent years, several calibration approaches utilizing deep learning techniques have also emerged. In general, calibration approaches can be classified into four categories: target-based approaches, feature matching-based approaches, statistics-based approaches, and deep learning-based approaches, as shown in Table 1.
Table 1. Classification of camera–LiDAR calibration approaches presented in the literature, dissected into four main categories that are target-based approaches, feature matching based approaches, statistics-based approaches, and deep learning-based approaches.
Category |
Subcategory |
References |
Target-Based |
- |
[20][21][22] |
Feature Matching-Based |
Point Features |
[23][24][25] |
Line Features |
[26][27][28] |
Surface Features |
[29] |
Semantic Features |
[30] |
3D Structure Features |
[31] |
Statistics-Based |
Reflectivity—Grayscale intensity |
[32][47] |
Surface normal—Grayscale intensity |
[37][38] |
Gradient magnitude and orientation—Gradient magnitude and orientation |
[34] |
3D semantic label—2D semantic label |
[48] |
Deep Learning-Based |
Regression |
[39][40][41][42][43] |
Calibration Flow |
[44][45] |
Keypoints |
[46] |
2.1. Target-Based Approaches
Target-based approaches for camera–LiDAR calibration rely on a predefined target that is visible by both sensors. Typically, the target is designed to have a known geometric structure and can be represented in both sensor data, albeit in different forms.
The offline calibration method, using a calibration board as described in
[20], can accurately calculate the relative pose between a laser rangefinder and a camera by placing the calibration board indoors. However, this method cannot be performed in real-time as the relative pose between the laser rangefinder and camera is constantly changing during vehicle operation, rendering this method ineffective. Similarly, the method of using pre-positioned ground control points for the registration of unmanned aerial vehicle images and onboard laser point clouds, as described in
[21], also faces this problem.
2.2. Feature Matching-Based Approaches
Feature matching-based approaches typically involve first converting the LiDAR and camera data into a common coordinate system using the initial calibration parameters. Next, salient features are extracted from the LiDAR and camera data, such as corners or edges, using feature detection algorithms. These features are then matched between the LiDAR and camera data based on their descriptors, which are high-dimensional representations of the features. Once the matching features are identified, the calibration parameters can be estimated using optimization methods, such as the PnP algorithm or bundle adjustment. These methods compute the transformation between the LiDAR and camera coordinate systems that minimize the reprojection error between the matched features.
A feature-based method that uses Harris corner points of road markings for matching is described in
[23]. This method projects the point cloud onto a plane to form an intensity image, which is then matched with the image data. Similarly,
[24] extracts Harris corner points from images and performs an exhaustive search for corresponding points in the LiDAR data, with the use of the Fourier transform for computational acceleration. In
[25], the authors utilize SIFT
[49] to extract intensity features from point cloud images for point cloud registration.
In
[26], skyline features are extracted from both the point cloud projection and the image, and an ICP
[50] algorithm considering the point normal vectors is used to find the corresponding points on the skyline. Finally, the camera pose is calculated based on the corresponding points. Similarly,
[27] uses a brute-force search to iteratively solve the registration parameters, and the search range is reduced by half after each iteration. Furthermore,
[28] is also based on line matching. Canny edge lines are extracted from both the image and the point cloud projection, and the camera pose is calculated based on the correspondence relationship between the lines using the generalized collinearity equation.
The method proposed by
[29] is based on surface matching. The method involves extracting features from both the point cloud and the digital image, then using a feature descriptor to match corresponding features. The matching is performed on planar surfaces, and the camera pose is estimated using an iterative closest point algorithm. This method can achieve high accuracy but it relies on the availability of planar surfaces in the scene.
Ref.
[30] proposed a camera–LiDAR calibration method based on semantic segmentation of images. Specifically, they extracted feature objects through the semantic segmentation of images and constructed a cost function based on the matching degree of the LiDAR points projected into the feature object region. The proposed method utilized semantic information, which is a higher-level representation, and thus demonstrated robustness to scene noise compared to edge-based methods. However, this method requires specific scene requirements, such as recognizable objects with certain shapes like cars, which limits its applicability in mapping applications.
In
[31], sparse point clouds are constructed through structure from motion (SFM)
[51] from images, and rigid ICP is used to align the sparse point clouds with the LiDAR point clouds. However, this method is essentially an offline method since it mainly uses continuous image frames to construct sparse point clouds, and then performs ICP alignment and the joint BA solution with the LiDAR point clouds.
2.3. Statistics-Based Approaches
Statistics-based approaches typically involve projecting LiDAR point clouds onto the camera image plane using the initial calibration parameters. This creates a 2D projection image that can be compared to the actual camera image. To compare the projection image and the camera image, filtering methods are employed to process the two images separately. These methods may include edge detection, noise reduction, or other image processing techniques. After filtering, the two images are overlapped, and specific statistical measures, such as correlation coefficients or mutual information, can be calculated to measure the similarity between the two images. Once the similarity measures are computed, non-linear optimization techniques can be used to refine the calibration parameters. These techniques aim to minimize the difference between the projection image and the camera image by adjusting the calibration parameters. This optimization process can be iterative, with the calibration parameters updated after each iteration until convergence is reached.
Ref.
[32] utilizes the mutual information between image pixel values and laser reflectance intensity, and ref.
[33] computes the mutual information between the image pixel values and both the reflectance and the depth maps from the LiDAR data. One drawback of mutual information methods is that they heavily rely on local features, which results in a significant dependence on the initial registration parameters. Moreover, using reflectance values for the mutual information method has a drawback in that it requires calibration of the laser reflectance values, as uncalibrated reflectance values are considered invalid, which can lead to inaccurate similarity measurements. In their work, ref.
[19] proposed an approach that extracts edge points from both the camera images and the LiDAR data. The method utilizes an objective function that integrates the information of camera intensity and depth discontinuity in a product sum fashion. It can detect and correct miscalibration between the two data sources through a grid search optimization.
2.4. Deep Learning-Based Approaches
Deep learning-based approaches have emerged as a promising method for LiDAR–camera calibration. These approaches aim to replace the manual feature extraction step with neural networks to better handle the complex data involved in LiDAR–camera calibration. By leveraging the powerful representation learning capabilities of neural networks, these approaches can automatically extract features that are more relevant to the calibration task. Moreover, the subsequent feature matching and parameter calculation process can also be implemented using neural networks. This enables the entire calibration process to be performed in an end-to-end fashion, with the neural network taking raw data as input and directly outputting the calibrated parameters.
In a recent study by
[39], an end-to-end approach is proposed to tackle the calibration problem. This approach employs convolutional neural networks (CNNs) to extract feature information from both camera and LiDAR-projected images, and subsequently, another CNN block is utilized to establish correspondence between the features. Finally, a fully connected network is employed to output the calibration parameters. Subsequent research
[40][41][42][43] has also employed neural networks as a tool to tackle the problem of calibration. Despite their impressive performance in various applications, deep learning models are known to suffer from limitations when it comes to applying them to arbitrary configurations. In such cases, conventional calibration techniques may be more practical and efficient than re-training the models. Moreover, the lack of interpretability of deep learning models makes it difficult to perform failure case analysis and estimate the operational limits analytically, which poses a significant challenge for these black box approaches.