The METRIC dataset comprises more than 10,000 synthetic and real images of ChAruCo and checkerboard patterns. Each pattern is securely attached to the robot's end-effector, which is systematically moved in front of four cameras surrounding the manipulator. This movement allows for image acquisition from various viewpoints. The real images in the dataset encompass multiple sets of images captured by three distinct types of sensor networks: Microsoft Kinect V2, Intel RealSense Depth D455, and Intel RealSense Lidar L515. The purpose of including these images is to evaluate the advantages and disadvantages of each sensor network for calibration purposes. Additionally, to accurately assess the impact of the distance between the camera and robot on calibration, we obtained a comprehensive synthetic dataset. This dataset contains associated ground truth data and is divided into three different camera network setups, corresponding to three levels of calibration difficulty based on the cell size.
1. Introduction
The use of camera networks has become increasingly popular in various computer vision applications, such as human pose estimation, 3D object detection and 3D reconstruction [
1,
2,
3,
4]. Multi-camera systems offer the advantage of monitoring larger areas and making several computer vision algorithms more robust against occlusion problems. These challenges frequently occur in complex real-world scenarios such as people tracking applications [
5] or robotic workcells [
6].
Calibrating a camera network is a crucial step in setups involving multiple cameras, and it typically involves determining intrinsic and extrinsic parameters. Intrinsic calibration is necessary to determine the internal sensor parameters required to accurately project the scene from each 3D camera reference frame onto the corresponding 2D image plane, and they can be obtained using algorithms such as Zhang’s or Sturm’s [
7,
8]. Extrinsic calibration is required to establish a single 3D reference frame shared by all sensors in the camera network, which is essential for multi-camera applications, since it allows the accurate localization of objects or people in the scene with respect to this common reference system. Both intrinsic and the extrinsic calibration involve an image acquisition phase where a calibration pattern is placed in different positions and orientations in front of the sensors. By detecting the pattern control points, an optimization process is performed to estimate camera parameters, which may involve, for example, minimizing the reprojection error [
9]. This step is necessary to accurately determine the camera’s intrinsic and extrinsic parameters. The calibration pattern can be either a planar model, such as checkerboard or the ChAruCo pattern [
10], or any other object whose shape is known and is showing elements that are easily recognizable [
11]. Intrinsic parameters require calibration pattern images taken at short distances from the sensor in order to cover the entire image plane, whereas extrinsic calibration is often performed by keeping the calibration pattern at longer distances to ensure, for example, its simultaneous detection by multiple cameras. Hence, the two calibration processes are typically performed in two separate steps. Furthermore, intrinsic calibration is conducted separately for each sensor, and intrinsic parameters such as focal length and image center are occasionally provided by the sensor manufacturer; therefore, cameras are sometimes considered to be intrinsically calibrated.
Camera network calibration is required for several applications, including multi-camera navigation systems [
12], people-tracking within camera networks [
13], and surveillance systems [
14]. Camera network calibration is also critical in robotic scenarios [
15,
16], especially when dealing with a robotic workcell composed of a robot arm surrounded by a camera network installed to monitor the workcell area [
5,
17,
18]. In such cases, it is essential to provide the robot with accurate information about its working environment. Simply estimating the relative positions among the cameras is not enough. The single reference frame shared by all viewpoints is required, which may be an external world reference frame or, more commonly, it may coincide with the robot’s base. Defining the reference frame coincident with the robot’s base allows the robot to locate an object of interest with respect to itself for several tasks, such as industrial and medical applications [
19,
20].
2. METRIC—Multi-Eye to Robot Indoor Calibration Dataset
Several methods have been proposed in the literature targeting camera network calibration and robot-world hand-eye calibration, and in both use cases, a planar calibration pattern such as ChAruCo and checkerboard is typically used. However, these algorithms have been typically evaluated on dedicated setups and specific datasets, limiting the comparison among different methods. Currently, there is a lack of general datasets that can be used to evaluate the performance of calibration algorithms and test their robustness under different conditions, such as variations in the distance between the camera and the calibration pattern.
2.1. Camera Network Calibration
Some of the camera network calibration techniques that share a common approach by using planar calibration patterns are described in more detail below.
Kim et al. [
12] proposed an extrinsic calibration process of multi-camera systems composed by lidar-camera combinations for navigation systems. The proposed method used a planar checkerboard pattern, which was manually moved in front of the sensors during the calibration process. Furgale et al. [
25] proposed Kalibr, a novel framework that employs maximum-likelihood estimation to jointly calibrate temporal offsets and geometric transformations of multiple sensors. The robustness of the approach is demonstrated through an extensive set of experiments, including the calibration of a camera and an inertial measurement unit (IMU). Tabb et al. [
22], proposed a method for calibrating asynchronous camera networks. The method addresses the calibration of multi-camera systems without relying on the hardware or the synchronization level among cameras, which is typically a main factor strongly influencing camera network calibration results. Caron et al. [
26] introduced an algorithm for the simultaneous intrinsic and extrinsic calibration of a multi-camera system using different models for each camera. The algorithm is based on minimizing the corner reprojection error computed on each camera using the corresponding projection model for each sensor, and they exploit a set of images of a calibration pattern, such as a checkerboard, manually moved at different distances and different positions in front of the cameras. Munaro et al. presented OpenPTrack, an open-source multi-camera calibration software designed for people-tracking in RGB-D camera networks [
13]. They proposed a camera network calibration system that works on images acquired from all the sensors while manually moving a checkerboard within the tracking space to allow more than one camera to detect it, followed by a global optimization of the camera and checkerboard poses. All of the above methods address the calibration of specific camera network configurations using a checkerboard or a ChAruCo pattern, but they have been tested on their respective datasets for specific tasks, which may limit the comparability of different techniques using a general dataset.
2.2. Robot-World Hand-Eye Calibration
Several works in the literature have addressed the issue of robot-world hand-eye calibration, adopting planar calibration patterns. In a previous study [
27], researchers proposed a non-linear optimization algorithm to solve the robot-world hand-eye calibration problem with a single camera. The proposed method involved the minimization of the corner reprojection error of a checkerboard that was rigidly attached to the robot’s end-effector and moved in front of the sensor at different positions and orientations during the image acquisition. In a scenario where a robot is surrounded by a camera network consisting of N sensors, this method must be applied N times, one for each sensor, to determine the pose of each camera with respect to the robot and thus to calculate the relative pose between the different cameras. Tabb et al. [
24] proposed a robot-world hand-multiple-eye calibration procedure using a classic checkerboard and compared two main techniques, each based on a different cost function. The first cost function minimizes the difference of two transformation chains over
n positions of the robot arm achieved during the image acquisition, and it is based on the Perspective-n-Point (PnP) problem [
28] of estimating the rototranslation between a camera and the calibration pattern. The second cost function focuses on the minimization of the corner reprojection error. In addition, Li and Shah proposed two different procedures for robot-world hand-eye calibration using dual quaternions and Kronecker product, respectively [
29,
30]. All of these works focus on calibration within small-sized workcells, where the cameras are placed approximately 1 m from the robot, which limits the ability to analyze the robustness of different calibration methods—particularly as the distance between the cameras and the calibration pattern increases.
2.3. Calibration Dataset
Based on the previous analysis, it can be observed that most of the calibration methods have been developed for specific use cases, such as the calibration of a camera network or the calibration of one or more sensors with respect to a robot, which makes it challenging to evaluate the performance of different calibration methods on standardized benchmarks. In particular, two main limitations have been identified: (i) the lack of common datasets to compare different calibration methods, and (ii) calibration works mainly focused on small workcells and small camera networks.
Tabb et al. [
22] released a dataset and the associated code that can be used to calibrate asynchronous camera networks. The dataset includes synthetic and real data aimed at calibrating a camera network with 12 Logitech c920 HD Pro Webcameras rigidly attached to the walls of a room, facing the centre of the scene. In addition, the authors captured a separate dataset specifically designed to calibrate a network of four moving cameras. In all three datasets, ChAruCo models were employed to calibrate a sensor network positioned approximately 0.70 m from the calibration pattern. In [
23], Wang and Jang presented a dataset that was used to calibrate a camera network. The dataset was obtained by manually moving a classical checkerboard placed in front of a multi-camera system consisting of four sensors 0.5 m apart and approximately 1 m away from the calibration pattern. Their proposed method generalizes the hand-eye calibration problem, which jointly solves multi-eye-to-base problems in a closed form to determine the geometric transformation between sensors within the camera network. T. Hüser et al. [
31] introduced a real-world dataset that included different recordings of a calibration checkerboard manually moved in front of sensors. The dataset was created to perform the intrinsic and extrinsic calibration of twelve synchronized cameras mounted on the walls of a small room, which were used to record and analyze the grasping actions of a monkey interacting with fruit and other objects. Another dataset, described in detail in [
32], consists of a small number of Aruco calibration pattern images positioned about 0.5 m from the camera and used for object localization tasks. As far as datasets for testing robot-world hand-eye calibration methods are concerned, there are only a few available in the literature. One such dataset is published in [
33], where the authors propose a set of images of a planar calibration pattern positioned approximately 1 m away from the robot. The pattern consists of a grid of circles and is used for the hand-eye calibration of a manipulator equipped with a monocular camera (PointGrey, Flea3) attached to the end-effector. Tabb et al. presented a dataset containing both synthetic and real data, which can be used to assess hand-eye calibration techniques. The authors captured several images of a checkerboard by controlling a robot arm equipped with a multi-camera system attached to its end-effector in various positions. The calibration pattern was positioned approximately 1 m away from the sensors during the image acquisition [
24].
The main drawback of many of these datasets is the limited number of images available to test different calibration methods—usually not exceeding 100 images. Additionally, the datasets contain images of a specific calibration pattern that may not be used by other state-of-the-art methods due to the lack of suitable detectors, further limiting their applicability for evaluating the performance of other techniques.
This entry is adapted from the peer-reviewed paper 10.3390/info14060314