On 3D Reconstruction with RGB-D Cameras: History
Please note this is an old version of this entry, which may differ significantly from the current revision.

The representation of the physical world is an issue that concerns the scientific community studying computer vision, more and more. Research has focused on modern techniques and methods of photogrammetry and stereoscopy with the aim of reconstructing three-dimensional realistic models with high accuracy and metric information in a short time. In order to obtain data at a relatively low cost, various tools have been developed, such as depth cameras. RGB-D cameras are novel sensing systems that capture RGB images along with per-pixel depth information.

  • RGB-D camera
  • 3D reconstruction
  • depth image processing

1. Introduction

One of the main tasks that computer vision is dealing with is the 3D reconstruction of the real world [1]. In computer science, three-dimensional reconstruction means the process of capturing the shape and appearance of real objects. This process is accomplished by active and passive methods, and for this purpose, several tools have been developed and applied. In the last decade, an innovative technology has emerged: depth cameras (RGB-D). The elementary issue is “object detection,” which refers to recognizing objects in a scene and is divided into instance recognition and category level recognition. Object recognition depends highly on RGB cameras and instance-specific information [2], whereas the quality of the recognized category depends on the generalization of the properties or functionalities of the object and the unseen instances of the same category. Although depth cameras provide 3D reconstruction models in real time, one of the main issues for researchers is robustness and accuracy [3]. In addition, the representation of the object can undergo changes such as scaling, translation, occlusion, or other deformations, which make category level recognition a difficult topic. In addition, object detection has weaknesses due to illumination, camera viewpoint, and texture [4]. The recovered information is expressed in the form of a depth map, that is, an image or image channel that contains information relating to the distance of objects’ surfaces from the capturing camera. Depth maps are invariant to texture and illumination changes [5]. The modeling of objects can be categorized into geometric and semantic. Geometric modeling provides an accurate model related to geometry, whereas semantic analyzes objects in order to be understood by humans. A typical example of semantic information is the integration of RGB-D in people’s daily life for space estimation (odometry), object detection, and classification (doors, window, walls, etc.). RGB-D camera have continuous detailed feedback for the existing configuration area, and this is especially helpful for visually impaired people. The RGB-D camera is a navigational aid both indoors and outdoors, in addition to the classic aids they use, such as walking stick. In particular, in recent years, with the emergence of the COVID-19 pandemic, people have been spending more time in the household, and this is an issue mostly for people in this category. In this context, algorithms for combining RGB cameras and depth cameras have been developed that allow audio messages about the distance of objects from humans [6]. In addition, to ensure their real-time performance, the systems are accelerated by parallel offloading functions to the GPU [7].
Moreover, objects are represented either dynamically [8] or statically [9], where scenes with rapid movements or complex topology changes and mapping area are captured, respectively. Three-dimensional reconstruction is applied in various fields, such as natural disasters [10], healthcare [11], heritage [12], and so on. Using various camera techniques and methods, scenes are captured in both dynamic and static environments. For a dynamic scene, there are two categories with their own advantages and limitations. One is related to tracking the motion of the object (fusion-based methods), and the other to modelling a reconstruction without taking into account the number of photos taken simultaneously (frame-based methods) [13]. Detailed and complete analysis of various scenes is of major importance, especially for robotic applications where different environments exist. In these cases, semantic segmentation methods are applied, which enhance various tasks, such as semantically assisted person perception, (semantic) free space detection, (semantic) mapping, and (semantic) navigation [14]. Depth images provide complementary geometric information to RGB images, which can improve segmentation [15]. However, the integration of depth information becomes difficult, because depth introduces deviating statistics and characteristics of a different kind modality [16]. For the semantic recording, understanding, and interpretation of a scene, several methods were developed that were not particularly successful due to the increased memory consumption required [17,18,19,20,21,22]. To solve this issue, new model architectures have been proposed, such as a shape-aware convolutional layer (ShapeConv) [23], separation-and-aggregation gate (SA-Gate) [24], attention complementary network (ACNet) [25], RFne [26,27], and CMX, [28], as well as methods that focus on extracting modality-specific features in order to ensure that the best features are extracted without errors [29,30,31,32]. The result of objects’ 3D reconstruction with depth cameras is a depth map that suffers from some limitations of this technology, such as sensors’ hardware, errors of the pose, and low-quality or low-resolution capture. In short, RGB-D cameras are a new technology and different from classical RGB cameras in the sensors’ integration. Although their advantages are many, there are still some limitations regarding geometric errors, camera trajectory, texture, and so on [33]. Moreover, deep-learning-based reconstruction methods and systems are used in various applications, such as human action and medical images, and they directly learn from image data to extract features [34,35,36,37]. ANNs are the foundation of deep learning techniques.

2. History of RGB and 3D Scene Reconstruction

The technology of time-of-flight (ToF) cameras with 3D imaging sensors that provide a depth image and an amplitude image with a high frame rate has developed rapidly in recent years. 
Depth cameras were developed in the last decade; however, the foundations were laid in 1989. Some milestones in the history of 3D Reconstruction and RGB-D technology are as follows: In the 1970s, the idea of 3D modeling and the significance of object shape were introduced. In the 1980s, researchers focused on the geometry of objects. From the 2000s, various techniques and methods related to the features and textures of objects and scenes were developed. In the 2010s, appropriate algorithms were developed and implemented in applications, which mainly include dynamic environments and robotics. In this framework, deep learning methods are used that provide satisfactory models with high accuracy. Nowadays, research is focused on ways of solving the existing limitations associated with quality fusion of the scenes.

3. Hardware and Basic Technology of RGB-D

Depth cameras are vision systems that alter the characteristics of their environment, mainly visual, in order to capture 3D scene data from their field of view. These systems have structured lighting, which projects a known pattern on the scene and measures its distortion when viewed from a different angle. This source may use a wavelength belonging to the visible field, but more commonly, will be selected from the infrared field. The more sophisticated systems implement the time of flight (ToF) technique, in which the return time of a light pulse after reflection by an object in the scene captures depth information (typically over short distances) from a scene of interest [71].
A typical depth camera (RGB-D) incorporates an RGB camera, a microphone, and a USB port for connection to the computer. In addition, it includes a depth sensor, which uses infrared structured light to calculate the distance of each object from the camera’s horizontal optical axis (depth). In order to achieve this, it takes into account each point of the object being studied. In addition, some cameras have an infrared emitter (IR emitter) that consists of an IR laser diode to beam modulated IR light to the field of view. The reflected light is collected by the depth sensor and an infrared absorber (IR sensor) mounted anti-diametrically. RGB-D sensors are a specific type of depth-sensing device that work in association with an RGB (red, green, blue color) sensor camera. They are able to augment the conventional image with depth information (related with the distance to the sensor) on a per-pixel basis. The depth information obtained from infrared measurements is combined with the RGB image to yield an RGB-D image. The IR sensor is combined with an IR camera and an IR projector. This sensor system is highly mobile and can be attached to a mobile instrument such as a laptop [72].
As for how it works, it emits a pre-defined pattern of infrared light rays. The light is absorbed by existing objects, and the depth sensors measure it. Since the distance between the emitter and sensor is known, from the difference between the observed and expected position, the depth measurement, with respect to the RGB sensor, is taken for each pixel. Trigonometric relationships are used for this purpose.

4. Conceptual Framework of 3D Reconstruction

4.1. Approaches to 3D Reconstruction (RGB Mapping)

Researchers have approached the subject of 3D reconstruction with various techniques and methods.
According to the literature, for the 3D reconstruction of various scenes, depth cameras are used in combination with various techniques and methods to extract more accurate, qualitative, and realistic models. However, when dealing with footage in mostly dynamic environments, there are some limitations that require solutions.

4.2. Multi-View RGB-D Reconstruction Systems That Use Multiple RGB-D Cameras

Three-dimensional reconstruction of a scene from a single RGB-D camera is a risk and should be taken seriously because there are certain limitations. For example, in complex large scenes, it has low performance and requires high memory capacity or estimation of the relative poses of all cameras in the system. To address these issues, the multiple RGB-D camera was developed. Using this method, we can acquire data independently from each camera, which are then put in a single reference frame to form a holistic 3D reconstruction of the scene. Therefore, in these systems, calibration is necessary [90].

4.3. RGB-D SLAM Methods for 3D Reconstruction

3D reconstruction can be used as a platform to monitor the performance of activities on a construction site [91]. The development of navigation systems is one of the major issues in robotic engineering. A robot needs information about the environment, objects in space, and its position; therefore, various methods of navigation have been developed based on odometry [92], inertial navigation, magnetometer, active labels (GPS) [93], and label and map matching. The simultaneous localization and mapping (SLAM) approach is one of the most promising methods of navigation. Recent progress in visual simultaneous localization and mapping (SLAM) makes it possible to reconstruct a 3D map of a construction site in real-time. Simultaneous localization and mapping (SLAM) is an advanced technique in the robotics community and was originally designed for a mobile robot to consistently build a map of an unknown environment and simultaneously estimate its location in this map [94]. When a camera is used as the only exteroceptive sensor, this technique is called visual SLAM or VSLAM [95]. Modern SLAM solutions provide mapping and localization in an unknown environment [96]. Some of them can be used for updating a map that has been made before. SLAM is the general methodology for solving two problems [97,98]: (1) environment mapping and 3D model construction, and (2) localization using a generated map and trajectory processing [99].

5. Data Acquisition and Processing

5.1. RGB-D Sensors and Evolution

The data acquisition from depth cameras plays an important role in further processing of data in order to produce a qualitative and accurate 3D reconstructed model of the physical world. Therefore, the contribution of depth cameras’ incorporated sensors is of major importance. Nowadays, the sensors have many capabilities and continue to evolve. The rapid evolution is due to the parallel development of technologies, and this is to be expected considering that depth cameras work with other devices or software. In short, there are two main types of sensors, active and passive, which complement each other in various implementations [100]. Although sensors provide many benefits, they also present errors [101] and inaccurate measurements [102]. In general, to achieve a high degree of detail, depth cameras should be calibrated. RGB-D cameras were developed in the last decade, but the foundations were laid in 1989.

5.2. Sensing Techniques of RGB-D Cameras

There are different techniques to acquire data from depth cameras. These techniques fall into two categories, active and passive sensing, as well as the recently developed monocular depth estimation. The techniques of the first category uses a structured energy emission to capture an object in a static environment [103], as well as capture the whole scene at the same time. With active techniques, the 3D reconstruction becomes simpler. In this case, there are also two subcategories, time of flight (ToF) and structured light (SL) cameras [104]. The second category is based on the triangulation principle [105,106], and through epipolar geometry, correspondence of the key points. In the third category, the depth estimation for the 3D reconstruction of an object is done by two-dimensional images [107]. 

5.3. Depth Image Processing (Depth Map)

The depth of the scene, combined with the color information, will compose the RGB-D data, and the result will be a depth map. A depth map is a metric value image that provides information relating to the distance of the surfaces of the scene objects. In fact, it is through depth estimation that the geometric relationships of objects within a scene are understood [108]. This process is achieved by epipolar geometry (i.e., the geometry of stereoscopic vision), which expresses a scene viewed by two cameras placed at different angles, or simply by the same camera shifted to different viewing angles.
A point P in world coordinates (X,Y,Z) is projected on the camera sensor at point x = K [R,T]X, with x = (u,v,1) and X = (X,Y,Z,1), where K is the camera calibration matrix, and R,T the rotation and translation matrices of 3 × 3 and 3 × 1 size, respectively. The only information that can be obtained is the calculation of the half-line on which this point is located, a half-line starting from the center of the camera projection and extending away from the camera. Therefore, if there is a second camera in a different part of space, covering the same scene, it is possible, through trigonometry, to calculate the exact 3D coordinates of the point, as long as the mapping of the points of one camera to the points of the other can be achieved [109]. Solving the problem is simple, as it requires solving a three-equation system with three unknown values. The pixel position in the frame of each camera, as well as the affinity transformation between the coordinate systems of the two cameras, is available as data. The mapping of pixels to a frame plane is done through the algorithms discussed above.
Depth maps are produced using the methods described in Section 5.2, and they are directly related to environment lighting, object reflection, and spatial analysis. For example, bright lighting is responsible for creating outliers [110]. In addition, depth maps suffer from view angle reflective surfaces, occlusion boundaries [111], levels of quantization, and random noise (mainly indoor scene distances) [112], which are related to the distance of the object and the pixel position. To a certain extent, some of the above disadvantages are addressed. For example, the fusion of frames from different viewpoints, shape from shading (SfS), shape from polarization (SfP) techniques, or bilateral filtering help to repair the noise and smooth the depth map [113]. Qualitative depth maps have been an important concern for researchers, and they have devised techniques to solve the problems created during the 3D reconstruction process.
Depth maps are of great importance for extracting 3D reconstruction models; however, there are still limitations that pose challenges to the scientific community. Moreover, there are still some issues that remain open and need to be explored in the future. The main limitations are as follows:
  • Recording the first surface seen cannot obtain information for refracted surfaces;
  • Noise from the reflective surface viewing angle. Occlusion boundaries blur the edges of objects;
  • Single-channel depth maps cannot convey multiple distances when multiple objects are in the location of the same pixel (grass, hair);
  • May represent the perpendicular distance between an object and the plane of the scene camera and the actual distances from the camera to the plane surface seen in the corners of the image as being greater than the distances to the central area;
  • In the case of missing depth data, many holes are created. To address this issue, a median filter is used, but sharp depth edges are corrupted;
  • Cluttered spatial configuration of objects can create occlusions and shadows.
From the above limitations emerge some challenges, such as occlusions, camera calibration errors, low resolution, and high levels of ambient light (ToF), which are unsuitable for outdoor operation (structured light). In addition, depth noise increases with distance (SL) quadratically. Moreover, issues such as the correspondence between stereo or multiview images, multiple depth cues, computational complexity, spatial resolution, angle of projection, and multiple camera interference for dynamic scenarios remain open to investigation.

5.4. RGB-D Datasets

RGB-D data is essential for solving certain problems in computer vision. Nowadays, there are open databases containing large datasets of both indoor and outdoor scenes collected by RGB-D cameras and different sensors. The data are related to scenes and objects, human activities, gestures, and the medical field, and are used for applications such as simultaneous localization and mapping (SLAM) [114], representation [115], object segmentation [116], and human activity recognition [117].
The NYU Depth dataset is the most popular for RGB-D indoor segmentation. It was created using a Microsoft Kinect v1 sensor, is composed of aligned RGB and depth images, and consists of labeled data containing semantic segmentation as well as raw data [118]. There are two versions: NYUv1 and NYUv2 (464 scenes (407,024 frames) with 1449 labeled aligned RGB-D images with 640 × 480 resolution). The dataset is split into a training set of 795 images and a testing set of 654 images. The difference is that the first type has fewer scenes and total frames (64 Scenes (108,617 Frames) with 2347 labeled RGB-D frames) [119]. NYUv2 originally had 13 different categories. However, recent models mostly evaluate their performance at the more challenging 40-classes settings [120].
The SUN RGB-D dataset [110] is the same category as NYU. Data was acquired with structured light and ToF sensors and used for semantic segmentation, object detection, and pose. This dataset provides 10,335 RGB-D images with the corresponding semantic labels. It contains images captured by different depth cameras (Intel RealSense, Asus Xtion, Kinect v1/2) since they are collected from previous datasets. Therefore, the image resolutions vary depending on the sensor used. SUN-RGBD has 37 classes of objects. The training set consists of 5285 images, and the testing set consists of 5050 images [121].
The Stanford2D3D dataset consists of indoor scene images, taken with a structured light sensor, which are used for semantic segmentation. It is a large-scale dataset that consists of 70,496 RGB images with the associated depth maps. The images are in 1080 × 1080 resolution and are collected in a 360° scan fashion. The usual class setting employed is 13 classes [122].
The ScanNet dataset is an indoor dataset collected by a structured light and contains over 2.5 million frames from 1513 different scenes. It is used for 3D semantic-voxel segmentation [115].
The Hypersim dataset consists of indoor scenes that are captured synthetically and used for normal maps, instance segmentation, and diffuse reflectance [123].

This entry is adapted from the peer-reviewed paper 10.3390/digital2030022

This entry is offline, you can click here to edit this entry!
Video Production Service