As a 3D reconstruction method, multi-view stereoscopic (多视图立体(MVS) plays a vital role in 3D computer vision, and has a wide range of applications in the fields of virtual reality, augmented reality, and autonomous driving. With the rapid development of deep learning technology in the field of computer vision, the learning-based multi-view stereo method has produced advanced results.)作为一种3D重建方法,在3D计算机视觉中起着至关重要的作用,在虚拟现实、增强现实和自动驾驶等领域有着广泛的应用。随着深度学习技术在计算机视觉领域的快速发展,基于学习的多视图立体方法已经产生了先进的成果。
1. Introduction引言
As a 3D reconstruction method, multi-view stereoscopic (多视图立体(MVS) plays a vital role in 3D computer vision, and has a wide range of applications in the fields of virtual reality, augmented reality, and autonomous driving. With a series of images from different viewpoints and the corresponding camera parameters as inputs, multi-view stereo can estimate the depth information of each pixel and generate a corresponding 3D representation of the observed scene. As a key problem in 3D computer vision, multi-view stereo has received extensive research attention )作为一种3D重建方法,在3D计算机视觉中起着至关重要的作用,在虚拟现实、增强现实和自动驾驶等领域有着广泛的应用。多视角立体以不同视点的一系列图像和相应的相机参数为输入,可以估计每个像素的深度信息,并生成观察到的场景的相应3D表示。作为3D计算机视觉中的关键问题,多视角立体已经获得了广泛的研究关注[1,2,3,4].。
In recent years, with the rapid development of deep learning technology in the field of computer vision, learning-based multi-view stereoscopic methods have achieved advanced results 近年来,随着深度学习技术在计算机视觉领域的快速发展,基于学习的多视图立体方法取得了先进的成果[
4,5,6]
. ]。基于学习的多视图立体算法通常由多个组件组成,包括特征提取、深度采样、成本量构建、成本量正则化和深度回归。
Learning-based multi-view stereo algorithms usually consist of multiple components, including feature extraction, deep sampling, cost quantity construction, cost quantity regularization, and deep regression. However, the large GPU memory requirements not only limit image processing to low resolutions, but also hinder the adoption of multi-view stereo on various edge computing devices. In the real-world application of 3D vision, the deployed devices often have limited computing resources. For example, in autonomous driving scenarios, lidar data is often processed using 3D point cloud compression technology to reduce storage and transmission costs 然而,大量的 GPU 内存要求不仅将图像处理限制在低分辨率,而且还阻碍了多视图立体在各种边缘计算设备上的采用。在3D视觉的实际应用中,部署的设备通常具有有限的计算资源。例如,在自动驾驶场景中,激光雷达数据通常使用三维点云压缩技术进行处理,以降低存储和传输成本[
7]
. Unlike LiDAR data processing, the main computational challenge for multi-view stereo is to generate a point cloud from a 2D image and camera parameters from a given input source. Therefore, reducing the memory consumption of the algorithm can greatly improve the practicability of the technology. Recently, many researchers have proposed improved methods to deal with the high computational problem of learning-based multi-view stereo methods. In particular, coarse-to-fine architectures have been widely used to design efficient multi-view stereo networks 。与激光雷达数据处理不同,多视图立体的主要计算挑战在于从给定输入源的二维图像和相机参数生成点云。因此,降低算法的内存消耗可以大大提高该技术的实用性。最近,许多研究人员提出了改进的方法来处理基于学习的多视图立体方法的高计算问题。特别是,从粗到细的架构已被广泛用于设计高效的多视图立体网络[6,8,9,10,11,12]
. Typically, in these methods, the initial cost volume is typically built at a low resolution rather than a fixed resolution, and then a new cost volume is iteratively built at a higher resolution based on the final stage results, and finally a depth map is obtained. The assumption that the depth plane is progressively reduced at different stages 。通常,在这些方法中,初始成本卷通常以低分辨率而不是固定分辨率构建,然后根据最后阶段结果以较高分辨率迭代构建新的成本卷,最后获得深度图。在不同阶段逐步缩小深度平面的假设[6,8,9,10,11,12]
is also a key strategy to reduce the amount of computation. Although the coarse-stage output is of great significance to the final result as an input to the construction of fine-stage cost-quantities, these existing methods need to pay more attention to the characteristic information of the coarse stage. If the feature extraction phase of the coarse stage is insufficient, the poor initial results may adversely affect the final results and final output of the subsequent stages. However, intensive feature extraction steps always increase computational load and GPU consumption, and there is still a need to balance accuracy and computational efficiency.也是减少计算量的关键策略。尽管粗阶段输出作为细阶段成本量构建的输入对最终结果具有重要意义,但这些现有方法需要更加关注粗阶段的特征信息。如果粗阶段的特征提取阶段不充分,较差的初始结果可能会对后续阶段的最终结果和最终输出产生不利影响。然而,密集的特征提取步骤总是会增加计算负载和 GPU 消耗,并且仍然需要平衡精度和计算效率。
In addition, another existing challenge for cascade-based multi-view stereoscopic is the adaptation to the depth hypothetical range. In the initial phase, planar scanning covers the entire imaginable depth range. At the same time, in many cascade-based algorithms, the estimated depth value of the previous stage is used as the center of the sampling interval during the generation of depth assumptions at a more granular stage 此外,基于级联的多视图立体的另一个现有挑战是适应深度假设范围。在初始阶段,平面扫描覆盖了整个可以想象的深度范围。同时,在许多基于级联的算法中,在更精细阶段的深度假设生成过程中[6,8,10,12]
, and each pixel has a fixed sampling distance within its respective stage. However, setting a uniform sampling distance for each pixel is not an ideal approach because the optimization of the depth refinement stage will vary between different pixels in the same depth map, where some pixels may have a stable depth while others may exhibit significant variations. With this challenge in mind, ,使用前一阶段的估计深度值作为采样间隔的中心,每个像素在其各自阶段内都有固定的采样距离。然而,为每个像素设置均匀的采样距离并不是一种理想的方法,因为深度细化阶段的优化在同一深度图中的不同像素之间会有所不同,其中某些像素可能具有稳定的深度,而其他像素可能表现出显着变化。考虑到这一挑战,Cheng [
11]
used the probability distribution of each pixel to set the sampling distance利用每个像素的概率分布来设置采样距离;
However, this approach does not perform well in terms of GPU memory usage and runtime, while its training time is also significant.然而,这种方法在GPU内存使用和运行时间方面表现不佳,同时其训练时间也很大。
2. Multi-view stereo mode多视图立体方式
2.1. Traditional multi-view stereo method传统的多视图立体方法
多视角立体(M
ulti-view stereoscopic (MVS), as a fundamental problem in the field of 3D reconstruction in computer vision, solves the problem of recovering the spatial geometry of a scene from a photo. Before the advent of deep learning, it had already attracted a great deal of attention and made substantial progress. Traditional multi-view stereo methods can be broadly divided into the following four categories: voxel-based methods VS)作为计算机视觉中3D重建领域的一个基本问题,解决了从照片中恢复场景的空间几何问题。在深度学习出现之前,它已经引起了极大的关注并取得了实质性的进展。传统的多视角立体方法大致可分为以下四类:基于体素的方法[25,26,27,28,29]
and grid-based methods 、基于网格的方法[
30,31]
. Surface-based methods 、基于表面的方法[19,32,33]
and depth map-based methods 和基于深度图的方法[1,20,21,34,35].
The mesh-based approach is less reliable because its final reconstruction performance depends on its initialization. At the same time, the surfel-based method represents the surface as a set of surfels, which is simple but efficient. However, surfel-based methods require additional cumbersome post-processing steps to generate the final 3D model. The depth map-based approach calculates the depth value of each pixel in each image, reprojects the pixels into 3D space, and then fuses the points to generate a point cloud model. Among the four methods, the depthmap-based method is the most flexible and the most widely used in this field. In recent years, depth map-based methods have achieved remarkable success, and good algorithmic frameworks are in use, such as 在这四种方法中,基于体素的方法将空间划分为一组体素,需要极高的内存消耗。基于网格的方法不太可靠,因为它的最终重建性能依赖于其初始化。同时,基于 surfel 的方法将曲面表示为一组 surfel,简单但高效。然而,基于 surfel 的方法需要额外的繁琐的后处理步骤来生成最终的 3D 模型。基于深度图的方法计算每个图像中每个像素的深度值,将像素重新投影到 3D 空间中,然后融合这些点以生成点云模型。在这四种方法中,基于深度图的方法最灵活,在该领域应用最广泛。近年来,基于深度图的方法取得了显著的成功,并且有很好的算法框架在使用中,例如Furu [
19]
, 、Gipuma [
21]
, 、Tola
[
20]
, and 和COLMAP [
1]
. While the performance of traditional multi-view stereoscopic is commendable, there are still shortcomings that need to be improved: high computational requirements, slow processing speed, and poor handling of scenes with weak textures or weak reflective surface blocks.。尽管传统多视角立体的性能值得称赞,但仍存在以下缺点需要改进:计算要求高、处理速度慢、对纹理较弱或反射面块较弱的场景处理欠佳。
2.2. Learning-based multi-view stereoscopic approach基于学习的多视图立体方法
In recent years, with the convergence of deep learning, the learning-based multi-view stereo method has experienced rapid development and achieved outstanding performance. 近年来,随着深度学习的融合,基于学习的多视图立体方法经历了快速发展,并取得了突出的性能。Yao
[
4]
launched 推出了MVSNet
, the first multi-view stereoscopic network based on end-to-end learning, laying the foundation for rapid growth in the coming years. ,这是第一个基于端到端学习的多视图立体网络,为未来几年的快速增长奠定了基础。MVSNet [
4]
uses a采用共享权重 shared-weight 2D-CNN
network to extract feature maps from the input images. Differential monotonic transformations 网络从输入图像中提取特征图。然后应用差分单调变换[
36]
are then applied to distort these feature maps into reference perspectives. The method utilizes a series of depth assumption planes to construct a cost volume that represents the correlation between the source and reference images. Subsequently, the 将这些特征图扭曲为参考透视。该方法利用一系列深度假设平面来构建成本体积,表示源图像和参考图像之间的相关性。随后,采用3D-CNN
network was used for cost regularization. Finally, the estimated depth map of the output as a reference image is obtained by depth regression. In the DTU benchmark dataset 网络进行成本量正则化。最后,通过深度回归得到输出作为参考图像的估计深度图。在DTU基准数据集[
17]
, 中,MVSNet [
4]
not only outperforms the previous traditional 不仅优于以前的传统MVS
methods 方法[1,19,20]
, but also runs much faster. However, due to the high GPU memory consumption, only low-resolution images can be used as input images in ,而且运行时间也快得多。但是,由于 GPU 内存消耗较高,因此在 MVSNet
. A number of learning-based MVS approaches have been proposed to deal with GPU memory consumption. Yao 中只能将低分辨率图像用作输入图像。已经提出了许多基于学习的 MVS 方法来处理 GPU 内存消耗问题。Yao[
22]
proposed an improved method, 提出了改进的方法R-MVSNet
[
22]
, which replaces the deeply refined ,该方法用一系列GRU卷积取代了深度细化的3D-CNN
network with a series of GRU convolutions. This improvement reduces GPU memory consumption and enables 3D reconstruction at high resolution. Gu 网络。这种改进减少了 GPU 内存消耗,并使其能够以高分辨率进行 3D 重建。顾[
6]
proposed the 提出了CasMVSNet
model, which is based on the Feature Pyramid Network (FPN) 模型,该模型基于特征金字塔网络(FPN)[
13]
to construct cascading costs. Thanks to its novel coarse-to-fine architecture, 构建级联成本量。得益于其新颖的从粗到细架构,CasMVSNet
can process input images from DTU datasets at native resolution 可以以原始分辨率处理来自DTU数据集[
17]
. Similar to 的输入图像。与CasMVSNet
[
6]
, 类似,CVP-MVSNet [
8]
and 和Fast-MVS [
23]
also contain coarse-to-fine frameworks, and both exhibit excellent performance on benchmark datasets 也包含从粗到细的框架,并且两者在基准数据集上都表现出了出色的性能[
17,18]
. Based on the coarse-thickness cascade framework, 。UCS-Net
[
11]
further introduces a depth sampling strategy that uses uncertainty estimation to adaptively generate spatially varying depth assumptions. 基于粗细级联框架,进一步引入了一种深度采样策略,该策略利用不确定性估计自适应地产生空间变化的深度假设。Vis-MVSNet [
9]
also uses uncertainty to explicitly infer and integrate pixel occlusion information in multi-view cost volume binning. 还使用不确定性来显式推断和整合多视图成本体积融合中的像素遮挡信息。PatchMatch [
2]
, as a classical and traditional stereo matching algorithm, has also been integrated into the learning-based 作为一种经典的、传统的立体匹配算法,也被集成到基于学习的MVS
framework, and the resulting model is named 框架中,得到的模型被命名为PatchmatchNet
[
2]
. Recently, 。最近,Effi-MVS [
10]
has been proposed, demonstrating a new method for constructing dynamic cost quantities in deep refinement. In addition, 被提出,展示了一种在深度细化中构建动态成本量的新方法。此外,TransMVSNet [
37]
is是第一个基于学习的 the first learning-based MVS approach that leverages MVS 方法,它利用 Transformer [
38]
to enable powerful, remote global context aggregation within and between images.在图像内部和图像之间实现强大的、远程的全局上下文聚合。