Multi-View Stereo Method: Comparison
Please note this is a comparison between Version 2 by Guangzheng Wu and Version 1 by Guangzheng Wu.

多视图立体(As a 3D reconstruction method, multi-view stereoscopic (MVS)作为一种3D重建方法,在3D计算机视觉中起着至关重要的作用,在虚拟现实、增强现实和自动驾驶等领域有着广泛的应用。随着深度学习技术在计算机视觉领域的快速发展,基于学习的多视图立体方法已经产生了先进的成果。) plays a vital role in 3D computer vision, and has a wide range of applications in the fields of virtual reality, augmented reality, and autonomous driving. With the rapid development of deep learning technology in the field of computer vision, the learning-based multi-view stereo method has produced advanced results.

  • multi-view stereo
  • cost volume

1. 引言Introduction

多视图立体(As a 3D reconstruction method, multi-view stereoscopic (MVS)作为一种3D重建方法,在3D计算机视觉中起着至关重要的作用,在虚拟现实、增强现实和自动驾驶等领域有着广泛的应用。多视角立体以不同视点的一系列图像和相应的相机参数为输入,可以估计每个像素的深度信息,并生成观察到的场景的相应3D表示。作为3D计算机视觉中的关键问题,多视角立体已经获得了广泛的研究关注) plays a vital role in 3D computer vision, and has a wide range of applications in the fields of virtual reality, augmented reality, and autonomous driving. With a series of images from different viewpoints and the corresponding camera parameters as inputs, multi-view stereo can estimate the depth information of each pixel and generate a corresponding 3D representation of the observed scene. As a key problem in 3D computer vision, multi-view stereo has received extensive research attention [1,2,3,4].
近年来,随着深度学习技术在计算机视觉领域的快速发展,基于学习的多视图立体方法取得了先进的成果In recent years, with the rapid development of deep learning technology in the field of computer vision, learning-based multi-view stereoscopic methods have achieved advanced results [4,5,6]. ]基于学习的多视图立体算法通常由多个组件组成,包括特征提取、深度采样、成本量构建、成本量正则化和深度回归。然而,大量的 GPU 内存要求不仅将图像处理限制在低分辨率,而且还阻碍了多视图立体在各种边缘计算设备上的采用。在3D视觉的实际应用中,部署的设备通常具有有限的计算资源。例如,在自动驾驶场景中,激光雷达数据通常使用三维点云压缩技术进行处理,以降低存储和传输成本 Learning-based multi-view stereo algorithms usually consist of multiple components, including feature extraction, deep sampling, cost quantity construction, cost quantity regularization, and deep regression. However, the large GPU memory requirements not only limit image processing to low resolutions, but also hinder the adoption of multi-view stereo on various edge computing devices. In the real-world application of 3D vision, the deployed devices often have limited computing resources. For example, in autonomous driving scenarios, lidar data is often processed using 3D point cloud compression technology to reduce storage and transmission costs [7]。与激光雷达数据处理不同,多视图立体的主要计算挑战在于从给定输入源的二维图像和相机参数生成点云。因此,降低算法的内存消耗可以大大提高该技术的实用性。最近,许多研究人员提出了改进的方法来处理基于学习的多视图立体方法的高计算问题。特别是,从粗到细的架构已被广泛用于设计高效的多视图立体网络. Unlike LiDAR data processing, the main computational challenge for multi-view stereo is to generate a point cloud from a 2D image and camera parameters from a given input source. Therefore, reducing the memory consumption of the algorithm can greatly improve the practicability of the technology. Recently, many researchers have proposed improved methods to deal with the high computational problem of learning-based multi-view stereo methods. In particular, coarse-to-fine architectures have been widely used to design efficient multi-view stereo networks [6,8,9,10,11,12]。通常,在这些方法中,初始成本卷通常以低分辨率而不是固定分辨率构建,然后根据最后阶段结果以较高分辨率迭代构建新的成本卷,最后获得深度图。在不同阶段逐步缩小深度平面的假设. Typically, in these methods, the initial cost volume is typically built at a low resolution rather than a fixed resolution, and then a new cost volume is iteratively built at a higher resolution based on the final stage results, and finally a depth map is obtained. The assumption that the depth plane is progressively reduced at different stages [6,8,9,10,11,12]也是减少计算量的关键策略。尽管粗阶段输出作为细阶段成本量构建的输入对最终结果具有重要意义,但这些现有方法需要更加关注粗阶段的特征信息。如果粗阶段的特征提取阶段不充分,较差的初始结果可能会对后续阶段的最终结果和最终输出产生不利影响。然而,密集的特征提取步骤总是会增加计算负载和 GPU 消耗,并且仍然需要平衡精度和计算效率。 is also a key strategy to reduce the amount of computation. Although the coarse-stage output is of great significance to the final result as an input to the construction of fine-stage cost-quantities, these existing methods need to pay more attention to the characteristic information of the coarse stage. If the feature extraction phase of the coarse stage is insufficient, the poor initial results may adversely affect the final results and final output of the subsequent stages. However, intensive feature extraction steps always increase computational load and GPU consumption, and there is still a need to balance accuracy and computational efficiency.
此外,基于级联的多视图立体的另一个现有挑战是适应深度假设范围。在初始阶段,平面扫描覆盖了整个可以想象的深度范围。同时,在许多基于级联的算法中,在更精细阶段的深度假设生成过程中In addition, another existing challenge for cascade-based multi-view stereoscopic is the adaptation to the depth hypothetical range. In the initial phase, planar scanning covers the entire imaginable depth range. At the same time, in many cascade-based algorithms, the estimated depth value of the previous stage is used as the center of the sampling interval during the generation of depth assumptions at a more granular stage [6,8,10,12],使用前一阶段的估计深度值作为采样间隔的中心,每个像素在其各自阶段内都有固定的采样距离。然而,为每个像素设置均匀的采样距离并不是一种理想的方法,因为深度细化阶段的优化在同一深度图中的不同像素之间会有所不同,其中某些像素可能具有稳定的深度,而其他像素可能表现出显着变化。考虑到这一挑战,, and each pixel has a fixed sampling distance within its respective stage. However, setting a uniform sampling distance for each pixel is not an ideal approach because the optimization of the depth refinement stage will vary between different pixels in the same depth map, where some pixels may have a stable depth while others may exhibit significant variations. With this challenge in mind, Cheng [11]利用每个像素的概率分布来设置采样距离 used the probability distribution of each pixel to set the sampling distance;然而,这种方法在GPU内存使用和运行时间方面表现不佳,同时其训练时间也很大。 However, this approach does not perform well in terms of GPU memory usage and runtime, while its training time is also significant.

2. 多视图立体方式Multi-view stereo mode

2.1. 传统的多视图立体方法Traditional multi-view stereo method

多视角立体(Multi-view stereoscopic (MVS)作为计算机视觉中3D重建领域的一个基本问题,解决了从照片中恢复场景的空间几何问题。在深度学习出现之前,它已经引起了极大的关注并取得了实质性的进展。传统的多视角立体方法大致可分为以下四类:基于体素的方法), as a fundamental problem in the field of 3D reconstruction in computer vision, solves the problem of recovering the spatial geometry of a scene from a photo. Before the advent of deep learning, it had already attracted a great deal of attention and made substantial progress. Traditional multi-view stereo methods can be broadly divided into the following four categories: voxel-based methods [25,26,27,28,29]、基于网格的方法 and grid-based methods [30,31]、基于表面的方法. Surface-based methods [19,32,33]和基于深度图的方法 and depth map-based methods [1,20,21,34,35].在这四种方法中,基于体素的方法将空间划分为一组体素,需要极高的内存消耗。基于网格的方法不太可靠,因为它的最终重建性能依赖于其初始化。同时,基于 The mesh-based approach is less reliable because its final reconstruction performance depends on its initialization. At the same time, the surfel 的方法将曲面表示为一组-based method represents the surface as a set of surfel,简单但高效。然而,基于s, which is simple but efficient. However, surfel 的方法需要额外的繁琐的后处理步骤来生成最终的 3D 模型。基于深度图的方法计算每个图像中每个像素的深度值,将像素重新投影到 3D 空间中,然后融合这些点以生成点云模型。在这四种方法中,基于深度图的方法最灵活,在该领域应用最广泛。近年来,基于深度图的方法取得了显著的成功,并且有很好的算法框架在使用中,例如-based methods require additional cumbersome post-processing steps to generate the final 3D model. The depth map-based approach calculates the depth value of each pixel in each image, reprojects the pixels into 3D space, and then fuses the points to generate a point cloud model. Among the four methods, the depthmap-based method is the most flexible and the most widely used in this field. In recent years, depth map-based methods have achieved remarkable success, and good algorithmic frameworks are in use, such as Furu [19], Gipuma [21], Tola [20], and COLMAP [1]。尽管传统多视角立体的性能值得称赞,但仍存在以下缺点需要改进:计算要求高、处理速度慢、对纹理较弱或反射面块较弱的场景处理欠佳。. While the performance of traditional multi-view stereoscopic is commendable, there are still shortcomings that need to be improved: high computational requirements, slow processing speed, and poor handling of scenes with weak textures or weak reflective surface blocks.

2.2. 基于学习的多视图立体方法Learning-based multi-view stereoscopic approach

近年来,随着深度学习的融合,基于学习的多视图立体方法经历了快速发展,并取得了突出的性能。In recent years, with the convergence of deep learning, the learning-based multi-view stereo method has experienced rapid development and achieved outstanding performance. Yao [4]推出了 launched MVSNet,这是第一个基于端到端学习的多视图立体网络,为未来几年的快速增长奠定了基础。, the first multi-view stereoscopic network based on end-to-end learning, laying the foundation for rapid growth in the coming years. MVSNet [4] uses a 采用共享权重shared-weight 2D-CNN 网络从输入图像中提取特征图。然后应用差分单调变换network to extract feature maps from the input images. Differential monotonic transformations [36]将这些特征图扭曲为参考透视。该方法利用一系列深度假设平面来构建成本体积,表示源图像和参考图像之间的相关性。随后,采用 are then applied to distort these feature maps into reference perspectives. The method utilizes a series of depth assumption planes to construct a cost volume that represents the correlation between the source and reference images. Subsequently, the 3D-CNN网络进行成本量正则化。最后,通过深度回归得到输出作为参考图像的估计深度图。在DTU基准数据集 network was used for cost regularization. Finally, the estimated depth map of the output as a reference image is obtained by depth regression. In the DTU benchmark dataset [17]中,, MVSNet [4]不仅优于以前的传统 not only outperforms the previous traditional MVS方法 methods [1,19,20],而且运行时间也快得多。但是,由于 GPU 内存消耗较高,因此在, but also runs much faster. However, due to the high GPU memory consumption, only low-resolution images can be used as input images in MVSNet 中只能将低分辨率图像用作输入图像。已经提出了许多基于学习的 MVS 方法来处理 GPU 内存消耗问题。Yao. A number of learning-based MVS approaches have been proposed to deal with GPU memory consumption. Yao [22]提出了改进的方法 proposed an improved method, R-MVSNet [22],该方法用一系列GRU卷积取代了深度细化的, which replaces the deeply refined 3D-CNN网络。这种改进减少了 GPU 内存消耗,并使其能够以高分辨率进行 3D 重建。顾 network with a series of GRU convolutions. This improvement reduces GPU memory consumption and enables 3D reconstruction at high resolution. Gu [6]提出了 proposed the CasMVSNet模型,该模型基于特征金字塔网络(FPN) model, which is based on the Feature Pyramid Network (FPN) [13]构建级联成本量。得益于其新颖的从粗到细架构, to construct cascading costs. Thanks to its novel coarse-to-fine architecture, CasMVSNet可以以原始分辨率处理来自DTU数据集 can process input images from DTU datasets at native resolution [17]的输入图像。与. Similar to CasMVSNet [6]类似,, CVP-MVSNet [8] and Fast-MVS [23]也包含从粗到细的框架,并且两者在基准数据集上都表现出了出色的性能 also contain coarse-to-fine frameworks, and both exhibit excellent performance on benchmark datasets [17,18]. Based on the coarse-thickness cascade framework, UCS-Net [11]基于粗细级联框架,进一步引入了一种深度采样策略,该策略利用不确定性估计自适应地产生空间变化的深度假设。 further introduces a depth sampling strategy that uses uncertainty estimation to adaptively generate spatially varying depth assumptions. Vis-MVSNet [9] 还使用不确定性来显式推断和整合多视图成本体积融合中的像素遮挡信息。also uses uncertainty to explicitly infer and integrate pixel occlusion information in multi-view cost volume binning. PatchMatch [2]作为一种经典的、传统的立体匹配算法,也被集成到基于学习的, as a classical and traditional stereo matching algorithm, has also been integrated into the learning-based MVS框架中,得到的模型被命名为 framework, and the resulting model is named PatchmatchNet [2]。最近,. Recently, Effi-MVS [10]被提出,展示了一种在深度细化中构建动态成本量的新方法。此外, has been proposed, demonstrating a new method for constructing dynamic cost quantities in deep refinement. In addition, TransMVSNet [37] 是第一个基于学习的is the first learning-based MVS 方法,它利用approach that leverages Transformer [38] 在图像内部和图像之间实现强大的、远程的全局上下文聚合。to enable powerful, remote global context aggregation within and between images.
Video Production Service