Urban Scene Reconstruction via Neural Radiance Fields: Comparison
Please note this is a comparison between Version 1 by Zhenbo Song and Version 2 by Sirius Huang.

3D reconstruction of urban scenes is an important research topic in remote sensing. Neural Radiance Fields (NeRFs) offer an efficient solution for both structure recovery and novel view synthesis. The realistic 3D urban models generated by NeRFs have potential future applications in simulation for autonomous driving, as well as in Augmented and Virtual Reality (AR/VR) experiences.

  • neural radiation field
  • voxelization
  • camera pose estimation
  • multi-sensor fusion
  • 3D reconstruction

1. Introduction

The acceleration of urbanization leads to challenges in constructing intelligent/digital cities, which requires understanding and modeling of urban scenes. Over the past years, data-driven deep learning models have been widely adopted for scene understanding [1]. However, deep learning models are often hindered by domain gap [2][3][2,3] and heavily depend on a vast amount of annotated training data that is costly and complex to collect and label, particularly for the multi-sensor data annotation [4]. 3D reconstruction [5][6][5,6] can be used not only for data augmentation but also for direct 3D modeling of urban scenes [7]. Specifically, in the remote sensing mapping [8][9][10][11][12][8,9,10,11,12], it can generate high-precision digital surface models using multi-view satellite images [13][14][13,14] and combine the diversity of virtual environments with the richness of the real-world, generating more controllable and realistic data than simulation data.
With the emergence of Neural Radiance Fields (NeRF) [15], the research on 3D reconstruction algorithms has rapidly progressed [16]. Many researchers have applied the NeRF model to the field of remote sensing mapping [17][18][17,18]. Compared to classic 3D reconstruction methods with explicit geometric representations, NeRF’s neural implicit representation is smooth, continuous, differentiable, and capable of better handling complex lighting effects. It can render high-quality images from new perspectives based on the camera images and six degrees of freedom camera poses [19][20][21][19,20,21]. The core idea of NeRF is to represent the scene as a density and radiance field encoded by a multi-layer perceptron (MLP) network and train the MLP using differentiable volume rendering techniques. Although NeRF can achieve satisfactory rendering effects, the training of deep neural networks is time-consuming, i.e., in hours or days, which limits its application. Recent studies suggest that voxel grid-based methods, such as Plenoxels [22], NSVF [23], DVGO [24] and Instant-NGP [25], can rapidly train NeRF within few hours and reduce memory consumption through voxel cropping [23][24][23,24] and hash indexing [25]. Depth supervision-based methods such as DsNeRF [26] utilize the sparse 3D point cloud output from COLMAP [27] to guide NeRF’s scene geometry learning and accelerate convergence.
Although these methods demonstrate robust 3D reconstruction results in bounded scenes, when they are applied to urban unbounded scenarios, they confront several challenges. First, it is a common requirement to handle large-scale scenes with relatively fixed data collection. A NeRF representation requires spatial division of the 3D environment. Although NeRF++ [28] separates the scene into foreground and background networks for training, extending NeRF to unbounded scenes, the division of large-scale scenes requires more storage and computational resources and the algorithm would be difficult to use without optimization. Yet, the real outdoor scenario, such as urban environments, typically covers a large area in hundreds of square meters, which presents a significant challenge for NeRF representation. In addition, urban scene data are usually collected using cameras mounted on the ground or unmanned aerial vehicles without focusing on any specific part of the scene. Therefore, some scenes may be less observed or potentially missed by the cameras, while some other scenes may be captured multiple times from multiple viewpoints. Such uneven observation perspectives increase the difficulty of reconstruction [29][30][29,30].
Another challenge to NeRF methods is complex scenes with highly variable environments. A scene often contains a variety of target objects, such as buildings, signs, vehicles, vegetation, etc. These targets have significant differences in appearance, geometric shape, and occlusion relationships. The reconstruction of diverse targets is limited by model capacity, memory, and computation resources. Additionally, because cameras usually adopt automatic exposure, captured images often have high exposure variation, subject to the lighting condition. NeRF-W [31] addresses occlusions and lighting changes in the scene through transient object embedding and latent appearance modeling. However, its rendering quality drops for some areas, such as the ground, that are rarely included in training images, and blurriness often appears in scenes where the camera pose is incorrect. Thus, relying solely on image data faces the difficulties of camera pose estimation, leading to low-quality 3D reconstruction.
This problem can be relieved by using 3D LiDAR for pose inference and urban scene 3D geometric reconstruction [32][33][34][35][32,33,34,35], however, LiDAR point clouds also have inherent disadvantages. The point cloud resolution is usually low, and it is very difficult to generate point cloud data on glossy or transparent surfaces. For this issue, Google proposed Urban Radiance Fields in 2021 [36], which compensates for scene sparsity through LiDAR point clouds and supervises rays pointing to the sky through image segmentation, addressing the problem of light changes in the scene.

2. Classic Methods of 3D Reconstruction

Classic 3D reconstruction methods initially collate data into explicit 3D scene representations, such as textured meshes [37] or primitive shapes [38]. Although effective for large diffuse surfaces, these methods can not well handle urban scenes due to the complexity of geometric structures. Alternative methods use 3D volumetric representations such as voxels [39], octrees [40], but their resolution is limited, and their storage demands for discrete volume are high. For large-scale urban scenes, Li proposed AADS [41] to utilize images and LiDAR point clouds for reconstruction, amalgamating perceptual algorithms and manual annotation to formulate a 3D point cloud representation of moving foreground objects. In contrast, SurfelGAN [42] employed Surfels for 3D modeling, capturing 3D semantic and appearance information of all scene objects. These methods rely on explicit 3D reconstruction algorithms like SfM [43] and MVS [44], which recover dense 3D point clouds from multi-view imagery [45]. However, the resulting 3D models often contain artifacts and holes in weakly textured or specular regions, requiring further processing for novel view image synthesis. 

3. Neural Radiance Fields

3.1. Theory of Neural Radiance Fields

Neural rendering techniques, exemplified by Neural Radiance Fields (NeRF) [15], allow neural networks to implicitly learn static 3D scenes from a series of 2D images. Once the network has been trained, it can render 2D images from any viewpoint. More specifically, Multilayer Perceptron (MLP) is employed to represent the scene. The MLP takes a 3D positional vector of a spatial point and a 2D viewing direction vector as inputs and maps them to the density and color vector of that location. Subsequently, a differentiable volume rendering method [46] is used to synthesize any new view. Typically, this representation is trained within a specific scene. For a set of input camera images and poses, the NeRF employs gradient descent to fit the function by minimizing the color error between the rendered results and the real images.

3.2. Advance in NeRF

Many research [47][48][49][50][51][52][53][54][47,48,49,50,51,52,53,54] have augmented the original NeRF, enhancing reconstruction accuracy, rendering efficiency, and generalization performance. MetaNeRF [55] improved accuracy by leveraging data-driven prior training scenes to supplement missing information in test scenes. NeRFactor [56] employed MLP-based factorization to extract information on illumination, object surface normals and light fields. Barron et al. [57] substitute a view cone for line-of-sight perception, minimizing jagged artifacts and blur. Addressing NeRF’s oversampling of blank space, Liu et al. [23] proposed a sparse voxel octree structure for 3D modeling. Plenoxels [22] bypassed extensive MLP models to predict density and color and instead stored these values directly on the voxel grid. Instant-NGP [25] and DVGO [24] constructed feature meshes and densities, calculating point-specific densities and colors from interpolated feature vectors using compact MLP networks. To improve the model’s generalizability, Yu et al. [20] introduced PixelNeRF, allowing the model to perform view synthesis tasks with minimal input by integrating spatial image features at the pixel level. Recently, Martin-Brualla et al. [31] conducted a 3D reconstruction of various outdoor landmark structures utilizing data sourced from the Internet. DS-NeRF [26] reconstructed sparse 3D point clouds from COLMAP [27], using inherent depth information to supervise the NeRF objective function, thereby enhancing the convergence of scene geometry.

4. Application of NeRF in Urban Scene

Some researchers have applied NeRF to urban scenes. Zhang et al. [28] addressed parameterization challenges in extensive, unbounded 3D scenes by dichotomizing the scene into foreground and background with sphere inversion. Google’s Urban Radiance Field [36] used LiDAR data to counteract scene sparsity and employed an affine color estimation for each camera to automatically compensate for variable exposures. Block-NeRF [58] broke down city-scale scenes into individually trained neural radiance fields, uncoupling rendering time from scene size. Moreover, City-NeRF [59] evolved the network model and training set concurrently, incorporating new training blocks during training to facilitate multi-scale rendering from satellite to ground-level imagery.

5. Acknowledgement

This work was supported in part by the National Natural Science Foundation of China (No. 62302220), in part by the China Postdoctoral Foundation (No. 2023M731691), and in part by the Jiangsu Funding Program for Excellent Postdoctoral Talent (No.2022ZB268).

Video Production Service