Accurate Measurement of Urban Environments: Comparison
Please note this is a comparison between Version 1 by Suan Lee and Version 2 by Rita Xu.

In the field of urban environment analysis research, image segmentation technology that groups important objects in the urban landscape image in pixel units has been the subject of increased attention.

  • urban environment analysis
  • streetscapes
  • image segmentation

1. Introduction

Semantic segmentation is a major task in computer vision, the purpose of which is to group similar regions. Since digital images are composed of pixels, image segmentation techniques segment images by predicting the classes of all pixels. Recently, as segmentation technologies based on convolutional neural networks have been actively researched, they have been used in the field of autonomous driving and medical imaging. These segmentation technologies must predict the shape of objects while detecting the location of objects in an image. Since image segmentation technologies extract important information for urban environment analysis from images of landscapes, such as the amount of green space, foreground openness, and proportion of roads and building sidewalks in a scene, urban environment analysis research fields are also paying attention to this technology [1][2][1,2]. This technique is extensively employed in research to enhance human activities in urban settings, such as assessing the walking environment of individuals in street settings, gauging the greenery of trees, or identifying areas with high crime rates [2][3][2,3]. Although algorithms that can accurately recognize objects and partition regions are yet to exist, the technique is widely used in various fields as it can effectively and quickly process large-scale data, which are often challenging and expensive to examine individually in the field [4][5][4,5].
Recent studies have also utilized semantic segmentation to analyze images viewed from above, such as aerial or satellite images [6]. However, more recent studies aim to identify visual characteristics from the human point of view, such as Google Street View (GSV) [7][8][7,8]. As a result, efforts are being made to classify pedestrian-friendly streets by measuring the amount of greenery on the roadside in pixel units, following recent research that suggests greenery in street environments positively affects pedestrian experience [9][10][9,10]. Furthermore, researchers are attempting to prove the effect of greenery by comparing the results with actual field survey data [11][12][11,12].
Most prior studies on semantic segmentation of collected GSV data employ pre-trained models using cityscapes datasets [13]. However, the accuracy of these models is suboptimal due to the differences in image characteristics between the data used for analysis and the data used for training [11]. Firstly, the captured image’s location differs, resulting in variations in the overall scene, such as the building’s surrounding shape and the road structure, compared to the trained data. Secondly, the raw data of different image sizes and aspect ratios from the existing dataset are subjected to cropping or reshaping, reducing accuracy. Furthermore, accuracy is low since the class required for calculating the green area ratio is different from the class provided in cityscapes [12]. In other words, the gap between the published dataset and the GSV data leads to low accuracy. Therefore, more accurate techniques are needed to segment unlabeled GSV images and identify pedestrian-friendly streets.
A semantic segmentation widely uses Intersection over Union (IoU) as an evaluation metric, which is the ratio between the number of correctly predicted pixels and the number of incorrectly predicted pixels. In general, the higher the number, the more accurate the segmentation is considered. Each image segmentation model has distinct characteristics, and the accuracy may vary for a specific class. Therefore, even if the average IoU is high, certain classes may have a lower IoU than other models. In the field of research dealing with time series data, there is a case for building a hybrid model that actively adopts the advantages of several models [14][15][14,15]. Based on these cases, in researchersthis paper, we propose a hybrid segmentation method to accurately predict unlabeled GSV images by exploiting the unique strengths of different segmentation models. Specifically, researcherswe built a hybrid model using the SegNet [16][17][16,17] and DeepLabv3+ [18] models. In general, DeepLab has a higher IoU than SegNet, but SegNet’s results are often more accurate than DeepLab’s for contours of relatively small and complex objects such as people and cars. On the other hand, DeepLab has high segmentation accuracy for large objects such as roads, buildings, and the sky.

2. Green Area Measuring

Recent studies have increasingly applied deep learning techniques to analyze green spaces in urban areas, predominantly utilizing Google Street View (GSV) imagery to determine the amount of green space in each image. Researchers such as Li et al. [19], Lu et al. [20], Seiferling et al. [21], Wang et al. [22], and Yin and Wang [3] have demonstrated the potential of GSV for evaluating the tree cover of streets in Manhattan by computing Green View Indexing (GVI) in each image and extracting the pixels occupied by plants. However, most studies using GSV have only measured GVI without considering trees, shrubs, and lawns. Some recent studies have attempted to overcome these limitations by classifying different tree species. Zarrin developed a new strategy for detecting various tree species based on their leaves [23]. Furthermore, Sun et al. proposed a method for classifying the type of vegetation (i.e., tree, low-lying vegetation, grass) in street view images [24]. More recently, Choi et al. utilized a semantic segmentation algorithm and graphical analysis to estimate tree profile parameters by determining the relative location of the interface of trees and the ground surface [12]. However, these studies still face limitations as they fail to properly consider the morphological and phenological characteristics of each tree species. To compute the amount of green vegetation in a given area, researchers have employed two primary methods: (1) color band methods, which extract information based on pixel color, and (2) semantic segmentation techniques, which distinguish between natural greenery and non-vegetated surfaces, such as buildings and roads. The resulting GVI is typically expressed as a percentage or numerical score that reflects the proportion of green vegetation in a specific area.

3. Image Segmentation

Long et al. [25] proposed a method for semantic segmentation using fully convolutional networks, which replaced the fully connected layer used in general CNN models for image classification with a convolutional layer for pixel-level classification. They also introduced a skip layer to improve accuracy during the up-sampling process. Similarly, Badrinarayanan et al. [16][17][16,17] developed SegNet, which combines the advantages of DeconvNet [26] and U-Net [27]. SegNet uses pooling indices instead of copying and cropping the entire feature to improve memory efficiency, and removes the fully connected layer used in DeconvNet to reduce parameterization. Another notable method is DeepLab [28], proposed by Chen et al., which uses atrous convolution-based semantic segmentation architecture and atrous spatial pyramid pooling (ASPP) [18][29][30][18,29,30] to improve the architecture. The authors proposed Panoptic DeepLab [31], which transforms the ASPP and decoder of DeepLabv3+ [18] into a dual form, with each decoder producing semantic and instance information as outputs for panoptic segmentation. Cheng et al. proposed Mask2Former [32] for universal image segmentation, which modifies the architecture of vision transformer [33] by applying transformer [34] and BERT [35] to computer vision. Mask2Former extracts localized features by constraining cross-attention within predicted mask regions. Kabilan et al. [36] improved segmentation accuracy and reduced complexity by using a three-step segmentation process involving the analysis of key components, mapping of similar objects in a faster way, and segmenting similar areas through color mapping. Semi-supervised learning is often used to solve the problem of insufficient labeled data, but it has a lazy mimicking problem. To address this issue, Huo et al. [37] proposed ATSO, a model that partitions unlabeled training data into two subsets and alternately uses one subset to fine-tune the model, updating labels on the other subset.

4. Hybrid and Fusion Scheme

There have been numerous studies on applying hybrid and fusion methods for the segmentation of urban and satellite images. Li et al. proposed a hybrid convolutional network (HCN) comprising U-Net and VGG sub-networks, which was applied for road segmentation [38]. Wang et al. developed a remote sensing image segmentation method using a hybrid method (division and merge) [39]. Khoshboresh et al. proposed a novel hybrid method that combines deep convolutional neural networks and a restricted Boltzmann machine (RBM) to take advantage of the semantic segmentation of high-resolution airborne imagery for automatic building detection [40]. Sun et al. proposed a novel RGB and thermal data fusion network called FuseSeg, which achieved superior performance in the semantic segmentation of urban scenes [41]. Khan et al. developed a hybrid deep learning model that combines the benefits of two deep models, i.e., DenseNet and U-Net [42]. Niu et al. proposed a novel attention-based framework named hybrid multiple attention network (HMANet) that adaptively captures global correlations from the perspective of space, channel, and category in a more effective and efficient manner [43]. Abdollahi et al. introduced two new deep convolutional models, the multilevel context gate UNet (MCg-UNet) and the bidirectional ConvLSTM UNet model (BCL-UNet), based on the UNet family for multi-object segmentation such as roads and buildings in aerial images [44]. Chen et al. presented a pipeline of hybrid supervision that designs auxiliary segmentation models using boundary box attention modules and boundary box filter modules [45]. Various deep learning models have been proposed to address the problem of semantic image segmentation, leveraging multiple information sources to achieve improved performance. For instance, Zhang et al. presented a hybrid deep neural network that combines a transformer and CNN for the semantic segmentation of very high-resolution remote sensing imagery [46]. Another study by Luo et al. introduced a hybrid convolutional neural network (H-ConvNet) to improve urban land cover mapping with MSR Sentinel-2 images [47]. Li et al. proposed a novel hybrid contrastive regularization (HybridCR) framework in a weakly supervised setting, which obtained competitive performance compared to its fully supervised counterpart [48]. Hossain et al. proposed a hybrid segmentation method with modifications such as using the reference polygon to identify optimal parameters and a donut-filling technique to reduce over-segmentation caused by roof elements and illumination differences [49]. Other models have leveraged multimodal fusion to achieve optimal joint predictions. For example, Valdez-Rodríguez et al. proposed a hybrid 2D-3D CNN architecture capable of obtaining semantic segmentation and depth estimation simultaneously [50]. Wang et al. presented a Bilateral Awareness Network that fully captures long-range relationships and fine-grained details in Very Fine Resolution (VFR) images using a dependency path and a texture path [51]. Men et al. proposed a novel model called Concatenated Residual Attention UNet (CRAUNet), which combines the residual structure and channel attention mechanism [52]. Another study by Wang et al. introduced a Transformer-based decoder and constructed a UNet-like Transformer (UNetFormer) for real-time urban scene segmentation [53]. Finally, to take advantage of both CNN and Transformer, a novel Adaptive Enhanced Swin Transformer with U-Net (AESwin-UNet) was proposed for remote sensing segmentation [53][54][53,54].
ScholarVision Creations