Accurate Measurement of Urban Environments

Accurate Measurement of Urban Environments: Comparison

Please note this is a comparison between Version 2 by Rita Xu and Version 1 by Suan Lee.

In the field of urban environment analysis research, image segmentation technology that groups important objects in the urban landscape image in pixel units has been the subject of increased attention.

urban environment analysis
streetscapes
image segmentation

1. Introduction

Semantic segmentation is a major task in computer vision, the purpose of which is to group similar regions. Since digital images are composed of pixels, image segmentation techniques segment images by predicting the classes of all pixels. Recently, as segmentation technologies based on convolutional neural networks have been actively researched, they have been used in the field of autonomous driving and medical imaging. These segmentation technologies must predict the shape of objects while detecting the location of objects in an image. Since image segmentation technologies extract important information for urban environment analysis from images of landscapes, such as the amount of green space, foreground openness, and proportion of roads and building sidewalks in a scene, urban environment analysis research fields are also paying attention to this technology [1,2]^[1][2]. This technique is extensively employed in research to enhance human activities in urban settings, such as assessing the walking environment of individuals in street settings, gauging the greenery of trees, or identifying areas with high crime rates [2,3]^[2][3]. Although algorithms that can accurately recognize objects and partition regions are yet to exist, the technique is widely used in various fields as it can effectively and quickly process large-scale data, which are often challenging and expensive to examine individually in the field [4,5]^[4][5].

Recent studies have also utilized semantic segmentation to analyze images viewed from above, such as aerial or satellite images ^[6]. However, more recent studies aim to identify visual characteristics from the human point of view, such as Google Street View (GSV) [7,8]^[7][8]. As a result, efforts are being made to classify pedestrian-friendly streets by measuring the amount of greenery on the roadside in pixel units, following recent research that suggests greenery in street environments positively affects pedestrian experience [9,10]^[9][10]. Furthermore, researchers are attempting to prove the effect of greenery by comparing the results with actual field survey data [11,12]^[11][12].

Most prior studies on semantic segmentation of collected GSV data employ pre-trained models using cityscapes datasets ^[13]. However, the accuracy of these models is suboptimal due to the differences in image characteristics between the data used for analysis and the data used for training ^[11]. Firstly, the captured image’s location differs, resulting in variations in the overall scene, such as the building’s surrounding shape and the road structure, compared to the trained data. Secondly, the raw data of different image sizes and aspect ratios from the existing dataset are subjected to cropping or reshaping, reducing accuracy. Furthermore, accuracy is low since the class required for calculating the green area ratio is different from the class provided in cityscapes ^[12]. In other words, the gap between the published dataset and the GSV data leads to low accuracy. Therefore, more accurate techniques are needed to segment unlabeled GSV images and identify pedestrian-friendly streets.

A semantic segmentation widely uses Intersection over Union (IoU) as an evaluation metric, which is the ratio between the number of correctly predicted pixels and the number of incorrectly predicted pixels. In general, the higher the number, the more accurate the segmentation is considered. Each image segmentation model has distinct characteristics, and the accuracy may vary for a specific class. Therefore, even if the average IoU is high, certain classes may have a lower IoU than other models. In the field of research dealing with time series data, there is a case for building a hybrid model that actively adopts the advantages of several models [14,15]^[14][15]. Based on these cases, in thires paper, we earchers propose a hybrid segmentation method to accurately predict unlabeled GSV images by exploiting the unique strengths of different segmentation models. Specifically, weresearchers built a hybrid model using the SegNet [16,17]^[16][17] and DeepLabv3+ ^[18] models. In general, DeepLab has a higher IoU than SegNet, but SegNet’s results are often more accurate than DeepLab’s for contours of relatively small and complex objects such as people and cars. On the other hand, DeepLab has high segmentation accuracy for large objects such as roads, buildings, and the sky.

2. Green Area Measuring

Recent studies have increasingly applied deep learning techniques to analyze green spaces in urban areas, predominantly utilizing Google Street View (GSV) imagery to determine the amount of green space in each image. Researchers such as Li et al. ^[19], Lu et al. ^[20], Seiferling et al. ^[21], Wang et al. ^[22], and Yin and Wang ^[3] have demonstrated the potential of GSV for evaluating the tree cover of streets in Manhattan by computing Green View Indexing (GVI) in each image and extracting the pixels occupied by plants. However, most studies using GSV have only measured GVI without considering trees, shrubs, and lawns. Some recent studies have attempted to overcome these limitations by classifying different tree species. Zarrin developed a new strategy for detecting various tree species based on their leaves ^[23]. Furthermore, Sun et al. proposed a method for classifying the type of vegetation (i.e., tree, low-lying vegetation, grass) in street view images ^[24]. More recently, Choi et al. utilized a semantic segmentation algorithm and graphical analysis to estimate tree profile parameters by determining the relative location of the interface of trees and the ground surface ^[12]. However, these studies still face limitations as they fail to properly consider the morphological and phenological characteristics of each tree species. To compute the amount of green vegetation in a given area, researchers have employed two primary methods: (1) color band methods, which extract information based on pixel color, and (2) semantic segmentation techniques, which distinguish between natural greenery and non-vegetated surfaces, such as buildings and roads. The resulting GVI is typically expressed as a percentage or numerical score that reflects the proportion of green vegetation in a specific area.

3. Image Segmentation

Long et al. ^[25] proposed a method for semantic segmentation using fully convolutional networks, which replaced the fully connected layer used in general CNN models for image classification with a convolutional layer for pixel-level classification. They also introduced a skip layer to improve accuracy during the up-sampling process. Similarly, Badrinarayanan et al. [16,17]^[16][17] developed SegNet, which combines the advantages of DeconvNet ^[26] and U-Net ^[27]. SegNet uses pooling indices instead of copying and cropping the entire feature to improve memory efficiency, and removes the fully connected layer used in DeconvNet to reduce parameterization. Another notable method is DeepLab ^[28], proposed by Chen et al., which uses atrous convolution-based semantic segmentation architecture and atrous spatial pyramid pooling (ASPP) [18,29,30]^[18][29][30] to improve the architecture. The authors proposed Panoptic DeepLab ^[31], which transforms the ASPP and decoder of DeepLabv3+ ^[18] into a dual form, with each decoder producing semantic and instance information as outputs for panoptic segmentation. Cheng et al. proposed Mask2Former ^[32] for universal image segmentation, which modifies the architecture of vision transformer ^[33] by applying transformer ^[34] and BERT ^[35] to computer vision. Mask2Former extracts localized features by constraining cross-attention within predicted mask regions. Kabilan et al. ^[36] improved segmentation accuracy and reduced complexity by using a three-step segmentation process involving the analysis of key components, mapping of similar objects in a faster way, and segmenting similar areas through color mapping. Semi-supervised learning is often used to solve the problem of insufficient labeled data, but it has a lazy mimicking problem. To address this issue, Huo et al. ^[37] proposed ATSO, a model that partitions unlabeled training data into two subsets and alternately uses one subset to fine-tune the model, updating labels on the other subset.

4. Hybrid and Fusion Scheme

There have been numerous studies on applying hybrid and fusion methods for the segmentation of urban and satellite images. Li et al. proposed a hybrid convolutional network (HCN) comprising U-Net and VGG sub-networks, which was applied for road segmentation ^[38]. Wang et al. developed a remote sensing image segmentation method using a hybrid method (division and merge) ^[39]. Khoshboresh et al. proposed a novel hybrid method that combines deep convolutional neural networks and a restricted Boltzmann machine (RBM) to take advantage of the semantic segmentation of high-resolution airborne imagery for automatic building detection ^[40]. Sun et al. proposed a novel RGB and thermal data fusion network called FuseSeg, which achieved superior performance in the semantic segmentation of urban scenes ^[41]. Khan et al. developed a hybrid deep learning model that combines the benefits of two deep models, i.e., DenseNet and U-Net ^[42]. Niu et al. proposed a novel attention-based framework named hybrid multiple attention network (HMANet) that adaptively captures global correlations from the perspective of space, channel, and category in a more effective and efficient manner ^[43]. Abdollahi et al. introduced two new deep convolutional models, the multilevel context gate UNet (MCg-UNet) and the bidirectional ConvLSTM UNet model (BCL-UNet), based on the UNet family for multi-object segmentation such as roads and buildings in aerial images ^[44]. Chen et al. presented a pipeline of hybrid supervision that designs auxiliary segmentation models using boundary box attention modules and boundary box filter modules ^[45]. Various deep learning models have been proposed to address the problem of semantic image segmentation, leveraging multiple information sources to achieve improved performance. For instance, Zhang et al. presented a hybrid deep neural network that combines a transformer and CNN for the semantic segmentation of very high-resolution remote sensing imagery ^[46]. Another study by Luo et al. introduced a hybrid convolutional neural network (H-ConvNet) to improve urban land cover mapping with MSR Sentinel-2 images ^[47]. Li et al. proposed a novel hybrid contrastive regularization (HybridCR) framework in a weakly supervised setting, which obtained competitive performance compared to its fully supervised counterpart ^[48]. Hossain et al. proposed a hybrid segmentation method with modifications such as using the reference polygon to identify optimal parameters and a donut-filling technique to reduce over-segmentation caused by roof elements and illumination differences ^[49]. Other models have leveraged multimodal fusion to achieve optimal joint predictions. For example, Valdez-Rodríguez et al. proposed a hybrid 2D-3D CNN architecture capable of obtaining semantic segmentation and depth estimation simultaneously ^[50]. Wang et al. presented a Bilateral Awareness Network that fully captures long-range relationships and fine-grained details in Very Fine Resolution (VFR) images using a dependency path and a texture path ^[51]. Men et al. proposed a novel model called Concatenated Residual Attention UNet (CRAUNet), which combines the residual structure and channel attention mechanism ^[52]. Another study by Wang et al. introduced a Transformer-based decoder and constructed a UNet-like Transformer (UNetFormer) for real-time urban scene segmentation ^[53]. Finally, to take advantage of both CNN and Transformer, a novel Adaptive Enhanced Swin Transformer with U-Net (AESwin-UNet) was proposed for remote sensing segmentation [53,54]^[53][54].

References

Rousselet, J.; Imbert, C.E.; Dekri, A.; Garcia, J.; Goussard, F.; Vincent, B.; Rossi, J.P. Assessing species distribution using Google Street View: A pilot study with the pine processionary moth. PLoS ONE 2013, 8, e74918.
Rzotkiewicz, A.; Pearson, A.L.; Dougherty, B.V.; Shortridge, A.; Wilson, N. Systematic review of the use of Google Street View in health research: Major themes, strengths, weaknesses and possibilities for future research. Health Place 2018, 52, 240–246.
Yin, L.; Cheng, Q.; Wang, Z.; Shao, Z. ‘Big data’ for pedestrian volume: Exploring the use of Google Street View images for pedestrian counts. Appl. Geogr. 2015, 63, 337–345.
Berland, A.; Lange, D.A. Google Street View shows promise for virtual street tree surveys. Urban For. Urban Green. 2017, 21, 11–15.
Liu, D.; Jiang, Y.; Wang, R.; Lu, Y. Establishing a citywide street tree inventory with street view images and computer vision techniques. Computers. Environ. Urban Syst. 2023, 100, 101924.
Gupta, K.; Kumar, P.; Pathan, S.K.; Sharma, K.P. Urban Neighborhood Green Index–A measure of green spaces in urban areas. Landsc. Urban Plan. 2012, 105, 325–335.
Kim, J.H.; Lee, S.; Hipp, J.R.; Ki, D. Decoding urban landscapes: Google street view and measurement sensitivity. Comput. Environ. Urban Syst. 2021, 88, 101626.
Rundle, A.G.; Bader, M.D.; Richards, C.A.; Neckerman, K.M.; Teitler, J.O. Using Google Street View to audit neighborhood environments. Am. J. Prev. Med. 2011, 40, 94–100.
Lu, Y.; Sarkar, C.; Xiao, Y. The effect of street-level greenery on walking behavior: Evidence from Hong Kong. Soc. Sci. Med. 2018, 208, 41–49.
Ye, Y.; Richards, D.; Lu, Y.; Song, X.; Zhuang, Y.; Zeng, W.; Zhong, T. Measuring daily accessed street greenery: A human-scale approach for informing better urban planning practices. Landsc. Urban Plan. 2019, 191, 103434.
Ki, D.; Lee, S. Analyzing the effects of Green View Index of neighborhood streets on walking time using Google Street View and deep learning. Landsc. Urban Plan. 2021, 205, 103920.
Choi, K.; Lim, W.; Chang, B.; Jeong, J.; Kim, I.; Park, C.R.; Ko, D.W. An automatic approach for tree species detection and profile estimation of urban street trees using deep learning and Google street view images. ISPRS J. Photogramm. Remote Sens. 2022, 190, 165–180.
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016.
Moon, J.; Park, S.; Rho, S.; Hwang, E. Robust building energy consumption forecasting using an online learning approach with R ranger. J. Build. Eng. 2022, 47, 103851.
Rew, J.; Cho, Y.; Moon, J.; Hwang, E. Habitat suitability estimation using a two-stage ensemble approach. Remote Sens. 2020, 12, 1475.
Badrinarayanan, V.; Handa, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv 2015, arXiv:1505.07293.
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495.
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder–decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the c European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818.
Li, X.; Zhang, C.; Li, W.; Ricard, R.; Meng, Q.; Zhang, W. Assessing street-level urban greenery using Google Street View and a modified green view index. Urban For. Urban Green. 2015, 14, 675–685.
Lu, Y.; Yang, Y.; Sun, G.; Gou, Z. Associations between overhead-view and eye-level urban greenness and cycling behaviors. Cities 2019, 88, 10–18.
Seiferling, I.; Naik, N.; Ratti, C.; Proulx, R. Green streets−Quantifying and mapping urban trees with street-level imagery and computer vision. Landsc. Urban Plan. 2017, 165, 93–101.
Wang, R.; Lu, Y.; Zhang, J.; Liu, P.; Yao, Y.; Liu, Y. The relationship between visual enclosure for neighbourhood street walkability and elders’ mental health in China: Using street view images. J. Transp. Health 2019, 13, 90–102.
Zarrin, I. Leaf based trees identification using convolutional neural network. In Proceedings of the 2019 IEEE 5th International Conference for Convergence in Technology (I2CT), Bombay, India, 29–31 March 2019; pp. 1–4.
Sun, Y.; Wang, X.; Zhu, J.; Chen, L.; Jia, Y.; Lawrence, J.M.; Wu, J. Using machine learning to examine street green space types at a high spatial resolution: Application in Los Angeles County on socioeconomic disparities in exposure. Sci. Total Environ. 2021, 787, 147653.
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–15 June 2015; pp. 3431–3440.
Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1520–1528.
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241.
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062.
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848.
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587.
Bowen, C.; Maxwell, D.C.; Yukun, Z.; Ting, L.; Thomas, S.H.; Hartwig, A.; Chen, L.-C. Panoptic-DeepLab. arXiv 2019, arXiv:1910.04751.
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 1290–1299.
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30.
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805.
Kabilan, R.; Devaraj, G.P.; Muthuraman, U.; Muthukumaran, N.; Gabriel, J.Z.; Swetha, R. Efficient color image segmentation using fastmap algorithm. In Proceedings of the 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), Tirunelveli, India, 4–6 February 2021; pp. 1134–1141.
Huo, X.; Xie, L.; He, J.; Yang, Z.; Zhou, W.; Li, H.; Tian, Q. ATSO: Asynchronous teacher-student optimization for semi-supervised image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 1235–1244.
Li, Y.; Guo, L.; Rao, J.; Xu, L.; Jin, S. Road segmentation based on hybrid convolutional network for high-resolution visible remote sensing image. IEEE Geosci. Remote Sens. Lett. 2018, 16, 613–617.
Wang, J.; Jiang, L.; Wang, Y.; Qi, Q. An improved hybrid segmentation method for remote sensing images. ISPRS Int. J. Geo-Inf. 2019, 8, 543.
Khoshboresh Masouleh, M.; Shah-Hosseini, R. A hybrid deep learning–based model for automatic car extraction from high-resolution airborne imagery. Appl. Geomat. 2020, 12, 107–119.
Sun, Y.; Zuo, W.; Yun, P.; Wang, H.; Liu, M. FuseSeg: Semantic segmentation of urban scenes based on RGB and thermal data fusion. IEEE Trans. Autom. Sci. Eng. 2020, 18, 1000–1011.
Khan, S.D.; Alarabi, L.; Basalamah, S. Deep hybrid network for land cover semantic segmentation in high-spatial resolution satellite images. Information 2021, 12, 230.
Niu, R.; Sun, X.; Tian, Y.; Diao, W.; Chen, K.; Fu, K. Hybrid multiple attention network for semantic segmentation in aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–18.
Abdollahi, A.; Pradhan, B.; Shukla, N.; Chakraborty, S.; Alamri, A. Multi-object segmentation in complex urban scenes from high-resolution remote sensing data. Remote Sens. 2021, 13, 3710.
Chen, L.; Fu, Y.; You, S.; Liu, H. Efficient hybrid supervision for instance segmentation in aerial images. Remote Sens. 2021, 13, 252.
Zhang, C.; Jiang, W.; Zhang, Y.; Wang, W.; Zhao, Q.; Wang, C. Transformer and CNN hybrid deep neural network for semantic segmentation of very-high-resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–20.
Luo, X.; Tong, X.; Hu, Z.; Wu, G. Improving urban land cover/use mapping by integrating a hybrid convolutional neural network and an automatic training sample expanding strategy. Remote Sens. 2020, 12, 2292.
Li, M.; Xie, Y.; Shen, Y.; Ke, B.; Qiao, R.; Ren, B.; Lin, S.; Ma, L. Hybridcr: Weakly-supervised 3d point cloud semantic segmentation via hybrid contrastive regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 14930–14939.
Hossain, M.D.; Chen, D. A hybrid image segmentation method for building extraction from high-resolution RGB images. ISPRS J. Photogramm. Remote Sens. 2022, 192, 299–314.
Valdez-Rodríguez, J.E.; Calvo, H.; Felipe-Riverón, E.; Moreno-Armendáriz, M.A. Improving depth estimation by embedding semantic segmentation: A hybrid CNN model. Sensors 2022, 22, 1669.
Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer meets convolution: A bilateral awareness network for semantic segmentation of very fine resolution urban scene images. Remote Sens. 2021, 13, 3065.
Men, G.; He, G.; Wang, G. Concatenated Residual Attention UNet for Semantic Segmentation of Urban Green Space. Forests 2021, 12, 1441.
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214.
Gu, X.; Li, S.; Ren, S.; Zheng, H.; Fan, C.; Xu, H. Adaptive enhanced swin transformer with U-net for remote sensing image segmentation. Comput. Electr. Eng. 2022, 102, 108223.