INtra-INter Spectral Attention Network for Pedestrian Detection: Comparison
Please note this is a comparison between Version 2 by Rita Xu and Version 1 by Sangin Lee.

Pedestrian detection is a critical task for safety-critical systems, but detecting pedestrians is challenging in low-light and adverse weather conditions. Thermal images can be used to improve robustness by providing complementary information to RGB images.

  • autonomous vehicle
  • computer vision
  • data augmentation
  • feature fusion

1. Introduction

Pedestrian detection, which involves predicting bounding boxes to locate pedestrians in an image, has long been studied due to its utility in various real-world applications, such as autonomous vehicles, video surveillance and unmanned aerial vehicles [1,2,3,4][1][2][3][4]. In particular, robust pedestrian detection in challenging scenarios is essential in autonomous driving application since it is directly related to human safety. However, modern RGB-based pedestrian detection methods failed to operate in challenging environments characterized by low illumination, rain, and fog [5,6,7,8][5][6][7][8]. To alleviate this problem, several methods [5,9,10][5][9][10] have emerged that leverage a thermal camera as a sensor complementary to the RGB camera already in use. Thermal cameras offer visual cues in challenging environments by capturing long-wavelength radiation emitted by subjects, thereby overcoming the limitations of RGB cameras in complex conditions.
To achieve successful multispectral pedestrian detection, it is important to consider three key factors: enhancing individual spectral features, understanding the relationships between inter-spectral features, and effectively aggregating these features. Building upon these principles, diverse multispectral pedestrian detection approaches have emerged, including single/multi-scale feature fusion [11,12,13,14,15,16][11][12][13][14][15][16] as well as iterative fusion-and-refinement methods [17,18][17][18]. These approaches have achieved impressive results with novel fusion techniques. However, most previous methods rely on a convolutional layer to enhance the modality-specific features and capture the correlations between them. Due to the lack of a receptive field in such convolution layers given their small kernel size, they have trouble capturing the long-range spatial dependencies of both intra- and inter-spectral images.
Recently, transformer-based fusion methods [19,20][19][20] that enhance the representation of each spectral feature map to improve the multispectral feature fusion have emerged. These methods capture the complementary information between multispectral images by employing an attention mechanism that assigns importance to input sequences by considering their relationships. While existing approaches achieve satisfactory detection results, they still have the disadvantage of neglecting or inadequately addressing the inherent relationship among intra-modality features.
In addition, wresearchers observed that the detection performance was restricted due to the imbalanced distribution of locations where pedestrians appear. This imbalanced distribution frequently occurs in both multispectral [5,10][5][10] and single-spectral thermal pedestrian detection datasets [21,22][21][22]. To analyze this phenomenon, wresearchers plot the distribution of the center of annotated pedestrians in the KAIST multispectral dataset and LLVIP dataset in Figure 1. As shown in the yellow square in Figure 1a, the number of pedestrian appearances is concentrated in specific regions biased to the right side. This result stems from the fact that KAIST dataset entries are acquired under right-hand traffic conditions, making it challenging to provide sufficient sight to detect pedestrians on the left side. In particular, pedestrian counts become intensely imbalanced in road scenarios where images were collected along arterial roads where sidewalks and traffic lanes are sharply divided (as shown in Figure 1b). As observed in Figure 1c, the phenomenon of pedestrian concentration persists even though the LLVIP dataset was captured from a video surveillance camera angle.
Figure 1. Analyzing the distribution of pedestrians in the KAIST multispectral dataset and LLVIP dataset using Gaussian Kernel Density Estimation (Gaussian KDE). In the (a) KAIST dataset, especially in the (b) road scene, pedestrians are more concentrated on the right side of the image for several reasons, including the road environment, where sidewalks are clearly divided and a right-hand driving condition prevails. In the (c) LLVIP dataset, while displaying a more uniform distribution, there is a persistent bias toward pedestrian over-appearance on the right side of the images. A plasma colormap is used to encode the distribution of the density, with blue indicating low density and yellow indicating high density. High density is marked with a yellow square.

2. Multispectral Pedestrian Detection

Multispectral pedestrian detection research has made significant progress with thermal images able to detect accurately pedestrians in a variety of challenging conditions. Hwang et al. [5] released a large-scale multispectral pedestrian dataset and proposed a hand-crafted Aggregated Channel Feature (ACF) approach that utilized the thermal channel features. This work had a significant impact on subsequent multispectral pedestrian detection research. Liu et al. [23] analyzed the feature fusion performance outcomes at different stages using the NIN (Network-In-Network) fusion strategy. Li et al. [16] demonstrated that multi-task learning using semantic segmentation could improve object detection performance capabilities compared to a detection-only approach. Zhang et al. [17] proposed a cyclic multispectral feature fusion and refinement method that improves the representation of each modality feature. Yang et al. [24] and Li et al. [25] designed an illumination-aware gate that adaptively modulates the fusion weights between RGB and thermal features using illumination information predicted from RGB images. Zhou et al. [18] leveraged common- and differential-mode information simultaneously to address modality imbalance problems considering both illumination and feature factors. Zhang et al. [11] proposed a Region Feature Alignment (RFA) module that adaptively interacts with the feature offset in an effort to address weakly aligned phenomena. Kim et al. [15] proposed a novel multi-label learning method to distinguish between paired and unpaired images for robust pedestrian detection in commercialized sensor configurations such as stereo vision systems. Although previous studies have achieved remarkable performance gains, convolution-based fusion strategies struggle to capture the global context effectively in both intra- and inter-spectral images despite the importance of doing so during the feature fusion process.

3. Attention-Based Fusion Strategies

Attention mechanisms [26,27,28][26][27][28] have led to a model capable of learning enhanced modality-specific information. Zhang et al. [12] proposed a cross-modality interactive attention mechanism that encodes the interaction between RGB and thermal modalities and adaptively fuses features to improve the pedestrian detection performance. Fu et al. introduced a pixel-level feature fusion attention module that incorporates spatial and channel dimensions. Zhang et al. [13] designed Guided Attentive Feature Fusion (GAFF) to guide the feature fusion of intra-modality and inter-modality features with an auxiliary pedestrian mask. With the success of the attention-based transformer mechanism [29] in natural language processing (NLP) and the subsequent development of a vision transformer (ViT) [30], several methods have attempted to utilize transformer-based attention schemes for multispectral pedestrian detection. Shen et al. [20] proposed a dual cross-attention transformer feature fusion framework for simultaneous global feature interaction and complementary information capturing across modalities. The proposed framework uses a query-guided cross-attention mechanism to interact with cross-modal information. Zhu et al. [31] proposed a Multi-modal Feature Pyramid Transformer (MFPT) using a feature pyramid architecture that simultaneously attends to spatial and scale information within and between modalities. Fang et al. [19] leveraged self-attention to execute intra-modality and inter-modality fusion simultaneously and to capture the latent interactions between RGB and thermal spectral information more effectively. However, transformer-based feature fusion methods have not yet fully realized the potential of attention mechanisms, as they do not effectively learn the complementary information between modalities.

4. Data Augmentations in Pedestrian Detection

Data augmentation is a key technique for improving the robustness and generalization of object detection. Pedestrian detection models commonly use augmentation approaches such as geometric transformations, including flips, rotation, and cropping, as well as other techniques such as zoom in, zoom out, cutmix [32], mixup [33], and others. In a previous study, Cygert et al. [34] proposed patch-based augmentation that utilized image distortions and stylized textures to achieve competitive results. Chen et al. [35] proposed shape transformations to generate more realistic-looking pedestrians. Chi et al. [36] and Tang et al. [37] introduced an occlusion-simulated augmentation method that divides pedestrians into parts and fills in with the mean values of ImageNet [38] or images to improve the degree of robustness to occlusions. To address the motion blur problem in an autonomous driving scene, Khan et al. [39] designed hard mixup augmentation, which is an image-aware technique that combines mixup [33] augmentation with hard labels. To address the paucity of data on severe weather conditions, Tumas et al. [40] used a DNN-based augmentation that modified training images with Gaussian noise to mimic adverse weather conditions. Kim et al. [15] proposed semi-unpaired augmentation, which stochastically applies augmentation to one of the multispectral images. This breaking of the pair allows the model to learn from both paired and unpaired conditions, demonstrating good generalization performance.

References

  1. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361.
  2. Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2020; pp. 11621–11631.
  3. Wang, X.; Wang, M.; Li, W. Scene-specific pedestrian detection for static video surveillance. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 361–374.
  4. Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Repbulic of Korea, 27–28 October 2019.
  5. Hwang, S.; Park, J.; Kim, N.; Choi, Y.; So Kweon, I. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1037–1045.
  6. Xu, D.; Ouyang, W.; Ricci, E.; Wang, X.; Sebe, N. Learning cross-modal deep representations for robust pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5363–5371.
  7. Devaguptapu, C.; Akolekar, N.; M Sharma, M.; N Balasubramanian, V. Borrow from anywhere: Pseudo multi-modal object detection in thermal imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR), Long Beach, CA, USA, 15–20 June 2019.
  8. Kieu, M.; Bagdanov, A.D.; Bertini, M.; Del Bimbo, A. Task-conditioned domain adaptation for pedestrian detection in thermal imagery. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 546–562.
  9. González, A.; Fang, Z.; Socarras, Y.; Serrat, J.; Vázquez, D.; Xu, J.; López, A.M. Pedestrian detection at day/night time with visible and FIR cameras: A comparison. Sensors 2016, 16, 820.
  10. Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 10–17 October 2021; pp. 3496–3504.
  11. Zhang, L.; Zhu, X.; Chen, X.; Yang, X.; Lei, Z.; Liu, Z. Weakly aligned cross-modal learning for multispectral pedestrian detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5127–5137.
  12. Zhang, L.; Liu, Z.; Zhang, S.; Yang, X.; Qiao, H.; Huang, K.; Hussain, A. Cross-modality interactive attention network for multispectral pedestrian detection. Inf. Fusion 2019, 50, 20–29.
  13. Zhang, H.; Fromont, E.; Lefèvre, S.; Avignon, B. Guided attentive feature fusion for multispectral pedestrian detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (CVPR), Virtual, 19–25 June 2021; pp. 72–80.
  14. Zheng, Y.; Izzat, I.H.; Ziaee, S. GFD-SSD: Gated fusion double SSD for multispectral pedestrian detection. arXiv 2019, arXiv:1903.06999.
  15. Kim, J.; Kim, H.; Kim, T.; Kim, N.; Choi, Y. MLPD: Multi-Label Pedestrian Detector in Multispectral Domain. IEEE Robot. Autom. Lett. 2021, 6, 7846–7853.
  16. Li, C.; Song, D.; Tong, R.; Tang, M. Multispectral pedestrian detection via simultaneous detection and segmentation. In Proceedings of the in British Machine Vision Conference (BMVC), Newcastle, UK, 3–6 September 2018; pp. 225.1–225.12.
  17. Zhang, H.; Fromont, E.; Lefevre, S.; Avignon, B. Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Virtual, 25–28 October 2020; pp. 276–280.
  18. Zhou, K.; Chen, L.; Cao, X. Improving multispectral pedestrian detection by addressing modality imbalance problems. In Proceedings of the European Conference on Computer Vision (ECCV), Springer, Glasgow, UK, 23–28 August 2020; pp. 787–803.
  19. Qingyun, F.; Dapeng, H.; Zhaokui, W. Cross-modality fusion transformer for multispectral object detection. arXiv 2021, arXiv:2111.00273.
  20. Shen, J.; Chen, Y.; Liu, Y.; Zuo, X.; Fan, H.; Yang, W. ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection. Pattern Recognit. 2024, 145, 109913.
  21. Xu, Z.; Zhuang, J.; Liu, Q.; Zhou, J.; Peng, S. Benchmarking a large-scale FIR dataset for on-road pedestrian detection. Infrared Phys. Technol. 2019, 96, 199–208.
  22. Tumas, P.; Nowosielski, A.; Serackis, A. Pedestrian detection in severe weather conditions. IEEE Access 2020, 8, 62775–62784.
  23. Liu, J.; Zhang, S.; Wang, S.; Metaxas, D.N. Multispectral deep neural networks for pedestrian detection. arXiv 2016, arXiv:1611.02644.
  24. Yang, X.; Qian, Y.; Zhu, H.; Wang, C.; Yang, M. BAANet: Learning bi-directional adaptive attention gates for multispectral pedestrian detection. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 2920–2926.
  25. Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recognit. 2019, 85, 161–171.
  26. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
  27. Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 510–519.
  28. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Munich, Germany, 8–14 September 2018; pp. 7132–7141.
  29. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30.
  30. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
  31. Zhu, Y.; Sun, X.; Wang, M.; Huang, H. Multi-Modal Feature Pyramid Transformer for RGB-Infrared Object Detection. IEEE Trans. Intell. Transp. Syst. 2023, 24, 9984–9995.
  32. Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 18–22 June 2019; pp. 6023–6032.
  33. Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412.
  34. Cygert, S.; Czyżewski, A. Toward robust pedestrian detection with data augmentation. IEEE Access 2020, 8, 136674–136683.
  35. Chen, Z.; Ouyang, W.; Liu, T.; Tao, D. A shape transformation-based dataset augmentation framework for pedestrian detection. Int. J. Comput. Vis. 2021, 129, 1121–1138.
  36. Chi, C.; Zhang, S.; Xing, J.; Lei, Z.; Li, S.Z.; Zou, X. Pedhunter: Occlusion robust pedestrian detector in crowded scenes. Proc. AAAI Conf. Artif. Intell. 2020, 34, 10639–10646.
  37. Tang, Y.; Li, B.; Liu, M.; Chen, B.; Wang, Y.; Ouyang, W. Autopedestrian: An automatic data augmentation and loss function search scheme for pedestrian detection. IEEE Trans. Image Process. 2021, 30, 8483–8496.
  38. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252.
  39. Khan, A.H.; Nawaz, M.S.; Dengel, A. Localized Semantic Feature Mixers for Efficient Pedestrian Detection in Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 5476–5485.
  40. Tumas, P.; Serackis, A.; Nowosielski, A. Augmentation of severe weather impact to far-infrared sensor images to improve pedestrian detection system. Electronics 2021, 10, 934.
More
Video Production Service