Convolutional Neural Network-Based Layer-Adaptive Ground Control Points Extraction

Convolutional Neural Network-Based Layer-Adaptive Ground Control Points Extraction: Comparison

Please note this is a comparison between Version 2 by Rita Xu and Version 1 by Lixing Zhao.

Ground Control Points (GCPs) are of great significance for applications involving the registration and fusion of heterologous remote sensing images (RSIs). However, utilizing low-level information rather than deep features, traditional methods based on intensity and local image features turn out to be unsuitable for heterologous RSIs because of the large nonlinear radiation difference (NRD), inconsistent resolutions, and geometric distortions. Additionally, the limitations of current heterologous datasets and existing deep-learning-based methods make it difficult to obtain enough precision GCPs from different kinds of heterologous RSIs, especially for thermal infrared (TIR) images that present low spatial resolution and poor contrast.

Ground Control Points (GCPs)
convolutional neural network (CNN)
layer-adaptive

1. Introduction

Ground control points (GCPs) of remote sensing images (RSIs) are widely used in image stitching, image registration, image fusion, and camera geometric correction [1,2,3]^[1][2][3]. GCPs of heterologous RSIs from different sensors or imaging bands are essential for further utilization of various satellite images. However, the severe nonlinear radiation difference (NRD) between heterologous RSIs will lead to low accuracy of GCP extraction and the resulting positioning error, which has been one of the most important factors affecting the further quantitative application of RSIs.

Thermal infrared (TIR) data reflects the thermal radiation information of the target in the observation area. By measuring the differences in the thermal radiation of the imaging target, TIR images convert the invisible infrared light into visible content, which has very important applications in military target detection, camouflage target disclosure, etc. However, some characteristics of TIR images make it challenging to extract sufficiently accurate GCPs from them. Affected by the thermal interaction between the target and the surrounding environment, the temperature distribution difference of the ground objects in the TIR RSI imaging area is small, resulting in a concentrated gray distribution and poor contrast in the TIR images. Compared with visible images, TIR images record the thermal radiation characteristics of ground objects, resulting in a nonlinear gray distribution relationship with the reflection characteristics of the target. This results in less obvious gray-level and edge features of TIR RSIs and relatively blurred visual effects. In addition, compared with visible light and short-wave infrared (SWIR), the longer wavelength of TIR leads to a low image spatial resolution. In addition, the existence of cold and hot shadows in the TIR RSIs will cause discontinuous gray distribution and lower image contrast in the shadow area, as well as insignificant texture, edges, and other features.

These characteristics above make GCPs extraction from TIR remote sensing images face the following problems: First, the traditional grayscale-based control point extraction algorithm relies on the grayscale changes around feature points, and the cross-correlation matching process requires high consistency of gray mapping around control points. However, the gray distribution of the thermal infrared image is relatively concentrated, the contrast is poor, and the gray mapping difference is also large compared with the reflection characteristics of the target, resulting in a poor control point extraction effect. In addition, the control point extraction algorithm based on image features mainly relies on the gray gradient, contour, texture, edge, and other information of the image itself, while the resolution of a thermal infrared image is low and gray level and edge features are not obvious, which makes it a challenge to extract more precise control point information from TIR images. Furthermore, for the previous methods based on deep learning, they simply used the feature map from a single immutable network construction even when the characteristics of the input image were different, which led to a lack of flexibility in the network and insufficient accuracy of the extracted GCPs.

2. Convolutional Neural Network-Based Layer-Adaptive Ground Control Points Extraction

GCPs are of great significance for the further quantitative application of heterologous RSIs. GCP extraction has always been a popular research issue and has made great progress in the past decades. In general, GCPs extraction methods are broadly classified into traditional methods and intelligent methods. Traditional methods mainly rely on grayscale and handcrafted features such as gradients, edges, and corners, as well as geometric texture. Traditional GCP extraction methods can be roughly divided into two categories: intensity-based methods and feature-based methods. An intensity-based method counts the information in the image window in the spatial domain or frequency domain and completes the extraction of control point pairs by optimizing the similarity measurement of the statistical values. Common similarity measurement methods primarily include the mutual information method (MI) ^[4], the normalized cross correlation method (NCC) ^[5], etc. The intensity-based method is also called the gray-based method because gray-level information is commonly used for statistics. Since intensity information is directly used to extract GCPs, gray-based methods are often sensitive to problems such as window size, illumination differences, geometric distortion, etc. Therefore, the methods above can hardly meet the requirements for GCP extraction from heterologous RSIs with nonlinear radiation distortions (NDR), the results of which will become worse, especially for distortion images. Feature-based methods first extract local features (point feature, edge feature, texture feature, etc.) of the image by the feature extraction operator and establish the corresponding descriptor. Furthermore, the GCPs are screened out through descriptor matching and outlier removal algorithms ^[6]. Representative local feature detection methods include scale invariant feature transform (SIFT) ^[7], Harris operator ^[8], Moravec operator ^[9], Features from Accelerated Segment Test (FAST) ^[10], Smallest Univalue Segment Assimilating Nucleus (SUSAN) ^[11], etc. Particularly, SIFT, famous for its geometric invariance in scale, rotation, illumination, etc., is one of the most classical feature-based GCP extraction methods. Wang ^[12] used the SIFT algorithm to extract GCPs from mountainous area images of Landsat-8 and the Advanced Spaceborne Thermal Emission and Reflection Radiometer Global Digital Elevation Model (ASTER GDEM), which achieves a positioning accuracy of better than 1.0 pixel in panchromatic (PAN), near-infrared (NIR), and intermediate infrared sensors. Relying on integral images for image convolutions, speeded-up robust features (SURF) ^[13] can compute and compare much faster than previously proposed schemes. Affine-SIFT ^[14] extended SIFT for the computation of affine invariant image local features, which effectively covers all six parameters of the affine transform. In order to overcome the difference in image intensity between the heterologous RSIs, Ma et al. ^[15] proposed a position scale orientation (PSO)-SIFT using a new gradient definition and a feature matching method combining the position, scale, and orientation of each key point. Moravec is one of the earliest local feature detection operators, which finds the local maximum value of the minimum intensity change by moving the rectangular window on the image. In terms of thermal infrared (TIR) RSIs presenting low spatial resolution and poor contrast, Li et al. ^[3] proposed an accurate geometric-texture-based GCPs extraction approach that achieves sub-pixel-level matching accuracy. Furthermore, the phase congruency (PC) feature is also used to solve the problem of NRD in multi-modal RSIs. Ye et al. ^[16] built a dense descriptor called the Histogram of Orientated Phase Congruency (HOPC) that captures similar geometric structure or shape features of multi-modal images. Furthermore, the magnitude and orientation of PC are used to construct HOPC. Li et al. ^[17] detected corner feature points and edge feature points on the PC map and constructed a maximum index map, which is suitable for multi-modal image feature description. However, challenges still exist with the traditional methods above, especially for heterologous images. The sensitivity of hand-crafted features based on image intensity and gradient to NRD makes it difficult for traditional methods to achieve both robust and highly accurate results in the problem of GCP extraction from multi-modal RSIs. Recently, deep learning has achieved great success in computer vision. Learning-based features have acquired achievements in image matching tasks [18,19,20,21]^{[18][19][20][21]}. Many deep trainable features perform better in heterologous RSIs GCPs’ extraction than handcrafted features. Due to the differences in imaging mechanisms and imaging sensors between heterologous images, low-level handcrafted features may not be shared across modalities. For example, the visible remote sensing sensors mainly receive the reflected light of the ground objects from the sun, while the TIR imaging mainly depends on the thermal radiation of the target source itself, which is related to the temperature and radiation intensity of the imaging target. In such a situation, the handcrafted features of the visible image reflect more edge and texture information, while the thermal infrared image may reflect more temperature information. Therefore, representing different meanings under different radiation characteristics with handcrafted features based on grayscale is hard to show robustness to NRD. In contrast, the image semantic information obtained from the deep feature is often shared between heterologous RSIs. A deep learning network can obtain deep features that are more abstract and global. A common approach is to combine the deep features extracted through neural networks like convolutional neural networks (CNN) ^[22] with traditional methods to obtain more robust and universal feature descriptors for matching. Yang et al. ^[23] used multi-scale feature descriptors generated from CNN on image registration for multi-temporal satellite images. Deep feature descriptors from different convolution layers are shared by image patches of different sizes and are used together to describe the feature points. Considering the spatial relationship, Ma et al. ^[24] proposed a two-step method using both the deep feature extracted from CNN and the classical local handcrafted feature. This method adjusts the location of matching blocks using different convolutional features output from different convolutional layers, which makes the location of matching points more accurate. Ye et al. ^[25] integrated SIFT and CNN features into the PSO-SIFT algorithm for RSI registration. These methods use CNN as a feature extractor and then use the extracted CNN features to describe and match the feature points to obtain GCPs. Recently, a two-branched siamse network was also applied for feature extraction and patch matching. Han et al. ^[18] proposed a Siamese network architecture named “MatchNet”, which extracts patch pair features for image patch matching. Zhu et al. ^[26] proposed a two-branch convolutional network with unshared weights to extract features uniquely and transformed the matching mission into a two-class classification mission. Using the DoG function instead of the s-LoG function, the size of the image patch can completely cover the texture structure around key points. Hughes et al. ^[27] proposed a pseudo-siamese CNN architecture to identify corresponding patches in optical and synthetic aperture radar (SAR) remote sensing imagery. Zhang et al. ^[28] proposed a Siamese fully convolutional network (SFcNet) with a hard negative mining strategy to obtain GCPs of optical, NIR, TIR, SAR, and map images. In short, for the previous methods, due to the strong feature extraction ability of deep learning networks, some classic image classification networks, such as the VGG-16 ^[29] network, are often used as the feature extractor, and the feature map output from the convolution layer can be used as the descriptors of RSI feature points after processing. However, these methods simply use the feature map from a single immutable network construction even when the characteristics of the input image are different, which leads to insufficient accuracy of the extracted GCPs and a lack of flexibility in the network. Different from that, ourthe work does not use a fixed CNN network but constructs a CNN feature extraction network with an adjustable structure through the layer-adaptive module. The layer-adaptive module can adjust the layer in the module adaptively according to the preset GCP precision threshold and process different input images with different network structures until the accuracy of GCPs meets the requirements. This precision-oriented approach enables ourthe method to achieve higher precision GCPs compared to other methods.

References

Jiang, L.Y.; Li, L.Y.; Li, X.Y.; Jiao, J.J.; Chen, F.S. Extrapolating distortion correction with local measurements for space-based multi-module splicing large-format infrared cameras. Opt. Express 2022, 30, 38043–38059.
Yang, L.; Li, X.Y.; Jiang, L.Y.; Zeng, F.J.; Pan, W.H.; Chen, F.S. Resolution-Normalizing Image Stitching for Long-Linear-Array and Wide-Swath Whiskbroom Payloads. IEEE Geosci. Remote Sens. Lett. 2022, 19, 7507705.
Li, X.Y.; Hu, Z.Y.; Jiang, L.Y.; Yang, L.; Chen, F.S. GCPs Extraction with Geometric Texture Pattern for Thermal Infrared Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 7000205.
Maes, F.; Collignon, A.; Vandermeulen, D.; Marchal, G.; Suetens, P. Multimodality image registration by maximization of mutual information. IEEE Trans. Med. Imaging 1997, 16, 187–198.
Cole-Rhodes, A.A.; Johnson, K.L.; LeMoigne, J.; Zavorin, I. Multiresolution registration of remote sensing imagery by optimization of mutual information using a stochastic gradient. IEEE Trans. Image Process. 2003, 12, 1495–1511.
Fischler, M.A.; Bolles, R.C. Random Sample Consensus—A Paradigm for Model-Fitting with Applications to Image-Analysis and Automated Cartography. Commun. ACM 1981, 24, 381–395.
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110.
Harris, C.; Stephens, M. A combined corner and edge detector. In Alvey Vision Conference; Alvety Vision Club: Manchester, UK, 1988.
Moravec, H.P. Obstacle Avoidance and Navigation in the Real World by a Seeing Robot Rover. Ph.D. Dissertation, Stanford University, Stanford, CA, USA, 1980.
Rosten, E.; Drummond, T. Fusing points and lines for high performance tracking. In Proceedings of the 10th IEEE International Conference on Computer Vision, Beijing, China, 17–20 October 2005.
Smith, S.M.; Brady, J.M. SUSAN—A new approach to low level image processing. Int. J. Comput. Vis. 1997, 23, 45–78.
Chen, B.Y.; Li, X.Y.; Zhang, G.X.; Guo, Q.; Wu, Y.P.; Wang, B.Y.; Chen, F.S. On-orbit installation matrix calibration and its application on AGRI of FY-4A. J. Appl. Remote Sens. 2020, 14, 024507.
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded up robust features. In Proceedings of the 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 404–417.
Yu, G.S.; Morel, J.M. ASIFT: An Algorithm for Fully Affine Invariant Comparison. Image Process. Line 2011, 1, 11–38.
Ma, W.P.; Wen, Z.L.; Wu, Y.; Jiao, L.C.; Gong, M.G.; Zheng, Y.F.; Liu, L. Remote Sensing Image Registration with Modified SIFT and Enhanced Feature Matching. IEEE Geosci. Remote Sens. Lett. 2017, 14, 3–7.
Ye, Y.; Shen, L. HOPC: A Novel Similarity Metric Based on Geometric Structural Properties for Multi-Modal Remote Sensing Image Matching. In Proceedings of the 23rd ISPRS Congress, Prague, Czech Republic, 12–19 July 2016; pp. 9–16.
Li, J.Y.; Hu, Q.W.; Ai, M.Y. RIFT: Multi-Modal Image Matching Based on Radiation-Variation Insensitive Feature Transform. IEEE Trans. Image Process. 2020, 29, 3296–3310.
Han, X.F.; Leung, T.; Jia, Y.Q.; Sukthankar, R.; Berg, A.C. MatchNet: Unifying Feature and Metric Learning for Patch-Based Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3279–3286.
Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-Net: A Trainable CNN for Joint Description and Detection of Local Features. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 8084–8093.
DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-Supervised Interest Point Detection and Description. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 337–349.
Zagoruyko, S.; Komodakis, N. Learning to Compare Image Patches via Convolutional Neural Networks. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4353–4361.
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90.
Yang, Z.Q.; Dan, T.T.; Yang, Y. Multi-Temporal Remote Sensing Image Registration Using Deep Convolutional Features. IEEE Access 2018, 6, 38544–38555.
Ma, W.P.; Zhang, J.; Wu, Y.; Jiao, L.C.; Zhu, H.; Zhao, W. A Novel Two-Step Registration Method for Remote Sensing Images Based on Deep and Local Features. IEEE Trans. Geosci. Remote Sens. 2019, 57, 4834–4843.
Ye, F.M.; Su, Y.F.; Xiao, H.; Zhao, X.Q.; Min, W.D. Remote Sensing Image Registration Using Convolutional Neural Network Features. IEEE Geosci. Remote Sens. Lett. 2018, 15, 232–236.
Zhu, H.; Jiao, L.C.; Ma, W.P.; Liu, F.; Zhao, W. A Novel Neural Network for Remote Sensing Image Matching. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 2853–2865.
Hughes, L.H.; Schmitt, M.; Mou, L.C.; Wang, Y.Y.; Zhu, X.X. Identifying Corresponding Patches in SAR and Optical Images with a Pseudo-Siamese CNN. IEEE Geosci. Remote Sens. Lett. 2018, 15, 784–788.
Zhang, H.; Ni, W.P.; Yan, W.D.; Xiang, D.L.; Wu, J.Z.; Yang, X.L.; Bian, H. Registration of Multimodal Remote Sensing Image Based on Deep Fully Convolutional Neural Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 3028–3042.
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015.