Ground Control Points (GCPs) are of great significance for applications involving the registration and fusion of heterologous remote sensing images (RSIs). However, utilizing low-level information rather than deep features, traditional methods based on intensity and local image features turn out to be unsuitable for heterologous RSIs because of the large nonlinear radiation difference (NRD), inconsistent resolutions, and geometric distortions. Additionally, the limitations of current heterologous datasets and existing deep-learning-based methods make it difficult to obtain enough precision GCPs from different kinds of heterologous RSIs, especially for thermal infrared (TIR) images that present low spatial resolution and poor contrast.
Ground control points (GCPs) of remote sensing images (RSIs) are widely used in image stitching, image registration, image fusion, and camera geometric correction 
. GCPs of heterologous RSIs from different sensors or imaging bands are essential for further utilization of various satellite images. However, the severe nonlinear radiation difference (NRD) between heterologous RSIs will lead to low accuracy of GCP extraction and the resulting positioning error, which has been one of the most important factors affecting the further quantitative application of RSIs.
Thermal infrared (TIR) data reflects the thermal radiation information of the target in the observation area. By measuring the differences in the thermal radiation of the imaging target, TIR images convert the invisible infrared light into visible content, which has very important applications in military target detection, camouflage target disclosure, etc. However, some characteristics of TIR images make it challenging to extract sufficiently accurate GCPs from them. Affected by the thermal interaction between the target and the surrounding environment, the temperature distribution difference of the ground objects in the TIR RSI imaging area is small, resulting in a concentrated gray distribution and poor contrast in the TIR images. Compared with visible images, TIR images record the thermal radiation characteristics of ground objects, resulting in a nonlinear gray distribution relationship with the reflection characteristics of the target. This results in less obvious gray-level and edge features of TIR RSIs and relatively blurred visual effects. In addition, compared with visible light and short-wave infrared (SWIR), the longer wavelength of TIR leads to a low image spatial resolution. In addition, the existence of cold and hot shadows in the TIR RSIs will cause discontinuous gray distribution and lower image contrast in the shadow area, as well as insignificant texture, edges, and other features.
These characteristics above make GCPs extraction from TIR remote sensing images face the following problems: First, the traditional grayscale-based control point extraction algorithm relies on the grayscale changes around feature points, and the cross-correlation matching process requires high consistency of gray mapping around control points. However, the gray distribution of the thermal infrared image is relatively concentrated, the contrast is poor, and the gray mapping difference is also large compared with the reflection characteristics of the target, resulting in a poor control point extraction effect. In addition, the control point extraction algorithm based on image features mainly relies on the gray gradient, contour, texture, edge, and other information of the image itself, while the resolution of a thermal infrared image is low and gray level and edge features are not obvious, which makes it a challenge to extract more precise control point information from TIR images. Furthermore, for the previous methods based on deep learning, they simply used the feature map from a single immutable network construction even when the characteristics of the input image were different, which led to a lack of flexibility in the network and insufficient accuracy of the extracted GCPs.
2. Convolutional Neural Network-Based Layer-Adaptive Ground Control Points Extraction
GCPs are of great significance for the further quantitative application of heterologous RSIs. GCP extraction has always been a popular research issue and has made great progress in the past decades. In general, GCPs extraction methods are broadly classified into traditional methods and intelligent methods.
Traditional methods mainly rely on grayscale and handcrafted features such as gradients, edges, and corners, as well as geometric texture. Traditional GCP extraction methods can be roughly divided into two categories: intensity-based methods and feature-based methods. An intensity-based method counts the information in the image window in the spatial domain or frequency domain and completes the extraction of control point pairs by optimizing the similarity measurement of the statistical values. Common similarity measurement methods primarily include the mutual information method (MI) 
, the normalized cross correlation method (NCC) 
, etc. The intensity-based method is also called the gray-based method because gray-level information is commonly used for statistics. Since intensity information is directly used to extract GCPs, gray-based methods are often sensitive to problems such as window size, illumination differences, geometric distortion, etc. Therefore, the methods above can hardly meet the requirements for GCP extraction from heterologous RSIs with nonlinear radiation distortions (NDR), the results of which will become worse, especially for distortion images.
Feature-based methods first extract local features (point feature, edge feature, texture feature, etc.) of the image by the feature extraction operator and establish the corresponding descriptor. Furthermore, the GCPs are screened out through descriptor matching and outlier removal algorithms 
. Representative local feature detection methods include scale invariant feature transform (SIFT) 
, Harris operator 
, Moravec operator 
, Features from Accelerated Segment Test (FAST) 
, Smallest Univalue Segment Assimilating Nucleus (SUSAN) 
, etc. Particularly, SIFT, famous for its geometric invariance in scale, rotation, illumination, etc., is one of the most classical feature-based GCP extraction methods. Wang 
used the SIFT algorithm to extract GCPs from mountainous area images of Landsat-8 and the Advanced Spaceborne Thermal Emission and Reflection Radiometer Global Digital Elevation Model (ASTER GDEM), which achieves a positioning accuracy of better than 1.0 pixel in panchromatic (PAN), near-infrared (NIR), and intermediate infrared sensors. Relying on integral images for image convolutions, speeded-up robust features (SURF) 
can compute and compare much faster than previously proposed schemes. Affine-SIFT 
extended SIFT for the computation of affine invariant image local features, which effectively covers all six parameters of the affine transform. In order to overcome the difference in image intensity between the heterologous RSIs, Ma et al. 
proposed a position scale orientation (PSO)-SIFT using a new gradient definition and a feature matching method combining the position, scale, and orientation of each key point. Moravec is one of the earliest local feature detection operators, which finds the local maximum value of the minimum intensity change by moving the rectangular window on the image. In terms of thermal infrared (TIR) RSIs presenting low spatial resolution and poor contrast, Li et al. 
proposed an accurate geometric-texture-based GCPs extraction approach that achieves sub-pixel-level matching accuracy. Furthermore, the phase congruency (PC) feature is also used to solve the problem of NRD in multi-modal RSIs. Ye et al. 
built a dense descriptor called the Histogram of Orientated Phase Congruency (HOPC) that captures similar geometric structure or shape features of multi-modal images. Furthermore, the magnitude and orientation of PC are used to construct HOPC. Li et al. 
detected corner feature points and edge feature points on the PC map and constructed a maximum index map, which is suitable for multi-modal image feature description. However, challenges still exist with the traditional methods above, especially for heterologous images. The sensitivity of hand-crafted features based on image intensity and gradient to NRD makes it difficult for traditional methods to achieve both robust and highly accurate results in the problem of GCP extraction from multi-modal RSIs.
Recently, deep learning has achieved great success in computer vision. Learning-based features have acquired achievements in image matching tasks 
. Many deep trainable features perform better in heterologous RSIs GCPs’ extraction than handcrafted features. Due to the differences in imaging mechanisms and imaging sensors between heterologous images, low-level handcrafted features may not be shared across modalities. For example, the visible remote sensing sensors mainly receive the reflected light of the ground objects from the sun, while the TIR imaging mainly depends on the thermal radiation of the target source itself, which is related to the temperature and radiation intensity of the imaging target. In such a situation, the handcrafted features of the visible image reflect more edge and texture information, while the thermal infrared image may reflect more temperature information. Therefore, representing different meanings under different radiation characteristics with handcrafted features based on grayscale is hard to show robustness to NRD. In contrast, the image semantic information obtained from the deep feature is often shared between heterologous RSIs. A deep learning network can obtain deep features that are more abstract and global. A common approach is to combine the deep features extracted through neural networks like convolutional neural networks (CNN) 
with traditional methods to obtain more robust and universal feature descriptors for matching. Yang et al. 
used multi-scale feature descriptors generated from CNN on image registration for multi-temporal satellite images. Deep feature descriptors from different convolution layers are shared by image patches of different sizes and are used together to describe the feature points. Considering the spatial relationship, Ma et al. 
proposed a two-step method using both the deep feature extracted from CNN and the classical local handcrafted feature. This method adjusts the location of matching blocks using different convolutional features output from different convolutional layers, which makes the location of matching points more accurate. Ye et al. 
integrated SIFT and CNN features into the PSO-SIFT algorithm for RSI registration. These methods use CNN as a feature extractor and then use the extracted CNN features to describe and match the feature points to obtain GCPs. Recently, a two-branched siamse network was also applied for feature extraction and patch matching. Han et al. 
proposed a Siamese network architecture named “MatchNet”, which extracts patch pair features for image patch matching. Zhu et al. 
proposed a two-branch convolutional network with unshared weights to extract features uniquely and transformed the matching mission into a two-class classification mission. Using the DoG function instead of the s-LoG function, the size of the image patch can completely cover the texture structure around key points. Hughes et al. 
proposed a pseudo-siamese CNN architecture to identify corresponding patches in optical and synthetic aperture radar (SAR) remote sensing imagery. Zhang et al. 
proposed a Siamese fully convolutional network (SFcNet) with a hard negative mining strategy to obtain GCPs of optical, NIR, TIR, SAR, and map images.
In short, for the previous methods, due to the strong feature extraction ability of deep learning networks, some classic image classification networks, such as the VGG-16 
network, are often used as the feature extractor, and the feature map output from the convolution layer can be used as the descriptors of RSI feature points after processing. However, these methods simply use the feature map from a single immutable network construction even when the characteristics of the input image are different, which leads to insufficient accuracy of the extracted GCPs and a lack of flexibility in the network. Different from that, the work does not use a fixed CNN network but constructs a CNN feature extraction network with an adjustable structure through the layer-adaptive module. The layer-adaptive module can adjust the layer in the module adaptively according to the preset GCP precision threshold and process different input images with different network structures until the accuracy of GCPs meets the requirements. This precision-oriented approach enables the method to achieve higher precision GCPs compared to other methods.