The purpose of multisource map super-resolution is to reconstruct high-resolution maps based on low-resolution maps, which is valuable for content-based map tasks such as map recognition and classification. However, there is no specific super-resolution method for maps, and the existing image super-resolution methods often suffer from missing details when reconstructing maps.
Earth observation systems and big data techniques gave birth to the exponential growth of large amounts of professionally generated and volunteered raster maps of various formats, themes, styles, and other attributes. As a great number of activities, including land cover/land use mapping 
, navigation 
, trajectory analysis 
, socio-economic analysis 
, etc., benefit from the geospatial information included in these raster maps, precisely retrieving the massive map dataset has become a pressing task. Traditional approaches for map retrieval, such as online search engines, generally rely on map annotations or the metadata of map files rather than the map content. These map annotations assigned to a raster map might vary due to subjective understanding as well as diverse map generation goal, map themes, and other factors 
. In comparison to map annotations and metadata, content-based map retrieval mainly focuses on employing the information included in a map to determine whether the retrieved map is truly needed by the user or for a task. Map text and symbols are the first map language and the essential part of map features with respect to map content 
. Thus, map text and symbol recognition has become a main research aspect of big map data retrieval. Recently, deep learning techniques such as convolutional neural networks (CNNs) have shown great strengths in map text and symbol recognition. Furthermore, Zhou et al. 
and Zhou 
reported that deep learning approaches could effectively support the retrieval of topographical and raster maps by recognizing text information.
However, poor spatial resolution and limited data size remain the two main obstacles for implementing state-of-the-art deep learning approaches into map text and symbol recognition. In the era of big data and volunteered geographical information, a majority of available maps are designed and created in unprofessional ways. This always makes the text characters and symbols in these maps unavailable for visual recognition due to poor spatial resolution or small data size.
Super-resolution techniques support the conversion of low-resolution images into high-resolution ones by extracting the mappings between low-resolution images and their corresponding high-resolution ones. Super-resolution techniques include multi-frame super-resolution (MFSR) and single-image super-resolution (SISR) 
. Considering that the generation of a map is always time-consuming, generating multiple types of maps that share a similar theme or style is generally impossible on the raster. Thus, SISR would be useful for research in map reconstruction. Currently, the community of machine intelligence and computer vision reported that CNNs have achieved great success in SISR 
. However, due to the limited receptive field of convolution kernels, it is difficult to effectively utilize global features. Moreover, although increasing the network depth can expand the receptive field of CNNs to some extent, this strategy still cannot fundamentally solve the receptive field in the spatial dimension. Specifically, increasing the depth might lead to an edge effect—the reconstruction of image edges is significantly worse than in the middle of the image. Otherwise, vision transformer (ViT) mainly focuses on modeling the features of the global receptive field using the attention mechanism 
Unlike natural images, maps contain a variety of information on different scales, containing both geographical information with global features and detailed information such as legends and annotations. The former follows Tobler’s first law of geography and consists mainly of low-frequency information being available for reconstruction using Transformer for global modeling. The latter has a large amount of high-frequency information and requires the use of CNN modules to focus on reconstructing the local details of maps. Up to now, super-resolution methods that fuse global and local information for maps have not been reported.
2. SISR Based on Deep Learning
2.1. CNN-Based SR
CNNs have been used for a long time. In the 1990s, LeCun et al. 
proposed LeNet using the backpropagation algorithm, which initially established the structure of CNNs. Dong et al. proposed the first super-resolution network, SRCNN 
, in 2014, which achieved results beyond previous traditional interpolation methods by using only three layers of convolution. As a pioneering work to introduce CNN into super-resolution, SRCNN had the problem of limited learning ability due to the shallow network, but it established the basic structure of image super-resolution, that is, the three-level structure of feature extraction, nonlinear mapping, and high-resolution reconstruction. They further proposed FSRCNN 
in 2016, where they moved the upsampling layer back, allowing feature extraction and mapping to be performed on low-resolution images, they used several small convolutions instead of a large convolution, both of which reduced the computations, and they replaced the upsampling method from interpolation to transposed convolution, which enhanced the learning ability of the model.
, VDSR 
, DRCN 
, and LapSRN 
were proposed to improve the existing super-resolution models from different perspectives. ESPCN proposed a PixelShuffle method for upsampling, which was proven to be better than transposed convolution and interpolation methods and has been widely used in later super-resolution models. Inspired by ResNet and RNN, Kim et al. proposed two methods, VDSR and DRCN, to deepen the model and improve the feature extraction ability. LapSRN was proposed as a progressive upsampling method, which was faster and more convenient for multi-scale reconstruction than single upsampling.
Lim et al. 
found that the Batch Normal (BN) layer normalized the image color and destroyed the original contrast information of the image, which hindered the convergence of training. Therefore, they proposed EDSR to remove the BN layer, implement a deeper network, and proposed residual scaling to solve the problem of numerical instability in the training process caused by the overly deep network. Inspired by DenseNet, Haris et al. 
proposed DBPN, which proposed an iterative upsampling and downsampling process that provided an error feedback mechanism for each stage and achieved excellent results in large-scale image reconstruction.
2.2. Transformer-Based SR
In 2017, Ashish et al. 
first proposed the Transformer model for machine translation using stacked self-attention layers and fully connected layers (MLP
) to replace the circular structure of the original Seq2Seq. Because of the great success of Transformer in NLP, Kaiser et al. 
soon introduced it to image generation work, and Alexey et al. 
proposed Vision Transformer (ViT), which segmented images into blocks and then serialized them, using Transformer to implement image classification. In recent years, Transformer, especially ViT, has gradually attracted the attention of the SISR academic community.
introduced Transformer to image super-resolution and proposed a channel attention (CA) mechanism that adaptively adjusted features considering the dependencies between channels, further improving the expressive capability of the network compared to CNN. Dai et al. 
proposed a second-order channel attention mechanism to better adaptively adjust the channel features considering that the global covariance could obtain higher-order and more discriminative feature information compared with the first-order pooling used in RCAN.
Inspired by Swin Transformer, Liang et al. 
proposed SwinIR, which used window partitioning to obtain several local windows, Transformer was used in each window, and the information across windows was fused by window shifting at the next layer. This solution of implementing Transformer for partitioned windows could greatly reduce the computations, had the advantage of processing larger sized images, and also could take advantage of Transformer by shifting windows to achieve modeling of the global dependency. Considering that the window partitioning strategy of SwinIR limited the receptive field and could not establish long dependencies at an early stage, Zhang et al. 
used a fast Fourier convolutional layer with a global receptive field to extend SwinIR, while Chen et al. 
combined SwinIR with a channel attention mechanism to propose a hybrid attentional Transformer model. Both approaches enabled SwinIR to establish long dependencies at the early stage and improved model performance using different methods.
Transformer breaks away from the dependence on convolution by using an attention mechanism and has a global receptive field, which can achieve better image super-resolution results compared to CNN. However, Transformer does not have the capability of capturing local features and is not sensitive enough to some local details of maps. Therefore, a Transformer model that fuses local features may be more effective for map super-resolution.