Single-Frame Low-Resolution Infrared Small Target Detection: History
Please note this is an old version of this entry, which may differ significantly from the current revision.
Contributor: , , , , , , ,

Infrared small target detection technology is widely used in infrared search and tracking, infrared precision guidance, low and slow small aircraft detection, and other projects. Its detection ability is very important in terms of finding unknown targets as early as possible, warning in time, and allowing for enough response time for the security system.

  • infrared image
  • small target detection
  • deep learning
  • self-attention

1. Introduction

Compared with visible light imaging detection and active radar imaging detection, infrared imaging detection technology has the following characteristics [1]: unaffected by light conditions, works in all types of weather, works passively, has high imaging spatial resolution, adapts to various environments, has strong anti-electromagnetic interference ability, has a simple structure, is small in size, and easy to carry and hide. Benefiting from the above advantages, infrared detection and imaging technology has been widely used in infrared search and tracking, infrared precise guidance, low and slow small aircraft detection and identification, and other projects [2].
In some cases that need to be pre-judged, the target to be detected is far away from the infrared detection imaging system, and the image shows a dim and small target, often lacking texture information. The targets to be detected are usually aircrafts, drones, missiles, ships, vehicles, and other fast-moving objects [3,4], so the outlines of the imaging targets are fuzzy. In addition, as they are affected by the surrounding environment and detection equipment, small infrared targets are easily submerged in noise and complex backgrounds [5]. All these factors bring challenges to infrared small target detection.
At present, there are many infrared detection devices with low imaging resolution that are applied in various fields [6]. Therefore, it is of practical significance to design a method for small target detection in low-resolution infrared images to improve the small target detection performance. The number of pixels occupied by the target in the low-resolution infrared small target image is low [7], and a more accurate prediction of each pixel of the small target (that is, improving the pixel-level metrics of the low-resolution infrared image small target detection) can significantly improve the target detection performance.
Research on infrared small target detection is divided into single-frame image target detection and multi-frame image target detection [2]. This text focuses on the former. Early researchers mainly proposed model-driven methods. Filter-based methods [8,9] require determining the filtering template in advance based on the structural characteristics of the image, so it has poor adaptability to complex background environments. Methods [10,11,12,13] based on local contrast are suitable for situations where there is a significant difference in grayscale between the target and surrounding background, but they are prone to missed detections and misjudgments. Low-rank-based [14,15] and tensor-based [16,17,18] methods can achieve good results, but the computational cost is high, and hyperparameters are more sensitive to image scenes.
With the development of deep learning, some data-driven methods and infrared small target datasets [7,19,20,21,22] have emerged in recent years. Considering the weak and small characteristics of infrared small targets, infrared small target detection is usually modeled as a semantic segmentation problem. In order to ensure that the features of small targets are not submerged, some methods [7,19,22,23] have been used to enhance the fusion of features at different layers of the network. Based on the small proportion of small targets in the overall image, some methods [24,25] solve the problem with infrared small target detection by suppressing the background area to make the network pay more attention to the target area. There are also some studies [26,27,28,29] that consider how to improve and innovate based on classic encoding and decoding structures.
The existing single-frame infrared small target detection methods [7,23,24] have problems in terms of poor adaptability and high false-alarm and missed detection rates when they are used to detect infrared small target images with low resolution. This is not only because the quality of the existing dataset is not high, which leads to unsatisfactory training of the network, but also related to the large number of parameters in the existing network structure or the insufficient local attention to small targets.

2. Infrared Small Target Datasets

The Society of Photo-Optical Instrumentation Engineers (SPIE) defines infrared small targets as having a total spatial extent of less than 81 pixels (9 × 9) in a 256 × 256 image [30]—that is, the proportion of small targets in the entire image is less than 0.12%. In addition, the size of small infrared targets varies greatly, ranging from only one pixel (i.e., dot target) to dozens of pixels (i.e., expanded target) [29].
In recent years, some scholars have done a lot of work on the collection and production of infrared small target datasets and have publicly released these datasets, which include single-frame datasets [7,19,20,21,22] (see Table 1) and multi-frame datasets [31,32,33] (see Table 2).
Table 1. Details on the present single-frame infrared small target datasets.
Table 2. Details on the present multi-frame infrared small target datasets.
In Table 1 and Table 2, it can be seen that the sample size of real single-frame infrared small target data is relatively small, but the sample size of multi-frame infrared small target data is rich, which can be used to expand single-frame data. Constructing a single-frame infrared small target dataset with a larger data volume and higher quality can promote the development of single-frame infrared small target detection.

3. Infrared Small Target Detection Methods

In recent years, deep learning has developed rapidly in terms of solving visual tasks such as image classification, object detection, and semantic segmentation. Some methods based on deep learning have also emerged for infrared small target detection.
Due to their “weak” and “small” characteristics, infrared small targets are easily overwhelmed by a network’s high-level features. However, if only low-level features are used, it is not possible to fully comprehend semantic information, making it easy to miss detection and raise false alarms. Therefore, some researchers have combined attention mechanisms to study methods for enhancing feature fusion at different layers. Dai et al. proposed a bottom-up channel attention modulation method [23] (ACM) to preserve and highlight infrared small target features in high-level layers. Thereafter, Dai et al. [19] modularized local contrast measurement methods [10] from traditional methods in the network to design a model-driven deep learning network (ALCNet). Li et al. [7] proposed the use of DNANet to achieve progressive information interaction between high-level and low-level features through densely nested modules (DNIM). Chen et al. [22] introduced the self-attention mechanism of a transformer into the designed IRSTFormer to extract multi-scale features from the input image through the overlapping block self-attention structure of the hierarchy.
Based on the small proportion of small targets in the overall image, some researchers solve the problem with infrared small target detection by suppressing the background area so that the network pays more attention to the target area. Wang et al. [24] proposed a coarse-fine two-stage network, IAANet. In the coarse stage, the candidate target regions are obtained by the region proposal network (RPN), and in the fine stage, the global features of all candidate target regions in the image are extracted by the attention encoder (AE). In IAANet, the hard decision method is used to suppress the background regions to the greatest extent. Cheng et al. [25] designed a supervised attention module trained by small target diffusion maps in the proposed LPNet to suppress most of the background pixels irrelevant to small target features in a soft decision manner.
It has been proved that the classical encoder-decoder structure can achieve better results in the semantic segmentation task [34], and some researchers have carried out work on the improvement and innovation of the classical codec structure. Tong et al. [26] proposed MSAFFNet, which introduced the EIFAM module containing edge information based on the codec structure and constructed multi-scale labels to focus on the details of target contour and internal features. Wu et al. [27] proposed UIU-Net (U-Net in U-Net). It imparts a tiny U-Net into a larger U-Net backbone network to realize multi-level and multi-scale representation learning of objects. Chen et al. [28] proposed a MultiTask-UNet (MTUNet) with both detection and segmentation heads. By sharing the backbone, the similar semantic features of the two tasks are fully utilized. Compared with the compound single-task model, MTUNet has fewer parameters and faster inference speed. Wu et al. [29] proposed a multi-level TransUNet (MTU-Net). In the encoder, the features extracted by convolution are passed through a multi-level ViT module (MVTM) to capture remote dependencies.
The networks that combine the attention mechanism and multi-scale feature fusion [7,19,22,23] enhance the network’s ability to extract image features, but the local attention to small targets in the image is not enough and the network’s ability to detect small targets is not enough to be improved. The networks that focus on the localized region of small targets [24,25] have a problem: the number of network parameters and computational amount are larger, and the network prediction speed is slower.

4. Evaluation Metrics

The output of the infrared small target detection network is pixel-level segmentation. Therefore, it is common to use semantic segmentation metrics to evaluate network performance, such as precision, recall, F1 score, ROC curve, and PR curve.
Precision and recall refer to the proportion of correctly predicted positive samples out of all predicted positive samples and all true positive samples, respectively. The F1-Score is the harmonic mean of precision and recall. The definitions of P (precision), R (recall), and F1 − P (F1 score) are as follows:
 
P = T P T P + F P R = T P T P + F N F 1 P = 2 P R P + R
where T, P, TP, FP, and FN denote the true, positive, true positive, false positive, and false negative, respectively.
The receiver operating characteristic (ROC) curve shows how the model performs across all classification thresholds. The horizontal coordinate of the ROC curve is the false positive rate (FPR), and the vertical coordinate is the true positive rate (TPR). It goes through the points (0, 0) and (1, 1). The horizontal coordinate of the precision-recall (PR) curve is the recall rate, which reflects the classifier’s ability to cover positive examples. The vertical coordinate is the precision, which reflects the accuracy of the classifier’s prediction of positive examples.
However, as a target detection task, some researchers have proposed pixel-level and target-level evaluation metrics based on existing metrics to better evaluate the detection performance of infrared small targets.
IoU and nIoU are pixel-level metrics. IoU represents the ratio of intersection and union between the predicted and true results:
 
IoU = T P T + P T P
nIoU [19] is the numerical result normalized by the IoU value of each target, as shown in (3), where N represents the total number of targets.
 
nIoU = 1 N i N T P [ i ] T [ i ] + P [ i ] T P [ i ]
Pd (probability of detection) and Fa (false-alarm rate) are target-level metrics [22]. Pd measures the ratio of the number of correctly predicted targets to the number of all targets. Fa measures the ratio of incorrectly predicted pixels to all pixels in the image.
 
P d = #   num   of   true   detections #   num   of   actual   targets
 
F a = #   num   of   false   predicted   pixels #   num   of   all   pixels

This entry is adapted from the peer-reviewed paper 10.3390/rs15235539

This entry is offline, you can click here to edit this entry!
Video Production Service