Low-Light Object Tracking in UAV Videos: Comparison
Please note this is a comparison between Version 1 by Fully Living and Version 2 by Sirius Huang.

Unmanned aerial vehicles (UAVs) visual object tracking under low-light conditions serves as a crucial component for applications, such as night surveillance, indoor searches, night combat, and all-weather tracking. However, the majority of the existing tracking algorithms are designed for optimal lighting conditions. In low-light environments, images captured by UAV typically exhibit reduced contrast, brightness, and a signal-to-noise ratio, which hampers the extraction of target features. Moreover, the target’s appearance in low-light UAV video sequences often changes rapidly, rendering traditional fixed template tracking mechanisms inadequate, and resulting in poor tracker accuracy and robustness. 

  • unmanned aerial vehicle
  • low-light tracking

1. Introduction

Visual object tracking is a fundamental task in computer vision that finds extensive applications in the unmanned aerial vehicle (UAV) domain. Recent years have witnessed the emergence of new trackers that exhibit exceptional performance in UAV tracking [1][2][3][1,2,3], which is largely attributed to the fine manual annotation of large-scale datasets [4][5][6][7][4,5,6,7]. However, the evaluation standards and tracking algorithms currently employed are primarily designed for favorable lighting conditions. In real-world scenarios, low-light conditions such as nighttime, rainy weather, and small spaces are often encountered, resulting in images with low contrast, low brightness, and low signal-to-noise ratio compared to normal lighting. These discrepancies give rise to inconsistent feature distributions between the two types of images, thereby rendering it challenging to extend trackers designed for favorable lighting conditions to low-light scenarios [8][9][8,9], making it more challenging for UAV tracking.
Low-light UAV video sequences exhibit poor robustness and tracking drift when conventional object-tracking algorithms are employed, as illustrated in Figure 1. The issue of object tracking under low-light conditions can be divided into two sub-problems: enhancing low-light image features and tackling the challenge of target appearance changes in low-light video sequences. First, the low contrast, low brightness, and low signal-to-noise ratio of low-light images make feature extraction more arduous compared to normal images. Insufficient feature information hampers subsequent object-tracking tasks and constrains the performance of object-tracking algorithms. Another obstacle hindering the effectiveness of object-tracking algorithms arises from the characteristics of low-light video sequences. During tracking, the target’s appearance often changes, and when it becomes occluded or deformed, its features no longer correspond to the original template features, resulting in tracking drift. Such challenges are commonplace in vision object-tracking tasks and are more pronounced under low-light conditions due to the unstable lighting conditions, which serve as a crucial limiting factor for the performance of object-tracking algorithms.
Figure 1.
Trackers performance under low-light conditions.

2. Low-Light Image Enhancement

The objective of low-light image enhancement is to improve the quality of images by making the details that are concealed in darkness visible. In recent years, this area has gained significant attention and undergone continuous development and improvement in various computer vision domains. Two main types of algorithms are used for low-light image enhancement, namely model-based methods and deep learning-based methods.
Model-based methods were developed earlier and are based on the Retinex theory [10]. According to this theory, low-light images can be separated into illuminance and reflectance components. The reflectance component contains the essential attributes of the image, including edge details and color information, while the illuminance component captures the general outline and brightness distribution of the objects in the image. Fu et al. [11][12][11,12] were the first to use the L2 norm to constrain illumination and proposed an image enhancement method that simultaneously estimates illuminance and reflectance components in the linear domain. This method demonstrated that the linear domain formula is more suitable than the logarithmic domain formula. Guo et al. [13] used relative total variation [14] as a constraint on illumination and developed a structure-aware smoothing model to obtain better estimates of illuminance components. However, this model has the disadvantage of overexposure. Li et al. [15] added a noise term to address low-light image enhancement under strong noise conditions. They introduced new regularization terms to jointly estimate a piecewise smooth illumination and a structure-displaying reflectance in the optimization problem of illumination and reflectance. They also modeled noise removal and low-light enhancement as a unified optimization goal. Additionally, Ref. [16] proposed a semi-decoupled decomposition model to simultaneously enhance brightness and suppress noise. Although some models use camera response characteristics (e.g., LEACRM [17]), their effects are often not ideal and require manual adjustment of numerous parameters when dealing with real scenes.
In recent years, deep learning-based methods have rapidly emerged with the advancement of computer technology. Li et al. [18] proposed a control-based method for optimizing UAV trajectories, which incorporates energy conversion efficiency by directly deriving the model from the voltage and current flow of the UAV’s electric motor. EvoXBench [19] introduced an end-to-end process to address the lack of a general problem statement for NAS tasks from an optimization perspective. Zhang et al. [20] presented a low-complexity strategy for super-resolution (SR) based on adaptive low-rank approximation (LRA), aiming to overcome the limitations of processing large-scale datasets. Jin et al. [21] developed a deep transfer learning method that leverages facial recognition techniques to achieve a computer-aided facial diagnosis, validated in both single disease and multiple diseases with healthy controls. Zheng et al. [22] proposed a two-stage data augmentation method for automatic modulation classification in deep learning, utilizing spectral interference in the frequency domain to enhance radio signals and aid in modulation classification. This marks the first instance where frequency domain information has been considered to enhance radio signals for modulation classification purposes. Meanwhile, deep learning-based low-light enhancement algorithms have also made significant progress. Chen et al. [23] created a new dataset called LOL dataset by collecting low/normal light image pairs with adjusted exposure time. This dataset is the first to contain image pairs obtained from real scenes for low-light enhancement research, making a significant contribution to learning-based low-light image enhancement algorithm research. Many algorithms have been trained based on this dataset. The retinal network, designed in [23], generated unnatural enhancement results. KinD [24] improved some of the issues in the retinal network by adjusting the network architecture and introducing some training losses. DeepUPE [25] proposed a low-light image enhancement network that learned an image-to-illumination component mapping. Yang et al. [26] developed a fidelity-based two-stage network that first restores signals and then further enhances the results to improve overall visual quality, trained using a semi-supervised strategy. EnGAN [27] used a GAN-based unsupervised training method to enhance low-light images using unpaired low/normal light data. The network was trained using carefully designed discriminators and loss functions while carefully selecting training data. SSIENet [28] proposed a maximum entropy-based Retinex model that could estimate illuminance and reflectance components simultaneously while being trained only with low-light images. ZeroDCE [29] heuristically constructed quadratic curves with learned parameters to estimate parameter mapping from low-light input and used curve projection models for iterative light enhancement of low-light images. However, these models focus on adjusting the brightness of images and do not consider the noise that inevitably occurs in real-world nighttime imaging. Liu et al. [30] introduced prior constraints based on Retinex theory to establish a low-light image enhancement model and constructed an overall network architecture by unfolding its optimization solution process. Recently, Ma et al. [31] added self-correcting modules during training to reduce the model parameter size and improve inference speed.
However, these algorithms have limited stability, and it is difficult to achieve sustained superior performance, particularly in unknown real scenes where unclear details and inappropriate exposure are common and without good solutions for noise in images.

3. Object Tracking

In recent years, object tracking algorithms can be classified into methods based on discriminative correlation filtering [32][33][34][32,33,34] and methods based on Siamese networks. Achieving end-to-end training on trackers based on discriminative correlation filtering is challenging due to their complex online learning process. Moreover, limited by low-level manual features or inappropriate pre-trained classifiers, trackers based on discriminative correlation filtering become ineffective under complex conditions.
With the continuous improvement of computer performance and the establishment of large-scale datasets, tracking algorithms based on Siamese networks have become mainstream due to their superior performance. The Siamese network series of algorithms started with SINT [35] and SiamFC [36], which treat target tracking as a similarity learning problem and train Siamese networks using large amounts of image data. SiamFC introduced a correlation layer for feature fusion which significantly improved accuracy. Based on the success of SiamFC, subsequent improvements were made. CFNet [37] added a correlation filter to the template branch to make the network shallower and more efficient. DSiam [38] proposed a dynamic Siamese network that could be trained on labeled video sequences as a whole, fully utilizing the rich spatiotemporal information of moving objects and achieving improved accuracy with an acceptable speed loss. RASNet [39] used three attention mechanisms to weight the space and channels of SiamFC features, enhancing the network’s discriminative ability by decomposing the coupling of feature extraction and discriminative analysis. SASiam [40] established a Siamese network containing semantic and appearance branches. During training, the two branches were separated to maintain specificity. During testing, the two branches were combined to improve accuracy. However, these methods require multi-scale testing to cope with scale changes and cannot handle proportion changes caused by changes in target appearance. To obtain more accurate target bounding boxes, B. Li et al. [41] introduced a region proposal network (RPN) [42] into the Siamese network framework, achieving simultaneous improvement in accuracy and speed. SiamRPN++ [43] further adopted a deeper backbone and feature aggregation architecture to exploit the potential of deep networks on Siamese networks and improve tracking accuracy. SiamMask [44] introduced a mask branch to simultaneously achieve target tracking and image segmentation. Xu et al. [45] proposed a set of criteria for estimating the target state in tracker design and designed a new Siamese network, SiamFC++, based on SiamFC. DaSiamRPN [46] introduced existing detection datasets to enrich positive sample data and difficult negative sample data to improve the generalization and discrimination ability of trackers. It also introduced a local-to-global strategy to achieve good accuracy in long-term tracking. Anchor-free methods use per-pixel regression to predict four offsets on each pixel, reducing the hyperparameters caused by the introduction of RPNs. SiamBAN [47] proposed a tracking framework, containing multiple adaptive heads, that does not require multi-scale search or predefined candidate boxes, that directly classifies objects in a unified network, and that regresses bounding boxes. SiamCAR [48] added a centrality branch to help determine the position of the target center point and further improve tracking accuracy. Recently, Transformer [49] was integrated into the Siamese framework to simulate global information and improve tracking performance.
Regarding target tracking algorithms under low-light conditions, a DCF framework integrated with a low-light enhancer was proposed in [50]. However, it is limited to hand-crafted features and lacks transferability. Ye et al. [51] developed a new unsupervised domain adaptation framework that uses a day-night feature discriminator to adversarially train a daytime tracking model for nighttime tracking. However, there is currently insufficient targeted research on this issue.
ScholarVision Creations