Real-time UAV tracking refers to the tracking processing that is completed within the actual time of the acquisition of the image sequence by the drone's airborne imaging device. This processing is used to acquire the motion parameters of the target in the image moment by moment, including the target's position, speed, acceleration, and motion trajectory, etc.
1. Visual tracking for UAV videos
In recent years, due to their many outstanding advantages in performance and cost, unmanned aerial vehicles (UAVs) have increasingly been deployed in many fields, such as security monitoring, disaster relief, agriculture, military equipment, sports and entertainments, etc. Correspondingly, a huge amount of visual data has been produced, and the demand for intelligent processing of UAV videos has increased significantly.
Due to the release of new benchmark datasets and the improved methodologies, single-target tracking has become a research hotspot, and the related work has made considerable advances. From the perspective of technical means, the current mainstream single-target trackers can be divided into two categories: trackers based on Discriminative Correlation Filter (DCF) and trackers based on deep learning. Minimum Output Sum of Squared Error (MOSSE) is one of the most representative trackers based on DCF 
. These kind of trackers have fast tracking speed and are easy to transplant to the embedded hardware platform for real-time processing, but the tracking accuracy is relatively low. Therefore, it is difficult for them to meet the high-accuracy tracking requirements. Afterwards, the researchers proposed various improved DCF-based trackers through optimizing in many aspects, such as Circulant Structure of tracking-by-detection with Kernels (CSK) tracker 
, Kernelized Correlation Filters (KCF) tracker 
, and Spatially Regularized Discriminative Correlation Filter (SRDCF) tracker 
, etc. These trackers achieve a significant improvement in tracking accuracy, but at the same time, the tracking speed is significantly reduced.
With the rapid development of deep learning, many trackers based on Convolution Neural Networks (CNN) have emerged. Compared to the previous trackers, they can yield higher tracking accuracy 
. However, for UAV target tracking scenarios, due to numerous challenges, such as relatively small object sizes and various orientation changes, the above trackers show degraded performance to different degrees. An accurate and efficient tracker is still needed to perform the target tracking task in UAV videos.
2. Real-time UAV Tracker by MultiRPN-DIDNet
MultiRPN-DIDNet is a real-time target tracking method based on multiple Region Proposal Networks (RPNs) and Distance-IoU Discriminative Network (DIDNet) for UAV videos. Firstly, an instance-based RPN suitable for the target tracking task is constructed under the framework of Simases Neural Network. RPN is to perform bounding box regression and classification, in which channel attention mechanism is integrated to improve the representative capability of the deep features. The RPNs built on the Block 2, Block 3 and Block 4 of ResNet50 output their own Regression (Reg) coefficients and Classification scores (Cls) respectively, which are weighted and then fused to determine the high-quality region proposals. Secondly, a DIDNet is designed to correct the candidate target’s bounding box finely through the fusion of multi-layer features, which is trained with the Distance-IoU loss. Experimental results on the public datasets of UAV20L and DTB70 show that, compared with the state-of-the-art UAV trackers, the proposed MultiRPN-DIDNet can obtain better tracking performance with fewer region proposals and correction iterations. As a result, the tracking speed has reached 33.9 frames per second (FPS), which can meet the requirements of real-time tracking tasks.
As shown in Figure 1, the single object tracking method proposed in this paper consists of multiple RPNs and a DIoU discriminative network, in which ResNet-50 is used as the backbone network for feature extraction. RPN is constructed under the framework of SNN to perform bounding box regression and classification. The RPNs built on the Block 2, Block 3 and Block4 of ResNet50 output their own Reg coefficients and Cls scores respectively. They are weighted and then fused through a set of offline learning weight coefficients, obtaining the final Reg coefficients and Cls scores. The foreground with a higher Cls score is selected as the anchor, and the corresponding region proposal is determined by combining the Reg coefficients of the anchor. The convolutional features from multiple layers of ResNet50 are fused. The fused features and the information of the candidate area are input into the DIoU discriminative network, and the region proposal with the best DIoU value is finally determined as the tracking result.
Figure 1. Target tracking framework combining multiple RPNs and DIoU discriminative network. The multiple RPNs are used to determine high-quality candidate regions, and the DIoU discriminative network performs the correction of the candidate regions, and then outputs the final result.