of the number of class instances in an image. To alleviate the gap between fully supervised and weakly supervised approaches,
annotations have also been exploited. The idea behind these annotations is that point labels are far cheaper to obtain than BBs
and they significantly increase the model performance. Still, fully supervised performance has not been matched by any weakly supervised method.
The characteristics of Remote Sensing Images (RSIs) introduce additional challenges:
3. Weakly Supervised Object Detection Approaches
As reported by Zou et al.
[7], in the past two decades, the progress of Object Detection (OD) has gone through two periods: “traditional object detection period (before 2014)” and “deep-learning-based object detection period (after 2014)”. Being more specific branches of OD, both WSOD and RSWSOD have gone through the same historical phases.
Figure 1 presents a timeline of FSOD, WSOD, and RSWSOD with important milestones (indicated by a green flag) for each task. More specifically, during the traditional object detection period, most WSOD approaches were based on the usage of support vector machines (SVMs), the MIL framework
[8][9], and the exploitation of low-level and mid-level handcrafted features (e.g., SIFT
[10] and HOG
[11]). These methods obtained promising results on natural images but were difficult to apply to RSIs due to the previously discussed difficulties. With the advent of DL, OD architectures became more powerful and obtained successful results in many fields but required a large amount of annotated data. For this reason, many researchers shifted their focus to weakly supervised approaches. In
Figure 1, it is interesting to note that most RSWSOD methods have been developed after specific WSOD milestones: CAM, WSDDN, and OICR.
Figure 1. Timeline of the milestones in RSWSOD, with a comparison with FSOD
[7][12] and WSOD
[2] through the years. The flag symbol is used to represent milestones, while a simple line is used to denote other relevant methods. For clarity, not all methods have been reported
[13][14].
Several approaches have been proposed to address the WSOD task. Four major categories can be identified depending on how the detector is trained:
-
TSI + TDL-based: These approaches are based on a simple framework that consists of two stages: training set initialization (TSI) and target detector learning (TDL).
-
MIL-based: These approaches are based on the Multiple Instance Learning (MIL) framework.
-
CAM-based: These approaches are based on Class Activation Maps (CAMs), a well-known explainability technique.
-
Other DL-based: Few methods reformulate the RSWSOD problem starting from the implicit results of other tasks, e.g., Anomaly Detection (AD).
3.1. TSI + TDL-Based
Before the advent of Deep Learning (DL), most object detectors were based on SVMs. The workflow behind these methods is to start by producing candidate proposals exploiting either a Sliding Window (SW)
[15][16][17] or Saliency-based Self-adaptive Segmentation (Sb-SaS)
[14][18][19][20] approach. SW generates proposals by sliding, on the entire image, multiple BBs with different scales while Sb-SaS produces saliency maps that measure the uniqueness of each pixel in the image and exploit a multi-threshold segmentation mechanism to produce BBs. Both methods try to deal with the variation in the target size and the resolution of the images. Each proposal is characterized using a set of low- and middle-level features derived from methods such as SIFT
[10] and HOG
[11]. The extracted features can be further manipulated to produce high-level ones. Then, positive and negative candidates are chosen to initialize the training set. The training procedure is then composed of two steps: (1) the training of the detector and (2) the updating of the training set by modifying the positive and negative candidates. These steps are repeated until a stopping condition is met.
With the advent of convolutional neural networks (CNNs)
[21], both WSOD and RSWSOD methods started to benefit from the powerful feature extraction capabilities of deep architectures. In 2015, Zhou et al.
[19][20] proposed exploiting transfer learning on a CNN to extract high-level features to feed to an SVM-based detector. The scholars further highlight the importance of the process used to select negative instances for training. Most previous methods select random negative samples, which may cause the deterioration or fluctuation of the performance during the iterative training procedure. The reason is that negative samples, which are visually similar to positive samples, tend to be easily misclassified. Thus, selecting ambiguous negative samples is fundamental to enhance the effectiveness and robustness of the classifier. The scholars propose using negative bootstrapping instead of random selection for negative samples, building a more robust detector. This technique is still taken into consideration even in modern state-of-the-art methods.
3.2. MIL-Based
In MIL-based approaches, each image is treated as a collection of potential instances of the object to be found. Typically, MIL-based WSOD follows a three-step pipeline: proposal generation, feature extraction, and classification.
Proposal generation aims to extract a certain number of regions of interest, i.e., those areas that may contain object instances, from the image. This can be accomplished in several ways, with the basic approach being Sliding Window. More advanced and efficient proposal generation methods have been proposed, such as Selective Search (SS)
[3], which leverages the advantages of both exhaustive search and segmentation to generate initial proposals, or Edge Boxes (EB)
[4], which uses object edges to generate proposals. These methods are built to have a high recall, so the generated candidates are likely to contain an object instance. However, these methods are very time-consuming. To solve this issue, it is possible to either exploit CAM-based approaches in which there is no region proposal generation step or directly integrate the region proposal generation and feature extraction steps into the network using an RPN. The latter exploits CNNs and can extract more relevant features for the areas of interest and speed up the process. Despite their advantages, RPNs are not largely used in WSOD since traditional techniques have been proven to work well with natural images.
Feature extraction is needed to compute a feature vector for each candidate region extracted in the previous step. Features can be handcrafted or extracted by a CNN as in DL methods.
Classification is the last step and performs WSOD by reformulating the problem as a MIL classification task. The MIL problem was first introduced in
[8]. In image classification, each image is considered a
bag containing a set of
feature vectors to be classified (one for each region proposal). For the training step, each image (or bag) is assigned a positive or negative label based only on the image-level label, i.e., the presence or absence of a specific class. Thus, an image can be represented as a positive bag for one class, while a negative bag for another class not present inside such an image (
Figure 2). The aim is to infer instance-level labels for the proposals inside each image.
Figure 2. An example of positive/negative bags for the “airplane” class. Bounding boxes correspond to the proposals input to the network; blue indicates positive instances, while red indicates negative instances. MIL-based WSOD aims to differentiate between positive and negative instances based only on image-level labels.
The influence of WSDDN and OICR also affected the remote sensing community. However, the performance drop was severe. For this reason, many researchers focused on solving the RSWSOD problem by improving WSOD techniques and adding new modules that could overcome RSI challenges. For example, Cao et al.
[17] exploited MIL and density estimation to predict vehicles’ locations starting from region-level labels. In 2018 (one year after OICR), Sheng et al. proposed MIRN
[22]. This MIL-based approach tries to leverage the count information and an online labeling and refinement strategy, inspired by OICR, to perform vehicle detection and solve the multiple-instance problem.
3.3. CAM-Based
CAM-based approaches formulate the WSOD problem as a localizable feature map learning problem. The idea comes from the fact that every convolutional unit in the CNN is essentially an object detector and is able to locate the target object in the image
[23]. For example, suppose the object appears in the upper left corner of the image; in that case, the upper left corner of a feature map after a convolutional layer will produce a greater response. These localization capabilities of CNNs have been further studied in other works such as
[24][25].
Class activation maps
[25] were introduced in 2016 as a weighted activation map (heatmap) showing the areas contributing the most to the classification. CAMs do not require any additional label or training and can be obtained from the last fully connected layer of a CNN. Bounding boxes can be produced by thresholding the CAM values.
Figure 3 shows an example. After that, many different CAM variants and CAM-based methods were proposed for WSOD and especially for weakly supervised object localization (WSOL)
[2].
Figure 3. In the middle, an example of CAM for the “airplane” class obtained from the image on the left. On the right, the green BB is the ground truth, whereas the red BB is obtained by thresholding the CAM values.
In the remote sensing community, researchers have started to exploit CAM-based approaches for the task of aircraft detection. For example, in 2018, Li et al.
[26] proposed a Siamese network to overcome the fact that existing methods tend to take scenes as isolated ones and ignore the mutual cues between scene pairs when optimizing deep networks. Moreover, a multi-scale scene-sliding-voting strategy is implemented to produce the CAM and solve the multi-scale problem. The authors further propose different methods for thresholding the CAM and observe that detection results for each class have a strong dependence on the chosen thresholding method. Ji et al.
[27] proposed a method to reduce the false detection rate that affects many aircraft detectors producing a more accurate attention map.