Weakly Supervised Object Detection for Remote Sensing Images

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Federico Milani	--	2119	2022-11-02 11:47:27	\|
2	Reference format revised.	Lindsay Dong	Meta information modification	2119	2022-11-24 14:17:37	\|

This entry is adapted from the peer-reviewed paper 10.3390/rs14215362

To account for the lack of fine-grained annotations, such as object bounding boxes, several object detection methods have been developed that leverage only coarse-grain annotations (especially image-level labels indicating only the presence or absence of an object). This approach is called inexact Weak Supervision and introduces a new branch of Object Detection called Weakly Supervised Object Detection. Given an image, Remote Sensing Fully Supervised Object Detection (RSFSOD) aims to locate and classify objects based on Bounding Boxes annotations. Differently from RSFSOD, Remote Sensing Weakly Supervised Object Detection aims to precisely locate and classify object instances in Remote Sensing Images using only image-level labels or other types of coarse-grained labels (e.g., points or scribbles) as ground truth.

weakly supervised object detection (WSOD) remote sensing survey remote sensing weakly supervised object detection (RSWSOD)

1. Coarse-Grained Annotations

Fully Supervised Object Detection (FSOD) requires manual Bounding Boxes (BBs) annotations, also referred to as instance-level labels. Conversely, Weakly Supervised Object Detection (WSOD) relies on coarse-grained annotations, i.e., all the labels considered less expensive to obtain than BBs.

The most common types of annotation used to perform Remote Sensing Weakly Supervised Object Detection (RSWSOD) are image-level labels, indicating the presence of at least one instance of a target object class. It is also possible to use other metadata, such as region-level annotations, suggesting the presence of at least one instance of an object in a portion of the image. A less popular coarse-grained annotation is represented by scene-level labels that record only the class of the most dominant object in the image.

Another weak RSWSOD annotation is the count of the number of class instances in an image. To alleviate the gap between fully supervised and weakly supervised approaches, point annotations have also been exploited. The idea behind these annotations is that point labels are far cheaper to obtain than BBs ^[1] and they significantly increase the model performance. Still, fully supervised performance has not been matched by any weakly supervised method.

2. Main Challenges

In general, WSOD presents three main challenges related to the use of coarse-grained annotations ^[2]:

Partial coverage problem: This may arise from the fact that the object detection proposals computed by the WSOD method with the highest confidence score are those that surround the most discriminative part of an instance. If proposals are selected solely based on the highest score, the detector will learn to focus only on the most discriminative parts and not on the entire extent of an object (discriminative region problem). Another problem may derive from proposal generation methods such as Selective Search ^[3] and Edge Boxes ^[4], which output proposals that may not cover the entire targets well, reducing the performances of the detector (low-quality proposal problem).
Multiple-instance problem: The model may have trouble trying to accurately distinguish multiple instances when there are several objects of the same class. This is because most detectors ^[5]^[6] select only the highest-scoring proposal of each class and ignore other relevant instances.
Efficiency problem: Current proposal generators (e.g., Selective Search ^[3] and Edge Boxes ^[4]) largely used in WSOD are very time-consuming.

The characteristics of Remote Sensing Images (RSIs) introduce additional challenges:

Density problem: Images often contain dense groups of instances belonging to the same class. Models usually have difficulties in accurately detecting and distinguishing all the instances in such densely populated regions.
Generalization problem: The high intra-class diversity in RSIs induces generalization problems mainly due to three factors:

-

Multi-scale: Objects may have varying sizes, and their representation strongly depends on the image resolution and ground sample distance.

-

Orientation variation: Instances present arbitrary orientations and may require the use of methods generating oriented bounding boxes instead of the classical horizontal bounding boxes.

-

Spatial complexity: In general, RSIs show varying degrees of complexity in the spatial arrangement of the objects.

3. Weakly Supervised Object Detection Approaches

As reported by Zou et al. ^[7], in the past two decades, the progress of Object Detection (OD) has gone through two periods: “traditional object detection period (before 2014)” and “deep-learning-based object detection period (after 2014)”. Being more specific branches of OD, both WSOD and RSWSOD have gone through the same historical phases. Figure 1 presents a timeline of FSOD, WSOD, and RSWSOD with important milestones (indicated by a green flag) for each task. More specifically, during the traditional object detection period, most WSOD approaches were based on the usage of support vector machines (SVMs), the MIL framework ^[8]^[9], and the exploitation of low-level and mid-level handcrafted features (e.g., SIFT ^[10] and HOG ^[11]). These methods obtained promising results on natural images but were difficult to apply to RSIs due to the previously discussed difficulties. With the advent of DL, OD architectures became more powerful and obtained successful results in many fields but required a large amount of annotated data. For this reason, many researchers shifted their focus to weakly supervised approaches. In Figure 1, it is interesting to note that most RSWSOD methods have been developed after specific WSOD milestones: CAM, WSDDN, and OICR.

/media/item_content/202211/637f6df8d2007remotesensing-14-05362-g005.png

Figure 1. Timeline of the milestones in RSWSOD, with a comparison with FSOD ^[7]^[12] and WSOD ^[2] through the years. The flag symbol is used to represent milestones, while a simple line is used to denote other relevant methods. For clarity, not all methods have been reported ^[13]^[14].

Several approaches have been proposed to address the WSOD task. Four major categories can be identified depending on how the detector is trained:

TSI + TDL-based: These approaches are based on a simple framework that consists of two stages: training set initialization (TSI) and target detector learning (TDL).
MIL-based: These approaches are based on the Multiple Instance Learning (MIL) framework.
CAM-based: These approaches are based on Class Activation Maps (CAMs), a well-known explainability technique.
Other DL-based: Few methods reformulate the RSWSOD problem starting from the implicit results of other tasks, e.g., Anomaly Detection (AD).

3.1. TSI + TDL-Based

Before the advent of Deep Learning (DL), most object detectors were based on SVMs. The workflow behind these methods is to start by producing candidate proposals exploiting either a Sliding Window (SW) ^[15]^[16]^[17] or Saliency-based Self-adaptive Segmentation (Sb-SaS) ^[14]^[18]^[19]^[20] approach. SW generates proposals by sliding, on the entire image, multiple BBs with different scales while Sb-SaS produces saliency maps that measure the uniqueness of each pixel in the image and exploit a multi-threshold segmentation mechanism to produce BBs. Both methods try to deal with the variation in the target size and the resolution of the images. Each proposal is characterized using a set of low- and middle-level features derived from methods such as SIFT ^[10] and HOG ^[11]. The extracted features can be further manipulated to produce high-level ones. Then, positive and negative candidates are chosen to initialize the training set. The training procedure is then composed of two steps: (1) the training of the detector and (2) the updating of the training set by modifying the positive and negative candidates. These steps are repeated until a stopping condition is met.

With the advent of convolutional neural networks (CNNs) ^[21], both WSOD and RSWSOD methods started to benefit from the powerful feature extraction capabilities of deep architectures. In 2015, Zhou et al. ^[19]^[20] proposed exploiting transfer learning on a CNN to extract high-level features to feed to an SVM-based detector. The scholars further highlight the importance of the process used to select negative instances for training. Most previous methods select random negative samples, which may cause the deterioration or fluctuation of the performance during the iterative training procedure. The reason is that negative samples, which are visually similar to positive samples, tend to be easily misclassified. Thus, selecting ambiguous negative samples is fundamental to enhance the effectiveness and robustness of the classifier. The scholars propose using negative bootstrapping instead of random selection for negative samples, building a more robust detector. This technique is still taken into consideration even in modern state-of-the-art methods.

3.2. MIL-Based

In MIL-based approaches, each image is treated as a collection of potential instances of the object to be found. Typically, MIL-based WSOD follows a three-step pipeline: proposal generation, feature extraction, and classification.

Proposal generation aims to extract a certain number of regions of interest, i.e., those areas that may contain object instances, from the image. This can be accomplished in several ways, with the basic approach being Sliding Window. More advanced and efficient proposal generation methods have been proposed, such as Selective Search (SS) ^[3], which leverages the advantages of both exhaustive search and segmentation to generate initial proposals, or Edge Boxes (EB) ^[4], which uses object edges to generate proposals. These methods are built to have a high recall, so the generated candidates are likely to contain an object instance. However, these methods are very time-consuming. To solve this issue, it is possible to either exploit CAM-based approaches in which there is no region proposal generation step or directly integrate the region proposal generation and feature extraction steps into the network using an RPN. The latter exploits CNNs and can extract more relevant features for the areas of interest and speed up the process. Despite their advantages, RPNs are not largely used in WSOD since traditional techniques have been proven to work well with natural images.

Feature extraction is needed to compute a feature vector for each candidate region extracted in the previous step. Features can be handcrafted or extracted by a CNN as in DL methods. Classification is the last step and performs WSOD by reformulating the problem as a MIL classification task. The MIL problem was first introduced in ^[8]. In image classification, each image is considered a bag containing a set of feature vectors to be classified (one for each region proposal). For the training step, each image (or bag) is assigned a positive or negative label based only on the image-level label, i.e., the presence or absence of a specific class. Thus, an image can be represented as a positive bag for one class, while a negative bag for another class not present inside such an image (Figure 2). The aim is to infer instance-level labels for the proposals inside each image.

/media/item_content/202211/637f6e167ab40remotesensing-14-05362-g006.png

Figure 2. An example of positive/negative bags for the “airplane” class. Bounding boxes correspond to the proposals input to the network; blue indicates positive instances, while red indicates negative instances. MIL-based WSOD aims to differentiate between positive and negative instances based only on image-level labels.

The influence of WSDDN and OICR also affected the remote sensing community. However, the performance drop was severe. For this reason, many researchers focused on solving the RSWSOD problem by improving WSOD techniques and adding new modules that could overcome RSI challenges. For example, Cao et al. ^[17] exploited MIL and density estimation to predict vehicles’ locations starting from region-level labels. In 2018 (one year after OICR), Sheng et al. proposed MIRN ^[22]. This MIL-based approach tries to leverage the count information and an online labeling and refinement strategy, inspired by OICR, to perform vehicle detection and solve the multiple-instance problem.

3.3. CAM-Based

CAM-based approaches formulate the WSOD problem as a localizable feature map learning problem. The idea comes from the fact that every convolutional unit in the CNN is essentially an object detector and is able to locate the target object in the image ^[23]. For example, suppose the object appears in the upper left corner of the image; in that case, the upper left corner of a feature map after a convolutional layer will produce a greater response. These localization capabilities of CNNs have been further studied in other works such as ^[24]^[25].

Class activation maps ^[25] were introduced in 2016 as a weighted activation map (heatmap) showing the areas contributing the most to the classification. CAMs do not require any additional label or training and can be obtained from the last fully connected layer of a CNN. Bounding boxes can be produced by thresholding the CAM values. Figure 3 shows an example. After that, many different CAM variants and CAM-based methods were proposed for WSOD and especially for weakly supervised object localization (WSOL) ^[2].

/media/item_content/202211/637f6e2ce0e9dremotesensing-14-05362-g008.png

Figure 3. In the middle, an example of CAM for the “airplane” class obtained from the image on the left. On the right, the green BB is the ground truth, whereas the red BB is obtained by thresholding the CAM values.

In the remote sensing community, researchers have started to exploit CAM-based approaches for the task of aircraft detection. For example, in 2018, Li et al. ^[26] proposed a Siamese network to overcome the fact that existing methods tend to take scenes as isolated ones and ignore the mutual cues between scene pairs when optimizing deep networks. Moreover, a multi-scale scene-sliding-voting strategy is implemented to produce the CAM and solve the multi-scale problem. The authors further propose different methods for thresholding the CAM and observe that detection results for each class have a strong dependence on the chosen thresholding method. Ji et al. ^[27] proposed a method to reduce the false detection rate that affects many aircraft detectors producing a more accurate attention map.

References

Li, Y.; He, B.; Melgani, F.; Long, T. Point-based weakly supervised learning for object detection in high spatial resolution remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5361–5371.
Shao, F.; Chen, L.; Shao, J.; Ji, W.; Xiao, S.; Ye, L.; Zhuang, Y.; Xiao, J. Deep Learning for Weakly-Supervised Object Detection and Localization: A Survey. Neurocomputing 2022, 496, 192–207.
Uijlings, J.R.R.; van de Sande, K.E.A.; Gevers, T.; Smeulders, A.W.M. Selective Search for Object Recognition. Int. J. Comput. Vis. 2013, 104, 154–171.
Zitnick, C.L.; Dollár, P. Edge Boxes: Locating Object Proposals from Edges. In Computer Vision—ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 391–405.
Bilen, H.; Vedaldi, A. Weakly Supervised Deep Detection Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2846–2854.
Tang, P.; Wang, X.; Bai, X.; Liu, W. Multiple Instance Detection Network with Online Instance Classifier Refinement. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3059–3067.
Zou, Z.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. arXiv 2019, arXiv:1905.05055.
Dietterich, T.G.; Lathrop, R.H.; Lozano-Pérez, T. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 1997, 89, 31–71.
Andrews, S.; Tsochantaridis, I.; Hofmann, T. Support vector machines for multiple-instance learning. Adv. Neural Inf. Process. Syst. 2002, 15, 561–568.
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110.
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893.
Zaidi, S.S.A.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.; Lee, B. A survey of modern deep learning based object detection models. Digit. Signal Process. 2022, 126, 103514.
Qian, X.; Huo, Y.; Cheng, G.; Yao, X.; Li, K.; Ren, H.; Wang, W. Incorporating the Completeness and Difficulty of Proposals Into Weakly Supervised Object Detection in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1902–1911.
Zhang, D.; Han, J.; Yu, D.; Han, J. Weakly supervised learning for airplane detection in remote sensing images. In Proceedings of the Second International Conference on Communications, Signal Processing, and Systems; Springer: Cham, Switzerland, 2014; pp. 155–163.
Han, J.; Zhang, D.; Cheng, G.; Guo, L.; Ren, J. Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning. IEEE Trans. Geosci. Remote Sens. 2014, 53, 3325–3337.
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Scalable multi-class geospatial object detection in high-spatial-resolution remote sensing images. In Proceedings of the 2014 IEEE Geoscience and Remote Sensing Symposium, Quebec City, QC, Canada, 13–18 July 2014; pp. 2479–2482.
Cao, L.; Luo, F.; Chen, L.; Sheng, Y.; Wang, H.; Wang, C.; Ji, R. Weakly supervised vehicle detection in satellite images via multi-instance discriminative learning. Pattern Recognit. 2017, 64, 417–424.
Zhang, D.; Han, J.; Cheng, G.; Liu, Z.; Bu, S.; Guo, L. Weakly supervised learning for target detection in remote sensing images. IEEE Geosci. Remote Sens. Lett. 2014, 12, 701–705.
Zhou, P.; Zhang, D.; Cheng, G.; Han, J. Negative bootstrapping for weakly supervised target detection in remote sensing images. In Proceedings of the 2015 IEEE International Conference on Multimedia Big Data, Beijing, China, 20–22 April 2015; pp. 318–323.
Zhou, P.; Cheng, G.; Liu, Z.; Bu, S.; Hu, X. Weakly supervised target detection in remote sensing images based on transferred deep features and negative bootstrapping. Multidimens. Syst. Signal Process. 2016, 27, 925–944.
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90.
Sheng, Y.; Cao, L.; Wang, C.; Li, J. Weakly Supervised Vehicle Detection in Satellite Images via Multiple Instance Ranking. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 2765–2770.
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Object Detectors Emerge in Deep Scene CNNs. arXiv 2014, arXiv:1412.6856.
Oquab, M.; Bottou, L.; Laptev, I.; Sivic, J. Is object localization for free? Weakly-supervised learning with convolutional neural networks. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 685–694.
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929.
Li, Y.; Zhang, Y.; Huang, X.; Yuille, A.L. Deep networks under scene-level supervision for multi-class geospatial object detection from remote sensing images. ISPRS J. Photogramm. Remote Sens. 2018, 146, 182–196.
Ji, J.; Zhang, T.; Yang, Z.; Jiang, L.; Zhong, W.; Xiong, H. Aircraft detection from remote sensing image based on a weakly supervised attention model. In Proceedings of the IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 322–325.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Computer Science, Artificial Intelligence; Remote Sensing

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Corrado Fasana

Samuele Pasini

Federico Milani

Piero Fraternali

View Times: 378

Update Date: 24 Nov 2022

Table of Contents

Video Upload Options

Confirm