Few-Shot Object Detection | Encyclopedia MDPI

Few-Shot Object Detection: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Contributor:

Few-shot object detection (FSOD) aims at designing models that can accurately detect targets of novel classes in a scarce data regime.

few-shot object detection
object detection
deep learning

1. Introduction

Object detection (OD) via deep learning approaches ^[1]^[2]^[3]^[4]^[5] in computer vision has experienced tremendous progress in recent years. However, existing object detection models rely heavily on a substantial amount of annotated data and require long training times to achieve exceptional performance. Demonstrating good performance with a limited number of annotated samples is challenging. Object detection via few-shot learning methods, called few-shot object detection (FSOD), is a promising research branch to overcome data scarcity.

Few-shot learning ^[6]^[7]^[8]^[9]^[10] aims at designing models that can successfully operate in a limited data regime. By leveraging few-shot learning, significant achievements have been made in few-shot classification (FSC) tasks ^[11]^[12]^[13]^[14]. FSOD is a more challenging research area that requires the simultaneous accomplishment of both novel object classification and localization. Recently, most FSOD research has focused on meta-learning approaches ^[15]^[16]^[17], which leverage support images to guide the detector in the classification and localization of novel-class objects. One of the crucial research branches in meta-learning-based FSOD is how to finish the aggregation of features from support images and query images effectively.

Existing meta-learning-based FSOD methods ^[18]^[19]^[20]^[21]^[22]^[23] aggregate query features and support features, which are generated by a feature extractor called the backbone. However, it is worth noting that the features extracted by the backbone network are coarse and may not highlight the key feature information of the samples. In other words, these methods do not effectively utilize precise features to accomplish feature aggregation, which leads to unsatisfactory detection performance. Training a model with a limited number of annotated samples means that the model must quickly focus on the recognizable feature representations of novel-class objects. How to highlight the key feature representations of objects (particularly the objects from different classes that have a high degree of similarity) becomes one of the challenges in FSOD. In fact, several feature processing methods proposed in object detection or image segmentation address the problem of feature enhancement from various perspectives. SENet ^[24] leverages the learning of channel weights to determine the importance of each channel and performs a weighted average of multiple feature maps during the fusion process, taking into account the channel weights. GCNet ^[25] utilizes a self-attention mechanism to integrate and capture global contextual information and local features during the fusion of multiple feature maps.

A limited number of annotated samples in a novel-class dataset implies that the dataset contains only a finite amount of feature information of objects from novel classes. Traditional object detection methods are prone to overfitting and poor generalization in data scarcity situations. The limited availability of annotated samples hampers the model’s ability to learn robust representations and generalize well to unseen samples of novel classes. Several commonly used geometric transformation methods for data augmentation include flipping, rotation, cropping, scaling, and translation at the picture level, which are usually used in data pre-processing. Performing data augmentation during the data pre-processing stage can increase the diversity of inputs effectively to a certain extent. Some data-augmentation-based few-shot learning methods train a hallucinator ^[6]^[26] to generate proposals or images containing novel-class objects by transferring the knowledge from base classes to novel classes. Although still finishing data augmentation at the picture level, the knowledge learned from base classes is fresh and novel for unseen classes. This greatly enriches the training data of novel classes by transferring knowledge through generation. A spatial transformer network (STN) ^[27] allows neural networks to actively manipulate and reason about the spatial transformations within input data. This enables a neural network to learn spatial invariance and perform geometric transformations on its input data, such as translation, rotation, scaling, and cropping.

2. Few-Shot Object Detection

2.1. Object Detection

Localizing and identifying objects of interest in an image is the problem of object detection. Each object’s bounding box and the correct object category must be predicted by the object detector. With the advent of deep learning, CNN-based methods, which have been divided into two groups, two-stage and single-stage detectors, have emerged as the dominant paradigm in the field of object identification. Two-stage detectors demonstrate that the models generate region proposals, including RCNN ^[28] and its variants ^[1]^[29]^[30]^[31]^[32], using a separate module called the region proposal module. The first technique that made use of CNN to boost detection performance was RCNN ^[28]. The region proposal module generates proposals, some of which are picked out by selective search, considering they are very likely to contain objects. Features of proposals will be extracted into vectors by CNN and finally classified by the SVM classifier. SPP Net ^[29] puts the convolution layers before the region proposal module, reducing the operation of uniforming the input size of images, which avoids the object deformation input warping. Both RCNN and SPP Net work slowly due to the separated training process of multiple components. Fast RCNN ^[30] was proposed to solve this issue by designing an end-to-end trainable network. Fast RCNN replaces the pyramidal pooling layers with an RoI pooling layer, which associates the feature maps with proposals. Faster RCNN ^[1] introduces anchor boxes by proposing a fully CNN-based network called the region proposal network (RPN), thereby making the detector run faster. R-FCN ^[31] is proposed to solve the issue of translation invariance in CNN and share most of the computations within the model in the meantime. Mask RCNN ^[32] replaces the RoI pooling layer with RoI Align on Faster RCNN and adds a mask head parallel to the classifier and boxes regressor head for classifying each pixel in proposals. Single-stage detectors such as YOLO ^[33], its variants ^[34]^[35] and SSD ^[36] finish the classification and boxes regression in the meantime. YOLO ^[33] regards the detection task as a regression problem, thereby using a fully connected layer to classify and locate the objects. SSD ^[36] concatenates multiple hierarchical feature maps after feature extraction and performs regression for the object’s position coordinates and classification. While single-stage detectors perform more quickly than two-stage detectors, two-stage detectors offer advantages in accuracy.

2.2. Few-Shot Learning

Recent deep learning methods require a lot of computation and resources since they train models with a great deal of data. Few-shot learning (FSL) refers to machine learning methods that can learn new knowledge or concepts with only a few training data examples, which is wisely used in few-shot classification (FSC). Transferring knowledge from the domain of base-class domain to that of novel-class domain is the core goal of FSL. Meta-learning ^[37] is employed in most few-shot learning methods and is considered as the basic technique for FSL. Metric-based methods ^[8]^[9]^[10]^[38] leverage the learning of distance function to measure the distance between two samples. Siamese neural net (SNN) ^[38] uses a couple of weighted-shared CNNs and takes a pair of samples as input for image recognition. The network is trained to determine whether two samples belong to the same category. The match network ^[9] computes the cosine similarity between the embeddings of support and query images unlike the L1 distance used in SNN. The prototypical network ^[10] encodes the query and support images into embedding vectors, and the prototype of each class is defined by the average of embedding vectors from support images in this class. The network makes predictions by calculating the squared Euclidean distance between the query’s and each class’s embedding vector, which represents the similarity between the query image and each class. The relation network ^[8] utilizes a CNN to measure the similarity score instead of calculating the similarity metric by the distance function. Optimization-based methods ^[39]^[40]^[41] aim to achieve good performance by optimizing the model on limited training data. LSTM Meta-Learner ^[39] is modified from long short-term memory (LSTM) and first generates the parameters on the training dataset and optimizes them on the test dataset. Model-Agnostic Meta-Learning (MAML) ^[40] is proposed to find good initialization parameters that make the model adapt new tasks with a few shot samples quickly. Meta-Transfer Learning (MTL) ^[41] employs a pretrained deep neural network (DNN) to extract features and completes the meta-training procedure on the last layer of the classifier. Some model-based methods ^[42]^[43] design the model framework according to the task in particular. Some fine-tuning-based methods ^[44]^[45] transfer the knowledge from a related task that has been trained on the model, leveraging transfer learning ^[46].

2.3. Few-Shot Object Detection

Leveraging a vast amount of annotated images, general object detection networks perform excellently. The difficult task of learning to detect novel classes of things using just one or a few examples per class is known as few-shot object detection (FSOD). Since localization is an additional assignment, FSOD is more complicated than FSC. Existing FSOD methods can be divided into two categories: fine-tuning-based methods and meta-learning methods. Fine-tuning-based methods ^[47]^[48]^[49]^[50], also called transfer learning-based methods, aim to improve the detection performance of novel classes by transferring the knowledge learned from base classes to novel classes. TFA ^[47] employs a two-stage framework, Faster RCNN, and considers that the features extracted by the backbone and RPN are class-agnostic. Since the weights of the feature extractor are fixed in the second step, only the parameters of the box classifier and box regressor need to be fine-tuned after the entire framework has been trained on base-class data in the first stage. MPSR ^[48] is proposed to use an independent branch to process the object and resize its feature maps to various scales. The model finally refines the predictions with multi-scale positive samples. FSCE ^[49] introduces a contrastive head parallel to the box classifier and box regressor to measure the similarity scores between proposal embeddings. Leveraging the contrastive head and contrastive proposal encoding loss, FSCE enlarges distances between different clusters and increases the generalizability of the model. Meta-learning methods ^[18]^[19]^[20]^[21]^[51] use a siamese network with a query and a support branch to improve the generalizability. FSRW ^[19] aims to perform the learning of reweighting coefficients with a few samples by measuring the intrinsic importance of novel-class features on a end-to-end YOLOv2 framework. MetaDet ^[51] finetunes a weight prediction meta-model to predict the parameters of class-specific components from a few examples of novel classes. Meta RCNN ^[18] applies meta-learning over RoI features and introduces a predictor-head remodeling network (PRN) containing a shared backbone with Faster RCNN. The PRN employs channel-wise soft-attention to generate the attentive vectors of each class that are used to remodel RoI features. DCNet ^[20] and DAnA ^[21] improves the detection performance by proposing attention-based aggregation modules. DAnA highlights the relevant semantic features of support images by the dual-awareness attention and incorporates the spatial correlations between query and support features, while DCNet utilizes a similar co-attention module.

This entry is adapted from the peer-reviewed paper 10.3390/electronics12194036

References

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99.
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), Computational and Biological Learning Society, San Diego, CA, USA, 7–9 May 2015.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778.
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162.
Wang, Y.X.; Girshick, R.; Hebert, M.; Hariharan, B. Low-shot learning from imaginary data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7278–7286.
Wu, J.; Dong, N.; Liu, F.; Yang, S.; Hu, J. Feature hallucination via maximum a posteriori for few-shot learning. Knowl.-Based Syst. 2021, 225, 107129.
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1199–1208.
Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D. Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 2016, 29, 3630–3638.
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. Adv. Neural Inf. Process. Syst. 2017, 30, 4077–4087.
Xie, J.; Long, F.; Lv, J.; Wang, Q.; Li, P. Joint distribution matters: Deep brownian distance covariance for few-shot classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7972–7981.
Yang, Z.; Wang, J.; Zhu, Y. Few-shot classification with contrastive learning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 293–309.
Guo, Y.; Du, R.; Li, X.; Xie, J.; Ma, Z.; Dong, Y. Learning calibrated class centers for few-shot classification by pair-wise similarity. IEEE Trans. Image Process. 2022, 31, 4543–4555.
Bendou, Y.; Hu, Y.; Lafargue, R.; Lioi, G.; Pasdeloup, B.; Pateux, S.; Gripon, V. Easy—Ensemble augmented-shot-y-shaped learning: State-of-the-art few-shot classification with simple components. J. Imaging 2022, 8, 179.
Chi, Z.; Gu, L.; Liu, H.; Wang, Y.; Yu, Y.; Tang, J. Metafscil: A meta-learning approach for few-shot class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14166–14175.
Feng, Y.; Chen, J.; Xie, J.; Zhang, T.; Lv, H.; Pan, T. Meta-learning as a promising approach for few-shot cross-domain fault diagnosis: Algorithms, applications, and prospects. Knowl.-Based Syst. 2022, 235, 107646.
Lee, K.; Maji, S.; Ravichandran, A.; Soatto, S. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10657–10665.
Yan, X.; Chen, Z.; Xu, A.; Wang, X.; Liang, X.; Lin, L. Meta r-cnn: Towards general solver for instance-level low-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9577–9586.
Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; Darrell, T. Few-shot object detection via feature reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8420–8429.
Hu, H.; Bai, S.; Li, A.; Cui, J.; Wang, L. Dense relation distillation with context-aware aggregation for few-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 10185–10194.
Chen, T.I.; Liu, Y.C.; Su, H.T.; Chang, Y.C.; Lin, Y.H.; Yeh, J.F.; Chen, W.C.; Hsu, W. Dual-awareness attention for few-shot object detection. IEEE Trans. Multimed. 2021, 25, 291–301.
Zhang, G.; Luo, Z.; Cui, K.; Lu, S.; Xing, E.P. Meta-DETR: Image-level few-shot detection with inter-class correlation exploitation. In IEEE Transactions on Pattern Analysis and Machine Intelligence; IEEE: Piscataway, NJ, USA, 2022; pp. 1–12.
Huang, L.; Dai, S.; He, Z. Few-shot object detection with dense-global feature interaction and dual-contrastive learning. Appl. Intell. 2023, 53, 14547–14564.
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141.
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019.
Zhang, W.; Wang, Y.X. Hallucination improves few-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13008–13017.
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28, 2017–2025.
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587.
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916.
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448.
Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 2016, 29, 379–387.
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017.
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934.
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37.
Schaul, T.; Schmidhuber, J. Metalearning. Scholarpedia 2010, 5, 4650.
Koch, G.; Zemel, R.; Salakhutdinov, R. Siamese Neural Networks for One-Shot Image Recognition. Master’s Thesis, University of Toronto, Toronto, ON, Canada, 2015.
Ravi, S.; Larochelle, H. Optimization as a model for few-shot learning. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016.
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1126–1135.
Sun, Q.; Liu, Y.; Chua, T.S.; Schiele, B. Meta-transfer learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 403–412.
Munkhdalai, T.; Yu, H. Meta networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2554–2563.
Cai, Q.; Pan, Y.; Yao, T.; Yan, C.; Mei, T. Memory matching networks for one-shot image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4080–4088.
Wang, Y.; Chao, W.L.; Weinberger, K.Q.; Van Der Maaten, L. Simpleshot: Revisiting nearest-neighbor classification for few-shot learning. arXiv 2019, arXiv:1911.04623.
Tian, Y.; Wang, Y.; Krishnan, D.; Tenenbaum, J.B.; Isola, P. Rethinking few-shot image classification: A good embedding is all you need? In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 266–282.
Torrey, L.; Shavlik, J. Transfer learning. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques; IGI Global: Hershey, PA, USA, 2010; pp. 242–264.
Wang, X.; Huang, T.; Gonzalez, J.; Darrell, T.; Yu, F. Frustratingly Simple Few-Shot Object Detection. In Proceedings of the International Conference on Machine Learning, Virtual, 12–18 July 2020; pp. 9919–9928.
Wu, J.; Liu, S.; Huang, D.; Wang, Y. Multi-scale positive sample refinement for few-shot object detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 456–472.
Sun, B.; Li, B.; Cai, S.; Yuan, Y.; Zhang, C. Fsce: Few-shot object detection via contrastive proposal encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 7352–7362.
Qiao, L.; Zhao, Y.; Li, Z.; Qiu, X.; Wu, J.; Zhang, C. Defrcn: Decoupled faster r-cnn for few-shot object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8681–8690.
Wang, Y.X.; Ramanan, D.; Hebert, M. Meta-learning to detect rare objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9925–9934.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.