Object detection is a complex problem due to underlying high intra-class and low inter-class variance. High intra-class variance is the consequence of different objects belonging to a single class, for instance, different poses of humans or humans wearing different clothes in an image. Low inter-class variance is the outcome of similar-looking objects belonging to different classes such as samples of class chair can easily be misclassified into the class bench and vice versa.
Object detection is considered as one of the most important and elementary tasks in the field of computer vision. The problem of object detection deals with the identification and spatial localization of objects present in an image or a video [1]. The task of object detection covers a wide range of many other computer vision tasks, such as instance segmentation [2,3[2][3][4],4], visual question answering [5], image captioning [6[6][7],7], object tracking [8], activity recognition [9,10,11][9][10][11] and so on.
One of the earlier approaches for object detection algorithms relied on sliding windows, applying classification on each window to find objects [14,15,16][12][13][14]. Later, the sliding window concept was replaced with region proposals to narrow the search before applying classification [17,18,19,20,21][15][16][17][18][19]. The recent surge in deep learning has given rise to object detection systems along with other fields.
The prior published work in object detection can be further classified into three categories which are explained below. Figure 1 depicts the basic difference between them: 1. Object Detection (OD) : OD aims at detecting objects regardless of their class category [17,22][15][20]. OD algorithms [23,24,25,26][21][22][23][24] generally propose a large number of possible region proposals, from which, later on, the best possible candidates are selected according to certain criteria. 2. Salient Object Detection (SOD) : SOD algorithms use the human attention mechanism concept to highlight and detect the objects in a picture or video [27,28][25][26]. 3. Category-specific Object Detection (COD) : COD aims at detecting multiple objects. Unlike OD and SOD, COD has to predict the category class and the location of the object in the image or video [16,29][14][27].
The deep learning-based object detection algorithms are categorized into two-stage object detectors and one-stage object detectors. Two-stage object detection architectures such as R-CNN [16][14], Fast R-CNN [31][28] and Faster R-CNN [23][21] segregate the task of object localization from the object classification task. They employ region proposal techniques to find possible regions where the likelihood of an object’s existence is maximum. Later segmentation output and better detection pooling [23][21] techniques were introduced with Mask R-CNN [25][23]. On the other hand, one-stage object detection algorithms first generate candidate regions, and then these regions are classified as object/no-object. For instance, one-stage detectors such as YOLO [24,32,33,34][22][29][30][31] and SSD [26][24] work with feature pyramid networks (FPNs) [35][32] as a backbone to detect objects at multiple scales in a single pass rather than first predicting regions and then classifying them.
There are many surveys carried out on the topic of object detection [38,39,40,41][33][34][35][36]. This section covers some of the prior surveys.
Han et al. [30][37] organized the survey in which deep learning techniques for salient and category-specific object detection are reviewed. In 2019, Zou et al. [42][38] performed an extensive survey on object detection methods that have been proposed in the last 20 years. The authors discussed all the types of object detection algorithms proposed over the years and highlighted their improvements.
Another survey organized by Jiao et al. [43][39] discussed various deep learning-based methods for object detection. The proposed work provided a comprehensive overview of traditional and modern applications of object detection. Moreover, the authors discussed methods for building better and efficient object detection methods by exploiting existing architectures. Arnold et al. [44][40] surveyed 3D object detection methods for autonomous driving. The proposed work compared various 3D object detection-based approaches.
It is vital to mention that all of the prior surveys have focused on the general problem of object detection. Although these surveys explain how object detection has improved over the years, they do not cover the challenges and solutions to improve object detection performance in a challenging environment such as low light, occlusions, hidden objects, and so on. To the best of our knowledge, we provide the first survey that reviews the performance of deep learning-based approaches in the field of object detection in a challenging environment.
We have investigated the performance of current state-of-the-art object detection algorithms on the three most challenging datasets. The idea is to conduct an analysis that explains how well object detection algorithms can perform under harsh conditions. We employed Faster R-CNN [23][21], Mask R-CNN [25][23], YOLO V3 [33][30], Retina-Net [99][41], and Cascade Mask R-CNN [100][42] to benchmark their performance on the datasets of ExDARK [97][43], CURE-TSD [92][44], and RESIDE [98][45].
We have leveraged the capabilities of transfer learning in our experiments. All the object detection networks are incorporated with a backbone of ResNet50 [101][46] pre-trained on the COCO dataset [12][47]. We fine-tuned all the models for 15 epochs with a learning rate of 2× 10− 5 and used Adam [102][48] as an optimizer. We resized images to 800 × 800 during the training and testing phases.
This section discusses the well-known evaluation criteria essential to standardize state-of-the-art results for object detection in difficult situations. Moreover, this section analyzes the performance of the approaches discussed in Section 3 with quantitative and qualitative illustrations. Finally, we will present the outcome of our experiments on the three most widely exploited challenging datasets.
The standardization of how to assess the performance of approaches on unified datasets is imperative. Since object detection in a challenging environment is identical to generic object detection, the approaches appraise similar evaluation metrics.
Precision [103][49] defines as the percentage of a predicted region that belongs to the ground truth. Figure 16 illustrates an the difference between precise object detection and imprecise object detection. The formula for precision is explained below: (1) Predicted area in ground truth Total area of predicted region = TP TP + FP where TP denotes true positives and FP represents false positives.
The main reason for the low performance of these state-of-the-art generic object detection algorithms is that they are not trained on challenging datasets that include low-light images or occluded images. Furthermore, the backbone network of these architectures cannot optimally extract the spatial features necessary for detecting objects in challenging environments. Hence, it is empirically established that generic object detection algorithms are not ideal for resolving object detection in challenging images.