1. Introduction
The ideas of deep learning
[1][2] have played a key role in the rapid development of the field of machine vision. In recent years, object detection has developed to an unprecedented height, which is an important part of picture information processing and machine vision disciplines, and the core part of surveillance information systems. With the development of deep-learning theory and technology, the application scope of object detection is also greatly expanding. Some advanced algorithms and models can meet various complex needs, such as automatic driving, real-time tracking, intelligent fruit and vegetable harvesting, etc. Therefore, improving the accuracy and detection speed of the object detection algorithm can not only provide more accurate object category and position information for other downstream tasks, but also promote the application of these downstream tasks based on target detection, among which representative downstream tasks and their challenges to the target detection algorithm include the following:
- (1)
-
Pedestrian detection, especially small-scale pedestrian detection, is one of the challenges faced by object detection algorithms
[3];
- (2)
-
Face detection, occlusion, and multi-scale object detection are also difficult challenges
[4];
- (3)
-
Text detection and object detection for distorted, blurry, and low-quality images is one of the difficult problems that need to be solved
[5];
- (4)
-
Fruit and vegetable testing
[6][7]. Recently, many researchers have begun to explore new and more efficient techniques for detecting and identifying fruits and vegetables
[8]. However, because traditional technologies cannot quickly and accurately collect and analyze massive images, their robustness and accuracy still have certain shortcomings
[9]. With the development of science and technology, new fruit and vegetable detection technology has been widely used in various scenarios. They can more accurately capture the appearance, structure, and properties of fruit, and can quickly and accurately identify the growth cycle of each fruit and the quality of the product. The application of these technologies has greatly improved the efficiency and accuracy of fruit and vegetable harvesting, bringing greater efficiency and faster harvest to agricultural production.
2. Related Technologies for Red Fruit Detection
Red fruit detection is an important target detection technology for fruits and vegetables that can help us to better pick fresh fruits and vegetables, classify them, assess their ripe maturity, and detect pests and diseases. In recent years, research in this field has also been in full swing. In 2017, BARGOTI
[10] developed a target detector based on Region-based Convolutional Neural Networks (R-CNNs) with Faster R-CNN that can accurately identify three different types of fruits and vegetables. In 2018, Peng Hongxing
[11] used the deep convolutional network architecture of the Single-Shot MultiBox Detector (SSD) to identify a variety of fruits and vegetables. Li
[12] used pretreatment technology to train Caffe Net and improved the recognition accuracy of strawberries to 95%. In 2019, Bi Song et al.
[13] designed a method for citrus target recognition in the natural environment based on deep learning. Zeng Pingping
[14] proposed a convolutional neural network model for identifying four fruits and vegetables: apples, pears, oranges, and peaches, which borrowed from the network architecture of Le Net. Cheng Hongfang et al.
[15] proposed an improved LeNet apple detection model, which achieved a recognition rate of 93.79% for apple fruits in the natural environment. Huang Haojie
[16] improved SSDs and realized the detection and classification of apples, oranges, and bananas. In 2020, Zhang Enyu
[17] and his team developed a new detection model using SSD technology, which can effectively distinguish green apples in nature and has good accuracy. Gao F
[18] and his team proposed a new apple fruit detection technology that uses Faster R-CNN technology to accurately divide the fruit according to the complex environment in which the apple fruit tree is located in its occlusion state to achieve more accurate fruit identification purposes.
3. Object Detection Based on Neural Networks
CNNs are an important means of image processing, which not only do not have to rely on complex models, but also have excellent generalization. CNNs
[19][20] involve the layered processing of observed image information from the low layer to the high level. Then, the feature information of the image is processed in layers, and each layer processes specific information. After repeating this iteration many times, the low-level features are combined to form high-level features to gain a deep understanding of the observed objects. By adopting the weight allocation mechanism, the calculation of the model can be effectively simplified, and the computing resources can be effectively saved. When processing multiple forms of image information, convolutional neural networks can directly input data into the neural network to process images more effectively and better meet the needs of practical applications. In the process of processing image information, the convolutional neural network performs convolution calculations through the convolution kernel. The function of the convolution kernel is to find the characteristics of the picture. The convolution kernel is calculated from left to right, from top to bottom, and the extracted different feature pictures are obtained. Different picture features can be extracted by using different convolution kernels, and the values in each convolution kernel are automatically learned by the algorithm without manual setting
[21].
Object detection algorithms can be roughly divided into two categories: two-stage algorithms and single-stage algorithms. The signature two-stage object detection algorithms are RCNN, the Spatial Pyramid Pooling Network (SPP-Net)
[22], Faster R-CNN
[23], Region-based Fully Convolutional Networks (R-FCNs), Mask R-CNN, Cascade RCNN, and Trident Net. Representative single-stage object detection algorithms include SSD, You Only Look Once (YOLO), YOLO9000, Retina Net, You Only Look Once version 3 (YOLOv3), Efficient Det, and YOLOv4
[24].
In November 2013, Ross Girshick’s R-CNN achieved great success in the field of object detection applications. However, to identify complex targets more accurately, the input image of the R-CNN must be processed according to a preset size, so this processing may lose the original data, which in turn affects the final inspection results.
In June 2014, He proposed a new SPP Net that reduces data loss due to image scaling by using SPP layers. However, SPP Net is similar to R-CNN in that it is inefficient to detect and must also face a large number of feature processing, which puts extremely high requirements on device performance.
In April 2015, Ross Girshick and other researchers proposed that Fast R-CNN can effectively reduce the complexity of object detection. So, it integrates multiple classification and regression steps into a single module, thereby greatly reducing the complexity of detection and greatly speeding up the detection process of the algorithm.
In June 2015, Redmon proposed YOLO, a single-stage object detection algorithm. It uses only one backbone process during the inspection process, which can significantly reduce computational complexity and memory usage and make it run faster. The invention of YOLO marks an important milestone. Although it is not as accurate as the two-stage detection algorithm, it opens a whole new path for academia to achieve a faster and more real-time detection method by simplifying the process. In December of that year, Liu and other scholars proposed the SSD algorithm, which combines the advantages of YOLO and explores its shortcomings. To achieve a balance, scholars introduced the anchor frame mechanism of the two-stage algorithm into the single-stage algorithm to improve the accuracy.
First proposed by He and other researchers in March 2017, the Mask R-CNN algorithm replaces the “ROI Pooling” layer of the Faster R-CNN algorithm with the “ROI Align” layer, and adds a new path, the “object mask”. It can more accurately describe and identify specific objects, which is different from traditional object recognition methods. Due to the existence of the loss function, this leads to the loss calculation of the mask, which makes the calculation amount larger again, resulting in the difficulty of improving the efficiency of detection.
In August 2017, the Retina Net single-stage object detection method was developed by the Facebook AI Research team, and can effectively solve the time-consuming problem of two-stage modes. It can effectively avoid the decline in the accuracy of single-stage object detection due to the difference of multiple tasks.
In 2018, Redmon, the founder of YOLO, and his team pioneered the introduction of the YOLOv3 algorithm. It is based on the concept of FPN, combining three feature maps of different sizes to better achieve multi-scale object detection.
The excellent performance of the FPN algorithm has attracted many researchers, who draw on this algorithm and constantly explore and improve its application. In 2019, Tan et al. launched the Efficient Det algorithm, an improved version of FPN, namely the Bidirectional Cross-Scale Connection sand Weighted Feature Fusion Pyramid Networks. It greatly expanded the application scope of the BiFPN algorithm and brought new opportunities and challenges to the development of object detection algorithms. BiFPN enabled the effective detection of targets by fusing the features of both sides between the P3 layer and the P7 layer. Although the Efficient Det has many notable features and excellent accuracy, it is still necessary to improve the efficiency of its inspection.
With the widespread adoption of YOLOv3 technology, it has become one of the most common deep-learning object detection methods in use today. In 2020, Boch Kovskiy et al. launched YOLOv4, which replaces YOLOv3’s Darknet53 network structure. It uses CSP Darknet53 to continue the essence of the YOLO algorithm, thereby promoting the progress of the YOLOv4 algorithm. The YOLOv4 method is a method that guarantees both speed and accuracy, and its unique features are its SPP module, PANet, and its many computing tools. It can effectively combine different perception elements to better meet different needs. Therefore, it was the best object detection algorithm at that time, which guaranteed both fast and accurate results.
In recent years, deep learning techniques have shown promising results in the field of disease classification using image data. Five widely used deep learning models, namely AlexNet, VGG16, InceptionV3, MobileNetV3, and Efficient Net, were trained and evaluated using a dataset of sunflower disease images by Yonis et al.
[25]. A well-known deep learning model, MobileNetV2, was used by Gulzar
[26] as the base model but was modified by adding five different layers to improve the accuracy and reduce the error rate during the classification process. The proposed model is trained on a dataset containing 40 varieties of fruits and is validated to have the highest accuracy in recognizing different types of fruits. Poonam Dhiman
[27] et al. described an SLR focused on disease identification and classification in citrus fruits using machine learning, deep learning, and statistical techniques and presented different conceptualized theories related to all the essential components of the recognition and classification of citrus fruit diseases. This SLR has addressed nearly all the state-of-the-art frameworks applicable to the detection of diseases in citrus fruits and also addressed stepwise measures to build a necessary automatic framework to protect fruits from apparent disease by answering nine research questions. Normaisharah Mamat et al.
[28]. proposed an automatic image annotation advancement approach that employs repetitive annotation tasks to automatically annotate an object. The YOLOv5 model, a deep learning approach, is chosen for automatically annotating images. The design of this method is proven to be fast at annotating a new image, successfully achieves high accuracy, and can greatly reduce the amount of time required to classify fruit, while also addressing the difficulty caused by a massive number of unlabeled images.
Although current object detection technologies have made great progress and have reached a certain level, they are limited by computing power, environmental noise, daytime environments, different object sizes, and a variety of other environmental factors, which make them unable to achieve the desired results.
This entry is adapted from the peer-reviewed paper 10.3390/pr12010015