Detection of the Stem end of Pomelo: Comparison
Please note this is a comparison between Version 1 by bowen hou and Version 2 by Lindsay Dong.

For the detection of the stem end of pomelo, there are no standard or even clear detection and grading guidelines. Researchers usually determine the detectors by experience. The deep learning method is good at extracting the hidden information from labeled image datasets.

  • detection
  • pomelo
  • deep learning

1. Introduction

Belonging to the genus Citrus of the family Rutaceae, pomelo (Citrus grandis L. Osbeck) is one of the three basic species of citrus cultivars, which account for approximately 25% of the output of Citrus fruit in China [1]. Pomelo is fragrant, sweet and sour, cool and moist, rich in nutrition, and high in medicinal value. It is not only a fruit that people like to eat, but also one with therapeutic effects [2].
Nowadays, most of the fruit detection methods consist of traditional image processing methods, which require hand-crafted features for various situations. It takes much effort and time to design those features [3]. In traditional image processing, the surface flaw of pomelo can be easily detected, but the stem end of pomelo is also drastically mistaken as a flaw. In recent years, deep learning has become more and more influential in the field of computer vision. With the progress of deep learning technology, image detection improves significantly.
Researchers optimize algorithms to accomplish vision-based tasks with high accuracy and reliability [4]. Deep learning approaches, especially vision transformer, can better perform computer-vision-related tasks [5]. Deep learning algorithms are stronger than traditional image methods for fruit detection [6]. They excel in feature representation and extraction, especially in automatically obtaining features from images [7]. Thanks to their powerful capabilities and easy assembly, they can solve complex and large problems more efficiently [8].

2. Detection of the Stem end of Pomelo

Before the advent of deep learning, the pomelo peel flaw detection task was usually carried out using machine learning. With the widespread use of deep learning, many fruit and vegetable detection algorithms have adopted a conjunction of traditional image algorithms and deep learning methods. Xiao et al. [9][19] used an improved feature fusion single multi-box detector for extracting RGB features for the detection of pomelo. The experimental results are good. However, their datasets are too small. This artificial neural network is only a detection function, and the generalization performance of the proposed model is not good. Huang [10][20] used a back-propagation neural network (BPNN) model to select the pomelo surface defects, pomelo shape, pomelo size, and other indicators. They built their own larger fruit dataset, and their data were mainly from daily shooting and the web. Li et al. [11][21] proposed using least-square support vector machine (LS-SVM) to identify pomelo on a 240-image dataset. They achieved good results with this small dataset. This machine learning method is applicable to the sorting of pomelo.
Moreover, for pomelo, some researchers even use infrared spectroscopy information [11][12][21,22]. Many traditional image algorithms are used to construct a system for pomelo maturity measurement and detection [12][22]. Such works are comprehensive. To determine categories, researchers count the pomelo color histograms and use thermal cameras to detect defects. Undoubtedly, these methods increase the hardware cost of a model that uses only cameras. The study by Jie et al. [13][23] shows that the conventional convolution neural network (CNN) achieved the best accuracy compared with the LS-SVM and BPNN for citrus grandis granulation determination. The quality of the detection model depends on the feature extraction. To improve the performance of CNN, they added the batch normalization layer. The detection model achieved 97.9% accuracy on the validation set. They point out that bands of 807–847 nm, 709–750 nm, and 660–721 nm are the spectra greatly related to pomelo granulation through analyzing the well-trained model layer by layer. Combined with some studies on functional groups, it is possible to speculate the change in internal substances, which may provide some hints to develop granulation-detecting equipment for pomelo.
The limitations of the current state of the art that motivate the present study lie in the small size of the number of pomelo datasets and the far less targeted improvement of the deep learning models.

2.1. Detection Methods

There have mainly been two kinds of detectors since the advent of deep learning. They are the one-stage detection framework and the two-stage detection framework [14][15][24,25]. The two-stage detection framework, which is represented by RCNN [16][26] and Fast RCNN [17][27], generates a series of sparse candidate boxes through CNN, and then classifies and regresses these candidate boxes. It has a more complicated training process because of the multistage complex pipeline. In practical applications, the time of inference is very long [14][24]. Theoretically, it is difficult for us to optimize. RCNN [16][26] uses CNN networks to extract image features from empirically driven artificial feature paradigms histogram of oriented gradients and scale invariant feature transform to data-driven representation learning paradigms to improve feature-to-sample representation. Fast RCNN [17][27] only performs feature extraction for the whole image full region once, introduces suggestion frame information, and extracts the corresponding suggestion frame features.
By comparison, one-stage detection framework (Representative YOLO [18][28], SSD [19][29], etc.) can avoid the problems mentioned above. YOLO [18][28] uses the whole image as the input of the network and takes target detection as a regression problem to solve it. YOLO directly regresses the position and category of the preselection box on the output layer. SSD [19][29] extracts feature maps of different scales for detection. Large-scale feature maps (the feature map in the front) can be used to detect small objects, while small-scale feature maps (the feature map in the back) can be used to detect large objects. Moreover, SSD uses prior boxes (default boxes) with different scales and aspect ratios.
In summary, one-stage detection frameworks detect objects in a single pass through the network. Two-stage detection frameworks use a two-stage process to detect objects. In the first stage, the network proposes regions of interest (ROIs) where objects may be located. In the second stage, the network classifies the proposed ROIs and refines their bounding boxes. One-stage detectors are faster and easier to use, but they sacrifice accuracy. Two-stage detectors are more accurate but are slower and more complex. In practical applications, provided that the real-time requirements are satisfied (FPS > 50), both one-stage and two-stage detection frameworks are suitable for distinguishing the stem end of pomelo from its black spots with higher accuracy.

2.2. Vision Transformers

The original ViT [20][10] is a model for image classification that uses a transformer-like architecture on various parts of the image. An image is processed as a series of small patches by transformers, making it easy to consider the interaction between patches at all positions, such as global attention. ViT [20][10] contains three main components: patch embedding, feature extraction from stacked transformer encoders, and classification head. However, due to the high computational complexity (increasing in a quadratic way with the image size), the original ViT cannot be easily applied to a wide range of visual tasks. By introducing the concept of a shifted window that supports patch reduction and local attention operations, the Swin-Transformer [21][12] mitigates the complexity problem and improves the adaptability to intensive prediction tasks (such as object detection). Pooling-based vision transformer [22][30] is able to reduce the ViT structure size and improve the spatial interaction ratio of ViT by controlling the self-attentive layer. A few methods use vision transformers as detector backbones. However, they achieve limited success [21][22][23][11,12,30].

2.3. Detection Transformers

Combining the structures of convolutional neural network backbones and transformer encoder–decoder, detection transformers discard the precisely designed components, such as anchor generation and maximum suppression. The study by Song et al. [24][31] shows that detection transformers can be effective detectors by configuring the attention module and refining the decoder. Compared to previous detectors [16][17][18][19][22][26,27,28,29,30], the original DETR [25][13] achieves accurate detection results, but the convergence speed is slow. For example, the Faster R-CNN [18][28] requires only 50 epochs for training while DETR needs 150 epochs. In order to solve this problem, Zhu et al. [26][14] propose Deformable DETR, which contains deformable attention to accelerate the slow training speed of DETR and utilize multi-scale features in the image.
ScholarVision Creations