1. Traditional Machine Learning
1.1. Support Vector Machines (SVM)
Support vector machines (SVM) were used for classifying vegetation by health status 
, classifying trees by type 
, identifying and classifying weeds to generate weed maps 
, and lastly, segmenting crop rows 
Tendolkar et al. 
proposed the use of an Agrocopter, a multipurpose farming drone, to assess and evaluate plant health status and to take corrective actions. The system assessed plant health on the basis of the NVDI index, texture, and color features of the individual pixels. These features were extracted utilizing a filter bank of 17 Gaussian and Laplacian filters. SVM was then used to perform semantic segmentation on the image pixels and to classify the pixels as healthy or unhealthy. Lastly, a segmented mask was generated and used to find the health ratio of the images according to the ratio of the area of healthy pixels to the total area of the image. The health ratio was then used to classify images into healthy, moderately healthy, and unhealthy. The trained model had 85% precision, 81% recall, and an F1-score of 79%.
Natividade et al. 
proposed a pattern recognition system (PRS) to identify and classify vegetation using the NDVI scale as a segmentation threshold. An SVM was trained on two datasets: a tree dataset with five classes and a vineyard dataset with three classes. The best models achieved an accuracy of around 72% on the two datasets.
Pérez-Ortiz et al. 
introduced a UAV-based weed mapping system for the early detection of weeds in crop fields. They used a semi-supervised SVM (SSVM) which aims to find an optimal labeling for the test portion of the data using both labeled and unlabeled data. The system used crop-row detection, vegetation indices, and spectral features to classify pixels in field images as belonging to one of three classes of crop, weed, or soil. Crop-row detection was introduced to improve classifier performance in differentiating crops and weeds because their spectral features were similar. The proposed system took UAV-captured images, partitioned them into 1000 × 1000 pixel images, and then calculated the vegetation index of all image pixels. NDVI was used for multispectral images, and the excess green index (ExG) was employed for visible images. The Otsu thresholding procedure was then applied to the vegetation indices to create thresholds that divided the indices into three classes where the highest vegetation index (VI) pertaining to crops, lower values to weeds, and the lowest values to soil. The image was then binarized by taking crop pixels as 1s and weed and soil pixels as 0s. The binarized image was then fed into the Hough transform (HT) method to detect crop rows in the images. Lastly, a crop-row data feature, along with VI and spectral features, was used to train different machine learning models to classify pixels as soil, crop, or weed. The SSVM returned an MAE of 12.68%.
César Pereira et al. 
compared the performance of multiple machine learning algorithms for the problem of crop-row segmentation. Their study used a single image of a sugar cane field as its dataset and compared the segmentation results of running this image through different classifiers to a manually labeled image. The manually segmented image’s pixels were classified into the two classes of crop row and background. Spectral features were extracted using ExG and VI, and textural features were extracted through a four-filter Gabor filter bank and a gray-level co-occurrence matrix (GLCM). The feature vectors and color features (RGB) were used to train SVM models. For the linear SVM model, the best combination of features was RGB, EXG, and Gabor filters. This combination yielded an F1-score of 88.01% and an IoU percentage of 78.86%. The worst feature combination was RGB and GLCM. This combination yielded an F1-score of 62.48% and an IoU percentage of 46.08%.
1.2. K-Nearest Neighbors (KNN)
The K-nearest neighbor algorithm (KNN) has been used extensively in precision agriculture in land-cover classification 
, sugarcane planting line detection/fault studies 
, and crop-row segmentation 
Rodríguez-Garlito and Paz-Gallardo 
proposed a KNN-based land-cover classification system. This system classified land cover into olive trees, soil, weeds, and shadow. In this system, high-resolution, multispectral images of the studied field were first captured using a UAV. These images went through spatial partitioning to reduce the memory costs of the machine learning algorithm. As a result, processing windows were formed, with each window holding the spectral information of a row of image pixels. The KNN algorithm was then applied to one processing window at a time to perform land-cover classification, and to classify individual pixels into the classes to which they belonged. The trained KNN model had a precision of 95.5%, an accuracy of 91.8%, and an accuracy score of 90.9% on an equally balanced dataset. Similarly, Rocha et al. 
used KNN to detect gaps in curved sugarcane planting lines from aerial images. The training and test sets were created using RGB images and classified using decision tree, linear discriminant analysis, and KNN. KNN had the best results with a relative error of 1.65%, and it effectively evaluated the planting conditions.
Pereira Júnior et al. 
studied the use of the KNN algorithm in crop-row segmentation. Two KNN models with two different K values of 3 and 11 were used. Constructing a KNN model with a K value of either 3 or 11 yielded similar results. The models used Euclidean distance and RGB, ExG, and Gabor filters as features, and both models achieved an IoU score of about 76% and an F1-score of about 86%.
1.3. Decision Trees (DT) and Random Forests (RF)
Decision tree classifiers were used in precision agriculture to classify vegetation like trees and vineyards 
. Similarly, the random forest algorithm was used to classify sugar beet crops and weeds 
Natividade et al. 
used decision trees to detect and classify trees and vineyards in a field, where trees were classified into five distinct types and vineyards into three types. On the tree data set, the best model resulted in 87% precision, 88% recall, and 74% accuracy. On the vineyard data, 87% precision, 90% recall, and 79% accuracy were achieved.
Lottes et al. 
proposed a crop and weed detection, feature extraction, and classification system that could identify and classify sugar beets and several types of weeds. NDVI and ExG were used as features. A segmented mask based on the VI threshold was then used to extract a spectral feature vector per segmented object in the image and a feature vector per key point in the image. These feature vectors, along with geometric and statistical features, were used to train a random forest model. The Phantom and Matrice-graining datasets contained UAV-captured images of crops and weeds, while the JAI training dataset contained ground-captured images. The Phantom dataset was used to test how well the model could classify vegetation into sugar beet crops, saltbush weeds, chamomile weeds, and other weeds. The model yielded a precision of 85% for both saltbush and chamomile weeds. The recall values were 95% and 87% for saltbush weeds and chamomile weeds, respectively. Lastly, a recall of only 45% was attained for other weeds. The overall accuracy of the model was 86%. When weed-type classification was ignored, and vegetation was classified into two classes, 99% recall and 97% precision were achieved.
2. Neural Networks and Deep Learning
2.1. Convolutional Neural Networks (CNN)
Convolutional neural networks (CNN) have been used extensively in analyzing images for precision agriculture. Specifically, transfer learning has often been used successfully using a variety of pretrained models, including Inception V3 and VGG. For example, Crimaldi et al. 
used the Inception V3 model and achieved 78.1% accuracy for classifying a crop into one of 14 crop types using data consisting of 54,309 images. Milioto et al. 
built a CNN model using RGB and NIR camera images. The model had 97.3% accuracy for images of early crop growth and 89.2% accuracy for images of crops in later stages. However, both models had the same recall percentage, with the early stage scoring 98% and the later stage scoring 99%. Similarly, Bah et al. 
used the AlexNet model on spinach, beet, and beans datasets and achieved precision of 93%, 81%, and 69%, respectively. The authors claimed that the bad results were primarily due to leaves overlapping between crops and weeds. Reddy et al. 
used a customized CNN model for their work on plant species identification and achieved 99.5% precision for Flavia, Swedish leaf, and UCI leaf datasets. Sembiring et al. 
focused on tomato plant disease detection. Their proposed model achieved 97.15% validation accuracy using the tomato leaf dataset from Plant Village. However, their model did not achieve the highest validation accuracy among all four trained models. The highest accuracy score of 98.28% was achieved by the VGG16 model. Geetharamani et al. 
achieved a classification accuracy of 96.46% using a customized nine-layer CNN model. The authors of 
used a residual learning CNN with an attention mechanism. The goal was to perform real-time corn leaf disease recognition. They also used the Plant Village disease classification challenge dataset 
. An overall accuracy of 98% was achieved. Nanni et al. 
used different combinations of CNNs, including ResNet50, GoogleNet, ShuffleNet, MobileNetv2, and DenseNet201, with different Adam optimization methods. These CNN models were trained on three datasets of insect images: the Deng dataset, the IP102 dataset, and the Xie2 dataset. The best-performing CNN achieved state-of-the-art accuracy on both insect datasets: 95.52% on Deng, a score that competed with human expert classifications, and 73.46% on IP102.
Atila et al. 
proposed using the EfficientNet architecture for plant disease classification on the Plant Village dataset and achieved 99.91% and 99.97% accuracy on original and augmented datasets, respectively. Prasad et al. 
proposed a two-step machine learning approach that analyzed low-fidelity and high-fidelity images from drones in sequence, preserving the efficiency and accuracy of plant diagnosis. The Pathology 2020 dataset and a set of synthetically generated images were used. A semi-supervised model derived from EfficientNet called EfficientDet was used. The end goal was to perform segmentation and classification. The model scored 75.5% for the average accuracy of the identifier model. Albattah et al. 
proposed a customized model of using EfficientNet called EfficientNetV2-B4 backbones to address plant disease classification. The Plant Village dataset and additional UAV images were used to train the model. The results were 99.63%, 99.93%, 99.99%, and 99.78% for precision, recall, accuracy, and F1-score, respectively.
Mishra et al. 
developed a standard CNN model to detect corn plant diseases in real time. The model was deployed on an Intel Movidius NCS and a Raspberry Pi 3b+ module. The authors used the Plant Village disease classification challenge dataset and divided the images into three classes: rust, northern leaf blight, and healthy. The system achieved an accuracy of 98.40% using a GPU and 88.56% on the NCS chip. Bah et al. 
used unsupervised data labeling for weed detection from UAV images. The dataset consisted of two fields: beans and spinach. Each dataset was divided into the two classes of crop and weed. Two-thirds of the data were labeled in a supervised manner, while one-third were labeled using unsupervised methods. The ResNet18 model was used to perform the classification. ResNet18 significantly outperformed SVM and RF methods in the bean field as it achieved an average AUC of 91.7% on both supervised and unsupervised labeled data in comparison to 52.68% using SVM and 66.7% using RF. On the other hand, RF resulted in a slightly better average AUC% in the spinach field compared to that achieved using ResNet18.
Zheng et al. 
proposed multiple CNN models to estimate percentage canopy cover and vineyard leaf area index in each field. The authors compared the estimation performance of five different models, including a CNN–ConvLSTM model, a vision transformer model, a joint Model, a CNN model of 71 layers (Xception model), and a ResNet50 model. The five models were trained on a dataset containing approximately 840 images extracted from UAV videos taken of vineyard fields at Alcorn State University. The five models were evaluated using the RMSE of both leaf area index (LAI) and percentage canopy cover. For the prediction of leaf area index, Xception, CNN-ConvLSTM, vision transformer, ResNet50, and the joint model had RMSEs of 0.28, 0.32, 0.34, 0.41, and 0.43, respectively. For predicting percentage canopy cover, Xception, CNN-ConvLSTM, vision transformer, ResNet50, and the joint model had RMSEs of 4.01, 4.50, 4.56, 5.98, and 6.08, respectively. Clearly, Xception performed best in both LAI estimation and percentage canopy cover estimation.
Yang et al. 
proposed a method of multisource data fusion for disease and pest detection of grape foliage using the ShuffleNet V2 model. The dataset consisted of 834 groups of grape foliage images. Each group contained three types of images of grape foliage: RGB image (RGBI) (2592 × 1944, three channels), multispectral image (MSI) (409 × 216, 25 channels), and thermal infrared image (TIRI) (640 × 512, three channels). The accuracy of MSI was 82.4%, that of RGB was 93.41%, and that of TIRI was 68.26%.
Briechle et al. 
used multispectral images to classify tree species and standing dead trees. They used the PointNet++ model. The data used were UAV-based light detection and ranging, including laser echo pulse width (LIDAR) data and five-channel MS imagery. They also applied segmentation to the images during the preprocessing of the data. Their model achieved an accuracy of 90.2%.
Aiger et al. 
proposed a method of image classification based on multi-view image projections. Their method used projections of multiple images at multiple depth planes near the reconstructed surface. This enabled the classification of categories whose most noticeable aspect was appearance change under different viewpoints, such as water, trees, and other materials with complex reflection/light response properties. They obtained the best accuracy of 96.3% on their proposed 3D CNN.
Weinstein et al. 
developed a semi-supervised model for individual tree detection from UAV imagery. The model used an existing LIDAR algorithm to generate RGB trees that could be used for training as a starting point. The model was then retained using a small number of manual labels to correct errors from the unsupervised detection. Then a pretrained ResNet50 backbone was used to classify the images. The model was tested on the NEON public dataset and achieved the best performance among existing LIDAR-based models (+2%) in comparison to that achieved by Silva et al. 
2.2. U-Net Architecture
The U-Net architecture was originally introduced in the medical domain by Ronneberger et al. 
and is commonly used for image segmentation. U-Net follows an encoder–decoder architecture. Many factors, such as the density of the crops, their growth stage, and the flight height of the drone, have an impact on how well a U-Net will perform. According to Kitano et al. 
, U-Net did not perform well when the plants were remarkably close together. However, some techniques could be used to solve this problem, such as using the opening morphological operator 
Lin et al. 
used U-Net to achieve an accuracy of 95.5% and an RMSE of 2.5% with 1000 manually labeled training images. Arun et al. 
achieved an accuracy of 95.34% and an RMSE of 7.45 using reduced U-Net by designing an efficient pixel-wise classifier for weeds and crops in agricultural field images. Hoummaidi et al. 
used the U-Net model to perform vegetarian extraction and achieved an overall accuracy of 89.7%. However, palm trees and Ghaf trees had higher detection rates of 96.03% and 94.54%, respectively. The authors justified their results with the fact that trees were obstructed by other trees. Palm trees also caused some errors due to their physical characteristics and the small crown sizes of some trees. The authors suggested that including young palms in the training data could improve the crown size error rate. Doha et al. 
used the U-Net architecture to detect crop rows by performing semantic segmentation on vertical aerial images. Zhang et al. 
used the dual-flow U-Net (DF-U-Net) to detect yellow rust severity in farmlands. The dataset was from the Yangling experiment field, which used a red-edge camera on board a DJI M100 UAV with a sensor size of 1336 × 2991. The F1-score, accuracy, and precision scores were 94.13%, 96.93%, and 94.02%, respectively. Sparse channel attention (SCA) was designed to increase the receptive field of the network and improve the ability to distinguish each category. Using U-Net, Lin et al. 
achieved high accuracy with a small dataset. Similarly, with only 48 images, Tsuichihara et al. 
achieved an accuracy of about 80% in detecting broad-leaved weeds.
2.3. Other Segmentation Models
Efficient dense modules of asymmetric convolution (EDANet) is another model that works well for real-time semantic segmentation. Therefore, EDANet can be useful for real-time applications such as UAVs. Yang et al. 
proposed an EDANet that performs semantic segmentation for detecting rice lodging. Lodging occurs when the stem weakens and the plant falls over. EDANet outperformed many systems because of its efficiency, low computational cost, and model size. The model identified normal rice at 95.28% and lodging at 86.17% accuracy. The model accuracy was improved to 99.25% when less than 2.5% of rice lodging was neglected.
Weyler et al. 
proposed an ERFNet-based instance segmentation model that segments individual crop leaves in plant imagery to extract relevant phenotyping information and then groups the instances that belong to one crop together. This model made use of two decoders, one of which was used to predict the offset of image pixels from leaf regions, while the other was used to predict the offset of image pixels from plant regions. The two decoder outputs were then used to generate one image with leaf clusters and another with plant clusters. The model was trained on a dataset of 1316 RGB images of sugar beet fields captured by a camera onboard a UAV. The model was evaluated on its ability to perform crop leaf segmentation, as well as full crop segmentation. In crop leaf segmentation, the model was able to achieve an average precision of 48.7% and an average recall of 57.3%. The model achieved an average precision of 60.4% and an average recall of 68% for crop segmentation.
Guo et al. 
developed a three-stage model to perform plant disease identification for smart farming. The model located the diseased leaves using a region proposal network (RPN) algorithm trained on a leaf dataset in complex environments, after which regression and classification neural networks were used to locate and retrieve the diseased leaves. Later, the Chan-Vese algorithm 
was used to perform segmentation according to the set zero level set and minimum energy function. Lastly, the diseases were identified using a pretrained transfer learning model. The proposed model outperformed the traditional ResNet101 model significantly, with an accuracy of 83.75% in comparison to 42.5% by the latter.
Sanchez et al. 
used a multilayer perceptron (MLP) neural network for the early detection of broad-leaved weeds and grass weeds in wide-row crops from UAV imagery. The data were manually collected using a UAV quadcopter equipped with a low-cost RGB camera. Image segmentation was done using the multiresolution segmentation algorithm (MRSA). The model achieved an average overall accuracy of 80.9% on two classes of crops.
Zhang et al. 
proposed a unified CNN called UniStemNet for joint crop recognition and stem detection in real time. The architecture of UniStemNet is similar to that of Mask-RCNN. The architecture consists of a backbone and two subnets, among which the first performs crop recognition, while the other performs stem detection simultaneously. The backbone consists of five convolutional stages, where the first is a standard CNN with batch normalization, while the other four contain two MobileNet2 inverted residual modules (IRMs). The subnets follow a varied-span feature fusion structure, as each has different detection targets. The evaluation was performed on the open-source CWF-788 dataset, and labels were manually annotated. The model obtained an F1-score of 97.4% and an IoU score of 94.5 in segmentation, which were slightly lower than those achieved by CR-DSS 
. Nonetheless, the model achieved the best-known results in stem detection with an SDR of 97.8%.
2.4. You Only Look Once (YOLO)
You Only Look Once (YOLO) is a real-time object detection neural network model where a single-stage neural network is applied to the full image. The network divides the image into regions and predicts bounding boxes along with probabilities for each region. The use of YOLO in agricultural disease and crop detection has recently been gaining popularity. For example, Chen et al. 
proposed a UAV to photograph and detect pests and employed a Tiny-YOLOv3 model built on NVIDIA Jetson TX2 to recognize their position in real time. The detected pest positions could later be used to plan optimal pesticide spraying routes, which agricultural UAVs would later follow. The model attained the best mAP score of 95.33% and 89.72% on 640 × 640 pixel test images.
Similarly, Qin et al. 
proposed a solution for precision crop protection based on a light deep neural network (DNN) called Ag-YOLO consisting of a modified version of ShuffleNet-v2 backbone, a ResBlock neck, and a YOLOv3 head. This model enabled the crop protection UAV to perform embedded real-time pest detection and autonomous spraying of pesticides. The model was tested on the Intel NCS2 hardware accelerator owing to its low weight and low power consumption. The detection system achieved an average F1-score of 92.05%.
Parico et al. 
proposed YOLO-WEED, a weed detection system trained with 720 annotated UAV images to detect instances of weeds, based on YOLOv3 using NVIDIA GeForce GTX 1060 for green onion crops. They obtained an mAP score of 93.81% and an F1-score of 94%.
Rui et al. 
proposed a novel comprehensive approach that combined transfer learning based on simulation data and adaptive fusion using YOLOv5 for improved detection of small objects. Their transfer learning and adaptive fusion mechanism led to a 7.1% improvement as compared to the original YOLOv5 model.
Parico et al. 
proposed a robust real-time pear fruit counter for mobile applications using only RGB data. Various variants of YOLOv4 (YOLOv4, YOLOv4-tiny, and YOLOv4-CSP) were compared. In terms of accuracy, YOLOv4-CSP was the best model, with an AP of 98%. In terms of speed and computational cost, YOLOv4-tiny showed a promising performance at a comparable rate with YOLOv4 at lower network resolutions. If considering the balance in terms of accuracy, speed, and computational cost, YOLOv4 was found to be the most suitable with AP >96%, inference speed of 37.3 FPS, and FN rate of 6%. Thus, YOLOv4-512 was chosen as the detection model for the pear counting system with Deep SORT.
Jintasuttisak et al. 
exploited the effective use of YOLO-V5 in detecting date palm trees in images captured by a UAV flying above farmlands in the Northern Emirates of the United Arab Emirates (UAE). The results of using YOLO-V5 for date palm tree detection in drone imagery were compared with those obtainable with other popular CNN architectures, YOLOv3, YOLOv4, and SSD300, both quantitatively and qualitatively. The results showed that, for the training data used, the YOLO-V5m (medium depth) model had the highest accuracy, resulting in an mAP of 92.34%. Furthermore, it provided the ability to detect and localize date palm trees of varied sizes in crowded, overlapped environments and areas where the date palm tree distribution was sparse.
Tian et al. 
proposed an anthracnose lesion detection method based on deep learning. Cycle GAN was used for data augmentation. DenseNet was then utilized to optimize the feature layers of the YOLO-V3 model, which had a lower resolution. The improved model exceeded faster RCNN with VGG16 and the original YOLO-V3 model and could realize real-time detection. The model obtained an F1-score of 81.6% and 91.7% IoU on the entire dataset.
2.5. Single-Shot Detector (SSD)
The single-shot detector (SSD) is a one-stage object detection network that can detect objects in one feed-forward pass with low-resolution input images 
. The model consists of three different modules. The first is a feature extraction module. This module is made up of a truncated base CNN model that is followed by convolutional layers used for the extraction of features at various scales. The second module is the object detection module which takes in feature maps and runs a set of default bounding boxes on their cells. The result is a defined number of box predictions, all of which have a shape offset and a class confidence score associated with them. The last module is the nonmaximal suppression module which chooses the best predictions out of the set presented by the detection module using a specific value of IoU and confidence score as a threshold. Lately, SSDs have made an appearance in precision agriculture for their ability to perform fast inference and work with low-resolution input images. These two features of SSDs make them desirable in real-time precision agriculture applications.
Veeranampalayam Sivakumar et al. 
proposed using a single-shot detector to detect mid-to-late season weeds in soybean fields for weed-spread suppression. The authors used a feature extractor from the Inception V2 network and a stack of four extra convolutional layers to extract features at varying scales. The output of this feature extraction module was six feature maps that were then fed into the SSD’s detection module. A set of bounding boxes with five different aspect ratios and six different scales were used on all locations in all six feature maps, resulting in several box-bounded detection predictions, each with its own shape offset and class confidence score. An RMS prop optimizer was used. After training the model over 25,000 epochs, the model achieved a precision of 66%, a recall of 68%, an F1-score of 67%, a mean IoU of 84%, and an inference time of 21 s over 1152 × 1152 image test data.
Ridho and Irwan 
proposed a strawberry-picking robot that could detect strawberries of different health states in real time. The robot ran an SSD-MobileNet architecture on a single-board computer (SBC) to perform real-time inference. The network used a feature extraction module built with a MobileNet backbone. The choice of MobileNet was prompted by computational power and time restrictions associated with running a real-time inference model on a low-computational power single-board computer. Using transfer learning, the SSD-MobileNet V1 model was previously trained on 91 classes from the COCO dataset. The model was then retrained on two new datasets containing a total of 250 training images of strawberries in good and bad condition. The result of the training returned an accuracy of 90% in detecting good and bad strawberries on image input extracted from a real-time-streamed video.
2.6. Region-Based Convolutional Neural Networks
The region-based convolutional neural network (RCNN) is a two-stage object detection system that extracts many region proposals from input images, uses a CNN to perform forward propagation on each region proposal to extract its features, and then uses these features to predict the class and bounding box of this region proposal.
Sivakumar et al. 
proposed an approach where object detection-based CNN models were trained and evaluated using low-altitude UAV images to detect weeds in middle and late seasons in soybean fields. Faster RCNN and SSD were both evaluated and compared in terms of weed detection performance. When faster RCNN was configured with 200 box proposals, its weed detection performance was like the SSD model. The faster RCNN model with 200 box proposals returned a precision of 0.65, a recall of 0.68, an F1-score of 0.66, and an IoU of 0.85. On the other hand, the SSD model returned 0.66, 0.68, 0.67, and 0.84 for precision, recall, F1-score, and IoU, respectively. The performance of a patch-based CNN model was also evaluated and compared to the previous models. The faster RCNN model performed better than the patch-based CNN model.
Ammar et al. 
proposed an original deep-learning framework for the automated counting and geolocation of palm trees from aerial images. They applied several recent convolutional neural network models (faster RCNN, YOLOv3, YOLOv4, and EfficientDet) to detect palm trees and other trees and conducted a complete comparative evaluation in terms of average precision and inference speed. YOLOv4 and EfficientDet-D5 yielded the best tradeoff between accuracy and speed (up to 99% mAP and 7.4 FPS).
Su et al. 
used the Mask-RCNN model for identifying Fusarium head blight disease in wheat spikes and its degree of severity. To perform this task, two Mask-RCNNs performed instance segmentation on the input images, one of which segments individual spikes in the images and the other segments diseased areas of spikes. Thereafter, the severity of the infection on the spikes was evaluated by calculating the ratio of infected spike pixels in the images to the total number of spike pixels. The backbone of this model for feature map extraction was composed of a combination of a ResNet101 model and an FPN model. The model returned a prediction accuracy of 77.19% after comparing the results to a set of manually labeled images.
Yang et al. 
used an FCN-AlexNet model to perform real-time crop classification using edge computing. The authors collected 224 images using a UAV during the growing period of rice and corn. The quantitative analysis showed that the SegNet model slightly outperformed FCN-AlexNet by 1% in the overall recall rate of object classification.
Menshchikov et al. 
proposed an approach for fast and accurate detection of hogweed. The approach includes a UAV with an embedded system on board running various fully convolutional neural networks (FCNNs). They proposed an optimal architecture of FCNN for the embedded system relying on the tradeoff between the detection quality and frame rate. In their pilot study, they determined that different architectures could successfully solve the semantic segmentation task for the aerial hogweed detection of two classes. The SegNet model achieved the best ROC AUC with 96.9%. This model could detect hogweed, which was not initially labeled. The modified U-Net architecture was characterized by a high frame rate (up to 0.7 FPS) and a reasonable recognition quality (ROC AUC > 0.938). Along with the low power consumption, the U-Net architecture demonstrated its applicability for real-time scenarios and running on edge-computing devices. One of the U-Net modifications could achieve 0.46 FPS on the NVIDIA Jetson Nano platform with an ROC AUC of 0.958.
Bah et al. 
proposed a model that combined CNN and the Hough transform to detect crop rows in images taken by a UAV. The model called CRowNet was a combination of SegNet (S-SegNet) and a CNN Hough transform (HoughCNet). The model achieved an accuracy of 93.58% and an IoU of 70%, respectively.
Hosseiny et al. 
proposed a model with the framework’s core based on a faster regional CNN (RCNN) model with a backbone of ResNet101 for object detection. The proposed framework’s primary idea was to generate unlimited simulated training data from an input image automatically. The authors proposed a fully unsupervised model for plant detection in UAV-acquired pictures of agricultural fields. Two datasets were used with 442 and 328 field patches, respectively. The precision, recall, and F1-score were 0.868, 0.849, and 0.855, respectively.
Weyner et al. 
addressed the problem of automated, instance-level plant monitoring in agricultural fields and breeding plots. They proposed a vision-based approach to perform a joint instance segmentation of crop plants and leaves in breeding plots. They developed a CNN-based encoder–decoder network with lateral skip connections that follows a two-branch architecture with two task-specific decoders to determine the position of specific plant key points and group pixels to detect individual leaf and plant instances. Lastly, they conducted pixel-wise instance segmentation of each crop and its associated leaves based on orthorectified RGB images captured by UAVs. Their method outperformed state-of-the-art instance segmentation approaches such as Mask-RCNN on this task. They achieved the highest score of 0.94 for AP50 at intermediate growth stages compared to 0.71 by Mask-RCNN with respect to the instance segmentation of sugar beet plants.
Lottes et al. 
presented a novel approach for joint stem detection and crop–weed segmentation using a fully convolutional network (FCN) integrating sequential information. Their proposed architecture enables the sharing of feature computations in the encoder while using two distinct task-specific decoder networks for stem detection and pixel-wise semantic segmentation of the input images. All their experiments were conducted using different generations of the BoniRob platform. BoniRob was built by BOSCH DeepField Robotics as a multipurpose field robot for research and development applications in precision agriculture, such as weed control, plant phenotyping, and soil monitoring. The system achieved the best mAP scores of 85.4%, 66.9%, 42.9%, and 50.1% for Bonn, Stuttgart, Ancona, and Eschikon datasets, respectively, for stem detection and 69.7%, 58.9%, 52.9% and 44.2% mAP scores for Bonn, Stuttgart, Ancona, and Eschikon datasets, respectively, for segmentation.
Su et al. 
proposed a deep neural network (DNN) that exploits the geometric location of ryegrass for the real-time segmentation of inter-row ryegrass weeds in a wheat field. Their proposed method introduced two subnets in a conventional encoder–decoder style DNN to improve segmentation accuracy. The two subnets treat inter-row and intra-row pixels differently and provide corrections to preliminary segmentation results of the conventional encoder–decoder DNN. A dataset captured in a wheat farm by an agricultural robot at different time instances was used to evaluate the segmentation performance, and the proposed method performed the best among various popular semantic segmentation algorithms (Bonnet, SegNet, PSPNet, DeepLabV3, and U-Net). The proposed method ran at 48.95 FPS with a consumer-level graphics processing unit and, thus, is real-time deployable at a camera frame rate. Their proposed model achieved the best mean accuracy and IoU scores of 96.22% and 64.21%, respectively.
Vaswani et al. 
proposed the transformer architecture based on the attention mechanism. A transformer is a sequence transduction model initially designed to tackle natural language processing (NLP) problems. Using transformers for computer vision tasks was limited initially due to the high computational cost of training. To address this issue, Dosovitskiy et al. 
proposed the vision transformer (ViT) that requires fewer resources while outperforming convolutional networks (CNNs). Other notable contributions include utilizing detection transformers (DETR) targeting the same problem. 
Thai et al. 
used ViTs for the early detection of infected cassava leaves and the classification of their diseases. Initially, they used the ImageNet pretrained ViT model published by the Google Research Team 
. The model was then tuned using the cassava leaf disease dataset 
. Later, the model was quantized to reduce its size and accelerate the inference step (FPS) before deploying it on a Raspberry Pi 4 Model B. Their model achieved a 90.3% F1-score in comparison to the best CNN score of 89.2% achieved by the Resnet50 model. Furthermore, they proposed a smart solution powered by the Internet of Things (IoT) that can be used in the agriculture industry for real-time detection of leaf diseases. The system consists of a drone that captures the leaf images, including the exact position of the spot in the field. The ViT model installed on the Drones Pi classifies the images and clusters the infected leaves. The results are then combined with the spot’s position and sent to a server via a 4G network to create a survey map of the field. Farmers and rescue agencies can obtain the map on their mobile phones and prevent the loss of crops beforehand.
Reedha et al. 
used two different models of ViT for plant classification of UAV images. Images were collected using a drone mounted with a high-resolution camera and deployed in a crop field of beet, parsley, and spinach located in France. The camera captured RGB orthorectified images at regular intervals in the field. The data were manually labeled into five classes: weeds, beet, parsley, spinach, and off-type green leaves. They also employed data augmentation to help improve the robustness of the model and the generalization capabilities of the training dataset. Later, they used ViT-B32 and ViTB16 models. They also tested the training data on EfficientNet and ResNet CNN architectures for comparison purposes. The results showed that ViT models outperformed the CNN models, as F1-scores of 99.4% and 99.2% were obtained from ViT-B16 and ViT-B32, respectively. In comparison, CNN models achieved slightly lower scores of 98.7% for EfficientNet B0, 98.9% for B1, and a close 99.2% using ResNet50. The authors pointed out that although all techniques obtained high accuracy and F1-scores, the classification of crops and weed images using ViTs yielded the best prediction performance. However, the inefficiency of ViT as compared with CNNs is another consideration if the model is to be deployed for real-time processing on a UAV.
Karila et al. 
used ViT models to estimate grass sward (i.e., short grass) quality and quantity in a field. The datasets were captured in the spring “primary growth phase”, and the same dataset was captured again in the summer “regrowth phase” using a quadcopter drone equipped with two cameras. The first captured RGB images, while the second captured Fabry–Pérot (FPI) images. The results showed that ViT RGB models performed the best on different datasets. Similarly, VGG CNN models provided equally satisfactory results in most cases.
Dersch et al. 
used a detection transformer (DETR) to detect single trees in high-resolution RGB true orthophotos (TDOPs) and compared it to a YOLOv4 single-stage detector. The multispectral images were collected by a 10-channel camera system with a horizontal field of view. Later, the images were post-processed using structure-from-motion (SFM) software. The data were later manually labeled with a split of 80% training and 20% validation. DETR outperformed YOLOv4 in mixed and deciduous plots with a 20% difference in F1-score in mixed plots and 4% in the latter plots: 86% to 65% and 71% to 67%, respectively. Across all three test plots, both methods had problems with over-segmentation. Furthermore, DETR failed to detect smaller trees far worse than YOLOv4 in multiple cases. The authors justified these poor results by the fact that DETR uses lower-resolution feature maps than that of YOLOv4.
Chen et al. 
proposed a new efficient deep learning model called the density transformer (DENT) for automatic tree counting from aerial images. The model’s architecture contains four stages: a multi-receptive field CNN (Multi-RF CNN) to compute a feature map over the input images, followed by a standard transformer encoder, and a density map generator (DMG) to predict the density distribution over the input images. They also introduced a benchmark dataset that contains aerial images for tree counting called the Yosemite tree dataset and released it to the public 
. The model outperformed most state-of-the-art methods with an MAE of 10.7 and an RMSE of 13.7 in comparison to 17.3 and 22.6, respectively, using YOLOv3. It is worth mentioning that the CANNet model 
achieved the closest values of 10.8 and 13.8, respectively, and achieved a better MAE score in one of four regions than the DENT models.
Lastly, Zhang et al. 
developed a spectral–spatial attention-based transformer (SSVT) to estimate crop nitrogen status from UAV imagery. The model is an improved version of the standard vision transformer (ViT) that can extract the spatial information of images. The newly proposed model can predict the spectral information which contains most of the features in agricultural applications. The model also tackles the computational complexity of large images that ViT suffers from by adopting a self-supervised learning (SSL) technology to allow models to train with unlabeled data. The results showed that the model with 96.2% accuracy outperformed the ViT model with 94.4% accuracy. However, this model required four million additional parameters compared to those required for a ViT model.
2.9. Semi-Supervised Convolutional Neural Networks
Bosilj et al. 
used the fundamental SegNet architecture to perform pixel-level classification and segmentation of three classes of soil. The input comprised RGB and near-infrared (NIR) images. The authors used a median frequency weighting to avoid unbalanced labeling, as soil pixels are dominant in any given field with respect to crops or weeds. The input data were directly taken in the form of RGB and NIR channels because NDVI preprocessing typically results in minimal differences. The model was trained on three different datasets of sugar beets, carrots, and onions (SB16, CA17, and ON17) in which there were fully labeled examples in one, and partially labeled examples in the other, with pixel-level and object-level training. Object-based detection performed better than pixel-based detection precision-wise. However, pixel-based detection performed better in terms of recall. It is worth noting that the partially labeled ON17 dataset with SB16 weights outperformed the fully labeled dataset. The partially labeled CO17 dataset performed significantly worse than the fully labeled dataset, with a difference of almost 20% on weeds and 5% on crops.
Coletta et al. 
used a semi-supervised classification algorithm that can aggregate information from clusters with those provided by a supervised algorithm such as SVM to discover new classes in an active learning manner. According to the authors, such an ability is largely convenient for inconsistent agricultural environments. The data were collected through a SenseFly eBee equipped with an RGB camera. The model consisted of two blocks: a classification block (ClaB) representing an area of 0.16 m2
to be classified and a contextual block (ConB) providing supplementary context information. Both blocks formed a concentric pair that generates feature vectors to be classified. These vectors were manually labeled as belonging to one of three classes. Then a semi-supervised classifier was used to quantify the uncertainty of classification, and a density measure evaluated the importance of a classified feature vector. If the instances resulted in highly uncertain labels, they were denoted as novelties to be learned, which were labeled later by an entropy- and density-based selection (EDS) domain expert and incorporated into the training set. The results showed that the all-class accuracy and recall improved iteratively.
Li et al. 
used a radial basis function neural network (RBFNN) to predict farmland moisture accurately. In their work, they deployed a high-precision infrared sensor mounted on a UAV to collect discrete-time images of farmland for later analysis and used 20 uniformly distributed soil moisture sensors to extract ground-truth data. To extract relevant information from the images, the authors used an image preprocessing pipeline that included adaptive median filtering, mean filtering, and edge information extraction using the Canny edge detection algorithms. Principal component analysis (PCA) was thereafter used for dimensionality reduction, and its effect was studied by comparing the original model trained on the full dataset with the model trained on the dataset resulting from PCA. The evaluation results showed that the performance of the two models was similar, with the original achieving an R-squared score of 0.92176 and a mean percentage error (MPE) of 0.063, and the PCA-RBFNN model achieving an R-squared of 0.90157 and an MPE of 0.061. Ultimately, it could be concluded that applying PCA helped reduce the model’s workload while maintaining similar accuracy.