1. Introduction
The agriculture sector is the backbone of most countries, providing enormous employment opportunities to the community as well as goods manufacturing and food supply. Fruit plantation is one of the most important agricultural activities. The production and protection of fruit per capita has recently been considered an essential indicator of a country’s growth and quality of life
[1]. Population growth of 7.2 to 9.6 billion people is expected by 2100. The advanced approach of smart agriculture must be used to meet the demand for food from agriculture
[2]. Several studies have recommended that the critical issue of improving management and production in the agriculture industry is addressed
[3][4]. Agriculture production has challenges in terms of productivity, environmental impact and sustainability. Agriculture ecosystems necessitate constant monitoring of several variables, resulting in a large amount of data. The data could be in the form of images that can be processed with various image processing algorithms to identify plants, diseases and other cases in varied agricultural situations
[5]. Advanced technology improvements have been made in agriculture with limited resources to ensure production, quality, processing, storage and distribution
[6]. The technology used in this field involves various scientific disciplines covering sensors, big data, artificial intelligence and robotics
[7]. Apart from using sensor technology to advance the agriculture industry
[8], the use of image annotation techniques to improve agriculture production is a relatively new invention in technology.
Image annotation has attracted widespread attention in the past few years due to the rapid growth of image data
[9][10][11]. This method is used to analyze big data images and predict labels for the images
[12]. Image annotation is the technique of labeling an image with keywords which reflect the character of the image and assist in the intelligent retrieval of relevant images using a simple query representation
[13]. Image annotation in the agriculture sector can annotate images according to the user’s requirement. Everything from plants and fruits to soil can be annotated to be recognized and classified. Moreover, it helps in plant detection, classification and segmentation based on the plant species, type, health condition or maturity. It can predict the label of a given image and can correspond well to the image content
[12]. Image annotation can describe images at the semantic level and has many applications that are not only focused on image analysis but also on urban management and biomedical engineering. Basically, image annotation algorithms are divided into traditional and deep neural network-based methods
[14]. However, traditional or manual image annotation has inherent weaknesses. Therefore, automatic image annotation (AIA) was introduced in the late 1990s by Mori et al.
[15].
The objective of automatic image annotation is to predict several textual labels for an unseen image representing its content, which is a labeling problem. This technology automatically annotates the image using its semantic tags and has been applied in image retrieval classification and the medical domain. The training data attempt to teach a model to assign semantic labels to the new image automatically. One or more tags will be transferred to the image based on image metadata or visual features. For instance, the technology has been proposed in many areas and shows outstanding achievement
[13][16]. Large amounts of data are required to improve the accuracy of annotating images of plants or diseases. To assist researchers in overcoming these severe challenges, Deng et al.
[17] introduced ImageNet, a publicly available collection of existing plants extensively used in computer vision. It has been frequently used as a benchmark for various visualization types of computer vision issues. Another public dataset is PlantVillage
[18], an open-access platform for disease plant leaf images by Penn State University. Moreover, the datasets that are dedicated to fruit detection are MinneApple
[19], Date Fruit
[20] and MangoYOLO
[21], weed control datasets are DeepWeeds
[22] and Open Plant Phenotype Dataset
[23] and a dataset of plant seedlings at different growth stages is V2 Plant Seedling Dataset
[24].
AIA can be classified into many categories. The difference in the classes is based on the contribution, computational complexity, computational time and annotation accuracy. One of the categories is deep learning-based image annotation
[25][26]. Deep learning in research on AIA has attracted extensive attention in the theoretical study and various image processing and computer vision task applications. It shows high potential in image processing capabilities for the future needs of agriculture
[27][28]. Deep learning, which is a subset of machine learning, was firstly introduced by Dechter
[29] in 1986 to machine learning and by Aizerberg et al.
[30] in 2000 to the artificial neural network. It can transform the data using various functions that allow data representation in a hierarchical way and defined as a simpler concept. It learns to perform any task directly from the images and produce high-accuracy responses
[31][32]. Several AIA techniques have been proposed other than the deep learning approach such as support vector machines, Bayesian, texture resemblance and instance-based method. Deep learning techniques, on the other hand, have succeeded in image processing throughout the last decade
[33]. The high accuracy of deep learning is generated by high computational and storage requirements during the training and inference phase. This is because the training process is both space consuming and computationally intensive, as millions of parameters are needed to refine over multiple periods of time
[34]. Due to complexity of the data models, training is quite expensive. Furthermore, deep learning necessitates the use of costly graphic user interfaces (GPUs) and many machines. This raises the cost to the users. The image annotation training set based on deep learning can be classified into supervised, unsupervised and semi-supervised categories.
Supervised deep learning involves training a data sample from a data source that has been classified correctly. Its algorithm is trained on input data that has been labeled for a certain output until it is able to discern the underlying links between the inputs and output findings. The system is supplied with labeled datasets during the training phase, which will inform it which outputs are associated with certain input values. Supervised learning provides a significant challenge due to the requirement of a huge amount of labeled data
[35][36] and at least hundreds of annotated images are required during the supervised training
[37]. The training approach consists of providing a large number of annotated images to the algorithm to assist the model to learn, then testing the trained model on unannotated images. To determine the accuracy of this method, annotated images with hidden labels are often employed in the algorithm’s testing stage. Thus, annotated images for training supervised deep learning models achieve acceptable performance levels. Most of the studies applied supervised learning, as this method promises high accuracy as proposed in
[38][39][40]. Another attractive annotation method is based on unsupervised learning. Unsupervised learning, in contrast to supervised learning, deals with unlabeled data. In addition, labels for these cases are frequently difficult to obtain due to insufficient knowledge data or the labeling is prohibitively expensive. Furthermore, the lack of labels makes setting goals for the trained model problematic. Consequently, determining whether or not the results are accurate is difficult. The study by
[41] employed unsupervised learning in two real weed datasets using a recent unsupervised deep clustering technique. These datasets’ results signal a potential direction in the use of unsupervised learning and clustering in agricultural challenges. For circumstances where cluster and class numbers vary, the suggested modified unsupervised clustering accuracy has proven to be a robust and easier to interpret evaluation clustering measure. It is also feasible to demonstrate how data augmentation and transfer learning can significantly improve unsupervised learning.
Semi-supervised learning, like supervised and unsupervised learning, involves working with a dataset. However, the dataset is separated into labeled and unlabeled parts. When the labeling of acquired data is too difficult or expensive, this technique is frequently used. In fact, it is also possible to use it if the labeled data are poor quality
[42]. The fundamental issue in large-scale image annotation approaches based on semi-supervised learning is dealing with a large, noisy dataset in which the number of images expands faster. The ability to identify unwanted plants has improved because of the advancement in farm image analysis. However, the majority of these systems rely on supervised learning, which necessitates a large number of manually annotated images. As a result, due to the huge variety of plant species being cultivated, supervised learning is economically infeasible for the individual farmer. Therefore,
[43][44][45] proposed an unsupervised image annotation technique to solve weed detection in farms using deep learning approaches.
Deep learning has significant potential in the agriculture sector in increasing the amount and quality of the produce by image-based classification. Consequently, many researchers have employed the technology and method of deep learning to improve and automate tasks
[3]. Its role in this sector gives excellent results in plant counting, leaf counting, leaf segmentation and yield prediction
[46]. Noon et al.
[47] have reviewed the application of deep learning in the agriculture sector by identifying plant leaf stress in early detection to enable farmers to apply the suitable treatment. Deep learning is effective in detecting leaf stress for various plants. However, implementing deep learning in agriculture requires a large amount of data regarding the plants, in terms of collecting and processing. The necessary data are basically collected using wireless sensors, drones, robots and satellites
[48]. The more data used to train the deep learning model, the more robust and pervasive the model becomes
[49].
Unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) are examples of robotics systems that provide a cost-effective, adaptable and scalable solution for product management and crop quality
[50]. Weeds are able to reduce crop production and their growth must be monitored regularly to keep them under control. Additionally, applying the same amount of herbicide to the entire field results in waste, pollution and a higher cost for farmers. The combination of image analytics from UAV footage and precision agriculture is able to assist agronomists in advising farmers on where to focus herbicides in particular regions in the field
[51][52]. As stated in
[53], the first stage in site-specific weed management is to detect weed patches in the field quickly and accurately. Therefore, the authors proposed object detection implemented with Faster RCNN in training and evaluating weed detection in soybean fields using a low-altitude UAV. The proposed technique was the best model in detecting weeds by obtaining an intersection over union (IoU) performance of 0.85. Franco et al.
[54] have captured a thistle weed species,
Cirsium arvense, in cereal crops by utilizing a UAV. This tool is used to gather a view of an agriculture site with detailed exploration and is attractive due to its low operational costs and flexible driving. A UAV captured RGB images of thistles at 50 m above the ground, annotated weed and cereal classes and grouped them under a unique label of pixels. According to
[51], labeling plants in a field image consumes a lot of time and there is very little attention paid to annotating the data by training a deep learning model. Therefore, the authors proposed a deep learning technique to detect weeds using UAV images by applying overlapping windows for weed detection
[51]. Deep learning techniques will provide the probability of the plant being a weed or crop for each window location. Deep learning can make harvesting robots more effective when generating robust and reliable computer vision algorithms to detect fruit
[55]. The usage of UAVs in dataset collection has also been applied in palm oil tree detection
[56], rice phenology
[57], detection and classification of soybean pests
[58], potato plant detection
[59], paddy field yield assessment
[60] and corn classification
[61].
Over the last few decades, UGVs have been used to achieve efficiency, particularly by reducing manpower requirements. UGVs have been employed for soil analysis
[62], precision spraying
[63], controlled weeding
[64] and crop harvesting
[65]. Mazzia et al.
[66] employed a UGV for path planning using deep learning as an estimator. Row-based crops are ideal for testing and deploying UGVs that can monitor and manage to harvest the crops. The research proposed by the authors proved the feasibility of the deep learning technique by demonstrating the viability of a complete autonomous global path planner. In
[67], a robot harvester with the implementation of a deep learning algorithm is used to detect an obstacle and observe the surrounding environment for rice. The image cascade network’s employment successfully detects obstacles and avoids collision with an average success rate of 96.6%. Besides UAVs and UGVs, deep learning provides a practical solution in the agriculture field from satellite imagery. A vital component of agricultural monitoring systems is having accurate maps of crop types and acreage. Therefore, the application of satellites is able to determine the boundary of smallholder farms since their boundaries are hazy, in irregular shapes and frequently mixed with other land uses. Persello et al.
[68] presented a deep learning technique to automatically delineate smallholder farms using a convolutional network in combination with a globalization and grouping algorithm. The proposed solution outperforms alternative strategies by autonomously delineating field boundaries with F scores greater than 0.7 and 0.6 for the proposed test regions, respectively. Furthermore, satellites are implemented to capture images in identifying crops as presented in
[69]. The authors utilized multiexposure satellite imagery of agricultural land using image analysis and deep learning techniques for edge segmentation in an image. The implementation of a CNN for image edge smoothing achieves accuracy of 98.17%. According to
[70], enough data should be collected for training in order to predict crop yields and forecast crop prices reliably. Data availability is a significant limitation that can be overcome using satellite imagery that can cover huge geographic areas. The combination of utilizing deep learning using satellite imagery applications gives a significant advantage results in extracting field boundaries
[71], monitoring agricultural areas
[72], weather prediction
[73], crop classification
[74] and soil moisture forecast
[75].
Various implementations of deep learning in agriculture approaches have been extensively reviewed in recent years as proposed in
[5][37][76][77][78][79]. Among those, Koirala et al.
[77] reviewed the application of deep learning in fruit detection and yield estimation, Zhang et al.
[80] explore dense scene analysis of the application deep learning in agriculture and Moazzam et al.
[79] emphasized the challenges of weed and crop classification using deep learning. Based on the great attention on the implementation of deep learning in the agriculture sector in recent years, and contrary to existing surveys, this research concisely reviews the use of deep learning techniques in image annotation, focusing on plants and crop areas. This research presents the most recent five years of research on this method in agriculture, covering the new technology and trends. The presentation covers the techniques of annotating images, the learning techniques, the various architectures proposed, the tools used and, finally, the applications. The application issues are basically in plant detection, disease detection, counting, yield estimation, segmentation and classification in the agriculture sector. These tasks are difficult to perform manually, time consuming and require workforce involvement. The lack of people’s ability to identify objects for these tasks is finally compensated for by using current technology and trends, particularly image annotation and deep learning techniques, which also boost process efficiency. There are many different types of plants. To identify plants, especially rare ones, knowledge is required. Additionally, a systematic and disciplined approach to classifying various plants is crucial for recognizing and categorizing the vast amount of data acquired on the many known plants. To solve this problem, plant detection and classification are crucial tasks. Since segmentation helps to extract features from an image, it will improve classification accuracy. A crucial concern in agriculture is disease detection. Disease control procedures can waste time and resources and result in additional plant losses without accurate identification of the disease and its causative agent. Furthermore, in the agriculture industry, counting is essential in managing orchards, yet it can be difficult because of various issues, including overlapping. In particular, counting leaves provides a clear image of the plant’s condition and stage of development. Especially in the age of global climate change, agricultural output assessment is essential for solving new concerns in food security. Accurate yield estimation benefits famine prevention efforts in addition to assisting farmers in making appropriate economic and management decisions.
2. Deep Learning for Image Annotation
Image annotation using deep learning is the most informative method that requires more complex training data. It is essential for functional datasets because it informs the training model about the crucial parts of the image and may use those details to recognize the classes in test images. The majority of automatic image annotation methods perform by extracting features from training and testing images at the first step. Secondly, based on the training data, the annotation model is developed. Finally, annotations are developed based on the characteristics of the test images
[81].
Figure 1 illustrates the detail of the image annotation process. Feature extraction is a technique for indexing and extracting visual content from images. Color, texture, shape and domain-specific features are examples of primitive or low-level image features
[82].
Figure 1. The process of image annotation algorithm.
Depending on the approach utilized, various annotation types are used to annotate images. The popular image annotation techniques employed in agriculture based on deep learning are bounding box
[83][84][85][86] and segmentation
[87][88][89][90]. The study in
[91] proposed the tools to boost the efficiency of identifying agriculture images, which frequently have more various objects and more detailed shapes than those in many general datasets. Feature extraction in the architecture of deep learning can be found in imaging applications. Different types of this architecture in deep learning that have frequently been applied in recent years are unsupervised pre-trained networks (UPNs), recurrent neural networks (RNNs) and convolutional neural networks (CNNs)
[92]. An RNN has the advantage of processing time-series data and making decisions about the future based on historical data. An RNN has been proposed by Alibabaei et al.
[93] to predict tomato yield according to the date, climate, irrigation amount and soil water content. RNN architecture consists of long-shot term memory (LSTM), gated recurrent units (GRUs), bidirectional LSTM (BLSTM) and bidirectional GRU (BGRU). The research shows that BLSTM is able to capture the relationship of the past and new observations and accurately predict the yield. However, the BLSTM model has a longer training time compared to implemented models. The authors also conclude that deep learning has the ability to estimate the yield at the end of the seasons.
A CNN is mainly used among deep learning architecture due to its high detection accuracy, reliability and feasibility
[94]. CNNs or convNets are designed to learn the spatial features, for example edges, textures, corners or more abstract shapes. The core of learning these characteristics is the diverse and successive transformation of the input object, which is convolution at different spatial scales such as pooling operation. This operation identifies and combines both high-level concepts and low-level features
[95]. This method has been proven to be good in extracting abstract features from a raw image through convolutional and pooling layers
[96]. The architecture of CNNs was introduced by Fukushima
[97] who proposed the algorithm of supervised and unsupervised training of the parameter that learns from the incoming data. In general, a CNN receives the image data that form input layers and generates a vector of different characteristics assigned to object classes in the form of an output layer. There are hidden layers between the input and output layers consisting of a series of convolution and pooling layers and ending with a fully connected layer
[98]. CNNs are widely used as a powerful class of models to classify images in a multiple problems in agriculture such as fruit classification, plant disease detection, weed identification and pest classification
[99]. In addition, they can also detect and count the number of crops. Huang et al.
[100] chose a CNN to classify green coffee beans because CNN characteristics are good at extracting image color and shape.
Two categories of object detection in deep learning are defined by drawing bounding boxes around the images and classifying the object’s pixels. From a label perspective, drawing rectangular bounding boxes around the object is much easier compared to labeling the object’s pixels by drawing outlines. However, from a mapping perspective, pixel-level object detection is more accurate compared to the bounding box technique
[101]. According to Hamidinekoo et al.
[102], it is challenging to segment and compute the detection of individual fruits from images. Therefore, the authors applied a CNN to classify various parts of the plant inflorescence and estimate fruit numbers from the images. CNNs are also used in detecting fruit and disease. Onishi et al.
[103] proposed a high-speed and accurate method to detect the position of fruit and automated harvesting using a robot arm. The authors utilized a shot multibox detector (SSD) based on the CNN method to detect objects in an image using a single deep neural network. To achieve a high level of recognition accuracy, the SSD creates multiscale predictions from multiscale feature maps and explicitly separates the predictions based on ratio aspect. The image of fruit detection utilized in this method is shown in
Figure 2. Other fruits and leaves occlude some apples, but the method can still detect the apples. The result of the research showed that the fruit detection using the SSD is 90% and this accuracy was achieved in only 2 s.
Figure 2. Fruit detection using CNN
[103].
Another major concern in the agriculture sector nowadays is that many pathogens and insects threaten many farms. Since deep learning can dive into deep analysis and computation, this technique is one of the prominent methods for plant disease detection
[104]. Many approaches help to monitor the health of the crop, from semantic segmentation to other popular image annotation techniques. When compared to labeling data for classification, segmentation data are more challenging. Several image annotations based on supervised learning for object segmentation methods have been presented in recent years for this reason. Sharma et al.
[105] used image segmentation to detect disease by employing the CNN method. In order to obtain maximum data on disease symptoms, the image is segmented by extracting the affected parts of leaves rather than the whole images. The quantifying result for each type of disease shows that the data are trained very well and achieved that excellent result even under real conditions. Kang and Chen
[106] performed detection and segmentation of apple fruit and branches as shown in
Figure 3. As shown in
Figure 3a–f, apples are drawn in distinct colors, and branches are drawn in blue. These detections and segmentations are recognized by utilizing a CNN. The experiment achieved 0.873 accuracy of instance segmentation of apple fruits and 0.794 accuracy of branch segmentation.
Figure 3. Detection and segmentation of fruit and branch
[106].
Khattak et al.
[107] proposed a CNN to identify fruits and leaves in healthy and diseased conditions. The result shows that the CNN has a test accuracy of 94.55 percent, making it a suggested support tool for farmers in classifying citrus fruit/leaf condition as either healthy or diseased. In yield estimation, Yang et al.
[108] trained a CNN to estimate corn grain yield. The experiment conducted by the authors produced 75.50% classification accuracy of spectral and color images. Fuentes
[109] successfully proved that the implementation of a deep learning technique can detect disease and pests in tomato plants. In addition, the technique is able to deal with a complex scenario from the surrounding area of the plant. The result obtained is shown in
Figure 4a–d, where the deep learning generates high accuracy in detecting disease and pests. The image from left to right for each sub-figure is the input image, annotated image and predicted results.
Figure 4. Detection result of disease and pests that affected tomato plants. (
a) Gray mold (
b) Canker (
c) Leaf mold (
d) Plague
[109].
The architectures of CNNs have been classified gradually with the increasing number of convolutional layers, namely LeNet, AlexNet, Visual Geometri Group 16 (VGG16), VGG19, ResNet, GoogLeNet ResNext, DenseNet and You Only Look Once (YOLO). The differences between these architectures are the number of layers, non-linearity function and the pooling type used
[110]. Mu et al.
[111] applied VggNet to detect the quality of blueberry through the skin pigments during the seven stages of its maturity. The technique was used to solve the difficulty and identify the maturity and quality grade of the blueberry fruit measured by the human eye. In fact, the method has improved the accuracy and efficiency of detection of the quality of blueberry. Lee et al.
[112] proposed three types of CNN architecture with different layers, namely, VGG16 with 16 layers, InceptionV3 with 48 layers and GoogLeNetBN with 34 layers. The InceptionV2 inspired GoogLeNetBN and InceptionV3 architecture and has the capability of improving the accuracy and reducing the complexity of computation. Batch normalization (BN) has been proven to be able to limit overfitting and speed up convergence. In a study by
[113], three CNN architectures, AlexNet, InceptionV3 and SqueezeNet, were compared to assess their accuracy in evaluating tomato late blight disease. Among these architectures, AlexNet generates the highest accuracy in feature extraction with 93.4%. Gehlot and Saini
[114] also compared the performance of CNN architectures in classifying diseases in tomato leaves. The architectures assessed in the research are AlexNet, GoogLeNet, VGG-16, ResNet-101 and DenseNet-121. The accuracy of all these architectures are almost equal. However, the size of DenseNet-121 is much smaller, at 89.6MB, and the largest size is 504.33 MB, obtained by ResNet-101.
Figure 5 presents the details on the image annotation and its deep learning approach technique. Low-level features are used to represent images in image classification and retrieval. The initial stage in semantic comprehension is to extract efficient and effective visual features from an image’s unstructured array of pixels. The performance of semantic learning approaches is considerably improved by appropriate feature representation. Numerous feature extraction techniques, including image segmentation, color features, texture characteristics, shape features and spatial relationships, have been proposed
[115]. There are five categories of image annotation methods, which are generative model-based image annotation, nearest neighbor-based image annotation, discriminative model-based image annotation, tag completion-based image annotation and deep learning-based image annotation
[25][26]. In the past decade, tremendous progress has been made in deep learning techniques, allowing image annotation tasks to be solved using deep learning-based feature representation. The most recent advancements in deep learning enable a number of deep models for large-scale image annotation. A CNN is commonly used by deep learning-based approaches to extract robust visual characteristics. Several versions of CNN architecture, such as LeNet, VGG, GooLeNet, etc., have been proposed. The following section describes the most commonly employed CNN architectures. The four types of image annotation are image classification, object detection or recognition, segmentation and boundary recognition. All of these task types can be annotated using deep learning techniques. The training process of deep learning can be supervised, unsupervised or semi-supervised, depending on how the neural network is used. In most cases, supervised learning is used to predict a label or a number. Commonly used benchmarks for evaluating image annotation techniques are based on the performance metrics.
Figure 5. Image annotation for deep learning-based technique.
This entry is adapted from the peer-reviewed paper 10.3390/agriculture12071033