Forecasting Plant and Crop Disease

Forecasting Plant and Crop Disease: Comparison

Please note this is a comparison between Version 2 by Francesca Maridina Malloci and Version 1 by Francesca Maridina Malloci.

Every year, plant diseases cause a significant loss of valuable food crops around the world. The plant and crop disease management practice implemented in order to mitigate damages have changed considerably. Today, through the application of new information and communication technologies, it is possible to predict the onset or change in the severity of diseases using modern big data analysis techniques. In this paper, we present an analysis and classification of research studies conducted over the past decade that forecast the onset of disease at a pre-symptomatic stage (i.e., symptoms not visible to the naked eye) or at an early stage. We examine the specific approaches and methods adopted, pre-processing techniques and data used, performance metrics, and expected results, highlighting the issues encountered. The results of the study reveal that this practice is still in its infancy and that many barriers need to be overcome.

plant disease prediction
precision agriculture
machine learning
artificial intelligence
deep learning
food security
review

Forecasting Plant and Crop Disease: An Explorative Study on Current Algorithms

Gianni Fenu^{https://orcid.org/0000-0003-4668-2476} and Francesca Maridina Malloci * ^{, https://orcid.org/0000-0003-3287-4450}

Department of Mathematics and Computer Science, University of Cagliari, Via Ospedale 72, 09124 Cagliari, Italy; fenu@unica.it

* Correspondence: francescam.malloci@unica.it

1. Introduction

Crop and plant diseases entail serious implications for food security and production losses. Over the years, the lasting global trade and the changing climate have not only exacerbated the existing favorable conditions for plant and crop disease but have also created new conditions with which agriculture must now contend. As the Food and Agriculture Organization of the United Nations (FAO) [1] asserts, plant pests and diseases are responsible for losses from 20% to 40% of annual global food production. This means that timely disease management will be necessary in order to address the increased food demand caused by population growth estimated by 2050 [2].

To meet these challenges, several studies [3,4,5,6,7,8] have been conducted with the aim of increasing our understanding of the seasonal effect of environmental and weather conditions on diseases affecting major food crops. The recent employment of new information and communication technologies (ICT) such as the Internet of Things (IoT) [9], remote sensing [10], and cloud computing [11] are incentivizing the diffusion of Precision Agriculture (PA), defined as the application of technologies and principles to manage the spatial and temporal variability associated with all aspects of agricultural production for the purpose of improving crop performance and environmental quality [12].

The aforementioned digital technologies contribute to improving our understanding by continuously monitoring and measuring different physical phenomena [13,14], producing a huge amount of data, termed as Big Data [15]. Agricultural big data is successfully being used for various tasks, such as yield prediction [16], weed and pest/disease detection [17], crop and food detection [18], risk management, food safety [19] and spoilage prevention, and operational/ equipment management, including plant and disease prediction [3,14,20]. The analysis of Big Data by means of Machine Learning (ML) [21], Deep Learning (DL) [22], and Artificial Intelligence (AI) [3] techniques has only only recently begun to be applied [15].

A decade of research has generated considerable knowledge of the complex, interconnected, and dynamic process of crop management. It is well known that plant disease responds to different climatic and environmental variables in distinct ways [23,24], and so the outcome of any host–pathogen interaction in uncertain conditions is not readily predictable. However, according to Classen et al. [25], there is still a lack of models involved in determining plant health under a changing climate, as well as their direct and indirect effects and interactions.

As a consequence, more effort and research is urgently needed with the aim of developing novel solutions to prevent and mitigate the impact of crop and plant disease to food production, especially at an early stage.

Thus, the main contribution of this study is to present an analysis and classification of the algorithms applied in the prediction of crop and plant diseases by highlighting the problems encountered, the methods and techniques employed, and the data used. In the literature, plant diseases have been predicted in several ways. This review considers crop and plant disease prediction models that adopt AI, ML, and DL algorithms to predict symptoms before they appear in the field or in an early stage with mild and small lesions. To this end, detection techniques were not taken into consideration. Besides this, a critical discussion of open challenges and directions for future research is attempted.

2. Methodology

The methodological design for this study’s bibliographic analysis involved two phases: (a) the collection of related research and (b) the analysis of these contributions. Data were collected from the scientific databases IEEE Xplore, ScienceDirect, MDPI, Hindawi, and from the web-based scientific indexing services Web of Science and Google Scholar. Regarding the search keywords, the following query was performed:

[“crop disease” OR “plant disease”] AND [“prediction” OR “forecasting”]

Only documents regarding conferences and journal articles published between 2010 to 2020 were considered. From the results, we filtered out the papers that did not provide sufficient descriptive elements for the method adopted. In this way, the number of documents was reduced to 46. Finally, each paper was analyzed considering the general approach, the AI, ML, or DL techniques employed, the sources, the type of data used, the predicted output, and the applied performance metrics.

To the best of our knowledge, this is the first study that focuses exclusively on predicting the symptoms of plant and crop diseases before they appear in the field or in an early stage with mild and minor lesions. A brief review of 10 scientific works related to pest detection and prediction is provided in [26], while other review works were conducted for different domains [22] and sub-domains [27,28,29,30] belonging to the agriculture sector.3.

3. Discussion

In the present section, we discuss the plant and crop disease predictions carried out through data analysis techniques, which fall into three computer areas known as AI, ML, and DL. The review considered 46 scientific papers, which predicted the onset of the disease at a pre-symptomatic (i.e., symptoms not visible to the naked eye) or at an early stage, which was done by adopting the methodology described in Section 2.1. Figure 3a shows the trend line of publications. From 2010 to 2020, there was an incremental trend (orange line).The majority of the papers were published after 2015, indicating how recent this sub-domain is in agriculture. More precisely, 22% of papers from 2010 to 2015 excluded (dashed line) and 78% from 2015 to 2020 (solid line). Figure 3b shows the number of citations for each year considered. As can be seen, the resulting line does not show a uniform trend, with a maximum in 2010 and a minimum between 2011–2013. However, to better understand this trend, it is necessary to evaluate the impact index of the conference proceedings and journals. To this end, a bar graph was constructed (Figure 3c) which shows the number of citations for each paper, with works grouped according to the journal’s H-index. Three clusters were identified: 0 ≤ H-Index ≤ 50 (blue), 50 < H-Index < 100 (orange), H-Index ≥ 100 (gray). The information relating to the citations was retrieved from the Google Scholar service, while the H-index was taken from the Scimago (https://www.scimagojr.com/) online service. The graph shows that 67% of publications were related to an impact index lower than 50, 13% to an impact index between 50 and 100 and the remaining 20% to an impact index higher than 100. Therefore, the graph shows that the number of citations is closely linked to how often the conference proceedings/journal is consulted.

Figure 3. (a) Number of research publications per year (2010–2020) related to plant and crop disease prediction, which predicted the onset of the disease in a pre-symptomatic (i.e., symptoms not visible to the naked eye) or early stage, recovered by adopting the methodology described in Section 2.1; (b) number of citations for each year considered; (c) number of citations for each paper, which have been grouped according to the journal’s H-index.

Our study shows that the approaches used in the literature to tackle the problem under examination can be divided into three categories: forecast models based on weather data (the first category), forecast models based on image processing (the second category), and forecast models based on distinct types of data coming from various heterogeneous sources (the third category). The first and second categories are the most explored, with an adoption rate of 63% and 22%, against 15% for the third category (see Figure 3c). Generally, RMs, ANNs, SVMs, and CNNs are the most used techniques. Due to the high heterogeneity of the experimental conditions (i.e., approaches, datasets, parameters, and performance metrics), it is difficult—if not unreliable—to perform a systematic comparison of the performance of each paper. Therefore, our comparisons are strictly limited to the techniques used in each paper. Taking these factors into account, we observe that the SVM and SVR when applied to the first category of models outperform ANNs and traditional regression models. As underlined by the various studies, the advantage given by the use of the SVM lies in its good ability to learn the representations of data in non-linear problems, with large dimensions and small samples. Likewise, the ANNs were favored for their ability to learn representations from past events to predict the future probability of occurrence of an event based on the conducive condition. The main obstacles encountered by the studies of the first category that used these techniques concerned the reduced size and imbalance of the classes in the dataset as well as overfitting. These problems had a major impact on the ANNs [59]. In fact, performance was limited by their trend of requiring more time and more data for training. Another disadvantage shown by the analyzed works that unites both techniques relates to the concept of the “black box”. The relationship between input and output is difficult to explain and derive. As indicated by Gu et al. [47], the purpose of SVR is to make predictions rather than to provide explanations. Therefore, there are limitations in explaining the effects that variables have on other variables. Although SVM showed superior performance in weather-based classification and regression problems, it did not perform as well in terms of image processing methodology. Models belonging to this category conducted spectral analysis using optical and thermal remote sensing images as well as multispectral and hyperspectral images. The multispectral images were found to be relevant for the detection of the disease at an early stage, as demonstrated in [38]. Besides this, the hyperspectral images most used by the studies reviewed (six papers) allowed the prediction of disease in a presymptomatic stage; i.e., even before the symptom was visible to the naked eye. This difference is due to the spectral resolution used by the two remote sensing technologies. Multispectral imaging collects spectral signals in a few discrete bands, each spanning a broad spectral range from tens to hundreds of nanometers. In contrast, hyperspectral imaging detects spectral signals in a series of continuous channels with a narrow spectral bandwidth (e.g., typically below 10 nm); therefore, it can capture fine-scale spectral features of targets that otherwise could be compromised [30]. Multispectral images compared to hyperspectral images provide less data complexity and information content [85]. However, the analysis of hyperspectral images brings with it various limitations. Several authors underlined the high dimensionality of the data as being among the difficulties encountered. As Mahlein et al. [85] pointed out, the high degree of inter-band correlation results in information redundancy, which can cause convergence instability in the multivariate prediction models. Therefore, most papers focused a great deal of effort on identifying the effective wavelengths for the extraction of the target properties; i.e., the visible patterns in the spectrum that characterize a healthy leaf from a pre-symptomatic diseased leaf. Image resolution also affects model performance. Zhang et al. [37] crossed weather data with MODIS images. The authors observed that the forecasting accuracy was affected by restricting the spatial resolution and imaging quality of optical remote sensing images. From the results obtained in the literature, we infer that remote sensing data represent a wealth of information that is useful for the development of autonomous non-invasive systems for the prediction of biotic and abiotic stress in plants. In this context, recent Deep Learning models, such as CNNs, seem capable of properly addressing many of the technical challenges related to perceptual problems, as seen in other use cases; e.g., yield prediction [16], land cover classification [86], and plant and weed recognition [87]. In particular, spectral images can be an aid to conventional adversity management techniques that are often time-consuming, destructive, expensive, and impractical. An ideal system approach requires precision, speed, and non-destructive practices. In general, the studies of the first and second categories show that the exclusive use of a single data source is not sufficient to build models capable of capturing and predicting the variability of a disease in the field. To increase the stability and generalization capabilities of the algorithms, several authors suggest the integration of multiple data sources, as well as the inclusion of more information such as plant age, cultivar, growth phase, and soil characteristics. In fact, the results obtained from the third category of models confirm this. Zhang et al. [67], by combining meteorological data with remote sensing data, recorded an increase in accuracy from 69% to 78%. Most works have focused on forecasting a disease by analyzing mainly meteorological parameters. Variables such as temperature, humidity, and precipitation emerge as the variables that contribute most to the onset of the disease. Each of these have different effects depending on the disease and the crop under examination. Rowlandson et al. [88] observed that leaf wetness, together with the variables mentioned above, is a parameter that should not be neglected. The authors pointed out that the analysis of leaf wetness periods of a specific time duration is necessary, as this variable interacts with the propagule germination of most phytopathogens. Badnakhe et al. [70] demonstrated that soil temperature plays a crucial role in gummosis disease prediction. As Section 2.3.3 illustrates, a large variety of algorithms and techniques have been employed to predict the occurrence or severity of diseases affecting different crops and plants. From these, we observed that many scientific contributions focused on predicting the main diseases affecting rice, wheat, and potato crops such as late blight [3,20], powdery mildew [37,67,68], downy mildew, and blast [46,48,55], as shown in Figure 4a,b.

Figure 4. (a) Number of publications based on crop and disease examined; (b) current state of crops and plants explored during the last 10 years in terms of percentage of research papers.

Overall, related to the approaches adopted by the works surveyed, we can generalize that a prediction model for plant and crop disease should consist of three mandatory steps: pre-processing, feature selection, and classification. The flow diagram of our survey is shown in Figure 5.

Figure 5. Techniques popularly explored in the domain of plant and crop disease prediction models.

The Support Vector Machine, followed by the Artificial Neural Network and Random Forest, were the most employed techniques. Nevertheless, although te aforementioned techniques and others produced promising results, a restricted number of studies tested their solutions in different datasets. This underlying trend may be influenced by the restricted availability of open-data. To obtain available weather and environmental parameters, a large number of studies relied on third-party organizations, such as national/regional governmental services. Only a restricted number of these works gathered data with their own on-field IoT networks. According to Kamilaris et al. [15], more big data repositories should become publicly available. Furthermore, we noticed that there is a shortage of model validation in real-world scenarios. Appropriate validation is needed for studies to have an accurate and broad impact. This can be inferred from the fact that such studies require an extensive observation time as well as requiring the involvement of human resources with different expertise.

4. Conclusions

In this paper, we performed an analysis and classification of forecasting models for plant and crop disease over the past 10 years (2010–2020). Forty-six research works were identified and reviewed, with an examination of the approaches adopted as well as the pre-processing techniques and data used. Issues and concerns were discussed in Section 2.4.

As we have seen in this study, the prediction of plant and crop disease is a complex problem to be solved due to the interaction of several environmental and climatic factors. Over the last 10 years, the literature has presented considerable advancements in understanding these dynamic processes by adopting different scientific approaches. As we observed, the problem under study requires high-quality, labeled data. However, the lack of open data is slowing the advance of knowledge in this agricultural sub-domain.

Indeed, regarding the state of the art, only a limited number of contributions has been presented in the literature from 2010 to today. The majority of these have focused on few pathogens and crops; furthermore, only a few of these have considered data from various heterogeneous sources to predict disease occurrence. These gaps are hindering progress in achieving development goals and creating products that are able to face realworld scenarios, and so more effort is required in data collection and in developing novel solutions to prevent and mitigate the impact of crop and plant disease to food production, especially for those crops which represent staple foods for millions of people who live in the least developed countries.