The Segment Anything Model (SAM) is a versatile image segmentation model that enables zero-shot segmentation of various objects in any image using prompts, including bounding boxes, points, texts, and more. However, studies have shown that the SAM performs poorly in agricultural tasks like crop disease segmentation and pest segmentation. To address this issue, the agricultural SAM adapter (ASA) is proposed, which incorporates agricultural domain expertise into the segmentation model through a simple but effective adapter technique.
1. Introduction
The development and use of fundamental models in artificial intelligence have experienced rapid growth in recent years. These models are trained on large datasets for generalization across various tasks and domains. Large Language Models (LLMs)
[1] have become widely adopted, making larger models a recent trend, for example, GPT4
[2] developed by OpenAI (2023). The Segmentation Anything Model (SAM)
[3] is a powerful and flexible visual segmentation model that has gained a lot of attention for its ability to generate accurate and detailed segmentation masks based on user prompts. The model is trained on a large dataset of 11 million images and over 1 billion masks. It has a fast design and training process that enables it to adapt to new image distributions and tasks with zero samples. The SAM has shown its excellent segmentation capabilities in various scenarios, and it is bringing innovations to the field of image segmentation and computer vision.
In tTh
is paper, the e SAM is introduced into agricultural segmentation tasks. In this context, it not only segments agricultural images with zero samples but also improves the accuracy and efficiency of crop disease identification and pest detection. In this case, the SAM can be a solution to agricultural segmentation problems caused by its specialization and insufficiency of available datasets, and provide scientific guidance and support for the agricultural industry. Crop disease segmentation is an important task in the field of agriculture, as it can prevent the spread and deterioration of diseases by detecting and locating them and taking necessary measures such as spraying pesticides and trimming leaves in time. This not only reduces pesticide usage and protects the ecosystem
[4] but also improves crop yield and quality. Similarly, effective image segmentation of agricultural pests can also reduce their harm to crops
[5]. For example, the level of damage and the effectiveness of pest control can be evaluated by segmenting out different types and numbers of pests to formulate a reasonable control program.
However, the SAM also has limitations like other basic models, especially in the case of a very wide range of applications of computer vision. Since the training data cannot cover all possible image types and the working scenarios are constantly changing, the SAM does not perform well in some agricultural image segmentation tasks and difficult scenarios. To demonstrate this, the SAM was tested on crop disease and pest segmentation tasks and it was found that the SAM does not “segment anything” well. This is because the training dataset of the SAM mainly consists of natural images, which often have clear edge information and differentiation, whereas agricultural images are taken in complex field environments, and have the following characteristics: (1) low contrast between the target and the background, which makes segmentation difficult, (2) the agricultural background is complex and varied, with foliage, soil, and rocks, (3) uneven illumination and shadows reduce the image quality and clarity, and (4) the targets have different morphologies, such as pests and diseases. These characteristics make it difficult for the SAM to adapt to the needs and characteristics of agricultural image segmentation, resulting in inaccurate or incomplete segmentation results. Therefore, the theoretical and practical problem this paper focuses on is as followsing: How can these challenges be overcome so that the segmentation ability learned by the SAM on a massive dataset can benefit agricultural segmentation tasks?
Adaption
[6,7,8][6][7][8] is an effective tool for fine-tuning basic, large vision models for downstream tasks, not only in NLP but also in computer vision. It requires only a small number of parameters to be learned (typically less than 5% of the total parameters) to allow efficient learning and faster updates while keeping most of the parameters frozen. It has also been shown that adaption methods work better than full fine-tuning because they avoid catastrophic forgetting and generalize better to out-of-domain scenarios, especially in low data states. It is believed that adaption is the most suitable technique for transferring the SAM to the agriculture domain. Therefore, a simple but effective adapter specifically designed for agricultural image segmentation is proposed. This adapter is a lightweight model that leverages internal and external knowledge to adapt to relatively fewer data and injects task-specific guidance information from the samples of that task. By using visual prompts to transfer information to the network, it can efficiently adapt the frozen large-scale base model to various agricultural image segmentation tasks with minimal additional trainable parameters. This adapter is integrated into the SAM to obtain an adapted model called the agricultural SAM adapter (ASA). To assess the performance of the ASA, a dataset containing 5464 images of agricultural pests and a dataset containing 1100 images of coffee-leaf diseases is collected. The pest dataset consists of 10 common pest types, whereas the coffee-leaf-disease dataset includes individual leaf images, as well as localized images of coffee trees. The extensive experimental results on 12 two-dimensional agricultural image segmentation tasks demonstrate that this approach significantly enhances the performance of the SAM in agricultural image segmentation tasks and compensates for its limitations in this context.
2. Agricultural Image Segmentation
Adapters. The concept of adapters was initially introduced in the field of Natural Language Processing (NLP)
[9] to fine-tune a large pre-trained model using a compact and scalable model for each specific downstream task. Stickland, Cooper, and Murray
[10] explored multi-task approaches that shared a single BERT model with a small number of additional task-specific parameters using new adaptation modules, PALs, or “Projected Attention Layers” and obtained state-of-the-art results on the Recognizing Textual Entailment dataset. In the realm of computer vision, a method proposed by Facebook AI Research
[11] achieved competitive results in object detection tasks with minimal adaptations for fine-tuning the ViT (Vision Transformer) architecture
[12]. More recently, Chen et al.
[13] designed a simple but powerful adapter for the ViT architecture for dense prediction tasks, and it demonstrated excellent performance in several downstream tasks, including object detection, instance segmentation, and semantic segmentation. Liu et al.
[14] were inspired by the widely used pre-training and prompt-tuning protocols in NLP and proposed an EVP (Explicit Visual Prompting) technique that could efficiently combine explicit visual prompting with an adapter. This technique achieved state-of-the-art performance in low-level structure segmentation tasks. Additionally, Chen et al. presented the SAM adapter
[15], achieving state-of-the-art results in camouflaged object detection and shadow detection tasks by incorporating domain-specific information and visual prompts into segmented networks.
In tTh
is paper, the e adapter approach is applied to the SAM to solve agricultural image segmentation tasks.
Segmentation. Image segmentation
[16,17,18][16][17][18] is an important task in computer vision that aims to assign each pixel in an image to a specific semantic category. Image segmentation has a wide range of applications in fields such as agriculture, medicine, and remote sensing. However, traditional image segmentation methods usually rely on a large amount of annotated data for training models, which are difficult to obtain or costly in some specific domains. Therefore, exploring how to achieve image segmentation with few or no annotated data is both a challenging and valuable problem.
Zero-Shot Segmentation. Zero-shot segmentation is a method used for image segmentation that utilizes unlabeled data. This approach can segment objects of any class in an image based on various types of prompts, including points, bounding boxes, and text. The research field of zero-shot image segmentation has gained significant attention in recent years. For instance, Lüddecke and Ecker
[19] proposed a backbone network based on CLIP and a Transformer-based decoder. This framework enables the generation of image segments by leveraging arbitrary text or image prompts, and its versatility in handling diverse segmentation tasks and generalizing to new queries has been demonstrated. Roy et al.
[20] introduced SAM.MD, a zero-shot medical image segmentation method based on the SAM. The method effectively segments abdominal CT organs by utilizing point or bounding box prompts, and it has exhibited excellent performance across multiple medical datasets. Furthermore, research endeavors have explored the utilization of zero-shot image segmentation for more intricate tasks, including instance segmentation
[21,22,23][21][22][23] and video segmentation
[24]. Despite these advancements, the current state of zero-shot image segmentation methods in the field of agricultural image segmentation remains underexplored.
Agricultural Image Segmentation. Agricultural image segmentation is a crucial task with significant applications and practical relevance, as it aids agricultural producers in identifying and monitoring crop conditions and pests. This, in turn, improves the efficiency and quality of agricultural production. In recent years, deep learning methods, for example, the FCN (Fully Convolutional Network)
[25], U-Net
[26], and Mask R-CNN (Region-Based Convolutional Neural Networks)
[23] have been widely used in agricultural image segmentation. These methods excel in automatically learning effective feature representations from data and are suitable for high-resolution and multi-class image segmentation tasks. Notably, deep learning techniques have achieved substantial performance advancements in crop disease and pest segmentation tasks, making them a mainstream approach. For instance, Ma et al.
[27] used a DCNN (Deep Convolutional Neural Network) to develop a method for the automatic diagnosis of cucumber diseases. This method provided a scientific basis for farmers to apply pesticides appropriately. Another notable work by Esgario et al.
[28] proposed a neural network-based multitasking system for the classification and severity estimation of coffee-leaf diseases. The system contains a suitable tool to assist experts and farmers in identifying and quantifying biotic stresses in coffee plantations. However, traditional deep learning approaches require extensive annotated data for training, which are scarce in the domain of agricultural image segmentation. Therefore, exploring how to extend zero-shot image segmentation methods to agricultural image segmentation presents a challenging and valuable research problem.
To address this research gap, the agricultural SAM adapter (ASA) is proposed, which is a fine-tuned version of the SAM customized for agricultural image segmentation, specifically designed for zero-shot segmentation with appropriate prompts. Experiments on 12 agricultural image segmentation tasks are conducted, and the results are shown in
Figure 1, which demonstrates that the proposed approach significantly outperforms the original SAM in terms of performance and generalization capability.
Figure 1. Visualized examples of the pre-trained SAM and ASA segmentation results. The ASA significantly improves segmentation performance in agricultural image segmentation tasks.