Weakly Supervised and Unsupervised Methods in Plant Segmentation

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Sergey Nesteruk	--	2468	2023-08-03 20:39:00	\|
2	format change	Peter Tang	Meta information modification	2468	2023-08-04 04:45:53	\|

This entry is adapted from the peer-reviewed paper 10.3390/info14070380

Plant segmentation is a challenging computer vision task due to plant images complexity. We need to distinguish plant parts rather than the whole plant. The major complication of multi-part segmentation is the absence of well-annotated datasets. It is very time-consuming and expensive to annotate datasets manually on the object parts level.

image instance segmentation weakly supervised segmentation multi-part segmentation

1. Introduction

Computer vision tasks, such as object detection and segmentation, require large-scale datasets to train neural network models ^[1]. An object detection task involves identification of object boundaries in an image, while a segmentation task assumes pixel classification. These tasks pose several challenges for researchers. The first one is image collection; the second issue concerns the preparation of high-quality annotations for the dataset. Obtaining precise annotations is a time-consuming and costly process, especially for large-scale datasets ^[2].

Computer vision tasks in the agriculture domain are even more challenging ^[3]^[4]. Plants are very diverse and volatile. For many practical problems, we have to solve even more difficult tasks. We need to distinguish plant parts rather than the whole plant. These masks are used to identify different parts of the plant, such as leaves, stems, and fruits. This information can be used to quantify plant traits such as leaf area, stem diameter, and fruit size. It can also help in identifying specific plant diseases that affect certain parts of the plant. The use of computer vision systems in agriculture can help automate many tasks, such as crop monitoring, weed detection, and yield estimation. Accurate plant segmentation and part segmentation masks are essential for developing such automated systems. The plant part segmentation model can be used as a component in a larger pipeline for precision agriculture. For example, the model can be used to identify and segment different parts of a plant, such as leaves, stems, and fruits, from images captured by drones or other imaging devices. This information can then be used to analyze the health and growth of the plant, optimize irrigation and fertilization, and detect diseases or pests early on. The model can also be beneficial in reducing manual labor and increasing efficiency in agricultural operations. By automating the process of plant part segmentation, farmers can save time and resources that would otherwise be spent on manual inspection and analysis.

2. Weakly Supervised Methods

The approaches below generate class activation maps (CAMs) by constructing graphs. In A2GNN ^[5], images are transformed into weighted graphs, where each node represents a super-pixel. To provide additional supervision from bounding box information, the authors introduce the multi-point (MP) loss specifically designed for the A2GNN method. For this work, image-level labels are used to generate the foreground using CAM inference, while bounding box labels are used to generate the background. In ^[6], a network is designed to produce CAMs and online accumulated class attention maps (OA-CAMs). In the OA-CAMs approach, the different parts of the target object from attention map are combined to improve poor CAM quality. However, in this solution, most of the attention is focused on enlarging salient regions around the target object while objects outside the salient region do not gather enough attention. To activate objects outside of the salient region, a graph-based global reasoning unit is integrated into the classification branch of the network. Furthermore, to enhance the quality of pseudo-labels, a potential object mining module (POM) and a nonsalient region masking module (NSRM) are employed. These modules combine semantic information of the target object and can generate pseudo-labels for the complex scenes in images.

Self-supervised equivariant attention mechanism (SEAM) ^[7], embedded discriminative attention mechanism (EDAM) ^[8], and image segmentation with iterative masking (ISIM) ^[9] are methods used for self-improvement in computer vision. SEAM uses a Siamese network that takes both original and augmented images as input to produce a CAM (class activation map) at the output. Each Siamese branch includes a pixel correlation module (PCM) that refines the CAM. The PCM module proposed by the authors is used to include low-level features in the CAM. The CAM and PCM module activation maps from the Siamese branches are regularized to ensure consistency. EDAM includes a discriminative activation layer (DA) after the backbone, as well as a collaborative multi-attention module (CMA). The DA layer predicts a class-specific mask for each category. Each mask is then multiplied with a feature map. The CMA module, which is located after the DA layer, applies a self-attention mechanism to explore activation maps of each category and extract common category-specific information from the images in the batch. These modules work together to improve the network’s ability to discriminate between classes and attend to important features in the input. In the ISIM model, an input image and its corresponding image-level label are passed through an encoder network to extract a CAM. Then, pseudo-segmentation labels are generated using the dense conditional random field (dCRF) algorithm, which is used to refine CAM quality. The model is retrained using these pseudo-segmentation labels as ground truth. A pixel-level loss function is used to activate less discriminative areas in the CAM inference. An iterative process is performed with a pixel-level loss, and a CAM threshold is set to optimize the final CAM result.

Another approach is to divide images into patches. In ^[10], the authors propose a complementary patch network (CPN). A CPN is formed by a triplet network with three branches. In the CPN, the original image is split into pairs of images with hidden parts, and the CAM is defined as the sum of the pair. To refine CAM results, the proposed pixel-region correlation module (PRCM) is used. This module finds semantic relations between regions or pixels and uses information with the help of the PCM module proposed in the ISIM work ^[9]. In the PPL ^[11] method, the image is split into patches. Each patch is fed to subsequent convolutional layers separately. In this case, the neural network has access only to the local features. It pushes the neural network to focus more attention on local features. The patch learning processing performs from low-level layers of the network to high-level layers. It allows focusing on low-level as well as high-level discriminative regions.

Several approaches to weakly supervised semantic segmentation (WSSS) utilize bounding boxes as annotations. In ^[12], foreground and background regions are extracted from the bounding boxes, and segmentation labels are obtained using CAM from the classification network, using background-aware pooling (BAP). CAM is applied for each bounding box. Finally, CNN is trained for semantic segmentation using noise-aware loss (NAL) to reduce the influence of noisy labels. In ^[13], foreground and background objects are considered as positive and negative instances, respectively. The multiple instances learning (MIL) loss is applied to the bounding boxes. Since bounding boxes usually include multiple foreground objects, it leads to classification problems. Therefore, the labeling-balance loss is used to overcome this drawback. Recent and most promising work describes the Segment Anything Model (SAM) developed ^[14] by Facebook. Images on the SAM input are fed to the image encoder that is based on the pretrained vision transformer and produces image embedding. Then, different kind of the prompts are used to map image embeddings into a mask. There are a few types of the prompts: points, boxes, and text. In the cases when prompt is quite ambiguous, SAM produces multiple masks with different confidence scores.

Other papers consist of different approaches to solving the WSSS task. The ACFN ^[15] model is based on atrous (dilated) convolution and includes two modules: the cascade module and the pyramid module. The cascade module is composed of three atrous convolutional layers inserted in the middle of the backbone network. The pyramid module is composed of four parallel atrous convolutional layers with different atrous rates, allowing it to learn different scales of context information. After the pyramid module, the image information of different scales is fused. The SLAM ^[16] framework contains two training stages. In the first stage, the semantic encoder is trained to learn the features of each category. In the second stage, the segmentation neural network is trained using the learned features of the semantic encoder. The AuxSegNet ^[17] is based on the cross-talk module that consists of three task-specific branches after the backbone. Since each branch is responsible for the specific type of learning (classification, saliency detection, semantic segmentation), the cross-talk affinity learning module learns task-specific affinities and features, which are used to enlarge the feature map produced by CAM for the saliency detection and semantic segmentation tasks. Then, these two task-specific affinity maps are used to produce a global cross-task affinity map. This affinity map is used to refine both saliency and segmentation predictions. In the CODNet ^[18] model, a pair of images is used as inputs, and common semantic features are extracted. For each location in the target images, features from a similar region in the reference images are extracted and concatenated. In ^[19], authors propose to erase misclassified regions of the CAM and then enlarge them properly. The contextual information captured by the semantic segmentation network is used as a guide to accurately erase the misclassified area in the CAM. Then, hierarchical deep seeded region growing (H-DSRG) is performed, accurately growing the semantic regions by taking into account the spatial distance between regions. The HSPP ^[20] model consists of parallel branches of global average pooling and max pooling with different scales. Inferences from each branch are averaged. In addition, a visual word encoder (VWE) module is used to encode local visual words and improve CAM inference. TransCAM authors decided to use the Conformer network as a backbone. Conformer consists of two branches: transform and CNN. The CNN branch generates CAM, and the transform branch generates the attention map. Combining the attention map and CAM inference allows significant improvement of the quality of the CAM result. The solution demonstrated in the paper ^[21] is based on the antiadversarial method called AdvCAM. It manipulates an attention map of an image to improve the classification task inference. In the classic adversarial attack method, pixel-level perturbations are used to change the network output. AdvCAM allows for the involvement of more regions in an attention map and improves the CAM result.

3. Unsupervised Methods

In the LOST model ^[22], features are obtained from the visual transformer. The image is divided into patches and fed into the DINO model ^[23], which uses the visual transformer mechanism. Similarities among patches are computed, and by selecting a patch with the fewest similarities (seed), object parts are localized. Then, seed expansion is performed, which involves adding correlated patches to the initial seed. However, authors of the ^[24] paper claim that the attention map provided by LOST is noisy and have proposed a method called TokenCut to eliminate this issue. TokenCut is based on a graph where edges represent similarities between graph nodes. Segmentation of the foreground and background objects is performed by the normalized cut (Ncut) approach, which performs eigendecomposition. To select the foreground object, an assumption is utilized that the eigenvector of the foreground object is less than the background eigenvector. Another graph-based approach was proposed in ^[25] and utilizes eigenvalues. First, a weighted graph over image patches is constructed, where the graph edge weights show the affinity of the pair patches. This is the process of constructing a semantic affinity matrix for the image. The Laplacian eigenvectors of this matrix are calculated, and these eigenvectors can be used to produce a segmentation mask or bounding box. In the ^[26] paper, as well as in the LOST and TokenCut works, they introduced a network for the object detection task. The network consists of foreground and background models. In the foreground model, the feature map generator produces a feature map and scalar attention map. These maps are used to predict object scales and positions. The background model is an autoencoder that tries to learn the image background. In the CCAM paper ^[27], a model is proposed to produce cues that can be used by other models to improve results. In the CCAM model, images are fed to the autoencoder, and features are extracted to produce a class-agnostic activation map. Then, contrast learning is applied to distinguish foreground and background. CCAM only predicts one activation map to indicate foreground and background regions in an image. In the case where the background or foreground has complex colors or texture, the rank weighting is designed to reduce the influence of dissimilarities. CCAM can be used to improve CAM or object localization.

4. Few-Shots Methods

Segmentation tasks are not able to tackle the new and unseen during training classes. In order to eliminate this issue, the few-shot learning was introduced. It can be used to construct class-agnostic segmentation models that adjust to the new classes. In few-shot learning, support datasets are used to assist the model in learning and generalizing to new tasks or data. The support dataset contains an extremely small number of labeled examples for each specific task or class of interest. Besides support images, there is query image term that refers to the image for which the model needs to generate segmentation masks.

One of the promising frameworks ^[28] utilizes singular value decomposition (SVD) matrices. Since the amount of the support data is too small, the model can experience overfitting. However, an unfreezed backbone with fine-tuning of a small amount of the backbone parameters helps to avoid the overfitting issue. To define these tunable parameters, all of the pretrained parameter are decomposed by the SVD. When only the singular matrix is fine-tuned, the other matrix values remain frozen. This approach is called singular value fine-tuning (SVF).

Another idea is demonstrated in the Multi-Similarity and Attention Network (MSANet) ^[29]. The pretrained backbone is used to extract features both query and support images. Then, these features are fed to the attention and multisimilarity modules where attention maps are produced and visual affinities are found in both of the images that are used in the process of obtaining final mask prediction.

In ^[30], foreground as well as background information in the support image is fully exploited. For this purpose, they proposed a dense pixel-wise cross-query-and-support attention-weighted mask aggregation (DCAMA) approach. Similarities and dissimilarities between query and support images are given different weight. Semantically similar pixels are given more weight than unlike pixels.

In ^[31], authors were faced with the issue that novel classes obtain lower activation than known ones. They proposed a hierarchically decoupled matching network (HDMNet). In this model, they used an extended transformer architecture. In this architecture, embedded correlation mechanism and correlation map distillation are used to extract more semantic information and eliminate the overfitting problem.

The most recent and high-performance approach ^[32] is based on the generative pretrained transformer (GPT) language model. The proposed segmentation GPT (SegGPT) framework can be applied to the various spectrum of the computer vision tasks such as video object segmentation, semantic segmentation, panoptic segmentation, and few-shot segmentation. The key feature of this model is that it does not require additional fine-tuning and still can show superior performance on the listed range of tasks.

References

Sorscher, B.; Geirhos, R.; Shekhar, S.; Ganguli, S.; Morcos, A. Beyond neural scaling laws: Beating power law scaling via data pruning. Adv. Neural Inf. Process. Syst. 2022, 35, 19523–19536.
Paton, N. Automating data preparation: Can we? should we? must we? In Proceedings of the 21st International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data, Lisbon, Portugal, 26 March 2019.
Lemikhova, L.; Nesteruk, S.; Somov, A. Transfer Learning for Few-Shot Plants Recognition: Antarctic Station Greenhouse Use-Case. In Proceedings of the 2022 IEEE 31st International Symposium on Industrial Electronics (ISIE), Anchorage, AK, USA, 1–3 June 2022; pp. 715–720.
Nesteruk, S.; Shadrin, D.; Pukalchik, M.; Somov, A.; Zeidler, C.; Zabel, P.; Schubert, D. Image compression and plants classification using machine learning in controlled-environment agriculture: Antarctic station use case. IEEE Sens. J. 2021, 21, 17564–17572.
Zhang, B.; Xiao, J.; Jiao, J.; Wei, Y.; Zhao, Y. Affinity Attention Graph Neural Network for Weakly Supervised Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8082–8096.
Yao, Y.; Chen, T.; Xie, G.S.; Zhang, C.; Shen, F.; Wu, Q.; Tang, Z.; Zhang, J. Non-salient region object mining for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 2623–2632.
Wang, Y.; Zhang, J.; Kan, M.; Shan, S.; Chen, X. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12275–12284.
Wu, T.; Huang, J.; Gao, G.; Wei, X.; Wei, X.; Luo, X.; Liu, C.H. Embedded discriminative attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021.
Bircanoglu, C.; Arica, N. ISIM: Iterative Self-Improved Model for Weakly Supervised Segmentation. arXiv 2022, arXiv:2211.12455.
Zhang, F.; Gu, C.; Zhang, C.; Dai, Y. Complementary patch for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7242–7251.
Li, J.; Jie, Z.; Wang, X.; Zhou, Y.; Wei, X.; Ma, L. Weakly Supervised Semantic Segmentation via Progressive Patch Learning. IEEE Trans. Multimed. 2022, 25, 1686–1699.
Oh, Y.; Kim, B.; Ham, B. Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Virtual, 19–25 June 2021.
Ma, T.; Wang, Q.; Zhang, H.; Zuo, W. Delving Deeper Into Pixel Prior for Box-Supervised Semantic Segmentation. IEEE Trans. Image Process. 2022, 31, 1406–1417.
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. arXiv 2023, arXiv:2304.02643.
Xu, L.; Xue, H.; Bennamoun, M.; Boussaid, F.; Sohel, F. Atrous convolutional feature network for weakly supervised semantic segmentation. Neurocomputing 2021, 421, 115–126.
Chen, J.; Zhao, X.; Liu, M.; Shen, L. SLAM: Semantic Learning based Activation Map for Weakly Supervised Semantic Segmentation. arXiv 2022, arXiv:2210.12417.
Xu, L.; Ouyang, W.; Bennamoun, M.; Boussaid, F.; Sohel, F.; Xu, D. Leveraging Auxiliary Tasks with Affinity Learning for Weakly Supervised Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021.
Wan, W.; Chen, J.; Yang, M.H.; Ma, H. Co-attention dictionary network for weakly-supervised semantic segmentation. Neurocomputing 2022, 486, 272–285.
Chong, Y.; Chen, X.; Tao, Y.; Pan, S. Erase then grow: Generating correct class activation maps for weakly-supervised semantic segmentation. Neurocomputing 2021, 453, 97–108.
Ru, L.; Du, B.; Wu, C. Learning Visual Words for Weakly-Supervised Semantic Segmentation. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, Virtual, 19–27 August 2021.
Lee, J.; Kim, E.; Yoon, S. Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 4071–4080.
Siméoni, O.; Puy, G.; Vo, H.V.; Roburin, S.; Gidaris, S.; Bursuc, A.; Pérez, P.; Marlet, R.; Ponce, J. Localizing objects with self-supervised transformers and no labels. arXiv 2021, arXiv:2109.14279.
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9650–9660.
Wang, Y.; Shen, X.; Hu, S.X.; Yuan, Y.; Crowley, J.L.; Vaufreydaz, D. Self-supervised transformers for unsupervised object discovery using normalized cut. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14543–14553.
Melas-Kyriazi, L.; Rupprecht, C.; Laina, I.; Vedaldi, A. Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8364–8375.
Sauvalle, B.; de La Fortelle, A. Unsupervised Multi-object Segmentation Using Attention and Soft-argmax. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 3267–3276.
Xie, J.; Xiang, J.; Chen, J.; Hou, X.; Zhao, X.; Shen, L. C2AM: Contrastive learning of Class-agnostic Activation Map for Weakly Supervised Object Localization and Semantic Segmentation. arXiv 2022, arXiv:2203.13505.
Sun, Y.; Chen, Q.; He, X.; Wang, J.; Feng, H.; Han, J.; Ding, E.; Cheng, J.; Li, Z.; Wang, J. Singular Value Fine-tuning: Few-shot Segmentation requires Few-parameters Fine-tuning. Adv. Neural Inf. Process. Syst. 2022, 35, 37484–37496.
Iqbal, E.; Safarov, S.; Bang, S. MSANet: Multi-Similarity and Attention Guidance for Boosting Few-Shot Segmentation. arXiv 2022, arXiv:2206.09667.
Shi, X.; Wei, D.; Zhang, Y.; Lu, D.; Ning, M.; Chen, J.; Ma, K.; Zheng, Y. Dense cross-query-and-support attention weighted mask aggregation for few-shot segmentation. In Computer Vision–ECCV 2022, Proceedings of the 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 151–168.
Peng, B.; Tian, Z.; Wu, X.; Wang, C.; Liu, S.; Su, J.; Jia, J. Hierarchical Dense Correlation Distillation for Few-Shot Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 23641–23651.
Wang, X.; Zhang, X.; Cao, Y.; Wang, W.; Shen, C.; Huang, T. SegGPT: Segmenting everything in context. arXiv 2023, arXiv:2304.03284.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Computer Science, Artificial Intelligence

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Semen Mukhamadiev

Sergey Nesteruk

Svetlana Illarionova

Andrey Somov

View Times: 274

Update Date: 04 Aug 2023

Table of Contents

Video Upload Options

Confirm