Zero-Shot Semantic Segmentation with No Supervision Leakage: Comparison
Please note this is a comparison between Version 1 by Yiqi Wang and Version 2 by Alfred Zheng.

Zero-shot semantic segmentation (ZS3), the process of classifying unseen classes without explicit training samples, poses a significant challenge. Despite notable progress made by pre-trained vision-language models, they have a problem of “supervision leakage” in the unseen classes due to their large-scale pre-trained data.

  • Semantic Segmentation
  • ZS3

1. Introduction

Semantic segmentation is at the foundation of several high-level computer vision applications such as autonomous driving, medical imaging, and other areas involving identification and classification of objects within an image. Deep supervised learning has been instrumental in driving advancements in semantic segmentation [1][2][3][4][1,2,3,4]. However, fully supervised methods often require extensive labeled image databases with pixel-level annotations. They are typically designed to handle a pre-defined set of classes, restricting their application in diverse, real-world scenarios.
Some weakly supervised semantic segmentation (WSSS) approaches have been proposed for the above situation. These methods capitalize on easily accessible annotations like scribbles [5], bounding boxes [6], and image-level labels [7] and generate pseudo-ground-truths through visualization techniques [8][9][8,9]. However, this approach still relies on a certain degree of labeled data and needs to retrain the entire model if there are some new classes.
Humans possess an intuitive ability to recognize and classify new classes based solely on descriptive details, a powerful skill that current machine learning systems have yet to emulate fully. This observation has catalyzed the exploration of zero-shot semantic segmentation (ZS3) [10][11][12][13][10,11,12,13].
ZS3 aims to exploit the semantic relationships between image pixels and their corresponding text descriptions, predicting unseen classes through language-guided semantic information of the respective classes rather than the dense annotations. ZS3 techniques are broadly divided into generative and discriminative methods [14]. Generative ZS3 methods [15][16][15,16] usually train a semantic generator network which maps unseen class language embeddings into the visual feature space and fine-tunes the pre-trained classifier on these generated features. While these generative methods have demonstrated impressive performance, their effectiveness could be improved by a multi-stage training strategy. Discriminative methods directly learn the join embedding spaces for visual and language, like SPNet [17] and map the visual feature to the fixed semantic representations, bridging the gap between visual information and its corresponding semantic understanding. Similarly, JoEm [14] was proposed to optimize both the visual and semantic features within a joint embedding space.

2. Zero-Shot Semantic Segmentation with No Supervision Leakage

Semantic Segmentation: Semantic segmentation has made significant progress with the advent of deep learning technologies. Chen et al. [1], Long et al. [2], Ronneberger et al. [3], and Zhao et al. [4] have leveraged deep learning architectures to enhance the performance of semantic segmentation, making it more accurate and efficient. and fully supervised semantic segmentation, operate under the assumption of pixel-level annotations throughout all training data. The DeepLab model has notably augmented segmentation performance on renowned datasets like PASCAL VOC2012 [18][19] and MS-COCO [19][21], employing sophisticated techniques such as multiple scales [13][20][13,22] and dilated convolution [21][22][23,24]. Other algorithms, such as UNet [3] and SegNet [23][25], have also demonstrated commendable performance using a diverse set of strategies. Furthermore, the transformative potential of the vision transformer (ViT) [24][26], as the pioneer in deploying transformer architecture for recognition tasks, cannot be overstated. Concurrently, the Swin Transformer took a leap forward, extrapolating the transformer’s capabilities for dense prediction tasks and achieving top-tier performance in the process. However, it must be acknowledged that these cutting-edge methods are heavily reliant on costly pixel-level segmentation labels and presuppose the presence of training data for all categories beforehand. In the quest to circumvent these obstacles, weakly supervised semantic segmentation (WSSS) methods have emerged, leveraging more readily accessible annotations such as bounding boxes [6], scribbles [5], and image-level labels [7]. A cornerstone in prevailing WSSS pipelines is the generation of pseudo-labels, chiefly facilitated by network visualization techniques such as class activation maps (CAMs). Some works employ expanding strategies to stretch the CAM ground-truth regions to encapsulate entire objects. Still, obtaining pseudo-labels that accurately delineate entire object regions with fine-grained boundaries continues to pose a significant challenge [25][26][27,28]. Zero-shot semantic segmentation: Zero-shot semantic segmentation (ZS3) models are primarily categorized into two main types: discriminative and generative. Discerning the nuances within these two categories provides a comprehensive understanding of the current strategies utilized in the field. Discriminative methods encompass several noteworthy studies. For instance, Zhao et al. [10] pioneered a groundbreaking study that proposed a novel strategy for predicting unseen classes using a hierarchical approach. This strategy represents an effort to build upon the data’s inherent structure, using hierarchies to draw insights into unseen classes. Another study, SPNet [17], adopted a different approach by leveraging a semantic embedding space. Here, visual features are mapped onto fixed semantic representations, bridging the gap between visual information and its corresponding semantic understanding. Similarly, JoEm [14] was proposed as a method that aligns visual and semantic features within a shared embedding space, thereby fostering a direct correlation between these two aspects. On the other hand, some studies explore the generative landscape of ZS3. ZS3Net [11], for example, employed a Generative Moment Matching Network (GMMN) to synthesize visual features. However, this model’s intricate three-stage training pipeline can potentially introduce bias into the system. To mitigate this issue, CSRL [13] employed a unique strategy that leverages relations of both seen and unseen classes to preserve these features during synthesis. Likewise, CaGNet [12] introduced a channel-wise attention mechanism in dilated convolutional layers, facilitating the extraction of visual features. Recently, some works have explored the large-scale pre-trained model in zero-shot semantic segmentation [27][28][29][29,30,31]. Furthermore, the pre-trained data usually contain both seen and unseen labels (e.g., CLIP, WebImageText 400M) and have a supervision leakage problem. Supervision leakage is a crucial concern in machine learning, referring to the unintended incorporation of information about unseen classes during the training phase. Given that CLIP models are trained on a massive scale of approximately 400 million image–text pairs, it is conceivable that the text labels could encapsulate a diverse range of seen and unseen classes. Consequently, these models might unintentionally learn unseen classes during training, thus creating a form of supervision leakage. This situation could compromise the integrity of the zero-shot learning task, as the models are no longer genuinely learning from a “zero-shot” perspective. In response to this significant challenge, the research introduces a distinct solution that effectively navigates the complexities of zero-shot learning while eliminating the risk of supervision leakage. By doing so, people enhance semantic segmentation models’ reliability and adaptability. The approach offers a robust framework to accurately process and categorize visual data in a genuinely zero-shot learning context. People envisage this method, free of supervision leakage, becoming a cornerstone for future research and advancements in semantic segmentation. The work paves the way for more authentic and reliable zero-shot learning models, fostering a more resilient future for semantic segmentation in computer vision. Visual-language learning: The domain of image–language pair learning has undergone a significant transformation marked by exponential growth. Several contributions have shaped the field, such as CLIP [30][32] and ALIGN [31][33]. Both models, pre-trained on hundreds of millions of image–language pairs, have marked substantial advancements in the field, pushing the boundaries of what is possible in image–language learning. Considering this, Yang et al. [32][34] put forth a unified contrastive learning method, successfully integrating both image–language techniques and image-label data. This method stands as an emblem of how these techniques can be effectively harnessed to push the frontier of the field further. In the ever-evolving domain of zero-shot learning, CLIP-based methods [27][30][33][34][35][29,32,35,36,37] have been recognized for their substantial contributions and potential to provide effective solutions. These models capitalize on the strength of large-scale image–text pair datasets to deliver remarkable performance. However, a critical challenge that potentially undermines the legitimacy of their zero-shot learning capabilities is the risk of supervision leakage.