Medical image segmentation primarily utilizes a hybrid model consisting of a Convolutional Neural Network and sequential Transformers. The latter leverage multi-head self-attention mechanisms to achieve comprehensive global context modelling. However, despite their success in semantic segmentation, the feature extraction process is inefficient and demands more computational resources, which hinders the network’s robustness. To address this issue, this research presents two innovative methods: PTransUNet (PT model) and C-PTransUNet (C-PT model). The C-PT module refines the Vision Transformer by substituting a sequential design with a parallel one. This boosts the feature extraction capabilities of Multi-Head Self-Attention via self-correlated feature attention and channel feature interaction, while also streamlining the Feed-Forward Network to lower computational demands.
1. Introduction
Medical image segmentation is a vital research area due to its distinct properties that differentiate it from RGB image segmentation. Its significance in medical applications further highlights its necessity. The encoder–decoder structure, based on the Convolutional Neural Network (CNN), was a pioneering aspect of this field
[1][2]. It exhibited excellent receptive fields and contextual information in the deep layers of the network, making it adaptable to multiscale input images. Additionally, it represented an end-to-end training model that received significant attention at the time. This innovation gave rise to the foundational U-Net framework
[3], based on the “U”-shaped network structure, sparking a wave of research enthusiasm. The U-Net network structure is characterized by its simplicity, featuring a fully symmetric encoder–decoder architecture with skip connections. Due to its outstanding network performance, it has dominated the field of medical image segmentation. However, CNNs with local receptive fields have limitations in extracting global features for tasks with long-term relationship dependencies, making them unable to fully capture global information. This restricts the CNN’s ability to realize its full potential.
Recently, network models based on the Transformer architecture
[4] have been challenging the dominant position of CNN and gaining prominence, primarily due to their self-attention mechanism, which possesses the capability to model long-range contextual information. This addresses the limitations of CNNs, making them shine in the field of medical imaging. The concept of incorporating Transformer modules into the network architecture of U-Net has reignited a research wave centered around Transformer-based approaches in the domain of medical image segmentation. On one hand, most researchers have been exploring how to embed serial Transformer modules into the U-Net structure, leading to a series of classic networks such as TransUNet
[5], Swin-Unet
[6], UNETR
[7] and so on
[8][9][10][11][12]. Undeniably, serial Transformer network models have significantly improved the accuracy of medical image segmentation. Notably, the TransUNet
[5] model was the first to apply the Vision Transformer (ViT)
[13] to the field of medical image segmentation, leveraging the global contextual modeling capabilities of Transformers in conjunction with the local feature extraction characteristics of CNN. This has provided highly effective solutions for the medical domain. It is evident that their competitive advantage has been achieved through increased model complexity, which inevitably comes with high computational and memory costs, potentially impacting the practical application of these models in clinical medical segmentation
[14]. On the other hand, there has been relatively less research on the application of parallel Transformer modules in medical image segmentation
[15][16][17]. This is because traditional parallel modes tend to increase network parameters and feature dimensions, which can impact network efficiency and accuracy. However, the emergence of parallel ViT
[18] has provided a new direction for applying parallel Transformers to medical image segmentation research. Under the condition of maintaining the same parameter count, replacing serial ViT
[13] with parallel ViT can increase network width while reducing network depth. Parallel ViT achieves this by reducing the depth of the modules, optimizing network training, and making network training less challenging compared with serial ViT. However, despite maintaining the same parameter count, it still incurs high computational costs. Additionally, the reduction in network depth weakens its semantic representation and contextual awareness, limiting the applicability of parallel ViT in medical image segmentation. Therefore, there is a need for an effective parallel structure that can simultaneously enhance the accuracy and efficiency of medical image segmentation, breaking the limitations of parallel ViT applications.
2. Vision Transformers Development
The ViT model, which was first designed for Natural Language Processing, has now been successfully applied to Computer Vision. It competes favourably with conventional CNN approaches in tasks including image classification, target detection, and semantic segmentation. Its success is attributable to its dynamic attention mechanism and long-range modelling capabilities, which prove its robust feature learning. The image is divided into multiple small patches by ViT. These patches are then turned into sequences that serve as input features. An N-layer Transformer processes these sequences to produce a thorough feature representation of the entire image. With the self-attention mechanism, the Transformer captures long-distance dependencies among image features and enables higher-order spatial information exchange. It excels in global relational modeling, expanding the receptive field and acquiring rich contextual details, which effectively complements the global modeling capabilities of CNN.
Recently, a variety of novel models based on the ViT backbone network have been born, which may be categorized into sequential and parallel Transformer architectures. The sequential ViT models include DeiT
[19], CeiT
[20], Swin Transformer
[21], T2T-ViT
[22], PVT
[23], DeepViT
[24], and others. As the Transformer module is concerned with global contextual information, it is concerned with building relationships between pixels throughout the whole image range. It is unable to capture local visual features like standard CNN utilizing inductive bias, increasing the difficulty of ViT training and delayed convergence. Touvron et al.
[19] proposed the DeiT model, which attempts to learn the inductive bias of image data and distill the knowledge utilizing the teacher model CNN, which is then passed to the student model Transformer, which may enhance feature extraction via convolutional bias and also accelerates model convergence. For sequence input, ViT will partition the input image into numerous patch blocks; these fixedly divided patch blocks will lose the image’s local features. The Swin Transformer
[21] model adopts the idea of dynamic attention to neighbouring pixels, using sliding windows to model globally in the spatial dimension, while performing self-attention operations in each patch block and attention computation across blocks; this dynamic generation of attention weights reduces the computational complexity of self-attention and improves local feature extraction. Each ViT Transformer layer has the same resolution image characteristics, resulting in a high computational cost. Yuan et al.
[22] proposed a T2T-ViT model which utilizes a deep and narrow hierarchical Transformer architecture to enhance the features but at a high computational cost. Wang et al.
[23] suggested a Pyramid Vision Transformer (PVT) model with an asymptotically shrinking feature pyramid hierarchical structure that can acquire multi-scale feature maps and an attention layer SRA to reduce the computational consumption of processing high-resolution feature maps. DeepViT
[24] is designed with Re-Attention, which re-performs the self-attention operation with multiple features across layers at a cheaper computational cost, relieving the deep ViT feature saturation problem and allowing the network to learn more complicated representations.
Currently, serial Transformer structure research is thriving, but parallel Transformer structure research is limited
[16][17][25]. Parallel ViT
[18] is the first improvement proposed by the Meta team, in which the serial connected Transformer blocks are converted to parallel processing by decreasing the depth of the model while increasing the width of the model, and the residual portion becomes smaller as the network becomes deeper, and the parallel processing can be approximated to be equivalent to the sequential ViT
[13]; at this time, the number of model parameters and FLOPs are not changed. Depth
[26][27] and width
[28] are two critical factors for neural network architecture. To boost performance, most ViT variants
[19][20][21][22][23][24] increase the depth by concatenating the Transformer. Deep networks are difficult to optimize, and the model’s separability is affected by the size of the feature dimension. There are fewer studies on ViT width expansion now
[18], and the main concern is that parallel ViT raises the computational cost of the network, increases the model’s complexity, and the feature dimensions are too high to be easily overfitted.
3. Transformer-Based Medical Image Segmentation Method
Replacing the convolutional block of the U-network with a Transformer module capable of global feature extraction is a promising avenue to investigate the application of the Transformer for medical image segmentation. The TransUNet
[5] was the pioneer network model implementing the ViT for medical image segmentation. As CNN captures only local information, embedding the Transformer in the codec to extract global features of the CNN image-coding block can acquire long-distance model dependencies and rich spatial information. In order to achieve accurate segmentation, the decoder up-samples the coded features and performs feature localization with CNN low-level features. The TransUNet model incorporates the self-attention mechanism into the U-Net architecture to enhance contextual comprehension, but this comes with a notable computational expense. The Swin-Unet
[6] model, inspired by the Swin Transformer
[21] module, replaces the U-Net network’s convolutional layer directly, leading to the first pure Transformer structure for medical image segmentation. The input image undergoes a non-overlapping patch operation before being fed into the Transformer encoder to learn the global deep feature representation. The decoder then combines the encoded features with up-sampled features to recover the feature map and perform segmentation prediction. This approach resolves the issue that convolution struggles to learn global semantic information effectively. UNETR
[7] aims to convert the 3D segmentation task into a sequence-to-sequence prediction problem. This model’s encoder learns semantic features over long distances using a pure Transformer architecture, while the decoder retrieves high-resolution features with a CNN structure. UNETR uses a hybrid Transformer-CNN approach because it recognizes that ViT, even though it is excellent at extracting global features, does not perform well in acquiring local semantic information, and that the Transformer has a greater computational overhead than CNN. As previously mentioned, conventional network architectures including TransUNet
[5], Swin-Unet
[6], and UNETR
[7] utilize ViT or Swin Transformer modules to enhance feature extraction through the increase of network depth (i.e., the series connectivity pattern of the blocks
[29]). It is evident that increasing the network depth has a greater impact on model performance, yet this may not be the ideal selection when considering network optimization, separability, and computational costs.