Accurate segmentation of retinal vessels is an essential prerequisite for the subsequent analysis of fundus images. Recently, a number of methods based on deep learning have been proposed and shown to demonstrate promising segmentation performance, especially U-Net and its variants. However, tiny vessels and low-contrast vessels are hard to detect due to the issues of a loss of spatial details caused by consecutive down-sample operations and inadequate fusion of multi-level features caused by vanilla skip connections.
1. Introduction
Fundus imaging is a non-invasive, reproducible, and inexpensive method that shows retinal vessels and pathology
[1]. In the medical domain, the morphological changes of retinal vessels, e.g., vessel diameter, branch angles, and branch lengths, can be used as clinical indicators for the detection and diagnosis of diabetes, hypertension, atherosclerosis, and other diseases
[2]. In addition, the retinal vascular tree can serve as a unique identifier for identification systems in the social security domain, due to the unique morphology of this feature in individuals
[3]. Retinal vessel segmentation is the process of determining from fundus images whether each pixel is a vessel or non-vessel pixel and is the preliminary step in objectively assessing retinal vasculature and quantitatively interpreting the morphometrics. Nevertheless, manual approaches to retinal vessel segmentation by trained experts are expensive, time-consuming, and laborious, especially in screening a large number of people. Furthermore, manual segmentation cannot guarantee segmentation performance because the results often vary from expert to expert due to their subjective segmentation. Therefore, the development of an automatic and high-precision method for retinal vessel segmentation is highly demanded. However, the retinal vascular tree presents an extremely complicated morphological structure and has many tiny vessels with a width of fewer than ten pixels or even one pixel, and are, therefore, generally difficult to distinguish from the background. Similarly, owing to uneven illumination and lesion regions, the contrast between blood vessels and non-vascular structures is relatively low. Because of these problems, it remains a challenging task to accurately segment retinal vessels from fundus images, especially tiny vessels and low-contrast vessels.
In 1989, Chaudhuri et al. became the first to deal with the problem of automatically segmenting retinal vessels
[4]. Following this research, many methods have been proposed for retinal vessel segmentation, spurred by developments in digital image processing technology in recent decades
[5]. Early studies based on various hand-crafted features, e.g., shape
[6], color
[7], and edge
[4], usually exhibit low accuracy and poor robustness due to the features being shallow and insufficiently expressing semantic-rich information. Recently, deep learning methods, especially deep convolutional neural networks (DCNNs), have achieved superior results for many computer vision tasks, e.g., image classification
[8], object detection
[9], human pose estimation
[10], and semantic segmentation
[11]. Compared with conventional methods, DCNNs are able to automatically learn richer representations from raw input data and demonstrate superior segmentation performance
[12]. In particular, Long et al.
[13] proposed a novel end-to-end and pixel-to-pixel semantic segmentation network, called FCN, which introduces the most basic framework for natural image segmentation: the encoder–decoder structure. However, unlike the large number of natural image datasets available, the number of medical image datasets is relatively small because they are difficult to collect due to patient privacy and ethical issues. In this regard, Ronneberger et al.
[14] proposed U-Net, an improvement on FCN, that could be trained with only a few images and still predict precise results. U-Net is a breakthrough advancement in deep learning in the field of medical image segmentation. In addition to its encoder–decoder structure, the success of U-Net is largely attributed to the skip connection between the encoder sub-network and the decoder sub-network, which combines multi-level features at different stages. As a general rule, the low-level features of shallow layers have abundant spatial details but lack sufficient semantic information, while the high-level features of deep layers have semantic-rich information but lose spatial details. It is an intuitive method that adopts the skip connection to fuse the spatial details of the encoder sub-network and the semantic information of the decoder sub-network.
Even though U-Net and its variants have achieved state-of-the-art results on many medical image segmentation tasks including kidney segmentation, pancreas segmentation, and liver segmentation, it is still not good enough to efficiently and effectively segment retinal vessels. In general, there are two main limitations. Firstly, consecutive down-sample operations in the encoder sub-network result in the loss of spatial information of tiny vessels and vessel edge information, and the final segmentation map cannot recover this lost information through skip-connections and up-sample operations in the decoder sub-network. Clinically, tiny vessels consisting of only several pixels provide an indispensable reference for the diagnosis of diseases like neovascular diseases. Therefore, researchers should pay more attention to the tiny vessels than to the thick vessels. Secondly, there exists a certain semantic gap between low-level features and high-level features in fundus images, especially in low-contrast regions. The vanilla skip connection introduces too much irrelevant redundant information, harming retinal vessel segmentation performance, especially with low-contrast vessels. It is essential to intelligently enhance vessel representations while suppressing background noise.
2. Deep Convolutional Neural Networks
To date, many segmentation networks based on the fully convolutional network (FCN) with the encoder–decoder (high-to-low and low-to-high in series) architecture have been proposed in the field of semantic segmentation. Among them, U-Net
[14] and its variants have achieved remarkable performance in medical image segmentation including retinal vessel segmentation. For instance, DUNet
[15] replaced standard convolutions with deformable convolutions because of the high complexity of the structures of retinal vessels. Zhang et al.
[16] introduced new edge-aware flows into U-Net to make predictive outcomes more sensitive to vessel edge information. For multi-source vessel image segmentation, Yin et al.
[17] designed a deep fusion network, called DF-Net, which is composed of multi-scale fusion, feature fusion, and classifier fusion. Li et al.
[18] proposed a multi-task symmetric network, called GDF-Net, which consists of three typical U-Net-shaped sub-networks consisting of a global segmentation network branch, a detail enhancement network branch, and a fusion network branch. As an alternative to the encoder–decoder architecture, Guo
[19] put forward a low-to-high segmentation architecture, called CSGNet, which first obtains low-resolution representations, and then learns high-resolution representations with the help of low-resolution representations. Recently, some studies
[20][21] have demonstrated that learning high-resolution representations throughout the training process can preserve spatial details of tiny vessels and vessel edge information, which is beneficial to segmenting tiny vessels and locating vessel boundaries. A representative method is HRNet, which was originally proposed for human pose estimation and used for other position-sensitive vision tasks
[20]. HRNet maintains high resolutions from input data to final outcomes without the requirement of restoring high resolutions and generates semantic-rich high-resolution representations via repeatedly exchanging information from multi-resolution features. Motivated by HRNet, Lin et al.
[21] proposed a novel high-resolution representation network with a multi-path scale, called MPS-Net. In MPS-Net, there are three paths with different resolutions, in which the main path maintains high resolutions throughout the entire process, while two branch paths with low-resolution representations are added to the main path in parallel.
3. Self-Attention Modules
Generally speaking, humans can analyze and understand complex scenes naturally and effectively. Motivated by this observation, attention mechanisms
[22][23] were introduced into deep learning in order to dynamically adjust the weight of feature maps. In particular, Vaswani et al.
[24] proposed a self-attention mechanism with the aim of acquiring the long-range dependencies of timing signals, which facilitates machine translation and natural language processing. Then, Wang et al.
[25] introduced the self-attention mechanism into computer vision to obtain long-range dependencies via non-local operations. Based on the self-attention mechanism, Fu et al.
[26] presented DANet for scene segmentation, which includes a position-attention module to focus on the relationship in the spatial dimension and a channel-attention module to pay attention to the interdependencies in channel dimensions. However, the self-attention mechanism needs to generate a huge attention matrix, whose complexity is
𝒪((𝐻×𝑊)×(𝐻×𝑊)), where
𝐻×𝑊 denotes the resolution of the input feature map, which seriously limits its practical applicability. Therefore, several variants of the self-attention mechanism have been proposed to reduce computational complexity. For instance, Huang et al.
[27] viewed the self-attention operation as a graph convolution and utilized several sparsely connected graphs instead of the densely connected graph generated by the original self-attention mechanism. To do so, Huang et al. introduced a criss-cross attention module, whose weight is
𝐻+𝑊−1 not
𝐻×𝑊, reducing the computational complexity from
𝒪((𝐻×𝑊)×(𝐻×𝑊)) to
𝒪((𝐻×𝑊)×(𝐻+𝑊−1)). In addition, Li et al.
[28] regarded the self-attention mechanism in terms of an expectation–maximization manner to obtain a much more compact set of bases, reducing the computational complexity from
𝒪((𝐻×𝑊)×(𝐻×𝑊)) to
𝒪((𝐻×𝑊)×𝐾)𝒪, where
𝐾 represents the number of the compact bases. Li et al.
[29] designed a lightweight dual-direction attention block, generating the attention matrix with computational complexity of
𝒪(𝐻×𝑊) via horizontal and vertical pooling operations. However, these existing variants are insufficient for retinal vessel segmentation, as they fail to focus on the characteristics of vessel structures.
This entry is adapted from the peer-reviewed paper 10.3390/s23218899