The exploration of aquatic environments has recently become popular due to the growing scarcity of natural resources and the growth of the global economy
[1]. Machine vision has been shown to be a low-cost and dependable method that has the benefits of non-contact monitoring, long-term steady operation, and a broad range of applications. Underwater object detection is pivotal in numerous applications, such as underwater search and rescue operations, deep-sea exploration and archaeology, and sea life monitoring
[2]. These applications require effective and precise vision-based underwater sea analytics, including image enhancement, image quality assessment, and object detection methods. However, capturing underwater images using optical imaging systems poses greater problems than capturing images under open-air conditions. More specifically, underwater images frequently suffer from degeneration due to severe color distortion, low contrast, non-uniform illumination, and noise from artificial lighting sources, which dramatically degrades image visibility and affects the detection accuracy for underwater object detection tasks
[1]. Over recent years, underwater image enhancement technologies have been developed that work as preprocessing operations to boost detection accuracy by improving the visual quality of underwater images.
On the other hand, underwater object detection performance is associated with the characteristics of underwater biological organisms. Usually, because of differences in size or shape and the overlapping or occlusion of marine organisms, traditional hand-designed feature extraction methods cannot meet detection requirements for actual underwater scenes. Most studies have emphasized the extraction of traditional low-level features, such as color, texture, contours, and shape
[3], which has led to the disadvantages of traditional object detection methods, such as poor recognition, low accuracy, and slow recognition. However, by directly benefiting from deep learning methods, object detection has witnessed a great boost in performance over recent years, although general object detection algorithms that are based on deep learning have not yet demonstrated better detection performance for marine organisms due to the low quality of underwater imaging and complex underwater environments.
2. Underwater Image Enhancement and Underwater Biological Detection
2.1. Underwater Image Enhancement (UIE) Methods
Underwater image enhancement (UIE) is a necessary step to improve the visual quality of underwater images. UIE can be divided into three categories: model-free, physical model-based, and deep learning-based approaches.
White balance
[4], Gray World theory
[5], and histogram equalization
[6] are examples of model-free enhancement methods that improve the visual quality of underwater images by directly adjusting the pixel values of images. Ancuti et al. suggested a multi-scale fusion underwater image enhancement method that could be combined with fusion color correction and contrast enhancement to obtain high-quality images
[7]. Based on prior research, Ancuti et al. also proposed a weighted multi-scale fusion method for underwater image white balance that could restore faded information and edge information in the original images using gamma variation and sharpening
[8]. Fu et al. proposed a Retinex-based enhancement system that included color correction, layer decomposition, and underwater image enhancement in the Lab color space
[9]. Zhang et al. extended the Retinex-based method by using bilateral and trilateral filters to enhance the three channels of underwater image in the CIELAB color space
[10]. However, because the physical deterioration process of underwater images has not been taken into account, the model-free UIE approaches can generate noise, artifacts, and color distortion, which makes them unsuitable for various types of applications.
Physical model-based methods regard underwater picture enhancement as an inverse image degradation problem and these algorithms can provide clear images by calculating the transmission and background light using Definition
1. Because underwater imaging models are similar to atmospheric models for fog, dehazing algorithms are used to enhance underwater images. He et al. proposed a dehazing algorithm that was based on dark channel prior (DCP), which could effectively estimate the thickness of fog and obtain fog-free images
[11]. Based on DCP, Drew et al. proposed an underwater dark channel prior that considered red light attenuation in water
[12]. Peng et al. developed a generalized dark primary color prior (GDCP) for underwater image enhancement that included adaptive color correction in an image creation model
[13]. Model-based approaches often need prior information and the quality of the improved images is dependent on precise parameter estimation.
Deep learning enhancement methods usually construct convolutional neural networks and train them using pairs of degraded underwater images and their high-quality counterparts
[14]. Li et al. suggested an unsupervised generative adversance network (WaterGAN) that generated underwater images from aerial RGB-D images and then trained an underwater image recovery network using the synthesized training data
[15]. To produce paired underwater image datasets, Fabbri et al. suggested an underwater color transfer model that was based on CycleGAN
[16] and built an underwater image recovery network using a gradient penalty technique
[17]. Ye et al. proposed an unsupervised adaptive network for joint learning that could jointly estimate scene depth and correct color underwater images
[18]. Chen et al. proposed two perceptual enhancement cascade models, which used gradient strategy feedback information to enhance more prominent features in images
[14]. Deep learning UIE approaches that are based on composite image training generally require a large number of datasets
[19]. Because the quality of the composite images cannot be guaranteed, these methods cannot be applied to underwater situations.
2.2. Attention Mechanisms
Some studies on attention mechanisms have been presented in the literature. Attention models enable networks to extract information from crucial areas with reduced energy consumption, thereby enhancing CNN performance. Wang et al. proposed a residual attention network that was based on an attention mechanism, which could continuously extract large amounts of attention information
[20]. Hu et al. proposed SENet, which contained architectural “squeeze” and “excitation” units. These modules enhanced network expressiveness by modeling the interdependencies between channels
[21]. Woo et al. proposed a lightweight module (CBAM) that combined feature channels and feature spaces to refine features
[22]. This method could achieve considerable performance improvements while maintaining small overheads.
2.3. Underwater Object Detection Algorithms
Deep learning-based object detection algorithms are currently divided into two categories: one-stage regression detectors and two-stage region generation detectors. One-stage detection methods mainly include the YOLO series
[23][24][25][23,24,25], SSDs
[26], RetinaNet
[27], and RefineDet
[28], which directly predict objects without region generation. Two-stage detection methods mainly include RCNNs
[29], fast RCNNs
[30], faster RCNNs
[31], and cascade RCNNs
[32]. Initially, these object detection methods were used for natural environments on land. As deep learning technology has advanced, more and more object detection algorithms have been applied to challenging underwater environments. Li et al. used a faster RCNN to detect fish species and achieved an outstanding performance
[33]. Li et al. employed a residual network to detect deep-sea plankton. Their experiments revealed that deep residual networks generalized plankton categorization
[34]. Cui et al. introduced a CNN-based fish detection system and optimized it using data augmentation, network simplification, and training process acceleration
[35]. Huang et al. presented three data augmentation approaches for underwater imaging that could imitate the illumination of marine environments
[36]. Fan et al. suggested a cascade underwater detection framework with feature augmentation and anchoring refinement, which could address the issue of imbalanced underwater samples
[37]. Zhao et al. designed a new composite backbone network to detect fish species by improving the residual network and used it to learn change information within ocean scenes
[3]. However, little research has been conducted in the field of underwater object detection using YOLO.