Ship Localization, Classification, and Detection Based on CNNs: Comparison
Please note this is a comparison between Version 2 by Eduardo Henrique Teixeira and Version 1 by Eduardo Henrique Teixeira.

Object detection is a common application within the computer vision area. Its tasks include the classic challenges of object localization and classification. As a consequence, object detection is a challenging task. Furthermore, this technique is crucial for maritime applications since situational awareness can bring various benefits to surveillance systems. The literature presents various models to improve automatic target recognition and tracking capabilities that can be applied to and leverage maritime surveillance systems.

  • maritime surveillance
  • classification
  • localization
  • detection
  • artificial intelligence
  • neural networks

1. Introduction

With the growth in ocean exploration by cruise ships, ocean liners, and other marine ships, the need for monitoring systems has increased considerably. With this, monitoring stations have become increasingly equipped to carry out the identification of possible issues. Among the maritime monitoring applications, one can mention potential collision prediction [1], navigation support, tracking of ships drift [2] target tracking, maritime safety [3]. Visual ship tracking provides crucial kinematic traffic information to maritime traffic participants, which helps to accurately predict ship traveling behaviors in the near future. Each of these applications requires different operating architectures [4].
Automatic maritime surveillance assumes the use of sensors that can provide enough information for automatic situational awareness tasks, such as localization and classification. In localization, a single object is found in an image. In classification, the object is defined as belonging to a specific class. Detection combines the characteristics of these two techniques to locate and classify multiple targets in the scene.
The fusion of different sensing sources can provide a better situational view of the monitored environment and help one take the necessary actions. Sensors based on sound or electromagnetic waves, such as sonar and radar, are generally employed in long-range applications. Optical sensors can be more economical alternatives for applications that require greater detail of the ships and aim for low power consumption. They can be employed in ship tracking and classification tasks, requiring only that these sensors be combined with visual detection techniques that are efficient, fast, and robust to enable the advancement of maritime applications [5].
The number of sensors involved can also vary, and the most common ones for this purpose are thermal cameras, optical cameras, and radar. Choosing the best techniques to obtain maritime situational information is not trivial since the literature offers a huge amount of model options that employ optical data. To make their application even more complicated, weather or water conditions, such as wind speed, tidal changes, rain, and fog, can, for example, blur or entirely obstruct objects in an image. Additionally, the increasing distance between the monitored object and the sensors can also aggravate visual tasks, as it can cause a high-scale variation.
At the beginning of the research on ship detection, such as the one on object detection in general, methods employing simple and handcrafted features were used, but recently, convolutional neural network (CNNs) have been added as part of this field of study because of their extraordinary ability to extract and represent visual features [6]. For example, in automatic navigation systems, the role of CNNs is to interpret the visual data collected by the cameras. Thus, the detection information is added to the data from different sensors, allowing the data fusion processing system to have enough information for decision-making.

2. Image Acquisition

During image capture, several types of sensors can be used. However, the focus of this work is on the use of optical sensors, be they remote, installed on satellites or aircraft, or those that observe from a side view, such as those installed in inshore or offshore scenarios, such as on other ships or fixed constructions on land, usually near the coast. Optical images can still be divided into visible and infrared (IR) spectra, and the range of both is very similar, from the order of meters to at most a few kilometers. The main differences between optical sensors are related to sensitivity to the environment and the quality and quantity of visual information generated by the sensor [7]. Comparing the sensitivity to illumination, both sensor types have problems working outside their respective designations, i.e., while the visible light sensor performs poorly for nighttime applications, the IR sensor presents high saturation in images captured during the day. In addition, the visible light sensor is less robust to the effects of light reflection on ships caused by water dynamics. However, the visual data generated are more detailed when compared to the quality and quantity of elements captured by a visible sensor [7]. Thus, this sensor can lead to the training of detectors with higher reliability. Optical remote sensing images suffer from weather conditions, such as rain, waves, fog, and clouds, which causes the need, in some cases, for preprocessing of the image to improve the image quality that will be analyzed.

3. Preprocessing Techniques

The preprocessing step can be used to improve the quality of images by introducing techniques that allow obtaining a dataset with better quality images through the attenuation of interference caused by elements of the environment, such as extreme brightness and contrast, in addition to the quality of the camera lenses used in the capture process [10]. Among the various techniques that can be used in preprocessing, it is possible to mention super-resolution techniques [11,12] and deblur [13,14]. The main benefit of improving images before they are used in location, classification, or detection models is usually the improvement of accuracy achieved only by increasing the quality of the dataset [15]. An example of detection enhancement can be seen in Figure 2.
Figure 2. Detection enhancement with preprocessing.
Super-resolution techniques are used to recover quality and improve the resolution of an image. With this, instead of receiving low-resolution images, the model starts to operate with more detailed images, leading various situations to an instant performance improvement [16]. The field of image super-resolution has been dominated by methods based on convolutional neural networks in recent years [17]. Among the examples of models related to the super-resolution task, one can mention the models based on training to minimize the mean squared error (MSE), such as super-resolution convolutional neural network (SRCNN) [18], super-resolution residual network (SRResNet) [19], enhanced deep super-resolution network (EDSR) [20], multi-scale deep super-resolution (MDSR) [20], and deep back-projection networks (DBPN) [21], and also models based on generative adversarial network (GANs), such as super-resolution generative adversarial network (SRGAN) [22], enhanced super-resolution generative adversarial network (ESRGAN) [23], and rank super-resolution generative adversarial network (RankSRGAN) [24]. In addition to being based on different forms of training, these models also differ in the layer structures that make up their architectures, which influence their performance, both in aspects related to accuracy and their processing time. Super-resolution models trained through an MSE estimator use the distance between training images and associated predictions as a cost function. As a result, these models tend to produce smoother images. On the other, models based on GANs use generators to create new false images that can mimic the expected result so that the discriminative model has the maximum difficulty distinguishing between the synthesized images of the generator and the actual images. This process generates output images with more realistic detail but can cause some unwanted noise to enter the image during the super-resolution process [25].

3.1. SRCNN

The SRCNN model is based on an CNN architecture, trained to learn an end-to-end mapping between low and high-resolution images for the super-resolution (SR) problem. It was one of the first architectures that applied the concept of deep learning in the super-resolution task, achieving one of the best results for this task in 2015. When the authors proposed the use of CNNs, the most common was to use the traditional sparse-coding-based SR methods [18]. The model is divided into three stages; the first is responsible for performing the extraction and representation of low-quality images within the network. The second step is the application of a nonlinear mapping, where the layers of CNN extract as much information about the image. The last step is to rebuild the image at a higher resolution than the image applied to the template entry [18].

3.2. SRResNet

The model SRResNet was developed with an architecture based on residual network (ResNet), but with modifications in the optimization of the MSE loss function to achieve high upscaling factors, 4×, as quoted in [22]. With this optimization, the model was consolidated in 2017 as the new state-of-the-art, and its performance was evaluated by peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM), two metrics widely used in image quality assessment [26]. Another proposal by the authors of [22] was to change the MSE loss function using model resources visual geometry group (VGG). With this, the authors compared the optimized version of the MSE with the version modified to use the VGG loss metrics. The result was that the model improved the visual metric mean opinion score (MOS) but had lower performance on the metrics of PSNR and SSIM [22].

3.3. SRGAN

The model SRGAN is trained through a generative architecture based on the ResNet. This architecture generates images with super-resolution and has its loss analyzed through a second structure with a discriminative function, which only acts during training. The commission’s proposal SRGAN is to use a new resource-based loss function of the model VGG, which, when combined with the discriminating network, helps to detect the difference between the image generated about the reference image. According to metric-based tests MOS [27], which evaluates the perceptual quality of an empirically obtained image through a visual classification scale, the trained model achieves results close to the state of the art in the literature [22].

3.4. EDSR

The model EDSR, as well as the other SR models already presented, also has its architecture based on ResNet. These models have characteristics similar to SRResNet. However, unnecessary modules are removed from the architecture to optimize the model. Among these changes, one can mention the residual blocks, which have removed the batch normalization (BN). This causes the model to be simplified and memory usage to be reduced [20]. Just as SRResNet achieves 4× upscaling factors, the EDSR model is also capable. In the work [20], the authors trained the model for upscaling of 2×, 3×, and 4×. In addition to achieving model time optimization and simplifying the architecture, the authors also had superior results compared to other networks that were tested, such as SRCNN, SRResNet, and MDSR.

3.5. MDSR

The model MDSR was proposed by the same author who developed the EDSR model such that their architectures are described in [20]. The MDSR network has a certain increase in complexity compared to EDSR because it uses extra blocks with different scales at the beginning of the architecture. Removing the BN layers, as suggested in [20], is also adopted in this model [20]. Unlike EDSR, which reconstructs only a super-resolution image scale on the MDSR network, an initial upscaling is applied that operates with parallel image processing structures of different sizes. This allows for reducing various problems caused by variations in image scale. Both models, EDSR and MDSR, were proposed in the NTIRE 2017 Super-Resolution Challenge [28], taking first and second place, respectively. With this, the authors claimed that they managed simultaneously to achieve the state of the art in the topic of super-resolution and simultaneously transformed the architecture ResNet into a more compact model.

3.6. ESRGAN

In three respects, the model ESRGAN is an improved version of the network SRGAN. The first was the replacement of residual blocks by residual-in-residual dense block (RRDB) to facilitate training, followed by the exchange of layers of BN by residual scaling and smaller initialization, as suggested in [20] because it allows the training of a more profound architecture. The second difference was the replacement of a GAN joint for a relativistic average GAN (RaGAN); instead of judging whether an image is true or false, this generative network can identify which image is more realistic. Finally, the perceptual loss was improved using VGG features before and after activation as in the SRGAN. This last change makes the model provide sharper edges and visually more satisfying results [23]. With the modifications made, the model reached the state of the art in 2018, presenting the best results of perceptual quality, with first place in the challenge perceptual image restoration and manipulation—super resolution (PIRM-SR) [17]. The test evaluated several models under the quality of visual perception metrics, PSNR and SSIM. From the evaluation of the performance of the models in this challenge, it was possible to notice that the increasing values of PSNR and SSIM were not always accompanied by an increase in perceptual quality. In many cases, this resulted in increasingly blurred and unnatural outputs, which gives more meaning to the previously cited results of [22].

3.7. RankSRGAN

The RankSRGAN model is based on the GANs architecture but adopts a siamese architecture to learn perceptual metrics and rank images according to the quality score found during its training. This model combines different SR algorithms to improve perceptual metrics by combining other models [24]. To train the ranker, the authors used three templates, SRResNet, SRGAN, and ESRGAN. With their combination, RankSRGAN was able to optimize the natural image quality evaluator (NIQE) parameter [29], a visual metric that measures the naturalness of the image in the scene. With this, the model achieved superior performance to the individual models used when applied to the dataset of the PIRM-SR Challenge 2018 [24].

3.8. DBPN

The DBPN model is an improved version of the SRCNN network, but instead of using predefined upsampling, it uses interleaved upsampling and downsampling layers. Unlike other methods that build the SR image feed-forwardly, our proposed networks focus on directly increasing SR resources by using multiple stages of ascending and descending sampling that feeds error predictions into each depth. The values of the error feedback of the steps of increase and scale reduction were used to guide the network to obtain a better result. The model performed similarly to the state-of-the-art performance in 2018. In addition, the network was trained with 8× magnification, higher than that used in the creation of SRResNet [19] and EDSR [20]. Unlike super-resolution techniques, deblurring techniques were developed to remove noise and blur present in the image, which hinders the visualization of the image. When noisy images are treated before being inserted into detection and classification systems, the system performance can increase considerably [30]. Some of the techniques that can be applied to the deblur task are deblur generative adversarial network (DeblurGAN), DeblurGAN-V2, and deblurring and shape recovery of fast moving objects (DeFMO).

3.9. DeblurGAN

The DeblurGAN template is composed of a GAN architecture, and its purpose is to remove blur in images. The model features an architecture of CNN, composed of residual blocks (ResBlocks) consisting of a convolution layer, instance normalization layer, and ReLU activation [31]. The authors of DeblurGAN validated their results by applying the you only look once (YOLO) model to perform the detection and classification of objects in images with blur and images processed by the deblur model. There is a gain in accuracy in the YOLO results when inserted images are improved by the DeblurGAN model, proving that it significantly contributes to image quality and consequently to the performance of subsequent processing systems [31].

3.10. DeblurGAN-V2

The DeblurGAN-V2 model is based on the construction of the original model DeblurGAN but with some modifications to improve the [32] network. Among them, the generative model in DeblurGAN-V2 integrates the technique feature pyramid network (FPN). This technique was initially developed for object detection purposes [33]. Still, in the case of the DeblurGAN-V2 model, the authors used FPN for the construction of a noisy image [32]. In addition to integrating the FPN technique, the new version allows the selection of different backbones. Each of the different backbones is designed to improve some of the performance parameters. For example, with the Inception-ResNetV2 architecture, you obtain a next-generation blur. In contrast, with the mobile network-depthwise separable convolution (MobileNet-DSC) architecture, you obtain an increase in processing speed, some 10 to 100 times faster than the top competitors in 2019 [32].

3.11. DeFMO

Motion blur is one of the existing blur types, and it is caused by the rapid movement of objects when captured by cameras or by the quick scroll of the camera to capture still objects, recording photos or videos, with blur [34]. Thus, DeFMO is designed to act on this type of blur. The proposed network is a novel based on a ’self-supervised’ loss function that improves the model’s accuracy when applied to images with motion blur. By presenting a good generalization capability, this model can be applied to different areas in computer vision, such as the improvement of security cameras, microscopes, and photos with high noise levels [35]. This model is the first fully neural FMOs deblurring that fills the gap between deblurring, 3D modeling, and FMO subframe tracking for trajectory analysis.

4. Processing Techniques

Most previously proposed models for image processing, that is, location, classification, or detection of ships, have focused on using handcrafted resources applied to image processing. These models are built with the expert knowledge of designers. Within the scope of handcraft features models, it is possible to point out several works that employ different techniques, such as Gabor filter in [36], for automatic target detection, discrete cosine transform (DCT) in [37] for maritime surveillance on non-stationary surface platforms, as well as Haar–Cascade [38], scale-invariant feature transform (SIFT) [39], local binary pattern (LBP) [40], support vector machine (SVM) [41], and histograms of oriented gradients (HOG) [42] for the remote sensing of ships. As a result, the extracted features reflect the limited aspects of the problem, generating a low response accuracy of the models and a low generalization. Thus, deep learning in the computer vision research community, such as CNNs proved to be more suitable for developing and training resource extractors [43]. The techniques based on CNN dominate the most recent works, as shown in Table 1, which details the evolution of the works over the years, pointing out aspects such as the type of image used, applications, and techniques involved in each of the works. They won great strength after winning the ImageNet challenge in 2012 and have been achieving excellent results in several image processing tasks for obtaining visual information [44]. Another point that collaborates with this type of network is the evolution of the sizes of the available datasets, given that CNNs usually require a large number of training samples. With this, the use of detection models based on CNNs has accelerated even more because, according to [45], a good object detector should improve when given more training data. Within these networks, there is a subclass, the region-based convolutional neural network (R-CNNs), whose working principle is based on a selective search for object detection, generating region proposals, as shown in Figure 3. Work related to this type of technique began with the R-CNN, proposed by Ross Girshick [46]. Since then, other variations have been proposed, such as fast R-CNN [45], faster R-CNN [47], mask R-CNN [48], single-shot detector (SSD) [49], YOLO [50], YOLOv2/9000 [51], YOLOv3 [52], YOLOv4 [52], and YOLOv5 [52]. These models have some modifications in their topologies to increase their speed and prediction performances or even to add a new function, as is the case of segmentation in mask R-CNN.
Figure 3. Regions with CNN features.

4.1. R-CNN

It emerged with the task of localizing objects through a CNN that could have high detection capability even with a small amount of annotated samples for its training. It is basically divided into three modules. The first is responsible for generating several region proposals without a specific category, by a method called selective search (SS) [53]. The second is an CNN, which extracts a fixed number of features for each of the proposals. Finally, the third module is based on a linear SVMs trained specifically for each possible class. With this, this network can not only locate the object, but also inform which of the possible classes it belongs to. This classification is performed through a score generated by the [46] classifiers.

4.2. Fast R-CNN

Fast R-CNN introduces single-stage training with an update of all layers and avoids disk storage for feature caching [45]. Regarding the detection task, it has the advantage of achieving higher mean average precision (mAP) compared to its standard version. In this model, the linear SVMs used in R-CNN is replaced by a softmax classifier. Using the same training algorithm and hyperparameters used in R-CNN, they train a new SVM to be the classifier for fast R-CNN and justify the use of softmax by achieving a slight advantage in mAP over it [45].

4.3. Faster R-CNN

This model uses the region proposal network (RPN), which comprises CNNs capable of providing region proposals to fast R-CNN, informing at the same time the object boundaries and the scores of each proposed region. RPN calculates proposal regions much faster and more efficiently compared to SS. Moreover, it brings another advantage by sharing convolutional layers between the proposal generation network and the classification network, optimizing the network training [47].

4.4. Mask R-CNN

It follows the same principle as faster R-CNN but has a second output in the model for segmenting objects [48]. The pixel-by-pixel object segmentation is performed through the superposition of an outline, applied by this second output. This overlay mask is applied to each region of interest (RoI) and is based on the fully connected neural network (FCNN) model [54]. 4.5. SSD Compared to previous methods that take two stages, SSD is a more straightforward method because it encapsulates all computations in a single deep neural net, eliminating the need to generate object proposals in multiple stages. This increases the speed of the system and facilitates training by providing a unified structure for training and inference. It scores bounding boxes and adjusts to best match the shape of the object and uses boxes of different proportions to handle objects of different sizes [49].

4.6. YOLO

Like SSD, this is also a single-stage detector, which can have its optimized performance within its unified detection model. In this method, object detection is performed as a regression task for bounding boxes, which, at the same time, provides the object locations with their respective classes. The primary source of error in this network is in the incorrect location of small objects [50].
Table 1. Models and features on related works.
  Image View Approaches  
Papers Side View Remote Localization Classification Techniques/Models
2017 [55] - x x - FusionNet
2017 [56] x - - x VGG16
2018 [57] x - x - Faster R-CNN+ResNet
2018 [58] - x x - ResNet-50
2018 [59] - x x - SNN
2018 [60] - x x x Faster R-CNN+Inception-ResNet
2018 [61] - x x - RetinaNet
2018 [62] - x x x R-CNN
2018 [63] - x x - R-CNN
2019 [64] - x - x VGG19
2019 [65] - x - x VGG16
2019 [66] x - - x Skip-ENet
2019 [67] - x x x Cascade R-CNN+B2RB
2019 [68] - x - x ResNet-34
2019 [69] x - x - YOLOv3
2019 [70] - x x x VGG16
2019 [71] x - x - Faster R-CNN
2020 [72] - x x x SSS-Net
2020 [73] - x x x YOLOv3
2020 [74] - x x x CNN
2020 [75] x - x - CNN Segmentation
2020 [76] - x x - YOLO
2020 [77] - x x x ResNet-50+RNP
2020 [78] x - - x CNN
2020 [79] x - x x YOLOv4
2020 [80] - x x - YOLOv3
2020 [81] - x - x VGG16
2020 [82] x - x - Mask R-CNN+YOLOv1
2021 [83] - x x x Mask RPN+DenseNet
2021 [84] - x x - VGG16
2021 [85] x - x x SSD MobileNetV2
2021 [86] x - x x YOLOv3
2021 [87] x - x - Faster R-CNN
2021 [88] x - x - R-CNN
2021 [89] x - x x BLS
2021 [90] x - x - YOLOv5
2021 [3] x - x x MobileNet+YOLOv4
2021 [91] - x x x Cascade R-CNN
2021 [92] x - x x YOLOv3
2021 [93] - x x x YOLOv3
2021 [94] x - x x YOLOv3
2021 [95] - x x x YOLOv4
2021 [96] x - x x ResNet-152
2021 [97] - x x - Faster R-CNN
2022 [92] x - x x YOLOv4
2022 [98] - x x - YOLOv3
2022 [99] x - x x MobileNetV2+YOLOv4
2022 [100] - x x x YOLOv5
 
ScholarVision Creations