2. Recognition
One of the most basic tasks in fisheries, aquaculture, and ecological monitoring is the detection and counting of fish and other relevant species. This can be done underwater in order to determine the population in a given area
[33][11], over conveyor belts during the process of buying and selling and tank transference
[34][12], or out of the water to determine, for example, the number of fish of different species captured during a catch. Estimating fish populations is important to avoid over-fishing and keep production sustainable
[35,36][13][14], as well as for wildlife conservation and animal ecology purposes
[33][11]. Automatically counting fish can also be useful for the inspection and enforcement of regulations
[37][15]. Fish detection is often the first step of more complex tasks such as behavior analysis, detection of anomalous events
[38][16], and species classification
[39][17].
Lighting variations caused by turbidity, light attenuation at lower depths, and waves on the surface may cause severe detection problems, leading to error
[5,10,36,41,52][5][10][14][18][19]. Possible solutions for this include the use of larger sensors with better light sensitivity (which usually cost more), or employing artificial lighting, which under natural conditions may not be feasible and can also attract or repel fish, skewing the observations
[2]. Additionally, adding artificial lighting often makes the problem worse due to the backscattering of light in front of the camera. Other authors have used infrared sensors together with the RGB cameras to increase the information collected under deep water conditions
[42][20]. Post-processing the images using denoising and enhancement techniques is another option that can at least partially address the issue of poor quality images
[41[18][21],
57], but it is worth pointing out that this type of technique tends to be computationally expensive
[52][19]. Finally, some authors explore the results obtained under more favorable conditions to improve the analysis of more difficult images or video frames
[55][22].
The background of images may contain objects such as cage structures, lice skirts, biofouling organisms, coral, seaweed, etc., which greatly increases the difficulty of the detection task, especially if some of those objects mimic the visual characteristics of the fish of interest
[10]. Banno et al.
[2] reported a considerable number of false positives due to complex backgrounds, but added that those errors could be easily removed manually. The buildup of fouling on the camera’s lenses was also pointed out by Marini et al.
[54][23] as a potential source of error that should be prevented either by regular maintenance or by using protective gear.
One of the main sources of errors in underwater fish detection and counting is the occlusion by other fish or objects, especially when several individuals are present simultaneously
[54][23]. Some of the methods proposed in the literature were designed specifically to address this problem
[5[5][24][25],
40,50], but the success under uncontrolled conditions has been limited
[2]. Partial success has been achieved by Labao and Naval
[36][14], who devised a cascade structure that automatically performs corrections on the initial estimates by including the contextual information around the objects of interest. Another possible solution is the use of sophisticated tracking strategies applied to video recordings, but even in this case occlusions can lead to low accuracy
(see Section 3.3). Structures and objects present in the environment can also cause occlusions, especially considering that fish frequently seek shelter and try to hide whenever they feel threatened. Potential sources of occlusion need to be identified and taken into account if the objective is to reliably estimate the fish population in a given area from digital images taken underwater
[33][11].
Underwater detection, tracking, measurement, and classification of fish requires dealing with the fact that individuals will cross the camera’s line of sight at different distances
[58][26]. This poses several challenges. First, fish outside the range of the camera’s depth of field will appear out of focus and the consequent loss of information can lead to error. Second, fish located too far from the camera will be represented by only a few pixels, which may not be enough for the task at hand
[36][14], thus increasing the number of false negatives
[54][23]. Third, fish that pass too close to the camera may not appear in their entirety in any given image/frame, again limiting the information available. Coro and Walsh
[42][20] explored color distributions in the object to compensate for the lack of resolvability of fish located too close to the camera.
One way to deal with the difficulties mentioned so far is by focusing the detection on certain distinctive body structures rather than the whole body. Costa et al.
[44][27] dealt with problems caused by body movement, bending, and touching specimens by focusing the detection on the eyes, which more unambiguously represented each individual than their whole bodies. Qian et al.
[59][28] focused on the fish heads in order to better track individuals in a fish tank.
The varying quality of underwater images poses challenges not only to automated methods but also to human experts responsible for annotating the data
[4]. Especially in the case of low-quality images, annotation errors can be frequent and, as a result, the model ends up being trained with inconsistent data
[60][29]. Banno et al.
[2] have shown that the difference in counting results yielded by two different people can surpass 20%, and even repeated counts carried out by the same person can be inconsistent. Annotation becomes even more challenging and prone to subjectivity-related inconsistency with more complex detection tasks, such as pose estimation
[51][30]. With the intrinsic subjectivity of the annotation process, inconsistencies are mostly unavoidable, but their negative effects can be mitigated by using multiple experts and applying a majority rule to assign the definite labels
[32][31]. The downside of this strategy is that manual annotation tends to be expensive and time-consuming, so the best strategy will ultimately depend on how reliable the annotated data needs to be.
With so many factors affecting the characteristics of the images, especially when captured under uncontrolled conditions, it is necessary to prepare the models to deal with such a variety. In other words, the dataset used to train the models needs to represent the variety of conditions and variations expected to be found in practice. In turn, this often means that thousands of images need to be captured and properly annotated, which explains why virtually all image datasets used in the reported studies have some kind of limitation that decreases the generality of the models trained
[38,47,51][16][30][32] and, as a result, limits their potential for practical use
[2,4][2][4]. This is arguably the main challenge preventing more widespread use of image-based techniques for fish monitoring and management. Given the importance of this issue, it is revisited from slightly different angles both in
Section 3.4 and
Section 4.
3. Measurement
Non-invasively estimating the size and weight of fish is very useful both for ecological and economic purposes. Biomass estimation in particular can provide clues about the feeding process, possible health problems, and potential production in fisheries. It can also reveal important details about the condition of wild species populations in vulnerable areas. In addition, fish length is one of the key variables needed for both taking short-term management decisions and modeling stock trends
[1], and automating the measurement process can reduce costs and produce more consistent data
[61,62][33][34]. Automatic measurement of body traits can also be useful after catch to quickly provide information about the characteristics of the fish batch, which can, for example, be done during transportation on the conveyor belts
[34][12].
Bravata et al.
[63][35] enumerated several shortcomings of manual measurements. In particular, conventional length and weight data collection requires the physical handling of fish, which is time-consuming for personnel and stressful for the fish. Additionally, measurements are commonly taken in the field, where conditions can be suboptimal for ensuring precision and accuracy. This highlights the need for a more objective and systematic way to ensure accurate measurements.
Fish are not rigid objects and models must learn how to adapt to changes in posture, position, and scale
[1]. High accuracies have been achieved with dead fish in an out-of-water context using techniques based on the deep learning concept
[1,56[1][36][37],
75], although even in those cases errors can occur due to unfavorable fish poses
[70][38]. Measuring fish underwater has proven to be a much more challenging task, with high accuracies being achieved only under tightly controlled or unrealistic conditions
[64,65,72][39][40][41], and even in this case, some kind of manual input is sometimes needed
[71][42]. Despite the difficulties, some progress has been achieved under more challenging conditions
[66][43], with body bending models showing promise when paired with stereo vision systems
[61][33]. Other authors have employed a semi-automatic approach, in which the human user needs to provide some information for the system to perform the measurement accurately
[74][44].
Partial or complete body occlusion is a problem that affects all aspects of image-based fish monitoring and management, but it is particularly troublesome in the context of fish measurement
[68,75][37][45]. Although statistical methods can partially compensate for the lost information under certain conditions
[1], usually errors caused by occlusions are unavoidable
[66][43], even if a semi-automatic approach is employed
[74][44].
Some studies dealt with the problem of measuring different fish body parts for a better characterization of the specimens
[66][43]. One difficulty with this approach is that the limits between different body parts are usually not clear even for experienced evaluators, making the problem relatively ill-defined. This is something intrinsic to the problem, which means that some level of uncertainty will likely always be present.
One aspect of body measurement that is sometimes ignored is that converting from pixels to a standard measurement unit such as centimeters is far from trivial
[1]. First, it is necessary to know the exact distance between the fish and the camera in order to estimate the dimensions of each pixel, but such a distance changes through the body contours, so in practice, each pixel has a different conversion factor associated. The task is further complicated by the fact that pixels are not circles, but squares. Thus, the diagonal will be more than 40% longer than any line parallel to the square’s sides. These facts make it nearly impossible to obtain an exact conversion, but properly defined statistical corrections can lead to highly accurate estimates
[1]. Proper corrections are also critical to compensate for lens distortion, especially considering the growing use of robust and waterproof action cameras which tend to have significant radial distortion
[70][38].
Most models are trained to have maximum accuracy as the target, which normally means properly balancing false positives and false negatives. However, there are some applications for which one or another type of error can be much more damaging. In the context of measurement, fish need to be first detected and then properly measured. If spurious objects are detected as fish, their measurements will be completely wrong, which in practice may cause problems such as lowering prices paid for the fisherman or skewing inspection efforts
[60][29].
Research on the use of computer vision techniques for measuring fish is still in its infancy. Because many of the studies aim at proving a solid proof of concept instead of generating models ready to be used in practice, the datasets used in such studies are usually limited in terms of both the number of samples and variability
[67,72,82][41][46][47]. As the state of the art evolves, more comprehensive databases will be needed (see
Section 4). One negative consequence of dataset limitations is that overfitting occurs frequently
[63][35]. Overfitting is a phenomenon in which the model adapts very well to the data used for training but lacks generality to deal with new data, leading to low accuracies. There are a few measures that can be taken to avoid overfitting, such as early training stop and image augmentation applied to the training subset, but the best way to deal with the problem is to increase the number and variability of the training dataset
[4,5][4][5].
One major reason for the lack of truly representative datasets in the case of fish segmentation and measuring is that the point-level annotations needed in this case are significantly more difficult to acquire than image-level annotations. If the fish population is large, a more efficient approach would be to indicate that the image contains at least one fish, and then let the model locate all the individuals in the image
[47][32], thus effectively automating part of the annotation process. More research effort is needed to improve accuracy in order for this type of approach to become viable.
4. Tracking
Many studies dedicated to the detection, counting, measurement, and classification of fish use individual images to reach their goal. However, videos or multiple still images are frequently used in underwater applications. This implies that each fish will likely appear in multiple frames/images, some of which will certainly be more suitable for image analysis. Thus, considering multiple recognition candidates for the same fish seems a reasonable strategy
[6,39][6][17]. This approach implicitly requires that individual fish be tracked. Fish tracking is also a fundamental step in determining the behavior of individuals or shoals
[59,83,84][28][48][49], which in turn is used to detect problems such as diseases
[85][50], lack of oxygenation
[86][51], the presence of ammonia
[87][52] and other pollutants
[88][53], feeding status
[58,89][26][54], changes in the environment
[86][51], welfare status
[90[55][56],
91], etc.
The term “tracking” is adopted here in a broad sense, as it includes not only studies dedicated to determining the trajectory of fish over time but also those focusing on the activity and behavior of fish over time, in which case the exact trajectory may not be as relevant as other cues extracted from videos or sequences of images
[84][49].
There are many challenges that need to be overcome for proper fish tracking. Arguably, the most difficult one is to keep track of large populations containing many visually similar individuals. This is particularly challenging if the intention is to track individual fish instead of whole shoals
[35,96][13][57]. Occlusions can be particularly insidious because as fish merge and separate, their identities can be swapped, and tracking fails
[13][58]. In order to deal with a problem as complex as this, some authors have employed deep learning techniques such as semantic segmentation
[35][13], which can implicitly extract features from the images which enable more accurate tracking. Other authors adopted a sophisticated multi-step approach designed specifically to deal with this kind of challenge
[94][59]. However, when too little individual information is available, which is usually the case in densely packed shoals with a high rate of occlusions
[60][29], camera-based individual tracking becomes nearly unfeasible. For this reason, some authors have adopted strategies that try to track the shoal as a whole, rather than following individual fish
[86,102][51][60].
Another challenge is the fact that it is more difficult to detect and track fish as they move farther away from the camera
[35][13]. There are two main reasons for this. First, the farther away the fish are from the camera, the smaller the number of pixels available to characterize the animal. Second, some level of turbidity will almost always be present, so visibility can decrease rapidly with distance. In addition, real underwater fish images are generally of poor quality due to limited range, non-uniform lighting, low contrast, color attenuation, and blurring
[60][29]. These problems can be mitigated using image enhancement and noise reduction techniques such as Retinex-based and bilateral trigonometric filters
[35[13][50],
85], but not completely overcome. A possible way to deal with this issue is to employ multiple cameras bringing an extended field of view, which can be very useful not only to counteract visibility issues but also to meet the requirements of shoal tracking and monitoring
[86][51]. However, the additional equipment may cause costs to rise to unacceptable levels and make it more complex to manage the system and to track across multiple cameras.
Due to body bending while free swimming, the same individual can be observed with very different shapes and fish size and orientation can vary
[60][29]. If not taken into account, this can cause intermittency in the tracking process
[59,83][28][48]. A solution that is frequently employed in situations such as this is to use deformable models capable of mirroring the actual fish poses
[60,91][29][56]. Some studies explore the posture patterns of the fish to draw conclusions about their behavior and for early detection of potential problems
[107][61].
Tracking is usually carried out using videos captured with a relatively high frame rate, so when occlusions occur, tracking may resume as soon as the individual reappears a few frames later. However, there are instances in which plants and algae (both moving and static), rocks, or other fish hide a target for too long a time for the tracker to be able to properly resume tracking. In cases such as this, it may be possible to apply statistical techniques (e.g., covariance-based models) to refine tracking decisions
[94][59], but tracking failures are likely to happen from time to time, especially if many fish are being tracked simultaneously
[59,101][28][62]. If the occlusion is only partial, there are approaches based on deep learning techniques that have achieved some degree of success in avoiding tracking errors
[98][63]. Another solution that has been explored is a multi-view setup in which at least two cameras with different orientations are used simultaneously for tracking
[99][64]. Exploring only the body parts that have more distinctive features, such as the head
[106][65], is another way that has been tested to counterbalance the difficulties involved in tracking large groups of individuals. Under tightly controlled conditions, some studies have been successful in identifying the right individuals and resuming tracking even days after the first detection
[100][66].
As in the case of fish measurement, the majority of studies related to fish tracking are performed using images captured in tanks with at least partially controlled conditions. In addition, many of the methods proposed in the literature require that the data be recorded in shallow tanks with depths of no more than a few centimeters
[101][62]. While these constraints are acceptable in prospective studies, they often are too restrictive for practical use. Thus, further progress depends on investigating new algorithms more adapted to the conditions expected to occur in the real world.
One limitation of many fish tracking studies is that the trajectories are followed in a 2D plane, while real movement occurs in a tridimensional space, thus limiting the conclusions that can be drawn from the data
[80,101][62][67]. In order to deal with this limitation, some authors have been investigating 3D models more suitable for fish tracking
[84,87,95,97,99][49][52][64][68][69]. Many of those efforts rely on stereo-vision strategies that require accurate calibration of multiple cameras or unrealistic assumptions about the data acquired, making them unsuitable for real-time tracking
[101][62]. This has led some authors to explore single sensors with the ability to acquire depth information, such as Microsoft’s Kinect, although in this case, the maximum distance for detectability can be limited
[101][62].