Real-Time Automatic Drone Surveillance and Wildlife Monitoring: Comparison
Please note this is a comparison between Version 1 by Peter Povlsen and Version 2 by Peter Tang.

Wildlife monitoring can be time-consuming and expensive, but the fast-developing technologies of uncrewed aerial vehicles, sensors, and machine learning pave the way for automated monitoring.

  • wildlife monitoring
  • uncrewed aerial systems
  • UAV

1. Introduction

The use of aerial drones for wildlife monitoring has increased exponentially in the past decade [1][2][3][4][5][6][7][1,2,3,4,5,6,7]. These drones, also known as uncrewed aerial vehicles (UAVs), unmanned aerial systems (UASs), and remotely piloted aircraft systems (RPASs), can carry a variety of sensors, including high-resolution visible-light-cameras (RGB) and thermal infrared (TI) cameras. As the technologies advance and the price of these drones and sensors drops, they become more accessible to conservation biologists, wildlife managers, and other professionals working with wildlife monitoring [2][3][4][5][2,3,4,5]. The prospects of drones in wildlife monitoring have already been proven to save time, create better imagery and spatial data for especially cryptic and nocturnal animals [8][9][8,9], and reduce the risks and hazards for the observer [10][11][10,11]. However, the methods are still in the early stages, and need further development to be truly superior and cost-saving compared to traditional monitoring methods. Automatic detection is pivotal for this development, and computer vision is likely to be the solution [1].

2. Automatic Detection and Computer Vision

Over the past decade, artificial intelligence has led to significant progress in the domain of computer vision, automating image and video analysis tasks. Among computer vision methods, Convolutional Neural Networks (CNNs) are particularly promising for future advances in automating wildlife monitoring [6][12][13][14][15][16][17][18][6,12,13,14,15,16,17,18]. Corcoran et al. [3] concluded that when implementing automatic detection, fixed-winged drones with RGB sensors were ideal for detecting larger animals in open terrain, whereas, for small, elusive animals in more complex habitats, multi-rotor systems with infrared (IR) or thermal infrared sensors are the better choice, especially when monitoring cryptic and nocturnal animals. It was also noted that there is a knowledge gap in understanding the effects of the chosen drone platforms, sensors, and survey design on the false positive detections made by the trained models, thereby potentially overestimating [3].

3. You-Only-Look-Once-Based UAV Technology

A popular and open-source group of CNNs is the YOLO (You Only Look Once) object detection and image segmentation models, with several iterations and active development [14][19][20][21][22][14,19,20,21,22], and a technology cross-fusion with drones has already been proposed as YOLO-Based UAV Technology (YBUT) [6]. The advantages of the YOLO models are that they are fast [8], making it possible to perform object detection in real-time on live footage, and that they are relatively user-friendly and intuitive, making the models approachable to non-computer scientists. By using the Python programming language, it is more accessible for custom development and customization. This makes it possible to implement it in external hardware so that, for example, object detection can be carried out in real-time onboard a drone. Object detection and tracking of cars and persons are already integrated into several unmanned aerial systems, such as the DJI Matrice 300RTK [23], but customization of these systems is limited. The YOLO framework and YBUT show potential for active community development [6][24][6,24]. Examples of this are architectures based on YOLOv5 that improve the model’s ability to detect minutely small objects in drone imagery [12][25][12,25], improved infrared image object detection network, YOLO-FIRI [26], and improved YOLOv5 framework to detect wildlife in dense spatial distribution [17].

4. Mean Average Precision

When training neural networks, here called models, one of the main parameters for explaining the performance of a model is the mean average precision (mAP) [27]. This is a metric used to evaluate the performance of a model when predicting bounding boxes at different confidence levels, and thereby measure the precision of the trained model in comparison to other models applied to the same test dataset. A training dataset may be a collection of manually annotated images divided into a set for the training itself, a validation set, and a testing set, also known as dataset splitting [27]. The validation set is used to detect the overfitting of a trained model and the test set is used to evaluate its performance on an unseen dataset. Mean average precision (mAP) consists of several parameters: precision, recall, and intersection over union (IOU) [18][27][18,27]. The precision of a model (calculated as the number of true positives divided by the sum of true and false positives generated by the model), describes the proportion of positive predictions that are correct. The precision of a model does, however, not take the false negatives into account. The recall of a model, calculated as the number of true positives divided by the sum of true positives and false negatives, describes how many of the true positives the model correctly detects. This means that there is a trade-off between precision and recall. Detection becomes less precise when making more predictions at a lower confidence level, which in return gives a higher recall. Precision–recall curves visualize how the precision of the model behaves when changing the selected confidence threshold. The IOU measures how much overlap there is between the bounding box on an image from the test dataset, manually annotated, and a bounding box annotated by the trained model, on the same image. Therefore, the IOU gives a proportion of how much of the object of the specified class and how much of the surroundings are included in the detection. mAP curves are the mean of the precision–recall curve for all classes and for all IOU thresholds for each class, so it both takes into consideration the number of false negatives and false positives, as well as how precise the bounding boxes are drawn around the object for detection [18][27][18,27].
Povlsen et al. [28] flew in predetermined flight paths at 60 m altitude with a DJI Mavic 2 Enterprise Advanced with the thermal camera pointing directly down (90°), covering the transects that were simultaneously surveyed, monitoring hare, deer, and fox. Using transect counting, it was possible to spot roughly the same number of animals as the traditional ground-based spotlight count [28]. However, this method covered a relatively small area per flight, and required post-processing of the captured imagery, still making it time-consuming. In the present research, the researchers tudy, we tried a slightly different approach by manually piloting the UAV continuously, using the scouring method which also had been shown to match and potentially surpass the traditional spotlight method [9]. By scouring the area with the camera angled at about 45°, thwe researchers aattained better situational awareness and covered a larger area per flight. This approach does require some experience from the drone pilot [24], both in piloting the drone and camera and in spotting animals in thermal imagery, but, as thwe researchers will show, there is a potential in automating this approach using machine learning (ML) to improve post-processing efficiency and possibly even collect data in real-time automatically while the drone is airborne.
Video Production Service