Image-Based Obstacle Detection Methods: Comparison
Please note this is a comparison between Version 1 by Masood Varshosaz and Version 2 by Jason Zhu.

Mobile robots lack a driver or a pilot and, thus, should be able to detect obstacles autonomously. These various image-based obstacle detection techniques include Unmanned Surface Vehicles (USVs), Unmanned Aerial Vehicles (UAVs), and Micro Aerial Vehicles (MAVs). The techniques were divided into monocular and stereo. The former uses a single camera, while the latter makes use of images taken by two synchronised cameras. Monocular obstacle detection methods are discussed in appearance-based, motion-based, depth-based, and expansion-based categories. Monocular obstacle detection approaches have simple, fast, and straightforward computations. Thus, they are more suited for robots like MAVs and compact UAVs, which usually are small and have limited processing power. On the other hand, stereo-based methods use pair(s) of synchronised cameras to generate a real-time 3D map from the surrounding objects to locate the obstacles. Stereo-based approaches have been classified into Inverse Perspective Mapping (IPM)-based and disparity histogram-based methods. Whether aerial or terrestrial, disparity histogram-based methods suffer from common problems: computational complexity, sensitivity to illumination changes, and the need for accurate camera calibration, especially when implemented on small robots. In our work we have thoroughly reviewed different image based obstacle detection techniques. 

  • obstacle detection
  • image-based
  • UAV

1. Introduction

The use of mobile robots such as Unmanned Aerial Vehicles (UAVs), Unmanned Ground Vehicles (UGVs) has increased in recent years for photogrammetry [1][2][3][1,2,3], and many other applications. A remotely piloted robot should be able to detect obstacles automatically. In general, obstacle detection techniques can be divided into three groups: Image-based [4][5][4,5], sensor-based [6][7][6,7], and hybrid [8][9][8,9]. In sensor-based methods, various active sensors such as lasers [10][11][12][13][10,11,12,13], radar [14][15][14,15], sonar [16], ultrasonic [17][18][17,18], and Kinect [19] have been used.
Sensor-based methods have their own merits and disadvantages. For instance, in addition to being reasonably priced, sonar and ultrasonic sensors can determine the direction and position of an obstacle. However, sonar and ultrasonic waves are affected by both constructive and destructive interference of ultrasonic reflections from multiple environmental obstacles [20][29]. In some situations, radar waves may be an excellent alternate, mainly when no visual data is available. Nevertheless, radar sensors are not small or light, which means installing them on small robots is not always feasible [21][22][30,31]. Moreover, infrared waves have a limited Field Of View (FOV), and their performance is dependent on weather conditions [23][32]. Despite being a popular sensor, LiDAR is relatively large and, thus, cannot permanently be installed on small robots like MAVs. Therefore, despite their popularity and ease of use, active sensors may not be ideal for obstacle detection when weight, size, energy consumption, sensitivity to weather conditions, and radio-frequency interference issues matter [23][24][32,33].
Alternative to active sensors is lightweight cameras that provide visual information about the environment the robot travels in. Cameras are passive sensors and are used by numerous image-based algorithms to detect obstacles using grayscale values [25][34] and point [26][35] or edge [27][28][36,37] features. They can provide details regarding the amount of the robot’s movement or displacement and the obstacle’s colour, shape, and size [29][38]. In addition to enabling real-time and safe obstacle detection, image-based techniques are not disturbed by environmental electromagnetic noises. Additionally, the visual data can be used to guide the robot through various image-based navigation techniques currently available.


2. Monocular Obstacle Detection Techniques

2.1. Appearance-Based

These methods consider an obstacle as a foreground object against a uniform background (i.e., ground or sky). They work based on some prior knowledge from the relevant background in the form of the edge [21][30], colour [25][34], texture [30][56], or shape [23][32] features. Obstacle detection is performed on single images taken sequentially using a camera mounted in front of the robot. The acquired image is examined to see if it conforms to that of sky or ground features; if it does not, it is considered an obstacle pixel. This process is performed for every pixel in the image. The result is a binary image in which obstacles are presented in white and the rest in black pixels.
In terrestrial robots, the ground data such as road or floor are detected first. Then, to detect the obstacles, ground data is used. Ulrich and Nourbakhsh [30][56] proposed a technique where each pixel in the image is labelled as an obstacle or ground using its pixel values. Their system is trained by moving the robot within different environments. As a result, in practical situations, if the illumination conditions vary from those used during the training phase, obstacles will not effectively be identified. In another study, Lee et al. [23][32] used Markov’s Random Field (MRF) segmentation to detect small and narrow obstacles in an indoor environment. However, the camera should not be more than 6.3 cm away from the ground [23][32]. An omnidirectional camera system and the Hue Saturation Value (HSV) colour model were used to separate obstacles by Shih An et al. [31][58]. They created a binary image and filtered out the noise using a width-first search technique to cope with image noise. In another study [32][59], object-based background subtraction and image-based obstacle detection techniques were used for static and moving objects. They used a single wide-angle camera for real-time obstacle detection. Moreover, Liu et al. [33][57] proposed a real-time monocular obstacle detection method to identify the water horizon line and saliency estimation for USVs like boats or ships. The system was developed to detect objects below the water edge that may pose a threat to USVs. They claimed their method outperforms similar state-of-the-art techniques [33][57].
Conventional image processing techniques do not usually meet the expectations of real-time applications. Therefore, recent research has focused on increasing the speed of obstacle detection using Convolutional Neural Networks (CNNs). For example, to increase the speed, Talele et al. [34][60] used TensorFlow [35][61] and OpenCV [36][62] to detect obstacles by scanning the ground for distinct pixels and classifying them as obstacles. Similarly, Rane et al. [37][63] used TensorFlow to identify pixels different from the ground. Their method was real-time and applicable to various environments. To recognise and track typical moving obstacles, Qiu et al. [38][52] used YOLOv3 and Simple Online and Real-time Tracking (SORT). To solve the low accuracy and slow reaction time of existing detection systems, the You Only Look Once v4 (YOLOv4) network, which is an upgraded version of YOLO [39][64], was proposed by He et al. [40][65]. It improved the recognition of obstacles at medium and long distances [40][65].
Furthermore, He and Liu [41][66] developed a real-time technique for fusing features to boost the effectiveness of detecting obstacles in misty conditions. Additionally, Liu et al. [42][67] introduced a novel semantic segmentation algorithm based on a spatially constrained mixture model for real-time obstacle detection in marine environments. A Prior Estimation Network (PEN) was proposed to improve the mixture model.
As for airborne robots, most research looks for a way to separate the sky from the ground. For example, Huh et al. [21][30] separated the sky from the ground using a horizon line. They then determined moving obstacles using the particle filter algorithm. Their method could be used in complex environments and low-altitude flights. Despite being efficient in detecting moving obstacles, their technique could not be used for stationary obstacles [21][30]. In another study, a method was introduced by Mashaly et al. [25][34] to find the sky in a complex environment, with obstacles separated from the sky in a binary image [25][34]. De Croon and De Wagter [43][68] suggested a self-supervised learning method to discover the horizon line [43][68]. In another study [32][59], a single wide-angle camera was used for real-time obstacle detection. The object-based background subtraction and image-based obstacle detection techniques were used for static and moving objects.
As can be seen, appearance-based methods have been used on various robots. Every study has had its concerns. One research has investigated narrow and small obstacle detection (i.e., [23][32]), whereas moving obstacle detection has been the centre of focus in a few others. Except for Shih An et al. [31][58], other aresearchticles have focused on detecting obstacles in front of the robot. It is also evident that the attention in the last three years has been on the speed of obstacle detection procedures. Appearance-based methods are generally limited to environments where obstacles can easily be distinguished from the background. This assumption can easily be violated, particularly in complex environments containing objects, like buildings, trees, and humans [44][69] with varying shapes and colours. Some of the techniques are affected by the distance to the object or the noise in the images. Moreover, when using deep learning approaches, the detection of obstacles is primarily affected by the change in the environment and the number and types of samples in the training data set. Therefore, rwesearchers suggest using a semantic segmentation algorithm performed by deep learning networks, provided that sufficient training data is available. Another alternative would be to enrich appearance-based methods by providing them with distance to object data that can be obtained using a sensor or a depth-based algorithm (see the following sections).

2.2. Motion-Based

In motion-based methods, it is assumed that nearby objects have sharp movements that can be detected using motion vectors in the image. The process involves taking two successive images or frames in a very short time. At first, several match points are extracted on both frames. Then, the displacement vectors of the match points are computed. Since objects closer to the camera have larger displacements, any point with a displacement value that exceeds a particular threshold is considered an obstacle pixel.
Various studies have been conducted in this field. Jia et al. [45][70] introduced a novel method that uses motion features to distinguish obstacles from shadows and road markings. Instead of using all pixels, they only used corners and Scale Invariant Feature Transform (SIFT) features to achieve real-time obstacle detection. Such an algorithm can fail if the number of mismatched features is high [45][70].
Optical flow is the data used in most motion-based approaches. Ohnishi and Imiya [46][71] prevented a mobile robot from colliding with obstacles without having a map of the environment. Gharani and Karimi [47][53] used two consecutive frames to estimate the optical flow for obstacle detection on smartphones to help visually impaired people navigate indoor environments. Using a context-aware combination data method, they determined the distance between two consecutive frames. Tsai et al. [48][72] used Support Vector Machine (SVM) [49][73] to validate Speeded-up Robust Features (SURF) [50][74] point detector locations as obstacles. In athis research dense optical flow approach was used to extract the data for training SVM. Then, they used obstacle points and measures related to the spatial weighted saliency map to find the obstacle locations. The algorithm presented in their research applies to mobile robots with a camera installed at low altitudes. Consequently, it might not be possible to use it on UAVs that usually fly at high altitudes.
Motion-based obstacle detection relies mainly on the quality of the matching points. Thus, its quality can decrease if the number of mismatched features is high. In addition, if the optical flow is used for motion estimating, care must be taken for image points close to the centre. This is because, in optical flow, the number of motion vectors is not high. Indeed, detecting obstacles in front of the robot using optical flow is still challenging [51][52][75,76]. To resolve this problem, rwesearchers suggest using an expansion-based approach to detect obstacles in the central parts of the images. This integration ensures the strength of the expansion-based technique in detecting frontal objects is employed, while the other parts are analysed by the motion-based method. Alternatively, researcherswe may use deep learning networks to solve such problems.

2.3. Depth-Based

Like motion-based methods, depth-based approaches obtain depth information from images taken by a single camera. There are two ways to accomplish this, the first being motion stereo and the second being deep learning. Two cameras are placed on the robot’s sides in the former, and a pair of consecutive images are captured. Although these images are only taken using a single camera, they can be considered as a pair of stereo images, from which the depth of object points can be estimated. For this, the images are searched for matching points. Then, using standard depth estimation calculations [53][77], the depth of object points is computed. Pixels whose depth is less than a threshold value are regarded as obstacles.
A recent alternative to the above process is employing a deep learning network. At first, the network is trained using appropriate data, so it can produce a depth map from a single image [54][78]. Samples of such networks can be found in [55][56][79,80]. The process then tests any image taken by the robot’s camera to determine the depth of its pixels. Then, similar to a classic approach, pixels having a depth smaller than a threshold are considered obstacles.
However, instead of using motion vectors, a complete three-dimensional model of the surroundings is constructed and used to detect nearby obstacles [57][58][81,82]. Some of these methods use motion stereo. For example, Häne et al. [59][83] used motion stereo to produce depth maps using four fisheye images. In this system, an object on the ground was considered an obstacle. Such an algorithm cannot detect moving obstacles. Moreover, it provides a complete map of the environment which requires complex computations.
In another research, Lin et al. [60][84] used a fisheye camera and an Inertial Measurement Unit (IMU) for autonomous navigation of a MAV. As part of the system, an algorithm was developed to detect obstacles in a wide FOV using fisheye images. Each fisheye image was converted into two pinhole images without distortion with a sum horizontal viewing angle of 180°. Depth estimation was based on keyframes. Because the depth can only be estimated when the drone moves, this system will not work on MAVs when in hovering mode. Moreover, as the quality of the parts on the sides of a fisheye image is low, the accuracy of the resulting depth image can be low. Besides, the production of two horizontal pinhole images can decrease the vertical FOV and, thus, limit the areas where the obstacles can be detected.
Artificial neural networks and deep learning have been used to estimate depth in recent years [61][85]. Contrary to methods like 3D model construction, deep learning-based techniques do not require complex computations for obstacle detection. Kumar et al. [62][86] used four single fisheye images and a CNN to estimate the depth in all directions. They used LiDAR data as ground truth for depth estimation to train the network. The dataset they used in their self-driving car had a 64-beam Velodyne LiDAR and four wide-angle fisheye cameras. TIn this study, the distortion of the fisheye image was not corrected. It is recommended to improve the results by using more consecutive frames to exploit the motion parallax and better CNN encoders [62][86]. Their research required further training. Therefore, another future goal for theis work is to improve semi-supervised learning using synthetic data and run unsupervised learning algorithms [62][86]. In another study, Mancini et al. [63][87] developed a new CNN framework that uses image features obtained via fine-tuning the VGG19 network to compute the depth and consequently detect obstacles.
Moreover, Haseeb et al. [64][88] presented DisNet, a distance estimation system based on multi-hidden-layer neural networks. They evaluated the system under static conditions, while evaluation of the system mounted on a moving locomotive remained a challenge. In another research, Hatch et al. [65][54] presented an obstacle avoidance system for small UAVs, in which the depth is computed using a vision algorithm. The system works by incorporating a high-level control network, a collision prediction network, and a contingency policy. Urban and Caplier [52][76] developed a navigation module for visually impaired pedestrians, using a video camera in an intelligent light-weighted glasses device. It includes two modules: a static data extractor and a dynamic data extractor. The first is a convolutional neural network used to determine the obstacle’s location and distance from the robot. In contrast, using a fully connected neural network, the dynamic data extractor computes the Time-to-Collision by stacking the obstacle data from multiple frames.
Furthermore, some researchers have developed methods to create and use a semantic map of the environment to recognise obstacles [66][67][89,90]. A semantic map is a representation of the robot’s environment that incorporates both geometric (e.g., height, roughness) and semantic data (e.g., navigation-relevant classes such as trail, grass, obstacle, etc.) [68][91]. When used in urban autonomous vehicle applications, they can provide autonomous vehicles with a longer sensing range and more excellent manoeuvrability than onboard sensory devices. Some studies have used multi-sensor fusion to improve the robustness of their segmentation algorithms to create semantic maps. More details can be found in [69][70][71][72][73][92,93,94,95,96].

2.4. Expansion–Based

These methods employ the same principle used by humans to detect obstacles, i.e., the object expansion rate between consecutive images. As we know, an object continuously grows larger when it approaches. Thus determining obstacles, points and/or regions on two sequential images can be used to estimate the object’s enlargement value. This value could be computed between homologous areas, distances, or even the SIFT scales of the extracted points. In expansion-based algorithms, if the enlargement value relating to an object exceeds a specific threshold, that object is considered an obstacle.
Expansion-based methods use the objects’ enlargement rate in between successive images. They use a concept similar to human perception. Several expansion-based studies have been conducted. In these methods, the obstacle is defined as an object enlarged or resized in consecutive frames. Therefore, sequential frames and various enlargement criteria are used to detect obstacles. For example, Mori and Scherer [74][98] used the characteristics of the SURF algorithm to detect the initial positions of obstacles that differed in size. This algorithm has simple calculations but may fail due to the slow reaction time to obstacles. Zeng et al. [44][69] used edge motion in two successive frames to identify approaching obstacles in another research. If the object’s edge shifts outwards (relative to its centre in successive frames), the object becomes large [44][69]. This approach applies to both fixed and mobile robots when the background is homogeneous. However, if the background is complicated, this approach only applies to static objects. Aguilar et al. [75][99] only detected obstacles conforming to some primary patterns. They use this concept to detect specific obstacles. As a result, obstacles other than those following the predefined patterns cannot be identified.
To detect obstacles, Al-Kaff et al. [26][35] used SIFT [76][100] to extract and match points across successive frames. He then formed the convex hull of the matched points. The points were regarded as obstacle points if the change in their SIFT scale values and the convex hull area exceeded a certain threshold. The technique may simultaneously identify both near and far points as obstacles. As a result, the mobile robot will have limited manoeuvrability in complex environments. In addition, the ratio of change of the convex hull region criterion will lose its efficiency if the corresponding points are wrong.
Badrloo and Varshosaz [77][55] used points with an average distance ratio greater than a specified threshold to identify obstacle points to solve this problem. Their technique was able to distinguish far and near obstacles properly. Others have solved the problem in different ways. For example, like Badrloo and Varshosaz [77][55], Euclidean distance was acquired between each point and the centroid of all other matched points by Padhy et al. [28][37]. Escobar et al. [78][101] computed the optical flow to obtain the expansion rate for obstacle recognition in unknown and complex environments in another study. In another study, Badrloo et al. [79][102] used the expansion rate of region areas for accurate obstacle detection.
Recently, deep learning solutions have been proposed to improve both the speed and the accuracy of obstacle detection, especially in complex and unknown environments. For instance, Lee et al. [5] detected obstacle trees in tree plantations. They trained a machine learning model, the so-called Faster Region-based Convolutional Neural Network (Faster R-CNN), to detect tree trunks for drone navigation. This approach uses the ratio of an obstacle height in the image to the image height. Additionally, the image widths between trees were used to find obstacle-free pathways.
Compared to other monocular techniques, expansion-based approaches employ a simple principle, i.e., the expansion rate. Such techniques are fast, as they do not require extensive computations. However, they may fail when the surrounding objects become complex. Thus, in recent years, deep neural networks have been employed to meet the expectations of real-time applications.
As seen from the above, expansion-based approaches use points or convex hulls to detect obstacles to increase speed. This leads to the inclusion of incomplete obstacle shapes, which can limit its accuracy. A recent method provides regions of an obstacle for complete and precise obstacle detection [79][102], although it does not yet meet the requirements of real-time applications. RWesearchers suggest using methods based on deep neural networks to accelerate the complete and precise detection of obstacles in expansion-based methods.

3. Stereo-Based Obstacle Detection Techniques

Obstacle detection based on stereo uses two synchronised cameras fixed on the robot [27][36]

3.1. IPM–Based Method

The IPM-based methods were primarily used to detect all types of road obstacles [80][104] and to eliminate the perspective effect of the original images in lane detection problems [81][82][105,106]. Currently, IPM images are mostly used in monocular methods [83][84][85][107,108,109].
Assuming the road has a flat surface, the IPM algorithm produces an image representing the road seen from the top, using internal and external parameters of the cameras. Then, the difference in the grey levels of pixels in the overlapping regions is computed, from which a polar histogram image is generated. If the image textures are uniform, this histogram contains two triangles/peaks: one for the lane and one for the potential obstacle. These peaks are then used to detect the obstacle, i.e., non-lane object. In effect, obstacle detection relies on identifying these two triangles based on their shapes and positions. In practice, it becomes difficult to form such ideal triangles due to the diversity of textures in the images, objects of irregular shapes, and variations in the brightness of the pixels. There is limited research on stereo IPM obstacle detection [86][110]. An example is that by Bertozzi et al. [86][110] for short-range obstacle detection. This method detects obstacles using the difference between the left and right IPM images. Although it may be accurate in some conditions, it has a limited range and cannot show the actual distance to obstacles. Kim et al. [87][111] used a stereo pair of cameras to create two IPM images for each camera. These images were then combined with another IPM image created using a pair of consecutive images taken with the camera having a smaller FOV to detect the obstacles. Although IPM-based methods are very fast, they have two notable limitations. First, since they use object portions with uniform texture or colour for obstacle detection [88][89][112,113], they can only be used to detect objects like a car that has a uniform material [90][114]. Second, errors in the homography model, unknown camera motion, and light reflection from the floor can generate noise in the images [90][114]. Indeed, implementing the IPM transform requires a priori knowledge of the specific acquisition conditions (camera location, orientation, etc.) and some assumptions regarding the objects being imaged. Consequently, it can only be utilised in structured environments, where, for instance, the camera is fixed or when the system calibration and the surrounding environment can be monitored by another type of sensor [91][115]. Due to the limitations of this method, rwesearchers recommend using it only for lane detection and obstacle detection in cars. This is because the necessary data and conditions for this method in unknown environments, particularly when using drones, are not necessarily available.

3.2. Disparity Histogram-Based

Two cameras are installed at a fixed distance in front of the robot in these methods. The cameras have similar properties like focal length and FOV. They simultaneously capture two images of the surroundings. The acquired images are rectified. The distance between the matched pixels (disparity) is then calculated. This is repeated for all of the image pixels. The result is a disparity map, which is then used to compute the depth map of the surrounding objects [92][116]. Pixels having a depth smaller than a threshold are considered obstacle points. The majority of stereo-based obstacle detection techniques developed so far are disparity histogram-based which are studreviewed in this section.
Disparity histogram-based methods can be discussed for robots on the ground and in the air. In the following, we will review both groups will be stated.

3.2.1. Disparity Histogram-Based Obstacle Detection for Terrestrial Robots

Disparity histogram-based obstacle detection techniques were initially developed for terrestrial robots. Kim et al. [93][117] proposed a Hierarchical Census Transform (HCT) matching method to develop car parking assistance using images taken by a pair of synchronised fisheye cameras. As the quality of points at the edges of a fisheye image is low, the detection was only accurate enough in areas close to the image centre. Moreover, the algorithm’s accuracy decreased when shadows and complicated or reflective backgrounds were present. Later, Ball et al. [94][118] introduced an obstacle detection algorithm that could continuously adapt to changes in the illumination and brightness in farm environments. They developed two distinct steps for obstacle detection. The first removes both the crop and the stubble. After that, stereo matching is performed only on the remaining small portions to increase the speed. The technique is unable to detect hidden obstacles. The second part bypasses this constraint by defining obstacles as unique observations in their appearance and structural cues [94][118]. Salhi and Amiri [95][119] proposed a faster algorithm implemented on Field Programmable Gate Arrays (FPGA) to simulate human visual systems. Disparity histogram-based techniques rely on matching computationally intensive algorithms. A solution to speed up the computations is to reduce the matching search space. Jung et al. [96][103] and Huh et al. [27][36] removed the road pixels to reduce the search space and regarded the other pixels as obstacles for vehicles travelling along a road. To detect the road, they used the normal FOV cameras. As a result, only obstacles in front of the vehicle could be detected. In a similar approach, to guide visually impaired individuals, Huang et al. [97][120] used depth data obtained using a Kinect scanner to identify and remove the road points. Furthermore, Muhovič et al. [98][121] approximated the water surface by fitting a plane to the point cloud, and outlying points are processed further to identify potential obstacles. As a recent technique, Murmu and Nandi [99][122] presented a novel lane and obstacle detection algorithm that uses video frames captured by a low-cost stereo vision system. The suggested system generates a real-time disparity map from the sequential frames to identify lanes and other cars. Moreover, Sun et al. [100][123] used 3D point cloud candidates extracted by height analysis for obstacle detection instead of using all 3D point clouds. With the development of neural networks, many researchers have recently turned their attention to deep learning methods [101][102][103][124,125,126]. Choe et al. [104][127] proposed a stereo object matching technique that uses 2D contextual information from images and 3D object-level information in the field of stereo matching. Luo et al. [105][128] also used CNNs that can produce extremely accurate results in less than one second. Moreover, in disparity histogram-based obstacle detection studies, Dairi et al. [101][124] developed a hybrid encoder that combines Deep Boltzmann Machines (DBM) and Auto-Encoders (AE). In addition, Song et al. [106][129] trained a convolutional neural network using manually labelled Region Of Interest (ROI) from the KITTI data set to classify the left/right side of the host lane. The 3-D data generated by stereo matching is used to generate an obstacle mask. Zhang et al. [102][125] introduced a method that uses stereo images and deep learning methods to avoid car accidents. The algorithm was developed for drivers reversing and with a limited view of the objects behind. This method detects and locates obstacles in the image using a faster R-CNN algorithm. Haris and Hou [107][130] addressed how to improve the robustness of obstacle detection methods in a complex environment by integrating an MRF for obstacle detection, road segmentation, and the CNN model to navigate safely. Their research evaluated the detection of small obstacles left on the road [107][130]. Furthermore, Mukherjee et al. [108][131] provided a method for detecting and localising pedestrians using a ZED stereo camera. They used the Darknet YOLOv2 to locate and achieve more accurate and rapid obstacle detection results. Compared with traditional methods, deep learning has the advantages of robustness, accuracy, and speed. In addition, it can achieve real-time, high-precision recognition and distance measurement through the combination of stereo vision techniques [102][125].

3.2.2. Disparity Histogram-Based Obstacle Detection for Aerial Robots

Despite terrestrial robots mainly being surrounded by known objects, aerial robots move in unknown environments. Processing disparity histogram-based methods can be too heavy for onboard MAV microprocessors. To simplify the search space and speed up the depth calculation, McGuire et al. [29][38] used vertical edges within the stereo images to detect obstacles. Such an algorithm would not work in complicated environments where horizontal and diagonal edges are present. Tijmons et al. [109][132] introduced the strategy of Droplet to identify and use only strong matched points and, thus, decrease the search space. TIn this study, the resolution of the images was reduced to increase the speed of disparity map generation. When the environment becomes complex, the processing speed of this technique decreases. Moreover, reducing the resolution may eliminate tiny obstacles such as tree branches, rope, and wire. Barry et al. [4] concentrated only on fixed obstacles at a 5–10 m distance from a UAV to speed up the process. The algorithm was implemented on a light drone (less than 1 kg) and detected obstacles at 120 frames per second. The baseline of the cameras was only 14 inches. Thus, in addition to being limited to detecting fixed objects, a major challenge of theis work would be its need for accurate calibration of the system to obtain reliable results. Lin et al. [110][133] considered dynamic environments and stereo cameras to detect moving obstacles using depth from stereo images. However, they only considered the obstacles’ estimated position, size, and velocity. Therefore, some characteristics of objects such as direction, volume, shape, and influencing factors like environmental conditions were not considered. Hence, such algorithms may have difficulty detecting some of the moving obstacles. One of the state-of-the-art methods is that by Grinberg and Ruf [111][134], which includes several components: image rectification, pixel matching, semi-global matching optimisation (SGM), compatibility check, and median filtering. This algorithm runs on an ARM processor of the Towards Ubiquitous Low-Power Image Processing Platforms (TULIPP). Therefore, image processing shows a performance suitable for real-time applications on a UAV [111][134]. The second problem is sensitivity to illumination variations, which have been tackled using deep learning networks in recent years. The third problem is the accurate calibration of the stereo cameras. Suppose the stereo cameras are not calibrated correctly. The detection error increases very quickly over time. This is due to system instability which will affect the accuracy of computing the distance from the obstacle, especially when the baseline of the cameras is small, e.g., when they are mounted on small-sized UAVs.
Video Production Service