Computer Vision-Based Structural Deformation Monitoring in Field Environments

Transportation infrastructure systems such as bridges, tunnels and railroads are important component systems for national social production and national development. With the tremendous development of social productivity, these transportation infrastructures are tested in two major ways. On the one hand, the tonnage and number of existing means of transportation may exceed the design load-carrying capacity; on the other hand, civil engineering structures including bridges, are subjected to various external loads or disasters (such as fire and earthquakes) during their service life, which in turn reduces the service life of the structures. By carrying out inspection, monitoring, evaluation, and maintenance of these structures, we can ensure the long life and safe service of national infrastructure and transportation arteries, which is of great strategic importance to support the sustainable development of the national economy.

In the past two decades, structural health monitoring (SHM) has emerged with the fundamental purpose of collecting the dynamic response of structures using sensors and then reporting the results to evaluate the structures’ performance. Their wide deployment in realistic engineering structures is limited by the requirement of cumbersome and expensive installation and maintenance of sensor networks and data acquisition systems ^[1][2][3][1,2,3]. At present, the sensors used for SHM are mainly divided into contact type (linear variable differential transformers (LVDT), optical fiber sensors ^{[4][5][6][7][8][9]}[4,5,6,7,8,9], accelerometers ^[10][11][10,11], strain gauges, etc.) and non-contact types (such as global positioning systems (GPS) ^[12][13][14][12,13,14], laser bibrometers [15], Total Station [16], interferometric radar systems [17], and level computer vision-based sensors). Amongst the existing non-contact sensors, the GPS sensor is easy to install, but the measurement accuracy is limited, usually between 5 mm and 10 mm, and the sampling frequency is limited (i.e., less than 20 Hz) ^{[18][19][20][21][22]}[18,19,20,21,22]. Xu et al. [23] made a statistical analysis of the data collected using accelerometers and pointed out that the introduction of maximum likelihood estimation in the process of fusion of GPS displacement data and the corresponding acceleration data can improve the accuracy of displacement readings. The accuracy of the laser vibrometer is usually very good, ranging from 0.1 mm to 0.2 mm, but the equipment is expensive and its range is usually less than 30 m [24]. Remote measurements can be performed with better than 0.2 mm accuracy using a total station or level, but the dynamic response of the structure cannot be collected ^[25][26][25,26].

With the development of computer technology, optical sensors and image processing algorithms, computer vision has been gradually applied in various fields of civil engineering. High-performance cameras are used to collect field images, then various algorithms are used to perform image analysis on a computer to obtain information such as strain, displacement, and inclination. After further processing of these data, the dynamic characteristics such as mode shape, frequency, acceleration and damping ratio can be obtained. Some researchers extract the influence line ^[27][28][27,28] and the influence surface [29] of a bridge structure from the spatial and temporal distribution information of vehicle loads, which are used as indicators to evaluate the safety performance of the structure. However, the long-term application of computer vision in the field is limited in many ways; for example, the selection of targets, measurement efficiency and accuracy, environment impact (especially the impact of temperature and illuminate changes).

2. System Composition

A computer vision-based structural deformation monitoring system includes an image acquisition system and an image processing system. The image acquisition system includes a camera, lens, and target to collect video images, while the image processing system performs camera calibration, feature extraction, target tracking, and deformation calculation, which purpose is to process the acquired image and calculate the structural deformation. This section will briefly introduce the basic components of the image acquisition system in the computer vision monitoring system. The image processing system will be introduced in Section 3.

2.1. Camera

The camera is an important part of the image acquisition system, and its most essential function is to transform received light into an electrical signal through a photosensitive chip and transmit it to the computer. Photosensitive chips can be divided into CCD and CMOS according to the different ways of digital signal transmission. The main differences between them are that CCD has advantages over CMOS in imaging quality, but its cost is much higher than that of CCD, so it is suitable for high-quality image acquisition; CMOS is highly integrated and saves electricity compared with CCD, but the interference of light, electricity and magnetism is serious and its anti-noise ability is weak, so it is more suitable for high-frequency vibration acquisition ^[30][31][30,31]. The selection of the camera needs to consider the following points: (1) the appropriate chip type and size is to be selected according to the measurement accuracy and application scenario; (2) because of the limited bandwidth, the frame rate and resolution of the camera are contradictory, so the frame rate and resolution should be balanced in camera selection; (3) industrial cameras appear to be the only option for long-term on-site monitoring.

2.2. Lens

The lens plays an important role in a computer vision system, and its function is similar to that of the lens in a human eye. It gathers light and directs it ontp the camera sensor to achieve photoelectric conversion. Lenses are divided into fixed-focus lenses ^[32][33][32,33] and zoom lenses [34]. Fixed-focus lenses are generally used in laboratories, and high-power zoom lens are generally used for long-distance monitoring such as of long-span bridge structures and high-rise buildings. The depth of field is related to the focal length of the lens; the longer the focal length, the shallower the depth of field. The following points need to be considered in the selection of shots: (1) a low distortion lens can improve the calibration efficiency; (2) an appropriate focal length for the camera sensor size, camera resolution and measuring distance should be selected; (3) a high-power zoom lens is appropriate for medium and long-distance shooting.

2.3. Target

The selection of targets directly affects the measurement accuracy, and an appropriate target can be selected according to the required accuracy. There are mainly two kinds of target: artificial targets and natural targets. Ye et al. [35] introduced six types of artificial targets ^[19][36][37][19,36,37] (flat panels with regular or irregular patterns, artificial light sources, irregular artificial speckles, regular boundaries of artificial speckle bands, and laser spots) and a class of natural targets ^[38][39][38,39]. Artificial targets can provide high accuracy and are robust to changes in the external environment, just as artificial light sources can improve the robustness of targets in light and the possibility of monitoring at night. The disadvantage of artificial target is that they need to be installed manually, which may change the dynamic characteristics of the structure. Natural targets rely on the surface texture or geometric shape of the structure, which is sensitive to changes of the external environment, and their accuracy is not high. The following points should be noted in the selection of targets: (1) when the target installation conditions permit, priority should be given to selecting artificial targets to obtain stable measurement results; (2) the selection of targets should correspond to the target tracking algorithm in order to achieve better monitoring results.

3. Basic Process

The flowchart of deformation monitoring based on computer vision is shown in Figure 1, and can be summarized as follows: (1) assemble the camera and lens to aim at artificial or natural targets, and then acquire images; (2) calibrate the camera; (3) extract features or templates from the first frame of the image, then track these features again in other image frames; (4) calculate the deformation. The following is a brief description of camera calibration, feature extraction, target tracking and deformation calculation.

Figure 1.

Process of deformation monitoring based on computer vision.

3.1. Image Acquisition

Image acquisition includes these steps: (1) determine the position to be monitored; (2) arrange artificial targets or use natural targets on the measurement points; (3) select an appropriate camera and lens; (4) assemble the camera lens and set it firmly on a relatively stationary object; (5) aim at the target and acquire images.

3.2. Camera Calibration

Camera calibration [40] is the process of determining a set of camera parameters which associate real points with points in the image. Camera parameters can be divided into internal parameters and external parameters: internal parameters define the geometric and optical characteristics of the camera, while external parameters describe the rotation and translation of the image coordinate system relative to a predefined global coordinate system [41]. In order to obtain the structural displacement from the captured video image, it is necessary to establish the transformation relationship from physical coordinates to pixel coordinates. The common coordinate conversion methods are full projection matrix, planar homography matrix, and scale factor.

3.2.1. Full Projection Matrix

The full projection matrix transformation reflects the whole projection transformation process from 3D object to 2D image plane. The camera internal matrix and external matrix can be obtained by observing a calibration board, which can be used to eliminate image distortion and has a high accuracy [42]. Commonly used calibration boards include checkerboard [43] and dot lattice [44]. Figure 2a shows the relationship between the camera coordinate system, the image coordinate system and the world coordinate system. A point T (X, Y, Z) in the real 3D world appears at the position t (x, y) in the image coordinate system after the projection transformation (where the origin of the coordinates is P). The relationship between the pixel coordinate system and the image coordinate system is shown in Figure 2b. Therefore, the equation for converting a point from a coordinate in the 3D world coordinate system to a coordinate in the pixel coordinate system is

S [\begin{array}{l} x \\ y \\ 1 \end{array}] = = [\begin{matrix} f x \begin{matrix} γ & u_{x} \begin{matrix} 0 \\ 0 & f y \begin{matrix} u_{y} \begin{matrix} 0 \\ 0 & 0 & 1 & 0 \end{matrix}] [\begin{matrix} R & t \\ 0 & 1 \end{matrix}] [\begin{matrix} X \\ Y \\ Z \\ 1 \end{matrix}] = M_{1} M_{2} X \end{matrix} \end{matrix} \end{matrix} \end{matrix}

where S is the scale factor from Equation (3), f_x and f_y are the camera lateral axis and vertical axis focal lengths, γ is the angle factor of the lens, u_x and u_y are lateral and vertical offsets of the principal axs, respectively, R is the rotation matrix of size 3 × 3 and t is the translation matrix of size 3 × 1, M₁ is the camera internal parameter, and M₂ is the camera external parameter.

Figure 2. (a) Relationship among the camera coordinate system, the image coordinate system, and the world coordinate system; and (b) Relationship between the pixel coordinate system and the image coordinate system.

Park et al. [45] and Chang et al. [41] calibrated with T-bar and checkerboard respectively to eliminate the measurement error caused by camera distortion and accurately measure the 3D dynamic response of a structure.

3.2.2. Planar Homography Matrix

In practical engineering applications, the above calibration process is relatively complex. To simplify the process, Equation (1) can be expressed

S [\begin{array}{l} x \\ y \\ 1 \end{array}] = [\begin{array}{l} k_{11 \begin{array}{l} k_{12 \begin{array}{l} k_{13 \begin{array}{l} k_{21 \begin{array}{l} k_{22 \begin{array}{l} k_{23 \begin{array}{l} k_{31 \begin{array}{l} k_{} \end{array}} \end{array}} \end{array}} \end{array}} \end{array}} \end{array}} \end{array}} \end{array}

where K is called planar homography matrix [46], which can reflect the relationship between the corresponding points on two images and is not affected by the angle between the optical axis and the structural plane [43]. The planar homography matrix is suitable for the case where there is an angle between the image plane and the moving plane of the object, and the angle is not easy to measure [42]. The position of at least four known points on the moving plane can be used to solve the planar homography matrix. Khuc et al. [29] and Xu et al. ^[47][48][47,48] both used known structural dimensions to solve for the planar homography matrix, construct the corresponding relationship between image coordinates and 3D world coordinates, and estimate the time history information of lateral and vertical displacement of a bridge.

3.2.3. Scale Factor

The scale factor (S) provides a simple and practical calibration method. As shown in Figure 3a, when the camera optical axis is perpendicular to the surface of the object, S (unit: mm/pixel) can be obtained based on the internal parameters of the camera (focal length, pixel size) and the external parameters of the camera and the surface of the object (measurement distance) in a simplified calculationfrom the simplified formula

S = \frac{L}{f} d_{p i x e l = \frac{D}{d} d_{p i x e l}}

Figure 3. Scaling factor calibration method. (a) Camera optical axis orthogonal to measured object surface; (b) Camera optical axis intersecting obliquely with measured object surface.

When the optical axis of the camera is not perpendicular to the surface of the measured object (as shown in Figure 3b), the included angle would affect the measurement accuracy [49]. Feng et al. [1] studied the influence of different angles between the optical axis and the surface of the measured object on the accuracy, and found that S can be determined by:

S = L_{f \cos 2 θ d_{p i x e l = \frac{D}{d \cos 2 θ d_{p i x e l}}}}

where f represents the focal length; L represents the distance from the camera to the measured object surface along the optical axis, also known as object distance; D represents the distance from the measuring point to the optical axis; and d represents the distance from the measuring point on the image to the origin. References ^{[50][51][52][53][54][55]}[50,51,52,53,54,55] build S according to known physical dimensions on the surface of the object (such as the dimension of an artificial object or the dimension of the structural member obtained from the design drawing) and the corresponding image dimensions to measure the displacement of the structure. Among these camera calibration algorithms, the appropriate coordinate conversion method needs to be selected according to the field environment and measurement purpose. The full projection matrix and the planar homography matrix have no restraint on camera position but need a calibration plate. The full projection matrix is suitable for 3D deformation monitoring, and the planar single response matrix and scale factor are suitable for 2D deformation monitoring.

3.3. Feature Extraction and Target Tracking

Feature extraction is used to obtain the unique information in the image (such as shape features, feature points, grayscale features, and particle features). The purpose of target tracking algorithms is to find these features again in other image frames. Common target tracking algorithms in civil engineering structural deformation monitoring include shape matching, feature point matching, optical flow estimation and digital image correlation (DIC) template matching.

3.3.1. Shape Matching

In an image, shape is a description of an edge or region, and shape matching is an image matching algorithm to identify and locate measured objects through image edge features. There are many algorithms for edge detection, such as Zernike operator [56], Roberts operator, Sobel operator [57], Log operator [58], Canny operator [59] and generalized Hough algorithm [60]. Among them, the Canny operator is widely used because of its high performance ^[61][62][61,62]. The principle of shape matching is relatively simple and can be used for displacement monitoring of structures with obvious shapes. The advantages are: (1) the calculation is relatively simple and the matching speed is fast; (2) it is robust to change of illumination because it tracks the geometric boundary of the object; (3) this measurement has an advantage for linear structures such as slings.

3.3.2. Feature Point Matching

Feature point matching is a target tracking method based on feature extraction and matching. The key points in computer vision are those which are stable, unique and invariant to image transformation, such as building corners, connection bolts, or other shaped targets ^[63][64][63,64]. The common methods of feature point detection include Harris Corner [65], Shi–Tomasi Corner [66], scale invariant feature transform (SIFT) ^[32][67][32,67], speed-up robust feature (SURF) [68], binary robust independent elementary features (BRIEF) [69], binary robust invariant scalable keypoint (BRISK) [70], and fast retina keypoint (FREAK) [71]. A feature point matching algorithm needs to select appropriate feature descriptors according to the measurement object to describe feature points mathematically and carry out image registration. It is usually suitable for structures with rich textures or certain shapes (Such as circle, hexagon or rectangle). Feature point matching has the following characteristics: (1) it deals with the whole image area and has accurate matching performance [72]; (2) it extracts texture features of the structure and is not sensitive to illumination and shape transformation; (3) the greater the number of feature points used, the higher is the precision (however, this increases the calculation time).

3.3.3. Optical Flow Algorithm

Optical flow algorithm is an image registration technique in which the surface motion in a three-dimensional environment is approximated as a two-dimensional field by using the spatio-temporal pattern of image intensity [73]. The optical flow algorithm can accurately provide the velocity and displacement of the object by tracking the trajectories of pixels, but it has great limitations and makes the following assumptions [74]: (1) the brightness of objects in adjacent frames remains unchanged; (2) the motion of objects in adjacent frames is small enough; (3) the motion between adjacent pixels is consistent [75]. Common optical flow algorithms include Lucas–Kanade ^[76][77][76,77], Horn–Schunck method [78], Farneback method [79], block match method [80], and phase-based optical flow ^[45][81][82][45,81,82]. Among those, Lucas–Kanade is fast and easy to implement, and it can perform motion tracking in the selected measurement area, especially of robust feature points, while other algorithms need to calculate every pixel in the image, which is slow. The optical flow algorithm is similar to the feature point matching algorithm in that it tracks feature points on the image and prefers target patterns with distinct and robust features over the whole test period. The optical flow algorithm has the following characteristics: (1) target features need to be clear; (2) sensitivity to illumination changes; (3) only motion components perpendicular to local edge direction can be detected, such as bridge cable vibration; (4) optical flow describes the motion information of the image brightness and is more suitable for measuring dynamic displacement.

3.3.4. DIC Template Matching

The basic principle of DIC is to compare the same points (or pixels) recorded between two images before and after deformation, and to calculate the motion of each point [83]. As a representative non-interference optical technique, DIC has the advantage of continuous measurement of the whole displacement field and strain field. It is a powerful and flexible surface deformation measurement tool in experiments on solids, and it has been widely accepted and used ^{[84][85][86][87]}[84,85,86,87]. If we track only a small pixel area is, we can tracked, and monitor the displacement of the measuring points of the structure can be tracked and monitored ^[88][89][88,89], which is called template matching. The basic process of monitoring displacement by template matching is as follows ^[90][91][92][90,91,92]: (1) select some areas of the first frame image as templates; (2) use these templates to scan line by line in a new image frame; (3) then use the relevant criteria to match the degree of similarity and determine the pixel coordinates of the matched template; (4) calculate the pixel displacement and convert it to the actual displacement. The relevant criteria include the following six mathematical algorithms: (1) cross-correlation (CC); (2) normalized cross-correlation (NCC); (3) zero-normalized cross-correlation (ZNCC); (4) sum of squared differences (SSD); (5) normalized sum of squared differences (NSSD); and (6) zero-normalized sum of squared differences (ZNSSD) [93]. In computer vision-based displacement measurement, the NCC matching method is the most popular, and there are numerous applications of the method. Template matching based on DIC has the following characteristics: (1) it is not very robust to light changes, slight occlusions, and scale changes; (2) an artificial target is beneficial to improve the success rate of matching; (3) huge computational expense during the template matching; calculation in the frequency domain can save computation time.

3.4. Deformation Calculation

Deformation computation is the process of transforming pixel displacement into actual displacement. First, high quality images are collected; then, 3D motion in the real world is decomposed into planar motion by camera calibration; later, the matching algorithm is used to track and calculate the pixel distance of the target moving in the image plane. Finally, the pixel distance is converted into proportional actual distance. The accuracy of displacement depends not only on the camera calibration method and target tracking algorithm, but also on the environment, so the influence of environment on the accuracy of displacement calculation needs to be understood. This is the problem that needs to be solved in current field applications. The most important thing is to improve the algorithm so that it can adapt to the changing environment.