The architecture, engineering and construction (AEC) sector is a significant driver of economic activity around the world. Structure- and workplace-related safety accidents have the potential to be life-threatening. Unfortunately, these are always some of the most overlooked things in the sector.
In the United States, around 40% of bridges are over 50 years old, and more than 9% of them are rated as structurally deficient, which would draw a total cost for bridge rehabilitation of around $123 billion . In addition to the need to design more robust structures under various loads , efficient structural monitoring is also important for aging infrastructure. Accurate structural health assessments are the basis for the decision-making of infrastructure maintenance, repair and rehabilitation. Typically, structure health monitoring (SHM) relates to different approaches, such as conducting regular visual inspections or relying on structural monitoring sensors . Visual inspections require experienced inspectors to carry inspection instruments to reach the structure surface and conduct the inspection, and such a process can be labor-intensive, time-consuming and sometimes risky. Sensor-based monitoring can identify defects from both the structure surface and interior, and it is more reliable when the sensors are functional . As time goes by, however, the accuracy may be compromised due to changing environments or sensor aging problems. Under these circumstances, noise filtering approaches could be used to correct the data. However, this is also tedious and requires expertise.
Similarly, workforce safety issues in jobsite safety management (JSM) are also a challenge for the AEC industry . For example, the US Occupational Health and Safety Agency (OSHA) recorded a surprisingly high death toll of 1008 construction worker fatalities in 2018 that were mainly caused by common on-site accidents, such as being struck by falling objects and falling from heights . Traditionally, construction on-site safety monitoring relies on site patrols and surveillance . However, the complex nature of site dynamics would make on-site safety monitoring more difficult and less proactive . In addition, the fatigue level of workers cannot be accurately identified.
Over the past few years, researchers have been formulating various machine learning (ML) applications for various fields . Prime ML applications in SHM and JSM include structure damage detection  and on-site worker safety monitoring . The rapid evolution of graphics processing units (GPUs) has dramatically improved the computational capacity for processing ML algorithms, which has led to the advent of an increasing amount of deep learning (DL) applications that are underpinned by improved GPU performance . In particular, the convolutional neural network (CNN), a DL algorithm, achieved extraordinary results in the ImageNET Large Scale Visual Recognition Challenge 2012 (ILSVRC2012), which is a benchmark in object classification and detection for thousands of object classes and millions of images . Currently, DL has outperformed many advanced algorithms in numerous fields . More and more, DL applications are being developed and deployed to address image classification, data augmentation and object detection problems . Besides that, scholars have also made encouraging progress in integrating DL and natural language processing (NLP) for the text extraction of construction safety reports . Through analyzing and classifying such reports, hidden dangers can be identified in time. Therefore, corresponding measures can be taken to avoid similar accidents in the future. It can be seen that ML and DL have great potential in image recognition and data analysis and are likely to be the best options to address the challenges of SHM and JSM.
In recent years, researchers have used computer vision-based methods to conduct the visual inspection of surface defects and have attested considerable merits . These methods are primarily based on image processing techniques (IPTs), such as histogram transformation, texture recognition and edge detection . However, these methods are vulnerable to lighting condition changes and image distortion issues.
To enhance the performance of IPT-based approaches for defect detection, researchers have integrated ML algorithms . Technically, ML algorithms can efficiently classify different damage features extracted from IPTs. ML-based methods mostly focus on identifying typical structural defects such as cracks , rusting , spalling  and loose bolts . Nevertheless, these methods require defect features to be clearly defined and extracted using proper classifiers. Overall, these methods lack efficiency, feasibility and accuracy. Rapidly developing DL techniques are expected to solve the problems mentioned above. The CNN, as an end-to-end model, can improve the efficiency of defect detection and localization significantly because it can learn the defect features automatically from the labeled defects in the training samples. Normally, the process of using a CNN to determine defects in images is as follows: a fixed-size sliding window is used to scan and separate the image into small patches, and then a well-trained CNN is used to detect the defects on each small patch separately. Because the scales and shapes of defects may vary, it is difficult to find an appropriate window size to fit all kinds of them in practice.
To overcome the drawback above, a region-based CNN (R-CNN)  was proposed to replace the sliding windows method. The R-CNN is a two-stage detector. First, it employs a selective search approach  to generate region proposals. Then, the defect features can be extracted from the regions for classification and be highlighted by bounding boxes.
Although a pixel-level representation of structural defects is beneficial for SHM, it can only identify the damage level on the structure surface and is not competent to infer the performance of internal structural components which may have been deteriorated in advance . Vibration data is the main type of source of data utilized in SHM. Technically, any structural damage will change the stiffness and mass distributions of the structure and lead to differences in the natural frequencies and mode shapes . Hence, vibration-based SHM methods have the potential to detect internal structural damages by analyzing the abnormal data acquired from the sensors (e.g., accelerometers). The previous research of vibration-based SHM mainly focused on setting up a real physical model to imitate the status of a real structure. Basically, this model-driven method employs mathematical modeling and physical laws to represent the monitored structure . Hence, the level and location of the damage can be determined accurately by analyzing and solving the model. Nevertheless, it is challenging to build and solve such a complicated model when the complexity of the monitored structure increases and the environmental factors are considered. Currently, model-driven methods have been progressively replaced by data-driven methods . The most critical drawback of the model-driven approach is that modeling usually requires expertise and is time-consuming. Unlike the model-driven method, the data-driven method can identify the anomaly data directly by measuring the data collected from the sensors. Most of the data-driven method is based on the ML paradigm . As the appropriate sensors’ layouts can improve the efficiency and accuracy of data collection and transmission, ML algorithms, such as a genetic algorithm (GA) , have also been used for the determination of optimal sensor layouts. However, when applying vibration-based SHM methods in practice, the natural frequencies of the structure are easily affected by environmental factors (e.g., temperature) . For example, if a structure has some small-scale damages, the changes in the natural frequency of the structure would possibly be suppressed by those environmental variables. Some scholars have conducted several analyses on the evolution of structural properties and their relationship with changes in environmental parameters . Among them, the monitoring of the Z24 bridge is emblematic for addressing this issue . Although significant efforts have been made in this regard, it requires comprehensive expertise and is time-consuming .
On-site surveillance videos or images have been used for automated unsafe behavior detection in recent years. Variables such as hard hats, safety vests and workers can be detected by using certain computer vision techniques (e.g., a background subtraction algorithm , the histograms of oriented gradients (HOG) method , and the scale-invariant feature transform (SIFT) ). Nowadays, such methods which require much work for feature extraction are being replaced by DL gradually.
Mneymneh et al.  developed a CNN-based framework that could determine if workers (even they are moving) were wearing hard hats on the construction site. Xie et al.  modified a CNN to detect workers’ hard hats, and the model produced excellent results in the mean average precision (mAP) performance metric. Similarly, the Faster R-CNN  and SSD methods  were also employed to detect hard hats.
Fang et al.  modified the Faster RCNN to identify if workers equipped harnesses properly. Kolar et al.  employed a VGG-16 model to detect if safety guardrails were installed correctly to prevent workers from falling from heights. Siddula et al.  integrated a Gaussian mixture model (GMM) with CNNs to detect roofers on roof construction sites. This research can alleviate roof site fall risks.
In the unsafe activities identification area, Ding et al.  coupled a long short-term memory (LSTM) model  with CNNs to identify if the worker would climb a ladder unsafely . Kim et al.  developed an image-based risk prevention system to display the safety-related information of each construction worker on a wearable augmented reality (AR) device. Luo et al.  utilized a Faster R-CNN to determine workers’ activities based on construction site images. Considering that temporal information is necessary for dynamic activities detection, Luo et al.  later improved the framework for video-based worker activity recognition by helping the temporal information emerge. Some researchers have also investigated construction vehicle detection using DL. Kim et al.  employed a region-based FCN to detect construction vehicles. Fang et al.  used a Faster R-CNN to identify the spatial relationship of workers and excavators on construction sites. This study provided a basic prototype of the site safety alert system, which can prevent workers from being hit by heavy equipment. Son et al.  used a Faster R-CNN to identify on-site workers in diverse poses against complex backgrounds.