2. Audio Signal Processing Applications and Methods
2.1. Application Contexts of Auditory Surveillance
Auditory surveillance has been applied to detect abnormal events in various contexts. As shown in
Table 13, the literature shows the implementations of sound-based technology in the following areas: home security
[23], public environments
[16][17][18][19][20][21][22][24][25][26][16,17,18,19,20,21,22,24,25,26], office
[29], medical and health care facilities
[27][28][27,28], and in industrial plants
[31][32][31,32].
Table 13.
Applications of sound surveillance.
Public environments have received the most attention from the research community. Despite complicated noises in public, many studies achieved promising results for the surveillance of abnormal sounds such as glass breaking, screams, gunshots, and explosions. For example, Ref.
[24] developed a technique for detecting shouting events in a real-life railway environment. Not only are audio events automatically detected, but the positions of the acoustic sources are also localized
[17]. Audio events of gunshots could be detected based on a novelty detection approach, which offers a solution to detect abnormality in continuous audio recordings in public places
[16]. Another similar architecture for acoustic surveillance of abnormal situations under different acoustic backgrounds was built to detect vocal reactions (screams, expressions of pain) and non-vocal atypical events associated with hazardous situations (gunshots and explosions)
[19]. Models for the detection of abnormal acoustic events from normal background sound were also developed by several authors
[20]. A few efforts have been made focusing on the detection of crimes in elevators
[25] and the detection of human emotions based on verbal sounds in hazardous situations
[18][26][18,26]. Ref.
[29] built a system for acoustic surveillance to detect abnormal sounds from people talking, opening and closing doors, and using computers and other devices in an office.
In addition, many studies have implemented signal processing for the surveillance of healthcare facilities. For example, a system to extract sound features for medical telesurvey was developed by
[27]. It can classify detected sounds into normal and abnormal types. The system’s purpose is to detect severe accidents such as falls or faintings at any place in the living area, which is useful for the surveillance of the elderly, the convalescent, or pregnant women. Furthermore, since older adults living alone potentially get into trouble when they fall and are even unable to call for assistance, a framework that detects falls by using acoustic signals by analyzing environmental sounds was proposed
[28].
In industrial sectors, current research on applications of sound-based detection has focused on detecting abnormal behaviors of the machine or the equipment. Acoustic-based fault diagnosis techniques of a three-phase induction motor are presented to see if the motor is in bad or good condition
[31]. Many rotating electric motors can be diagnosed using acoustic signals; this can prevent unexpected failure and can improve the maintenance of electric motors. The advantages of the proposed acoustic-based fault diagnosis technique are its non-invasive technique, low cost, and instant measurement of acoustic signals. A novel optimization technique for the unsupervised anomaly detection of sound signals using an autoencoder (AE) is proposed
[32]. The goal is to detect unknown sounds without training data to identify abnormality or failure in the operation of the stereolithography 3D printer, the air blower pump, and the water pump.
2.2. Principles of Audio-Based Hazard Detection
The detection of hazards using acoustic data is typically based on the extraction of the following four types of information: type of sound, location of the sound, the direction of moving sound sources, and the abnormality of sound (see
Figure 17). These auditory event characteristics are used as input for detecting hazardous situations. The type of sound is the most common indicator used for detecting hazardous events and differentiating abnormal sounds from normal sound events. The occurrence of abnormal sounds, such as a gunshot or an explosion, is an important indicator of a dangerous situation requiring a quick safety response. Additionally, sound localization cues, such as the location and the direction of a moving sound source, are also essential for detecting potential abnormal events. They could inform receivers whether or not they are at an unsafe distance from the hazard. Moreover, measuring the abnormality of ambient sound has been used for evaluating hazardous situations, such as a scenario where a machine is operated and any abnormal sound that occurs in the parts or in the assembly process is often regarded as an abnormal sound and will require the attention of maintenance officers. Given the fact that
onwe may not be able to classify all the unknown abnormal sound events that occur during equipment operational noise inspection, the anomaly would be useful in cases where researchers are developing the abnormal noise inspection device to automate the process.
Figure 17.
Indicators of hazardous events.
Automatic detection of auditory cues requires computational advances in processing the auditory signals. As discussed earlier, hazardous situations are detected based on auditory features, including the type of sound, sound localization, and sound abnormality.
Figure 28 provides a typical procedure of sound-based hazard detection found in various literature. The process entails three steps: (1) acoustic feature extraction, (2) auditory event recognition, and (3) assessment of hazardous situations. Of those steps, recognizing auditory events in Step 2 is the most critical step in the signal-processing procedure.
Figure 28.
Overall architecture of auditory signal processing to detect hazardous situations.
2.3. Sound Localization
Sound localization refers to the ability to identify acoustic sources in terms of direction and distance. It is one of the essential acoustic parameters that enables the ability to recognize and locate hazardous events. When dangerous objects or equipment are within an unsafe distance of a construction worker, detecting the location of acoustic sources can enable them to make appropriate preventive responses. Methods to localize sound events are mainly based on calculating the difference in the arrival times of the signal. Then, the similarity measure of the signal at different times is examined in either time or frequency domain for sound localization. For the time domain, the acoustic impulses reach the microphones at varying Times of Arrival (TOA) when they are spatially distant from one another. The signal’s Direction of Arrival (DOA) is determined from the recorded time delays using the given array geometry. Each pair of microphones in the array has a projected time delay. Then, using time delays and geometry, the best estimate of the DOA is determined, while the frequency domain is the difference between the time sound pressure reaching the array geometry and is mostly used to localize higher frequency sounds.
Table 27 lists the methods for sound localization, of which the details are provided below.
Table 27.
Summary of methods for sound localization.
Calculating the Time Difference of Arrivals (TDOA) of the signal is one approach to the detection of sound location. Valenzise et al.
[17] adopted the Maximum-Likelihood Generalized Cross Correlation (GCC) method and linear-correction least-square localization algorithm for estimating the TDOA of the signal. Ref.
[34][78] utilized the Euclidean Distance (EUD) to measure the similarity in the time domain. Other alternatives include measuring the similarity between signals at different arrivals based on the frequency domain. Compared with the time domain, the frequency domain showed significant improvements in the precision for sound localization as demonstrated by Satoh et al.’s
[35][79] study, where they computed the cross-correlation in the frequency domain based on the Normalized Cross Correlation (NCC) method. Tarzia et al.
[36][80] used Fast Fourier Transform (FFT) to measure the frequency domain similarity. Lastly, Wirz, Roggen, and Tröster
[37][81] proposed an innovative method called the fingerprinting algorithm to measure the similarity based on the naive Bayes algorithm. Their results showed that the estimation of the quantitative distance in meters between a device D
A and another device D
B reaches an accuracy of approximately 80% when using ambient sound as a relevant source for obtaining proximity information.
Recently, a Deep Neural Network (DNN)-based approach has been proposed over the parametric approach (TDOA and IID) for sound source localization. Adavanne et al.
[38][82] employed a convolutional RNN for sound localization of multiple overlapping sound events in three-dimensional (3D) space. The approach network takes a sequence of consecutive spectrogram timeframes as a multichannel input and maps it to a multi-output regression. The study shows that the proposed method is generic and applicable to any array structures and is robust to unseen DOA values, reverberation, and low SNR scenarios.
2.4. Sound Abnormality Detection
Measuring sound abnormality is another challenging issue, given that preparing all labeled sound data related to hazardous situations is unrealistic. If a surveillance system is trained using the data of specific sounds, such as explosions or gunshots, it cannot be applied to detect other auditory events. A method for detecting unknown abnormal sound events is required to develop an automated auditory surveillance system that is useful in more general cases. To address this issue, Ref.
[39][71] developed a method that models the abnormality of sounds without using any samples of labeled sounds. The technique can detect those sounds that rarely occur in a normal situation. It first processes sound in the usual situation and trains a statistical model of the normal sounds. After training the model, the system continues to process sounds and calculates the likelihood of the sound. If the likelihood value goes beyond a predefined threshold, that sound is considered abnormal. Lu et al.
[40][83] used another approach to detect abnormal sounds using the case-based identification algorithm. In this method, it is necessary to first convert the sound data into feature representation vectors and then to apply an establishment distribution of a supervised learning model. This supervised learning requires a small training dataset of sample elements of abnormality.
In the construction industry, sound equipment abnormality will affect the model performance since mobile equipment sound varies between different types of equipment. This could be because the acoustic characteristic of one sound is more difficult to train than that of another. Other equipment characteristics that may affect sound abnormality measurement are the models, brands, age, and maintenance programs. Obtaining different model metadata for the audio dataset would help us understand how they affect the detection capability. Collecting such data requires a significant effort in terms of time, cost, and human resources.
3. Audio-Based Surveillance in Construction
3.1. Feasibility of Implementing Audio-Based Hazard Detection in Construction
The potential auditory indicators of hazards in construction are provided in
Table 38. As shown, one of the most important types of sounds is the sound source of moving heavy equipment or machinery, which can cause collision hazards
[30][33][41][30,33,36]. Furthermore, the detection of screams, shouts, or cry sounds is also supportive for acoustic surveillance and monitoring of negative situations, since human emotions are somehow delivered in sound events
[17][18][19][20][23][24][25][26][27][28][42][43][44][45][46][17,18,19,20,23,24,25,26,27,28,61,62,68,76,77]. Other sounds released by people can also help detect abnormal situations. Detecting ground ambient sound (i.e., a group of people) allows workers to be aware of violent events, natural disasters, riots, or chaotic acts in crowds
[26]. Some types of safety-critical sounds which lead to dangerous situations, such as alarms of fire, earthquakes, explosions, and gunshots, could improve the situational awareness of hazards
[23][25][45][46][23,25,76,77]. For example, the detection of an explosion allows workers to stay away from the hazardous source
[16] and gunshot detection allows workers to stay away from a gun attack
[16][17][19][20][25][42][43][16,17,19,20,25,61,62].
Table 38.
Summary of auditory event characteristics in the construction field.
. Using the information individually is insufficient for evaluating hazardousness in construction that requires a simultaneous consideration of many factors, including the size of the equipment in contact with, the breaking speed of a machine, the average reaction time of a worker, and the speed of the worker.
Table 49 presents a list of hazardous situations in construction sites along with required auditory indicators. Specifically, hazardous situations include heavy equipment/machines being at an unsafe distance from workers when detected using the sound of moving equipment. The detection of equipment approaching or operating in an abnormal condition will be alerted if the direction of the moving sound source is toward the worker or if the sound source is abnormal. Other situations, such as someone crying/shouting, a crowd approaching, alert alarms, an explosion, or a gunshot, can indicate a hazardous situation. Defining hazardous situations that require quick and effective responses from construction workers is the priority of automated auditory surveillance in the construction field and could contribute to construction workers’ safety.
Table 49.
List of hazardous situations and required auditory event characteristics (see
Table 3 for the auditory event notations).
8 for the auditory event notations).
Other than the classification of sound, sound source location
[34][35][36][37][78,79,80,81], the direction of a moving sound source
[34][35][36][37][78,79,80,81], and sound abnormality
[39][71] are useful for detecting hazardous situations in the construction field. For example, the information on the location and the direction of moving construction equipment from the engine sound could help alert if a heavy construction vehicle is in proximity to a worker. Additionally, suppose the construction equipment breaks down by falling, collapsing, colliding, or by failures in the engine, the information of the abnormality in the sound could give a cue to carry out proper assistance, e.g., the noise produced when excessive vibration of equipment occurs that is not usually expected to vibrate.
Existing frameworks for detecting safety cues solely rely upon a single type of information, for example, high-frequency filtering
[47][84], sound identify classification
[48][49][85,86], or direction of arrival of sound
[50][87]
3.2. State-of-the-Art Research in Auditory Signal Processing for Construction
There have been emerging studies on audio-based activity detection for improving construction management and productivity due to its advantages in terms of cost and applicability
[51][88]. Other efforts have aimed to develop new methods for assisting workers in hearing critical sounds, which is a crucial need given the typical heterogeneity of sounds generated from diverse construction work activities, including static equipment and hand tools
[52][53][89,90]. The examination of various OSHA accident reports by Hinze et al., 2011
[41][36] revealed that the heterogeneous nature of concurrent construction sounds (e.g., equipment sounds and alarms) indeed decreased workers’ safety awareness, since alarm signals may be drowned out or not audible enough for workers. They also reported that there were cases where multiple alarm signals issued warnings at the same time, influencing workers’ judgment and making the alert signals less effective or ineffective. However, due to the technical challenges of processing complex soundscapes, this problem has gained little attention from the academic community in the past decade.
Only a few papers were found related to this technology in construction management. In general, previous studies were focused on the tracking of activities of construction equipment, identification of working and operation activities, proximity detection and alert systems, and embedded sensory systems. Of those areas, a majority number of studies aimed at monitoring the activities of heavy construction equipment to reduce operating costs
[30][54][55][56][30,72,74,91]. This is probably because a large portion of the expenses in a construction project is allocated towards the operating costs of heavy equipment. For example, Ref.
[57][92] applied the Hidden Markov Model (HMM) and a frequency-domain technique on spectrogram data to accurately classify types of construction sounds and to identify patterns from each type of construction task. The classified sound signals’ strength and location are visualized with a Building Information Modeling (BIM) platform. The acoustic signals from construction activities were used to calculate working periods to allow field managers to track work progress and productivity and to provide a means to efficiently enhance project schedule management. Ref.
[58][93] worked on a sound monitoring system for the prevention of underground pipeline damage. To develop a dataset similar to what is found on the construction site, they collected working equipment sound data of typical construction threats, including excavators, hammers, road cutters, and electric hammers. They also collected the background noise of a typical construction environment, such as pedestrians, traffic, and wind sound. Two random forest-based classifiers were developed to detect suspicious sounds and to help prevent pipeline damages caused by construction activities. The endpoint was to create an alarm system that uploads a report if the duration of a construction threat sound exceeds the threshold value specified.
Another study proposed a hybrid system for recognizing multiple construction equipment activities
[30]. The study trained a machinery task classifier on integrated data of both audio and kinematics using Support Vector Machines (SVM). The proposed system results indicate that a hybrid system is capable of providing up to 20% more accurate results compared with cases using individual sources of data such as images
[6][7][8][6,7,8], sensors
[9][10][9,10], and audio
[56][59][60][75,91,94]. The system allows the construction managers to monitor and track productivity, equipment downtime/idle time detection, equipment cycle time estimation, and equipment fuel use control. Wei et al.
[61][95] also developed a noise hazard prediction method that combines a wearable audio sensor with Building Information Modeling (BIM) data to predict and visualize spatial noise distribution on BIM models.
There has also been a widespread usage of machine learning algorithms to train sound data for construction activity monitoring. Ref.
[59][75] implemented a supervised machine learning-based sound identification approach to enhance the monitoring of construction site activities. Ref.
[52][89] also developed a risk assessment framework using a machine learning algorithm. This method used the activity classification information from auditory signals to estimate safety risks based on the occupational injury and illness information contained in historical construction accident data. All these audio-based frameworks for detecting construction activities are effective, especially for night-time tasks, since activities on construction sites can be detected regardless of visibility levels that are not suitable for image-based approaches.
Another line of effort on auditory surveillance in construction is focused on the identification of construction collision hazards. The advanced computational techniques in auditory signal processing for collision hazard detection in construction are motivated by strong acoustic emissions from equipment operation, since construction equipment often produces unique sound patterns while performing certain activities (e.g., moving vs. idling)
[55][74], which can be used as an indicator of safety-critical cues or warning signals. For example, the research done by Lee and Yang
[47][84] used a high-frequency sound (18 kHz, inaudible to construction workers) to analyze the doppler effect change caused by a single subject’s movements to prevent struck-by hazards. They installed a speaker on equipment that plays a predefined high-frequency sound. A smartphone carried by an on-foot worker was used to capture the sound produced by the speaker. The input signal was processed to extract the position of the equipment relative to the on-foot worker. The proposed technology was able to classify the movement direction and speed with 97.4% accuracy. Although the study proves its potential for detecting collision threats from equipment, the study still has some limitations. Firstly, they only tested a struck-by hazard situation involving a single type of moving construction equipment. Since the construction site is a complex environment in which multiple pieces of equipment work simultaneously leading to signal overlap, the mixture of similar sounds would prevent the recognition of movement of individual pieces of equipment. Another limitation was that the sound source was a speaker attached to the equipment, not the sound produced by the mobile equipment. This means the deployment will require expensive installation of sound speakers on every piece of equipment present on the job site.
Recent studies by Refs.
[48][49][85,86] developed sound classification models that can distinguish between mobile equipment and stationary equipment to support collision hazard detection. These studies collected and synthesized the sound of construction equipment at different signal-to-noise ratios and used the dataset to develop a machine learning model using a CNN for automated detection of mobile equipment occurrence. The efficacy of this model was tested on a real construction site and the result accuracy of the model was 99% in detecting sounds related to collision hazards when the signals were not buried in background noises. Compared with earlier efforts, Refs.
[48][49][85,86] offer superior advancement, as their models are able to deal with complex soundscapes with overlapping sound sources, including mobile and stationary equipment and background noise (e.g., workers communicating, materials’ movement, and street noises). Another study that utilized equipment sounds for collision hazard assessment was performed by Ref.
[50][87], which aimed at localizing the sound sources using the Direction of Arrival (DOA) signal processing techniques. The determination of sound location can supplement the sound classifiers developed by Refs.
[48][49][85,86] to enable a more comprehensive assessment of the hazardousness-based situations. This information is vital for construction workers and safety engineers to precisely reduce false alarms by only notifying workers if they are in a danger zone based on the distance calculated using the direction of the hazard. For example, a mobile piece of equipment moving further away from an on-foot worker does not impose a hazard.
Quantitative benchmarks between existing frameworks for preventing struck-by hazards are greatly difficult as they all were tested with nonidentical testing conditions (e.g., software, hardware, job-site characteristics, and assumptions). Therefore, some performance metrics, such as recall, cost, computational power, and data usage, can still be used for a fair comparison. In terms of performance comparison, they all yielded a competitive recall of 99%
[48][85], 98%
[62][96], 99%
[48][49][85,86], and 100%
[63][97]. Cost comparison was another metric to measure past success in preventing struck-by hazards with mobile equipment. Audio-based collision hazard detection by Ref.
[48][85] requires little financial investment as the model can be quickly deployed on workers’ smartphones, while a relatively high deployment cost is needed for many sensor devices
[63][97] or high-quality cameras
[62][64][96,98].