Due to the one-time feature of construction projects and organization, practitioners and academics tend to use experience-based information to make decisions. Experience-based methods, such as brainstorming, Delphi method, questionnaires, interviews, literature studies, and their combination, are primarily used to collect data and information
[6][7][8]. This may lead to biased judgements, and the result will be limited by the number and knowledge of the experts. In particular, with the development of the construction industry, the density of information is increasing. The large amount of information (over 80%) stored in daily project management documents or standardized texts makes the use of traditional methods for information retrieval and management difficult
[9][10]. There is a lot of potentially valuable information contained in text documents, but it is difficult to transform this information into knowledge by individuals or researchers. Therefore, in the context of Industry 4.0, there is a need to introduce emerging information technologies for the mining and research of text-based information in the construction field.
2. Text-Based Analysis Methods
Text analysis techniques typically involve the following phases when used in construction-related text mining research: corpus acquisition, pre-processing, text representation, and model training. Here, the corpus refers to paper or electronic text (e.g., Word, PDF, Excel) and image files.
2.1. Text Pre-Processing
Text pre-processing, including text cleaning, error correction, formatting, word separation, lexical annotation, and de-activation filtering, is the preliminary work performed on the original text in order to adapt it to a machine-readable form. For text separation and lexical annotation, more advanced open-source NLP tools are currently available; see
Table 1. The ICTCLAS Chinese word separation system has been employed in the field of construction engineering, in order to divide words and lexical annotation for construction quality acceptance requirements and documents in the area of coal mine safety
[12]. Urban rail transit construction safety risk management has been conducted by utilizing the LTP method
[13]. Xue and Zhang
[14] have pointed out that generic lexicons are limited and the performance of open-source pre-processing tools may be degraded when dealing with domain-specific documents. Therefore, future studies will also require the manual building of dictionaries and ontologies that are relevant to the construction domain
[15].
Table 1. NLP open-source tools and implemented functions.
2.2. Text Representation
Text representation (i.e., text feature generation) enables the digitization of text with the help of data structures, such as vectors or matrices, characterized as machine-readable.
Table 2 provides a brief overview of current feature generation methods based on modern developments. Traditional NLP techniques extract features from text data by analyzing the syntactic structure. A literature search revealed that the vector space model (VSM) is relatively simple and dominates feature generation methods. While TF has been historically popular as a metric for identifying key features, TF-IDF, first proposed by Jones
[16], has become the main method for determining feature weights in documents. With the development of computer technology, deep learning algorithms based on neural networks began to appear, including Word2Vec, ELMO, and BERT.
Table 2. Feature generation methods and introduction.
2.3. Model Training
The last phase in text mining is model training, which uses the previously created features to carry out various tasks such as document classification, incident analysis, and compliance evaluation. Several algorithmic models that arose often in the literature analysis are listed in
Table 3. The majority of earlier studies employed conventional machine learning techniques, such as SVM, KNN, and CRF, with SVM models outperforming the others in terms of performance. Convolutional neural networks (CNN), recurrent neural networks (RNN), bidirectional long and short-term memory (Bi-LSTM), and other neural network architectures have received significant attention in recent years. The BERT model was put forth by Google in 2018, and since then, self-attention mechanisms have been used in the construction field. In the years to come, the number of publications on self-attention mechanism-based approaches for construction text analysis is anticipated to rise
[17].
Table 3. Training model and introduction.
3. Current Theme and Topic Analysis
The results of the keyword analysis provided in Section 3.4 were slightly modified, based on a thorough reading of the 185 articles. The selected articles were grouped into four application directions: Document Management (DM), Automated Compliance Checking (ACC), Security Management (SM), and Risk Management (RM). In order to better structure the analysis of the selected articles, these four main categories were further sub-divided according to text mining tasks, as detailed in Table 4.
Table 4. Statistical results of articles by category and task.
3.1. Document Management
The main research objectives of DM can be divided into the following three areas: knowledge extraction, knowledge retrieval, and document classification/clustering.
In document knowledge extraction, Al Qady and Kandil
[26] have used NLP techniques to parse contract documents into noun phrases (NP), verb phrases (VP), and prepositional phrases (PP). By identifying subject and object triads <subj, VP, obj>, they extracted contextually relevant semantic knowledge to improving functions such as document classification and retrieval, with an F-measure score of 90%. Ren, R. et al.
[70] have proposed a semantic rule-based information extraction method to automatically extract construction execution steps from construction procedure documents, reducing the workload of manually collecting information from construction procedure documents while achieving an accuracy of 97.08% and a recall of 93.23%.
Due to the availability of data and algorithms, document retrieval was the main topic of research, particularly in the beginning. NLP was progressively used for numerous applications up until 2015. This is often accomplished for document knowledge retrieval, by comparing the similarity of two representation vectors. Using TF-IDF and cosine similarity, Li and Ramani
[71] have created an ontology-based design document query system that outperformed keyword-based search methods. By utilizing a Bayesian classifier to retrieve feature documents through similarity matching, Yu and Hsu
[29] have developed a technique to reduce the dimensionality of VSM, enabling automatic and quick retrieval of CAD documents from 2094 Chinese annotated CAD drawings gathered from two actual building projects.
In document classification/clustering, the general process of document classification is pre-processing, text representation, and classification modelling. In the early literature, Caldas and Soibelman
[20] have implemented an automated document classification system that can automatically classify construction project documents according to project components, with an average classification accuracy of 92.05% for the three levels. The recent research of Hassan and Le
[72] has classified contract language into requirement and non-requirement material using Word2Vec and SVM, in order to shorten reading times and enhance comprehension of the contract scope. As a supplement to analytical tools such as CiteSpace, some researchers
[73] have recently used LDA topic modelling to automatically assign one or more topics to documents, in order to achieve document tagging, which is used to analyze historical documents and extract clustered subject terms.
3.2. Automated Compliance-Checking
Automated compliance-checking using NLP techniques is another hot topic in text mining applications. Automated compliance-checking requires understanding and extracting constraints from various building regulation documents, followed by converting them into a formal format that allows for checking/reasoning. Two authors—Zhang, J. and El-Gohary—have made significant contributions to this field. In 2015, Zhang, J. and El-Gohary
[40] proposed the extraction of rules based on pattern matching and conflict resolution rules. The same year, they
[38] proposed a bottom-up conversion method based on semantic mapping and conflict resolution rules, in order to extract constraints and convert them into first-order logic using Prolog syntax. Building on previous research, Zhang and El-Gohary
[74] have extracted regulatory concepts and industry base category (IFC) concepts from compliance documents. They then identified the relationships between each pair of regulatory and IFC concepts to create extended IFC schemas. At present, NLP-based compliance checks are mainly used to assess architectural designs
[41] and work process dependencies
[75].
3.3. Safety Management
The scope of safety management includes scheduling, cost, construction process, and so on. Rupasinghe et al.
[65] have used support vector machines (SVM), linear regression (LR), k-nearest neighbors (KNN), decision trees (DT), plain Bayesian (NB), and integrated models to analyze construction accident reports and classify the causes of accidents. Tixier et al.
[51] have developed a manual rule-based NLP program to automatically extract attributes and results from injury reports with an F1 score of 96%. Chi et al. have combined TF-IDF, principal component analysis (PCA), and SVM to classify accident categories from documents.
3.4. Risk Management
Risk management is broadly defined as the measurement, assessment, and development of contingency strategies for all aspects of the construction production process. Current research in the field of text mining related to risk management focuses on risk factor identification and analysis, as well as risk prediction. Siu et al.
[62] have applied NLP software to identify 16 new risk categories for engineering contracts from unstructured text descriptions of NEC projects in Hong Kong, and used decision trees to analyze risk ratings. Kim and Kim
[63] have identified factors related to building fire accidents from news articles, and then analyzed the main factors causing fire accidents in different seasons by using principal component analysis (PCA). Li et al.
[76] have developed four main safety accident data sets, where the documents were represented by doc2vec vectors. As new incident reports emerged, the most similar data sets were selected based on doc2vec similarity, in order to share key factors that predict injury levels. The data sets were then trained to recommend deep learning models, based on their meta-features (e.g., proportion of category factors), in order to maximize prediction performance. Xu, N et al.
[77] have proposed an information entropy-weighted term frequency (TF-H) for term importance assessment regarding the case of a Chinese metro construction project, extracting 37 safety risk factors from 221 metro construction accident reports.