Data Harmonization | Encyclopedia MDPI

Data Harmonization: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Information Systems

Contributor: Ganesh Kumar

Data harmonization (DH) corresponds to a field that unifies the representation of such a disparate nature of data. Over the years, multiple solutions have been developed to minimize the heterogeneity aspects and disparity in formats of big-data types.

data harmonization
heterogeneous data
text preprocessing

1. Introduction

Big Data plays a vital role in the assessment of massive data produced every second by real-world applications, using tools and algorithms [1]. Some of the real-life application domains of Big Data are healthcare, telecommunication, financial firms, retail, law enforcement, marketing, new product development, banking, energy and utilities, insurance, education, agriculture, and urban planning, as discussed in Reference [2]. Nowadays, data are being produced in various formats, ranging from structured and semi-structured to unstructured (SSU) generated from heterogeneous resources [3,4]. The disparate nature of data cannot be processed with simple tools and techniques [2,5], and this creates a challenge for decision-makers to make decisions based on the scattered data. Emerging technologies, such as the Internet of Things (IoT), Industry 4.0 (I4.0), and extended reality (XR), produce distinct kinds of information via heterogeneous sources and real-world applications that create heterogeneity issues [6], in IoT integration, security, analytics challenges, and computational time [7,8,9]. Among them, data harmonization (DH), which describes the uniform representation of heterogeneous data, was proposed in References [10,11].

IoT is a system that deals with interrelated computing objects, such as unique tags, RFID, or machine interactions, and that can transfer data without human and machine involvement [12]. As technology evolves, the IoT has further grown into the Industrial IoT (IIoT), which deals with heterogeneous data produced by real-world applications, industrial products, and devices, such as privacy authentication logs of IIoT devices [13], business architecture devices data [14], and heterogeneous IIoT devices data [6]. In addition, I4.0 deals with IoT-based automation, technologies, and decision-making that help decision-makers to make decisions based on the disparate nature of data produced [15]. Applications of I4.0 in higher education, predictive maintenance [16,17], food logistics [18], knowledge management [19], business [20], and supply chain [21]. The main problem faced by these applications is related to managing the heterogeneous data produced in bulk by employing I4.0 and IIoT. The data produced by industries include digital data for manufacturing purposes, unstructured data for predictive maintenance, customer data for food logistics, customer reviews for knowledge management, business data for the supply chain, and manufacturing data for the supply chain. XR deals with the real and virtual environment with the help of a machine and human interaction [22]. XR is improving heterogeneous manufacturing data in the digital world. The tools must be advanced so that user acceptance and better usability of products are achieved [23]. AI can be effectively used to address the disparate nature of manufacturing data to deliver the best appearance to the XR industry [24].

To resolve the problems mentioned earlier, the disparity of data needs to be reviewed in detail, so that data harmonization models, tools, techniques, algorithms, and their performance can be evaluated for extensive heterogeneous textual information. Although related work was carried out in multimodalities for text, image, audio, and video [25,26,27], there were no such studies highlighting the work associated with textual data, data harmonization core techniques, and performance measurement. Multiple studies have been conducted which deal with applications such as sentiment analysis, text similarity, word embedding, and emotion recognition in conjunction with the help of classification and clustering techniques. Therefore, solving real-world application problems, such as those of a medical and healthcare nature, needs data to be harmonized and uniformly presented, so that decisions can be carried out efficiently. Based on the needs and contributions of emerging technologies and real-world application domains, we aimed to conduct a systematic review of the literature that could demonstrate the heterogeneity issues faced by real-world applications, data harmonization as a solution architecture for the disparate nature of data, techniques that can deal with large textual heterogeneous datasets, and performance assessment of models.

2. How Does Data Harmonization Resolve the Issues of Heterogeneity?

In this section, 25 studies were selected which discuss data harmonization, data integration, and data fusion. The details of each study are discussed below.

Initially, heterogeneous oil and gas data are unorganized, which is difficult to manage. For that data harmonization was proposed by Danyaro and Liew [43], using semantic web and BD tools. Where the performance of the precision, recall, and F-score were found better than existing techniques. In addition, agriculture data are stored in clusters, and it is difficult to handle heterogeneous data. Therefore, a uniform format was reported by (Sambrekar, Rajpurohit, and Joshi [44], using Couchbase and NoSQL, and it was found that the time duration for fetching records is fast. Apart from this, different frameworks have been developed by different organizations to make decisions, but no framework has been proposed for value creation. In this study, Saggi and Jain [45] created a framework for value creation from SSU data also in-depth issues of heterogeneity, harmonization, and BD techniques were highlighted. It shows the importance of data integration for industrial data, decisions, product reviews, and visualization of future strategies. Artificial intelligence, ML, and cloud computing will be helpful for BD Analysts. Moreover, Li, Chai, and Chen [46] summarized that the heterogeneous data in the industry are produced easily but are difficult to store, manage, and audit. In this study, the issue of heterogeneity of large firms was solved using a NoSQL-based data integration model. Furthermore, health data are very important for patient treatment, monitoring, and satisfaction. Health data are generated by all institutes by using open-source web data, but no such online module has been proposed for integration of all web-based centralized. In their study, Hong, Wang, et al. [47] revealed a Web-based FHIR visualization tool, using a standard structured format API. Again, Lopes, Bastião, and Oliveira [48] described that the file-sharing between users was difficult for heterogeneous data. Therefore, a real-time integration and interoperability model was developed by using PostgreSQL to facilitate different users.

In addition, Yuan, Holtz, Smith, and Luo [49] mentioned that the child-patient disorder/condition data were complex and unmanageable due to manual work and human involvement. To overcome this issue, different preprocessing, NLP, and ML tools are used to create patient data in digital form and without any biases. The performance of the autism spectrum is calculated using precision and recall. Furthermore, Daniel [50] also emphasized the issues and challenges faced by educational institutes and researchers are highlighted, such as data integration and sharing between campuses and branches. Besides this, text-free or unstructured data in healthcare data create issues for managing and storing. Therefore, data fusion was suggested by Kraus et al. [10] to manage the heterogeneous data. Moreover, in an online learning system, data need to be integrated and efficient for smart educational systems. Data processing and storage of audio, video, images and text formats was developed by Dahdouh, Dakkak, Oughdir, and Messaoudi [51] with the help of Hadoop, MapReduce, and Spark. As a result, it helps in taking a smart decision within seconds. Additionally, Patel and Sharma [52] explained the various issues of data harmonization in this survey. Before that, data warehousing and OLAP were used, which do not support huge datasets of open source and unstructured formats. In the end, different BD and ML techniques are suggested for dealing with huge data. Consequently, in the oil and gas industry, data are generated in operational formats from different clusters at a time, which needs data integration to collect data in a centralized place for making timely decisions identified by Alguliyev, Aliguliyev, and Hajirahimova [53].

Wang [54] mentioned that the disparate data are generated in unstructured formats, such as sensors and text, which describe heterogeneous behavior. For this reason, a data integration model was developed to solve the technical and quality problems of BDA. The model was developed using ML and DL techniques so that BD analysts could visualize, analyze, and make decisions from disparate data. Additionally, Chondrogiannis et al. generated a tool for clinical data in a heterogeneous form and for integration of data, an ontology-based tool suggested to arrange data in a structured format. Moreover, patient cohort and biomedical data play an important role for previous health treatment and analysis, and data provided by patients in a heterogeneous structure need to be harmonized, as argued by Kourou et al. [11], so that, in an online tool, all patient data are available to medical staff during analysis. In this survey, different cohort harmonization techniques were highlighted, which will help in healthcare applications, such as ML, DL, and Ontology techniques. In addition, in an urban town, so many issues related to basic needs were mentioned by Souza et al. [55]; the objective of that study was to make the urban town into a smart urban town. Data are generated by different departments in JSON, string, and maps. To make smart decisions, all data must be integrated.

Furthermore, the patient stays in hospital data with different codes were not publicly available to make a health record into an EHR reported by authors (Scheurwegs, Luyckx, Luyten, Daelemans, and Van den Bulcke [56]. By using Naïve base and Random Forest on the UZA dataset, the patient classification was performed. Similarly, the researchers Jayaratne et al. [57], in their study, stated that the web-portal-based patient data produced by many healthcare hospitals in different formats were difficult to decide due to decentralization. To solve this issue, an automated and centralized web portal was developed which helps with online decisions. In contrast, the research team of Hong, Wen, Stone, et al. [58] analyzed that the patients with obesity and comorbidities were monitored after discharge from hospitals. The objective of this study is to develop a patient-centric system for FHIR using NLP toolkits and ML algorithms from the Mayo Clinic, MIMIC III, and i2b2 datasets. The overall performance of this system is measured in precision, recall, and F-Score. In addition, the same authors, Hong et al. [59] proposed a model for the quality and performance-based data integration for information extraction, using NLP, ML, and Bag of Words (BoW). Moreover, Hong et al. [60] used a Mayo Clinic dataset with the help of NLP toolkits for making a digital FHIR system. In contrast, Chen, Zhong, Yuan, and Hu [61] conducted a review and suggested a unified model for SSU data, using MapReduce. Besides that, XML-based OGOLOD datasets were accessed by using ontology tools for a semantic oriented data harmonization model that was presented by Carmen Legaz-García, Miñarro-Giménez, Menárguez-Tortosa, and Fernández-Breis [62].

In Saudi Arabia, patient health data generated in public and private hospitals are not shared and integrated with the health information system due to a lack of heterogeneity. Therefore, the Banu, Kuppuswamy, and Sasikala [63] team proposed NLP and BDA-based systems. Lastly, online FHIR-based web portals were developed by using NLP techniques and open-source tools on the Mayo Clinic dataset to centralize the data generated in a heterogeneous format that was revealed by the researcher Hong, Wen, Shen, et al. [64]. The contributions of all studies in all domains are discussed in Table 7.

Table 7. RQ2 domain and contributions.

Study Reference	Domain	Contributions
[43]	Oil and Gas	High-performance measure
[44]	Agriculture	High performance, high availability, and high scalability, using the latest techniques
[45]	General-Purpose	Data generation, storing, fetching, analysis, visualization, and decision-making
[46]	Banking	Helps in auditing the multisource data
[47]	Healthcare	Facilitate for navigation of HL7 FHIR core resources
[48]	General-Purpose	Delivering automatic services to interoperable system
[49]	Healthcare	Helps in developing an automatic system for disordered patient
[50]	Education	To motivate researchers and academicians about the latest techniques
[10]	Healthcare	Useful for decisions of scientific, clinical, and administrative work
[51]	Education	Facilitate in online learning, storage, processing, and academic activities
[52]	General-Purpose	Recommendation system, opinion mining, and parallelism can be targeted
[53]	Oil and Gas	Helpful for decision-makers during exploration, drilling, and production
[54]	General-Purpose	It will facilitate for fetching data and performance measure
[65]	Healthcare	Helpful for disease prevention, tracking, and policy-making
[11]	Healthcare	Helps in boosting statistical power of sustainable and robust data
[55]	Infrastructure	Geographic based smart city for aggregation, visualization, and analysis
[56]	Healthcare	Helps in predicting the clinical codes of patient stays
[57]	Healthcare	Helps in patient-lefted care decision-making among stakeholders
[58]	Healthcare	Helps in finding the patient having obesity and comorbidities
[59]	Healthcare	Helps in developing patient diagnostic criteria and representation
[61]	General-Purpose	Support in integration, storage, computation, and visualization
[62]	Healthcare	Open biomedical repositories can be developed in semantic web formats
[60]	Healthcare	Normalizing and integration of structured and unstructured EHR data
[63]	Healthcare	Helps health information system to keep a record of patients’ data
[64]	Healthcare	Helps in standardizing the clinical data normalization

3. Which Techniques Are Being Used for Solving the Harmonization Issue for Large Textual Datasets?

In previous studies, SSU heterogeneous data were used in the form of text, images, audio, video, and social media formats. The BD and BDA literature reviews proposed so many models and frameworks for data harmonization or integration. Among them, textual data play an important role in semantic, syntactic, and schematic data from large datasets. In different industries, different approaches are used by BD analysts to meet the demands of users and owners.

In this section, 16 studies have been selected that highlight the core techniques and their contributions in terms of performance, time, and accuracy in data harmonization, data integration, and data fusion. The details of each study are discussed below.

At first, Tekli [66] found that, in the entertainment industry, the feedback given by the audience in form of large sentences and getting semantic meaning from XML documents is very challenging. Additionally, Sanyal, Bhadra, and Das [67] pointed out that, by using the business intelligence tool sentence-similarity retrieved, the technique proposed for the IT Ecosystem has been adopted by business firms. Apart from that, in the health sector, data are also important for harmonization, as noted by Adduru et al. [68]. They also discussed how the dataset contains many clinical codes and how it is difficult to get information and text classification to solve the issue. NLP techniques, such as N-Gram, Jaccard Similarity, Word2Vec, and different DL approaches, are used to create a paraphrasing dataset from clinical data. Similarly, the research team of Mujtaba et al. [69] revealed, in a clinical-text-classification review that the approaches for textual data play an important role, especially in supervised ML techniques. Likewise, a medical prescription is a document of proof about a patient’s health history recorded during the diagnosis, but sometimes it is difficult to understand the semantics of prescribed medicines was presented by Yanshan Wang et al. [70]. In this study, the Mayo Clinic dataset was utilized with the help of NLP techniques to find the semantic and similarity scores of medical texts. On the contrary, a study was proposed by Chen, Hao, Hwang, Wang, and Wang [71] that states that the healthcare communities manage healthcare data on web-based portals but are not available to all medical practitioners. For the prediction of chronic diseases, ML classification algorithms, such as CNN, NB, KNN, and DT, are used for analysis. Besides that, the authors Pathak and Lal [72] focused on open-source files-based heterogeneous datasets developed by using Modified IDF cosine similarity for information retrieval. A very detailed and descriptive survey was carried out by the authors Torfi, Shirvani, Keneshloo, Tavvaf, and Fox [73]. In this survey, different datasets of open-source NLP tasks, using different DL methods on BERT models, were discussed to summarize text and word embedding. In addition, Wu, Zhao, and Li [74] proposed that phrases of NLP models be vectorized by using the phrase2Vec model to overcome the issues of BoW and preprocessing. In the same way, the authors Moscatelli et al. [75] stated that patient data are very critical and sharing them is possible with high-security algorithms. By using NoSQL, MongoDB, and NLP techniques on XLS, CSV, and TXT files, data acquisition and simulation are possible. Similarly, Chen, Du, Kim, Wilbur, and Lu [76] also emphasized that, with the use of advanced technology, the health sector can be upgraded. Furthermore, health records can be in the digital form of clinical data and support multiple formats, but it is not easy to fetch similar data for digital records without the latest techniques in text mining. DL-based entities fetched from STS datasets combine rich features. Despite this, Malawi and Sasi [77] found that, from a large number of Enron email datasets, data are extracted by using NLP and sentiment analysis to make them available in a structured format. Furthermore, the authors Eke, Norman, Shuib, and Nweke [78] noted that the other parts of NLP are also important. In that, lexical analysis and ML-based emotional behavior detected from the text messages were used to check the level of criticism or hurt level from the Sarcasm dataset. Moreover, biomedical text mining was performed by using text preprocessing, clustering, classification, and information-extraction techniques mentioned by Allahyari et al. [79]. This led authors García, Ramírez-Gallego, Luengo, Benítez, and Herrera [80] to focus on Indian regional multilingual data processed with the help of natural-language-processing techniques. Finally, Harish and Rangan [81] suggested that text be processed through ML and DL algorithms for semantics. BD processing for huge data is performed by using BD tools and libraries

4. How NN Algorithms Are Well-Suited with Respect to Efficiency for Large Sequential Datasets

In this section, 8 studies have been selected that highlight the performance of Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) used for sequential data. The details of each study are discussed below.

At first, the researchers Yin et al. [82] and Ouyang et al. [83], in both surveys, discussed the use of NLP and DL techniques for fake-news detection and sequential data. By using techniques, it is found that the accuracy of the model is up to 93%. Moreover, a comparison of CNN and RNN reveals that RNN is better than CNN. The techniques that can be used for sentimental, relational, textual entitlement, answer selection, QA path query, and POS tagging were pointed by Lopez and Kalita [84]. Additionally, the authors Chai and Li [85] selected the studies that work for the Chinese community. In that, Chinese language-based Clinical NER’s performance was increased by using NLP techniques with DL. Similarly, the other techniques such as RNN with DL always show better results which were presented by authors Oshikawa, Qian, and Wang [86]. In addition, with the help of NLP in the different domains, the sequential data performance is optimum also highlighted by Young, Hazarika, Poria, and Cambria [87,88]. lastly, a survey was conducted by the authors Jing and Xu Jing and Xu [89,90] which depicts the performance of RNN with the addition of NLP is at it shows the performance at its peak.

The contributions, techniques, and domains of all studies are discussed in Table 9.

Table 9. Model Performance Techniques.

Study Reference	Domain	Techniques	Contributions
[82]	General	CNN, RNN for NLP	RNN perform better
[83]	Healthcare	RNN, N-Gram	RNN performance better by using N-gram
[84]	General	Compared with the existing Algorithm of CNN	RNN outperformed
[85]	General	Used in many NLP and audio-video functionality	Better for sequential text
[86]	Fake News	RNN for larger data sets of fake news	93% accuracy
[87]	General	CNN, RNN	RNN is better as per recent studies
[88]	Cancer, healthcare	DL classifier is better than conventional classifier	Model accuracy is better by using RNN
[89]	General	FFNNLM, RNNLM	RNN Language model is best
[90]	Medical, General	CNN, DBN, RNN	RNN is better in terms of NLP

This entry is adapted from the peer-reviewed paper 10.3390/app11178275

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.