Alavi and Leidner presented discussions about knowledge, knowledge management, and knowledge management systems
[5]. They described issues, challenges, and benefits of knowledge management systems
[6]. Brent Gallupe considered three levels of knowledge management technologies: tools, generators, and specific KMSs
[7]. Some studies discussed knowledge management in the age of big data related to some aspects such as knowledge bases, knowledge discovery, and knowledge fusion. Suchanek and Weikum gave an overview of the methods for building large knowledge bases
[8]. Begoli and Horey presented three system design principles that can be integrated into knowledge discovery infrastructure and provided development experiences with big data problems
[9]. Dong et al. introduced a web-scale probabilistic knowledge base that employed supervised machine-learning methods in knowledge fusion from existing repositories
[10]. These studies considered the presentation of big data in their systems, but they did not provide a comprehensive process of knowledge development. Tretiakov et al.
[11] adapted and extended a generic model of knowledge management systems including relevant factors to healthcare. Experiments were conducted on data collected from 263 doctors within two district health boards in New Zealand. Maramba et al.
[12] presented a comprehensive synopsis of the challenges in the implementation of computer-based KMS in healthcare institutions. Manogaran et al.
[13] proposed a big-data-based KMS supporting clinical decisions. They provided an overview of big data tools and technologies that can be used in KMS. These observed studies remain at the level of knowledge exploration that do not apply new knowledge in concrete practice. Recently, Le Dinh et al. proposed an architecture for implementing big-data-driven knowledge management systems
[14]. A knowledge management system in a big data context must fully ensure the development process of knowledge including four stages: capture, organize, transfer, and apply. The study stays on the abstract level of KMS without any implementation.
In order to overcome the above challenges, researchers propose to build a big-data-driven healthcare knowledge management system supporting the diagnostic decision in a parallel and distributed environment. The large-scale healthcare system ensures a complete and comprehensive knowledge development process, including knowledge exploration and knowledge exploitation. Additionally, the involvement of artificial intelligent and big data processing is to provide real-time diagnosis decision supports with the massive volumes of medical records for a reasonable response time. The proposed healthcare knowledge management system for supporting medical diagnosis includes four layers: a data layer, an information layer, a knowledge layer, and an application layer. An illustration of the proposed system is presented using machine-learning techniques in the knowledge layer to generate knowledge for hypertension and brain hemorrhage diagnosis. Data used in this system are collected from several hospitals and health-monitoring devices. Hypertension is one of the most leading causes of disability and death worldwide. According to the World Health Organization (WHO), an estimated 9.4 million deaths are caused by high blood pressure. This dangerous disease needs to be promptly detected and treated to limit the risks of death as well as disease complications. Researchers use decision trees to generate knowledge for hypertension diagnosis and classification. Decision trees learn and generate simple rules from a complex decision-making process that is similar to the way of human thinking. In addition, researchers use deep-learning techniques to generate knowledge for brain hemorrhage detection and classification. A brain hemorrhage is a type of stroke that is caused by an artery bursting in the brain. Stroke is the second leading cause of death according to the World Health Organization. The diagnosis of the disease is based on cerebral CT/MRI images; thus, researchers proposed to use deep-learning techniques for hemorrhage detection and classification. The trained model with Faster R-CNN Inception ResNet v2 achieves the mean average precision of 79% in classifying four types of brain hemorrhage.
2. Knowledge Management Systems
Knowledge management systems have a dramatic impact on the decision-making support of organizations. However, an effective KMS needs to ensure the whole process of knowledge management, including knowledge exploration and knowledge exploitation. Le Dinh et al. proposed an architecture for big-data-driven knowledge management systems including a set of constructs, a model, and a method
[14]. This architecture has complied with the requirements of the knowledge development process and the knowledge management process. Based on the research of Le Dinh et al., researchers have proposed an architecture for a knowledge management system supporting medical diagnosis including four layers: data layer, information layer, knowledge layer, and application layer (
Figure 1). This knowl- edge management system ensures all four stages of the knowledge development process, including data, information, knowledge, and understanding, corresponding to four main activities, which are capture, organize, transfer, and apply. The objective of this entry is to present the architecture for medical diagnosis decision-supporting systems by collect- ing and analyzing big data. This proposal addresses two major challenges: knowledge management and knowledge organization from disparate data sources.
Figure 1. Proposed architecture for healthcare knowledge management systems.
The system processes two types of data: batch data (patient records collected over a long time period) and real-time data (collected from wearable devices). The batch data are loaded into the data lake (HDFS) and the real-time data are ingested into the processing system with Kafka and Spark streaming. With a large amount of medical data, the system will filter out useful information for disease diagnosis and classification, preprocess information, and store information into HBase. The information will be used for knowledge transformation to create machine-learning models. New knowledge is created and made available to users through queries from websites or wearable devices.
2.1. Data Layer
There are two data sources used in this entry, including historical datasets collected from hospitals and real-time data collected from patients via health-monitoring wearable devices. The batch data are loaded into Hadoop Distributed File System (HDFS), a well- known fault-tolerant distributed file system. HDFS is designed to store very large datasets reliably and to stream those datasets at high bandwidth to user applications. The real- time data are ingested into the system with Apache Kafka, a distributed, reliable, high-throughput and low-latency publish-subscribe messaging system. Kafka has become popular when it and Apache Spark are coordinated to process stream data as well as to use both of their advantages. Researchers use Kafka to ingest real-time event data, streaming it to Spark Streaming. The data can be in text format or images, especially CT/MRI images that are commonly used in medical diagnosis. These raw data are collected and fed into the system for storage at the data layer.
Figure 2. Training phase in a Spark cluster.
Testing phase: Researchers extract features for the testing set, thereby evaluating the accuracy of the trained models with the test set. The trained model is used to predict whether or not a patient has a disease. The execution of queries in this phase is also implemented in a distributed parallel environment. Machine-learning models are used in the testing phase to evaluate the accuracy of the predictions. The models’ performance can be evaluated with precision, recall, and F1 score. The appropriate models for the problem will be stored on a distributed storage system for future use.
2.4. Process Layer
In this layer, the applications are built to input patient information into the system and give outputs about diagnosis and diseases classification. The applications are designed to perform patient data entry and then execute knowledge queries to return new knowledge about the patient’s health status. The execution of queries in this layer is implemented in a distributed environment.
3. Healthcare Knowledge Management Systems
3.1. High Blood Pressure Diagnosis Support
Blood pressure is the blood force exerted against vessel walls as it moves through the vessels
[15]. Blood pressure is expressed as two numbers: systolic pressure and diastolic pressure. Systolic is the higher number, which corresponds to the period when the heart beats to push the blood in the arteries. Diastolic is the lower number, which corresponds to the rest period between two consecutive heartbeats. Typically, high blood pressure is when the blood pressure measured in medical facilities is greater than or equal to 140/90 mmHg. According to the seventh report of the Joint National Committee on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure (JNC 7)
[16], the classification of blood pressure for adults aged 18 and older is presented in
Table 1.
Table 1. Classification of blood pressure for adults.
Class |
Systolic |
Diastolic |
Normal |
<120 |
and <80 |
Prehypertension |
120–139 |
or 80–89 |
Stage 1 hypertension |
140–159 |
or 90–99 |
Stage 2 hypertension |
≥160 |
or ≥100 |
3.1.1. Decision Tree for High Blood Pressure Detection
Preprocessing: The text data have a lot of empty data, zero value data, and even non-viable values that will affect the operations of the knowledge layer. Therefore, data preprocessing will remove non-viable values from the dataset. The solution to empty data fields is filling values using mathematical interpolation. This dataset is saved as a csv extension file and put on HBase for later use in distributed environments.
Researchers label the data records based on the diagnosis results, which are concluded by professional doctors with high reliability. The data record is labeled 1 if the patient is diagnosed with high blood pressure and 0 otherwise. After labeling, researchers process the string information in the dataset to build a feature extraction model and receive the feature vectors.
Model training: Researchers fit a decision tree with a ratio of 70/30 for training and testing phases. A classification decision tree is built with the train set, and then researchers will use the test set to evaluate the model performance.
Table 2 contains the information of the dataset after labeling and feature extraction. This information is obtained during the steps researchers take before dividing train/test sets.
Table 2. Examples of data before training models.
Symptoms |
Diagnosis |
Label |
Index |
Symptoms Classification |
Features |
Headache, vomit |
Intracranial injury |
0 |
194 |
(25,152, [194], [1.0]) |
(25,163, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 205], [17.0, 100.0, 60.0, 80.0, 18.0, 1.57, 22, 53, 48.0, 37.0, 1.0]) |
Fiver |
Chickenpox |
0 |
7 |
(25,152, [7], [1.0]) |
(25,163, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 18], [1.0, 36.0, 140.0, 60.0, 78.0, 20.0, 1.7, 39, 68, 50.0, 39.0, 1.0]) |
Tired |
Hypertension |
1 |
1 |
(25,152, [1], [1.0]) |
(25,163, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12], [49.0, 210.0, 140.0, 104.0, 22.0, 1.73, 40, 55, 80.0, 37.0, 1.0]) |
Abdominal pain |
Acute appendicitis |
0 |
0 |
(25,152, [0], [1.0]) |
(25,163, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [23.0, 110.0, 70.0, 87.0, 20.0, 1.46, 40.0, 50.0, 40.0, 37.0, 1.0]) |
Dizzy |
Vestibular dysfunction; Hypertension |
1 |
4 |
(25,152, [4], [1.0]) |
(25,163, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15], [1.0, 53.0, 170.0, 100.0, 84.0, 18.0, 1.5, 42, 55, 50.0, 37.0, 1.0]) |
In addition, based on the trained model, researchers use the featureImportances function supported by PySpark library to select variables that have an important influence on the disease diagnosis in the dataset. The importance of a variable is weighted by Gini-importance defined by the total decrease in node impurity. It is calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value, the more important the feature is. Researchers can rely on this result to remove unimportant data fields to reduce training time as well as increase the accuracy of the model. The results researchers obtained from the featureImportances are shown in
Figure 3.
Figure 3. Feature importance in predicting high blood pressure.
Researchers decided to remove two unimportant data fields (head circumference and chest circumference) and retrain the models with the dataset consisting of only 11 data fields. Researchers train different decision tree models by varying the tree depth as well as performing the training phase in a distributed environment with three proposed scenarios.
Training results: Researchers construct decision trees with different depths. Each tree will have rules that give different prediction results. A tree of depth n will inherit inner branches from a tree of depth
n−1 and has additional conditions for making predictions. An example to illustrate a decision tree with a depth of 4 is shown in
Figure 4.
Figure 4. Decision tree of depth 4 for the problem high blood pressure detection.
In addition, based on the decision tree models and the rules generated, researchers found that several health factors of the patient are closely related to high blood pressure. For example, a patient with the systolic blood pressure of over 147 usually has some symptoms such as headache, dizziness, and fatigue. People over the age of 55 are likely to have a high risk of hypertension. Researchers train the models on a Spark cluster, and the training time is presented in
Figure 5a. The deeper the tree, the more time it spends on the training process. After finishing the training process, researchers evaluate the detection models by applying the models for high blood pressure detection on the testing set. The accuracy of the models received is presented in
Figure 5b. The precision of the models with different tree depth levels reaches 84% to 87%. After the process of training and evaluating the results of the models, researchers choose to stop training at a tree depth of 6 because the generated rules are consistent with reality. These things considered, if researchers increase the depth of the tree, researchers find that redundant branches start to appear, and the decision trees fall into over-fitting.
Figure 5. Training time and accuracy of the detection models. (a) Training time; (b) Accuracy.
3.1.2. Decision Tree for High Blood Pressure Classification
Model training: The classification of high blood pressure is based on
Table 1. Researchers perform labeling by comparing the patient’s systolic and diastolic blood pressure to make the classification as follows.
-
Label 0: systolic < 120 and diastolic < 80
-
Label 1: systolic ≥ 120 and diastolic ≥ 80
-
Label 2: systolic ≥ 140 and diastolic ≥ 90
-
Label 3: systolic ≥ 160 and diastolic ≥ 100
The classification of the disease is conducted after the disease detection; thus, researchers do not pay attention to label 0. Researchers train decision trees for classification problems on the same dataset with the ratio of 70/30 for train/test sets on the three proposed scenarios.
Results: Similar to the detection of hypertension, researchers build a classification model of high blood pressure with decision trees at different depths. Researchers choose to stop training at a tree depth of 4 because as the depth increases, redundant branches start to appear, and the tree falls into over-fitting. An example of a decision tree that classifies hypertension with a tree depth of 4 is shown in
Figure 6.
Figure 6. Decision tree of depth 4 for the problem of high blood pressure classification.
The classification models are trained on a Spark cluster. The training time is presented in
Figure 7a. The deeper the tree, the more time it spends on the training process. Researchers evaluate the classification models based on precision, recall, and F1-score. The accuracy of the models received is presented in
Figure 7b. Researchers receive a precision of over 92% all over the three models.
Figure 7. Training time and accuracy of the classification models. (a) Training time; (b) Accuracy.
3.2. Brain Hemorrhage Diagnosis Support
Brain hemorrhage is a dangerous disease, being a type of stroke that can lead to death or disability. There are four common types of cerebral hemorrhage [
27]: epidural hematoma (EDH), subdural hematoma (SDH), subarachnoid hemorrhage (SAH), and intracerebral hemorrhage (ICH). Hypertension is the most common cause of primary intracerebral hemorrhage. To detect the brain hemorrhage, doctors usually rely on the Hounsfield Units (HU) of the hemorrhage region in a CT/MRI image. Thus, researchers propose a diagnosis supporting system for brain hemorrhage detection and classification using HU values. The machine-learning algorithm to be used in the knowledge layer for this type of disease is deep learning, which is mentioned in this entry as Faster R-CNN Inception ResNet v2.
Hounsfield unit represents different types of tissue on a scale of −1000 (air) to 1000 (bone).
Table 3 illustrates different tissues with their HU density. The hemorrhagic region will have HU values in the range of 40 to 90. The HU values are calculated by Equation (1) with
pvalue being the value of each pixel and
rslope and
rintercept being the values stored in CT/MRI images.
Table 3. HU density on CT/MRI images.
Matter |
Density (HU) |
Air |
−1000 |
Water |
0 |
White matter |
20 |
Gray matter |
35–40 |
Hematoma |
40–90 |
Bone |
1000 |
3.2.1. Training Phase
Preprocessing: The CT/MRI images will be converted into digital images (.jpg) according to the HU values. The location of brain hemorrhage is determined by HU values; thus, after preprocessing, researchers will have a digital images dataset with highlighted hemorrhagic regions. The hemorrhagic regions will be labeled with the supervision of specialists.
Feature extraction: Researchers perform feature extraction using a pretrained CNN of Inception ResNet v2 as the backbone of the Faster R-CNN to reduce the computation time. This step helps to quickly classify brain hemorrhage.
Model training: The extracted features are trained on Faster R-CNN. This training process is monitored with the Loss value. When the Loss value is not improved (or not decreased), researchers stop the training process. The Loss value of the model is very low (below 10%) after 60,000 training steps, as illustrated in
Figure 8. This means that the error rate in the brain hemorrhage prediction of the proposed model is very low.
Figure 8. Loss values over training steps.
3.2.2. Testing Phase
After the training process, researchers evaluate the proposed model for brain hemorrhage detection and classification on the test dataset. The preprocessing and feature extraction are also performed on the testing set before evaluating the model. The trained Faster R-CNN Inception ResNet v2 is then used to detect and classify four common types of brain hemorrhage. It can correctly detect the contours of entire hemorrhage regions with an accuracy of 100%. An example of multiple hemorrhages detection on an image is presented in
Figure 9. It can predict bleeding time from 2 to 3 days, recognize hemorrhage type as ICH and SAH, and accurately segment bleeding regions.
Figure 9. Multi- brain hemorrhages segmentation.
The average precisions (AP) of the proposed model for four types of brain hemorrhage (EDH, SDH, SAH, and ICH) are 0.7, 0.59, 0.72, and 0.71, respectively (
Figure 10). This model gives the mAP value of 0.68 for the detection and classification of four classes of brain hemorrhage. The results show that the system can support doctors in accurately diagnosing cerebral hemorrhage and providing appropriate treatment regimens.
Figure 10. Average precision (AP) of four brain hemorrhage types.