Representation Learning for Electronic Health Records

Representation Learning for Electronic Health Records: Comparison

Please note this is a comparison between Version 1 by Atieh Khodadadi and Version 2 by Sirius Huang.

An electronic health record (EHR) is a vital high-dimensional part of medical concepts. Discovering implicit correlations in the information of this data set and the research and informative aspects can improve the treatment and management process. The challenge of concern is the data sources’ limitations in finding a stable model to relate medical concepts and use these existing connections.

electronic health record
deep learning
intensive care unit

1. Introduction

Medical and therapeutic techniques have substantially benefited from the collection of health data and the use of such data in the field of data science ^[1][2][3][1,2,3]. EHRs are one of these enormous sources of data, helpful for a variety of predictive tasks in medical applications [4]. EHRs hold a patient’s demographics, medical history, vital signs, laboratory tests, recommended medicine, diagnosis, and clinical outcomes during an interaction ^[5][6][5,6]. EHR databases may contain several patient visits, establishing a longitudinal patient record that can be used to aim the treatment process, such as disease prediction, mortality prediction, and enhancing the efficacy of the therapeutic process.

Initially, EHR systems were intended to manage the basic administrative functions of hospitals, permitting the use of regulated terminology and labelling schemes. Numerous labelling schemes exist, including ICD (International Statistical Classification of Diseases) codes for diagnostic ^{[7][8][9][10]}[7,8,9,10], CPT (Current Procedural Terminology) codes for procedures ^[11][12][13][11,12,13], and LOINC (Logical Observation Identifiers Names and Codes) for laboratories ^[14][15][14,15], ATC (Anatomical Therapeutic Chemical) for drug ^[16][17][16,17], and RxNorm for medication [12]. The various labelling techniques produce standard datasets for varied specialisations. As the EHR system develops, the volume of EHR data increases annually, and several studies have been conducted on the secondary use of these data.

EHRs offer numerous benefits, including improved patient care, increased efficiency, and reduced healthcare costs [18]. Regardless of the potential for EHRs in various applications, their effective usage is hindered by data-specific restrictions [6], such as high missingness and irregular sampling ^[19][20][21][19,20,21], as well as imbalanced classes due to uneven prevalence of illnesses [22]. Therefore, it is important to address these limitations in order to fully realise the potential of EHRs.

2. Representation Learning for EHR Applications

2.1. Vector-Based Methods

One of the learning models that represents patient information on this basis is a fully connected Deep Neural Network (DNN). Futoma et al. ^[23][27] evaluated various models’ propensity to forecast hospital readmissions using data from a large EHR database. The outcome demonstrates DNN outperforms other approaches that have previously been used to solve this issue in terms of prediction performance. The study given in ^[24][28] employed a deep generative learning model to overcome the problem of insufficient data using MRI pictures efficiently by learning and categorising tumour locations from MRI images. The search by Zheng et al. ^[25][29] for suicide ideation, behaviour, or death prediction in the literature was based on the health records of patients who had visited a Berkshire Health System hospital. Multiple machine learning and deep learning methodologies are employed in EHRs to classify the severity of patients in ^[26][30]. The experimental findings indicate DNN performed exceptionally well. In the type II diabetes disease prediction ^[27][31], a deep learning neural network architecture model was adopted. All these studies demonstrated the DNN can be utilised for EHR data analysis and diagnosis. Despite this, the majority of recent research has considered this architecture to be the classic way ^[28][32].

Autoencoders are vector-based, unsupervised deep learning models, which are an efficient dimensionality reduction technique with promising performance for the deep representation of medical data ^[29][33]. Autoencoders have also been effectively applied to datasets comprising massive collections of electronic health records, where they are very adept at handling missing data ^[30][34]. A comparison study by Sadati et al. ^[31][35] emphasised the effectiveness of several types of autoencoders for electronic health record-based data sets. Combining a recurrent autoencoder with two GANs, Lee et al. ^[32][36] suggested sequential electronic health records with a dual adversarial autoencoder (DAAE). Biswal et al. ^[33][37] synthesised sequences of discrete EHR encounters and encounter features using a variational autoencoder. Very recently, in ^[34][38], for adverse drug event preventability, a model of dual autoencoders was explored in EHRs. Wang et al. ^[35][39] compared the model with autoencoder features to traditional models, which could show a reasonable result.

Convolutional Neural Networks (CNNs) are a further vector-based technique. EHR research ^[36][40] focuses on capturing the local temporal dependence of these data, which are then used to predict multiple diseases and for other related tasks. Wang et al. ^[37][41] adopted a CNN learning with 1929 features for the classification of 1099 international diseases. Researchers in ^[38][42] aimed to develop a convolutional neural network model for the prediction of the risk of advanced nonmelanoma skin cancer (NMSC) in Taiwanese adults. In an intriguing study ^[39][43], CNN was applied over electronic health records to determine the top 20 lung-cancer-related indicators in order to avoid radiation exposure and costs. CNN has shown its superior ability to measure patient similarity. However, the traditional CNN architecture could not properly exploit the temporal and contextual information of EHRs for disease prediction. Consequently, it is increasingly difficult to represent the timing and substance of EHR data concurrently ^[40][44].

Natural language processing was the original inspiration for word2vec ^[41][45], which was developed to learn word embeddings from large-scale text resources. In ^[42][46], the authors pursue the word2vec technique to train a two-layer neural network to improve clinical application prediction accuracy relative to baselines. Choi et al. ^[43][47] applied skip-gram to longitudinal EHR data to learn low-dimensional representations of medical concepts. To improve the performance of a convolutional neural network for patient phenotyping, Yang et al. ^[44][48] explored a model that combines token-level and sentence-level inputs. Similarly, in ^[45][49], clinical text was employed to expect clinical notions. Steinberg et al. ^[46][50] proposed a novel analogy of language modelling on discretised clinical time-series data. However, these techniques do not explicitly model dynamic temporal information or address the challenges of heterogeneous data sources ^[47][51].

2.2. Temporal Matrix-Based Methods

Lee and Seu ^[48][52] presented Non-Negative Matrix Factorisation (NMF) as a method for discovering a collection of basic functions for expressing non-negative data. This matrix pertains to electronic health records, which generate a matrix with a time dimension and a clinical event dimension. Bioinformatics has extensively used NMF for clustering sources of variation ^[49][50][51][53,54,55]. There are other efforts to use NMF or its variants in the depiction of patient data in EHRs. In ^[52][56], disease trajectories are analysed using NMF to extract multi-morbidity patterns from a huge data collection of electronic health records. Zhao et al. ^[53][57] suggested that the NMF identifies relationships between genetic variants and disease phenotypes. In a recent study ^[54][58], NMF was used to examine the symptoms of covid and predict long-term infection. Controlling the degree to which the representation is sparse is difficult since sparseness is a side effect of the NMF algorithm ^[55][59]. The huge number of various diagnosis codes is an additional obstacle that results in a combinatorial explosion of the number of possible diseases, many of which are unique to a single patient ^[56][60].

2.3. Graph-Based Methods

The graph technique can be expressed using the EHR by using nodes to represent medical events and edges between the nodes to highlight the temporal links among clinical events. One emerging method of deep learning on graph-structured data is Graph Neural Networks (GNNs) ^[57][61]. GNNs can infer the missing information, leading to a representation that is more explicable ^[58][62]. The hierarchical relationships in EHRs were captured using GNN, as described in reference ^[59][60][63,64]. In ^[61][65], GNN reflected the links between drugs, side effects, diagnosis, associated treatments, and test results. For instance, Park et al. ^[62][66] suggested a knowledge graph-based question answering with EHR. Research ^[63][67] introduced an EHR-oriented knowledge graph system to efficiently utilise non-used information buried in EHRs. In EHRs, it is typical for spurious edges to be included and for other edges to be absent. Even though the observed graph is clean, it may contravene the properties of GNNs because it is not jointly optimised with them. These flaws in the observed graph may precipitously degrade the performance of GNNs ^[64][68].

2.4. Sequence-Based Methods

Sequence-based patient representation turns EHR data into a temporally ordered sequence of clinical events for use in prediction. A recurrent neural network (RNN) is a neural network that includes the GRU and LSTM networks as specific cases, according to Sherstinsky’s study ^[65][69]. RNNs are widely used in patient representation research that focuses on combinations or sequences of clinical codes ^[58][62]. The research included aid in early diagnosis ^[66][67][70,71] and disease prediction ^{[68][69][70][71][72][73][74][75]}[72,73,74,75,76,77,78,79]. Recently, Gupta et al. ^[76][80] adopted a general LSTM network architecture to make improved predictions of BMI and obesity. Ref. ^[77][81] examined the performance of various deep neural network architectures, including LSTM, in scenarios involving clinical factors and chest X-ray radiology reports, revealing that the recommended BiLSTM model outperforms other DNN baseline models. RNN is frequently stated without context or rationale. In addition, training equations are frequently removed entirely; therefore, partial descriptions or missing formulas in RNN may result in its inefficiency ^[65][69].

2.5. Tensor-Based Methods

Tensor-based methods apply an n-dimensional tensor to represent patient information. The multi-dimensional and high level of tensor factors in EHR data make complex relationships understandable and interpretable ^[78][82]. Zhao et al. ^[79][83] identified previously unknown cardiovascular characteristics using a modified non-negative tensor-factorisation technique. Afshar et al. ^[80][84] implemented temporal and static tensor factorisation to extract clinically significant characteristics. Hernandez et al. ^[81][85] used a novel tensor-based dimensionality reduction method to predict the onset of haemodynamic decompensation.