Personalized medicine is an emerging medical practice based on a data-driven approach that considers relevant medical, genetic, behavioral, and environmental information about an individual to determine patient-specific therapy
[2][3][4]. By linking together diverse datasets to reveal hitherto-unknown casual pathways and correlations, big data allows for far more precision and tailoring than was ever before possible
[3]. Recent scientific advancements in high-throughput, high-resolution data-generating technologies enables cost-effective analysis of big datasets on individual health
[5]. However, to analyze and integrate such large information, there is a need for new computational approaches such as faster, more integrated processors, larger computer memories, improved sensors, new much sophisticated algorithms, methodologies and cloud computing, which may guide future clinical practice by providing clinically useful information
[5][6]. The basic aim of precision medicine is to support the practicing clinician by making that information of pragmatic value. Precision medicine can be succinctly defined as an approach to provide the right treatments to the right patients at the right time
[7]. However, for most clinical problems, precision strategies remain aspirational. The challenge of reducing biology to its component parts, then identifying which can and should be measured to choose an optimal intervention, the patient population that will benefit, and when they will benefit most, cannot be overstated. However, the increasing use of hypothesis-free, big data approaches promises to help us reach this aspirational goal
[8].
2. The Conceptualization of Big Data
Distinct dimensions are included in the definition of “big data”, namely, volume, velocity, variety, value, variability, visualization, virality, and veracity, which describes the massive volume of structured, semi-structured, and unstructured data
[9][10][11][12]. According to the Health Directorate of the Directorate-General for Research and Innovation of the European Commission, big data can be defined as “Big data in health encompasses high volume, high diversity biological, clinical, environmental, and lifestyle information collected from single individuals to large cohorts, in relation to their health and wellness status, at one or several time points”
[13]. Various sources of big data in the health care industry and in biomedical research include medical records of patients, results of medical examinations, and hospital records, etc.
[14]. In addition, advances in technology have already created and continue to create thousands or even millions of measurements that include the sequencing of DNA, RNA, and the characterization of proteins: their sequence, structure, posttranslational modifications, and function, alongside their clinical features. In order to extract useful information from this huge amount of data, high-end computing solutions, along with appropriate infrastructure to systematically generate and analyze big data, are urgently needed. Moreover, advanced machine learning algorithms and techniques (such as deep learning, and cognitive computing) represent the future toolbox and emerging reality, which can be effectively applied to deliver integrative solutions for multi-view big data analysis in order to explain an event or predict an outcome
[2].
However, It is also important to note that while working with genetic data, researchers should consider that the number of examples (patients) is usually very small in relation to the number of genes or genetic variables that are measured. Therefore, the solution is bounded by the number of patients instead of the number of variables, which makes it
a little big data problem. This causes the uncertainty space of the mathematical models that are built to solve this kind of problems and make decisions (regressors or classifiers) to have a huge uncertainty space that contains the set of models that predict the observed data within the same error bounds. These models are located in flat curvilinear valleys of the cost function landscape
[15][16]. This holds independently of the inverse problem that it is being solved and concerns the uncertainty analysis of inverse problems and classification problems, which are by definition ill-posed. In this way, these problems are very difficult to solve since the noise from the data might dramatically perturb the solution by generating spurious unphysical solutions. Therefore, the best way to deal with such problems is by reducing the dimension to perform a robust uncertainty analysis of the corresponding medical decision problem
[17][18]. This kind of approach needs robust sampling methods to consider possible multiple scenarios.
3. Computational Approaches toward Personalized Medicine
Personalized medicine refers to the patient’s treatment based on their personal clinical characterization
[19]. The patient’s individual characteristics are used to modify treatment in a way that might be more intricate compared to the standard course
[20]. It is evident from recent advances in the pharmacological and genetic behavior of various drugs that genetic variations in a single individual could lead to differences in the response to drugs
[21]. All of these factors conspire with the notion of personalized medicine. The main aim of personalized medicine is to achieve the right treatments being given to the right patients. Nowadays, computational models are integrated in different fields in medicine and drug development, ranging from disease modeling and biomarker research to the assessment of drug efficacy and safety
[22]. The added value of such computational models, sometimes called digital evidence, in medicine is also acceptable by the scientific community
[23][24] and the U.S. Food and Drug Administration (FDA) or the European Medicines Agency (EMA)
[25][26]. There are two types of models: mechanistic models and data derived models. The basic aim of mechanistic models is the structural representation of the governing physiological processes in the model equations to support a functional understanding of the underlying mechanisms. On the other hand, data-driven approaches (machine learning (ML) and deep learning (DL) use algorithms and artificial intelligence (AI) methodology
[27][28].
3.1. Molecular Interaction Maps (MIMs)
MIMs actually represent the physical and causal interactions based on knowledge based information among biological species in the form of networks
[29]. MIMs explore the information about different mechanistic pathways and regulatory modules involved in a disease such as Parkinson’s
[30] or signaling in cancer
[31], respectively. The basic principle of MIMs uses graph-theory concepts to identify network static properties such as (i) the identification of critical nodes; (ii) community detection; and (iii) prediction of hidden links. Furthermore, upon overlying expression data, such maps serve as visualization tools for the activity level of regulators and their targets of established disease markers, which provide the simplest mechanistic visualization of data
[22].
3.2. Constraint-Based Models
Genome-scale metabolic (GEM) models are the best example of constraint-based models that provide a mathematical framework to understand the metabolic capacities of a cell, enabling system wide analysis of genetic perturbations, exploring metabolic diseases, and finding the essential enzymatic reactions and drug targets
[32]. Most importantly, the GEM modeling approach is being used in multiple medical domains such as cancer
[33] obesity
[34], and in Alzheimer’s disease
[35].
3.3. Boolean Models (BMs)
BMs are the simplest logic-based models in which nodes are assigned one of two possible states: 1 (ON, activation) or 0 (OFF, inactivation)
[36]. Moreover, the regulatory relationship between regulators (upstream nodes) to targets (downstream nodes) are expressed by logical operators such as AND, OR, and NOT, respectively. Therefore, BM does not require detailed kinetic data for parameter estimation, which makes them useful for application to large biological systems. In the context of systems medicine, this approach is often applied for cancer research
[37][38].
3.4. Quantitative Models (QMs)
QM are like ordinary differential equation (ODE)-based approaches used to analyze the quantitative behavior of a biochemical reaction with time. QMs consist of a set of differential equations containing variables and parameters that describe how the system responds to different stimuli or perturbations
[39]. This quantitative modeling approach explains the biological-systems dynamics in detail and applies to a single pathway due to the requirement of detailed kinetic data for parameter estimations. Most importantly, in personalized medicine, ODE models are applied for individual biomarker discovery
[40], drug response, and tailored treatments
[41].
3.5. Pharmacokinetic Models
Pharmacokinetic models explain the concentration of a drug in plasma or different tissues. Therefore, drug pharmacokinetics are promptly used as a surrogate for drug-induced responses. Therefore, pharmacokinetic models can be described by compartmental pharmacokinetic (PK) modeling
[42] or by physiologically based PK (PBPK) modeling
[43].
4. Machine Learning Perspectives on Personalized Medicine
Nowadays, personalized medicine in relation to machine learning programs is considered as an emerging reality and is strongly connected with genomics and proteomics datasets. Machine learning approaches have been applied to massive data collected through genome sequencing, with the aim to precisely define what treatment method will work for an individual
[44]. These methodologies have provided deep understanding of the underlying disease mechanisms, while integration of the assorted patient data results in amended and robust biomarker discovery for various disease diagnoses. It has been assessed that without machine learning approaches, the full potential of personalized medicine is impossible to comprehend in clinical practice. Based on machine learning approaches, various algorithms focused on specific diseases have been proposed. Among them, there is an FDA approved MammaPrint prognostic test for breast cancer based on 70 gene signatures
[45]. MammaPrint is a microarray-based signature method using formalin-fixed-paraffin-embedded (FFPE) or fresh tissue for microarray analysis
[46][47]. Moreover, the BluePrint test has also demonstrated the expression data, which could be supportive for personalized medicine in MINDACT and IMPACt trials
[48]. Similarly, Bejnordi et al. reported an algorithm that is trained to detect metastases in various lymph nodes in stained tissue sections of breast cancer
[49]. A machine learning echocardiography algorithm proposed by Madani et al. provided an accuracy of greater than 90% for the diagnosis of cardiac disease
[50]. For the early detection of Alzheimer’s disease, Ding et al. proposed a machine learning based system with high accuracy and sensitivity
[20].
Machine learning and AI approaches work with different types of data including genetic, genomic
[51], epigenomic
[52][53], transcriptomic
[54], metabolomic data
[55], medical images, biobanks data
[56], electronic health records (EHR)
[57], scientific literature data, etc., and are able to combine all of this information to design optimum classifiers
[58]. In this respect, two problems including regression and classification problems are of interest. The difference between them is that in regression, the aim is to predict the value of continuous and real value quantities, for instance, to predict the level of cholesterol in blood based on other biomarkers. In the case of classification problems, the aim is to predict the label of a set of individuals that are gathered in a broad class, for instance, the patients that have a survival time greater than the average from the rest. The interest in formulating these prediction problems as classification problems comes from the reduction in the uncertainty space. Particularly, phenotype prediction problems are of great use to better understand the altered genetic pathways that are responsible for the development of the disease and to speed up the drug discovery process
[59].