Requirements
|
Implementation
|
Findability
|
-
-
-
Assign PIDs meaningfully.
-
Each PID should uniquely identify a single patient, which needs to be consistent between branch laboratories with parallel systems.
-
Develop solutions for unknown emergency patients, which allow correct assignment of test results when personal data is identified later on.
-
Develop solutions for analyses conducted for research purposes. Avoid cumulative PIDs.
-
-
-
Record actual sampling time instead of planned sampling time.
-
-
-
Connect all analytical devices to the lab IT system to avoid manual entries.
-
-
-
Connect the lab IT system to the hospital’s central IT system to enable searches by clinicians and researchers.
|
Accessibility
|
-
-
-
Protect lab data adequately with:
-
-
-
Design ETL processes efficiently.
-
-
-
Consider the general consent status of patients and allow access to data accordingly.
-
-
-
Employ modern technical solutions such as multiparty computing and homomorphic encryption for merging data from different sites.
|
Interoperability
|
-
-
-
Code analyses in a standardized manner, e.g., with LOINC codes.
-
-
-
Additionally, code the device manufacturer and kit version in a standardized way.
-
-
-
Code newly developed analyses in a homogenous way, even if no standardized codes are available yet.
-
-
-
Enable consolidation of data from different labs.
|
Reusability
|
-
-
-
Provide detailed metadata to maximize reproducibility, including:
-
LOINC codes.
-
batch numbers.
-
quality management data.
-
SPREC codes.
|
+
|
-
-
-
Offer your laboratory medicine expertise to clinicians and researchers, as no one knows the intricacies of your laboratory data better than you.
|
Abbreviations: ETL: extract—transform—load; lab: laboratory; LOINC: Logical Observation Identifiers Names and Code; PID: patient identifier; SPREC: Standard Preanalytical Code. + signifies the additional human resource (laboratory expertise).
Findable data must be stored in a way that enables easy retrieval. For “standard” examinations, this is usually realised though a patient identifier (PID) and date, so individual results can be assigned to the respective patients and collection times. Depending on the organization of the laboratory, this is easier said than done. Potential pitfalls are, for example, that the same PIDs might be assigned to different patients in different branch laboratories, or that analyses conducted for unidentified emergency patients cannot be attributed to the correct person when their identity has been clarified. Additionally, results of different patients might be combined under a “collective” PID for research purposes. Moreover, data can be confusing when samples are registered with the planned collection date instead of the actual collection data, resulting in analysis time points prior to collection. Equipment for special examinations poses particular challenges to findability, as they are frequently not connected to the LIS. Here, the patient ID may be entered manually into the evaluation files in a way that does not conform to the standard, which can lead to confusion and incomplete entries. An example of this are “-omics” analyses: analytical devices routinely produce and output files too large for transfer and storage in the central LIS. Therefore, they need to be linked, preferably in a searchable manner to enable offline findability. Likewise, findability has to be addressed in the sharing of machine-actionable (meta)data online. Good metadata makes data findable. In web 1.0/2.0 approaches, this was addressed by the Linked Data Principles, a set of best practices when publishing structured data to the web
[14]. These principles were however proposed before the emergence of FAIR, meaning that little emphasis was put on standardization and a variety of inherently different schemas were proposed
[15]. One of the most recent efforts for making semantic artefacts, FAIR has been launched by the FAIRsFAIR project, where the authors list recommendations for findable (meta)data, highlighting the need for GUPRIs (Globally Unique, Persistent and Resolvable Identifiers), highly enriched and searchable (meta)data descriptions and, especially relevant for clinical laboratory sciences, the need to publish data and metadata separately
[16]. Findability remains one of the most important aspects of the FAIRification of Big Data analysis, as a lack of appropriate metadata standards affects the availability of research data in the long term. A recent study observed decreased findability of UK health datasets over time
[17], a trend also observed in a greater context of data-driven science, both in terms of the findability of datasets and the reachability of the responsible authors
[18].
The
accessibility of laboratory data can also be a challenge. LISs usually do not have freely accessible query functionalities because of regulatory requirements. Therefore, LISs that are not connected to central clinical data warehouses must be accessed through the laboratory IT personnel. This often leads to an enormous amount of additional work, since laboratory data are highly attractive for a variety of research projects
[19]. For use in clinical data warehouses, the LISs must be electronically connected, and the data prepared via ETL processes (Extract, Transform, Load). This requires the use of universal web standards including HTTP (Hypertext Transfer Protocol), standardized data exchange formats (e.g., FHIR
[20] and the semantic-based Resource Description Framework (RDF)
[21][22]) and tools which allow querying respective data (e.g., SPARQL
[23]). Additionally, data models like OMOP
[24][25] or i2b2
[26] are in common use. In true FAIR fashion, LISs must present standard API (Application Programming Interface) with secure access protocols (e.g., SSL) for data management and retrieval
[16]. Generally, the entire content of the databases is not transferred, but a limited subset of data (e.g., data records that can be clearly assigned to patients) is identified and transmitted. A special challenge in this context is posed by legacy systems that are solely operated in read-only mode, where the effort for the technical connection must be weighed against the benefit of the further use of the data contained. Moreover, as the available data for researchers grows, there need to be mechanism in place to enable privacy protection with the use of de-identification or anonymization algorithms. While textbook methods, for instance k-anonymity
[27] or l-diversity
[28], are often cited, they do not come without their limitations
[29][30][31]. In this context, the question arises as to who is allowed to access the laboratory data and under what conditions. For example, data relating to infection serologies or staff medical service is particularly sensitive and requires careful data governance
[32]. Another important aspect is the question of patient consent for research-project-access needs, to be restricted according to regulatory requirements
[33]. The use of patient data in research in Switzerland is governed by the Federal Act on Data Protection (FADP 1992, art. 3c) and the Human Research Act (HRA RS 810.30). Notably, the governance of Big Data is not different from “regular” research data: A request on the disposal and use of sensitive data must be submitted to a cantonal REC (Research Ethics Committee). Big Data research proposes novel ethical concerns
[34], mostly surrounding the notions of privacy (hindrance of individual reidentification) and consent (possibility to later revoke consent), where traditional ethics oversight practice is often unaware of the direct societal impact of their decisions
[35]. A recent study in Switzerland showed that members of the seven Swiss RECs had broadly differing views regarding the opportunities and challenges of Big Data, citing insufficient expertise in big data analytics or computer science, to adequately judge the use of Big Data in clinical research
[36]. This situation can become especially cumbersome for researchers when data from different institutions are merged—in this case, modern systems that work with secure multiparty computing and homomorphic encryption, such as the MedCo system, can be a promising approach
[37]. Wirth et al. offer a great overview regarding privacy-preserving data-sharing infrastructures for medical research
[38].
The next big and perhaps most important aspect for Big Data in laboratory medicine is the necessary
semantic interoperability. This means that the individual data items must be clearly assigned semantically, ideally by means of standardized coding, e.g., along the lines of LOINC (Logical Observation Identifiers Names and Code). This represents an enormous challenge, which has been addressed in Switzerland, for example, by the L4CHLAB project
[39]. It is not enough to identify laboratory analyses only by their trivial name (e.g., “potassium”)—the necessary granularity is defined by the requirements of the research projects based on it. Thus, a creatinine measurement of any kind may be sufficient as a “safety lab measurement” but be completely insufficient for a method comparison study or the establishment of reference intervals. It should be noted that currently there is no universal standard, as even LOINC does not specify, e.g., device manufacturer and kit version, which need to be coded additionally. Unique identifiers for medical devices, e.g., from the GUDID
[40] or EUDAMED database
[41], or type identifiers, e.g., from medical device nomenclatures such as GMDN
[42] or EMDN
[43], may enrich the LOINC system and increase its acceptance. Extensive preparatory work to address this issue has been done by the Swiss Personalized Health Network (SPHN), which established corresponding “concepts”
[44]. Particular difficulties arise from historically grown LISs, which are often not structured according to the 1:1 principle of LOINC nomenclature, preventing a clean assignment of laboratory analyses to unambiguous codes. This must be considered especially when replacing and updating LISs, so that the master data remains future-proof and interoperable
[11]. The use of advanced data models such as RDF is beneficial here, as it allows a data scheme to evolve over time without the need to change the original data
[22]. In the university environment, the latest test technology might be employed, using analyses that do not yet have a LOINC code assigned, making it necessary to deviate accordingly. For the consolidation of large amounts of data from different sources, a high semantic granularity, which is necessary for individual questions, can be problematic, as equivalent analyses must be defined as such in order to enable comprehensive evaluations. Here, Minimum Information Checklists (MICs), stating the minimum requirements for quality and quantity to make data descriptions accurate and useful, could offer a needed standardization to track data quality from various sources
[45][46]. It is essential that a core vocabulary features support for descriptions to be machine-readable RDF
[47], closely linking the commonly used semantics in laboratory medicine with machine-actionable descriptions. The use of semantic web technologies, such as RDF, in the laboratory environment could also help to establish the common use of Electronic Lab Notebooks (ELNs)
[48]. Notably, the application of suitable data formats facilitates, but by itself does not guarantee, actual interoperability of data sets from different data providers. Seemingly trivial details including spelling, cardinalities, datatypes, consistent use of GUPRIs, or measurement units must be carefully assessed. In the context of RDF, the Shapes Constraint Language (SHACL) allows the testing and validating of data against a set of predefined requirements
[49]. These conditions (SHACL rules) constitute a “shape graph” against which the actual data (as “data graph”) is matched. The expression of complex constraints is facilitated by SHACL-extensions supporting SPARQL and JavaScript
[50][51]. Despite the rise of user-friendly validation tools, semantic standards alone are not a “silver bullet” against data mayhem. In fact, even with maximum semantic care, the competence of experts in laboratory medicine remains in high demand. Different automated approaches for resolving the semantic heterogeneity when mapping different ontologies have been launched but still require human oversight
[52][53]. For many researchers who come from non-analytical subjects, the differences in the meaning of the analysis codes are not obvious at first glance. Considerable misinterpretations can occur, e.g., calculation of eGFR from urine creatinine. Here, the laboratory holds responsibility since it has the necessary competence to avoid such errors.
The
reusability of laboratory medical data depends to a large extent on the existence and level of detail of the associated metadata. This includes—as already mentioned—not only analysis-related data (mapped in the dimensions of LOINC) but also batch numbers, quality management data, and, if applicable, SPRECs (Sample PREanalytical Codes)
[54]. In essence, everything that is or could be of importance for optimal replicability of the measurement results. It can be problematic that the metadata are stored in separate databases and cannot be provided automatically via the ETL processes, so that they can neither be exported nor viewed. Not only the (meta)data needs to be reusable but also the algorithms and data-processing scripts. With “FAIRly big”, a functional framework for retracing and verifying the computational processing of large-scale data based on machine-actionable provenance records, high performance could be observed regarding data sharing, transparency, and scalability, despite ignoring explicit metadata standards
[55]. Reusability can also refer to the efficient use of statistical models that may arise using machine learning methodology. The latter may involve a feedback process, where the model is validated and even further calibrated as information arrives through the expansion of the database with fresh data. Potential pitfalls impairing reusability may include legislative limitations imposed by national research acts or legal ambiguities in Data Transfer and Use Agreements (DTUA) of multicentre cohort studies involving several data providers.
3.2. Risks
The use of laboratory medical data for Big Data analytics does not only have advantages but is also associated with a considerable number of risks: as all health data, laboratory values are worthy of special protection. As with all information compiled in large databases, there is an imminent risk of data leaks, especially if the data are accessible from the outside. Structured laboratory data can also be copied easily and quickly due to their small file size, so there is a considerable risk of unauthorized data duplication. Similarly, data governance must be ensured, which requires a comprehensive authorization framework—this is easier to implement in closed LISs. Another essential aspect is data integrity, which must be ensured in particular through the ETL process pipelines and also for further processing. LISs, as medical products, usually fulfil the necessary standards, but with self-written transformation scripts this may be different, so enforce a meticulous quality control. However, this has the advantage that non-data transfer-related errors can also be detected and deleted. In any case, certification of the IT processes is both sensible and costly. Post-analytics can also cause difficulties—the IT systems of the receivers (clinicians or researchers) must be able to handle the data formats supplied and must not alter or falsify their presentation. Another enormously problematic aspect is change tracking. In the LISs, laboratory tests are often identified by means of their internal analysis numbers—if changes occur here, e.g., due to the inclusion of new analyses, changes must be reported to the peripheral systems—preferably automatically and with confirmation of knowledge—otherwise serious analysis mix-ups can occur. Finally, when individual laboratory data are queried, the framework of the findings is no longer guaranteed—the analyses lose their context and, thus, their interpretability.
3.3. Chances
The introduction of “Big Data” technologies holds great potential for laboratory medicine, and some aspects will be specifically addressed here.
Setting up ETL processes inevitably leads to the detection of inadequacies in the structure and content of the laboratory’s master data. Frequently, LISs have grown over years and—although continuously maintained—are not organized in a fundamentally consistent manner. Before one can begin with the extraction and processing of laboratory data, the data organization, structure, and meta information must already be disclosed in the source system. A thorough review of this data is recommended to be carried out in the mother database, because tidying up is in any case necessary, which is quite obviously better done in the source system than in subordinate databases. Another important aspect is the necessary introduction of clear semantics—this is a laborious process that initially represents a large workload but is subsequently relatively easy to maintain. Many laboratories are reluctant to take on this effort—here, the diagnostics manufacturers are asked to supply the necessary codes (e.g., extended LOINC codes, see above) for the analyses they offer, e.g., in tabular form, which makes bulk import considerably easier and a matter of a few days. For researchers, in particular, it is also extremely helpful to have a data catalogue created in this context. Laboratory catalogues are often available electronically but are usually organized around request profiles, rather than individual analyses that are often of importance for research questions. The IT teams of the data warehouses will also be very grateful for appropriate documentation. This also offers the opportunity to make extensive metadata accessible and usable for interested researchers. Together with the introduction of semantics and data catalogues, transparent change tracking should be integrated, so queries in the data warehouses can be adapted accordingly, if, for example, analyses have changed, or new kits have been used. Change tracking is also clearly to be advocated from a good laboratory practice (GLP) point of view.
Another aspect of outstanding importance for laboratory medicine as a scientific subject is the visibility and documentability of the contribution of laboratory medicine to research projects. In the vast majority of clinical studies, laboratory data play an extremely important role, be it as outcome variables, as safety values, as quality and compliance indicators, or as covariates. With a transparent database and query structure, the use and publication impact of laboratory data can be shown more clearly and the position of the laboratory in the university environment as an essential collaboration and research partner can be strengthened. Other aspects include the improved use of patient data for research purposes—turning laboratory databases from graveyards of findings into fertile ground for research, an aspect that is certainly in the interest of patients in the context of improvement of treatment options. The improved indexability of laboratory data in large “data lakes” would also allow to link them to clinical data.
3.4. Fields of Application
Big Data, with its technological environment, does not yet represent a translation into medical fields of application, but it should be regarded as a basis and facilitator for a large number of potential uses. Mainly applications come into consideration that already require a large amount of information to be processed and, thus, bring the human part of the evaluation pipeline to a processing limit. These include, of course, data-intensive “-omics” technologies, including not only pattern recognition in specialized metabolic diagnostics and new-born screening but also technical and medical validation and quality management. Further applications can be population-based evaluations such as the creation of reference value intervals. In the following, some of the potential fields of application are described.
An obvious field for Big Data technologies in laboratory medicine are “-omics” applications
[56][57][58]. These have been developed for nucleic acid-based techniques as, e.g., genomics
[59][60], transcriptomics
[61], and epigenomics
[62], as well as for mass spectrometry-based methodologies such as proteomics
[63][64], metabolomics
[65][66], lipidomics
[67], and others. The particular challenges in this field include connecting the analysis systems to the corresponding data lakes—it is no longer possible to work with traditional database technologies and new approaches, for example, hadoop
[68] become necessary. Even more than in the case of highly standardized routine procedures in classical laboratory medicine, metadata play an outstanding role in evaluability, comparability, and replicability. In addition, the raw data generated with these procedures are often formatted in a proprietary manner and are also of enormous size—comparable only with the data sets of the imaging disciplines. For retrieval, indexing and linking to the respective patient must be ensured; this can be achieved, for example, by linking tables of processed results instead of raw data output. The extent to which transformation and evaluation steps already make sense in the ETL process depends on the respective question, but following the FAIR principles, open file formats should be made available in addition to raw data, even if the transformation process is often accompanied by a loss of information (e.g., in mass spectrometry).
Moreover, in other diagnostic fields where a large number of different analyses have to be medically validated synoptically, Big Data technologies offer a good basis for the development of pattern recognition and AI algorithms, which not only help to automate workflows efficiently but also can recognize conspicuous patterns without fatigue and, thus, lead to a reduced false negative rate. New-born screening is a prime example of this
[69], but complex metabolic diagnostics will also benefit from data that is machine learning ready—there is still considerable potential for development
[70]. For algorithms to be registered as “medical devices”, the hurdles to be taken are fairly high, including proper assessment of potential risks, detailed software design specifications, traceability, data security, etc., just to name a few obligations to be compliant with the new “Medical Device Regulation” (MDR) of the European Union
[71].
Besides laboratory diagnostics itself, there are a large number of other fields of application for Big Data in laboratory medicine. For example, the field of quality management. Mark Cervinski notes that “modelling of Big Data allowed us to develop protocols to rapidly detect analytical shifts”—additionally, administrative and process-oriented aspects, such as optimizing turnaround time (TAT), can also benefit from Big Data
[10]. Especially, since under a big workload, the main factor affecting TATs is not the verification step of test results but rather the efficiency of the laboratory equipment
[72]. With the help of predictive modelling, TATs could be highlighted that are likely to exceed their allocated time. Furthermore, these highlighted TATs could potentially be relayed to the ordering clinician, allowing new levels of laboratory-reporting transparency.
Clinical-decision support systems are more oriented towards clinical needs and are essentially based on laboratory data. This can be in the context of integrated devices
[73] or more- or less-complex algorithms that enable the integration of multimodal information and allow clinicians to quickly and reliably make statements about the diagnostic value of the constellations of findings. An example of this is the prediction of the growth of bacteria in urine culture based on urine-flow cytometric data
[74].
Perhaps the most exciting field of application for Big Data in laboratory medicine, however, is predictive and preemptive diagnostics. With the help of laboratory data, probabilities for a variety of patient-related events can be calculated and, in the best case, therapeutic countermeasures can be initiated, so that the events do not occur in the first place. This can range from the prediction of in-house mortality, in the sense of an alarm triage
[75][76], to the prediction of derailments in the blood glucose levels of diabetic patients
[77]—the possible applications are almost unlimited.