Integrating text-mining into the curation of disease maps

Integrating text-mining into the curation of disease maps: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Others

Contributor: Liza Vinhoven

Text mining algorithms can be used to analyse the natural language of scientific publications. These types of algorithms can take humanly readable text passages and convert them into a more ordered, machine-usable data structure. To support the creation of systems medicine models such as disease maps by text mining, an interactive, user-friendly disease map viewer was developed to sit at the interface between computational text mining and the manual expert creation of disease maps. The disease map viewer displays text mining results in a systems biology map, where the user can review them and either validate or reject identified interactions. Ultimately, the viewer brings together the time-saving advantages of text mining with the accuracy of manual data curation. The disease map viewer, installation instructions, and the exemplary cystic fibrosis data set are available under https://s.gwdg.de/8bK6f5.

Text mining
disease map
systems biology

1. Introduction

Every day, more and more data and knowledge on different diseases and their underlying biological pathways are being acquired. Thus, it is becoming increasingly important to develop methods of data and knowledge integration, storage, and representation in ways that can be interpreted and analysed by humans and computers alike. One of these approaches is systems medicine disease maps, which has been proposed by Mazein et al. in 2018. The authors define disease maps as a “comprehensive, knowledge-based representation of disease mechanisms” [1]. They evolved from and are comparable to metabolic and signaling pathways, stored and represented in standardized formats such as the Systems Biology Graphical Notation (SBGN) [2] or Systems Biology Markup Language (SBML) [3]. Disease maps can be used for a multitude of purposes, such as identifying disease biomarkers and drug targets, drug repositioning, structuring omics data, and developing improved diagnostics [1,4]. Most recently, a large, interdisciplinary community of over 230 researchers launched a project to create a COVID-19 disease map [5]. This resulted in what, to the best of our knowledge, is the largest disease map to date, currently consisting of 5499 elements, which are connected by 1836 interactions across 42 diagrams. The data for this enormous knowledge resource were curated from 617 publications and preprints, highlighting the sheer time and manpower required to create these manually curated disease maps. One way to support the construction of disease maps is text mining, the automated annotation of texts that produces a condensed keyword list, which can then be formatted into machine- and human-readable media and to consist of the core information of that text. In principle, text mining means the extraction of information from textual data, thereby speeding up the curation and annotation process of human-written text [6]. To do so, many possible information technologies are applicable, for example, machine learning, pattern matching, or the processing of natural, human-readable language [7].

In general, a text mining algorithm will follow the steps below. As an input, the algorithm will take a human-readable sentence, in this case from a biological paper. It will then first highlight the named entities (NE), which are terms that are then normalized and transformed into identifiers. These NEs can be proteins, genes, diseases, or any other biologically relevant term, taken from an underlying database that contains NEs that the system should be able to identify. The entities are then assigned to unique identifiers, which are then organized into an identifier scheme. Afterward, the extracted relationships from the input text data are included between named entities. The resulting network of nodes and relationships can then be compared and expanded with additional text data. With the help of this network, new hypotheses can be formed and these can then be the subject of further research [7].

Even though great strides have been made in the development of text mining algorithms with high sensitivity and specificity, they cannot yet replace a human expert curator. We, therefore, developed a tool to bring together the advantages of text mining and the expert knowledge and experience of scientists to support the creation of systems biology disease maps. Our tool consists of an interactive disease map viewer, which takes the output of an independent text mining system, translates it to the required format, and displays it in a disease map-like cellular layout. This allows the user to utilize the text mining approach they find most suitable for their use case or even include results from more than one system. The user then has the possibility to examine the interactions identified by the text mining algorithm and evaluate them based on the text passage they are based on. In the end, this results in a list of automatically parsed but expert-validated interactions, which can then be used as a basis for a disease map. Ultimately, this simplifies and significantly speeds up the curation step during the construction of disease maps.

2. Disease Map Viewer

In order to support the creation of disease maps, we developed a tool capable of displaying text mining results as disease maps and validating them through the integration of expert domain knowledge.

For this purpose, we used an independent, exchangeable text mining algorithm to parse molecular interactions between biological entities’ data from publicly available scientific text. The results are output in two simple, reproducible CSV files, one containing the interactions between the entities themselves and the other specifying their subcellular localization. A flowchart of the input data, software, and output data of the systems can be seen in Figure 1.

Figure 1. Flowchart of the processes included in the tool. Input knowledge and data are shown in green on the right, the software modules are shown in yellow, and the output files are shown in blue on the right. Two CSV files, one containing the list of interactions and one containing the subcellular localisation of the entities, serve as input for the CytoscapeJSON parser implemented in Python. The resulting JSON file serves as input for the disease map viewer, where the interactions are validated by expert knowledge. The validated interactions can then be exported in a cellular layout in a JSON file or as a list of interactions in a CSV file.

To prepare text mining results that are easy to store, share, and use, we used a Python script to convert them from a simple CSV file to JSON format. Simply put, the JSON data structure of the text mining results is a list of every element (nodes, compartments, and edges) in the disease map. This SBML-based JSON format is used by the Cytoscape.js library to create the graphical SBGN map from it. The interface is built around the Cytoscape.js instance that renders and displays disease maps to help the user annotate and review the text-mined disease map conveniently.

Figure 2 shows the interface with exemplary data. The main graph is shown in a cell-like layout, where the user can zoom in and out. The rectangular nodes represent the molecular entities and are localized in the subcellular compartment specified in the JSON file. The arrow-shaped edges represent molecular interactions between them. All entities (genes/proteins and compartments), as well as their respective edges, can be moved freely by dragging to improve structure and visibility to fit the user’s needs.

Figure 2. Interface of the disease map viewer. The large window in the middle shows the text mining data as a coarse disease map in a cellular layout. The left sidebar shows the legend and filter options, and the right sidebar shows the review function, where the supporting sentences from the parsed publications are displayed and the user can validate or reject an interaction. The buttons on the bottom left show the timeline option, where the interaction data can be filtered by date of publication.

The colouring is the colour of categorization of found verbs. All “activating” edges are coloured green, “inhibiting” edges are coloured red, “neutral” edges are coloured blue, and “undefined” edges have a grey colour, while incoherent interactions are shown in brown.

The left sidebar shows the legend and filter options for the edges in the graph. As a default, all edges are displayed, but the user can uncheck types of edges to hide them and thus obtain a better overview of the remaining categories of edges. This legend can be opened and closed by clicking the top button “hide/show filter”.

Another way the data from the text mining are categorized is by the thickness of the edges in the graph. The more distinct publications have been found to have both connected nodes mentioned in the same sentence, the thicker the edge between them. In the bottom-left corner of the filter window, the user can filter the edges depending on the number of supporting publications. The slider can be moved to define a minimum number of publications an edge needs to have to display it. Moreover, below the slider is a button that will reset the filter and reload the map.

In order to integrate expert knowledge and validate text-mined data, we included a review function, as observed in the right-hand panel of the interface. The user can examine all interactions with two methods: by clicking the “Next edge” button to iterate all interactions that need to be reviewed or by directly selecting a specific edge from the graph. The review panel will then display the two nodes connected by the clicked edge and the colour of the edge between both, as well as the current review status of the interaction. Below this, a list of PubMed IDs is displayed together with the sentences that have been used to identify the interaction in each reference. The verbs that have been used to categorize the interaction are coloured in red. The user can then load the entire text to obtain more context for the sentence. The user can then review the interaction with all available data on hand and assign a status to the interaction. If the expert approves the text-mined interaction, the “accept” status can be selected. If the text-mined interaction is a false positive, the “decline” status is appropriate, and if more research needs to be conducted to approve the interaction, the “further inspection needed” status can be assigned.

To view the status of the review process, the data can be downloaded either as a CSV file with all interactions, their current review status, and the PubMed ID from with the interaction, which was text mined from the disease map, or as a JSON file with the entire disease map in a JSON object that can be saved for reloading in a later session or to share with other users.

To show how the viewer operates, we used an individualized text mining workflow to create a sample data set with the use case of cystic fibrosis, based on the CFTR Lifecycle Map we previously curated [24].

The disease map viewer, installation instructions, and the exemplary cystic fibrosis data set are available under https://s.gwdg.de/8bK6f5.

©Text is available under the terms and conditions of the Creative Commons-Attribution ShareAlike (CC BY-SA) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.