The interactive, user-friendly disease map viewer was developed to support the automated creation of systems medicine models such as disease maps by text mining. It sits at the interface between computational text mining and the manual expert creation of disease maps, was developed to sit at the interface between computational text mining and the manual expert creation and brings together the time-saving advantages of text mining with the accuracy of manual data curation.
1. Introduction
In light of the rapidly increasing data and knowledge on disease and their underlying bioligical pathways, it is becoming more and more essential to integrate, store and visualize them. In order to analyse and interpret the data efficiently, it is important that these knowledge representations are human- as well as machine-readable. For this purpose, the approaches such as systems medicine disease maps have been gaining importance over the last years. Disease maps were proposed by Mazein et al. and are defined as "comprehensive, knowledge-based representations of disease mechanisms" [1]. They are based on the systems biology models and written in the Systems Biology Graphical Notation [2], but combine regulatory networks, metabolic and signalling pathways, as well as extensions such as e.g. different phenotypes. These disease maps can be used for a range of applications, such as identifying disease biomarkers and drug targets, drug repositioning, structuring omics data, and developing improved diagnostics[1][3]. The largest disease map to date is the COVID-19 disease map [4]. It was created by 130 researchers and consists of 42 diagrams with a total 5499 elements, connected by 1836 interactions, which were curated from 617 publications and preprints. This highlights the sheer time and manpower required to manually curate these valuable knowledge resources. One way to support the construction of disease maps is by text mining. Text mining refers to the automated annotation of human-written texts to extract the information and bring it into a human- and machine-readable format, thereby speeding up the curation and annotation process of human-written text [5]. To do so, many possible information technologies are applicable, for example, machine learning, pattern matching, or the processing of natural, human-readable language [6].
In general, a text mining algorithms will follow the steps below.
1. As an input, the algorithm will take a human-readable sentence, in this case from a biological paper. It will then first highlight the named entities (NE), which are terms that are then normalized and transformed into identifiers. These NEs can be proteins, genes, diseases, or any other biologically relevant term, taken from an underlying database that contains NEs that the system should be able to identify.
2. The entities are assigned to unique identifiers, which are then organized into an identifier scheme.
3. The extracted relationships from the input text data are included between named entities.
The resulting network of nodes and relationships can then be compared and expanded with additional text data. With the help of this network, new hypotheses can be formed and these can then be the subject of further research
[6].
In the last years, great strides have been made in the development of text mining algorithms with high sensitivity and specificity, but they cannot yet replace a human expert curator. Therefore, the researchers developed a tool to bring together the speed of text mining and the accuracy of expert knowledge and experience of scientists to support the creation of systems medicine disease maps.
Our tool consists of an interactive disease map viewer, which takes the output of text mining algorithms, translates it to the required format, and displays it in a cellular layout similar to disease map. As the disease map viewer is a stand-alone tool, the user is able to utilize the text mining approach they find most suitable for their use case or even include results from more than one system. The user then has the possibility to examine the interactions identified by the text mining algorithm and evaluate them based on the text passage they are based on. In the end, this results in a list of automatically parsed but expert-validated interactions, which can then be used as a basis for a disease map. Ultimately, this simplifies and significantly speeds up the curation step during the construction of disease maps.
2. Application
The required input data for the disease map viewer is biological interaction data parsed by a text mining algorithm. The results have to be formatted in two simple, reproducible CSV files, one containing the interactions between the entities themselves and the other specifying the subcellular localization of each biological entity. A flowchart of the input data, software, and output data of the systems can be seen in Figure 1.
Figure 1. Flowchart of the processes included in the tool. Input knowledge and data are shown in green on the right, the software modules are shown in yellow, and the output files are shown in blue on the right. Two CSV files, one containing the list of interactions and one containing the subcellular localisation of the entities, serve as input for the CytoscapeJSON parser implemented in Python. The resulting JSON file serves as input for the disease map viewer, where the interactions are validated by expert knowledge. The validated interactions can then be exported in a cellular layout in a JSON file or as a list of interactions in a CSV file.
To prepare text mining results that are easy to store, share, and use, the researchers used a Python script to convert them from a simple CSV file to JSON format. Simply put, the JSON data structure of the text mining results is a list of every element (nodes, compartments, and edges) in the disease map. This SBML-based JSON format is used by the Cytoscape.js library to create the graphical SBGN map from it. The interface is built around the Cytoscape.js instance that renders and displays disease maps to help the user annotate and review the text-mined disease map conveniently.
Figure 2 shows the interface with exemplary data. The main graph is shown in a cell-like layout, where the user can zoom in and out. The rectangular nodes represent the molecular entities and are localized in the subcellular compartment specified in the JSON file. The arrow-shaped edges represent molecular interactions between them. All entities (genes/proteins and compartments), as well as their respective edges, can be moved freely by dragging to improve structure and visibility to fit the user’s needs.
Figure 2. Interface of the disease map viewer. The large window in the middle shows the text mining data as a coarse disease map in a cellular layout. The left sidebar shows the legend and filter options, and the right sidebar shows the review function, where the supporting sentences from the parsed publications are displayed and the user can validate or reject an interaction. The buttons on the bottom left show the timeline option, where the interaction data can be filtered by date of publication.
The colouring is the colour of categorization of found verbs. All “activating” edges are coloured green, “inhibiting” edges are coloured red, “neutral” edges are coloured blue, and “undefined” edges have a grey colour, while incoherent interactions are shown in brown.
The left sidebar shows the legend and filter options for the edges in the graph. As a default, all edges are displayed, but the user can uncheck types of edges to hide them and thus obtain a better overview of the remaining categories of edges. This legend can be opened and closed by clicking the top button “hide/show filter”.
Another way the data from the text mining are categorized is by the thickness of the edges in the graph. The more distinct publications have been found to have both connected nodes mentioned in the same sentence, the thicker the edge between them. In the bottom-left corner of the filter window, the user can filter the edges depending on the number of supporting publications. The slider can be moved to define a minimum number of publications an edge needs to have to display it. Moreover, below the slider is a button that will reset the filter and reload the map.
In order to integrate expert knowledge and validate text-mined data, the researchers included a review function, as observed in the right-hand panel of the interface. The user can examine all interactions with two methods: by clicking the “Next edge” button to iterate all interactions that need to be reviewed or by directly selecting a specific edge from the graph. The review panel will then display the two nodes connected by the clicked edge and the colour of the edge between both, as well as the current review status of the interaction. Below this, a list of PubMed IDs is displayed together with the sentences that have been used to identify the interaction in each reference. The verbs that have been used to categorize the interaction are coloured in red. The user can then load the entire text to obtain more context for the sentence. The user can then review the interaction with all available data on hand and assign a status to the interaction. If the expert approves the text-mined interaction, the “accept” status can be selected. If the text-mined interaction is a false positive, the “decline” status is appropriate, and if more research needs to be conducted to approve the interaction, the “further inspection needed” status can be assigned.
To view the status of the review process, the data can be downloaded either as a CSV file with all interactions, their current review status, and the PubMed ID from with the interaction, which was text mined from the disease map, or as a JSON file with the entire disease map in a JSON object that can be saved for reloading in a later session or to share with other users.