Integrating text-mining into the curation of disease maps

Integrating text-mining into the curation of disease maps: Comparison

Please note this is a comparison between Version 1 by Liza Vinhoven and Version 5 by Catherine Yang.

The interactive, user-friendly disease map viewer was developed txt mining algorithms can be used to analyse the natural language of scientific publications. These types of algorithms can take humanly readable text passages and convert them into a more ordered, machine-usable data structure. To support the automated creation of systems medicine models such as disease maps by text mining. It sits, an interactive, user-friendly disease map viewer was developed to sit at the interface between computational text mining and the manual expert creation of disease maps, was developed to sit at the interface between computational text mining and the manual expert creation and. The disease map viewer displays text mining results in a systems biology map, where the user can review them and either validate or reject identified interactions. Ultimately, the viewer brings together the time-saving advantages of text mining with the accuracy of manual data curation. The disease map viewer, installation instructions, and the exemplary cystic fibrosis data set are available under https://s.gwdg.de/8bK6f5.

Text mining
disease map
systems biology

1. Introduction

In light of the rapidly increasing data and knowledge on disease and their underlying bioligical pathways, it is becoming more and more essential to integrate, store and visualize them. In order to analyse and interpret the data efficiently, it is important that these knowledge representations are human- as well as machine-readable. For this purpose, the approaches such as systems medicine disease maps have been gaining importance over the last years. Disease maps were proposed by Mazein et al. and are defined as "comprehensive, knowledge-based representations of disease mechanisms" ^[1]. They are based on the systems biology models and written in the Systems Biology Graphical Notation ^[2], but combine regulatory networks, metabolic and signalling pathways, as well as extensions such as e.g. different phenotypes. These disease maps can be used for a range of applications, such as identifying disease biomarkers and drug targets, drug repositioning, structuring omics data, and developing improved diagnostics^[1][3]. The largest disease map to date is the COVID-19 disease map ^[4]. It was created by 130 researchers and consists of 42 diagrams with a total 5499 elements, connected by 1836 interactions, which were curated from 617 publicatroductions and preprints. This highlights the sheer time and manpower required to manually curate these valuable knowledge resources. One way to support the construction of disease maps is by text mining. Text mining refers to the automated annotation of human-written texts to extract the information and bring it into a human- and machine-readable format, thereby speeding up the curation and annotation process of human-written text ^[5]. To do so, many possible information technologies are applicable, for example, machine learning, pattern matching, or the processing of natural, human-readable language ^[6].

Every day, more and more data and knowledge on different diseases and their underlying biological pathways are being acquired. Thus, it is becoming increasingly important to develop methods of data and knowledge integration, storage, and representation in ways that can be interpreted and analysed by humans and computers alike. One of these approaches is systems medicine disease maps, which has been proposed by Mazein et al. in 2018. The authors define disease maps as a “comprehensive, knowledge-based representation of disease mechanisms” [1]. They evolved from and are comparable to metabolic and signaling pathways, stored and represented in standardized formats such as the Systems Biology Graphical Notation (SBGN) [2] or Systems Biology Markup Language (SBML) [3]. Disease maps can be used for a multitude of purposes, such as identifying disease biomarkers and drug targets, drug repositioning, structuring omics data, and developing improved diagnostics [1,4]. Most recently, a large, interdisciplinary community of over 230 researchers launched a project to create a COVID-19 disease map [5]. This resulted in what, to the best of our knowledgeneral, a tee, is the largest disease map to date, currently consisting of 5499 elements, which are connected by 1836 interactions across 42 diagrams. The data for this enormous knowledge resource were curated from 617 publications and preprints, highlighting the sheer time and manpower required to create these manually curated disease maps. One way to support the construction of disease maps is text mining algorithm, the automated annotation of texts that produces a condensed keyword list, which can then be formatted into machine- and human-readable media and to consist of the core information of that text. In principle, text mining means the extraction of information from textual data, thereby speeding up the curation and annotation process of human-written text [6]. To do so, many possible information technologies are applicable, for example, machine learning, pattern matching, or the processing of natural, human-readable language [7].

In general, a text mining algorithm will follow the steps below.

1. As an input, the algorithm will take a human-readable sentence, in this case from a biological paper. It will then first highlight the named entities (NE), which are terms that are then normalized and transformed into identifiers. These NEs can be proteins, genes, diseases, or any other biologically relevant term, taken from an underlying database that contains NEs that the system should be able to identify.

2. The entities are then assigned to unique identifiers, which are then organized into an identifier scheme.

3. Afterward, Tthe extracted relationships from the input text data are included between named entities.

The resulting network of nodes and relationships can then be compared and expanded with additional text data. With the help of this network, new hypotheses can be formed and these can then be the subject of further research ^[6][7].

IEven the last years, ough great strides have been made in the development of text mining algorithms with high sensitivity and specificity, but tthey cannot yet replace a human expert curator. Therefore, the researchersWe, therefore, developed a tool to bring together the speedadvantages of text mining and the accuracy of expert knowledge and experience of scientists to support the creation of systems medicinebiology disease maps.

Our tool consists of an interactive disease map viewer, which takes the output of an independent text mining algorithmssystem, translates it to the required format, and displays it in a disease map-like cellular layout similar to disease map. As the disease map viewer is a stand-alone tool, t. This allows the user is able to utilize the text mining approach they find most suitable for their use case or even include results from more than one system. The user then has the possibility to examine the interactions identified by the text mining algorithm and evaluate them based on the text passage they are based on. In the end, this results in a list of automatically parsed but expert-validated interactions, which can then be used as a basis for a disease map. Ultimately, this simplifies and significantly speeds up the curation step during the construction of disease maps.

2. Application

2. Disease Map Viewer

TheIn required input data for the order to support the creation of disease maps, we developed a tool capable of displaying text mining results as disease map viewer is biological interaction das and validating them through the integration of expert domain knowledge. For tahis parsed by aurpose, we used an independent, exchangeable text mining algorithm. The results have to be formatted to parse molecular interactions between biological entities’ data from publicly available scientific text. The results are output in two simple, reproducible CSV files, one containing the interactions between the entities themselves and the other specifying their subcellular localization of each biological entity. A flowchart of the input data, software, and output data of the systems can be seen in Figure 1.

Figure 1. Flowchart of the processes included in the tool. Input knowledge and data are shown in green on the right, the software modules are shown in yellow, and the output files are shown in blue on the right. Two CSV files, one containing the list of interactions and one containing the subcellular localisation of the entities, serve as input for the CytoscapeJSON parser implemented in Python. The resulting JSON file serves as input for the disease map viewer, where the interactions are validated by expert knowledge. The validated interactions can then be exported in a cellular layout in a JSON file or as a list of interactions in a CSV file.

To prepare text mining results that are easy to store, share, and use, thwe researchers used a Python script to convert them from a simple CSV file to JSON format. Simply put, the JSON data structure of the text mining results is a list of every element (nodes, compartments, and edges) in the disease map. This SBML-based JSON format is used by the Cytoscape.js library to create the graphical SBGN map from it. The interface is built around the Cytoscape.js instance that renders and displays disease maps to help the user annotate and review the text-mined disease map conveniently. Figure 2 shows the interface with exemplary data. The main graph is shown in a cell-like layout, where the user can zoom in and out. The rectangular nodes represent the molecular entities and are localized in the subcellular compartment specified in the JSON file. The arrow-shaped edges represent molecular interactions between them. All entities (genes/proteins and compartments), as well as their respective edges, can be moved freely by dragging to improve structure and visibility to fit the user’s needs.

Figure 2. Interface of the disease map viewer. The large window in the middle shows the text mining data as a coarse disease map in a cellular layout. The left sidebar shows the legend and filter options, and the right sidebar shows the review function, where the supporting sentences from the parsed publications are displayed and the user can validate or reject an interaction. The buttons on the bottom left show the timeline option, where the interaction data can be filtered by date of publication.

The colouring is the colour of categorization of found verbs. All “activating” edges are coloured green, “inhibiting” edges are coloured red, “neutral” edges are coloured blue, and “undefined” edges have a grey colour, while incoherent interactions are shown in brown. The left sidebar shows the legend and filter options for the edges in the graph. As a default, all edges are displayed, but the user can uncheck types of edges to hide them and thus obtain a better overview of the remaining categories of edges. This legend can be opened and closed by clicking the top button “hide/show filter”. Another way the data from the text mining are categorized is by the thickness of the edges in the graph. The more distinct publications have been found to have both connected nodes mentioned in the same sentence, the thicker the edge between them. In the bottom-left corner of the filter window, the user can filter the edges depending on the number of supporting publications. The slider can be moved to define a minimum number of publications an edge needs to have to display it. Moreover, below the slider is a button that will reset the filter and reload the map. In order to integrate expert knowledge and validate text-mined data, thwe researchers included a review function, as observed in the right-hand panel of the interface. The user can examine all interactions with two methods: by clicking the “Next edge” button to iterate all interactions that need to be reviewed or by directly selecting a specific edge from the graph. The review panel will then display the two nodes connected by the clicked edge and the colour of the edge between both, as well as the current review status of the interaction. Below this, a list of PubMed IDs is displayed together with the sentences that have been used to identify the interaction in each reference. The verbs that have been used to categorize the interaction are coloured in red. The user can then load the entire text to obtain more context for the sentence. The user can then review the interaction with all available data on hand and assign a status to the interaction. If the expert approves the text-mined interaction, the “accept” status can be selected. If the text-mined interaction is a false positive, the “decline” status is appropriate, and if more research needs to be conducted to approve the interaction, the “further inspection needed” status can be assigned. To view the status of the review process, the data can be downloaded either as a CSV file with all interactions, their current review status, and the PubMed ID from with the interaction, which was text mined from the disease map, or as a JSON file with the entire disease map in a JSON object that can be saved for reloading in a later session or to share with other users. To show how the viewer operates, we used an individualized text mining workflow to create a sample data set with the use case of cystic fibrosis, based on the CFTR Lifecycle Map we previously curated [24]. The disease map viewer, installation instructions, and the exemplary cystic fibrosis data set are available under https://s.gwdg.de/8bK6f5.