Integrating text-mining into the curation of disease maps

Integrating text-mining into the curation of disease maps: Comparison

Please note this is a comparison between Version 2 by Liza Vinhoven and Version 1 by Liza Vinhoven.

Thext mining algorithms can be used to analyse the natural language of scientific publications. These types of algorithms can take humanly readable text passages and convert them into a more ordered, machine-usable data structure. T interactive, user-friendly disease map viewer was developed to support the automated creation of systems medicine models such as disease maps by text mining, an interactive, user-friendly disease map viewer was developed to sit. It sits at the interface between computational text mining and the manual expert creation of disease maps. The disease map viewer displays text mining results in a systems biology map, where the user can review them and either validate or reject identified interactions. Ultimately, the viewer , was developed to sit at the interface between computational text mining and the manual expert creation and brings together the time-saving advantages of text mining with the accuracy of manual data curation. The disease map viewer, installation instructions, and the exemplary cystic fibrosis data set are available under https://s.gwdg.de/8bK6f5.

Text mining
disease map
systems biology

1. Introduction

Every day, more a

In light of the rapidly increasing data and knowledge on disease and their underlying bioligical pathways, it is becoming more and more essential to integrate, store and visualize them. In order to analyse and interpret the data efficiently, it is important that these knowledge representations are human- as well as machine-readable. For this purpose, the approaches such as systems medicine disease maps have been gaining importance over the last years. Disease maps were proposed by Mazein et al. and are defined as "comprehensive, knowledge-based representations of disease mechanisms" ^[1]. They are based on the systems biology models and written in the Systems Biology Graphical Notation ^[2], but combine regulatory networks, metabolic and signalling pathways, as well as extensions such as e.g. different phenotypes. These disease maps can be used for a range of applications, such as identifying disease biomarkers and drug targets, drug repositioning, structuring omics data, and developing improved diagnostics^[1][3]. The largest disease map to date is the COVID-19 disease map ^[4]. It was created by 130 researchers and consists of 42 diagrams with a total 5499 elements, connected by 1836 interactions, which were curated from 617 publications and preprints. This highlights the sheer time and manpower required to manually curate these valuable knowledge resources. One way to support the construction of disease maps is by text mining. Text mining refers to the automated annotation of human-written texts to extract the information and bring it into a human- and machine-readable format, thereby speeding up the curation and annotation process of human-written text ^[5]. To do so, many possible information technologies are applicable, for example, machine learning, pattern matching, or the processing of natural, human-readable language ^[6].

Ind more data and knowledge on different diseases and their underlying biological pathways are being acquired. Thus, it is becoming increasingly important to develop methods of data and knowledge integration, storage, and representation in ways that can be interpreted and analysed by humans and computers alike. One of these approaches is systems medicine disease maps, which has been proposed by Mazein et al. in 2018. The authors define disease maps as a “comprehensive, knowledge-based representation of disease mechanisms” [1]. They evogeneral, a text mining alved from and are comparable to metabolic and signaling pathways, stored and represented in standardized formats such as the Systems Biology Graphical Notation (SBGN) [2] or Systems Biology Markup Language (SBML) [3]. Disease maps can be used for a multitude of purposes, such as identifying disease biomarkers and drug targets, drug repositioning, structuring omics data, and developing improved diagnostics [1,4]. Most recently, a large, interdisciplinary community of over 230 researchers launched a project to create a COVID-19 disease map [5]. This resulted in orithms what, to the best of our knowledge, is the largest disease map to date, currently consisting of 5499 elements, which are connected by 1836 interactions across 42 diagrams. The data for this enormous knowledge resource were curated from 617 publications and preprints, highlighting the sheer time and manpower required to create these manually curated disease maps. One way to support the construction of disease maps is text mining, the automated annotation of texts that produces a condensed keyword list, which can then be formatted into machine- and human-readable media and to consist of the core information of that text. In principle, text mining means the extraction of information from textual data, thereby speeding up the curation and annotation process of human-written text [6]ll follow the steps below. To do so, many possible information technologies are applicable, for example, machine learning, pattern matching, or the processing of natural, human-readable language [7].

In general, a text mining algorithm will follow the steps below1. As an input, the algorithm will take a human-readable sentence, in this case from a biological paper. It will then first highlight the named entities (NE), which are terms that are then normalized and transformed into identifiers. These NEs can be proteins, genes, diseases, or any other biologically relevant term, taken from an underlying database that contains NEs that the system should be able to identify.

2. The entities are then assigned to unique identifiers, which are then organized into an identifier scheme.

3. Afterward, tThe extracted relationships from the input text data are included between named entities.

The resulting network of nodes and relationships can then be compared and expanded with additional text data. With the help of this network, new hypotheses can be formed and these can then be the subject of further research [7]^[6].

EveIn thoughe last years, great strides have been made in the development of text mining algorithms with high sensitivity and specificity, but they cannot yet replace a human expert curator. We, tTherefore, we developed a tool to bring together the advantagesspeed of text mining and the accuracy of expert knowledge and experience of scientists to support the creation of systems biologymedicine disease maps.

Our tool consists of an interactive disease map viewer, which takes the output of an independent text mining systemalgorithms, translates it to the required format, and displays it in a cellular layout similar to disease map-like cellular layout. This allows. As the disease map viewer is a stand-alone tool, the user is able to utilize the text mining approach they find most suitable for their use case or even include results from more than one system. The user then has the possibility to examine the interactions identified by the text mining algorithm and evaluate them based on the text passage they are based on. In the end, this results in a list of automatically parsed but expert-validated interactions, which can then be used as a basis for a disease map. Ultimately, this simplifies and significantly speeds up the curation step during the construction of disease maps.

2. Application

The Drequisease Map Viewer

Ired input order to support the creation of data for the disease maps, we developed a tool capable of displaying text mining results as disease maps and validating them through the integration of expert domain knowledge. For viewer is biological interaction dathisa purpose, we used an independent, exchangeablearsed by a text mining algorithm to parse molecular interactions between biological entities’ data from publicly available scientific text. The results are output. The results have to be formatted in two simple, reproducible CSV files, one containing the interactions between the entities themselves and the other specifying their subcellular localization of each biological entity. A flowchart of the input data, software, and output data of the systems can be seen in Figure 1.

Figure 1. Flowchart of the processes included in the tool. Input knowledge and data are shown in green on the right, the software modules are shown in yellow, and the output files are shown in blue on the right. Two CSV files, one containing the list of interactions and one containing the subcellular localisation of the entities, serve as input for the CytoscapeJSON parser implemented in Python. The resulting JSON file serves as input for the disease map viewer, where the interactions are validated by expert knowledge. The validated interactions can then be exported in a cellular layout in a JSON file or as a list of interactions in a CSV file.

To prepare text mining results that are easy to store, share, and use, we used a Python script to convert them from a simple CSV file to JSON format. Simply put, the JSON data structure of the text mining results is a list of every element (nodes, compartments, and edges) in the disease map. This SBML-based JSON format is used by the Cytoscape.js library to create the graphical SBGN map from it. The interface is built around the Cytoscape.js instance that renders and displays disease maps to help the user annotate and review the text-mined disease map conveniently. Figure 2 shows the interface with exemplary data. The main graph is shown in a cell-like layout, where the user can zoom in and out. The rectangular nodes represent the molecular entities and are localized in the subcellular compartment specified in the JSON file. The arrow-shaped edges represent molecular interactions between them. All entities (genes/proteins and compartments), as well as their respective edges, can be moved freely by dragging to improve structure and visibility to fit the user’s needs.

Figure 2. Interface of the disease map viewer. The large window in the middle shows the text mining data as a coarse disease map in a cellular layout. The left sidebar shows the legend and filter options, and the right sidebar shows the review function, where the supporting sentences from the parsed publications are displayed and the user can validate or reject an interaction. The buttons on the bottom left show the timeline option, where the interaction data can be filtered by date of publication.

The colouring is the colour of categorization of found verbs. All “activating” edges are coloured green, “inhibiting” edges are coloured red, “neutral” edges are coloured blue, and “undefined” edges have a grey colour, while incoherent interactions are shown in brown. The left sidebar shows the legend and filter options for the edges in the graph. As a default, all edges are displayed, but the user can uncheck types of edges to hide them and thus obtain a better overview of the remaining categories of edges. This legend can be opened and closed by clicking the top button “hide/show filter”. Another way the data from the text mining are categorized is by the thickness of the edges in the graph. The more distinct publications have been found to have both connected nodes mentioned in the same sentence, the thicker the edge between them. In the bottom-left corner of the filter window, the user can filter the edges depending on the number of supporting publications. The slider can be moved to define a minimum number of publications an edge needs to have to display it. Moreover, below the slider is a button that will reset the filter and reload the map. In order to integrate expert knowledge and validate text-mined data, we included a review function, as observed in the right-hand panel of the interface. The user can examine all interactions with two methods: by clicking the “Next edge” button to iterate all interactions that need to be reviewed or by directly selecting a specific edge from the graph. The review panel will then display the two nodes connected by the clicked edge and the colour of the edge between both, as well as the current review status of the interaction. Below this, a list of PubMed IDs is displayed together with the sentences that have been used to identify the interaction in each reference. The verbs that have been used to categorize the interaction are coloured in red. The user can then load the entire text to obtain more context for the sentence. The user can then review the interaction with all available data on hand and assign a status to the interaction. If the expert approves the text-mined interaction, the “accept” status can be selected. If the text-mined interaction is a false positive, the “decline” status is appropriate, and if more research needs to be conducted to approve the interaction, the “further inspection needed” status can be assigned. To view the status of the review process, the data can be downloaded either as a CSV file with all interactions, their current review status, and the PubMed ID from with the interaction, which was text mined from the disease map, or as a JSON file with the entire disease map in a JSON object that can be saved for reloading in a later session or to share with other users. To show how the viewer operates, we used an individualized text mining workflow to create a sample data set with the use case of cystic fibrosis, based on the CFTR Lifecycle Map we previously curated [24]. The disease map viewer, installation instructions, and the exemplary cystic fibrosis data set are available under https://s.gwdg.de/8bK6f5.

References

Alexander Mazein; Marek Ostaszewski; Inna Kuperstein; Steven Watterson; Nicolas Le Novère; Diane Lefaudeux; Bertrand De Meulder; Johann Pellet; Irina Balaur; Mansoor Saqi; et al.Maria Manuela NogueiraFeng HeAndrew PartonNathanaël LemonnierPiotr GawronStephan GebelPierre HainautMarkus OllertUgur DogrusozEmmanuel BarillotAndrei ZinovyevReinhard SchneiderRudi BallingCharles Auffray Systems medicine disease maps: community-driven comprehensive representation of disease mechanisms. npj Systems Biology and Applications 2018, 4, 1-10, 10.1038/s41540-018-0059-y.
Nicolas Le Novère; Michael Hucka; Huaiyu Mi; Stuart Moodie; Falk Schreiber; Anatoly Sorokin; Emek Demir; Katja Wegner; Mirit I Aladjem; Sarala Wimalaratne; et al.Frank T BergmanRalph GaugesPeter GhazalHideya KawajiLu LiYukiko MatsuokaAlice VillégerSarah BoydLaurence CalzoneMélanie CourtotUgur DogrusozThomas FreemanAkira FunahashiSamik GhoshAkiya JourakuSohyoung KimFedor KolpakovAugustin LunaSven SahleEsther SchmidtSteven WattersonGuanming WuIgor GoryaninDouglas KellChris SanderHerbert SauroJacky SnoepKurt KohnHiroaki Kitano The Systems Biology Graphical Notation. Nature Biotechnology 2009, 27, 735-741, 10.1038/nbt.1558.
Marek Ostaszewski; Stephan Gebel; Inna Kuperstein; Alexander Mazein; Andrei Zinovyev; Ugur Dogrusoz; Jan Hasenauer; Ronan M T Fleming; Nicolas Le Novère; Piotr Gawron; et al.Thomas LigonAnna NiarakisDavid NickersonDaniel WeindlRudi BallingEmmanuel BarillotCharles AuffrayReinhard Schneider Community-driven roadmap for integrated disease maps. Briefings in Bioinformatics 2018, 20, 659-670, 10.1093/bib/bby024.
Marek Ostaszewski; Anna Niarakis; Alexander Mazein; Inna Kuperstein; Robert Phair; Aurelio Orta-Resendiz; Vidisha Singh; Sara Sadat Aghamiri; Marcio Luis Acencio; Enrico Glaab; et al.Andreas RueppGisela FoboCorinna MontroneBarbara BraunerGoar FrishmanLuis Cristóbal Monraz GómezJulia SomersMatti HochShailendra Kumar GuptaJulia ScheelHanna BorlinghausTobias CzaudernaFalk SchreiberArnau MontagudMiguel Ponce de LeonAkira FunahashiYusuke HikiNoriko HiroiTakahiro G YamadaAndreas DrägerAlina RenzMuhammad NaveezZsolt BocskeiFrancesco MessinaDaniela BörnigenLiam FergussonMarta ContiMarius RameilVanessa NakonecnijJakob VanhoeferLeonard SchmiesterMuying WangEmily E AckermanJason E ShoemakerJeremy ZuckerKristie OxfordJeremy TeutonEbru KocakayaGökçe Yağmur SummakKristina HanspersMartina KutmonSusan CoortLars EijssenFriederike EhrhartD A B RexDenise SlenterMarvin MartensNhung PhamRobin HawBijay JassalLisa MatthewsMarija Orlic-MilacicAndrea Senff-RibeiroKaren RothfelsVeronica ShamovskyRalf StephanCristoffer SevillaThawfeek VarusaiJean-Marie RavelRupsha FraserVera OrtseifenSilvia MarchesiPiotr GawronEwa SmulaLaurent HeirendtVenkata SatagopamGuanming WuAnders RiuttaMartin GolebiewskiStuart OwenCarole GobleXiaoMing HuRupert W OverallDieter MaierAngela BauchBenjamin M GyoriJohn A BachmanCarlos VegaValentin GrouèsMiguel VazquezPablo PorrasLuana LicataMarta IannuccelliFrancesca SaccoAnastasia NesterovaAnton YuryevAnita de WaardDenes TureiAugustin LunaOzgun BaburSylvain SolimanAlberto ValdeolivasMarina Esteban-MedinaMaria Peña-ChiletKinza RianTomáš HelikarBhanwar Lal PuniyaDezso ModosAgatha TreveilMarton OlbeiBertrand De MeulderStephane BallereauAurélien DugourdAurélien NaldiVincent NoëlLaurence CalzoneChris SanderEmek DemirTamas KorcsmarosTom C FreemanFranck AugéJacques S BeckmannJan HasenauerOlaf WolkenhauerEgon L WillighagenAlexander R PicoChris T EveloMarc E GillespieLincoln D SteinHenning HermjakobPeter D'EustachioJulio Saez-RodriguezJoaquin DopazoAlfonso ValenciaHiroaki KitanoEmmanuel BarillotCharles AuffrayRudi BallingReinhard Schneider COVID‐19 Disease Map, a computational knowledge repository of virus‐host interaction mechanisms. Molecular Systems Biology 2021, 17, e10851, 10.15252/msb.202110851.
Nathan Harmston; Wendy Filsell; Michael P H Stumpf; What the papers say: Text mining for genomics and systems biology. Human Genomics 2010, 5, 17, 10.1186/1479-7364-5-1-17.
Fei Zhu; Preecha Patumcharoenpol; Cheng Zhang; Yang Yang; Jonathan Chan; Asawin Meechai; Wanwipa Vongsangnak; Bairong Shen; Biomedical text mining and its applications in cancer research. Journal of Biomedical Informatics 2013, 46, 200-211, 10.1016/j.jbi.2012.10.007.