Community-Specific Overview of Knowledge Graph Research

Community-Specific Overview of Knowledge Graph Research: Comparison

Please note this is a comparison between Version 1 by Mayank Kejriwal and Version 3 by Beatrix Zheng.

Knowledge graphs (KGs) have rapidly emerged as an important area in AI over the last ten years. Building on a storied tradition of graphs in the AI community, a KG may be simply defined as a directed, labeled, multi-relational graph with some form of semantics. In part, this has been fueled by increased publication of structured datasets on the Web, and well-publicized successes of large-scale projects such as the Google Knowledge Graph and the Amazon Product Graph. However, another factor that is less discussed, but which has been equally instrumental in the success of KGs, is the cross-disciplinary nature of academic KG research. Arguably, because of the diversity of this research, a synthesis of how different KG research strands all tie together could serve a useful role in enabling more ‘moonshot’ research and large-scale collaborations.

knowledge graphs
applications
natural language processing
semantic web
data mining
knowledge representation
graph databases

1. Introduction

With accelerating growth of the Web over the 2000s, and the rise of both e-commerce and social media, knowledge graphs (KGs) have emerged as important models for representing, storing and querying heterogeneous pieces of data that have some relational structure between them, and that typically have real-world semantics ^[1][26]. The semantics are closely associated with the domain for which the KG has been designed ^[2][27]. A formal way to define such a domain, favored in the Semantic Web (SW) community, is through an ontology ^[3][28]. The most common definition of a KG is that it is a directed graph where both edges and nodes have labels. Nodes are considered to be entities, ranging from everyday entities such as people, organizations and locations to highly domain-specific entities such as proteins and viruses (assuming the domain is a biological one). Edges, also known as properties or predicates, represent either relations between entities (e.g., an ‘employed_at’ relation between a person and organization entity) or an attribute of an entity (e.g., a person’s date of birth), typically represented as a literal. Edges and nodes may also be used to represent an entity’s attribute (e.g., the ‘date_of_birth’ of a person entity) and the attribute’s value (e.g., ‘1970-01-01’), respectively. Even definitionally, diversity is observed in KG research. For example, the SW community makes formal distinctions between the two uses of nodes and edges mentioned above, while others, such as NLP, are less formal. (Within SW, nodes representing entities and attribute values are generally referred to as ‘resources’ and ‘literals’, respectively. Similarly, edges representing entity-relations and attributes are, respectively, referred to as ‘object properties’ and ‘datatype properties’.) An illustrative KG fragment from the tourism domain is visualized in Figure 1. The fragment contains both the actual KG fragment (called the A-Box) and the concepts (nodes shaded in orange) that are part of the T-Box or ontology that models the domain of interest. Put differently, concepts are the types or classes of entities allowable in the domain. Another important aspect of the domain is the set of allowable edge-labels (called properties or predicates) and the constraints associated with them. For example, the ‘employed_at’ relation can be constrained to only map from an entity of type ‘Person’ to an entity of type ‘Organization’. Formally, ‘Person’ and ‘Organization’ would be declared as the allowable domain and range of the predicate ‘employed_at’, similar to a functional constraint in mathematics. The ontology can also have other axioms and constraints. (An intuitive example is a cardinality constraint, e.g., the requirement can be imposed that a ‘married_to’ predicate can be linked to at most one entity-object.) A special predicate called rdf:type serves as an explicit bridge between the A-Box and the T-Box by declaring an entity’s type (which, by definition, is in the T-Box).

Figure 1.

A knowledge graph (KG) fragment. Concepts (that typically belong in the T-Box) are shaded in orange. Links in the figure were accessed on 17 March 2022.

Per the brief formalism above, the semantics of the KG are provided for by the ontology itself, in conjunction with a reasoning engine that (in principle) can detect when the KG is violating the ontology in some way. However, while this formalism is among the most mature in the AI community for expressing, codifying and manipulating the semantics of domain knowledge, it is not the only way. The NLP, knowledge discovery and database communities have much more lightweight and implicit notions of an ontology (usually denoted a ‘schema’ in the academic work, if mentioned explicitly at all).

2. Community-Specific Overview of KG Research

Given that different aspects of KG research are prioritized in different communities, an important component of this articlentry is to first review the main research priorities (as pertinent to KGs) within those communities. The treatment herein does not imply exclusivity, e.g., information extraction (IE), which is predominantly researched in NLP, has also witnessed interesting research in knowledge discovery and SW ^[4][5][60,61]. However, an attempt is made to capture the norms and priorities of the overall community to a reasonable extent. One manner in which this attempt was made systematically was to consider the tutorials, workshops and demonstrations published in the top conferences covering these sub-fields over the last 5 years, including the International Semantic Web Conference (ISWC), the Knowledge Discovery and Data Mining (KDD) conference, the Association for Computational Linguistics (ACL), the Web Conference (WebConf; formerly known as the World Wide Web Conference) and core machine learning conferences, such as NeurIPS, International Conference on Learning Representations (ICLR) and International Conference on Machine Learning (ICML). In all of these conferences, there was at least one tutorial, and multiple workshops and demonstrations involving an important aspect of KG research. Some recent (non-exhaustive) examples of such workshops include Heterogeneous Graph Deep Learning and Applications (KDD 2021), Mining Knowledge Graph for Deep Insights (KDD 2020), International Workshop on Semantic Evaluation (ACL 2021) and Workshop on Deep Learning for Knowledge Graphs (ISWC 2021). In short, only those communities where substantial KG-related research has been published, demonstrated or otherwise promoted (e.g., through tutorials and workshops) to date are considered. A good example of an important AI community that would not meet this condition is Computer Vision. Although some KG research has been published in Computer Vision ^[6][62], including the construction of multimodal KGs ^[7][52], the number of KG-related publications is still relatively small compared to the other communities that are covered in this section. Finally, it bears noting that, because KG research is rapidly advancing as a field, some of the areas discussed below may become less relevant for presenting advances in KG research, and others (not currently discussed in depth, such as computer vision) may gain in importance. Hence, this selection of areas should be interpreted as being only quasi-objective and subject to change even in the near future.

2.1. Natural Language Processing (NLP)

KG research can trace its origins to at least two different research areas (NLP and the Semantic Web, which is re-visited subsequently). Within NLP, KGs first emerged as a result of progress in the domain of information extraction (IE), starting from the 1990s with the institution of the Message Understanding Conferences ^[8][63]. The majority of IE research published over the last three decades has involved either named entity recognition (NER) or relation extraction (RE). Good surveys on the former include work by ^[9][10][64,65] (the second of which focuses on deep learning methods), while ^[11][12][66,67] provide a recent, comprehensive survey on the latter.

Since RE research has almost always involved 2-arity relations (where the relation is assumed to exist between a pair of entities), extracted relations and entities can be modeled as triples and placed into (what has been traditionally denoted as) a knowledge base (KB). Prior to the growth of the Web, there was no reason to model these KBs as graphs. Connections between entities became more apparent and important both when the same entity started getting extracted from multiple documents and (much later) when it was discovered that the structural properties of the KB, such as entity and relation co-occurrence features, could lead to improved performance on related tasks such as entity linking ^[13][68]. Entity linking is the problem of automatically linking an extracted entity to its equivalent in an agreed-upon ‘canonical’ KB like Wikipedia ^[14][69]. To take a simple example of the utility of a structural feature like co-occurrence, suppose that both ‘V. Williams’ and ‘Wimbledon’ were extracted from a single document. If the entity extraction system attempts to link these two extractions to Wikipedia independently, it becomes difficult to decide whether V. Williams refers to Venus Williams (the tennis player) or Vanessa Williams (the actress), and also whether Wimbledon refers to the tennis grand slam tournament of the same name or Wimbledon, London (where the championships are held, but which is technically different from the event itself). Co-occurrence helps resolve this ambiguity by not linking independently. More complex features help improve performance even further, and a similar philosophy would also apply to related tasks such as co-reference resolution ^[15][70], which is the problem of determining when words and phrases (including pronouns) refer to a unique entity.

From the perspective of KG research, IE, entity linking and other problems such as co-reference resolution, all play a vital role because they ultimately lead to a higher-quality initial KG. If two extractions, such as ‘V. Williams’ and ‘Venus Williams’, can indeed be linked to the same Wikipedia entry, for example, then they can be modeled as a single node in the KG. Good co-reference resolution can help add more data to the KG (e.g., more facts and relations). For these reasons, and also because of other applications that have arisen over the years (such as question answering ^[16][71]), improving performance through the design of more sophisticated algorithms and representation learning techniques has always been an important goal in the community. IE problems such as Open IE and event extraction continue to pose challenges ^[17][18][72,73].

2.2. Semantic Web

Earlier, the concept of the A-Box and the T-Box were briefly introduced. These notions are primarily inspired by description logics, which have heavily influenced KG research in the SW community ^[19][74]. For example, ^[20][75] describe how description logics serve as ontology languages for the semantic web. However, in the broader community, modeling and representing KGs is only one part of the equation. An equally important goal is to devise better ways of publishing, linking and accessing this data on the Web. According to a seminal article by ^[21][76], the Semantic Web is fundamentally an effort to transform the Web by ‘augmenting Web pages with data targeted at computers’.

With the advent of a movement called Linked Data ^[22][77], KGs modeled in formal graph-friendly languages like Resource Description Framework (RDF) started becoming more common on the Web ^[23][78], although they are still dwarfed by the volume of natural language text. The KG fragment that was illustrated earlier in Figure 1 is an RDF graph. Data are represented as a set of triples of the form (subject, predicate, object), intuitively representing a directed edge in the graph, where the subject and predicate must be uniform resource identifiers or URIs (and are typically just uniform resource locators for actual datasets), while the object may be a URI or a literal. (Technically, they must be internationalized resource identifiers, which subsume URIs.)

Linked Data are defined as a set of four best practices (https://www.w3.org/wiki/LinkedData accessed on 17 March 2022) for publishing ‘structured data’ (that are, by and large, KGs) on the Web: (i) use URIs as names for things, (ii) use HTTP URIs to enable people to look up those names, (iii) provide useful information when a person looks up a URI and (iv) include links to other URIs to enable greater discoverability ^[22][77]. Linked Open Data started in 2007 with only a handful (<10) of datasets that has since grown to hundreds of datasets in recent years ^[24][79], spanning domains as varied as social media ^[25][26][80,81], biology and life sciences ^[27][28][82,83], and computational linguistics ^[29][30][84,85]. The fourth principle, in particular, has made this possible, since without it, different datasets obeying the other three Linked Data principles may still have been siloed. Both classic and recent research in the 50-year-old problem of entity resolution (ER) has made automatic linking of equivalent entities in independent datasets to one another (even at the Web scale, e.g., the author’s previous work on entity name systems ^[31][86]) much more feasible ^[32][33][29,87].

Other research priorities in SW include the development of efficient KG querying infrastructures, such as triplestores ^[34][88]. Recently, such triplestores (along with the related technology of graph databases, which has been a subject of heavy research in the core Database research community, as subsequently detailed) have also started gaining prominence, with at least one major cloud service (Amazon Neptune) available for it ^[35][89]. Another paradigm that has recently been proposed for data integration and access is the Virtual Knowledge Graph (VKG) paradigm. This paradigm is inspired by the literature on Ontology-Based Data Access (OBDA), which is a well known problem in the Semantic Web community. The key difference between VKGs and OBDA is that the former replaces rigidly structured tables that are a key feature of the latter with flexible graphs. Similar to OBDA, however, the graphs do not have to be ‘materialized’ but can be maintained as a virtual layer and used to capture and represent domain knowledge. A comprehensive overview of systems and use-cases for VKGs is provided in ^[36][90].

2.3. Core Machine Learning: Representation Learning and Probabilistic Graphical Models

Representation learning and probabilistic graphical models, the best known examples of which are Markov logic networks and Bayesian networks ^{[37][38][39][40]}[91,92,93,94], have played an equally important role in recent KG research. Representation learning is a more recent phenomenon, with the structured embedding paper by ^[41][95], followed by influential architectures such as TransE, ConvE and the neural tensor network. Several surveys of such KG-embedding approaches have been published, examples being ^[42][43][44][96,97,98]. The basic purpose of these methods is to ‘embed’ each node and relation in the KG into a dense, continuous, real-valued vector space. Similar to word embeddings, operations such as link prediction can then be optimized in vector space. In recent years, KG representation learning and refinement have also become popular in other KG communities, such as SW ^[45][99], natural language processing ^[46][100] and broad AI topics such as commonsense reasoning ^[47][101]. More recent surveys on KG embeddings and representation learning include ^[43][44][97,98]. Beyond surveys, in the SW, examples of KG applications and algorithms include ^[48][49][50][102,103,104]. Unsurprisingly, the success of these approaches closely mirrors the success of deep learning methods and architectures in related areas. Representation learning has been particularly successful in ‘refining’ KGs by predicting links, detecting incorrect triples and resolving entities.

As KG embeddings have become more advanced, several authors have sought to use other classes of interesting ‘information sets’ with which to obtain higher-quality embeddings. One such type of information is temporal information. Since KG facts can be time-sensitive in some domains (e.g., X co-authored a paper with Y in a given year), the goal is to use time-aware embedding models to further improve KG embeddings ^[51][52][105,106]. One way in which this can be accomplished is by imposing temporal order constraints on time-sensitive relation pairs. Another way is to model the temporal evolution of KGs by using quadruples rather than triples. This kind of representation is especially well suited for medical or sensor-based domains (e.g., Internet of Things). Other kinds of information sets that have inspired similar research in the machine learning community include relation paths, which are designed to help incorporate richer context into the relationship between a pair of entities ^[53][54][107,108], rather than ‘single-hop’ relations represented using an edge in the KG, and even logical rules. Although the use of such rules, once a staple of expert systems, is more common in communities such as Semantic Web, their use as regularizers when learning better KG embeddings shows the interdisciplinary connections between these fields. Examples of systems that use rules or rule-based constraints to refine KG embeddings include ^[55][56][57][109,110,111].

The application of probabilistic graphical models and probabilistic soft logic (PSL) to problems like link prediction predates representation learning by several years ^[58][59][2,112]. PSL is well suited for large-scale KGs because its optimization is convex. A particularly interesting use case is knowledge graph identification (KGI), wherein the confidence-annotated outputs of tasks like IE and ER (the ‘initial’ KG) are fed into a PSL program, along with ontological constraints ^[60][113]. The output of the program is a much cleaner KG. The advantage of PSL is that it is able to incorporate a combination of domain knowledge and probabilistic reasoning to ‘identify’ the true KG. Results have been promising. The possible synergy of such probabilistic models with representation learning is an interesting avenue for future research.

2.4. Databases, Data Mining, and Knowledge Discovery in Databases (KDD)

Although distinct from the SW or NLP communities, the knowledge discovery in databases (KDD) and data mining communities have also had a significant influence on KG research in the last 5 years. KGs have been used in innovative applications, including recommender systems ^[61][62][114,115]. One reason that KGs can make a difference in recommender systems’ performance is their ability to provide useful external knowledge. Combined with deep learning, the external knowledge can make quite a difference. Gao et al. provide a survey on deep learning on KGs for recommender systems ^[63][116]. They cite the emergence of graph neural networks (GNNs) as an important recent advance in this space ^[64][117]. Using GNNs in tandem with KGs, recommender systems can be adapted to become more knowledge-aware, and in turn, this also helps such systems adapt to problems such as cold-start. In their survey ^[63][116], Gao et al. also cite publicly available open-source code and benchmark datasets (examples of which include ^[65][66][118,119]), showing that the ecosystem is starting to mature, making it more likely that these algorithms will be adopted and refined by independent developers (and possibly, smaller companies who may not have a significant research and development budget) in the near future. Although the use of external knowledge and even taxonomies is not novel in this space ^{[67][68][69][70][71][72][73]}[120,121,122,123,124,125,126], KGs have historically been difficult to work with due to both scale and noise. GNNs present a robust solution to the problem ^[74][127].

KGs have also been studied under the umbrella of heterogeneous information networks or HINs ^[75][128]. The HIN model resembles a KG and it is also a directed graph, but the schema (called a network schema ^[76][129]) is less formal than the ontologies that are commonly found in the SW community. HINs have found applications in many of the domains that KGs have, including social media, healthcare and bibliographic domains ^[76][129]. To take healthcare as an example, Ding et al. ^[77][130] propose considering a biological system to be a ‘complex HIN’ that can be used to explore heterogeneous and complicated relationships between biological entities such as molecules to study distinct phenotypes. This treatment of HINs is reminiscent of domain-specific KGs, especially in biology and medicine (including recently proposed KGs for COVID-19) ^[78][27][53,82]. HINs have also been applied to recommender systems ^[79][131], as well as for tasks such as sentiment link prediction and learning structure-aware embeddings ^[80][81][132,133].

Last but not least, because efficient querying is an important problem in KG research ^[82][134], techniques developed by the database community, especially in query reformulation and graph databases, have also been influential ^[83][84][135,136]. Indeed, as argued in a synthesis lecture series on querying graphs ^[85][137], executing queries on modern graph database systems involves a ‘complete lifecycle’ of processing, with relevant topics of research including graph data models and query languages, graph constraints, query specification and formulation, and query processing. There are many outstanding challenges still in the community, including defining schemas for property graphs, understanding graph representations in a comprehensive and comparative framework, understanding and formalizing advanced graph query optimization techniques, and efficiently evaluating certain classes of queries. These topics are directly relevant to building, maintaining and optimizing KG access (which is fundamentally a graph querying problem), and they continue to be explored in the database community (in particular), with recently published work including ^{[86][87][88][89]}[138,139,140,141].