Decentralized Federated Learning and Knowledge Graph Embedding

Decentralized Federated Learning and Knowledge Graph Embedding: Comparison

Please note this is a comparison between Version 1 by Xiangjie Kong and Version 2 by Jessie Wu.

Anomaly detection plays a crucial role in data security and risk management across various domains, such as financial insurance security, medical image recognition, and Internet of Things (IoT) device management. Researchers rely on machine learning to address potential threats in order to enhance data security.

knowledge graph embedding
anomaly detection
relation

1. Introduction

The development of Internet technology makes digitized data and information easy to be transmitted and analyzed, and the subtle connections between data and information are easier to mine [1]. But at the same time, hidden crises and potential risks, such as abnormal data and fraudulent behavior, are also mixed in. Whether it is fraud detection in the financial field, device quality monitoring in the IoT industry, disease diagnosis in the healthcare field, or intrusion detection in network security, all rely on anomaly detection to ensure system reliability and data integrity. Regardless of the industry sector, all involve serious economic losses and trust crises. Therefore, the research and development of effective detection mechanisms for the management and analysis of digitized information have become crucial.

However, anomaly samples are still rare and traditional auto insurance fraud detection relies directly on expert manual review. This results in extremely inefficient fraud detection [2]. To reduce human error and missed inspections, insurance companies start to leverage the automated intelligence of machine learning [3]. In addition to using unsupervised learning, semi-supervised learning, and other methods to improve the performance of anomaly detection models, researchers also use these methods: (1) Data generation: generating abnormal data through transformation, expansion, or learning from existing data for sample expansion, such as how Zhang et al. utilized MetaGAN, based on a Generative Adversarial Network, to generate images to strengthen the performance of sample-level image classification [4]. (2) Transfer Learning [5]: learning anomaly detection models from the original domain dataset and transferring them to the target domain. (3) Active Learning [6]: improving model performance by intelligently selecting which samples should be labeled, thereby reducing dependence on labeled data. (4) Cooperative train [7]: Collaborating between different data holders to jointly build anomaly detection models can also help solve the problem of data scarcity. The centralized training of data can indeed improve the efficiency of detection, but this completely disregards privacy concerns [8]. Especially in the financial insurance industry, when it comes to a substantial amount of customer information, data sharing needs to be carried out with the precondition of ensuring privacy protection.

2. Decentralized Federated Learning

Artificial Intelligence is steadily developing, and its technical underlying support remains data. The quantity, quality, and dimension of data have become some of the most important factors constraining the progress of science and technology. As data owners need to consider data security protection, competition relations, and legal regulations when facing data exchange and sharing, it leads to the problem of “data silos” between enterprises and industries ^[9][21]. How to share data safely and effectively has become a popular research topic.

In 2017, Google first proposed and constructed a Federated Learning (FL) framework to realize the idea of model updating locally ^[10][22]. They aimed to improve the prediction accuracy of what Android users associate with their next input when typing on their mobile terminals. Subsequently, a large number of scholars conducted more in-depth research on data security and personalized models. In 2019, Google released the first FL framework in the world, TensorFlow Federated Framework. And in the same year, Professor Yang Qiang with his team open-sourced the first FL framework in China, named Federated AI Technology Enabler ^[11][23], as a secure computing framework to support the Federated AI system.

FL can not only break through the “data silos” and “small data” limitations during the training process, but also ensures a certain degree of data privacy and security while benefitting all participants ^[12][24]. Because of this, it has been a high concern of researchers in various fields. This framework has a wide range of applications, including medical image processing ^[13][25], auto plate recognition ^[14][26], air handwriting recognition ^[15][27], and so on.

However, due to the high dependence of FL on the central server, it is unable to cope with the problem of a single point of failure of the central server, and thus DFL emerges ^[16][28]. A more decentralized federation aggregation is achieved through communication and interaction between participants. Currently, this framework has been applied in several fields. Lu et al. ^[17][29] extracted medical patient features more securely by constructing a DFL model that conforms to realistic cooperation. In addition, Kalapaaking et al. ^[18][30] combined blockchain with FL to improve the security of the system by using the traceable and untamperable characteristics of blockchain. They used blockchain with a Trusted Execution Environment to replace the central server to improve the fault tolerance and attack resistance of the system.

Compared with traditional FL, DFL has no central communication bottleneck, but it generates a huge client–client communication overhead. To deal with this problem, Liu et al. ^[19][31] pioneered the application of the Lloyd–Max algorithm to DFL. They utilized the exchange of model information between neighboring nodes to adaptively adjust the quantization level, and succeeded in improving the communication efficiency by reducing the amount of data of the federated transmission model parameters. Sun et al. ^[20][32] investigated Decentralized FedAVG with Momentum (DFedAvgM) based on the FedAVG paradigm, which reduces the communication overhead by mixing the matrices, Momentum, multiple local iterations of client training, and quantization of sending models.

3. Knowledge Graph Embedding

Knowledge Graph as a kind of mesh database manages loose multi-source heterogeneous data through a standardized structural organization (head entity, relation, tail entity) ^[21][33]. With the advantages of graph structure to reflect and manage information, and to help accurate positioning and searching, KG provides strong underlying support for specific downstream applications such as Internet semantic searches, personalized recommendations, an intelligent Question Answering System and big data decision-making.

However, such a ternary structure is difficult to deal with directly due to the low portability of its underlying symbolic properties. Therefore, researchers apply knowledge graph embedding to ensure computational simplicity by embedding entities and relations into a continuous low-dimensional vector space, while preserving the structural information of the KG ^[22][34]. The entities and relations are downscaled and stored in the low-dimensional space in the form of vector matrices or tensors.

The training of KGE involves the semantic understanding level. It is necessary to consider how to extract relations and entities from non-aligned heterogeneous data, and how to understand the real meaning of different relations and entities for the alignment task ^[23][35].

Existing knowledge graph embedding methods are mainly categorized into three types: (i) Translational Distance Models; (ii) Semantic Matching Models; and (iii) Neural Network Models. Based on the above three types of models, many models have been derived.

The Translational Distance Models are based on TransE ^[24][36] and extend to derive models such as TransR ^[25][37], RotatE ^[26][38], and HAKE ^[27][39]. This type of method defines the scoring function by modeling the relation as the distance from the head entity to the tail entity like the Euclidean distance of TransE and the rotation transformation of RotatE. Semantic Matching Models measure the rationality of triples at the semantic level to construct score functions, mainly including RESCAL (bilinear model) models ^[28][40]. Due to the lack of clear Euclidean inner product correspondence in hyperbolic spaces, in order to extend the calculation to hyperbolic spaces, Ivana Balaževic et al. first used a combination of a bilinear model and Poincaré ball ^[29][41]. The MuRP model proposed by them can outperform Euclidean models on the link prediction task at lower dimensionality. As for the Neural Network Models, they score by embedding the head entity, relation, and tail entity into the neural network. Jiarui Zhang et al. ^[30][42] found that data-driven link prediction tasks rely on various labels and only utilize the structural information of the graph. Inspired by knowledge distillation, the DA-GCN proposed by them makes use of logical rules to reduce the dependence of graph neural networks on data and iterative rules to construct graph convolutional networks. The KGE trained by DA-GCN can perform excellently in link prediction tasks.