CMKG:Construction Method of Knowledge Graph for Image Recognition: Comparison
Please note this is a comparison between Version 1 by Xia Xie and Version 2 by Fanny Huang.

With the continuous development of artificial intelligence technology and the exponential growth in the number of images, image detection and recognition technology is becoming more widely used. Image knowledge management is extremely urgent. The data source of a knowledge graph is not only the text and structured data but also the visual or auditory data such as images, video, and audio.

  • knowledge graph
  • image recognition

1. Introduction

With the continuous development of intelligent technology and the exponential growth in the number of images, image detection and recognition technology is becoming more widely used. Computer image recognition is usually divided into several main steps: information acquisition, preprocessing, feature extraction and selection, classifier design, and classification decision. In recent years, the best effect of image recognition has been instance segmentation, which is based on convolution neural network target detection and semantic segmentation. For the initial instance segmentation method in 2014, inspired by the R-CNN [1] two-stage target detection framework, Bharath Hariharan proposed the SDS [2] model, which is also the earliest instance segmentation algorithm, and this laid the foundation for subsequent research.
Ren et al. proposed an instance segmentation model, Mask R-CNN, which has become the basic framework for many instance segmentation tasks and is based on Faster-RCNN [3]; this was added as a branch of semantic segmentation in 2017. In 2020, Chen et al. proposed BlendMask [4]. The method, based on the shortcomings of instance segmentation in one-stage target detection, takes FCOS [5] as the framework and combines the top–down and bottom–up methods [6]. A blender module was designed to fuse high-level features with lower-level features. However, due to the convolution kernel existing in the feature fusion process, the receptive field in the feature layer of large pixels is too small [7], which will lead to inaccurate feature extraction and fusion.

2. Construction Method of Knowledge Graph for Image Recognition

2.1. Target Detection

In 2001, Viola et al. proposed the Haar feature extraction method and combined the AdaBoost [8][10] classification algorithm to achieve face detection. In 2005, Dalal et al. proposed the histogram of oriented gradients algorithm to carry out feature recognition through edge features and realize pedestrian detection by combining it with an SVM classifier. In 2008, Bay et al. proposed the SURF [9][11] algorithm based on the improvement of the SIFT algorithm, which greatly reduced the running time of the program and increased the robustness of the algorithm. In 2015, Redmon et al. [10][12] proposed YOLO, which was a new approach to object detection. The object detection frame was defined as the boundary box of space separation and the regression problem of correlation probability. From 2016 to now, the YOLO series has changed from YOLOv1 to YOLOv7. In object detection, the anchored regression method and non-anchored regression method are dominant, but they also have certain shortcomings. In 2020, Bin et al. [11][13] proposed CPM R-CNN, which contains three efficient modules to optimize the anchor-based point-guided method. In recent years, unsupervised pretraining methods have been designed for target detection, but they usually have defects in image classification. Enze et al. [12][14] proposed a simple and effective self-supervised target detection method, DetCo.

2.2. Classical Model

In 2014, Szegedy et al. designed the GoogLeNet [13][15] network model and proposed the structure of inception and its branch structure; the model achieved a new low error rate on the image datasets (ImageNet). At the same time, Simonyan et al. proposed the VGG-Net [14][16] model. In 2015, He et al. proposed the ResNet [15][17] network model, which achieved an error rate of only 3.6% on ImageNet. Convolutional neural networks (CNNs) have gained remarkable success in many image classification tasks in recent years. Wen-Shuai et al. [16][18] proposed an automatic CNN architecture design method by using genetic algorithms to effectively address the image classification tasks. In order to suppress the uncertainty of ResNet, in 2021, Kai et al. [17][19] proposed a simple yet efficient Self-Cure Network (SCN) based on ResNet18, which prevents deep networks from over-fitting uncertain facial images. In 2022, Jing et al. [18][20] introduced a regulator module as a memory mechanism to extract the complementary features of the middle layer and feed them further to ResNet. Recently, the size of the convolution kernel has also changed. In 2022, Xiaohan Ding et al. [19][21] proposed RepLKNet, a pure CNN architecture whose kernel size was as large as 31 * 31, in contrast to the commonly used 3 * 3. In 2023, combined with MobileNet and the ResNet-18 model, Lee et al. [20][22] proposed a block processing strategy, which effectively improved the efficiency of facial expression processing.

2.3. Instance Segmentation

In 2015, Dai et al. proposed an Instance-sensitive Fully Convolutional Network (FCN) [21][23] in order to make up for the translation invariance defect of the Fully Convolutional Network (FCN) [22][24], and it completed the task of instance segmentation. Faster R-CNN proposes an RPN network based on the R-CNN series of algorithms to obtain accurate candidate regions. It is an end-to-end detection model for multi-object classification and localization.
In 2019, Bolya et al. proposed the YOLACT model to add the mask branch to the existing one-stage target model based on Mask R-CNN operating the same as Faster-RCNN but without explicit localization steps. To classify based on the instance center points problem and the dense distance regression problem, Xie et al. proposed the PolarMask [23][25] model. Based on the instance category of the quantized object center position and object size, Wang et al. proposed the SOLO model [24][26] to identify a single pixel: not a single output category but a category with location information in 2019. In the same year, based on the principle that inspection and segmentation should promote each other, Wang et al. proposed the RDSNet [25][27] to improve the performance of instance segmentation by making full and reasonable use of the information interaction between the target detection and instance segmentation. In 2022, Lu et al. proposed the Segmenting Objects from Relational Visual Data [26][28] that promoted the development of image segmentation. In 2023, Lei et al. [27][29] modeled the image formation as the composition of two overlapping layers and used the double-layer structure to model the occlusion relationship, which naturally decoupled the boundaries between instances and effectively solved the image segmentation problem in the case of occlusion.

2.4. Knowledge Graph

As early as 1960, semantic networks were proposed as a method of knowledge expression, which was mainly used in the field of self-speech language understanding. In 2006, Tim introduced linked data to highlight the essence of the semantic web to establish links between open data. In recent years, the application of knowledge graph technology in various industries has become an important trend [28][30], such as the Baidu knowledge graph and the Google knowledge graph in the search field. In the medical field, there is a knowledge graph of traditional Chinese medicine [29][31]; there is JD.com’s e-commerce field, and so on. These fully illustrate the universality of the knowledge graph.
Video Production Service