In the education field, foundational subject-related knowledge reflects concepts frequently expressed as single words or phrases. Educational concepts must be blended into learning and instructional practices; students’ inadequate conceptual understanding can cause them to forget information during the learning process. Similarly, instructors can enhance learning materials’ quality through a clear sense of subject-specific concepts. Extracting concepts from various subject areas based on extensive unstructured text is thus critical for instructors and students, especially to enrich teaching and learning activities.
1. Generic Concept Extraction Methods
Concept extraction, a core NLP task, refers to identifying and extracting predefined concepts or patterns from textual data. This task is difficult given the complex and dynamic nature of language. Scholars have investigated concept extraction in numerous areas, such as clinical medicine, information retrieval, and automation engineering. A recent review of the clinical literature on this topic indicated that most approaches to extracting domain concepts from clinical text fall into two categories: (1) rule-based methods and (2) machine learning methods, including deep learning approaches and hybrid approaches [
4]. Likewise, Kang et al. suggested that concept extraction strategies related to information retrieval can be classified as either machine learning methods, corpus-based methods, glossary-based methods, or heuristic-based methods [
23]. Currently, concept extraction often involves either supervised or unsupervised strategies. These approaches normally follow a four-step procedure of preprocessing, generating a list of candidate concepts, identifying concepts from candidates, and evaluating those concepts.
A task that can be further divided into four groups (summarized in
Table 1): rule-based methods, dictionary-based methods, statistical methods, and semantic-based methods. First, rules and patterns in rule-based methods are predefined to extract concepts from text. Regular expressions or pattern-matching techniques are prevalent. Rule-based concept extraction adheres to grammatical rules, semantic rules, and related aspects to process a corpus and extract multi-character units that conform to predefined rules. These units are eventually labeled as concepts. Szwed employed a rule-based method that involved transforming detected names according to Polish grammar rules, utilizing a user-friendly approach for specifying transformation patterns through annotations to extract concepts from unstructured Polish texts [
6]. Stanković et al. developed a rule-based approach that relies on a system of language resources to tackle the multi-word term problem in domain concept extraction [
7]. One benefit of rule-based methods is their capacity to manage patterns and implement domain-specific knowledge. Nevertheless, these approaches are labor-intensive and time-consuming to develop. They also may not capture linguistic diversity. Second, dictionary-based concept extraction methods use pre-defined dictionaries of concept keywords to extract relevant information. Words or phrases in the target text are compared with dictionary entries via similarity metrics or string-matching algorithms [
8,
24]. These techniques are faster than rule-based methods but heavily depend on the dictionary’s quality and coverage; they may include noise or miss concepts that are absent from the dictionary. Third, statistical methods emphasize modeling and analyzing potential patterns among domain concepts in a target text. Statistical metrics such as term frequency–inverse document frequency (TF–IDF), co-occurrence, and neighbors are popular heuristics when ranking candidate concepts. The TF–IDF method is premised on the fact that domain-specific concepts exhibit much higher frequencies in some domains than in others, akin to the word frequency patterns provided by TF–IDF [
9]. Candidate domain concepts can also be ranked statistically by depicting the extracted concepts as nodes on a graph and appraising their roles using network properties such as concept centrality and connectivity. Concepts that occupy more prominent positions within the graph receive higher scores, reflecting the representativeness of both the node and the concept [
5]. Graph-based methods for concept extraction include the TextRank [
10] approach and its variations, such as Ne-rank [
25], TopicRank [
26], and MultipartiteRank [
27].
Table 1. A summary of generic concept extraction methods.
Methods |
Core Processes |
Strengths |
Weaknesses |
Articles |
Rule-based methods |
- ▪
-
Adhering to grammatical rules, semantic rules, and related aspects to process a corpus;
- ▪
-
Extracting multi-character units that conform to predefined rules and are labeled as concepts.
|
- ▪
-
Having the ability to manage well-defined patterns;
- ▪
-
Supporting domain-specific knowledge.
- ▪
-
Enabling access and providing transparency of the concept extraction process.
|
- ▪
-
Labor-intensive;
- ▪
-
Time-consuming;
- ▪
-
Disregards linguistic diversity.
|
[6,7] |
Dictionary-based methods |
- ▪
-
Using a predefined concept dictionary to compare words or phrases in the text with dictionary entries;
- ▪
-
Employing similarity metrics or string-matching algorithms to extract relevant information.
|
- ▪
-
Allowing faster implementation;
- ▪
-
Allowing for quick extensions for domain adaption;
- ▪
-
Providing scalability to large datasets.
|
- ▪
-
Depends on dictionary quality and coverage;
- ▪
-
Missing out-of-dictionary concepts;
- ▪
-
Underperforming in concept extraction.
|
[8,24] |
Statistical-based methods |
- ▪
-
Analyzing word frequency and co-occurrence patterns;
- ▪
-
Ranking based on statistical features to identify concepts in the text (e.g., weight-based ranking, graph-based ranking).
|
- ▪
-
Having ability to model potential patterns among domain concepts;
- ▪
-
Allowing highly customized extracting domain concepts;
- ▪
-
Being robust to noise.
|
- ▪
-
Being sensitive to data quality;
- ▪
-
Disregarding contextual information;
- ▪
-
Missing semantic associations.
|
[9,25,26,27] |
Semantic-based methods |
- ▪
-
Using predefined grammar rules to identify candidate concepts;
- ▪
-
Utilizing pretrained models to obtain semantic vectors for candidates;
- ▪
-
Applying post-processing techniques to determine target concepts from the text.
|
- ▪
-
Obtaining higher precision and recall scores by capturing deeper meaning and contextual information in concept extraction;
- ▪
-
Taking advantage of the state-of-the-art NLP techniques (e.g., word embeddings, BERT);
- ▪
-
Having scalability to different domains.
|
- ▪
-
Relying on the quality and coverage of word embeddings;
- ▪
-
Requiring larger computational resources;
- ▪
-
Having challenges in evaluation and explainability.
|
[12,13] |
2. Concept Extraction in Education
Concept extraction is a fundamental technique for knowledge mining in education (e.g., when identifying topics in students’ online discussions or arranging educational knowledge graphs). Studies have demonstrated that automated domain concept extraction brings deep insights for teaching and learning [
14,
15,
28,
29]. Chen et al. identified e-learning domain concepts from academic articles to assemble a concept map; this helped teachers create adaptive learning materials and enabled students to better grasp the complete picture of subject knowledge [
16]. Conde formulated a tool to ascertain terms from electronic textbooks and assist teachers in crafting instructional materials [
17]. Peng et al. extracted topic concepts from students’ forum posts, enabling instructors to detect and trace students’ learning engagement with discourse content [
30]. A set of concepts extracted from subject materials, along with a group of association rules, can be used to construct knowledge graphs and thereby promote teaching and learning. A systematic review revealed that the relationships among domain concepts are essential for estimating or predicting learners’ knowledge states [
15]. Together, such research has shown concept extraction to be crucial in teaching and learning practices. However, popular approaches in educational studies (e.g., TF–IDF and latent Dirichlet allocation) depend on word frequency statistics, which can easily overlook low-frequency educational concepts and struggle to capture the semantic information behind text. Therefore, it is imperative to determine how to exploit semantic information from educational concepts to facilitate concept extraction.
Many strategies adopted in educational settings involve TF–IDF, C/NC values, and graph-based ranking. These statistical approaches to concept extraction (i.e., from textual data) are generally contingent on word frequency or key words. Lin Zhang proposed a hybrid method based on the TextRank algorithm and TF–IDF for key concept extraction and sentiment analysis of educational texts [
21]. Liu improved the Chinese term extraction method by using C/NC values [
20]. Although statistical methods are applicable to concept extraction, they traditionally require extensive domain knowledge and labeling to identify meaningful features. In contrast, word embedding techniques can learn directly from text corpora without manual labeling or feature engineering; that is, they can learn in an unsupervised manner. Each dimension of word embeddings can also reflect certain aspects of lexical meaning, thereby providing rich semantic information [
31].
At present, pre-trained large language models can obtain word representations with more semantic information and have been employed for educational concept extraction. Pan et al. extended the pre-trained embedding model by adding a graph propagation algorithm to capture relationships between words and courses, enabling domain concepts to be identified within a course [
18]. Albahr et al. used the skip-gram model with the Wikipedia corpus to ascertain word embedding vectors for concept extraction in massive open online courses [
19]. To address noisy and incomplete annotations during high-quality knowledgeable concept extraction from these types of courses, Lu et al. developed a three-stage framework [
22]. It harnessed pre-trained language models explicitly and implicitly and integrated discipline-embedding models with a self-training strategy. These models are usually trained on large-scale corpora, making them highly robust and able to implicitly encode real knowledge concepts [
32]. However, when using pre-trained models for concept extraction, the generality of corpora may cause extracted concepts not to match the semantics in a specific domain. Put simply, this method’s feasibility is limited in domain-specific concept extraction in the absence of extensive and high-quality domain-specific corpora. Publicly available pre-trained word embedding models are sufficient for NLP tasks. Researchers from different fields have since fine-tuned these models on target domain texts to improve performance in downstream NLP tasks. In the legal domain, Chalkidis et al. developed the Legal-BERT model based on BERT and realized higher performance [
33]. Wang et al. showed that word embeddings trained on biomedical corpora captured the semantics of medical terms better than word embeddings trained on general domain corpora [
34]. Clavi and Gal noted that domain-specific large pre-trained models could have promising results for learning analytics [
35]. Concept extraction performance can thus be enhanced by optimizing pre-trained models. The true test lies in effectively incorporating pre-trained models tailored to domain-specific semantics into educational concept extraction.
This entry is adapted from the peer-reviewed paper 10.3390/app132212307