Concept Extraction in Education

This entry is adapted from the peer-reviewed paper 10.3390/app132212307

In the education field, foundational subject-related knowledge reflects concepts frequently expressed as single words or phrases. Educational concepts must be blended into learning and instructional practices; students’ inadequate conceptual understanding can cause them to forget information during the learning process. Similarly, instructors can enhance learning materials’ quality through a clear sense of subject-specific concepts. Extracting concepts from various subject areas based on extensive unstructured text is thus critical for instructors and students, especially to enrich teaching and learning activities.

concept extraction artificial intelligence education

1. Generic Concept Extraction Methods

Concept extraction, a core NLP task, refers to identifying and extracting predefined concepts or patterns from textual data. This task is difficult given the complex and dynamic nature of language. Scholars have investigated concept extraction in numerous areas, such as clinical medicine, information retrieval, and automation engineering. A recent review of the clinical literature on this topic indicated that most approaches to extracting domain concepts from clinical text fall into two categories: (1) rule-based methods and (2) machine learning methods, including deep learning approaches and hybrid approaches ^[1]. Likewise, Kang et al. suggested that concept extraction strategies related to information retrieval can be classified as either machine learning methods, corpus-based methods, glossary-based methods, or heuristic-based methods ^[2]. Currently, concept extraction often involves either supervised or unsupervised strategies. These approaches normally follow a four-step procedure of preprocessing, generating a list of candidate concepts, identifying concepts from candidates, and evaluating those concepts.

A task that can be further divided into four groups (summarized in Table 1): rule-based methods, dictionary-based methods, statistical methods, and semantic-based methods. First, rules and patterns in rule-based methods are predefined to extract concepts from text. Regular expressions or pattern-matching techniques are prevalent. Rule-based concept extraction adheres to grammatical rules, semantic rules, and related aspects to process a corpus and extract multi-character units that conform to predefined rules. These units are eventually labeled as concepts. Szwed employed a rule-based method that involved transforming detected names according to Polish grammar rules, utilizing a user-friendly approach for specifying transformation patterns through annotations to extract concepts from unstructured Polish texts ^[3]. Stanković et al. developed a rule-based approach that relies on a system of language resources to tackle the multi-word term problem in domain concept extraction ^[4]. One benefit of rule-based methods is their capacity to manage patterns and implement domain-specific knowledge. Nevertheless, these approaches are labor-intensive and time-consuming to develop. They also may not capture linguistic diversity. Second, dictionary-based concept extraction methods use pre-defined dictionaries of concept keywords to extract relevant information. Words or phrases in the target text are compared with dictionary entries via similarity metrics or string-matching algorithms ^[5]^[6]. These techniques are faster than rule-based methods but heavily depend on the dictionary’s quality and coverage; they may include noise or miss concepts that are absent from the dictionary. Third, statistical methods emphasize modeling and analyzing potential patterns among domain concepts in a target text. Statistical metrics such as term frequency–inverse document frequency (TF–IDF), co-occurrence, and neighbors are popular heuristics when ranking candidate concepts. The TF–IDF method is premised on the fact that domain-specific concepts exhibit much higher frequencies in some domains than in others, akin to the word frequency patterns provided by TF–IDF ^[7]. Candidate domain concepts can also be ranked statistically by depicting the extracted concepts as nodes on a graph and appraising their roles using network properties such as concept centrality and connectivity. Concepts that occupy more prominent positions within the graph receive higher scores, reflecting the representativeness of both the node and the concept ^[8]. Graph-based methods for concept extraction include the TextRank ^[9] approach and its variations, such as Ne-rank ^[10], TopicRank ^[11], and MultipartiteRank ^[12].

Table 1. A summary of generic concept extraction methods.

Methods	Core Processes	Strengths	Weaknesses	Articles
Rule-based methods	▪ Adhering to grammatical rules, semantic rules, and related aspects to process a corpus; ▪ Extracting multi-character units that conform to predefined rules and are labeled as concepts.	▪ Having the ability to manage well-defined patterns; ▪ Supporting domain-specific knowledge. ▪ Enabling access and providing transparency of the concept extraction process.	▪ Labor-intensive; ▪ Time-consuming; ▪ Disregards linguistic diversity.	^[3]^[4]
Dictionary-based methods	▪ Using a predefined concept dictionary to compare words or phrases in the text with dictionary entries; ▪ Employing similarity metrics or string-matching algorithms to extract relevant information.	▪ Allowing faster implementation; ▪ Allowing for quick extensions for domain adaption; ▪ Providing scalability to large datasets.	▪ Depends on dictionary quality and coverage; ▪ Missing out-of-dictionary concepts; ▪ Underperforming in concept extraction.	^[5]^[6]
Statistical-based methods	▪ Analyzing word frequency and co-occurrence patterns; ▪ Ranking based on statistical features to identify concepts in the text (e.g., weight-based ranking, graph-based ranking).	▪ Having ability to model potential patterns among domain concepts; ▪ Allowing highly customized extracting domain concepts; ▪ Being robust to noise.	▪ Being sensitive to data quality; ▪ Disregarding contextual information; ▪ Missing semantic associations.	^[7]^[10]^[11]^[12]
Semantic-based methods	▪ Using predefined grammar rules to identify candidate concepts; ▪ Utilizing pretrained models to obtain semantic vectors for candidates; ▪ Applying post-processing techniques to determine target concepts from the text.	▪ Obtaining higher precision and recall scores by capturing deeper meaning and contextual information in concept extraction; ▪ Taking advantage of the state-of-the-art NLP techniques (e.g., word embeddings, BERT); ▪ Having scalability to different domains.	▪ Relying on the quality and coverage of word embeddings; ▪ Requiring larger computational resources; ▪ Having challenges in evaluation and explainability.	^[13]^[14]

2. Concept Extraction in Education

Concept extraction is a fundamental technique for knowledge mining in education (e.g., when identifying topics in students’ online discussions or arranging educational knowledge graphs). Studies have demonstrated that automated domain concept extraction brings deep insights for teaching and learning ^[15]^[16]^[17]^[18]. Chen et al. identified e-learning domain concepts from academic articles to assemble a concept map; this helped teachers create adaptive learning materials and enabled students to better grasp the complete picture of subject knowledge ^[19]. Conde formulated a tool to ascertain terms from electronic textbooks and assist teachers in crafting instructional materials ^[20]. Peng et al. extracted topic concepts from students’ forum posts, enabling instructors to detect and trace students’ learning engagement with discourse content ^[21]. A set of concepts extracted from subject materials, along with a group of association rules, can be used to construct knowledge graphs and thereby promote teaching and learning. A systematic review revealed that the relationships among domain concepts are essential for estimating or predicting learners’ knowledge states ^[16]. Together, such research has shown concept extraction to be crucial in teaching and learning practices. However, popular approaches in educational studies (e.g., TF–IDF and latent Dirichlet allocation) depend on word frequency statistics, which can easily overlook low-frequency educational concepts and struggle to capture the semantic information behind text. Therefore, it is imperative to determine how to exploit semantic information from educational concepts to facilitate concept extraction.

Many strategies adopted in educational settings involve TF–IDF, C/NC values, and graph-based ranking. These statistical approaches to concept extraction (i.e., from textual data) are generally contingent on word frequency or key words. Lin Zhang proposed a hybrid method based on the TextRank algorithm and TF–IDF for key concept extraction and sentiment analysis of educational texts ^[22]. Liu improved the Chinese term extraction method by using C/NC values ^[23]. Although statistical methods are applicable to concept extraction, they traditionally require extensive domain knowledge and labeling to identify meaningful features. In contrast, word embedding techniques can learn directly from text corpora without manual labeling or feature engineering; that is, they can learn in an unsupervised manner. Each dimension of word embeddings can also reflect certain aspects of lexical meaning, thereby providing rich semantic information ^[24].

At present, pre-trained large language models can obtain word representations with more semantic information and have been employed for educational concept extraction. Pan et al. extended the pre-trained embedding model by adding a graph propagation algorithm to capture relationships between words and courses, enabling domain concepts to be identified within a course ^[25]. Albahr et al. used the skip-gram model with the Wikipedia corpus to ascertain word embedding vectors for concept extraction in massive open online courses ^[26]. To address noisy and incomplete annotations during high-quality knowledgeable concept extraction from these types of courses, Lu et al. developed a three-stage framework ^[27]. It harnessed pre-trained language models explicitly and implicitly and integrated discipline-embedding models with a self-training strategy. These models are usually trained on large-scale corpora, making them highly robust and able to implicitly encode real knowledge concepts ^[28]. However, when using pre-trained models for concept extraction, the generality of corpora may cause extracted concepts not to match the semantics in a specific domain. Put simply, this method’s feasibility is limited in domain-specific concept extraction in the absence of extensive and high-quality domain-specific corpora. Publicly available pre-trained word embedding models are sufficient for NLP tasks. Researchers from different fields have since fine-tuned these models on target domain texts to improve performance in downstream NLP tasks. In the legal domain, Chalkidis et al. developed the Legal-BERT model based on BERT and realized higher performance ^[29]. Wang et al. showed that word embeddings trained on biomedical corpora captured the semantics of medical terms better than word embeddings trained on general domain corpora ^[30]. Clavi and Gal noted that domain-specific large pre-trained models could have promising results for learning analytics ^[31]. Concept extraction performance can thus be enhanced by optimizing pre-trained models. The true test lies in effectively incorporating pre-trained models tailored to domain-specific semantics into educational concept extraction.

References

Fu, S.; Chen, D.; He, H.; Liu, S.; Moon, S.; Peterson, K.J.; Shen, F.; Wang, L.; Wang, Y.; Wen, A.; et al. Clinical Concept Extraction: A Methodology Review. J. Biomed. Inform. 2020, 109, 103526.
Kang, Y.-B.; Haghighi, P.D.; Burstein, F. CFinder: An Intelligent Key Concept Finder from Text for Ontology Development. Expert Syst. Appl. 2014, 41, 4494–4504.
Szwed, P. Concepts Extraction from Unstructured Polish Texts: A Rule Based Approach. In Proceedings of the 2015 Federated Conference on Computer Science and Information Systems (FedCSIS), Lodz, Poland, 13–16 September 2015; pp. 355–364.
Stanković, R.; Krstev, C.; Obradović, I.; Lazić, B.; Trtovac, A. Rule-Based Automatic Multi-Word Term Extraction and Lemmatization. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016; pp. 507–514.
Gong, L.; Yang, R.; Liu, Q.; Dong, Z.; Chen, H.; Yang, G. A Dictionary-Based Approach for Identifying Biomedical Concepts. Int. J. Pattern Recognit. Artif. Intell. 2017, 31, 1757004.
Levow, G.-A.; Oard, D.W.; Resnik, P. Dictionary-Based Techniques for Cross-Language Information Retrieval. Inf. Process. Manag. 2005, 41, 523–547.
Aizawa, A. An Information-Theoretic Perspective of Tf--Idf Measures. Inf. Process. Manag. 2003, 39, 45–65.
Firoozeh, N.; Nazarenko, A.; Alizon, F.; Daille, B. Keyword Extraction: Issues and Methods. Nat. Lang. Eng. 2020, 26, 259–291.
Mihalcea, R.; Tarau, P. Textrank: Bringing Order into Text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 404–411.
Bellaachia, A.; Al-Dhelaan, M. NE-Rank: A Novel Graph-Based Keyphrase Extraction in Twitter. In Proceedings of the 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Macau, China, 4–7 December 2012; IEEE: Piscataway, NJ, USA, 2012.
Bougouin, A.; Boudin, F.; Daille, B. Topicrank: Graph-Based Topic Ranking for Keyphrase Extraction. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), Nagoya, Japan, 14–19 October 2013; pp. 543–551.
Boudin, F. Unsupervised Keyphrase Extraction with Multipartite Graphs. arXiv 2018, arXiv:1803.08721.
Tulkens, S.; Šuster, S.; Daelemans, W. Unsupervised Concept Extraction from Clinical Text through Semantic Composition. J. Biomed. Inform. 2019, 91, 103120.
Xiong, A.; Liu, D.; Tian, H.; Liu, Z.; Yu, P.; Kadoch, M. News Keyword Extraction Algorithm Based on Semantic Clustering and Word Graph Model. Tsinghua Sci. Technol. 2021, 26, 886–893.
Daems, O.; Erkens, M.; Malzahn, N.; Hoppe, H.U. Using Content Analysis and Domain Ontologies to Check Learners’ Understanding of Science Concepts. J. Comput. Educ. 2014, 1, 113–131.
Abyaa, A.; Khalidi Idrissi, M.; Bennani, S. Learner Modelling: Systematic Review of the Literature from the Last 5 Years. Educ. Technol. Res. Dev. 2019, 67, 1105–1143.
Kong, S.C.; Li, P.; Song, Y. Evaluating a Bilingual Text-Mining System with a Taxonomy of Key Words and Hierarchical Visualization for Understanding Learner-Generated Text. ACM J. Educ. Resour. Comput. 2018, 56, 369–395.
Chau, H.; Labutov, I.; Thaker, K.; He, D.; Brusilovsky, P. Automatic Concept Extraction for Domain and Student Modeling in Adaptive Textbooks. Int. J. Artif. Intell. Educ. 2021, 31, 820–846.
Chen, N.-S.; Wei, C.-W.; Chen, H.-J. Mining E-Learning Domain Concept Map from Academic Articles. Comput. Educ. 2008, 50, 1009–1021.
Conde, A.; Larrañaga, M.; Arruarte, A.; Elorriaga, J.A.; Roth, D. Litewi: A Combined Term Extraction and Entity Linking Method for Eliciting Educational Ontologies from Textbooks. J. Assoc. Inf. Sci. Technol. 2016, 67, 380–399.
Peng, X.; Han, C.; Ouyang, F.; Liu, Z. Topic Tracking Model for Analyzing Student-Generated Posts in SPOC Discussion Forums. Int. J. Educ. Technol. High. Educ. 2020, 17, 35.
Zhang, L.; Li, X.-P.; Zhang, F.-B.; Hu, B. Research on Keyword Extraction and Sentiment Orientation Analysis of Educational Texts. J. Comput. 2017, 28, 301–313.
Liu, J.; Shao, X. An Improved Extracting Chinese Term Method Based on C/NC-Value. In Proceedings of the 2010 International Symposium on Intelligence Information Processing and Trusted Computing, Wuhan, China, 28–29 October 2010; IEEE: Piscataway, NJ, USA, 2010.
Mikolov, T.; Yih, W.-T.; Zweig, G. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA, 9–14 June 2013; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 746–751.
Pan, L.; Wang, X.; Li, C.; Li, J.; Tang, J. Course Concept Extraction in MOOCS via Embedding-Based Graph Propagation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, Taipei, Taiwan, 28–30 November 2017; Asian Federation of Natural Language Processing: Volume 1: Long Papers. pp. 875–884.
Albahr, A.; Che, D.; Albahar, M. A Novel Cluster-Based Approach for Keyphrase Extraction from MOOC Video Lectures. Knowl. Inf. Syst. 2021, 63, 1663–1686.
Lu, M.; Wang, Y.; Yu, J.; Du, Y.; Hou, L.; Li, J. Distantly Supervised Course Concept Extraction in MOOCs with Academic Discipline. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023. Volume 1: Long Papers.
Niven, T.; Kao, H.-Y. Probing Neural Network Comprehension of Natural Language Arguments. arXiv 2019, arXiv:1907.07355.
Chalkidis, I.; Fergadiotis, M.; Malakasiotis, P.; Aletras, N.; Androutsopoulos, I. LEGAL-BERT: The Muppets Straight out of Law School. arXiv 2020, arXiv:2010.02559.
Wang, Y.; Liu, S.; Afzal, N.; Rastegar-Mojarad, M.; Wang, L.; Shen, F.; Kingsbury, P.; Liu, H. A Comparison of Word Embeddings for the Biomedical Natural Language Processing. J. Biomed. Inform. 2018, 87, 12–20.
Clavié, B.; Gal, K. Edubert: Pretrained Deep Language Models for Learning Analytics. arXiv 2019, arXiv:1912.00690.

Related Content

Artificial Intelligence in Special Education Entry

This entry examines the growing role of artificial intelligence (AI) in special education. The authors discuss applications of AI in the field, including its uses for personalized learning, adaptive technologies, teacher support, and AI’s potential to address issues related to student accessibility and engagement. The entry draws on recent syntheses of literature, highlighting studies that reveal AI’s capacity to improve educational outcomes for students with disabilities, mitigate teacher workload, and foster inclusion. Despite these promising developments, the authors address ethical considerations, potential biases, and privacy concerns surrounding the use of AI, as well as the need for high-quality research that validates AI’s effectiveness in special education. The authors conclude that while AI can offer substantial support, it should be integrated thoughtfully, guided by empirical research, and accompanied by skilled professional oversight to ensure that it truly benefits students with disabilities.

Keywords: accessibility; artificial intelligence; individualized education program; personalized learning; special education

New Frontiers of Electronic Theses and Dissertations Entry

(1) Background: Since the 1990s, theses and dissertations—a key part of scientific communication—have evolved significantly with advances in information and communication technologies. (2) Methods: This study reviews 99 publications examining these changes, drawing insights from international conferences and empirical studies in the field. (3) Results: Historically, a major challenge in managing PhD theses has been the shift to electronic formats, resulting in the creation of electronic theses and dissertations (ETDs). This shift involves four main tasks: adopting new digital formats, updating institutional workflows between departments, graduate schools, and academic libraries, implementing updated bibliographic standards (such as metadata and identifiers), and utilizing new tools and channels for distribution. With open science becoming a widespread research policy across many countries and institutions, ensuring open access for ETDs is an added challenge—though a substantial portion of ETD content remains restricted to institutional or library networks. Today, ETD management is on the brink of a new era, with advancements in data-driven science and artificial intelligence. (4) Conclusions: The development of ETDs varies significantly across different countries, regions, and institutions due to technological, organizational, and legal differences. It is essential for academic libraries and other stakeholders to address the challenges identified while considering these variations.

Keywords: electronic theses and dissertations; digitization; open access; institutional repositories; research data; metadata; artificial intelligence; academic libraries; higher education institutions; library and information science

Museum Education Entry

Museum education involves using a museum’s resources and collections to facilitate learning for diverse audiences. It includes activities like tours, workshops, and interactive exhibits that promote active, inquiry-based learning. Focused on accessibility and inclusivity, museum education aims to engage visitors, enhance their understanding, and foster a deeper appreciation for cultural, historical, or scientific content to foster active citizenship and lifelong learning in a non-formal learning context. Museum education uses collections and exhibits to engage audiences through hands-on, inquiry-based learning. By integrating digital tools and interactive technologies, it enhances learning through immersive and distance-based experiences. This approach promotes active engagement, critical thinking, and meaning-making, transforming traditional teaching methods. Museums serve as inclusive spaces where knowledge—embodied in artifacts and digital mediators—supports cognitive, emotional, and social development, fostering deeper connections with culture and history.

Keywords: museum education; digital cultural; artifacts; teaching mediator; inclusion

Digital Simulations in STEM Education: Insights from Recent Empirical Studies, a Systematic Review Entry

This study explores the use of digital simulations in STEM education, addressing the gap in systematic reviews synthesizing recent advancements and their implications for teaching and learning by focusing on their impact on learning outcomes and student engagement across general and special education settings. The review includes 31 peer-reviewed empirical studies published in the last five years, sourced from ERIC, Scopus, and Web of Science, and adheres to the PRISMA methodology to ensure transparency and rigor. The findings reveal that interactive simulations are the most widely used type of digital tool, accounting for 25 of the 31 studies, followed by game-based simulations and virtual labs. Quasi-experimental designs dominate the research landscape, often employing pre- and post-tests to evaluate intervention effectiveness. While inquiry-based learning emerges as the most frequently implemented instructional strategy, hybrid and simulation-based approaches also feature prominently. Despite the evident benefits of digital simulations in enhancing conceptual understanding, engagement, and problem-solving skills, research gaps remain, particularly regarding their application in primary and special education contexts. This review underscores the need for diverse research methodologies and broader population studies to maximize the potential of digital simulations in STEM education.

Keywords: STEM education; digital simulations; inquiry-based learning; learning outcomes; student engagement

10 Benefits of Incorporating Webinars into Your Curriculum Entry

Webinars have become an essential tool in modern education. They offer a flexible and engaging way to deliver content to students, regardless of their location. By incorporating webinars into your curriculum, you can enhance student learning, improve teaching methods, and boost overall educational outcomes.

Keywords: onlinewebinar,

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Computer Science, Information Systems

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Jingxiu Huang

Ruofei Ding

Xiaomin Wu

Shumin Chen

Jiale Zhang

Lixiang Liu

Yunxiang Zheng

View Times: 465

Update Date: 24 Nov 2023

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Yunxiang Zheng	--	1526	2023-11-23 08:40:10	\|
2	Reference format revised.	Lindsay Dong	Meta information modification	1526	2023-11-24 07:26:50	\|

1. Generic Concept Extraction Methods

2. Concept Extraction in Education

References

Video Upload Options

Confirm