In the modern educational landscape, the ability to predict and understand student performance is paramount. It not only provides educators with valuable insights into learning patterns but also allows students to recognize areas of improvement, fostering a conducive learning environment.
Numerous scholarly investigations have examined diverse aspects of student performance. However, the swift improvements in machine learning present a novel viewpoint, holding the potential to provide more precise and nuanced forecasts. The current research, based on an empirical framework, aims to utilize these improvements, with a specific emphasis on evaluating the effectiveness of various machine learning models in predicting academic outcomes. This research endeavors to illuminate the most effective ways for academic forecasting by contrasting classic prediction methods with state-of-the-art machine learning models. The next sections explore the methodology employed, the datasets utilized, and the insights obtained, resulting in a full comprehension of the role and promise of machine learning in forecasting student outcomes.
Students’ performance prediction and students’ academics analytics are the two most demanding research topics in the domain of educational literature. Despite having diverse goals, performance analysis has a substantial impact on prediction research. Learning analytics, according to the common definition, is the process of collecting and analyzing information on students and their surroundings with the goal of improving both [
1]. The field of educational data mining is closely related to the domain of learning analytics. Educational data mining, in turn, is concerned with the application of data mining, machine learning (ML), and the statistics of information generated and gathered within educational settings for discovering knowledge about how people learn. Learning analytics is a field that focuses on the application of data mining, ML, and statistics to information generated and gathered within educational settings [
2]. It is possible that statistical techniques are insufficient for reliably linking individual variables to outcomes [
3]. Supervisors and learners alike may benefit from the groundbreaking insights that may be gleaned by using complex algorithms [
4,
5]. Many investigators have been motivated to dig further into the information sharing procedure through the rapid development of data mining tools. A few of them have been using data mining techniques for this goal for quite some time [
6]. However, at the present moment, there are very few of these kinds of investigations. This effort to enhance the quality of learning is facilitated by the increasing accessibility of digital data gathered by various academic resource management platforms and educational tools in recent years [
7,
8,
9].
1.1. Learning Management System and Students’ Academics Analytics
Nowadays, Learning Management Systems (LMSs) and Open Course Ware (OCW) platforms are a playing a challenging role for the application and exploitation of learning analytics because they are designed to host educational materials (open and web-based materials in the case of OCW platforms), typically organized as courses. Open Course Ware (OCW) platforms, for example, offer adaptable learning environments and enable the automated gathering of learning data like student activity and performance, while also including course preparation materials and assessment tools. Thus, when properly mined using data-mining methods, this massive trove of educational data might be considered as a source of essential information for educators at all levels. Data, mostly from online education environments, that may be utilized to enhance students’ learning are made accessible because of the widespread adoption of digital equipment in the classroom. Commonly, learning analytics have been used in a wide variety of settings and organizations [
10,
11].
Learning analytics may be interpreted and used in a number of ways, including as a tool for analyzing students’ performance and habits in comparison to those of their peers. To be more specific, learning analytics may be used in a number of contexts, including easing the delivery of specific indicators for student progress, enabling the online customization of course modules automatically, etc. Predicting how well a student will do on an exam or other kind of assessment is one of the most often used and beneficial uses of learning analytics. This is thought to be especially useful for identifying students “at danger” of dropping out or failing a course in order to offer them with extra help or tutoring in a timely manner (e.g., to identify students who are likely to fail an end-of-semester test). This is particularly crucial when a large number of students are enrolled in a course offered through distant learning. More than that, it might be put to use in providing individualized learning plans and assessments to students according to their prior performance and areas of interest. Teachers could benefit from the data gleaned from these learning analytics by learning which courses and teaching programs need improvement, adaptation, or development of new curriculum offerings, etc., and then hiring or planning interventions to better support students (either individually or in groups).
There has been a lot of research published on the topic of student test performance analytics. Students’ test performance was the primary focus of these investigations, and the categorization challenge was solved by dividing them into “pass” and “fail” groups. The goal of these analyses was to identify which students were most likely to fail a certain class. Kotsiantis et al. [
12] used machine learning methods like Naive Bayes (NB) and k-Nearest Neighbors (kNN) to classify students as dropouts or not. Findings from the research revealed that students at risk of dropping out may be identified by looking at merely their demographic information or, more succinctly, their background. Possible EDM goals were outlined and research reported between 1995 and 2005 was consolidated [
13]. More recently, in 2010, they [
14] released yet another comprehensive assessment that looked at approximately 300 studies from 1993 to 2009. This framework categorizes research on EDM into eleven distinct categories, one of which is the forecasting of student achievement. By that time, learning analytics as a field had also taken off. The analytical investigations in education are first driven by two research communities, the worldwide academic data mining society as well as the society for educational analytics research, predicting students’ final exam marks in an online class [
15]. The authors of this study evaluated three distinct ways of classifying students’ performance: categorizing them into two classes (“pass” and “fail”); three classes (“high”, “middle”, and “poor”); and nine classes.
Despite EDM approaches having been used by academics in both conventional and computer-based online education, the use of the former is somewhat less common than the latter. In light of this, the researchers of two EDM review publications [
14,
16] discovered very little research focusing on the conventional schooling model. Furthermore, all individual EDM polls of student achievement takes into account the findings on both online and conventional classroom-based learning. It is worth noting that authors looked high and low for a survey paper on EDM that specifically addressed education in the classroom but came up empty. Predictors, methodologies, and prediction efficiency have been the primary foci of prior surveys of student performance. No one, however, considers the passage of time. It is possible that prediction conducted both before and after a course begins serves two entirely different functions. In a study, Shih and Lee [
17] used the kNN algorithm to make educated guesses on the appropriateness of the study materials presented to pupils. The findings of these analyses generally demonstrated that various ML models produced excellent results and performed equally well. However, outcomes varied depending on the kind of input data used in the analysis. Furthermore, when it comes to maximizing the effectiveness of a given ML approach, the research that has been published so far has not offered a thorough examination of what constitutes the best possible input dataset.
1.2. Complexity of the Learning Process and the Role of Machine Learning
The domain of education, known for its complex and diverse characteristics, presents difficulties when just examined from an efficiency standpoint. The process of learning is not merely a straightforward progression from a state of ignorance to one of knowledge, but rather it is characterized by a diverse array of personal encounters, obstacles, and moments of enlightenment.
It is of utmost importance to underscore that the objective of this study is not to categorize students into binary groups only based on their performance in examinations. The objective, in this case, is to utilize data in order to extract valuable insights that can contribute to the comprehensive learning experience of students. Machine learning offers a comprehensive framework for performing predictive analytics. However, it is imperative to adopt a critical perspective when interpreting its outcomes, recognizing that it mostly captures trends and patterns rather than encompassing the total of an individual’s educational progress.
2. Data Mining Techniques for Students’ Performance Predictive Analysis
Predicting a student’s performance is one of the most essential factors in speculating students’ career prospects. Not only will the prediction assist the students, but various agencies also require efficiency in academic management such as student retention, admissions, and alumni relations, as well as facilitating more precise and relevant advertising and promotion. School-based intervention initiatives can benefit the prediction of potential kids at risks.
2.1. Risk Prediction in Student Performance Employing Machine Learning
In order to study student behavior in VLE systems, Kuzilek [
19] relied on General Unary Hypotheses Automation (GUHA) along with Markov Chain-based exploration. There were thirteen situations in the collection. This evaluation relied on a dataset that included (a) students’ grades for assigned work and (b) a record of students’ activities inside a virtual learning environment (VLE). The LISP-Miner program was used to carry out the analysis. After performing their research, they determined that either approach would provide useful results when mining the dataset. A graphical model built on the Markov Chain may make the information more readily apparent. The intervention strategy is bolstered by the patterns collected utilizing the techniques. Students’ future academic success may be forecast using data on their past behaviors.
He [
20] presented the risk detection in massive open online courses. They presented two transfer learning algorithms: LR-SEQ for sequential data and LR-SIM for simultaneous data. DisOpt 1 and DisOpt2 data were used to test and compare the effectiveness of the suggested algorithms. When compared to the original Logistic Regression (LR) model, LR-SIM had a higher AUC value in the first week, making it the clear winner when comparing the results to the LR-SEQ. This finding reflected a positive forecast during the preliminary admissions process.
Kovacic [
21], utilizing ML methods, investigated the possibility of advanced prediction of student progress. The review analyzed the relationship between socio-demographic factors (such as education, employment, gender, marital status, handicap) and instructional factors (such as course program and course block) for the sake of accurate forecasting. These dataset characteristics were gathered from the Open University of New Zealand. Feature-selection methods were developed for ML to zero in on the most important factors influencing student performance. The analysis concluded that course program, ethnicity, and course block are the most significant factors influencing students’ performance.
While attempting to predict how students in an introductory programming course will perform, Watson [
24] took into account their activity record. This analysis suggested using an indirect foundation, such as automatically assessed criteria, to track student achievement over time. They created an evaluation method for student programs called WATWIN, which gives points for different kinds of work. A student’s score was based on their responsiveness to and speed in fixing programming problems.
In contrast to other research, Marbouti [
28] studied prediction models to detect underperforming students in a standard-based education setting. Additionally, they used the data from the first-year engineering program at a Midwestern US University in 2013 and 2014 to implement feature selection techniques that reduced the feature space. The student progress data collection included things like quiz scores, homework grades, teamwork marks, project checkpoints, quantitative modeling, and exam results. Classifiers from KNN, LR, NB, SVM, MLP and DT were among the six classifiers that were chosen in this study. Overall accuracy, pass-rate accuracy, and failure-rate accuracy were used to assess the performance of these classifiers. Features having a Pearson’s correlation coefficient value greater than 0.3 were employed in the prediction procedure as part of the feature selection approach. In contrast, NB models trained with 16 characteristics performed better (88 percent accuracy).
Additionally, Iqbal [
29] predicted learners’ GPA utilizing CF, MF, and RBM in a comparable assessment. This analysis relied on data gathered from Information Technology University (ITU) in Lahore, Pakistan. A feedback model was presented to determine the extent to which a student had grasped the material covered in a given class. In addition, they proposed a method for tuning the Hidden Markov model (HMM) such that it may be used to foretell how well students would perform in a given class. The experimental data were divided into a 70% training set and a 30% testing set. RMSE, MSE, as well as Mean Absolute Error were used to rank the ML-based classifiers. For this dataset, RBM performed well, with mean squared errors (MSEs), Mean Absolute Errors (MAEs), and Root-Mean-Squared Errors (RMSEs) all below 0.
Using AutoML, Zeineddine [
36] improved the accuracy of student performance prediction by capitalizing on characteristics collected in the lead-up to the commencement of the new curriculum. With AutoML, they were able to reduce the false prediction rate while maintaining an accuracy of 75.9% and a Kappa value of 0.5. They concluded that AutoML should be used by researchers in this area to find the best possible learner performance prediction model, particularly when starting with pre-start data. To support students in imminent need of assistance, they recommended using pre-admission data to initiate intervention and consultation sessions prior to beginning academic progress. Because of the uneven distribution of the available data, they used the SMOTE pre-processing approach and the automatically created ensemble methods to make accurate predictions about which students will fail. The authors recognized SMOTE’s overgeneralization weakness and suggested ways to mitigate the unbalanced data issue without resorting to a less than ideal solution: a more balanced dataset.
Using past data, Bueno-Fernández [
37] advocated using ML techniques to predict learners’ ultimate grades. They used information gathered from the universities of Ecuador’s computer engineering departments. The collection of a large amount of high-quality data was the primary objective of this study. Their strategy produced a plethora of information that, with proper processing, could be repurposed into several beneficial tools for the field of education. This study offered a unique method for pre-processing and clustering students with similar patterns. After that, they used a wide variety of supervised learning techniques to zero in on the students whose patterns were most similar and to forecast their final grades. The findings from ML approaches were then compared to the state of the art. They asserted a 91.5% accuracy using ensemble methodologies, demonstrating the efficacy of ML approaches to predicting students’ performance.
Many academics, Reddy and Rohith noted, had used the sophisticated ML algorithms to accurately forecast student performance; nevertheless, they had failed to provide any helpful suggestions for students who were struggling. In order to overcome this barrier, they set out to uncover what individual factors can predict a student’s poor performance in a classroom setting. With the help of DT, SVM, GB, and RF, data from the University of Minnesota were analyzed. They asserted a higher rate of accuracy (over 75%) in determining which students would fail this term based on a set of characteristics that are broad enough to apply to all of them.
2.2. Students Dropout
Alhusban [
39] used an ML study to track and improve retention rates among college learners. Al-Al Bayt University students’ gender, enrollment type, entrance marks, place of birth, nationality, courses, and marital status studied during elementary and secondary school were among the variables assessed. They used Hadoop, an open-source platform based on ML, since there are so many features incorporated that the sample dataset is large. The results of the admissions exam were discovered to have a substantial impact on the chosen field of study. Furthermore, they argued that particular sexes tend to dominate certain spheres, such as the medical industry, where many more females than boys choose to specialize. They further argued that pupils’ socioeconomic backgrounds had an impact on their academic success. As a conclusion, they said that single learners outperformed their married or coupled counterparts.
The data for the Yukselturk study [
40], which were acquired via online surveys, were reviewed by Yukselturk in order to evaluate the data mining approaches for dropout prediction. The online questionnaire consisted of 10 sections, which were titled as follows: age, profession, gender, self-efficacy, education level, prior knowledge, coverage, previous online experience, and locus of control. There were a total of 189 students that participated in the event. The investigation made use of four different ML models and a strategy that was based on the genetic algorithm to identify features. According to the findings, 3NN was the most effective classifier, with an accuracy rate of 87%.
The authors of [
41] provided some insight into the role of temporal characteristics in the prediction of student dropout. Using information gained from student quiz scores and data collected from networking forums through the Canvas API, the temporal features were able to capture the changing patterns of student performance over time. The following are some of the attributes that were retrieved from the data: active days, number of module views, number of forum views, dropout week, number of quiz views, number of discussion posts, and social network degree. Both the BN and the DT were used as classification strategies throughout the classification process.
Liang and Zhen [
42] examined information about students’ classroom participation to calculate the likelihood of student withdrawal in the subsequent days. Gathering information from the XuetangX forum, pre-processing that information, extracting and selecting features, and using machine learning techniques were all parts of the proposed framework. There were 39 Open-Edx-based courses included in the XuetangX online learning dataset. Throughout a 40-day period, pupils recorded their actions and attitudes. In order to train ML algorithms, the raw log data must first be processed. There were a total of 121 characteristics retrieved, and they were split evenly across three groups: users, courses, and enrollments. The dataset was then split into a training set with 120,054 cases and a testing set with 80,360 instances. When the average area under the curve (AUC) for a certain classifier was large for a GBT, SVM and RF classifiers were also utilized. There were two types of information used to forecast pupils’ future performance: (a) static information and (b) information that changes over time. Thaker [
43] presented a framework for adaptable textbooks based on a dynamic model of student knowledge. The dynamic learner’s performance data reportedly included learner success and failure diaries compiled as they engage with the LMS. As the features of the dataset vary over time, learner interaction sessions with the e-learning platform are an illustration of dynamic data. Alternatively, dynamic learner performance data are continuously updated, whereas static data are obtained just once. Enrollment and demographic information about students is one such example. As a solution, the suggested framework takes into account students’ reading habits and quiz scores to estimate where their knowledge is at the moment. The framework has two more complex forms of the standard Behavioral Model (BM): the Behavior-Performance Model (BPM) and the Individualized Behavior-Performance Model (IBPM). The suggested models were implemented using the Feature Aware Student Knowledge Tracing (FAST) tool. The suggested method outperformed the standard Behavior Model in terms of RMSE and ACU.
Carlos [
44] introduced an ML classification model for predicting learner performance; this model incorporates a data gathering strategy to glean information about student learning and behavior from instructional settings. Students’ performance was utilized to divide them into three groups using the support vector machine (SVM) algorithm: high, medium, and low. The author gathered information from 336 pupils across 61 dimensions. The first experiment utilized just behavioral characteristics for classification, the second experiment only used learning features, the third experiment mixed learning and behavioral features for classification, and the fourth experiment used only chosen features for predicting student success. Predictions of student performance over a ten-week period were based on the general eight behavioral variables and 53 learning factors included in the dataset. The findings demonstrated that the classifier’s performance improved each week as more data were collected. In addition, by week 10, a combined effort of behavioral and learning variables yielded a high classification result of 74.10%.