2. Reinforcement Learning in Adaptive and Intelligent Educational System (RLATES)
Distance learning has attracted increasing attention in recent years. When students and teachers are unable to attend classes face to face in the same classroom, distance learning is essential. Sometimes, learning through resources on the web (text, video, pictures, voice, etc.) or online tutorials provided by teachers can fulfill the basic requirements of distance learning. However, when questions are encountered, students are unable to accurately find the answers through online resources, and the onetomany teacher–student sessions ultimately do not fulfill one function, which is adaptive instructions ^{[2]}. The reason why onetoone lessons are more effective and more satisfying than small group lessons is that onetoone lessons allow for personalized teaching strategies, but because of the high financial cost of onetoone lessons, it is impossible to extend this approach to all groups of students. Based on this situation, the Adaptive Intelligence Educational System was designed to allow students to develop their own personalized teaching strategies if they have access to a computer, thus allowing them to benefit from a onetoone teaching model at a relatively low cost and allowing each student to have their own virtual teacher.
The principle of AIESs is to resequence all course knowledge modules based on student characteristics, and a variety of machine learning techniques are used in the system to learn student characteristics ^{[3]}. According to the principles of AIESs, if reinforcement learning is introduced into the system, it allows the student to interact with the system and allows the system model to continuously improve its learning, thus enhancing its performance. The type of system that introduces reinforcement learning into AIESs is called RLATES.
RLATES comprises two models, the knowledge model and the pedagogical strategy model ^{[4]}. In the knowledge model, the content of the teaching is decided, for example, which chapters of the textbook will be covered and which format (video, audio, text, or pictures) will be used for delivery. In the pedagogical strategy model, the teaching strategy is developed, which determines how the material will be delivered.
However, RLATES is not available for teaching directly from the beginning. At the early stage, the model needs to first be trained by feeding it with training data so that the system learns which teaching strategy to use when confronting students with different characteristics. Therefore, the whole experimental process should be separated into two phases when designing the system, the training phase and the teaching phase ^{[5]}. Only after the model has been successfully trained can it be implemented into real teaching.
2.1. Current Research
In this section, the current status of the research in the domain of intelligent educational systems is presented. Although there are numerous studies that focus on intelligent educational systems similar to AIESs, the retrieval shows that only a fraction of them adopted reinforcement learning algorithms. The details are shown in Table 1.
Table 1. Current research for intelligent tutoring system.
According to Table 1, it can be concluded that, in the domain of intelligent teaching systems, most authors still adopt the classical Qlearning algorithm since the Qlearning algorithm is a modelfree and policyfree reinforcement learning algorithm that is suitable for implementation in intelligent teaching systems. However, due to the defects of Qlearning, the processing speed is sluggish, and the system response time increases when the Qtable is excessively large. Nonetheless, the Qlearning algorithm is one of the classical algorithms of reinforcement learning and is relatively simple in practical applications compared with other modelfree reinforcement learning algorithms, which is probably part of the reason why many authors chose the Qlearning algorithm in their studies.
For the articles listed above, although articles 1–5 all adopt the Qlearning algorithm, their evaluation metrics are distinct. Most of the authors selected the number of actions, the number of students, or time consumption to evaluate the performance of the model. However, the most remarkable article is article 1, in which the authors develop an evaluation metric themselves called PFM. According to the authors’ settings in that article, if PFM ≥ 60, then the performance of the model is good, and if PFM < 60, then the performance of the model is poor. Meanwhile, an assessment using PFM can also indicate the difficulty of the learning content to some extent; if the performance is poor, then this indicates that the learning content is probably relatively difficult. The authors use this evaluation metric to compare the three strategies within the article, and although it does not permit a horizontal comparison of the model’s performance to other articles, it makes the evaluation results more intuitive and straightforward to peruse and understand.
2.2. Applied Reinforcement Learning in RLATES
Based on the introduction to reinforcement learning, it can be seen that reinforcement learning comprises five main components. In order to apply reinforcement learning to RLATES, it is essential to ensure that the components of the system correspond to each of the five components of reinforcement learning algorithms. Therefore, in this section, the application of reinforcement learning to RLATES is introduced.
First, the following descriptions are given of how the components of RLATES correspond to those of the reinforcement learning algorithm ^{[5]}:

Agent: In RLATES, the agent refers to the student. The learning system is used through the student interacting with the system for subsequent processes; therefore, the student corresponds to the agent in the reinforcement learning algorithm.

Environment: In a broad sense, the environment is the entire knowledge structure of the system, and it collects information on the characteristics of the students and tests their knowledge through exams and quizzes distributed throughout the knowledge modules.

Action: Actions are the selections that an agent needs to take at each step, so in RLATES, the actions correspond to the knowledge modules, each of which represents an action.

State: In reinforcement learning algorithms, the state refers to the state that the environment returns to when the agent performs an action. Therefore, in RLATES, the state corresponds to the student’s learning state, i.e., how the student mastered the knowledge. Here, a vector is used to store the data, and all state values are in the range of 0–1. For a student, if the knowledge has been fully mastered and correctly understood, the state value is set at 1. If the knowledge has not been mastered by the student, then the state value is set at 0.

Reward: For reinforcement learning algorithms, each selection returns a different reward value, and similarly, in RLATES, each knowledge module corresponds to a different reward according to the significance. Moreover, in RLATES, the intention is to maximize the cumulative value of this reward.
Next, the application of the reinforcement learning algorithm to RLATES is described in Algorithm 1. Coupling the components in RLATES to the elements in the reinforcement learning algorithm yields the following process
^{[4]}^{[5]}:
Algorithm 1 Apply reinforcement learning algorithm to RLATES 
Initialize Q (s, a) for s ∈ S and a ∈ A Test the current situation of student’s knowledge s Loop for each episode, Pick a knowledge module a, show this module to the student, by using the εgreedy policy Get the reward r, while the RLATES goal is achieved, a positive r will be obtained, else a null r will be obtained. Test the current situation of student’s knowledge s’ Update Q (s, a): s ← s’ until s reaches the goal state 