Automatically Detecting Incoherent Written Math Answers of Fourth-Graders

Automatically Detecting Incoherent Written Math Answers of Fourth-Graders: Comparison

Please note this is a comparison between Version 1 by Roberto Araya and Version 2 by Camila Xu.

Arguing and communicating are basic skills in the mathematics curriculum. Making arguments in written form facilitates rigorous reasoning. It allows peers to review arguments, and to receive feedback about them.

written short answers
incoherent answer detection
natural language processing

1. Introduction

Arguing and communicating are basic skills in the mathematics curriculum. For example, in the U.S. Common Core State Standards for Mathematics (CCSSM) [1], it is stated that students should, “construct viable arguments and critique the reasoning of others”. According to the CCSSM, mathematically proficient students should “understand and use stated assumptions, definitions, and previously established results in constructing arguments”. In the case of elementary students, they should be able to "construct arguments using concrete referents such as objects, drawings, diagrams, and actions. Such arguments can make sense and be correct, even though they are not generalized or made formal until later grades”. The CCSSM establishes that all students at all grades should be able to “listen or read the arguments of others, decide whether they make sense, and ask useful questions to clarify or improve the arguments”. In Chile, the National Standards for Mathematics also has arguing and communication as one of the four core mathematics skills for all grades. An extensive literature supports the inclusion of these elements in the mathematics curricula of different countries, emphasizing the importance of developing the ability to argue and communicate in mathematics. For example, [2] states that students in grades 3 to 5 should learn to create general arguments and learn to critique their own and others’ reasoning. Mathematics instruction should help students learn to communicate solutions with teachers and with peers ^[3][4][3,4].

However, this is not an easy task. For example, an analysis of third-grade German textbooks found that no more than 5–10% of all textbook tasks ask for reasoning [5]. The same situation happens in other countries. In Chile, for example, textbooks and national standardized tests do not have explicit reasoning or communication questions.

On the other hand, the process of arguing and communicating in writing has several additional advantages over doing so only verbally. It allows students to reason immediately and visually about the correctness of their solution [6]. It also supports reasoning and the building of extended chains of arguments. Writing facilitates critique of the reasoning of others [1], reviewing the argumentation of peers, and receiving feedback from them. Although writing in mathematics can serve many purposes [7].

Developing arguing and communication competencies in mathematics for elementary school classrooms is a great challenge. If, in addition, the teacher wants the students to do so in writing, then there are several additional implementation challenges. It requires at least two conditions: all students should be able to write their answers and all should be able to comment on answers written by peers. At the same time, they should receive immediate feedback.

One solution is to use online platforms. The teacher poses a question and in real time receives the answers from the students. Giving feedback should take one or two minutes. This is possible since fourth graders in our population typically write answers of eight to nine words. However, reviewing the answers in their notebook or smartphone is very demanding for the teacher. The teacher must review 30 answers in real time. To facilitate the revision, the first task is to automate the detection of incoherent answers. These answers can reflect a negative attitude of the student. They can also show an intention not to respond. Thus, automatic detection enables the teacher to immediately require correction of them.

Some incoherent answers are obvious. For example, “jajajajah”. Others are more complex. They require some degree of understanding of the question and the answer and the ability to compare them.

2. Automatically Detecting Incoherent Written Math Answers of Fourth-Graders

Ref. ^[8][18] reports the building of a question classifier for six types of questions: Yes-No questions (confirmation questions), Wh-questions (factoid questions), choice questions, hypothetical questions, causal questions, and list questions, for 12 types of term categories, such as health, sports, arts, and entertainment. To classify question types, the authors use sentence representation based on grammatical attributes. Using domain-specific types of common nouns, numeral numbers, and proper nouns and ML algorithms, they produced a classifier with 90.1% accuracy. Ref. ^[9][19] combined lexical, syntactic, and semantic features to build question classifiers. They classified questions into six broad classes of questions, and each of these into several more refined types of questions. The authors applied nearest neighbors (NN), naïve Bayes (NB), and support vector machine (SVM) algorithms, using bag-of-words and bag-of-n-grams. They obtained 96.2% and 91.1% accuracy for coarse- and fine-grained question classification. Some authors use the BERT model to represent questions as vectors and create classifiers. These representations have obtained outstanding results in text classification when compared to traditional machine learning ^[10][20]. Ref. ^[11][21] reported the development of a BERT-based classifier for agricultural questions relating to the Common Crop Disease Question Dataset (CCDQD). A very high accuracy of 92.46%, a precision of 92.59%, a recall of 91.26%, and a weighted harmonic mean of accuracy and recall of 91.92% were obtained. The authors found that the BERT-based fine-tuning classifier had a simpler structure, fewer parameters, and a higher speed than the other two classifiers tested on the CCDQD database: the bidirectional long short-term memory (Bi-LSTM) self-attention network classification model and the Transformer classification model. Ref. ^[12][22] reported the use of a Swedish database with 5500 training questions and 500 test questions. The taxonomy was hierarchical with six coarse-grained classes of questions: location, human, description, entity, abbreviation, and number. It also included 50 fine-grained classes of questions. Two BERT-based classifiers were built. Both classifiers outperformed human classification. Ref. ^[13][23] reported the building of an SVM model to classify questions into 11 classes: advantage/disadvantage, cause and effect, comparison, definition, example, explanation, identification, list, opinion, rationale, and significance. The authors tested the classifiers on a sample of 1000 open-ended questions that they either created or obtained from various textbooks. In answer classification, there have been several reported studies, although not for answer coherence. The authors of ref. ^[14][24] used NLP algorithms to assess language production in e-mail messages sent by elementary students on an online tutoring system during the course of a year. They found that lexical and syntactic features were significant predictors of math success. In their work, the students did not answer questions. Therefore, it was not possible to verify if the answers were coherent. The authors of ref. ^[15][25] analyzed 477 written justifications of 243 third, fourth, and sixth graders, and found that these could be accounted for by a one-dimensional construct with regard to a model of reasoning. However, they did not code incoherent responses. The absence of incoherent answers may be due to the nature of this project. In a small research project, the behavior of the students is different. In their handwritten answers, the students wrote only coherent answers. In contrast, in a large-scale project where students write every week, students may behave differently. Ref. ^[16][26] explored the impact of misspelled words (MSW) on automated computer scoring systems in the context of scientific explanations. The results showed that, while English language learners (ELLs) produced twice as many MSW as non-ELLs, MSW was relatively uncommon in the corpora. They found that MSW in the corpora is an important feature of computer scoring models. Linguistic and concept redundancy in student responses explained the weak connection between MSW and scoring accuracy. This study focused on the impact of poorly written responses but did not examine answers that may have been incoherent or irrelevant to the open-ended questions. There is an extensive literature on automated short answer grading (ASAG). In ref. ^[17][27], it was found that automated scoring systems with simple hand-feature extraction were able to accurately assess the coherence of written responses to open-ended questions. The study revealed that a training sample of 800 or more human-scored student responses per question was necessary to accurately construct scoring models, and that there was nearly perfect agreement between human and computer-automated scoring based on both holistic and analytic scores. These results indicate that automated scoring systems can provide feedback to students and guide science instruction on argumentation. The authors of ref. ^[18][28] identified two main challenges intrinsic to the ASAG task: (1) students may express the same concept or intent through different words, sentence structures, and grammatical orders, and (2) it can be difficult to distinguish between nonsense and relevant answers, as well as attempts to fool the system. Existing methods may not be able to account for these problem cases, highlighting the importance of including such considerations in automated scoring systems. Finally, NLP has had a significant impact on text classification in educational data mining, particularly in the context of question and answer classification ^[18][28]. Shallow models, such as machine learning algorithms using bags-of-words and bag-of-n-grams, have been employed to classify question types with high accuracy. Deep learning models, including BERT-based classifiers ^[19][20][29,30], have shown outstanding performance in representing and classifying questions, outperforming traditional machine learning approaches ^[18][28]. Ensemble models, such as XGBoost classifiers ^[21][31], have also been utilized for question classification, achieving impressive results ^[22][32].