Student Modeling Method Integrating Knowledge Tracing and IRT with Decay Effect

Similar documents
Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Computerized Adaptive Psychological Testing A Personalisation Perspective

Physics 270: Experimental Physics

Learning From the Past with Experiment Databases

On-Line Data Analytics

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Truth Inference in Crowdsourcing: Is the Problem Solved?

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Lecture 1: Machine Learning Basics

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Australian Journal of Basic and Applied Sciences

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

MODELING ITEM RESPONSE DATA FOR COGNITIVE DIAGNOSIS

Introduction to Simulation

Probabilistic Latent Semantic Analysis

Reducing Features to Improve Bug Prediction

Julia Smith. Effective Classroom Approaches to.

Rule Learning With Negation: Issues Regarding Effectiveness

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Statewide Framework Document for:

Math 96: Intermediate Algebra in Context

Radius STEM Readiness TM

Uncertainty concepts, types, sources

A Case-Based Approach To Imitation Learning in Robotic Agents

Rule Learning with Negation: Issues Regarding Effectiveness

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Python Machine Learning

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

prehending general textbooks, but are unable to compensate these problems on the micro level in comprehending mathematical texts.

Epistemic Cognition. Petr Johanes. Fourth Annual ACM Conference on Learning at Scale

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

MTH 141 Calculus 1 Syllabus Spring 2017

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Corrective Feedback and Persistent Learning for Information Extraction

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Mathematics subject curriculum

A Bootstrapping Model of Frequency and Context Effects in Word Learning

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Integrating E-learning Environments with Computational Intelligence Assessment Agents

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Generative models and adversarial training

Improving Conceptual Understanding of Physics with Technology

A Model of Knower-Level Behavior in Number Concept Development

Measurement. When Smaller Is Better. Activity:

Learning Probabilistic Behavior Models in Real-Time Strategy Games

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

Development of Multistage Tests based on Teacher Ratings

A heuristic framework for pivot-based bilingual dictionary induction

Transfer Learning Action Models by Measuring the Similarity of Different Domains

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Practical Integrated Learning for Machine Element Design

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Mathematics Scoring Guide for Sample Test 2005

Office Hours: Mon & Fri 10:00-12:00. Course Description

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

TD(λ) and Q-Learning Based Ludo Players

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

Linking Task: Identifying authors and book titles in verbose queries

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Clouds = Heavy Sidewalk = Wet. davinci V2.1 alpha3

Probability and Statistics Curriculum Pacing Guide

Interpreting ACER Test Results

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Grade 6: Correlated to AGS Basic Math Skills

A Case Study: News Classification Based on Term Frequency

Running head: DUAL MEMORY 1. A Dual Memory Theory of the Testing Effect. Timothy C. Rickard. Steven C. Pan. University of California, San Diego

Assessing Functional Relations: The Utility of the Standard Celeration Chart

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

SURVIVING ON MARS WITH GEOGEBRA

Literature and the Language Arts Experiencing Literature

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Assignment 1: Predicting Amazon Review Ratings

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Introduction. Chem 110: Chemical Principles 1 Sections 40-52

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

Automating the E-learning Personalization

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Math 098 Intermediate Algebra Spring 2018

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Biome I Can Statements

TOPICS LEARNING OUTCOMES ACTIVITES ASSESSMENT Numbers and the number system

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Abnormal Activity Recognition Based on HDP-HMM Models

KLI: Infer KCs from repeated assessment events. Do you know what you know? Ken Koedinger HCI & Psychology CMU Director of LearnLab

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

Transcription:

Student Modeling Method Integrating Knowledge Tracing and IRT with Decay Effect Shinichi Oeda 1 and Kouta Asai 2 1 Department of Information and Computer Engineering, National Institute of Technology, Kisarazu College 11-1, Kiyomidaihigashi 2-chome Kisarazu City, Chiba, Japan oeda@j.kisarazu.ac.jp 2 Advanced Control and Information Engineering Course, National Institute of Technology, Kisarazu College Abstract. Educational data mining (EDM) involves the application of data mining, machine learning, and statistics to information generated from educational settings. Modeling students knowledge is a fundamental part of intelligent tutoring systems. One of the most popular methods for estimating students knowledge is knowledge tracing. It is the de-facto standard for inferring students knowledge from performance data. The goal of this study is to estimate future student performance from massive amounts of examination results. We propose a novel method to improve the precision of student modeling using knowledge tracing with item response theory, including the decay theory of forgetting. Keywords: Educational data mining, knowledge tracing, item response theory, hidden Markov model, decay theory 1 Introduction Intelligent tutoring systems (ITS) and learning management systems (LMS) have been widely used in the fields of education, and have allowed us to collect log data from learners, such as students. Educational data mining (EDM) aims at discovering useful information from the massive amounts of electronic data collected by these educational systems. EDM is an emerging multi-disciplinary research area where methods and techniques for exploring data originating from various educational information systems have been developed [1]. One of the goals of EDM is student modeling. It is one of the key factors affecting automated tutoring systems in making instructional decisions. The purpose of student modeling is the estimation of students skills and The prediction whether a student solve an item or not from log data such as examination results. One of the most popular methods for estimating student knowledge is knowledge Currently NIFTY Corporation, Human Resources Department, Shinjuku Front Tower 21-1, Kita-shinjuku 2-chome, Shinjuku-ku, Tokyo, Japan, asai.kota@nifty.co.jp

tracing [2]. It is the de-facto standard for inferring students knowledge from performance data. An ITS provides efficient learning environments for students by assigned a suitable item for a student s skill level. The ITS employs a student model. In order to create a high-performance ITS, a student model is needed that can predict students answers and estimate the state of their skills. However, knowledge tracing did not consider the process of the decay theory of forgetting, whereby human memory fades over time. Conventional methods for knowledge tracing cannot handle the decay effect because it is difficult to estimate the parameters of model using the forgetting process. In order to comprehend the learning effects in the educational process, it is significant to study how the distribution of students latent skills changes over time. We address the issue by incorporating item response theory into the decay effect. In this paper, we propose a novel method to improve the precision of student modeling using knowledge tracing with item response theory, including the decay theory of forgetting. 2 Knowledge Tracing Knowledge Tracing was developed in 1995, and has since established its position as a well-known method of student modeling. Figure 1 uses the plate notation to show a graphical model of knowledge tracing. A question item in an examination requires several skills to solve. The diagram shows that t is a learning opportunity, k t is a latent variable as a skill state (master or not master) of the student, y t is an observation variable as a result (correct or incorrect) of the student s response. Knowledge tracing is represented hidden Markov model, since student s skill states are not observed while student s results are observed. In knowledge tracing, four parameters P (L 0 ), P (T ), P (G), P (S) for each skill are defined as follows: already know learn guess slip P (L 0 ) def = P (k 0 = true), (1) P (T ) def = P (k t = true k t 1 = false), (2) P (G) def = P (y t = true k t = false), (3) P (S) def = P (y t = false k t = true). (4) There are four types of model parameters used in knowledge tracing as the initial probability of knowing a skill a priori. P (L 0 ) is the probability that a student has learned how to apply a knowledge component prior to the first opportunity to apply it in the ITS. P (T ) is the probability of a student s knowledge of a skill transitioning from the not known to the known state after an opportunity to apply it. Here, knowledge tracing assumes that a student does not forget

already know Student Knowledge (k 0 ) learn Student Knowledge (k t ) guess or slip Student Student Performance Performance (y 0 ) (y t ) Fig. 1. Knowledge tracing. a mastered skill if even once. Accordingly, the probability of skill transition from master to not master is zero. P (G) is the probability of correctly applying an unknown skill, and P (S) is the probability of making a mistake when applying a known skill. Given that parameters P (L 0 ), P (T ), P (G), P (S) are set for all skills, the formulae used to update student knowledge of skills are as follows, from Equation (5) to (8), from the results of students answers until opportunity t: P (L t = true y t = true) = P (L t = true y t = false) = P (L t )(1 P (S)) P (L t )(1 P (S)) + (1 P (L t ))P (G), (5) P (L t )P (S) P (L t )P (S) + (1 P (L t ))(1 P (G)), (6) P (L t+1 = true) = P (L t y t ) + (1 P (L t y t ))P (T ), (7) P (y t+1 = true) = P (L t+1 )(1 P (S)) + (1 P (L t+1 ))P (G). (8) Equations (5) and (6) update a skill state from the answer to opportunity t. The skill state of future opportunity t + 1 is calculated by Equation (7) by the updated value of Equation (5) and (6). Moreover, the probability that a student can answer an assigned item at t + 1 is calculated using Equation (8) by the derived value of Equation (7). 2.1 Estimation of parameters In the knowledge tracing model, the four parameters P (L 0 ), P (T ), P (G), P (S) per skill are unknown. Although these parameters are defined by an expert, they are estimated by results from past data in general. We can estimate these parameters by the Baum Welch algorithm [3], since knowledge tracing is a hidden Markov model.

3 Item Response Theory 3.1 Overview of model IRT (item response theory) [4] is the study of examination and item scores based on assumptions concerning the mathematical relationship between a latent ability and item responses. The IRT model predicts the probability that a certain student will give a certain response to a certain item. Students can have different levels of ability, and items can differ in many respects. In IRT models, Rasch model like a logistic function is used on the ability variable to explain examinees item responses as follows: P ij (y = true) = 1 1 + exp( 1.7(θ i β j )), (9) where index i indicates a student, j indicates an item, θ i is the student s ability parameter for item j, and β j is the difficulty parameter of item j. Variable θ i is considered the ability required to perform well on question items. The item response function gives the probability that a student with a given ability level will answer a question correctly. Students with lower ability have less of a chance, whereas those with higher ability are more likely to answer correctly. 3.2 Estimation of parameters The common estimation methods for IRT are joint maximum likelihood estimation, marginal maximum likelihood estimation, and Bayesian estimation. However, it is difficult to calculate the joint maximum likelihood if the number of students increases. Marginal maximum likelihood estimation overcomes this issue by reducing the number of students through marginalization. On the other hand, it does not work when results are all correct or all incorrect. In this paper, we use Bayesian estimation in order to estimate parameters because it solves above the problems. Although Bayesian estimation can analytically solve for a simple model like the Rasch model through Equation (9), it cannot solve the following complex model. In this paper, we use the Markov Chain Monte Carlo method, which can estimate the parameters of a complex model. 4 Related Work 4.1 Rasch model with forgetting Lindsey et al. have developed the Rasch model using a theory of forgetting [5] through Equation (10), which is based on Equation (9), as follows: P ij (y = true) = (1 + ht ij) exp( θ i β j) 1 + exp( 1.7(θ i β j )), (10)

where t ij indicates the elapsed time between the initial presentation of item j to student i and a later recall test, θ i indicates a forgetting parameter for student i, βj indicates a forgetting parameter for item j, and h is a scaling parameter. The Rasch model with forgetting takes into account the elapsed time and the forgetting parameter. It is believed that human memory decays over time. The proposed model Equation (10) incorporates elapsed time, because of which the probability of a correct response decreases with time. 4.2 Combination of knowledge tracing and IRT Khajah et al. have developed a method that combines knowledge tracing and the Rasch model in Equation (9), and yielded a higher prediction accuracy than previous methods [6]. We describe the method of combining two models. Equation (8) for knowledge tracing is rearranged as Equation (11) as follows: P (y t y (t 1) ) = P (y t k t = l) P (k t = l y (t 1) ), (11) l {mastered, not masterd} where y (t 1) = y 0...y t 1. P (y t k t = l) which, appears on the right-hand side of Equation (11), and represents slip and guess. This part is replaced with the Rasch model as follows: P (y t y (t 1) ) = Rasch(θ it, β jt, c l ) P (k t = l y (t 1) ). (12) l {mastered, not masterd} The Rasch model, as Equation (12), is added as a parameter of c l. Although the IRT does not have the two parameters of slip and guess, c l is added to the model. The model adds parameter c l to Equation (9) of the Rasch model to Equation (13) as follows: Rasch( ) = c l + 5 Proposed Method 1 c l 1 + exp( 1.7(θ i β j )). (13) In this paper, we propose a method that combines knowledge tracing and the Rasch model with forgetting in order to improve prediction accuracy. In the proposed model, we replace the Rasch function in equation (12) with the Rasch model with forgetting in equation (10). We similarly adds parameter c l to Equation (10). The combined model can be represented as follows: P (y t y (t 1) ) = RF(θ it, β jt, c l ) P (k t = l y (t 1) ), (14) l {mastered, not masterd}

RF( ) = c l + (1 + ht ij) exp( θ i β j) c l 1 + exp( 1.7(θ i β j )). (15) We employed Bayesian estimation to estimate the parameters of the model as in Section 3.2. We did not use simple a Bayesian model, but applied a Bayesian hierarchical model because it has hyperprior distributions. 6 Experiments 6.1 Overview of experiments We conducted two experiments to evaluate the proposed model. A dataset was divided into training and test data. The training data was used to fit the parameters of the model and the test data to assess its generalization error. We verified that the proposed method could predict whether a given answer by a student was correct. We compared our method with two others: (i) original knowledge tracing, and (ii) the method represented in Equation (12). The proposed method is as in Equations (14) and (15). We employed AUC (Area Under the Curve) and RMSE (Root Mean-squared Error) as measures for evaluation. AUC is a metric for a two-class prediction problem; the value of the AUC is 1 if the prediction is completely correct and 0.5 if the prediction is random. RMSE is a metric for numerical predictions, where its value represents the difference between the values predicted by a model and those observed. In short, a high-performance model indicates a value close to 1 on the AUC and close to 0 in terms of the RMSE. 6.2 Dataset In this experiment, we applied three methods to two datasets of synthetic data and the Bridge to Algebra 2006-2007 [7]. Table 1 presents an overview of each dataset. Table 1. Details of datasets. Records Students Items Skills Synthetic 200,000 1,000 25 5 Algebra 225,880 1,127 612 114 (1) Synthetic data We employed IRT to generate the synthetic data. We assumed that if an item was assigned to a student once, the student s skill to solve the item increased. In order to add a decay effect, we calculated the retention interval between the initial presentation of an item to a student and a later recall assignment. If the elapsed time was long, the student s skill to solve the item decreased. (2) Bridge to Algebra 2006-2007 This dataset was used at the KDD Cup 2010 Educational Data mining Challenge as actual data from an e-learning system. We omitted items that have less than 200 records and items requiring a defined skill to be solved.

0.81 0.41 0.8 0.79 0.405 0.4 0.78 0.77 0.395 0.76 0.39 0.75 Original Previous Proposed 0.385 Original Previous Proposed (a) AUC (b) RMSE Fig. 2. Results with synthetic data. 6.3 Results (1) Synthetic data Figure 2 shows the prediction results for each method for synthetic data. The graphs show (i) original method (knowledge tracing), (ii) previous method (knowledge tracing and IRT), and (iii) the proposed method (knowledge tracing and IRT with forgetting) from the left in Figure 2. The values of the AUC of the previous method and the proposed method were greater than that for original knowledge tracing in Figure 2(a). There was no significant difference between the previous method and the proposed method. However, the value of RMSE in Figure 2(b) shows that the proposed method has superior prediction ability than the previous methods. Therefor, the results indicated that the proposed method was the most effective. (2) Bridge to Algebra 2006-2007 Figure 3(a) shows the prediction results for each method on actual data. Our proposed methods yielded the best performance, whereas there was slight difference between the results for the proposed method and the previous method. However, the value of RMSE of the proposed method indicated lower than previous method in Figure 3(b). 7 Conclusion In this paper, we proposed a novel combination of knowledge tracing and IRT with a decay effect in order to improve the previous method. The proposed approach showed promising effectiveness on real-world datasets.

0.77 0.362 0.76 0.361 0.75 0.36 0.74 0.73 0.72 0.71 0.359 0.358 0.357 0.7 0.356 0.69 0.355 0.68 Original Previous Proposed 0.354 Original Previous Proposed (a) AUC (b) RMSE Fig. 3. Results of Bridge to Algebra 2006-2007. Acknowledgment This work was supported by JSPS KAKENHI Grant Number JP16K01095. References 1. T. Calders, M. Pechenizkiy, Introduction to The Special Section on Educational Data Mining, SIGKDD, Vol. 13, Issue. 2, pp. 3 5, 2011. 2. A. T. Corbett, J. R. Anderson, Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge, User Modeling and User-Adapted Interaction, 4(4), pp. 253 278, 1995. 3. S.E.Levinson, L.R. Rabiner, M.M. Sondhi, An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition, Bell System Technical Journal, Vol. 62, Issue. 4, pp. 1035 1074, 1983. 4. Wim J. van der Linden, Ronald K. Hambleton, Handbook of Modern Item Response Theory, Springer, 1996. 5. R.V. Lindsey, M.C. Mozer, Predicting Individual Differences in Student Learning via Collaborative Filtering, Submitted, 2014. 6. M. Khajah, Y. Huang, J. P. González-Brenes, M. C. Mozer, and P. Brusilovsk, Integrating Knowledge Tracing and Item Response Theory: A Tale of Two Frameworks, Proceedings of Workshop on Personalization Approaches in Learning Environments (PALE2014) at the 22th International Conference on User Modeling, Adaptation, and Personalization, pp. 7 12, 2014. 7. J. Stamper, A. Niculescu-Mizil, S. Ritter, G. J. Gordon, K. R. Koedinger, Bridge to Algebra 2006-2007, Development data set from KDD Cup 2010 Educational Data Mining Challenge, (http://pslcdatashop.web.cmu.edu/kddcup/downloads.jsp).