Can a Machine Learn to Teach?

Similar documents
Lecture 10: Reinforcement Learning

Lecture 1: Machine Learning Basics

Artificial Neural Networks written examination

The Good Judgment Project: A large scale test of different methods of combining expert predictions

South Carolina English Language Arts

Learning From the Past with Experiment Databases

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Are You Ready? Simplify Fractions

Welcome to ACT Brain Boot Camp

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

CS Machine Learning

A Bootstrapping Model of Frequency and Context Effects in Word Learning

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Effective Instruction for Struggling Readers

South Carolina College- and Career-Ready Standards for Mathematics. Standards Unpacking Documents Grade 5

Python Machine Learning

Ohio s Learning Standards-Clear Learning Targets

Title:A Flexible Simulation Platform to Quantify and Manage Emergency Department Crowding

A heuristic framework for pivot-based bilingual dictionary induction

On the Combined Behavior of Autonomous Resource Management Agents

Chapter 4 - Fractions

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?

Transfer of Training

The Strong Minimalist Thesis and Bounded Optimality

Rule Learning With Negation: Issues Regarding Effectiveness

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

An Introduction to Simio for Beginners

Discriminative Learning of Beam-Search Heuristics for Planning

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

CS177 Python Programming

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

What the National Curriculum requires in reading at Y5 and Y6

Focus of the Unit: Much of this unit focuses on extending previous skills of multiplication and division to multi-digit whole numbers.

Transfer Learning Action Models by Measuring the Similarity of Different Domains

End-of-Module Assessment Task

Learning to Rank with Selection Bias in Personal Search

Grade 5 + DIGITAL. EL Strategies. DOK 1-4 RTI Tiers 1-3. Flexible Supplemental K-8 ELA & Math Online & Print

Mathematics Scoring Guide for Sample Test 2005

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

STA2023 Introduction to Statistics (Hybrid) Spring 2013

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Truth Inference in Crowdsourcing: Is the Problem Solved?

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Rule-based Expert Systems

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

Disambiguation of Thai Personal Name from Online News Articles

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Getting Started with Deliberate Practice

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Reading Comprehension Lesson Plan

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Interpreting ACER Test Results

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

On-the-Fly Customization of Automated Essay Scoring

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

An Empirical and Computational Test of Linguistic Relativity

TASK 2: INSTRUCTION COMMENTARY

Inside the mind of a learner

PIRLS. International Achievement in the Processes of Reading Comprehension Results from PIRLS 2001 in 35 Countries

Reinforcement Learning by Comparing Immediate Reward

West s Paralegal Today The Legal Team at Work Third Edition

Knowledge Transfer in Deep Convolutional Neural Nets

Corrective Feedback and Persistent Learning for Information Extraction

Cross Language Information Retrieval

Go fishing! Responsibility judgments when cooperation breaks down

Software Maintenance

Reducing Features to Improve Bug Prediction

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

UDL AND LANGUAGE ARTS LESSON OVERVIEW

How to Judge the Quality of an Objective Classroom Test

Managerial Decision Making

Detecting English-French Cognates Using Orthographic Edit Distance

Why Did My Detector Do That?!

Functional Skills Mathematics Level 2 assessment

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Physics 270: Experimental Physics

An overview of risk-adjusted charts

The One Minute Preceptor: 5 Microskills for One-On-One Teaching

Sample Problems for MATH 5001, University of Georgia

Critical Thinking in Everyday Life: 9 Strategies

Dyslexia and Dyscalculia Screeners Digital. Guidance and Information for Teachers

Short Text Understanding Through Lexical-Semantic Analysis

Rule Learning with Negation: Issues Regarding Effectiveness

12- A whirlwind tour of statistics

CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and

P a g e 1. Grade 5. Grant funded by:

PHY2048 Syllabus - Physics with Calculus 1 Fall 2014

arxiv: v1 [cs.cl] 2 Apr 2017

This scope and sequence assumes 160 days for instruction, divided among 15 units.

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Constructing Parallel Corpus from Movie Subtitles

STAT 220 Midterm Exam, Friday, Feb. 24

Transcription:

Can a Machine Learn to Teach? Brandon Rule 5356629 December 6, 2 Introduction Computers have the extroardinary ability to recall a lookup table with perfect accuracy given a single presentation. We humans are not so fortunate. To learn, we must see the entries of a lookup table many times. owever, it is not sufficient, nor efficient, to simply see the entries several times in one sitting. We must repeatedly be reminded of an entry at spaced intervals. The spacing is not arbitrary: if we wait too long, we forget the entry; not long enough, and we waste time with a familiar entry. The spacing is also not constant: more difficult entries must be reviewed more frequently. The process of learning a lookup table in this manner is called spaced repetition. The goal of spaced repetition is to maximize the number of lookup table entries stored in a student s memory at a given time. owever, it is not possible to have complete confidence that any particular entry is known, so we consider two alternatives: Goal : Maximize the expected number of entries known at a given time. Goal 2: Maximize the number of entries that we are highly confident the student knows at a given time. It is not clear that either goal is superior in all circumstances. If the student wants to score well on a simple knowledge retrieval test, then we might argue that we should target the first goal, because this would maximize the expected score on the exam. On the other hand, if the lookup table consisted of the vocabulary for a language, then it might be superior to target the second goal, since the first may be prone to leaving the student with a vocabulary of partially known words across a range of topics, rendering her unable to speak fluently about any single one. 2 Our Model To clarify the problem, we specify a probabilistic model. We are given a set of students S = {a, b, c...} and a lookup table T = {(x, y ),..., (x n, y n )}. ere the set S is arbitrary, and the x k and y k are also arbitrary. The reader may take x k and y k to be numbers, words, names, or any other objects that a person might be interested in committing to memory. Associated with each student and entry is a history consisting of times = (t, t,...) R N (more acculy, a sequence of elements of an affine

space acted on by R), indicating when the student is exposed to the given entry. For example, suppose Adam is trying to learn Spanish, and has seen the flashcard ello ola, at 7:pm, 8:pm and :pm. In this case, we could represent Adam as student a, the flashcard as entry ( ello, ola ), and the history as (7, 8, ). We model the experiment of testing student a on entry (x k, y k ) at time t given history using a Bernoulli random variable X whose probability is a function of a, k, t and. We set X = if student a knows entry (x k, y k ) at time t given history, and otherwise. We denote the probability that X = by f(a, k, t, ). In symbols, we have f(a, k, t, ) def = Pr(X = ; a, k, t, ). With our new definitions, we see that the task is to construct for each student and entry a history, given our knowledge of the outcome of a series of Bernoulli experiments. We thus restate goal as follows. Given student a, vocabulary T, and time t, find arg max k= E[X; a, k, t, ] = arg max f(a, k, t, ). Goal 2 can be stated using an additional parameter γ, indicating what we mean by highly confident. For example, we might say we re highly confident a student knows an entry if we believe there is at least a 9% chance that she knows the entry. In this case, we d set γ =.9. Given γ, a, k and t, our goal is to find arg max k= {f(a, k, t, ) γ}. k= For this project, we focus on the latter goal. 3 Existing Solution Our data was collected by a program used by a single student to learn the language Xhosa. In this case, the entries of the lookup table consisted of pairs of words indicating the translation from English to Xhosa, for example ( Dog, Inja ). The program uses a simple algorithm intended to maximize the number of entries with confidence greater than 9%. For each word, the program keeps track of the the student s past performance. For example, if at a given point in time, the student has been presented with a given word five times, answering incorrectly the first two and correctly the last three, the student s performance on the word would be (,,,, ). The algorithm associates with a given history a feature called the word s streak, defined to be the value and length of the longest constant suffix of the history. For example, the history (,,,, ) would have streak (, 3). Intuitively, this says that the student has answered the word correctly the past three times in a row. The history (,, ) would have streak (, 2), indicating she has answered incorrectly the past two times in a row. Associated with each type of streak that has occurred, the program stores a number indicating the number of milliseconds that it should wait before presenting the student 2

with any word with the given history. For example, if the student has history (,, ) for a particular word, the student last saw the word at 8:pm, and the program has a time of hour associated with the streak type (, 2), then the the student will be scheduled to see the given word again at 9:pm. Note that the repetition interval selected by the algorithm is purely a function of the streak of a particular word, taking no other features of the word or its history into account. In order to target the goal of maximizing the number of words with confidence above 9%, the program tunes the times associated with the various streak types as follows. Whenever the student answers correctly after a given streak type, the time associated with the streak type is multiplied by.. When the student answers incorrectly, the streak type is multiplied by. 9. Thus, if a student is answering correctly after a given streak type 9% of the time, then on average, out of answers, 9 will be correct, and will be incorrect. Thus, the time will be multiplied by. 9. 9 =, causing the time to oscillate. If she is answering more than 9% of the words correctly, the time will increase until it starts to oscillate. Similarly, it will decrease if she answers correctly less than 9% of the time. The data demonsts that this technique appears to work well: in a history consisting of 6,744 answers, we observed that the student answered words correctly 88.3% of the time, on average. owever, this model takes into account only a single feature of the word: its current streak. It makes no distinction between histories () and (,,,, ). We decided to investigate the impact of other features on the probability of answering a word correctly. 4 Testing other features Given our data of 6,744 answers across 964 words, with times selected by the algorithm described in the previous section, we trained a logistic regression algorithm to predict whether the student will answer a word correctly given the word s history, testing the predictive capabilities of various features. owever, it wasn t possible to treat all histories uniformly, because the way that times were selected was not uniform. For instance, we initially attempted to find a correlation between time since last seeing a word and probability of answering correctly. It was difficult to find any correlation. owever, this was to be expected, because the times were carefully selected by an algorithm to target a 9% probability of answering correctly. To overcome this bias, we split the data according to streak types. This way, within a single streak type, there is no bias as to how the time was selected. We then tried various features to determine which might have an impact on the probability of answering correctly. Although we tried more than a dozen features, only a few ended up being predictive. We give seven here, though as we ll see in the data, not all of them were particularly predictive. The time since the student last saw the word The number of times the student has answered the word incorrectly The number of times the student has answered the word correctly 3

The longest streak of incorrect answers the student has had for the given word The number of times the student has answered the given word incorrectly after a streak of the current type An exponentially weighted count of times the student has answered the current word correctly. Answering correctly the previous time counts for, the time before for γ, before that γ 2, etc. We found γ =.8 to be most effective. An exponentially weighted sum of the total amount of time the student has gone between seeing the word while still getting it correct. We tested the features using 7%/3% hold-out cross validation, using the area under the ROC curve as our metric. To select features for a particular streak type, we used forward search. 5 Results We present our results in the Figure 5.. We note that for different streak lengths, different features tend to be more predictive. For short correct or incorrect streaks, we see that the exponentially weighted count of correct answers, as well as the longest wrong streak, tends to be indicative, while for long correct streaks, the simple count of total wrong answers for the word tends to be most indicative. 6 Future work In future work, we d like to try to incorpo the features we tested into a new model for selecting times to show a word. It would also be interesting to attempt to come up with a model that optimizes goal, the expected number of words known. It would also be useful to collect data that is not influenced by a selection algorithm, since this would allow us to test whether the streak length itself was a good feature to use. 4

.8.6.4.4.6.8 False (a) Wrong streak of.8.6.4.4.6.8 False (b) Wrong streak of 2.8.6.4.4.6.8 False (c) Right streak of.8.6.4.4.6.8 False (e) Right streak of 3.8.6.4.4.6.8 False (d) Right streak of 2.8.6.4.4.6.8 False (f) Right streak of 4 Exp time.8.6.4.4.6.8 False (g) Right streak of 5 Past streak Exp count Correct Wrong streak Time Wrong (h) Legend Figure 5.: ROC curves for different types of streaks 5