Development of Multistage Tests based on Teacher Ratings

Similar documents
On-the-Fly Customization of Automated Essay Scoring

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Linking the Common European Framework of Reference and the Michigan English Language Assessment Battery Technical Report

Simulation of Multi-stage Flash (MSF) Desalination Process

Evidence for Reliability, Validity and Learning Effectiveness

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Probability and Statistics Curriculum Pacing Guide

Seminar - Organic Computing

Axiom 2013 Team Description Paper

Comparing Teachers Adaptations of an Inquiry-Oriented Curriculum Unit with Student Learning. Jay Fogleman and Katherine L. McNeill

Assignment 1: Predicting Amazon Review Ratings

Navigating the PhD Options in CMS

CS Machine Learning

Detailed course syllabus

Universityy. The content of

Graduate Handbook Linguistics Program For Students Admitted Prior to Academic Year Academic year Last Revised March 16, 2015

Generative models and adversarial training

Psychometric Research Brief Office of Shared Accountability

Georgetown University at TREC 2017 Dynamic Domain Track

PHD COURSE INTERMEDIATE STATISTICS USING SPSS, 2018

SSE - Supervision of Electrical Systems

Computerized Adaptive Psychological Testing A Personalisation Perspective

Summary / Response. Karl Smith, Accelerations Educational Software. Page 1 of 8

Effect of Cognitive Apprenticeship Instructional Method on Auto-Mechanics Students

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

Statistics and Data Analytics Minor

Guide to the Uniform mark scale (UMS) Uniform marks in A-level and GCSE exams

learning collegiate assessment]

BUILD-IT: Intuitive plant layout mediated by natural interaction

Instrumentation, Control & Automation Staffing. Maintenance Benchmarking Study

Lecture 15: Test Procedure in Engineering Design

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

ROA Technical Report. Jaap Dronkers ROA-TR-2014/1. Research Centre for Education and the Labour Market ROA

STA 225: Introductory Statistics (CT)

Python Machine Learning

Lecture 1: Machine Learning Basics

A Comparison of Academic Ranking Scales

An Introduction to Simio for Beginners

Introduction to Questionnaire Design

Radius STEM Readiness TM

BMBF Project ROBUKOM: Robust Communication Networks

A simulated annealing and hill-climbing algorithm for the traveling tournament problem

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Mathematical learning difficulties Long introduction Part II: Assessment and Interventions

Travis Park, Assoc Prof, Cornell University Donna Pearson, Assoc Prof, University of Louisville. NACTEI National Conference Portland, OR May 16, 2012

Running head: LISTENING COMPREHENSION OF UNIVERSITY REGISTERS 1

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Using a Simulated Practice to Improve Practice Management Learning

Ergonomics of translation: methodological, practical and educational implications

A student diagnosing and evaluation system for laboratory-based academic exercises

CHAPTER III RESEARCH METHOD

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Student Morningness-Eveningness Type and Performance: Does Class Timing Matter?

12- A whirlwind tour of statistics

S T A T 251 C o u r s e S y l l a b u s I n t r o d u c t i o n t o p r o b a b i l i t y

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Tutor s Guide TARGET AUDIENCES. "Qualitative survey methods applied to natural resource management"

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Greek Teachers Attitudes toward the Inclusion of Students with Special Educational Needs

Cross-Year Stability in Measures of Teachers and Teaching. Heather C. Hill Mark Chin Harvard Graduate School of Education

GCSE English Language 2012 An investigation into the outcomes for candidates in Wales

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Further, Robert W. Lissitz, University of Maryland Huynh Huynh, University of South Carolina ADEQUATE YEARLY PROGRESS

Effectiveness of McGraw-Hill s Treasures Reading Program in Grades 3 5. October 21, Research Conducted by Empirical Education Inc.

EGRHS Course Fair. Science & Math AP & IB Courses

Measuring Being Bullied in the Context of Racial and Religious DIF. Michael C. Rodriguez, Kory Vue, José Palma University of Minnesota April, 2016

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

King-Devick Reading Acceleration Program

STAT 220 Midterm Exam, Friday, Feb. 24

Learning By Asking: How Children Ask Questions To Achieve Efficient Search

arxiv: v1 [cs.cl] 2 Apr 2017

The Condition of College & Career Readiness 2016

Game-based formative assessment: Newton s Playground. Valerie Shute, Matthew Ventura, & Yoon Jeon Kim (Florida State University), NCME, April 30, 2013

Capturing and Organizing Prior Student Learning with the OCW Backpack

M55205-Mastering Microsoft Project 2016

Prediction of Maximal Projection for Semantic Role Labeling

Individual Differences & Item Effects: How to test them, & how to test them well

Math Placement at Paci c Lutheran University

QUESTIONS ABOUT ACCESSING THE HANDOUTS AND THE POWERPOINT

Go fishing! Responsibility judgments when cooperation breaks down

School Concurrency Update. Palm Beach County

ADVANCED PLACEMENT STUDENTS IN COLLEGE: AN INVESTIGATION OF COURSE GRADES AT 21 COLLEGES. Rick Morgan Len Ramist

re An Interactive web based tool for sorting textbook images prior to adaptation to accessible format: Year 1 Final Report

How do we balance statistical evidence with expert judgement when aligning tests to the CEFR?

How to Judge the Quality of an Objective Classroom Test

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

Intermediate Computable General Equilibrium (CGE) Modelling: Online Single Country Course

Benjamin Pohl, Yves Richard, Manon Kohler, Justin Emery, Thierry Castel, Benjamin De Lapparent, Denis Thévenin, Thomas Thévenin, Julien Pergaud

Learning From the Past with Experiment Databases

FINAL EXAMINATION OBG4000 AUDIT June 2011 SESSION WRITTEN COMPONENT & LOGBOOK ASSESSMENT

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Multi-Lingual Text Leveling

Simple Random Sample (SRS) & Voluntary Response Sample: Examples: A Voluntary Response Sample: Examples: Systematic Sample Best Used When

Major Milestones, Team Activities, and Individual Deliverables

SAP EDUCATION SAMPLE QUESTIONS: C_TPLM40_65. Questions. In the audit structure, what can link an audit and a quality notification?

College Pricing and Income Inequality

1 We would like to thank participants of the Economics of Education group in Maastricht University, of the International

Transcription:

Development of Multistage Tests based on Teacher Ratings Stéphanie Berger 12, Jeannette Oostlander 1, Angela Verschoor 3, Theo Eggen 23 & Urs Moser 1 1 Institute for Educational Evaluation, 2 Research Center for Examinations and Certification, University of Twente 3 CITO IACAT Cambridge, 14 th to 16 th of September 2015 1

Overview Introduction Test development Test design Test construction based on teacher ratings Routing rules based on heuristics Results Routing Correlation between ratings and item difficulty Information per module and path Reliability Discussion and conclusion 2

Introduction Development of standardized tests for secondary school in Northwestern Switzerland Assessment of student ability in four different school subjects Individual reporting High stakes Target population: Secondary school students, grade 8 Three different school types Content framework: New Swiss curriculum Computer-based assessment 3

Research Question Target population covering a broad ability range Multistage testing New item pool, but no resources for pretesting Teacher ratings as approximation of item difficulty Questions What are the implications of using teacher ratings instead of pretest data for constructing a multistage test? Do teacher ratings allow us to construct a reliable multistage test? 4

Advantages of Multistage Testing Yan, Lewis & von Davier (2014) Adaptive optimization of fit between item difficulty and student ability More efficient and precise measurement of student ability compared to linear tests Higher control over content balance and test structure compared to fully adaptive tests Allows students to navigate and to review items within one module Reduced test copying compared to linear tests 5

Multistage Test Design Mathematics Practical considerations: 9 items 1A 1B Testing time: 2 lessons = 90 minutes Reduce copying by multiple versions 9 items 2A 2B 3A 3B 4A 4B Allow for recovery from inadequate routing 15 items 5A 5B 6A 6B 7A 7B Double 1-3-3-3 MST including 252 items 15 items 8A 8B 9A 9B 10A 10B easy medium difficult 6

Test Construction based on Teacher Ratings Teacher ratings of item difficulty 6 secondary school teachers from Northwestern Switzerland Rating of printed items including item key Categorization of items into three different categories: easy, medium, difficult 7

Distribution of Items per Module Stage 1 Stage 2 Stage 3 Stage 4 8

Routing Rules based on Heuristics Routing based on raw score Target difficulty per module: p = 0.66 Predicted mean score: 2/3 of maximum score Predicted SD: 1/6 of maximum score Goal to route equal amount of students per path ⅓ per path for routing module and medium modules routing based on P 33 and P 66 of predicted score ½ per path for easy and difficult modules routing based on mean of predicted score 9

Routing Rules based on Heuristics Max = 9 x = 6.0, SD = 1.5 P 33 = 5.3, P 66 = 6.6 Max = 14 x = 9.3, SD = 2.3 Max = 16 X = 10.7, SD = 9.5 P 33 = 9.5, P 66 = 11.8 9 Items max = 9 0-5 6-7 8-9 9 Items 9 Items 9 Items max = 14 max = 16 max = 18 0-9 10-14 6-9 10-12 13-16 8-12 13-18 15 Items 15 Items 15 Items max = 24 max = 29 max = 33 0-16 17-24 8-17 18-21 22-29 13-22 23-33 15 Items 15 Items 15 Items max = 32 max = 39 max = 48 easy medium difficult 10

Calibration Sample: N = 7176 grade 8 students Item response model: One Parameter Logistic Model (OPLM) (Verhelst & Glas, 1995) Item calibration with OPLM program (Verhelst, Glas & Verstralen, 1995) Marginal maximum likelihood estimation (MML) Exclusion of 15 items due to poor model fit, low discrimination or low p-value 11

Results I: Descriptive Values per Module St. Module Lev. # Items Mean β Mean SE(β) # Observations % Observations Mean θ 1 2 3 4 1A R 8-1.041.033 3659 51% -0.538 1B R 8 -.988.049 3518 49% -0.512 2A E 9 -.913.051 1810 25% -1.135 2B E 9 -.354.059 1811 25% -1.162 3A M 8.390.060 1239 17% -0.108 3B M 9.132.067 1099 15% -0.100 4A D 8 1.766.161 588 8% 0.546 4B D 7.557.215 628 9% 0.501 5A E 15 -.480.076 1969 27% -1.156 5B E 14 -.207.073 1884 26% -1.144 6A M 15 -.157.064 1348 19% 0.042 6B M 14 -.520.076 1302 18% 0.041 7A D 15.364.117 328 5% 0.892 7B D 13 1.478.192 322 4% 0.813 8A E 13-1.070.057 2052 29% -1.123 8B E 13 -.323.067 2087 29% -1.103 9A M 15 -.143.067 1346 19% 0.128 9B M 14 -.273.074 1239 17% 0.098 10A D 15 1.021.172 296 4% 1.007 10B D 15.607.170 298 4% 0.914 12

Results II: Routing from 1A/B from 2A/B from 3A/B from 4A/B from 5A/B from 6A/B from 7A/B 13

Result III: Correlation between Ratings and Item Difficulty r = 0.44 n = 220 p < 0.01 14

Results III: Information per Module 15

Results IV: Information per Path 16

Results V: Test Reliability Simulation Item parameters from calibration 50 000 simulees from N(mean = -0.546, SD = 0.890) Estimated reliability: ρ = Var T Var X = Var(θ) Var( θ) Mean test length Mean test score Estimated reliability Test length comp. rel. Multistage test 44.9 22.0 0.90 35.8 Random linear test 45.0 18.5 0.87 56.5 17

Discussion & Conclusion Moderate correlation between teacher ratings and estimated item difficulty General underestimation of item difficulty Multistage item collection designs involve risk of unbalanced number of observations per module Higher reliability of multistage test compared to a random linear test 18

Questions and Discussion Contact: Stephanie.Berger@ibe.uzh.ch 19

References I Verhelst, N. D.; Glas, C. A. W.; Verstralen, H. H. F. M. (1995). One-Parameter Logistic Model. OPLM. Arnhem: CITO. Verhelst, N. D., & Glas, C. A. W. (1995). The One Parameter Logistic Model. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch Models. Foundations, Recent Developments, and Applications. New York, NY: Springer New York. Yan, D.; Lewis, C.; von Davier, A. A. (2014). Overview of computerized multistage tests. In: Duanli Yan, Alina A. von Davier und Charles Lewis (Eds..), Computerized multistage testing. Theory and applications (p. 3-20). Boca Raton: CRC Press. 20