Evidence for Reliability, Validity and Learning Effectiveness

Similar documents
On-the-Fly Customization of Automated Essay Scoring

STA 225: Introductory Statistics (CT)

Automatic Essay Assessment

Reading Project. Happy reading and have an excellent summer!

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

ASSESSMENT REPORT FOR GENERAL EDUCATION CATEGORY 1C: WRITING INTENSIVE

Linking the Common European Framework of Reference and the Michigan English Language Assessment Battery Technical Report

A Pilot Study on Pearson s Interactive Science 2011 Program

NCEO Technical Report 27

BENCHMARK TREND COMPARISON REPORT:

Writing a Basic Assessment Report. CUNY Office of Undergraduate Studies

Conceptual and Procedural Knowledge of a Mathematics Problem: Their Measurement and Their Causal Interrelations

The Relationship Between Tuition and Enrollment in WELS Lutheran Elementary Schools. Jason T. Gibson. Thesis

learning collegiate assessment]

Longman English Interactive

How to Judge the Quality of an Objective Classroom Test

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

S T A T 251 C o u r s e S y l l a b u s I n t r o d u c t i o n t o p r o b a b i l i t y

A Study of Metacognitive Awareness of Non-English Majors in L2 Listening

School Size and the Quality of Teaching and Learning

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Technical Manual Supplement

Lecture 2: Quantifiers and Approximation

George Mason University Graduate School of Education Program: Special Education

VIEW: An Assessment of Problem Solving Style

Alpha provides an overall measure of the internal reliability of the test. The Coefficient Alphas for the STEP are:

FOR TEACHERS ONLY. The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION. ENGLISH LANGUAGE ARTS (Common Core)

The Oregon Literacy Framework of September 2009 as it Applies to grades K-3

Guru: A Computer Tutor that Models Expert Human Tutors

Oklahoma State University Policy and Procedures

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Observing Teachers: The Mathematics Pedagogy of Quebec Francophone and Anglophone Teachers

The College Board Redesigned SAT Grade 12

THEORY OF PLANNED BEHAVIOR MODEL IN ELECTRONIC LEARNING: A PILOT STUDY

AN ANALYSIS OF GRAMMTICAL ERRORS MADE BY THE SECOND YEAR STUDENTS OF SMAN 5 PADANG IN WRITING PAST EXPERIENCES

TRAITS OF GOOD WRITING

SCHEMA ACTIVATION IN MEMORY FOR PROSE 1. Michael A. R. Townsend State University of New York at Albany

Proficiency Illusion

Scoring Guide for Candidates For retake candidates who began the Certification process in and earlier.

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Marketing Management

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Reference to Tenure track faculty in this document includes tenured faculty, unless otherwise noted.

Teachers Guide Chair Study

What is PDE? Research Report. Paul Nichols

Further, Robert W. Lissitz, University of Maryland Huynh Huynh, University of South Carolina ADEQUATE YEARLY PROGRESS

APA Basics. APA Formatting. Title Page. APA Sections. Title Page. Title Page

English Language Arts Missouri Learning Standards Grade-Level Expectations

A Study of the Effectiveness of Using PER-Based Reforms in a Summer Setting

West s Paralegal Today The Legal Team at Work Third Edition

INSTRUCTOR USER MANUAL/HELP SECTION

Language Acquisition Chart

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

English for Specific Purposes World ISSN Issue 34, Volume 12, 2012 TITLE:

Grade 4. Common Core Adoption Process. (Unpacked Standards)

The Effect of Close Reading on Reading Comprehension. Scores of Fifth Grade Students with Specific Learning Disabilities.

PIRLS. International Achievement in the Processes of Reading Comprehension Results from PIRLS 2001 in 35 Countries

Teaching Task Rewrite. Teaching Task: Rewrite the Teaching Task: What is the theme of the poem Mother to Son?

A Comparison of Charter Schools and Traditional Public Schools in Idaho

SSIS SEL Edition Overview Fall 2017

Interview with a Fictional Character

School Competition and Efficiency with Publicly Funded Catholic Schools David Card, Martin D. Dooley, and A. Abigail Payne

Seventh Grade Course Catalog

Major Milestones, Team Activities, and Individual Deliverables

Probabilistic Latent Semantic Analysis

BENGKEL 21ST CENTURY LEARNING DESIGN PERINGKAT DAERAH KUNAK, 2016

ACADEMIC AFFAIRS GUIDELINES

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

The Effect of Written Corrective Feedback on the Accuracy of English Article Usage in L2 Writing

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

TIMSS Highlights from the Primary Grades

Executive summary (in English)

Probability and Statistics Curriculum Pacing Guide

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Examining High and Low Value- Added Mathematics Instruction: Heather C. Hill. David Blazar. Andrea Humez. Boston College. Erica Litke.

5 Star Writing Persuasive Essay

key findings Highlights of Results from TIMSS THIRD INTERNATIONAL MATHEMATICS AND SCIENCE STUDY November 1996

Intermediate Algebra

(Includes a Detailed Analysis of Responses to Overall Satisfaction and Quality of Academic Advising Items) By Steve Chatman

Cooper Upper Elementary School

PROGRAM HANDBOOK. for the ACCREDITATION OF INSTRUMENT CALIBRATION LABORATORIES. by the HEALTH PHYSICS SOCIETY

Rote rehearsal and spacing effects in the free recall of pure and mixed lists. By: Peter P.J.L. Verkoeijen and Peter F. Delaney

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

10.2. Behavior models

Assessing speaking skills:. a workshop for teacher development. Ben Knight

National Survey of Student Engagement (NSSE) Temple University 2016 Results

The Good Judgment Project: A large scale test of different methods of combining expert predictions

CONSISTENCY OF TRAINING AND THE LEARNING EXPERIENCE

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and

Calculators in a Middle School Mathematics Classroom: Helpful or Harmful?

Teacher assessment of student reading skills as a function of student reading achievement and grade

Writing for the AP U.S. History Exam

How do adults reason about their opponent? Typologies of players in a turn-taking game

South Carolina English Language Arts

Third Misconceptions Seminar Proceedings (1993)

Audit Of Teaching Assignments. An Integrated Analysis of Teacher Educational Background and Courses Taught October 2007

Transcription:

PEARSON EDUCATION Evidence for Reliability, Validity and Learning Effectiveness Introduction Pearson Knowledge Technologies has conducted a large number and wide variety of reliability and validity studies for both the Intelligent Essay Assessor (IEA) and Summary Street components of WriteToLearn. Although comparable studies have produced roughly comparable results, this paper focuses on studies conducted in the last two years so as to best represent the current status of the products. The KAT Engine. The technology underlying Summary Street, IEA, and WriteToLearn is based on the Knowledge Analysis Technologies (KAT) including Pearson Knowledge Technologies unique implementation of Latent Semantic Analysis (LSA), an approach that is trained to measure the semantic similarity of words and passages by analyzing large bodies of relevant text. LSA then can closely approximate the degree of similarity of meaning of two texts as judged by human readers. This ability has been documented in a number of top-ranked refereed professional journals (e.g., Landauer & Dumais, 1997 in Psychological Review). Summary Street Summary Street has been evaluated in a number of language arts and subject matter classrooms. Across several smaller studies and experiments, it has proven to measurably increase students summarization, general writing, and reading comprehension skills. Although Summary Street and WriteToLearn cannot provide the detailed and elaborated critiques that teachers can, they do provide the opportunity for much more evaluated reading and writing practice than can the typical teacher who often teaches more than 100 students a day. Thus Summary Street increases the actual reading and writing practice that literacy researchers have shown to be the most important determiner of achievement. University of Colorado IERI Research Results Researchers from the University of Colorado (CU) Institute of Cognitive Science evaluated Summary Street over the course of a five-year study funded by the Interagency Education Research Initiative (IERI), a collaborative effort sponsored by the National Science Foundation, The U.S. Department of Education and the National Institutes of Health. The IERI project at CU included: controlled experiments comparing users and non-users of Summary Street; system measured progress in writing COPYRIGHT 2007 PEARSON EDUCATION, INC. OR ITS AFFILIATE(S). ALL RIGHTS RESERVED. 1

Evidence for Reliability, Validity, and Learning Effectiveness summaries that met content quality and English and language arts requirements; and data from state tests of English and language arts by students in field trials involving a large number of diverse schools, including Title 1 schools, randomly assigned to be given Summary Street or not during the third and fourth year of the program. A white paper, detailing the findings, entitled, Building Student Summarization, Writing and Reading Comprehension Skills with Guided Practice and Automated Feedback: Highlights from Research at the University of Colorado can be found on the WriteToLearn.net website. Here is a brief summary. In the randomized controlled experiment, students in Colorado middle schools wrote a series of four summaries of increasingly advanced readings over four days in a two week period. In some schools they used Summary Street, in others not. All summaries were blind scored with high reliability by two teachers. The experimental group students, especially those in the lower 75 percent of the control summary scores, showed significantly greater growth in summarizing ability. In only four weeks of practice, the students improved their content summary scores by an overall effect size of d = 0.9 (see the figure below). For a class of mixed-ability students, students scoring at the fiftieth percentile improved their writing performance with difficult materials to the eighty-second percentile. When the performance of low- and medium-ability students (the lower 75 percent of the distribution) is considered, the effect size was d = 1.5 for the most difficult materials. (An effect size of 1.0 corresponds to approximately a one-grade difference in elementary school. An effect size of 1.5 is extremely rare in educational research.) Effect Size for Content Scores for Students of Different Ability Levels 2 1.5 Ability Level medium + low high Effect Size (Cohen's d) 1 0.5 0-0.5 Text 1 Text 2 Text 3 Text 4 Difficulty of Assignment Figure 1. Effect sizes for summary content scores for low- and medium- (bottom 75 percent) vs. high-ability (top 25 percent) students. Based on 120 eighth-grade students randomly assigned to a control group with no feedback and an experimental group. 2

PEARSON EDUCATION As corroborating evidence, independent scorers found increases not only in the student s ability to condense and abstract, but also in measures of organization and style. Repeated, guided practice led to improvement in writing ability across a broad range of measures. In the analysis of the longterm data, the before and after summary performance of matched classes showed that the increase in performance endured, even when students wrote without the support of Summary Street. Students included 50 percent more relevant content in their test summaries after having used Summary Street in their school work. Finally, students who used Summary Street increased their performance on gist-level comprehension items of a standardized comprehension test (Colorado State Assessment Program or CSAP) by an effect size of d = 0.42, compared with students who practiced summary writing without getting feedback. Scores on CSAP reading comprehension items were 1.4 standard deviations higher for the lower 75 percent of students after just two weeks four summaries with Summary Street than were those from matched control classes. The results of this work have been reported in refereed professional journals and conferences, and edited volumes (see Caccamise, Franzke, Eckhoff, Kintsch & Kintsch, in press; and Franzke, Kintsch, Caccamise, Johnson & Dooley, 2005). Earlier Study Results In the earliest classroom trials of Summary Street in 2001, two sixth grade English and language arts classes in a Boulder, Colo., middle school used either the then full-featured system or a system with the same interface and interaction features but no feedback, essentially a word processor. Students in both conditions wrote summaries by hand in their classrooms, then typed them into an input box on a classroom computer, revised them until satisfied, then handed them in to the teacher for grading and comment. The experimental group (one of whom suggested the name Summary Street ) used the system during three class hours during the first of three weeks of the experiment, and thereafter used the no-feedback version. The other group was in the no-feedback condition the first and third week. After the last week, the class teacher and another English and language arts teacher each rated all of the essays independently in random order and without knowledge of the condition in which it was written. Ratings included adequacy of summary content, style and mechanics. Summaries written with Summary Street were significantly better on all counts. Summaries written without Summary Street feedback after using Summary Street retained a significant advantage. Students voluntarily spent twice as much time, on average going beyond class periods, revising than did those without Summary Street. Interviews produced highly favorable evaluations by teachers and students. After the experiment, the teachers requested continued use, as did some teachers who had seen or heard of Summary Street but not used it. Some of them eventually got access and used it without training. COPYRIGHT 2007 PEARSON EDUCATION, INC. OR ITS AFFILIATE(S). ALL RIGHTS RESERVED. 3

Evidence for Reliability, Validity, and Learning Effectiveness Intelligent Essay Assessor Evidence about the Intelligent Essay Assessor For essay scores, the same measure is customarily used to evaluate reliability and validity, how well humans and in this case humans and an automated scoring system agree. However, Pearson Knowledge Technologies believes that this criterion is subject to some amount of doubt and to important supplementation when the goal is the test s use in developmental or formative assessment. On the one hand, human inter-rater reliability may be affected by training readers to agree where uninfluenced judgment might not, and some aspects of writing, such as spelling, may actually be more consistently evaluated with computer help. On the other hand, criteria, such as whether scores vary appropriately with known differences in test-taker abilities, can be of equal or greater interest. Therefore, the various studies conducted have included reliability and validity measures of several different kinds. They have included independent teacher judgments without scoring collaboration or instruction, administration of tests before and after studying relevant material, scoring by readers with different levels of expertise, comparing performance of students in different grades and of different chronological ages (vocabulary and other language skills improve by cultural immersion), correlation with other kinds of tests, effects on scores on other tests taken later and improvements engendered and measured by taking the test itself plural times. In reviewing these studies, we focus on the largest, cleanest and most statistically and methodologically precise studies and experiments, but also describe some of lower power that add wider perspective on the informative value of the assessment. Introductory methodological note: For validation, we favor product-moment correlations between continuous KAT engine scores and whatever score the human graders give over agreement on grade or score categories. This reduces to the minimum the contribution of quantification error and exploits the greater precision of our machine scores than the ~2 bit Miller s Law limit of human absolute judgments. It also avoids classifying the scores into discrete score groups, a matter that often involves pedagogical and policy decisions largely irrelevant to questions of measurement precision. Nonetheless, we also analyze exact, adjacent and discrepant scores so as to report results in the customary manner. The most recent, 2006, and elegant study was conducted in collaboration with MetaMetrics Inc. using holistically scored essay data that were collected by its researchers. We regard this study as research into the fundamental accuracy of our automatic essay grading methods. The data were 3,453 moderate length essays written by fourth, sixth, eighth, 10 th and 12 th grade students on 18 different commonly used English and language arts prompts in a design in which each student answered six prompts, each on a different day, each scored by six different experienced and independently working readers, with all variables factorially counterbalanced such that the effect of 4

PEARSON EDUCATION grade, student, prompt, day of test and reader could be measured and partialled out of the final score, which was computed by a Rasch Lexile method. With these data, the automatic scoring engine had a reliability correlation of 0.91 with the human readers, identical to the very high mean human-human reliability. 1 With these same purified residual data, we were also able to analyze how well human and IEA scores compared in measuring differences between scores on the same essays when taken by students in different grades, thus points in normal progress in writing skills. Of the 18 prompts, three had been answered by both fourth and sixth grade students, three by sixth and eight, three by eight and 10 th and three by 10 th and 12 th, each pair containing a maximum of 108 and minimum of 98 students in both grades. The average absolute difference between the scores of students separated by two school grades had effect sizes of 0.45 and 0.43 respectively for human and IEA. These average changes were highly significant statistically, p < 10-4 in both cases. But the difference between them was not, p one-tailed ~0.73, i.e. no appreciable difference. 2 In 2002, we performed a similar analysis on holistic scores from a sample of 900 minimally constrained open-ended creative fiction essays. These had reliability correlations of 0.90 for both human double scoring and IEA to human scoring. Student school grades ranged from fifth to seventh with varying numbers of responses. We had chronological ages of each student as well. In that case, correlations of human and IEA sensitivity to both school grade chronological age were significantly, p < 0.01, higher for IEA. We hypothesize that the apparent superiority of IEA in that case might be attributable to greater amounts of noise in the development related component of the score variance and less in other components for the human than for the machine grading scores. Using the purified residuals data in the 2006 study, we also explored new ways to score essays without regard to the prompt being answered using only features common to good content in any essay and on IEA measurements of the quality of writing. After some exploration, we attained 0.87 reliability with a variant of the usual IEA scoring model. Although this reliability will undoubtedly shrink somewhat in the noisy case of ordinary assessment, the new method adds another kind of free-response assessment to our quiver, one that will be especially useful for Summary Street in which no pre-scored training data are used. Correlations are shown below for prompt independent scoring scoring essays on all 18 prompts with the same scoring engine without retraining or re-calibration for the same essays as described above. N PKT Correlation Human-human 3,453 0.87 0.80 COPYRIGHT 2007 PEARSON EDUCATION, INC. OR ITS AFFILIATE(S). ALL RIGHTS RESERVED. 5

Evidence for Reliability, Validity, and Learning Effectiveness Data on IEA Reliabilities Over Grade Levels 6 to 12 with 81 Prentice Hall Prompts Our most extensive reliability evaluation study with operational data was conducted with results from 33, 205 essays written to 81 different Prentice Hall English and language arts prompts from its textbook companion Website, 10 to 15 different prompts per grade level, over seven grade levels, sixth through 12 th. The overall mean inter-rater reliabilities for holistic grades are given in the table below. Human-Human IEA Human Correlation 0.86 0.90 Exact agreement 61.7 61.1 Exact + adjacent agreement 97.7 98.1 IEA-human correlations were higher than human-human for all seven grade levels, the difference in r ranging from 0.03 to 0.06. Overall, IEA was correlated with averaged pairs of human scores on essay significantly better than the humans scores were correlated with each other (p <.001, by a t-test 2-tailed paired comparison, df =6, over the seven means of grade level scores. This way of estimating the statistical significance avoids inflating the df with between-prompt, between grade-level and over possible multiple scores by same student that would have occurred if the number of individual essays were taken as the basis. It is an application of the Greenhouse-Geisser approach.) The degree of superiority of IEA in this case, 4 percent in the product-moment correlation coefficient, 9 percent greater shared variance, is unusual and may be of practically important magnitude for reflecting the small incremental skill gains expected in formative developmental assessment. Exact agreement was almost identical for human-to-human and IEA-tohuman, p < 0.54, by the same statistical method. The contrast between correlation and exact agreement for humans and IEA could be because distributions over score categories are slightly different from those of human readers. Trait Scoring for Essays In several trial and operational uses of IEA, we have also provided scores on various components or traits of English writing. Generally human scores on traits have considerably lower inter-rater reliabilities than do holistic scores, and the trait scores are all highly correlated with each other and with holistic scores. IEA scores trained and calibrated on the human scores to match them are therefore also lower and highly correlated with each other. Results are shown below for one exemplary set with a large variety of traits and better than usual double scoring reliability by humans. This IEA scoring was done by Pearson Knowledge Technologies for an independent large professional testing company client. 6

PEARSON EDUCATION Correlations (r) among trait scores: Description Holistic Content H-H r between holistic and traits IEA-IEA r between holistic and traits H-IEA r on the same traits 0.71 0.80 0.74 0.82 0.60 0.69 0.81 0.75 0.82 0.62 0.71 0.72 0.81 0.76 0.82 0.74 0.71 Human-to-human correlation and human-to-iea correlation on the holistic scores were almost identical, r = 0.71 and 0.72 respectively. The correlations of individual traits with the holistic score were also very similar for IEA and for humans scores (the correlation between the correlations over the five individual traits was r = 0.84.) [The mean difference favoring the human-to- IEA agreement over the human-to-human agreement was very small and not statistically reliable, p < 0.15.] 3 Conclusions The practical overall conclusions from the body of reliability and validity evaluation data are (a) that human and machine scoring methods are very nearly equivalent, (b) that the use of standard IEA for formative assessment and feedback for essays as used in WriteToLearn is well justified, and (c) that use of IEA for WriteToLearn feedback and for formative developmental assessment is sufficiently robust to serve a steering function for both day-today student learning and teacher guidance, as well as for district and state oversight for English and language arts learning. Endnotes: Effective sentences Focus & Organization Grammar, usage, mechanics Word Choice 1 With raw scores instead of the residuals the IEA to human correlation was 0.73. 2. The small size of these differences (which were all positive except that for the 10 th to 12 th comparison, which was, puzzlingly, slightly negative), an average effect size gain of only 0.23 sigma per year, and a total gain of only 1.7 sigma over the whole fourth to 12 th period. While it might be thought this was due to restriction of range by the six point scoring rubric, the fact that the analyzed scores were noise-reduced averages of six independent ratings on fairly large samples casts some doubt on such an explanation. 3. We note that in some scoring program that have elicited very short or highly unusual kinds of essays, or used highly unusual scoring procedures for example, > 10-point scores with erratic category distributions, IEA has not equaled human reliabilities. WriteToLearn, however, uses only testing and scoring procedures for which IEA has proven reliable. COPYRIGHT 2007 PEARSON EDUCATION, INC. OR ITS AFFILIATE(S). ALL RIGHTS RESERVED. 7

Evidence for Reliability, Validity, and Learning Effectiveness 8