! Leander S Hughes!! Investigating the Validity and Reliability of Automatically Generated Reading Comprehension Questions! About!

Similar documents
and secondary sources, attending to such features as the date and origin of the information.

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Alpha provides an overall measure of the internal reliability of the test. The Coefficient Alphas for the STEP are:

Linking the Common European Framework of Reference and the Michigan English Language Assessment Battery Technical Report

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Creating Meaningful Assessments for Professional Development Education in Software Architecture

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Metacognitive Strategies that Enhance Reading Comprehension in the Foreign Language University Classroom

PROFESSIONAL TREATMENT OF TEACHERS AND STUDENT ACADEMIC ACHIEVEMENT. James B. Chapman. Dissertation submitted to the Faculty of the Virginia

A Pilot Study on Pearson s Interactive Science 2011 Program

MASTER OF ARTS IN APPLIED SOCIOLOGY. Thesis Option

West s Paralegal Today The Legal Team at Work Third Edition

Classifying combinations: Do students distinguish between different types of combination problems?

DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?

Effective practices of peer mentors in an undergraduate writing intensive course

VIEW: An Assessment of Problem Solving Style

Calculators in a Middle School Mathematics Classroom: Helpful or Harmful?

African American Male Achievement Update

The Oregon Literacy Framework of September 2009 as it Applies to grades K-3

Inside the mind of a learner

Learning By Asking: How Children Ask Questions To Achieve Efficient Search

Running head: LISTENING COMPREHENSION OF UNIVERSITY REGISTERS 1

Number of students enrolled in the program in Fall, 2011: 20. Faculty member completing template: Molly Dugan (Date: 1/26/2012)

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

Skyward Gradebook Online Assignments

Developing a College-level Speed and Accuracy Test

AC : DEVELOPMENT OF AN INTRODUCTION TO INFRAS- TRUCTURE COURSE

Developing Autonomy in Language Learners: Diagnostic Teaching. LEARN Workshop July 28 and 29, 2015 Ra ed F. Qasem

THE EFFECTS OF TEACHING THE 7 KEYS OF COMPREHENSION ON COMPREHENSION DEBRA HENGGELER. Submitted to. The Educational Leadership Faculty

Improving Conceptual Understanding of Physics with Technology

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

ASSESSMENT REPORT FOR GENERAL EDUCATION CATEGORY 1C: WRITING INTENSIVE

Evaluation of a College Freshman Diversity Research Program

Evidence-based Practice: A Workshop for Training Adult Basic Education, TANF and One Stop Practitioners and Program Administrators

Study Abroad Housing and Cultural Intelligence: Does Housing Influence the Gaining of Cultural Intelligence?

Revision and Assessment Plan for the Neumann University Core Experience

5. UPPER INTERMEDIATE

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Colloque: Le bilinguisme au sein d un Canada plurilingue: recherches et incidences Ottawa, juin 2008

A Game-based Assessment of Children s Choices to Seek Feedback and to Revise

Summary results (year 1-3)

Mathematics Program Assessment Plan

DOES OUR EDUCATIONAL SYSTEM ENHANCE CREATIVITY AND INNOVATION AMONG GIFTED STUDENTS?

GUIDE TO EVALUATING DISTANCE EDUCATION AND CORRESPONDENCE EDUCATION

Empowering Students Learning Achievement Through Project-Based Learning As Perceived By Electrical Instructors And Students

BENCHMARK TREND COMPARISON REPORT:

Senior Project Information

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

CHAPTER III RESEARCH METHOD

A STUDY ON THE EFFECTS OF IMPLEMENTING A 1:1 INITIATIVE ON STUDENT ACHEIVMENT BASED ON ACT SCORES JEFF ARMSTRONG. Submitted to

Successfully Flipping a Mathematics Classroom

Course Objectives Upon completion of this course, you will: Have a clear grasp of organic gardening techniques and methods

Navigating the PhD Options in CMS

How to Judge the Quality of an Objective Classroom Test

The Effect of Personality Factors on Learners' View about Translation

Shelters Elementary School

Thesis-Proposal Outline/Template

TASK 2: INSTRUCTION COMMENTARY

Focused on Understanding and Fluency

ABET Criteria for Accrediting Computer Science Programs

EQuIP Review Feedback

Analysis of Students Incorrect Answer on Two- Dimensional Shape Lesson Unit of the Third- Grade of a Primary School

Evaluation of Hybrid Online Instruction in Sport Management

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

School Size and the Quality of Teaching and Learning

The Impact of Honors Programs on Undergraduate Academic Performance, Retention, and Graduation

ACADEMIC ALIGNMENT. Ongoing - Revised

VOL. 3, NO. 5, May 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and

The Effect of Written Corrective Feedback on the Accuracy of English Article Usage in L2 Writing

Norms How were TerraNova 3 norms derived? Does the norm sample reflect my diverse school population?

Procedia - Social and Behavioral Sciences 98 ( 2014 ) International Conference on Current Trends in ELT

GROUP COMPOSITION IN THE NAVIGATION SIMULATOR A PILOT STUDY Magnus Boström (Kalmar Maritime Academy, Sweden)

Scoring Guide for Candidates For retake candidates who began the Certification process in and earlier.

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Psychometric Research Brief Office of Shared Accountability

Predatory Reading, & Some Related Hints on Writing. I. Suggestions for Reading

Extending Place Value with Whole Numbers to 1,000,000

Achievement Level Descriptors for American Literature and Composition

The Efficacy of PCI s Reading Program - Level One: A Report of a Randomized Experiment in Brevard Public Schools and Miami-Dade County Public Schools

Number of Items and Test Administration Times IDEA English Language Proficiency Tests/ North Carolina Testing Program.

TEXT FAMILIARITY, READING TASKS, AND ESP TEST PERFORMANCE: A STUDY ON IRANIAN LEP AND NON-LEP UNIVERSITY STUDENTS

The Condition of College & Career Readiness 2016

The Political Engagement Activity Student Guide

Kristin Moser. Sherry Woosley, Ph.D. University of Northern Iowa EBI

Integrating Grammar in Adult TESOL Classrooms

UK Institutional Research Brief: Results of the 2012 National Survey of Student Engagement: A Comparison with Carnegie Peer Institutions

Program Change Proposal:

Rubric for Scoring English 1 Unit 1, Rhetorical Analysis

ScienceDirect. Noorminshah A Iahad a *, Marva Mirabolghasemi a, Noorfa Haszlinna Mustaffa a, Muhammad Shafie Abd. Latif a, Yahya Buntat b

A Study of Metacognitive Awareness of Non-English Majors in L2 Listening

Effects of Self-Regulated Strategy Development on EFL Learners Reading Comprehension and Metacognition

Evaluating Statements About Probability

GUIDE TO THE CUNY ASSESSMENT TESTS

SSIS SEL Edition Overview Fall 2017

Effect of Word Complexity on L2 Vocabulary Learning

Interpretive (seeing) Interpersonal (speaking and short phrases)

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Transcription:

Investigating the Validity and Reliability of Automatically Generated Reading Comprehension Questions Leander S Hughes About Leander Hughes is an Associate Professor at the Saitama University Center for English Education and Development. He is interested in quantitative language research methods, extensive reading, and computer assisted language learning. Other interests include learner autonomy and applying principles of social psychology to the language-teaching context. Abstract This study seeks to obtain a preliminary evaluation of the validity and reliability of reading questions generated by the program Jist for the purpose of gauging learners online extensive reading progress. The study specifically aims to compare Jist-generated questions to teacher-generated questions in terms of their internal consistency reliability, the relationship of their scores with measures of general English reading proficiency, as well as their inter-correlations after factoring out general proficiency. Findings suggest that Jist questions may actually be more reliable measures of basic comprehension of a text than questions created by teachers. : Jist Jist Jist 6

In a previous article published in this journal, Hughes (2012) introduced the program, Jist, an online application currently available at www.leanderhughes.com/jist for assisting teachers in checking online extensive reading assignments. Specifically, Jist attempts to automatically gauge a learner s reading of any digital text pasted by that learner into the program s text input box through having students answer automatically generated questions on that text and then emailing the results along with the actual text read to both the student and her teacher. Although Jist is not the first attempt by researchers to create a system which can automatically measure reading comprehension (see Gates, 2008, for a thorough review of other attempts at automatic question generation), it is to the author s knowledge, the first attempt at using automatic question generation to encourage and evaluate extensive reading progress among learners of English as a second or foreign language. While Hughes (2012) provided an in-depth rationale for and description of Jist, the question of whether or not Jist can actually be trusted as a valid and reliable basic assessment of learners comprehension of a text was left unanswered. Thus, the aim of the present study is to investigate the reliability and validity of Jist-generated questions through answering the following research questions: 1. Do Jist questions on a given text match or otherwise surpass teachergenerated questions on the same text in terms of their reliability? 2. Is there a significant positive relationship between scores on Jist questions and English reading proficiency test scores? 3. Is there a significant positive relationship between scores on Jist questions and scores on teacher-generated reading comprehension questions made for the same text, after controlling for general English reading proficiency? 7

First, it is important to find out whether the Jist-generated questions are internally consistent meaning that they measure the same thing every time they are administered. An affirmative answer to the first research question above would imply that Jist questions demonstrate such measurement consistency. Next, if Jist questions really are able to provide confirmation on whether or not a student has read a text, it entails that the questions must to some degree also measure English reading proficiency. Thus for Jist questions to be valid, the answer to the second research question should be affirmative. Finally, even if Jist questions are found to have a significant positive relationship with general English reading proficiency, there is still a chance that students are answering Jistgenerated questions by relying solely on their general English reading proficiency without actually having read the text for which those questions were generated. To indicate that Jist questions are measuring knowledge and effort specifically dependent on the content of a given text and not general English reading proficiency, Jist question scores should also share a significant positive relationship with scores on teachergenerated questions for the that text after factoring out learners general English reading proficiency. Affirmative answers to all three research questions would support the claim that Jist-questions can provide basic confirmation of whether or not learners have read a given text, which in turn would allow teachers to relegate the work required for such confirmation to Jist, leaving them with more time to devote to class preparation and professional development. Participants and Procedure This study initially involved 1,087 university students at a public university in the Kanto area. Of these initial participants, 740 were dropped from the study because they failed to complete all of the assignments. Thus, the final sample consisted of 347 students, whose Test of English for International Communication (TOEIC) reading section scores ranged from 70 to 385 (out of 495). 8

Participants read one news/human-interest article and answered two teachergenerated and two Jist-generated questions on each article online at home, once a week for 20 weeks as part of their homework for Preparation for TOEIC courses in which they were enrolled. These courses were taught by 12 different teachers who each chose one or two of the articles used in the study and wrote the teacher-generated questions for them. Articles ranged in length from 616 to 1,217 words with 871 being the average. Each assignment was worth 1.5 grade points, and students were given this credit as long as they completed the assignment. However, as will be explained further in the next section, the only way participants could complete an assignment was by answering each question correctly, retrying the question if necessary until this was achieved. Measures This section explains each question type and how it was scored as well as how general English reading proficiency was measured. As mentioned above, the two human-generated questions per reading assignment were created by the teachers of the students participating in the study. Both questions were multiple choice with four choices and were modeled after the questions used in the Reading Comprehension section (Part 7) of the TOEIC test. The first of the two questions typically asked about specific information contained in the first half of the reading (see Figure 1 for an example), while the second usually required a more global understanding of the reading (Figure 2). Each question was checked by another teacher, and edited if necessary to help ensure that it matched with other questions in its aims and difficulty level. The first teacher-generated question appeared immediately when participants began the reading assignment, while the second teacher-generated question appeared after the first Jist-generated question was answered. 9

Figure 1. Example of a TOEIC-Style Specific Information Question Employed in the Study *The article featured in this and the following examples was taken from http://www.reuters.com/article/ 2012/04/25/us-apple-results-idUSBRE83N19Q20120425 Figure 2. Example of a TOEIC-Style Global Comprehension Question Employed in the Study 10

Two different types of questions were generated by Jist. The first was a missing phrase question (referred to as Progress Check in Hughes, 2012) which displayed four choices, each five-to-eight words in length and one of which was not from the text (Figure 3). Participants had one minute to determine which of the phrases presented was missing from the text. The second question generated by Jist was an ordering question which presented participants with three sections of the text up to 15 words long and one distractor of equal length taken from a different text. Participants were given two minutes to determine which choice was the distractor as well as put the other three choices in the order in which they appeared in the text. As with the human-generated questions, participants had access to the actual text while they were answering the questions. Although this allowed for the possibility that learners could respond correctly simply by scanning the text for the sections presented in the questions, the time limits were meant to make this difficult (see Analysis and Results for further discussion of this possibility). Figure 3. A Jist Phrase Question Generated for an Article Used in the Study 11

Figure 4. Jist Ordering Question Generated for an Article Used in the Study To discourage random guessing and also increase measurement sensitivity, participants who answered a question incorrectly had to retry that question after a time delay (10 seconds for human-generated questions, 30 seconds for phrase questions, and one minute for ordering questions) until they were able to successfully answer it. While the human questions remained identical with each retry, the choices for the Jist questions actually changed with each retry, thus preventing participants from answering those items correctly simply by scanning for previously displayed choices while waiting to retry the question. Scores were calculated for each question by dividing one point by the number of tries it took to answer that question correctly (one point for answering on the first try, a half point for doing so on the second retry, a third of a point for a correct response on the third try, and so on). Individual scores for each question were then summed to obtain composite scores for each question type. Finally general English reading proficiency was measured by participants scores on the reading section of a TOEIC pre-test given before the study began as well as their scores on the same section of a TOEIC post-test administered at the end of the study period. Note, that most, but not all participants took the tests (N=345 for the pre-test and N=322 for the post-test). 12

Analysis and Results This section explains how each research question was investigated, the results, and their interpretation. Reliability To assess and compare the reliability of the Jist versus teacher-generated questions, a Cronbach s alpha coefficient of internal consistency reliability was obtained for each set of items. The results are presented in Table 1 below. Table 1. Internal Consistency Reliability of Question Sets Teacher-Generated Jist Phrase Jist Ordering All Jist Questions Questions Questions Questions (40 items) (40 items) (20 items) (20 items) Chronbach's alpha.66.76.77.84 Mohsen and Reg (2011, p. 55) state that, depending on the researcher and context, the value considered acceptable for a Cronbach s alpha ranges between.70 and.95. As shown above, the 40 teacher-generated items approached but failed to reach the lower threshold of.70, whereas both the 20 Jist phrase items and the 20 Jist ordering items surpassed it. Also, combining the Jist items raised their internal consistency reliability further such that it surpassed the midpoint between the lower and upper threshold given by Mohsen and Reg. In short, the Jist questions were more reliable than the questions developed by the university teachers. Relationship with English Reading Proficiency To determine and compare the strength of the relationships between the different question sets and general English reading proficiency, a correlation analysis was conducted using students pre- and post-toeic reading section scores (labeled Prof 1 and Prof 2, respectively) and the total scores for each question type. Table 2 below displays the results. 13

Table 2. Correlations of Question Scores with English Reading Proficiency Teacher-Generated Questions Jist Phrase Questions Jist Ordering Questions All Jist Questions Prof 1 (N=345).17*.35**.33**.39** Prof 2 (N=322).19**.36**.36**.41** * significant at the.01 level, one-tailed ** significant at the.001 level, one-tailed Though all question types shared a significant positive correlation with proficiency measurements, the Jist questions clearly correlate more highly with the two reading proficiency measurements than do the teacher-generated questions. Additionally, Dörnyei (2007, p. 223) indicates that correlations above.30 can generally be considered meaningful. As shown, only the Jist questions surpass this benchmark. Thus, it appears that Jist questions actually provided a better measurement of learners English reading proficiency than the teacher-generated questions. Relationship with Teacher-Generated Questions While Jist questions appear to be a better measurement of English reading proficiency than the teacher-generated questions, one could still argue that this was because readers relied on their general proficiency to scan for or otherwise guess the correct answers to Jist questions rather than actually reading and comprehending the text upon which those questions were based. On the other hand, perhaps the reason why the teacher-generated questions were less strongly correlated with general proficiency was that readers could not rely solely on their general proficiency to determine the correct answers to these questions and instead had to read and understand the actual content of each text. Depending on that content, individual readers may have achieved higher or lower scores on teacher-generated questions than their TOEIC scores would predict, corresponding to individual differences in background knowledge, interest, and effort devoted to the reading of the text. In order to falsify this hypothesis, a partial correlation analysis was conducted investigating for a relationship between teacher-generated and Jist -generated questions after factoring out general English reading proficiency. Table 3 below presents the results. 14

Table 3. Correlations after Controlling for English Reading Proficiency Teacher-Generated Questions (df = 328) Jist Phrase Questions Jist Ordering Questions All Jist Questions 0.32** 0.22** 0.32** ** significant at the.001 level, one-tailed As shown, even after factoring out their relationships with general reading proficiency measurements, Jist questions still shared a significant positive correlation with teacher-generated questions (although the ordering questions failed to surpass Dörnyei s benchmark for meaningfulness). This finding implies that students did not simply rely on their general proficiency to answer Jist questions, but as with the teacher-generated questions, actually had to read and comprehend each individual text to some extent in order to respond correctly. At the same time, the weaker correlation for Jist ordering questions could imply that students relied on scanning and guessing strategies derived from general proficiency more often on those questions than on Jist phrase questions. This in turn may have been because 1) more time was allotted to learners for the ordering question, in which they could have mechanically scanned the text to find the order of the text sections presented and 2) each ordering choice consisted of more words than the choices for the phrase question, which would have increased the relative likelihood that a choice could have been located in the text purely through scanning for it while also making it more likely that the correct response could be guessed based purely on the more complete information provided by each choice. Conclusion The results of this preliminary study indicate that questions automatically generated by the program Jist are more reliable and provide a better indication of reading proficiency than teacher-generated questions on the same content. At the same time, Jist questions share a significant positive relationship with teacher-generated questions after controlling for general reading proficiency, suggesting that responding correctly to Jist questions requires at least some reading of the text for which they were generated and not merely scanning or guessing abilities. 15

Future studies on Jist questions should look at how varying different aspects of the questions either increases or decreases their reliability and validity. For example, future studies would do well to investigate whether or not the questions can be further improved by hiding the text when questions appear and thereby eliminating the possibility of answering by scanning (as the current online version of Jist does). Other variables that might usefully be manipulated include question time limits, time delays between retries for both teacher- and Jist-generated questions, number of words per choice, and the number of choices presented. This study represents a first step toward determining the reliability and validity of questions generated by Jist. Though, it would be premature to conclude that Jist is a suitable replacement for teachers when it comes to writing reading comprehension questions, the findings of this study support the claim that Jist can be effectively used as an automatic measure of how much learners have read and understood a text. In conclusion, Jist may indeed represent a quick, yet valid way for teachers to encourage and gauge their learners online extensive reading progress. Acknowledgements The author would like to thank Adriana Edwards Wurzinger, Stacey Vye, Nathan Krug, Jason White, Debjani Ray, Richard Sheehan, Decha Hongthong, Samuel Nfor, Kevin Hawkins, Stewart Fulton, Robert Palka, Naoki Otani, Gabriel Wilkes, and Risa Aoki for their assistance with this study. 16

References Dörnyei, Z. (2007). Research methods in applied linguistics: Quantitative, qualitative and mixed methodologies. Oxford: Oxford University Press. Hughes, L. S. (2012). Gauging online extensive reading progress via automatically generated comprehension questions. The Journal of Saitama City Educators, 2(5), 13-21. Mohsen, T., and Reg, D. (2011). Making sense of Cronbach s alpha. International Journal of Medical Education, 2, 53-55. Gates D. M. (2008). Automatically generating reading comprehension look-back strategy questions from expository texts. Master s thesis, Carnegie Mellon University. 17