Investigating the Validity and Reliability of Automatically Generated Reading Comprehension Questions Leander S Hughes About Leander Hughes is an Associate Professor at the Saitama University Center for English Education and Development. He is interested in quantitative language research methods, extensive reading, and computer assisted language learning. Other interests include learner autonomy and applying principles of social psychology to the language-teaching context. Abstract This study seeks to obtain a preliminary evaluation of the validity and reliability of reading questions generated by the program Jist for the purpose of gauging learners online extensive reading progress. The study specifically aims to compare Jist-generated questions to teacher-generated questions in terms of their internal consistency reliability, the relationship of their scores with measures of general English reading proficiency, as well as their inter-correlations after factoring out general proficiency. Findings suggest that Jist questions may actually be more reliable measures of basic comprehension of a text than questions created by teachers. : Jist Jist Jist 6
In a previous article published in this journal, Hughes (2012) introduced the program, Jist, an online application currently available at www.leanderhughes.com/jist for assisting teachers in checking online extensive reading assignments. Specifically, Jist attempts to automatically gauge a learner s reading of any digital text pasted by that learner into the program s text input box through having students answer automatically generated questions on that text and then emailing the results along with the actual text read to both the student and her teacher. Although Jist is not the first attempt by researchers to create a system which can automatically measure reading comprehension (see Gates, 2008, for a thorough review of other attempts at automatic question generation), it is to the author s knowledge, the first attempt at using automatic question generation to encourage and evaluate extensive reading progress among learners of English as a second or foreign language. While Hughes (2012) provided an in-depth rationale for and description of Jist, the question of whether or not Jist can actually be trusted as a valid and reliable basic assessment of learners comprehension of a text was left unanswered. Thus, the aim of the present study is to investigate the reliability and validity of Jist-generated questions through answering the following research questions: 1. Do Jist questions on a given text match or otherwise surpass teachergenerated questions on the same text in terms of their reliability? 2. Is there a significant positive relationship between scores on Jist questions and English reading proficiency test scores? 3. Is there a significant positive relationship between scores on Jist questions and scores on teacher-generated reading comprehension questions made for the same text, after controlling for general English reading proficiency? 7
First, it is important to find out whether the Jist-generated questions are internally consistent meaning that they measure the same thing every time they are administered. An affirmative answer to the first research question above would imply that Jist questions demonstrate such measurement consistency. Next, if Jist questions really are able to provide confirmation on whether or not a student has read a text, it entails that the questions must to some degree also measure English reading proficiency. Thus for Jist questions to be valid, the answer to the second research question should be affirmative. Finally, even if Jist questions are found to have a significant positive relationship with general English reading proficiency, there is still a chance that students are answering Jistgenerated questions by relying solely on their general English reading proficiency without actually having read the text for which those questions were generated. To indicate that Jist questions are measuring knowledge and effort specifically dependent on the content of a given text and not general English reading proficiency, Jist question scores should also share a significant positive relationship with scores on teachergenerated questions for the that text after factoring out learners general English reading proficiency. Affirmative answers to all three research questions would support the claim that Jist-questions can provide basic confirmation of whether or not learners have read a given text, which in turn would allow teachers to relegate the work required for such confirmation to Jist, leaving them with more time to devote to class preparation and professional development. Participants and Procedure This study initially involved 1,087 university students at a public university in the Kanto area. Of these initial participants, 740 were dropped from the study because they failed to complete all of the assignments. Thus, the final sample consisted of 347 students, whose Test of English for International Communication (TOEIC) reading section scores ranged from 70 to 385 (out of 495). 8
Participants read one news/human-interest article and answered two teachergenerated and two Jist-generated questions on each article online at home, once a week for 20 weeks as part of their homework for Preparation for TOEIC courses in which they were enrolled. These courses were taught by 12 different teachers who each chose one or two of the articles used in the study and wrote the teacher-generated questions for them. Articles ranged in length from 616 to 1,217 words with 871 being the average. Each assignment was worth 1.5 grade points, and students were given this credit as long as they completed the assignment. However, as will be explained further in the next section, the only way participants could complete an assignment was by answering each question correctly, retrying the question if necessary until this was achieved. Measures This section explains each question type and how it was scored as well as how general English reading proficiency was measured. As mentioned above, the two human-generated questions per reading assignment were created by the teachers of the students participating in the study. Both questions were multiple choice with four choices and were modeled after the questions used in the Reading Comprehension section (Part 7) of the TOEIC test. The first of the two questions typically asked about specific information contained in the first half of the reading (see Figure 1 for an example), while the second usually required a more global understanding of the reading (Figure 2). Each question was checked by another teacher, and edited if necessary to help ensure that it matched with other questions in its aims and difficulty level. The first teacher-generated question appeared immediately when participants began the reading assignment, while the second teacher-generated question appeared after the first Jist-generated question was answered. 9
Figure 1. Example of a TOEIC-Style Specific Information Question Employed in the Study *The article featured in this and the following examples was taken from http://www.reuters.com/article/ 2012/04/25/us-apple-results-idUSBRE83N19Q20120425 Figure 2. Example of a TOEIC-Style Global Comprehension Question Employed in the Study 10
Two different types of questions were generated by Jist. The first was a missing phrase question (referred to as Progress Check in Hughes, 2012) which displayed four choices, each five-to-eight words in length and one of which was not from the text (Figure 3). Participants had one minute to determine which of the phrases presented was missing from the text. The second question generated by Jist was an ordering question which presented participants with three sections of the text up to 15 words long and one distractor of equal length taken from a different text. Participants were given two minutes to determine which choice was the distractor as well as put the other three choices in the order in which they appeared in the text. As with the human-generated questions, participants had access to the actual text while they were answering the questions. Although this allowed for the possibility that learners could respond correctly simply by scanning the text for the sections presented in the questions, the time limits were meant to make this difficult (see Analysis and Results for further discussion of this possibility). Figure 3. A Jist Phrase Question Generated for an Article Used in the Study 11
Figure 4. Jist Ordering Question Generated for an Article Used in the Study To discourage random guessing and also increase measurement sensitivity, participants who answered a question incorrectly had to retry that question after a time delay (10 seconds for human-generated questions, 30 seconds for phrase questions, and one minute for ordering questions) until they were able to successfully answer it. While the human questions remained identical with each retry, the choices for the Jist questions actually changed with each retry, thus preventing participants from answering those items correctly simply by scanning for previously displayed choices while waiting to retry the question. Scores were calculated for each question by dividing one point by the number of tries it took to answer that question correctly (one point for answering on the first try, a half point for doing so on the second retry, a third of a point for a correct response on the third try, and so on). Individual scores for each question were then summed to obtain composite scores for each question type. Finally general English reading proficiency was measured by participants scores on the reading section of a TOEIC pre-test given before the study began as well as their scores on the same section of a TOEIC post-test administered at the end of the study period. Note, that most, but not all participants took the tests (N=345 for the pre-test and N=322 for the post-test). 12
Analysis and Results This section explains how each research question was investigated, the results, and their interpretation. Reliability To assess and compare the reliability of the Jist versus teacher-generated questions, a Cronbach s alpha coefficient of internal consistency reliability was obtained for each set of items. The results are presented in Table 1 below. Table 1. Internal Consistency Reliability of Question Sets Teacher-Generated Jist Phrase Jist Ordering All Jist Questions Questions Questions Questions (40 items) (40 items) (20 items) (20 items) Chronbach's alpha.66.76.77.84 Mohsen and Reg (2011, p. 55) state that, depending on the researcher and context, the value considered acceptable for a Cronbach s alpha ranges between.70 and.95. As shown above, the 40 teacher-generated items approached but failed to reach the lower threshold of.70, whereas both the 20 Jist phrase items and the 20 Jist ordering items surpassed it. Also, combining the Jist items raised their internal consistency reliability further such that it surpassed the midpoint between the lower and upper threshold given by Mohsen and Reg. In short, the Jist questions were more reliable than the questions developed by the university teachers. Relationship with English Reading Proficiency To determine and compare the strength of the relationships between the different question sets and general English reading proficiency, a correlation analysis was conducted using students pre- and post-toeic reading section scores (labeled Prof 1 and Prof 2, respectively) and the total scores for each question type. Table 2 below displays the results. 13
Table 2. Correlations of Question Scores with English Reading Proficiency Teacher-Generated Questions Jist Phrase Questions Jist Ordering Questions All Jist Questions Prof 1 (N=345).17*.35**.33**.39** Prof 2 (N=322).19**.36**.36**.41** * significant at the.01 level, one-tailed ** significant at the.001 level, one-tailed Though all question types shared a significant positive correlation with proficiency measurements, the Jist questions clearly correlate more highly with the two reading proficiency measurements than do the teacher-generated questions. Additionally, Dörnyei (2007, p. 223) indicates that correlations above.30 can generally be considered meaningful. As shown, only the Jist questions surpass this benchmark. Thus, it appears that Jist questions actually provided a better measurement of learners English reading proficiency than the teacher-generated questions. Relationship with Teacher-Generated Questions While Jist questions appear to be a better measurement of English reading proficiency than the teacher-generated questions, one could still argue that this was because readers relied on their general proficiency to scan for or otherwise guess the correct answers to Jist questions rather than actually reading and comprehending the text upon which those questions were based. On the other hand, perhaps the reason why the teacher-generated questions were less strongly correlated with general proficiency was that readers could not rely solely on their general proficiency to determine the correct answers to these questions and instead had to read and understand the actual content of each text. Depending on that content, individual readers may have achieved higher or lower scores on teacher-generated questions than their TOEIC scores would predict, corresponding to individual differences in background knowledge, interest, and effort devoted to the reading of the text. In order to falsify this hypothesis, a partial correlation analysis was conducted investigating for a relationship between teacher-generated and Jist -generated questions after factoring out general English reading proficiency. Table 3 below presents the results. 14
Table 3. Correlations after Controlling for English Reading Proficiency Teacher-Generated Questions (df = 328) Jist Phrase Questions Jist Ordering Questions All Jist Questions 0.32** 0.22** 0.32** ** significant at the.001 level, one-tailed As shown, even after factoring out their relationships with general reading proficiency measurements, Jist questions still shared a significant positive correlation with teacher-generated questions (although the ordering questions failed to surpass Dörnyei s benchmark for meaningfulness). This finding implies that students did not simply rely on their general proficiency to answer Jist questions, but as with the teacher-generated questions, actually had to read and comprehend each individual text to some extent in order to respond correctly. At the same time, the weaker correlation for Jist ordering questions could imply that students relied on scanning and guessing strategies derived from general proficiency more often on those questions than on Jist phrase questions. This in turn may have been because 1) more time was allotted to learners for the ordering question, in which they could have mechanically scanned the text to find the order of the text sections presented and 2) each ordering choice consisted of more words than the choices for the phrase question, which would have increased the relative likelihood that a choice could have been located in the text purely through scanning for it while also making it more likely that the correct response could be guessed based purely on the more complete information provided by each choice. Conclusion The results of this preliminary study indicate that questions automatically generated by the program Jist are more reliable and provide a better indication of reading proficiency than teacher-generated questions on the same content. At the same time, Jist questions share a significant positive relationship with teacher-generated questions after controlling for general reading proficiency, suggesting that responding correctly to Jist questions requires at least some reading of the text for which they were generated and not merely scanning or guessing abilities. 15
Future studies on Jist questions should look at how varying different aspects of the questions either increases or decreases their reliability and validity. For example, future studies would do well to investigate whether or not the questions can be further improved by hiding the text when questions appear and thereby eliminating the possibility of answering by scanning (as the current online version of Jist does). Other variables that might usefully be manipulated include question time limits, time delays between retries for both teacher- and Jist-generated questions, number of words per choice, and the number of choices presented. This study represents a first step toward determining the reliability and validity of questions generated by Jist. Though, it would be premature to conclude that Jist is a suitable replacement for teachers when it comes to writing reading comprehension questions, the findings of this study support the claim that Jist can be effectively used as an automatic measure of how much learners have read and understood a text. In conclusion, Jist may indeed represent a quick, yet valid way for teachers to encourage and gauge their learners online extensive reading progress. Acknowledgements The author would like to thank Adriana Edwards Wurzinger, Stacey Vye, Nathan Krug, Jason White, Debjani Ray, Richard Sheehan, Decha Hongthong, Samuel Nfor, Kevin Hawkins, Stewart Fulton, Robert Palka, Naoki Otani, Gabriel Wilkes, and Risa Aoki for their assistance with this study. 16
References Dörnyei, Z. (2007). Research methods in applied linguistics: Quantitative, qualitative and mixed methodologies. Oxford: Oxford University Press. Hughes, L. S. (2012). Gauging online extensive reading progress via automatically generated comprehension questions. The Journal of Saitama City Educators, 2(5), 13-21. Mohsen, T., and Reg, D. (2011). Making sense of Cronbach s alpha. International Journal of Medical Education, 2, 53-55. Gates D. M. (2008). Automatically generating reading comprehension look-back strategy questions from expository texts. Master s thesis, Carnegie Mellon University. 17