THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS ELIZABETH ANNE SOMERS Spring 2011 A thesis submitted in partial fulfillment of the requirements for a baccalaureate degree in Mathematics with honors in Mathematics Reviewed and approved* by the following: Stanley Smith Associate Professor Thesis Supervisor Mark Levi Professor Honors Adviser * Signatures are on file in the Schreyer Honors College.

Abstract This study was conducted at Penn State University and focused on the Math 021, College Algebra, final exam results. The goal of this study was to examine the effectiveness and difficulty of multiple choice exam questions. This study looked at group data from the fall 2008 final exam (Schreyer, 2008) and based on this data, five questions were identified to appear on the spring 2010 final. These questions were chosen for their high effectiveness and appropriate difficulty level. This study found that both the effectiveness and difficulty of these exam questions changed from the fall 2008 to spring 2010 exams. Two of the five questions had an increase in difficulty, and one had a decrease in difficulty. The remaining two questions saw fairly consistent difficulty levels across the two exam years. The effectiveness of the questions decreased from the fall 2008 exam to the spring 2010 exam. The first three identical questions showed a large decrease in effectiveness, while the second two questions had more consistent effectiveness across exam years (MATH, 2010). The factors that affected the effectiveness and difficulty of the exam questions are discussed. The ALEKS program (ALEKS, 2010) used during the spring 2010 semester is one of these factors and this factor appeared to have an effect on the changes in effectiveness and difficulty. i

Table of Contents Introduction 1 Methods and Materials 3 Table 1: Fall 2008 Data 4 Table 2: Distribution of items by ITEM EFFECT: Biserial Coefficient 4 Table 3: Distribution of items by % Correct 5 Figure 1: Questions found on both fall 2008 and spring 2010 7 Results 8 Table 4: Spring 2010 Data 8 Table 5: Difficulty and Effectiveness 9 Discussion 12 Conclusion 21 Appendices 23 Appendix A: Glossary 23 Appendix B: Consent Forms 24 References 28 ii

Acknowledgements I would like to thank all of those who aided me in the process of completing this project. First I would like to thank my thesis advisor Dr. Stanley Smith for all the time and effort he gave. I would also like to thank Mary Erickson, Coordinator of First Year Courses and all the others in the Math Department who allowed me to complete this study, and Crystal Ramsay, Instructional Consultant for the Schreyer Institute for Teaching Excellence for providing me with integral information for my study. Finally I would like to thank my family and friends for the support and encouragement they gave me as I was completing this project. iii

Introduction The use of multiple choice questions to test knowledge is prevalent in the academic institutions of the United States. Many institutions rely on multiple choice questions to test students and assess their knowledge and learning. Multiple choice questions appear on standardized tests such as the PSSA and the SAT s and classroom tests found in primary schools, secondary schools, and universities. Multiple choice exams fall under the larger category of assessment. The types and quality of assessment are being examined in the mathematics community. In 1991, teachers participating in a Garet and Mills (1995) study rated the frequency for which they implemented multiple choice tests in their classroom. On a scale of 1 to 5 where 1 indicated never and 5 indicated very frequently teachers scored their use at a 2.5. The Thompson article that reports this study pushes the use of alternate assessments, but also advocates for the improvement of tests (Thompson, 1997). Tests can be improved through improving the effectiveness of test items and through insuring that the test items individually and collectively have an appropriate difficulty level. Meeting with Crystal Ramsay, an Instructional Consultant for the Schreyer Institute for Teaching Excellence led to the awareness of three university websites that address these issues along with other aspects of multiple choice item analysis (C. Ramsay, Personal Communication, October 7, 2010). A website provided by the University of Texas at Austin provides information about how to analyze test items, addressing item discrimination and difficulty along with other measures pertinent to the analysis of a multiple choice question. This site also supplies information about the testing process including how to write test items, how to produce data that reflects the test items, and how to analyze test items (Instructional, 2010). The University of Wisconsin Oshkosh provides a website that discusses item discrimination and its importance making judgments about 1

a test item. This site includes a method of calculating the item discrimination (University, 2005). Finally, Vassar College supports a website that calculates statistical information about a data set including the item discrimination (Lowry, 2010). Each of these websites includes information about calculating the effectiveness of test items. The presence of these pages indicates the importance of test effectiveness. Penn State University is also working to improve question effectiveness on multiple choice math exams. Indeed, this study was a part of this process. The purpose of this study was to investigate the effectiveness and difficulty of multiple choice items on the MATH 021, College Algebra exam. This article will discuss the study that was conducted to generate data and use this data to help make conclusions about the effectiveness and difficulty of test items. 2

Methods and Materials This study was conducted using data from exam scores from the College Algebra I class, Math 021, at Penn State University, University Park. The study consisted of two rounds of testing. The first round of testing took place before the study began. Group data was obtained from this study (Schreyer, 2008). This data was used to help shape the second round of testing. The data from this round includes fewer participants, but more in depth analysis of the data (MATH, 2010). Data from the fall 2008 Math 021 final exam was obtained from the Schreyer Institute for Teaching Excellence at Penn State University. The sample size of this data set was 653 students. These students belonged to one of four subsets. Each subset of students took a different version of the exam. The versions included A, B, C, and D. There were 162 students who took version A of the exam. Likewise, 162 took version B of the exam. Version C had 166 students participate. Version D had 163 students participate. Each version of the exam was made up of the same 30 questions. The questions appeared in the same order on all four exams. The variation between versions occurred in the order that the answers appeared. There was not one particular pattern that characterized how these answer choices varied on each exam (Schreyer, 2010). The data produced through the Schreyer Institute for Teaching Excellence provides the percent of students who chose each answer choice and the difficulty of the question. A key is provided along with the mean score of each form. The data results include the item effect and the reliability of the test scores. The group data for the fall 2008 exam is displayed in Table 1 (Schreyer, 2010). The data includes the results of the five problems that appear identically on the spring 2010 exam. 3

Table 1: Fall 2008 Data Version A 162 students Version B 162 students Reliability 0.842 mean: 108.58 Reliability 0.836 mean: 106.36 Version C 166 students Version D 163 students Reliability 0.819 mean: 105.21 Reliability 0.830 mean: 106.35 Table 1. Extracted from the item analysis data for five of the 30 questions on the fall 2008 exam (Schreyer, 2008). The group data from the fall 2008 exam was studied and questions were chosen to reappear on the spring 2010 exam based on high item effect across the four versions and a difficulty that when averaged together came to roughly a C. The information provided by the Schreyer Institute for Teaching Excellence includes the ranges of values that indicate ineffective questions, questions with low effectiveness, questions with medium effectiveness, and questions Table 2: Distribution of items by ITEM EFFECT: Biserial Coefficient Negative (ineffective).00 -.20 (low effectiveness).21 -.40 (medium effectiveness).41-1.00 (high effectiveness) Table 2. The ranges for item effect as displayed in the item analysis data (Schreyer, 2008). with high effectiveness. A table displaying these ranges can be found in the item analysis produced by the Schreyer Institute for Teaching Excellence, and this table can be viewed in Table 2 (Schreyer, 2008). With these ranges in mind, the cutoff item effect chosen for 4

this study was 0.40. This means that every version of the exam must have a 0.40 item effect or greater. There were 14 questions on the fall 2008 exam that fit into this category (Schreyer, 2008). Of these 14 questions, 5 of these questions were used verbatim on the spring 2010 exam. These five questions are those represented in Table 1. The answer order for corresponding versions was also identical from the fall 2008 exam to the spring 2010 exam. Two of the 14 eligible questions, numbers 28 and 29, involved content that was not being tested in the spring 2010 semester of Math 021. Questions 10 and 12 were also not used on the spring 2010 exam. The remaining eligible questions appeared with some alterations on the spring 2010 final. Table 3: Distribution of items by % Correct 0-20 (very difficult) 21-60 (difficult) 61-90 (moderately difficult) 91-100 (easy) Table 3. The ranges for item difficulty as displayed in the item analysis data (Schreyer, 2008). Questions 3 and 24 had different distracters. In question 6, a negative was factored out of part of the equation being solved. Question 11 appears on the spring 2010 exam, but the answers are not shuffled between versions. Questions 4 and 5 also appeared on the spring 2010 exam but these were not questions chosen for this study (Penn State, 2010). The difficulty of each question was also considered. The goal was to create an exam that had a difficulty percent of a C. The ranges for difficulty levels can be found in Table 3 (Schreyer, 2008). The remainder of the exam was written by Mary Erickson, Coordinator of First Year Courses at Penn State University. The questions that will be focused on in the data collection and review are the five questions that are identical to those that appeared on the fall 2008 exam. The spring 2010 exam included questions chosen for their effectiveness and difficulty on the fall 2008 exam. There were 25 questions on this exam, and 4 versions of the exam. The 5

versions contained the same questions, but with the answer choices in different orders (Penn State, 2010). This exam was administered to students in May of 2010 who were taking the Penn State course Math 021. Before students completed the exam, students were asked to indicate if their scores could be used for the study. The informed consent form used for this process (Appendix B) was in compliance with the IRB Office at Penn State University (Research, 2010). There were 143 students who indicated that their scores could be used. These scores were extracted from the group data after being stripped of identifying information. There were 34 students who completed version A, 38 completing version B, 40 completing version C, and 31 completing version D (MATH, 2010). The extracted data provided by my thesis advisor, Stanley Smith, Associate Professor and Director of Online Instruction, included the answer choices each student made for each question (MATH, 2010). The exams were graded in excel after an answer key was made for each version. The percent of students who chose each answer choice was determined along with the difficulty, effectiveness, overall reliability, and mean for each version. The difficulty and mean were calculated using an excel spreadsheet. The overall reliability was calculated using the Kuder- Richardson formula 20 value (Appendix A). This is the same formula used by testing services to calculate the overall reliability (Schreyer, 2010). The formula for this calculation was found in Psychometrics: An Introduction by Furr and Bacharach (Furr, 2008). The effectiveness of each question was determined using an online calculator provided by Vassar College (Lowry, 2010). Crystal Ramsey, an instructional consultant for the Schreyer Institute for Teaching Excellence, recommended this calculator. She also confirmed that Penn State University uses a similar calculation to determine the effectiveness of exam questions. The top 33 percent and bottom 33 percent of scorers are used to calculate the effectiveness of a 6

question (C. Ramsey, Personal Communication, October 7, 2010). When defining the top 33 percent and bottom 33 percent of scorers, there was overlapping scores between the top and middle or middle and bottom groups. The determination was made using excel. The scores were sorted from smallest to largest and then the top 33 percent were determined using this order. In order to make sure that the process of choosing which students scores to use did not affect the effectiveness of a question significantly, the effectiveness of the five questions that are studied in greater depth was recalculated. The students with the same scores were assigned a number and a TI-83 calculator was used to randomly generate the numbers represented to indicate which scores would and would not be used. The Spring 2010 data was compared to the original group data. The difficulty and effectiveness of the five chosen questions was compared to the difficulty and effectiveness of these questions on the fall 2008 exam. The five questions are displayed in Figure 1 (Penn State, 2010). Figure 1: Questions found on both fall 2008 and spring 2010 1. Simplify a) x(x;2) 2(x:3) 98x 3 y 2 b) (x:3) 3 (x;2) c) x 2 d) x;2 x:3 7x 2 y 14xy x 2 :6x:6 x 2 :x;6 2. Simplify (3 ;2 2 ;2 ) ;1. 3. Write 9 50 in the simplest radical form. 15 18 a) 5 a) 3 b) 5 b) 12 2 c) 5 36 d) 36 5 c) 1 d) 3 5 5 3 b 4. Simplify 6 5. Find the center of the circle x 2 + y 2 4x + 6y + 1 = 0 b a) 1 3 b a) center (2, 3) 3 b) b b) center (4,9) c) 1 6 b c) center ( 2,3) 6 d) b d) center ( 4,9) Figure 1. The five questions that appear identically on the fall 2008 and spring 2010 exams (Penn State, 2010). 7

Results Table 4: Spring 2010 Data Version A 34 students Version B 38 students Reliability 0.76 mean: 71.41 Reliability 0.67 mean: 70.84 Version C 40 students Version D 31 students Reliability 0.75 mean: 65.2 Reliability 0.80 mean: 70.71 Table 4. Item analysis for spring 2010 exam based on raw data (MATH, 2010). 8

The results for the spring 2010 exam, based on the raw data provided by the thesis advisor of this study (MATH, 2010), are displayed in Table 4. The results are displayed for each question on the exam. The number of participants, mean score, overall reliability, difficulty, and effectiveness for each question is present in this data. The five questions that will be further examined are highlighted. The specific questions and comparisons between the fall 2008 and spring 2010 exam are displayed in Table 5. Table 5. Difficulty and effectiveness of five selected problems. Values from item analysis (Schreyer, 2008), and calculated used raw data (MATH, 2010). 9

The percent difficulty for each of the five selected questions is displayed in Table 5. This data was obtained through the item analysis for the fall 2008 exam provided through the Schreyer Institute for Teaching Excellence (Schreyer, 2008), and calculated from the raw data provided by the thesis advisor for this study (MATH, 2010). The difference in difficulty between the fall 2008 exam and spring 2010 exam is also shown in this table. The first question that was identical on both the fall 2008 exam and the spring 2010 exam asked students to simplify an algebraic expression (Penn State, 2010). Students had a high success rate with all versions difficulty falling between 91 and 100 and therefore being labeled easy (Schreyer, 2008), but the difficulty level was higher for the fall 2008 exam. For each version there was an increase between the fall 2008 exam and the spring 2010 exam in the percentage of students who answered correctly. Two questions, question 2 and question 4, showed an increase in difficulty across the two exams. Question 2 showed an increase in difficulty across exams. The spring 2010 results fall in the very difficult and lower end of the difficult categories. The fall 2008 result, however, are in the middle to upper end of the difficult category (Schreyer, 2008). Question 4 also showed an increase in difficulty across exams. The fall 2008 exam showed results in the moderately difficult category. The spring 2010 exam, however, had results in the very difficult and difficult categories (Schreyer, 2008). The remaining two questions, question 3 and question 5, had an increase in difficulty with some versions, and a decrease in difficulty with others. This was the case for the third question. The results stayed fairly consistent across the exams. The results from both exams lie within the upper moderately difficult category and the easy category. The fifth question also had 10

fairly consistent results. Across both exams and all four versions, the difficulty ratings remained in the moderately difficult category (Schreyer, 2008). The effectiveness of questions also changed from the fall 2008 exam to the spring 2010 exam. There was an overall trend of a decrease in effectiveness across the two exam years (Table 5). The effectiveness for the first question could not be calculated on version A because every student in the sample group answered this question correctly (MATH, 2010). Questions one, two, and three showed a large decrease in effectiveness while questions four and five showed a more consistent trend in effectiveness across exam years. 11

Discussion The progression of this study took place in Math 021 at Penn State University. This study followed five final exam questions (Penn State, 2008) that were chosen to reappear on the Math 021 final exam based on their effectiveness and difficulty levels from their initial ratings on the fall 2008 exam (Schreyer, 2008). The results of this study show discrepancy on both the effectiveness and difficulty of these questions across exams (Table 5). These five questions on the fall 2008 exam and spring 2010 exam correspond identically as set up by this study. The answer choices are in the same order for version A of the fall 2008 exam and version A of the spring 2010 and all other corresponding versions across exam years. There are disparities, however, in the conditions between the first time these questions were administered and the second time they were administered. These disparities will be further discussed in order to make a conjecture as to the cause of the differences. First, the exams contained a different number of questions and the questions not included in the identical five were different across exam years. The exams also included a slightly different range of knowledge. The fall 2008 exam contained 30 questions, and ellipses and hyperbolas were included (Penn State, 2008). The spring 2010 exam contained 25 questions and did not include ellipses and hyperbolas (Penn State, 2010). The disparities in the tests may affect the effectiveness of the test questions because effectiveness is calculated through examining how many high scoring students and how many low scoring students answered a question correctly. A highly effective question will have the high scoring students answering correctly and the low scoring students answering incorrectly (Instructional, 2010). If two exams test even slightly different knowledge or skills, the range of skills being tested changes and perhaps a student who 12

was a high scoring student on the first version of the exam would become a middle scoring student on the second version of the exam. This would alter the effectiveness of the question. Another difference between the fall 2008 data and spring 2010 data was the sample size. The fall 2008 data was group data that was already existing and calculated by the Schreyer Institute for Teaching Excellence. The population included all Math 021 students taking the exam. There were a total of 653 students whose scores are represented. There were over 160 students taking each version (Schreyer, 2008). For the spring 2010 exam, the population represented in the data was defined differently. The first factor influencing the population is permission was obtained from students to use their exam data. Only 143 students granted permission for their scores to be used. This meant that there were between 30 and 40 students represented for each version (MATH, 2010). The second factor that affected the population of the spring 2010 exam was a program called ALEKS. This program is an online method of instruction (ALEKS, 2010) that gave students skill practice throughout the semester. Students needed to master each category for the class. Those who mastered all of the categories before the end of the semester took the exam at an earlier date. Only those who took the exam on the final exam day were asked to participate in this study, therefore, the early test takers are not represented in the population examined for the spring 2010 data. The difference in population could affect the effectiveness and difficulty of a question because it is unknown if the sample of people represented in this study is a true representation of the entire group. The ALEKS program (ALEKS, 2010) that was used during the spring of 2010 was another factor that could have affected question difficulty and question effectiveness. The students who took this exam in the fall of 2008 did not use this program to help them with their studies throughout the semester (Math Department, 2008). This program has students take a pre- 13

test that determines which skills need to be practiced in order for students to master all of the content of the course. Students must master each content area by answering enough questions correctly in the topic (ALEKS, 2010). The responsibility that students have to practice and master skills could affect both the difficulty and effectiveness of a question. Exam material could either be well supported by the content covered in ALEKS or poorly supported. The concepts on the exam that are well supported may show a decrease in difficulty level. The effectiveness of questions could be affected depending on how well the program mirrors the final exam content. Students could master the ALEKS material, but if there were discrepancies between topics on the final exam and topics included in the program, being an expert on the ALEKS material would not necessarily correlate with a good score on the exam. These students could score well on the ALEKS correlated questions, but poorly on the non ALEKS coordinated questions and depending on if the question being analyzed was a ALEKS correlated question and if the exam had a majority of ALEKS coordinated questions on it, the effectiveness would be affected. Finally, the fall 2008 data was calculated by the Schreyer Institute for Teaching Excellence (Schreyer, 2008). The spring 2010 data was presented in this study as raw data (MATH, 2010). The different measurements were calculated through excel and also through an online calculator on the Vassar College website. This online calculator s purpose is to calculate the effectiveness of a test question (Lowry, 2010). In addition, there were some scores that fell on the line between the top 33 percent and the middle 33 percent or the middle and the bottom, as calculated from the spring 2010 raw data (MATH, 2010). The way this situation was handled in this study was the scores in excel were sorted from smallest to largest. The top 33 percent were then determined based on the current order of scores. Testing services may use a different method for choosing which scores should be included. The disparity between the methods used 14

to calculate the effectiveness could have caused slight variations between the two analyses of effectiveness. Many of these differences could not be controlled through this study, and examining the effects of them is not possible through the constraints of the study. An ideal study would compare the results from every student in two different semesters of Math 021 where the exams were identical. The changes in the content tested on the exam were determined by Mary Erickson, the Coordinator of First Year Courses. These changes were made to reflect the content covered during the semester in the Math 021 class. The length of the exam was also determined by Mary Erickson and the other Penn State faculty that contributed to the creation of the exam. The constraints of this study do not allow the effect the differences in content and length of the exam have on both the effectiveness and difficulty of questions being studied, however, there are only three questions on the fall 2008 exam that ask students to use their knowledge of ellipses and hyperbolas (Penn State, 2008). The differences in length of the fall 2008 and spring 2010 exams are consistent for each student who is taking the exam in each of these semesters. The length of the exam could present differences if time is an issue for students taking the exam. If students do not have adequate time to complete the exam, the difficulty of questions could be affected if students cannot spend as much time on questions as they would like. It is the belief of this study, however, that the effectiveness of a question would not be greatly affected by time constraints because both weak students and strong students were given the same amount of time to complete the exam. The second factor that was discussed in regards to the differences between the fall 2008 exam and spring 2010 exam is the sample and the method of calculation for the data. The available data for this study caused this difference to be present. The data given from the 15

population that was available for the study was examined using methods similar to the methods used by the Schreyer Institute for Teaching Excellence. Crystal Ramsay, an Instructional Consultant from the Schreyer Institute for Teaching Excellence provided guiding information to help align the methods used in this study as closely as possible to the calculations of the fall 2008 exam (Personal Communication, Crystal Ramsay, October 7, 2010). The students taking Math 021 in the spring of 2010 who finished the ALEKS program early took the final exam early (Math Department, 2010). This means that their scores could not be included in the population of scores examined. These students are probably among the top students in the class and the absence of this data could have a great affect on the difficulty and effectiveness of questions. Unfortunately the constraints of this study, specifically the data available, do not allow this disparity to be examined. The presence of the ALEKS program for the students taking Math 021 during the spring semester of 2010 is a factor that can be further examined. The ALEKS program looks to create a learning environment for each student based on their needs. The program contains a list of topics for a particular course, and instructors can choose which of these topics they want students to master in the course. Students begin with a pre-test that initially assesses what students know. Based on how successful students are solving each of the pre-test tasks, the program will determine what other pre-test questions a student should be asked. After the pre-test is complete, ALEKS will create an individual pie chart based on the results of the pre-test. The pie chart includes each of the topics the instructor indicated grouped into categories. The topics that the student already has mastery of, determined by the pre-test will be indicated on the pie chart. Students must then work to master each skill required for the course. Students must be able to answer questions relating to a skill correctly multiple times before ALEKS determines that 16

students have learned a skill. The ALEKS program does not allow students access to practice for all skills at once, but instead only opens those that students can successfully complete with the use of their prior knowledge and will need to complete the later skills. As students mastered earlier skills, later skills became available for students to complete (ALEKS, 2010). The ALEKS program was used in Math 021 to help students practice and learn the content knowledge. Students were required to master each of the skills in order to complete the curriculum (ALEKS, 2010). Therefore, the correlation between the content presented in ALEKS and the content presented on the exam could greatly affect the results of the exam. In particular, the difficulty of a question would be greatly affected depending on how well the ALEKS program covered the skills need to complete the question. The difference in the difficulty of the five questions that appeared on both the fall 2008 exam and the spring 2010 exam was examined. Questions two and four stated in Table 4 had an increase in difficulty from the fall 2008 exam to the spring 2010 exam (Table 5). The skills needed to complete each of these questions were determined and the depth to which these skills were covered in ALEKS was also determined. Question two of Table 4 asks students to simplify the expression, (3 ;2 2 ;2 ) ;1 (Penn State, 2010). On the fall 2008 exam, there was approximately a 50/50 right to wrong ratio (Schreyer, 2008). On the spring 2010 exam, however, there was approximately a 25/75 right to wrong ratio (MATH, 2010). Therefore, it is important to note how effectively ALEKS addresses the skills tested in this problem. Students must first simplify what is inside the parentheses. Students must be able to rewrite terms without negative exponents present, ( 1 9 1 4 );1. Students must then combine these terms using common denominators as calculators are not allowed on the exam, ( ;5 36 );1. Lastly, students must apply the negative exponent to the simplified expression to obtain 36. The students who answered this question incorrectly most often chose 5 as their 5 17

answer on both the fall 2008 exam (Schreyer, 2008) and spring 2010 exam (MATH, 2010). Perhaps this solution resulted from the following process: (3 ;2 2 ;2 ) ;1 = 3 2 2 2 = 9 4 = 5 Students appear to have illegally distributed the exponent and then simplified the new expression. The ALEKS program provides students with practice on exponents and order of operations, and additional practice with exponents; however practice with negative exponents was not observed to be available at the base level. This category of practice also has students simplify expressions that include examples such as, 2 3 ( 2) (2 3) 2, where students must recognize that what is inside the parentheses must be simplified before the exponent can be addressed (ALEKS, 2010). Students perhaps were not given enough initial practice with negative exponents and order of operations. Question four of Table 4 asks students to simplify 3 6. (Penn State, 2010). The increase in difficulty from the fall 2008 exam to the spring 2010 exam was an average of 40% (Table 5). The skills needed to solve this problem were broken down to help assess which of these skills were present in the ALEKS program. Students should begin by writing the expression with fractional exponents. This will result in 3 6. Students must then be able to simplify this expression by writing the b terms on either the top or bottom of the fraction, and with a positive 1 exponent. Students should obtain 6. Students must recognize that this is equivalent to 6. Upon examination of the skills presented in ALEKS, there is no practice with fractional exponents when students are beginning their practice of exponents observed (ALEKS, 2010). Students perhaps require more initial practice on fractional exponents in order to be able to 18

complete exercises such as, simplifying 3 6 (Penn State, 2010), and apply their knowledge to more complex problems. The ALEKS program used during the spring semester of the Math 021 course was a factor that could have contributed to affecting students learning and performance on the final exam. It appears that there were some skills that were not mastered well by students and perhaps these skills were not represented enough in the ALEKS program. Students performance on questions that included these skills decreased from the fall of 2008 to the spring of 2010. There were, however, questions that appeared on both the fall 2008 exam and spring 2010 exam that had an increase in success or a constant trend of success (Table 5). The content covered in the ALEKS program most likely had some bearing on the increased or constant success. There are of course other factors that helped to determine students success on test questions, however, the material that students must master before finishing the course is important in what skills students have acquired across the course. The effectiveness of questions remained constant for some of the identical questions, but for others showed a downward trend (Table 5). Perhaps the effectiveness of questions changed because of the differences in background knowledge students had going into the final exam. The first identical question had most students answering correctly, so if even one top student answered this question incorrectly it would greatly affect the effectiveness. Question two of the identical questions was very difficult and as examined, the skills needed for the question were not completely present in ALEKS. The students who studied the materials in ALEKS, believing this would prepare them for the exam may not be fully prepared for this question. This, however, did not seem to affect the fourth identical question that also saw an increase in difficulty. 19

There are many factors that contributed to the changes in difficulty and effectiveness from the fall 2008 exam to the spring 2010 exam. The factors that were uncontrollable could unfortunately not be further assessed to determine their effect on the results. The ALEKS program was a factor that could be further examined, and as a result it could be seen through this program that skills needed to answer some of the questions were not present in the capacity that they perhaps needed to be. A final factor that it is important to note is the students who finished ALEKS early took the exam early and are not present in this data sample. These students are probably the top students so their results would probably increase the success level on each of the questions. 20

Conclusion This study followed five questions that appeared identically on both the fall 2008 (Penn State, 2008) and spring 2010 (Penn State, 2010) exams. The effectiveness and difficulty of these questions across these exam years were examined along with the factors that were present that could have affected the changes in these measurements. The results of this study show that there was an increase in difficulty for two of the five identical problems, a decrease in difficulty for one of the identical problems, and the difficulty for the remaining two identical problems remained consistent. This information was obtained through comparing the data analysis from the fall 2008 exam (Schreyer, 2008) with the analyzed raw data from the spring 2010 exam (MATH, 2010). The affect that the ALEKS program (ALEKS, 2010), used during the spring 2010 semester, had on the difficulty was examined and it was determined that skills needed for the questions that saw an increase in difficulty were perhaps not as present as need be. The effectiveness of the five identical questions showed a general decrease. The first three identical questions showed a significant decrease in effectiveness, while the fourth and fifth questions had more consistent effectiveness. This comparison was made through examining the item analysis from the fall 2008 exam (Schreyer, 2008) and the analysis of the raw data from the spring 2010 exam (MATH, 2010). There are many factors that could account for these changes including the knowledge students have gained over the semester, and the sample size and method for calculating this measurement. This study revealed a great deal about the effectiveness and difficulty of these five identical exam questions. Studies like this one can be done by classroom teachers on an individual basis to help improve the reliability of their exams. Universities, such as Penn State, provide the Schreyer Institute for Teaching Excellence to produce this data that professors can 21

examine. Calculators, such as the Vassar College calculator (Lowry, 2010), and other methods of analysis are available through university websites for teachers to access and compute this data independently. 22

Appendix A: Glossary Item effect: The item effect is the biserial coefficient for the exam question (Schreyer, 2008). The biserial coefficient is described on the University of Texas website, but is referred to as the item discrimination. The item discrimination, or item effect, measures those who had high scores on the exam and those who had low scores on the exam and whether or not they answered the question correctly. An item with high effect has the students receiving the highest scores on the exam choosing the correct answer and the students receiving the lowest scores choosing one of the incorrect answers (Instructional, 2010). The item effect is a number between -1.0 and 1.0 where values closer to 1.0 indicate a higher item effect (Schreyer, 2008). Reliability: The reliability of an exam is calculated using the Kuder-Richardson formula 20 value (Schreyer, 2008). This value is calculated using the following formula: = 1 [1 2 ] Where k is the number of participants in the study, is the sum of the variances of each question, and 2 = ( ; )2 is the variance of the total scores (Furr, 2008: 119-120). The reliability is a value between.00 and 1.00. The closer the value is to 1.00 the higher reliability there is. A reliability below 0.50 is a poor reliability score (Schreyer, 2008). 23

Appendix B: Consent Forms Informed Consent Form for Social Science Research The Pennsylvania State University Title of Project: Principal Investigator: Advisor: ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS Elizabeth Somers, Undergraduate Student 317 Simmons Hall, University Park, PA 16802 eas5196@psu.edu 610 742 6821 Associate Professor Stanley Smith 104 G McAllister smith_s@math.psu.edu 1. Purpose of the Study: The purpose of this study is to conduct research that will lead to the improvement of the effectiveness of multiple choice math exams. The particular test that this study will concentrate on is the MATH 021 final exam. This study is being done for research purposes and is being performed for an undergraduate honors thesis. The research for this study will aid in the improvement of the MATH 021 final exam. 2. Procedures to be followed: For this study, you will be asked to complete the MATH 021 final exam as you would in the absence of a study. At the time of the final exam, you will be asked to indicate whether or not you wish to participate in this study. 3. Duration/Time: This study will only not require any additional time apart from filling out the consent form. 4. Statement of Confidentiality: Your participation in this research is confidential. The data will be stored and secured at McAllister Building in an archived file. In the event of a publication or presentation resulting from the research, no personally identifiable information will be shared. Your test results will be archived as they normally are. The student researcher will only have access to the data after all identifying information has been removed by the Mathematics Department. 5. Right to Ask Questions: Please contact Elizabeth Somers at (610) 742 6821 with questions or concerns about this study. If you have questions regarding the purpose, outcomes, or any other aspects of this study, please contact Elizabeth Somers. 6. Voluntary Participation: Your decision to include your final exam responses in this research is voluntary. You may request your final exam results be removed from the study at any time by contacting the Mathematics Department, 104 McAllister Building. 24

You must be 18 years of age or older to consent to take part in this research study. If you agree to take part in this research study and the information outlined above, please be sure to sign your name and indicate the date on the informed consent form attached to your final exam. The preceding information will be provided for you at this time as well. 25

Informed Consent Form for Social Science Research The Pennsylvania State University Title of Project: Principal Investigator: Advisor: ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS Elizabeth Somers, Undergraduate Student 317 Simmons Hall, University Park, PA 16802 eas5196@psu.edu 610 742 6821 Associate Professor Stanley Smith 104 G McAllister smith_s@math.psu.edu 7. Purpose of the Study: The purpose of this study is to conduct research that will lead to the improvement of the effectiveness of multiple choice math exams. The particular test that this study will concentrate on is the MATH 021 final exam. This research is being done for an undergraduate honors thesis. 8. Procedures to be followed: For this study, you will be asked to complete the MATH 021 final exam as you would in the absence of a study. You are asked to indicate whether or not you wish to participate in this study. Please check yes or no, and then sign your name to indicate this. 9. Duration/Time: This study will only not require any additional time apart from filling out the consent form. 10. Statement of Confidentiality: Your participation in this research is confidential. The data will be stored and secured at McAllister Building in an archived file. In the event of a publication or presentation resulting from the research, no personally identifiable information will be shared. Your test results will be archived as they normally are. The student research will only have access to the data after all identifying information has been removed. 11. Right to Ask Questions: Please contact Elizabeth Somers at (610) 742 6821 with questions or concerns about this study. If you have questions regarding the purpose, outcomes, or any other aspects of this study, please contact Elizabeth Somers. 26

12. Voluntary Participation: Your decision to include your final exam responses in this research is voluntary. You may request your final exam results be removed from the study at any time by contacting the Mathematics Department, 104 McAllister Building. You must be 18 years of age or older to consent to take part in this research study. If you agree to take part in this research study and the information outlined above, please be sure to sign your name and indicate the date. You will be given a copy of this form for your records. I agree to allow my final exam results from MATH 021 to be released to the principal investigator and the research team of this study for the purpose of researching the effectiveness of the MATH 021 exam. I DO NOT agree to allow my final exam results from MATH 021 to be released to the principal investigator and the research team of this study. Participant Signature Date Person Obtaining Consent Date 27

References ALEKS. (2010). ALEKS. Retrieved November 16, 2010, from http://www.aleks.com/ Furr, R. M., & Bacharach, V. R. (2008). Psychometrics An Introduction. Los Angeles: Sage Publications. Instructional Assessment Resources. (2010). Assess Students: Item analysis. Retrieved October 7, 2010, from http://www.utexas.edu/academic/ctl/assessment/iar/students/ report/itemanalysis.php Lowry, R. (2010). Point Biserial Correlation Coefficient. Retrieved October 7, 2010, from http://faculty.vassar.edu/lowry/pbcorr.html MATH 021. (2010, spring). Raw data. Unpublished raw data. Math Department. (2008). MATH 021 Syllabus. Unpublished manuscript. Math Department, (2010). MATH 021 Syllabus. Unpublished manuscript. Penn State Department of Mathematics. (2008, fall). MATH 021 FINAL EXAM VERSION A, B, C, D. Unpublished manuscript. Penn State Department of Mathematics. (2010, spring). MATH 021 FINAL EXAM VERSION A, B, C, D. Unpublished manuscript. Research at Penn State. (2010). Conducting a Human Participant Research Study. Retrieved from http://www.research.psu.edu/orp/humans/conducting-study. Schreyer Institute for Teaching Excellence. (2008, Dec. 18). MATH 021 001-023. Unpublished item analysis data. Thompson, D. R., Beckmann, C. E., & Senk, S. L. (1997, January). Improving Classroom Tests as a Means of Improving Assessment. Mathematics Teacher, 90(1), 58-64. 28

University of Wisconsin Oshkosh. (2005, August 30). Testing Services: Item Discrimination I. Retrieved October 7, 2010, from http://www.uwosh.edu/testing/facultyinfo/ itemdiscrimone.php 29

Vita Elizabeth Somers Elizabeth Somers 149 Fawn Lane Haverford, PA 19041 eas5196@gmail.com Education: Pennsylvania State University, Spring 2011 Bachelor of Science Mathematics- Teacher Certification Option Honors in Mathematics Thesis: Assessing the Effectiveness of Multiple Choice Math Tests Thesis Advisor: Dr. Stanley Smith Honors: Dean s List Fall 2007- Spring 2010 Related Experience: Pre-service student teaching in mathematics Fall 2010 Private geometry tutoring for the SAT Summer 2009 Math tutor with Volunteers in Public Schools at State College Area High School Fall 2007, 2008, Spring 2008 Experience and Activities: ESF Summer Camps counselor Summer 2010 ESF Summer Camps swim instructor Summer 2009 Pre- team swim coach Summer 2006 & Summer 2007 Penn State Natatorium lifeguard October 2007- December 2010