An Evaluation of the Item Pools Used for Computerized Adaptive Test. Versions of The Maryland Functional Tests. Steven L. Wise, Ph.D.

An Evaluation of the Item Pools Used for Computerized Adaptive Test Versions of The Maryland Functional Tests Steven L. Wise, Ph.D. July, 1997 A Report Prepared for the Assessment Branch of the Maryland State Department of Education. Ol O

An Evaluation of the Item Pools Used for Computerized Adaptive Test Versions of The Maryland Functional Tests In order to meet state-mandated high school graduation requirements, Maryland students must demonstrate competency in mathematics, writing, reading, and citizenship. The Maryland Functional Tests were developed in the late 1970's to ensure that Maryland's high school graduates were competent in these four core subject areas. Early in the 1990's, the Maryland State Department of Education (MSDE) developed computerized adaptive test (CAT) versions of three core area tests (mathematics, reading, citizenship). The CAT versions have augmented rather than supplanted the paper-andpencil versions; they are typically used with transfer students and students who are retaking one or more test. The purpose of this report is to present the findings of my evaluation of the item pools used in the CAT versions of the Maryland Functional Tests. The item pools were evaluated regarding: (a) pool size, (b) adequacy of the test information provided by the pooi, and (c) balance of content domains and item difficulty levels. I provide recommendations regarding both targeted expansion of the item pools and testing methods that will make more effective use of the existing pools. In addition, I discuss the problem of item exposure and the vulnerability of the MSDE item pools. Item Pool Size and Structure The success of any CAT program is largely dependent on the quality of the item pool (sometimes termed an item bank) from which the administered items are drawn. Quality can be conceptualized according to two basic criteria. First, the total number of items in the pool must be sufficient to supply informative items throughout a testing session. Second,

the items in the pool must have characteristics that provide adequate information at the proficiency levels that are of greatest interest to the test developer. This criterion primarily means that at all important levels of proficiency there are sufficient numbers of items whose difficulty parameters provide useful information. Thus, a high-quality item pool will contain sufficient numbers of useful items that permit efficient, informative testing at important levels of proficiency. When a testing program has developed a large number of items that are to be used in the CAT, the first criterion obviously is met. Merely having a large number of items, however, does not ensure that the second criterion will be satisfied. Unless the developed items have a distribution of difficulties that is reasonably matched to the important levels of proficiency, there will likely be areas of proficiency in which the test information provided by the CAT will accumulate at too slow of a rate. In the sense of the pool analogy, the item pool will be too "shallow" in some proficiency region(s). In these regions, the CAT will be less efficient--resulting in either higher standard errors of profidency estimation (for a fixed-length CAT) or a longer test being required (for a variable-length CAT) to reach a desired level of precision. An obvious solution to this problem is for additional pool items to be developed--that provide additional information (depth) where it is most needed. The pool depth issue becomes more complicated when the pool is subdivided into a number of content domains---each of which must be represented to a prespecified degree in the CAT. Ideally, each content domain should exhibit a distribution of item difficulties that resembles that of the entire pool. In practice, however, this is difficult to attain. Items written for domains that represent more elementary knowledge tend to be easier than

those written for domains representing more advanced knowledge in a given subject. Having a larger item pool has an additional advantage--items in the pooi tend to be exposed in fewer CAT administrations. The integrity of the CAT is dependent upon the item parameters remaining unchanged. If an item is presented too often, students (and/or teachers) may become familiar with it and prepare for it. This would result in a decrease in the item's actual difficulty--which in turn would positively bias proficiency estimation. Hence, item exposure can become a serious problem, particularly in a highstakes test program. Pvrpqse of the Maryland F~nctiQnal Tests It should be emphasized that the purpose of the Maryland Functional Tests is to ensure that Maryland's high school graduates are minimally competent in the four core subject areas. That is, they are primarily criterionreferenced tests, rather than norm-referenced tests--which is important in assessing the suitability of the subject area item pools for CAT. In minimum competency testing, the primary goal of measurement is whether or not a student has attained a proficiency level that exceeds the minimum passing standard. Differentiation among student performance above (or below) from the passing standard is of secondary importance. This implies that the best measurement needs to occur at the passing score. In an item response theory sense, test information should be maximized at the passing score. My judgments and recommendations regarding the item pools are based on this perspective. Although it is tempting to adopt the dual goals of having the Maryland Functional Tests be both good minimum competency tests and good normreferenced tests, I believe that it would be unwise to do so in the case of the

CAT versions. The adaptive procedures are quite different in each type of CAT, and if one tried to meet both goals concurrently, then neither goal would likely be met very well. Assessment of the Maryland Functional Test Item Pools Mathematics The current Mathematics item pool consists of 180 items from 7 content domains. Figure 1 shows the distributions of item difficulties for each content domain. The passing score for the Mathematics test, transformed into Rasch model logits, is 2.15. Figure 1. Mathematics Item Pool Census, Broken Down by Content Domain and Item Difficulty (Delta). Item Difficulty (Delta) -3.00-2.00-1.00 0.00 1.01 2.01 3.01 5.01 to - to - to - to to to to to Mathematics Domair~ 2.01 1.01 0.01 1.00 2.00 3.00 5.00 9.00 Number Concepts 1 1 3! 3 3 2 i 0 0 13 Whole Number Operations 3 17 5 ~ 1 0 0 0 0 26 Mixed Number/Fraction Operations 0 0 5! 18 4 0 0 0 27 Decimal Operations 0 4 9 1 9 2 1 0 0 ' 25 Measurement, 1 ] 6 5 i 4 7 3 1 2 7 Using Data 2 i 7 9 11 9 1 1 0 40 Problem Solving 0 i 2 5 8 5 1 1 0 22, n ~ m 7 37 41 54 30 8 3 0 180 Figure 1 reveals several problems with the Mathematics item pool. The most serious problem is that the pool is too easy. For a minimum competency test with passing score of 2.15 logits, an effective item pool would have most of its delta values in the region of 2.15. The Mathematics pool, however, has only 8 items (4%) in the 2.01 to 3.00 range, and no more than 41 (23%) within 1.15 logits of the passing score. This indicates that the majority

of the items provide relatively little information at the passing score--where it is most needed. This means that (a) it is difficult for a CAT to match items to students whose proficiency is in the vicinity of the passing score and (b) the CATs administered to moderate to high proficiency students will be virtually identical--an item exposure problem. The distribution of item difficulties varies substantially across content domains. Three domains exhibited particular problems. Whole Number Operations contains a fairly narrow range of very easy items. The items in the Mixed Number/Fraction Operations domain were generally more difficult, but showed an even narrower range. In addition, there were relatively few items available (13) in Number Concepts. Readine v The current Reading item pool consists of 191 items from 5 content domains. Figure 2 shows the distributions of item difficulties for each content domain. The passing score for the Reading test, transformed into logits, is 1.75. Fie-ure 2. Reading Item PooI Census, Broken Down by Content Domain and Item Difficulty (Delta). Item Difficulty (Delta) -3.00-2.00 -I.00 0.00 1.01 2.01 3.01 5.01 to - to - to - to to to to to 2.01 1.01 0.01 1.00 2.00 3.00 5.00 9.00 Following Directions 0 0 7 11 10 2 3 0 33 Locating Information 0 3 11 6 8 0 1 0 29 Main Ideas 0 0 0 2 12 34 11 0 59 Using Details 0 0 1 10 9 7 1 0 28 Understanding Forms 0 0 2 14 9 7 6 4 42 0 3 21 43 48 50 22 4 191

The Reading item pool--in contrast to the Mathematics pool--is well centered over the passing score. There are 141 (74%) of the items between logit values of 0 and 3. The only problematic domain was Locating Information, for which the distribution of items was too easy. Citizenship The current Citizenship item pool consists of 261 items from 3 content domains. Figure 3 shows the distributions of item difficulties for each content domain. The passing score for the Citizenship test, transformed into Iogits, is 1.00. Fi~-ure 3. v Citizenship Item Pool Census, Broken Down by Content Domain and Item Difficulty (Delta). Item Difficulty (Delta) Citizenshiv Dom0in Constitutional Government Principles, Rights, Responsibilities Politics and Political Behavior -3.00-2.00-1.00 0.00 1.01 2.01 3.01 5.01 to - to - to - to to to to to 2.01 1.01 0.01 1.00 2.00 3.00 5.00 9.00 2 17 53 26 4 0 0 5 24 30 14 3 1 1 0 0 2 18 38 22 0 0 0 6 28 65 105 51 5 1 0 103 78 80 261 As with the Reading item pool, the Citizenship pool is well centered over the passing score; 156 (60%) of the item difficulties lie within 1 logit of the passing score. Moreover, there is a good distribution of item difficulties within each of the three content domains. Re omm ndations Re~ardin~ the Item Pools My primary recommendation is that the Mathematics item pool be augmented with more items. The overall pool size is not very large; because of the high-stakes nature of the Maryland Functional Tests I would

recommend that it be expanded to at least 250 items. The set of new items should be substantially more difficult than the current ones, in order to increase the information of the entire pool at the passing score. Special attention should also be paid to increasing the relative number of items in the Number Concepts domain. For the Reading item pool, I recommend that the pool size be increased to 250 items, using items of moderate difficulty. In addition, I would suggest that special attention be paid to adding more difficult items to the Locating Information domain. The Citizenship item pool is in very good shape---in terms of both numbers of items and distribution of items within content domains. I have no changes to recommend for this pooi. I have two additional general recommendations regarding the item pools. First, I encourage MSDE staff to review all the items in each of the pools to ensure that (a) the item stem and options are correctly entered and (b) the keyed answer is correct. I have heard concerns expressed regarding correctness of keyed answers; a review would address these concerns. Second, I believe that the item response theory (IRT) parameters used for the CAT item pool were based on calibrations of paper-and-pencil versions of the items. It is unclear whether computer administration affects the parameters of the items. The research on this issue is limited. It would be useful to conduct a study that investigated the robustness of the item parameters across administration media. I would, however, consider this a lower priority than a review of the items for accurate content and correct keying. Recommendations Regarding CAT Administration I believe that the three item pools can be used more effectively through changes in the MicroCAT testing programs. There are two shortcomings in

the programs currently being used. First, each CAT comprises a 30-item test that administers a predetermined number of items from each content domain. For example, the Citizenship CAT administers a sequence of 10 items from the first content domain, then 10 from the second, and finally 10 from the third. While this procedure ensures that a third of each test will come from each content area, it precludes the possibility of variable-length CATs--which is a key advantage of adaptive testing. The second shortcoming of the current CATs is that they use basically norm-referenced procedures to make criterion-referenced decisions about student proficiency. There are alternative testing procedures that are better suited to criterion-referenced CATs. One is the adaptive mastery testing (AMT) procedure (Weiss & Kingsbury, 1984). In the AMT procedure, the CAT will terminate if the confidence interval around a student's estimated proficiency does not include the passing score value--which indicates sufficient certainty regarding classification of the student's competency status. This procedure results in shorter tests being needed for many students whose proficiency levels lie well above (or below) the passing score, because fewer items would be needed to reach the termination criterion. For borderline students, longer tests would be administered, because more information would be needed to make the more difficult classification decision. Note that the AMT procedure uses variable length CATs. Both shortcomings can be overcome by adopting a CAT procedure that combines the AMT procedure with the use of testlets (Wainer & Kiely, 1987). A testlet is a set of items that are administered as a group. Testlets can be formed that are of differing average difficulty, and a CAT can adaptively administer testlets instead of items. After a testlet has been administered, the

CAT algorithm looks for the most informative remaining testlet to administer. Because the content domains are represented equally in each of the three subject CATs, testlets could readily be used to maintain content balance. The size of the testlet in each subject area would be equal to the number of content domains. Thus, the Mathematics, Reading, and Citizenship CATs would use testlets of size seven, five, and three, respectively. The AMT procedure could then be used to adaptive administer testlets (perhaps with a minimum test length imposed) until the termination criterion was reached (a student's profidency confidence-interval no longer contained the passing score). This procedure would permit efficient, variable length testing while maintaining content balance. To illustrate the testlet-based AMT procedure, consider the Reading test. A group of five-item testlets would be formed; each testiet would contain one item from each content domain. Suppose that a minimum of 15 items was imposed, with a maximum of 30. After a testlet had been administered (beginning with the third), the procedure would check to see if the termination criterion had been reached. If so, then the procedure would end; otherwise, the most informative testlet remaining in the pooi would be selected and administered. This process would continue until either the termination criterion was reached or a sixth testlet had been administered--at which point the CAT would end, and the pass/fail status of the student would be determined by comparing his or her finai profidency estimate to the passing score. Use of the testlet-based AMT procedure would more efficiently use the item pool, and items would consequently be exposed at a lower rate. Moreover, the testlets could be reassigned periodically, to help balance the

rates at which items are exposed. The testlet-based AMT procedure could readily be programmed using the current version of the MicroCAT testing software. Testlets are called "cluster items" in the MicroCAT scripting language. In addition, code for the implementing the AMT procedure on MicroCAT has been published (Roos, Wise, Yoes, & Rocklin, 1996). Increasing Test Security Any high-stakes testing program must be concerned about security. The credibility and integrity of the testing program is dependent upon how well the security of the test items are maintained. Because all of the items are accessible in a CAT, particular care must be taken to protect the item pool. MSDE has already experienced the loss of a 240-item pool in Mathematics, because the item contents were potentially revealed to an unauthorized individual. Use of the Mathematics CAT had to be suspended until a new item pool was developed. The current Citizenship item pool, however, contains all of the available items. If this item pool were to be compromised, the results would be disastrous, because there are no calibrated paper-andpencil items from which a new pool could be developed. Hence, unless new Citizenship items are developed, MSDE is in a very vulnerable position. Ideally, MSDE should have the resources to develop a backup item pool for each subject test. Moreover, periodic rotation of these pools could further decrease item exposure. MSDE should review all of their procedures for distributing CAT software and collecting student data. MSDE staff must be very careful to limit who has access to the item pool. Moreover, student CAT administration output files should contain a record of exactly which items each student was administered.

At the item level, to the degree to which particular pool items are known to students (or teachers), the validity of the CAT is threatened. This problem can be addressed, in part, by controlling the frequency with which a particular item will be exposed. Increasing the size of the item pools and/or adopting testlet-based AMT procedures will help control item exposure. If testlets are used, periodic reassignment of file items by MSDE staff could be done in a fashion that distributes the exposure fairly uniformly across the pool. Resources Allocation V.ersus CAT Program Longevity MSDE currently has limited resources for monitoring and maintaining the CAT versions of the Maryland Functional Tests. Developing new items and backup item pools will require expenditures of time and resources that may be difficult to allocate. Furthermore, there are plans to eventually phase out the Maryland Functional Tests after the Maryland High School Assessment is implemented. Judgments regarding how much effort should be directed toward changing the present CAT tests must take into account the likelihood that the tests may soon be discontinued. I believe, however, that with relatively small expenditures of time and money the current CAT tests can be markedly improved. References Roos, L. L., Wise, S. L., Yoes, M. E., & Rocklin, T. R. (1996). Conducting self-adapted testing using MicroCAT. Educational and Psychological Measurement, 56, 821-827. Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 185-201. Weiss, D. J., & Kingsbury, G. G. (1984). Application of computerized adaptive testing to educational problems. Journal of Educational Measurement, 21, 361-375.