An Evaluation of the Item Pools Used for Computerized Adaptive Test. Versions of The Maryland Functional Tests. Steven L. Wise, Ph.D.

Similar documents
Computerized Adaptive Psychological Testing A Personalisation Perspective

Further, Robert W. Lissitz, University of Maryland Huynh Huynh, University of South Carolina ADEQUATE YEARLY PROGRESS

How to Judge the Quality of an Objective Classroom Test

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Psychometric Research Brief Office of Shared Accountability

Proficiency Illusion

Guidelines for the Use of the Continuing Education Unit (CEU)

NCEO Technical Report 27

ACADEMIC AFFAIRS GUIDELINES

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Alpha provides an overall measure of the internal reliability of the test. The Coefficient Alphas for the STEP are:

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Norms How were TerraNova 3 norms derived? Does the norm sample reflect my diverse school population?

DEPARTMENT OF MOLECULAR AND CELL BIOLOGY

Student Assessment and Evaluation: The Alberta Teaching Profession s View

Karla Brooks Baehr, Ed.D. Senior Advisor and Consultant The District Management Council

The Oregon Literacy Framework of September 2009 as it Applies to grades K-3

Program Change Proposal:

GCSE English Language 2012 An investigation into the outcomes for candidates in Wales

BENCHMARK TREND COMPARISON REPORT:

NATIONAL CENTER FOR EDUCATION STATISTICS RESPONSE TO RECOMMENDATIONS OF THE NATIONAL ASSESSMENT GOVERNING BOARD AD HOC COMMITTEE ON.

Lecture 1: Machine Learning Basics

Last Editorial Change:

Approval Authority: Approval Date: September Support for Children and Young People

5. UPPER INTERMEDIATE

South Carolina English Language Arts

ABET Criteria for Accrediting Computer Science Programs

Executive Summary. Laurel County School District. Dr. Doug Bennett, Superintendent 718 N Main St London, KY

DATE ISSUED: 11/2/ of 12 UPDATE 103 EHBE(LEGAL)-P

The Future of Consortia among Indian Libraries - FORSA Consortium as Forerunner?

Software Maintenance

Strategic Practice: Career Practitioner Case Study

Assessment System for M.S. in Health Professions Education (rev. 4/2011)

Oklahoma State University Policy and Procedures

CONNECTICUT GUIDELINES FOR EDUCATOR EVALUATION. Connecticut State Department of Education

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Short vs. Extended Answer Questions in Computer Science Exams

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

SSIS SEL Edition Overview Fall 2017

College of Engineering and Applied Science Department of Computer Science

A Pilot Study on Pearson s Interactive Science 2011 Program

A Guide to Adequate Yearly Progress Analyses in Nevada 2007 Nevada Department of Education

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Loyola University Chicago Chicago, Illinois

Linguistics Program Outcomes Assessment 2012

EFFECTS OF MATHEMATICS ACCELERATION ON ACHIEVEMENT, PERCEPTION, AND BEHAVIOR IN LOW- PERFORMING SECONDARY STUDENTS

Mathematical Misconceptions -- Can We Eliminate Them? Phi lip Swedosh and John Clark The University of Melbourne. Introduction

STUDENTS' RATINGS ON TEACHER

Curriculum and Assessment Policy

Aviation English Training: How long Does it Take?

Delaware Performance Appraisal System Building greater skills and knowledge for educators

EMPIRICAL RESEARCH ON THE ACCOUNTING AND FINANCE STUDENTS OPINION ABOUT THE PERSPECTIVE OF THEIR PROFESSIONAL TRAINING AND CAREER PROSPECTS

Delaware Performance Appraisal System Building greater skills and knowledge for educators

DISTRICT ASSESSMENT, EVALUATION & REPORTING GUIDELINES AND PROCEDURES

Inside the mind of a learner

What Is The National Survey Of Student Engagement (NSSE)?

Rules of Procedure for Approval of Law Schools

Orleans Central Supervisory Union

Practice Examination IREB

Development of Multistage Tests based on Teacher Ratings

RETURNING TEACHER REQUIRED TRAINING MODULE YE TRANSCRIPT

BASIC EDUCATION IN GHANA IN THE POST-REFORM PERIOD

U VA THE CHANGING FACE OF UVA STUDENTS: SSESSMENT. About The Study

CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and

Evidence for Reliability, Validity and Learning Effectiveness

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

DRAFT VERSION 2, 02/24/12

Committee on Academic Policy and Issues (CAPI) Marquette University. Annual Report, Academic Year

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

22/07/10. Last amended. Date: 22 July Preamble

State Budget Update February 2016

Kelso School District and Kelso Education Association Teacher Evaluation Process (TPEP)

Visit us at:

Students Understanding of Graphical Vector Addition in One and Two Dimensions

Early Warning System Implementation Guide

Assessment and Evaluation for Student Performance Improvement. I. Evaluation of Instructional Programs for Performance Improvement

The Timer-Game: A Variable Interval Contingency for the Management of Out-of-Seat Behavior

Unit 3. Design Activity. Overview. Purpose. Profile

BEFORE THE ARBITRATOR. In the matter of the arbitration of a dispute between ADMINISTRATORS' AND SUPERVISORS' COUNCIL. And

Referencing the Danish Qualifications Framework for Lifelong Learning to the European Qualifications Framework

Data Structures and Algorithms

Kansas Adequate Yearly Progress (AYP) Revised Guidance

Critical Thinking in Everyday Life: 9 Strategies

PROGRAM HANDBOOK. for the ACCREDITATION OF INSTRUMENT CALIBRATION LABORATORIES. by the HEALTH PHYSICS SOCIETY

PROVIDING AND COMMUNICATING CLEAR LEARNING GOALS. Celebrating Success THE MARZANO COMPENDIUM OF INSTRUCTIONAL STRATEGIES

QUESTIONS ABOUT ACCESSING THE HANDOUTS AND THE POWERPOINT

Conceptual Framework: Presentation

Accountability in the Netherlands

The My Class Activities Instrument as Used in Saturday Enrichment Program Evaluation

Workload Policy Department of Art and Art History Revised 5/2/2007

Lesson M4. page 1 of 2

Volunteer State Community College Strategic Plan,

WHAT ARE VIRTUAL MANIPULATIVES?

Reference to Tenure track faculty in this document includes tenured faculty, unless otherwise noted.

A Note on Structuring Employability Skills for Accounting Students

Applying Florida s Planning and Problem-Solving Process (Using RtI Data) in Virtual Settings

CONTINUUM OF SPECIAL EDUCATION SERVICES FOR SCHOOL AGE STUDENTS

Unit 13 Assessment in Language Teaching. Welcome

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems

Transcription:

An Evaluation of the Item Pools Used for Computerized Adaptive Test Versions of The Maryland Functional Tests Steven L. Wise, Ph.D. July, 1997 A Report Prepared for the Assessment Branch of the Maryland State Department of Education. Ol O

An Evaluation of the Item Pools Used for Computerized Adaptive Test Versions of The Maryland Functional Tests In order to meet state-mandated high school graduation requirements, Maryland students must demonstrate competency in mathematics, writing, reading, and citizenship. The Maryland Functional Tests were developed in the late 1970's to ensure that Maryland's high school graduates were competent in these four core subject areas. Early in the 1990's, the Maryland State Department of Education (MSDE) developed computerized adaptive test (CAT) versions of three core area tests (mathematics, reading, citizenship). The CAT versions have augmented rather than supplanted the paper-andpencil versions; they are typically used with transfer students and students who are retaking one or more test. The purpose of this report is to present the findings of my evaluation of the item pools used in the CAT versions of the Maryland Functional Tests. The item pools were evaluated regarding: (a) pool size, (b) adequacy of the test information provided by the pooi, and (c) balance of content domains and item difficulty levels. I provide recommendations regarding both targeted expansion of the item pools and testing methods that will make more effective use of the existing pools. In addition, I discuss the problem of item exposure and the vulnerability of the MSDE item pools. Item Pool Size and Structure The success of any CAT program is largely dependent on the quality of the item pool (sometimes termed an item bank) from which the administered items are drawn. Quality can be conceptualized according to two basic criteria. First, the total number of items in the pool must be sufficient to supply informative items throughout a testing session. Second,

the items in the pool must have characteristics that provide adequate information at the proficiency levels that are of greatest interest to the test developer. This criterion primarily means that at all important levels of proficiency there are sufficient numbers of items whose difficulty parameters provide useful information. Thus, a high-quality item pool will contain sufficient numbers of useful items that permit efficient, informative testing at important levels of proficiency. When a testing program has developed a large number of items that are to be used in the CAT, the first criterion obviously is met. Merely having a large number of items, however, does not ensure that the second criterion will be satisfied. Unless the developed items have a distribution of difficulties that is reasonably matched to the important levels of proficiency, there will likely be areas of proficiency in which the test information provided by the CAT will accumulate at too slow of a rate. In the sense of the pool analogy, the item pool will be too "shallow" in some proficiency region(s). In these regions, the CAT will be less efficient--resulting in either higher standard errors of profidency estimation (for a fixed-length CAT) or a longer test being required (for a variable-length CAT) to reach a desired level of precision. An obvious solution to this problem is for additional pool items to be developed--that provide additional information (depth) where it is most needed. The pool depth issue becomes more complicated when the pool is subdivided into a number of content domains---each of which must be represented to a prespecified degree in the CAT. Ideally, each content domain should exhibit a distribution of item difficulties that resembles that of the entire pool. In practice, however, this is difficult to attain. Items written for domains that represent more elementary knowledge tend to be easier than

those written for domains representing more advanced knowledge in a given subject. Having a larger item pool has an additional advantage--items in the pooi tend to be exposed in fewer CAT administrations. The integrity of the CAT is dependent upon the item parameters remaining unchanged. If an item is presented too often, students (and/or teachers) may become familiar with it and prepare for it. This would result in a decrease in the item's actual difficulty--which in turn would positively bias proficiency estimation. Hence, item exposure can become a serious problem, particularly in a highstakes test program. Pvrpqse of the Maryland F~nctiQnal Tests It should be emphasized that the purpose of the Maryland Functional Tests is to ensure that Maryland's high school graduates are minimally competent in the four core subject areas. That is, they are primarily criterionreferenced tests, rather than norm-referenced tests--which is important in assessing the suitability of the subject area item pools for CAT. In minimum competency testing, the primary goal of measurement is whether or not a student has attained a proficiency level that exceeds the minimum passing standard. Differentiation among student performance above (or below) from the passing standard is of secondary importance. This implies that the best measurement needs to occur at the passing score. In an item response theory sense, test information should be maximized at the passing score. My judgments and recommendations regarding the item pools are based on this perspective. Although it is tempting to adopt the dual goals of having the Maryland Functional Tests be both good minimum competency tests and good normreferenced tests, I believe that it would be unwise to do so in the case of the

CAT versions. The adaptive procedures are quite different in each type of CAT, and if one tried to meet both goals concurrently, then neither goal would likely be met very well. Assessment of the Maryland Functional Test Item Pools Mathematics The current Mathematics item pool consists of 180 items from 7 content domains. Figure 1 shows the distributions of item difficulties for each content domain. The passing score for the Mathematics test, transformed into Rasch model logits, is 2.15. Figure 1. Mathematics Item Pool Census, Broken Down by Content Domain and Item Difficulty (Delta). Item Difficulty (Delta) -3.00-2.00-1.00 0.00 1.01 2.01 3.01 5.01 to - to - to - to to to to to Mathematics Domair~ 2.01 1.01 0.01 1.00 2.00 3.00 5.00 9.00 Number Concepts 1 1 3! 3 3 2 i 0 0 13 Whole Number Operations 3 17 5 ~ 1 0 0 0 0 26 Mixed Number/Fraction Operations 0 0 5! 18 4 0 0 0 27 Decimal Operations 0 4 9 1 9 2 1 0 0 ' 25 Measurement, 1 ] 6 5 i 4 7 3 1 2 7 Using Data 2 i 7 9 11 9 1 1 0 40 Problem Solving 0 i 2 5 8 5 1 1 0 22, n ~ m 7 37 41 54 30 8 3 0 180 Figure 1 reveals several problems with the Mathematics item pool. The most serious problem is that the pool is too easy. For a minimum competency test with passing score of 2.15 logits, an effective item pool would have most of its delta values in the region of 2.15. The Mathematics pool, however, has only 8 items (4%) in the 2.01 to 3.00 range, and no more than 41 (23%) within 1.15 logits of the passing score. This indicates that the majority

of the items provide relatively little information at the passing score--where it is most needed. This means that (a) it is difficult for a CAT to match items to students whose proficiency is in the vicinity of the passing score and (b) the CATs administered to moderate to high proficiency students will be virtually identical--an item exposure problem. The distribution of item difficulties varies substantially across content domains. Three domains exhibited particular problems. Whole Number Operations contains a fairly narrow range of very easy items. The items in the Mixed Number/Fraction Operations domain were generally more difficult, but showed an even narrower range. In addition, there were relatively few items available (13) in Number Concepts. Readine v The current Reading item pool consists of 191 items from 5 content domains. Figure 2 shows the distributions of item difficulties for each content domain. The passing score for the Reading test, transformed into logits, is 1.75. Fie-ure 2. Reading Item PooI Census, Broken Down by Content Domain and Item Difficulty (Delta). Item Difficulty (Delta) -3.00-2.00 -I.00 0.00 1.01 2.01 3.01 5.01 to - to - to - to to to to to 2.01 1.01 0.01 1.00 2.00 3.00 5.00 9.00 Following Directions 0 0 7 11 10 2 3 0 33 Locating Information 0 3 11 6 8 0 1 0 29 Main Ideas 0 0 0 2 12 34 11 0 59 Using Details 0 0 1 10 9 7 1 0 28 Understanding Forms 0 0 2 14 9 7 6 4 42 0 3 21 43 48 50 22 4 191

The Reading item pool--in contrast to the Mathematics pool--is well centered over the passing score. There are 141 (74%) of the items between logit values of 0 and 3. The only problematic domain was Locating Information, for which the distribution of items was too easy. Citizenship The current Citizenship item pool consists of 261 items from 3 content domains. Figure 3 shows the distributions of item difficulties for each content domain. The passing score for the Citizenship test, transformed into Iogits, is 1.00. Fi~-ure 3. v Citizenship Item Pool Census, Broken Down by Content Domain and Item Difficulty (Delta). Item Difficulty (Delta) Citizenshiv Dom0in Constitutional Government Principles, Rights, Responsibilities Politics and Political Behavior -3.00-2.00-1.00 0.00 1.01 2.01 3.01 5.01 to - to - to - to to to to to 2.01 1.01 0.01 1.00 2.00 3.00 5.00 9.00 2 17 53 26 4 0 0 5 24 30 14 3 1 1 0 0 2 18 38 22 0 0 0 6 28 65 105 51 5 1 0 103 78 80 261 As with the Reading item pool, the Citizenship pool is well centered over the passing score; 156 (60%) of the item difficulties lie within 1 logit of the passing score. Moreover, there is a good distribution of item difficulties within each of the three content domains. Re omm ndations Re~ardin~ the Item Pools My primary recommendation is that the Mathematics item pool be augmented with more items. The overall pool size is not very large; because of the high-stakes nature of the Maryland Functional Tests I would

recommend that it be expanded to at least 250 items. The set of new items should be substantially more difficult than the current ones, in order to increase the information of the entire pool at the passing score. Special attention should also be paid to increasing the relative number of items in the Number Concepts domain. For the Reading item pool, I recommend that the pool size be increased to 250 items, using items of moderate difficulty. In addition, I would suggest that special attention be paid to adding more difficult items to the Locating Information domain. The Citizenship item pool is in very good shape---in terms of both numbers of items and distribution of items within content domains. I have no changes to recommend for this pooi. I have two additional general recommendations regarding the item pools. First, I encourage MSDE staff to review all the items in each of the pools to ensure that (a) the item stem and options are correctly entered and (b) the keyed answer is correct. I have heard concerns expressed regarding correctness of keyed answers; a review would address these concerns. Second, I believe that the item response theory (IRT) parameters used for the CAT item pool were based on calibrations of paper-and-pencil versions of the items. It is unclear whether computer administration affects the parameters of the items. The research on this issue is limited. It would be useful to conduct a study that investigated the robustness of the item parameters across administration media. I would, however, consider this a lower priority than a review of the items for accurate content and correct keying. Recommendations Regarding CAT Administration I believe that the three item pools can be used more effectively through changes in the MicroCAT testing programs. There are two shortcomings in

the programs currently being used. First, each CAT comprises a 30-item test that administers a predetermined number of items from each content domain. For example, the Citizenship CAT administers a sequence of 10 items from the first content domain, then 10 from the second, and finally 10 from the third. While this procedure ensures that a third of each test will come from each content area, it precludes the possibility of variable-length CATs--which is a key advantage of adaptive testing. The second shortcoming of the current CATs is that they use basically norm-referenced procedures to make criterion-referenced decisions about student proficiency. There are alternative testing procedures that are better suited to criterion-referenced CATs. One is the adaptive mastery testing (AMT) procedure (Weiss & Kingsbury, 1984). In the AMT procedure, the CAT will terminate if the confidence interval around a student's estimated proficiency does not include the passing score value--which indicates sufficient certainty regarding classification of the student's competency status. This procedure results in shorter tests being needed for many students whose proficiency levels lie well above (or below) the passing score, because fewer items would be needed to reach the termination criterion. For borderline students, longer tests would be administered, because more information would be needed to make the more difficult classification decision. Note that the AMT procedure uses variable length CATs. Both shortcomings can be overcome by adopting a CAT procedure that combines the AMT procedure with the use of testlets (Wainer & Kiely, 1987). A testlet is a set of items that are administered as a group. Testlets can be formed that are of differing average difficulty, and a CAT can adaptively administer testlets instead of items. After a testlet has been administered, the

CAT algorithm looks for the most informative remaining testlet to administer. Because the content domains are represented equally in each of the three subject CATs, testlets could readily be used to maintain content balance. The size of the testlet in each subject area would be equal to the number of content domains. Thus, the Mathematics, Reading, and Citizenship CATs would use testlets of size seven, five, and three, respectively. The AMT procedure could then be used to adaptive administer testlets (perhaps with a minimum test length imposed) until the termination criterion was reached (a student's profidency confidence-interval no longer contained the passing score). This procedure would permit efficient, variable length testing while maintaining content balance. To illustrate the testlet-based AMT procedure, consider the Reading test. A group of five-item testlets would be formed; each testiet would contain one item from each content domain. Suppose that a minimum of 15 items was imposed, with a maximum of 30. After a testlet had been administered (beginning with the third), the procedure would check to see if the termination criterion had been reached. If so, then the procedure would end; otherwise, the most informative testlet remaining in the pooi would be selected and administered. This process would continue until either the termination criterion was reached or a sixth testlet had been administered--at which point the CAT would end, and the pass/fail status of the student would be determined by comparing his or her finai profidency estimate to the passing score. Use of the testlet-based AMT procedure would more efficiently use the item pool, and items would consequently be exposed at a lower rate. Moreover, the testlets could be reassigned periodically, to help balance the

rates at which items are exposed. The testlet-based AMT procedure could readily be programmed using the current version of the MicroCAT testing software. Testlets are called "cluster items" in the MicroCAT scripting language. In addition, code for the implementing the AMT procedure on MicroCAT has been published (Roos, Wise, Yoes, & Rocklin, 1996). Increasing Test Security Any high-stakes testing program must be concerned about security. The credibility and integrity of the testing program is dependent upon how well the security of the test items are maintained. Because all of the items are accessible in a CAT, particular care must be taken to protect the item pool. MSDE has already experienced the loss of a 240-item pool in Mathematics, because the item contents were potentially revealed to an unauthorized individual. Use of the Mathematics CAT had to be suspended until a new item pool was developed. The current Citizenship item pool, however, contains all of the available items. If this item pool were to be compromised, the results would be disastrous, because there are no calibrated paper-andpencil items from which a new pool could be developed. Hence, unless new Citizenship items are developed, MSDE is in a very vulnerable position. Ideally, MSDE should have the resources to develop a backup item pool for each subject test. Moreover, periodic rotation of these pools could further decrease item exposure. MSDE should review all of their procedures for distributing CAT software and collecting student data. MSDE staff must be very careful to limit who has access to the item pool. Moreover, student CAT administration output files should contain a record of exactly which items each student was administered.

At the item level, to the degree to which particular pool items are known to students (or teachers), the validity of the CAT is threatened. This problem can be addressed, in part, by controlling the frequency with which a particular item will be exposed. Increasing the size of the item pools and/or adopting testlet-based AMT procedures will help control item exposure. If testlets are used, periodic reassignment of file items by MSDE staff could be done in a fashion that distributes the exposure fairly uniformly across the pool. Resources Allocation V.ersus CAT Program Longevity MSDE currently has limited resources for monitoring and maintaining the CAT versions of the Maryland Functional Tests. Developing new items and backup item pools will require expenditures of time and resources that may be difficult to allocate. Furthermore, there are plans to eventually phase out the Maryland Functional Tests after the Maryland High School Assessment is implemented. Judgments regarding how much effort should be directed toward changing the present CAT tests must take into account the likelihood that the tests may soon be discontinued. I believe, however, that with relatively small expenditures of time and money the current CAT tests can be markedly improved. References Roos, L. L., Wise, S. L., Yoes, M. E., & Rocklin, T. R. (1996). Conducting self-adapted testing using MicroCAT. Educational and Psychological Measurement, 56, 821-827. Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 185-201. Weiss, D. J., & Kingsbury, G. G. (1984). Application of computerized adaptive testing to educational problems. Journal of Educational Measurement, 21, 361-375.