ASSESSMENT OVERVIEW. Norm-Referenced and Criterion-Referenced Tests

Similar documents
How to Judge the Quality of an Objective Classroom Test

Mathematics Success Level E

Introduction to the Practice of Statistics

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Early Warning System Implementation Guide

Copyright Corwin 2015

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Introduction to Questionnaire Design

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

eportfolio Guide Missouri State University

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Norms How were TerraNova 3 norms derived? Does the norm sample reflect my diverse school population?

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Spinners at the School Carnival (Unequal Sections)

Writing a Basic Assessment Report. CUNY Office of Undergraduate Studies

Technical Skills for Journalism

Interpreting ACER Test Results

Final Teach For America Interim Certification Program

Mathematics Scoring Guide for Sample Test 2005

2 nd grade Task 5 Half and Half

Exemplar 6 th Grade Math Unit: Prime Factorization, Greatest Common Factor, and Least Common Multiple

Evidence for Reliability, Validity and Learning Effectiveness

ACADEMIC AFFAIRS GUIDELINES

READY TO WORK PROGRAM INSTRUCTOR GUIDE PART I

WORK OF LEADERS GROUP REPORT

Study Group Handbook

Developing an Assessment Plan to Learn About Student Learning

Learning Lesson Study Course

Diagnostic Test. Middle School Mathematics

STA 225: Introductory Statistics (CT)

Effect of Cognitive Apprenticeship Instructional Method on Auto-Mechanics Students

Grade 6: Correlated to AGS Basic Math Skills

Short Term Action Plan (STAP)

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Alpha provides an overall measure of the internal reliability of the test. The Coefficient Alphas for the STEP are:

Firms and Markets Saturdays Summer I 2014

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

South Carolina English Language Arts

I N T E R P R E T H O G A N D E V E L O P HOGAN BUSINESS REASONING INVENTORY. Report for: Martina Mustermann ID: HC Date: May 02, 2017

Further, Robert W. Lissitz, University of Maryland Huynh Huynh, University of South Carolina ADEQUATE YEARLY PROGRESS

CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and

Kelso School District and Kelso Education Association Teacher Evaluation Process (TPEP)

NORTH CAROLINA STATE BOARD OF EDUCATION Policy Manual

SPECIALIST PERFORMANCE AND EVALUATION SYSTEM

TU-E2090 Research Assignment in Operations Management and Services

A Pilot Study on Pearson s Interactive Science 2011 Program

PROGRESS MONITORING FOR STUDENTS WITH DISABILITIES Participant Materials

Oakland Unified School District English/ Language Arts Course Syllabus

The College Board Redesigned SAT Grade 12

Colorado s Unified Improvement Plan for Schools for Online UIP Report

Physics 270: Experimental Physics

Statewide Framework Document for:

RtI: Changing the Role of the IAT

Common Core State Standards

Strategy for teaching communication skills in dentistry

CONNECTICUT GUIDELINES FOR EDUCATOR EVALUATION. Connecticut State Department of Education

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Self Study Report Computer Science

Cuero Independent School District

EFFECTS OF MATHEMATICS ACCELERATION ON ACHIEVEMENT, PERCEPTION, AND BEHAVIOR IN LOW- PERFORMING SECONDARY STUDENTS

Intermediate Algebra

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

National Literacy and Numeracy Framework for years 3/4

Case study Norway case 1

The Incentives to Enhance Teachers Teaching Profession: An Empirical Study in Hong Kong Primary Schools

Safe & Civil Schools Series Overview

QUESTIONS ABOUT ACCESSING THE HANDOUTS AND THE POWERPOINT

MSW POLICY, PLANNING & ADMINISTRATION (PP&A) CONCENTRATION

Function Tables With The Magic Function Machine

Delaware Performance Appraisal System Building greater skills and knowledge for educators

The Timer-Game: A Variable Interval Contingency for the Management of Out-of-Seat Behavior

A Note on Structuring Employability Skills for Accounting Students

DICE - Final Report. Project Information Project Acronym DICE Project Title

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Course Content Concepts

Honors Mathematics. Introduction and Definition of Honors Mathematics

Introducing the New Iowa Assessments Language Arts Levels 15 17/18

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Innovative Methods for Teaching Engineering Courses

School Size and the Quality of Teaching and Learning

TRI-STATE CONSORTIUM Wappingers CENTRAL SCHOOL DISTRICT

Certified Six Sigma Professionals International Certification Courses in Six Sigma Green Belt

Within the design domain, Seels and Richey (1994) identify four sub domains of theory and practice (p. 29). These sub domains are:

ASSESSMENT OF STUDENT LEARNING OUTCOMES WITHIN ACADEMIC PROGRAMS AT WEST CHESTER UNIVERSITY

Plattsburgh City School District SIP Building Goals

Planning for Preassessment. Kathy Paul Johnston CSD Johnston, Iowa

RECRUITMENT AND EXAMINATIONS

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

AC : DEVELOPMENT OF AN INTRODUCTION TO INFRAS- TRUCTURE COURSE

Person Centered Positive Behavior Support Plan (PC PBS) Report Scoring Criteria & Checklist (Rev ) P. 1 of 8

Standards-Based Bulletin Boards. Tuesday, January 17, 2012 Principals Meeting

TOPICS LEARNING OUTCOMES ACTIVITES ASSESSMENT Numbers and the number system

success. It will place emphasis on:

re An Interactive web based tool for sorting textbook images prior to adaptation to accessible format: Year 1 Final Report

Disciplinary Literacy in Science

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

MIDDLE SCHOOL. Academic Success through Prevention, Intervention, Remediation, and Enrichment Plan (ASPIRE)

Transcription:

ASSESSMENT OVERVIEW Norm-Referenced and Criterion-Referenced Tests Tests are systematic procedures for assessing behavior under specified conditions. Norm-referenced tests compare people to each other. Criterion-referenced tests compare a person's performance to a specified standard. Norm-referenced tests are especially useful in selecting relatively high and low members of a group. Criterion-referenced tests are useful in specifying those who meet or fail to meet a standard of performance. A good item on a norm-referenced test is one that some pass and some fail. An item that everybody (or most) passed would be eliminated from a norm-referenced test. On the other hand, on a criterion-referenced test being used to evaluate instruction, such an item might be very valuable. The main use of criterion-referenced tests in education is to evaluate the process of instruction to show where it has occurred and where it has not. There are several approaches to developing criterion-referenced tests. One traditional approach bases the tests on a set of very specific curriculum independent instructional objectives (e.g., the Brigance Inventory of Basic Skills and similar tests popular in the 1960 s and 70 s). This type of test requires teachers to specify what they are trying to teach, to pretest the students, and to get feedback after instruction through post-testing. Such procedures can assist in improving programs and instruction. However, difficulties with constructing this type of test are: 1. Writing objectives and building tests is a lot of work. If every teacher had to do it, the waste of effort would be considerable. 2. It is difficult to evaluate good and bad test items without an instructional program. The test might be failed because of problems with the program implementation, use of different directions and response requirements from those taught the students, use of examples beyond the range of those taught, and so forth. 3. It is not possible to specify how specific or how general an objective should be in the absence of an instructional program. Another approach ties the tests to a specified instructional program. To be maximally useful tests must be specifically referenced to defined instructional materials, which are in turn aligned with annual achievement expectations (e.g., state or district goals). When this is done, it is possible to monitor the process of instruction throughout the program and provide for corrective action whenever it is needed (proactively). Clear diagnosis of problems and their remediation is possible with such tests. Also, when working with a program that has been previously demonstrated to work, it is possible to analyze the test results to determine if failures are due to program implementation difficulties (most students fail items which have been "taught"), or to student difficulties (one or more students fail many tasks on the test which are not failed by most students).

Understanding Test Scores Tests are used to measure behavior. Measurement is a comparison procedure. In norm-reference testing, comparisons are made to the performance of other people. In criterion-referenced testing, comparisons are made with a standard of performance. Five statistical concepts were introduced to permit examination in more detail of the nature of norm-referenced and criterion-referenced scores. A frequency distribution graphically depicts how many people received what scores on a test. The vertical axis on the graph shows the "how many" people, the horizontal axis shows the "what scores." Two important statistics for describing the set of scores that make up a frequency distribution are the mean and the standard deviation. The mean (M) is an average. All of the scores (Xs) are summed and then divided by the number of scores (N). This tells us where the "middle" of the frequency distribution is. M = Sum Xs N The standard deviation (SD) is a measure of the degree to which scores in a distribution deviate from the mean. It can be thought of as the average of the deviations from the mean, except that the deviations are squared and then the average is later "unsquared" by taking the square root. The standard deviation tells us how far the scores in a distribution spread out from the mean on the average. The mean and the standard deviation of a distribution can be used to convert each score in the distribution to standard scores (SS). The raw scores are expressed as a deviation from the mean, and then they are divided by the standard deviation. SS = x -M SD The sign of a standard score tells you immediately whether the score is above or below the mean. Most raw scores will fall between +3 and -3 on a standard score scale. Standard scores provide one kind of "consistent frame of reference" for comparing scores of individuals within a distribution, and between distributions involving different measurements. The most common approach to comparing scores in criterion-referenced testing is to compute the percent right. These scores tell how close one comes to meeting the objective the test was designed to measure. An example involving spelling and math was presented to show how widely different conclusions could be drawn from the test scores using standard scores rather than percent right scores. "How much is much?" depends on the comparison standard. Standard scores discard the absolute level of performance in looking at score distributions. Percent right scores retain this information and are generally more informative when one is concerned with teaching mastery or competency.

Norm Referenced Given a set of scores, a mean and standard deviation can be computed to describe the frequency distribution for those scores. The mean and standard deviation can then be used to express raw scores in standard score form. Standard scores readily tell where a score falls in a frequency distribution relative to other scores. Criterion Referenced Given a set of scores, we can compute a mean and standard deviation, but it is unlikely that the latter would be used if computed. A frequency distribution can be plotted if desired. Standard scores would not be used. Instead, percent right scores would be computed to see how many students met a criterion of say 85 or 90 percent right. Constructing Curriculum-Embedded Tests It makes no sense to try to build a test based on an instructional-program if in fact no consistent program is followed. So be sure first that the program you are implementing can be used with consistency before spending time building tests to use with it. It is also important in building tests to be able to tell the difference between sets of skills that define a general case and those that do not. A general case has been taught when after teaching some members of a defined set, all members can be performed correctly, Examples or applications of concepts, operations, or problem-solving rules form general-case sets. Linear-additive sets involve skill sets where each new member has to be taught. This can occur because a rote teaching method is used or because of the inherent structure of the knowledge and skill set. Usually language concepts, mathematical operations, and problem-solving strategies each form linear-additive sets. Learning about some member of the set does not teach how to do the others. When a class of skills to be tested involves a general-case set, the class can be described by the characteristics of the set. When a linear-additive set is involved, the specific members of the set must be identified. The steps to follow in constructing progress tests for an instructional program are these: First, identify the major end skills (annual goals). A scope and sequence chart or a teacher's guide is a good place to start. Second, identify the specific directions and response requirements necessary to show mastery of the objectives. Third, find where each skill is taught (if it is taught). This will lead to an analysis of subskills which may be needed and provide a bases for a flowchart showing where each skill is introduced and how long it is taught. Fourth, divide the skills into pathways. The goal is to show what is being taught when, and how tasks or skills build on each other. A pathway may be defined by a set of skills which are taught in a common format, or as sequence of skills using different formats which build to a major end goal. Fifth, divide the pathway into testing units. A test for each two weeks of progress may be needed. The sixth step is to decide which skills to test at each testing cycle. Since not everything can be tested, key building blocks and their consolidation into more complex skills are given priority. Testing also focuses on members of sets which are likely to be confused with recently taught members. The seventh and last step involves construction of test items.

Five guidelines for construction of test items were suggested: 1. Test what has been taught. 2. Give preference to most recently taught items and to highly similar items taught earlier. 3. Do not test a skill unit until it has been taught for three days. 4. Do not test trivial skills, that is, skills which are never used in the program again. 5. Avoid ambiguous instructions. When selecting instructional programs, you can apply the same analysis procedures used in building tests for an instructional program. Evaluating Curriculum-Embedded Tests In evaluating curriculum-embedded tests it is first necessary to decide how many different teaching outcomes you wish to test. This depends on the stage of instruction and whether you are dealing with outcomes involving general cases or linear-additive sets. Usually, a general-case set provides the basis for one test. Exceptions to this may occur before the set is fully taught. In a linear-additive set, each member should be treated as a separate test unless all members have had a good chance to be taught. In looking at item reliability we are concerned with the degree to which there is performance consistency on items assumed to measure the same thing. In dealing with a general-case set, a percent-agreement index can be computed for any pair of items by counting the number of students for whom there is an outcome agreement and dividing by the number of students. Then the average agreement for all possible pairs of items on the test can be computed. Usually, inspection of the table of plusses and minuses will reveal the problem items without this computation. Low agreement may occur where items are written ambiguously and need revision. It can also occur between subgroups of items which are consistent within themselves. In this latter case, it is likely that you are testing two different things. With linear-additive sets, each member constitutes a different teaching objective. An effective measure of reliability would use at least two measures of each member of the set. A percent agreement index can then be computed for each member of the set and averaged over members. It may be economical to use double item testing at least in a tryout form of the test. If good reliability is found, then testing with one item for each member of the set is possible. After all members of a linear-additive set have been taught, it is reasonable to consider testing with only a sample of the set. However, where poor performances are found, a full testing of the set should be undertaken to guide remediation. In looking at test validity we are concerned with whether the test is a measure of the specific teaching objective. This can be determined logically, by analyzing the content validity of the items, and empirically by examining the sensitivity to instruction of the items.

Content validity is concerned with if the test item falls in the set of performances defined by a teaching objective. Sensitivity to instruction is demonstrated by showing that an item is not passed prior to instruction and is passed after instruction. Where items are failed before and after instruction, one has to determine the adequacy of the instruction before a judgment about the test items can be made. Where items are failed before and passed after instruction, it is necessary to evaluate whether the change could have occurred because of instruction taking place elsewhere. A validity index was proposed based on the gain in percentage passing from pre- to post-testing. The evaluation of a testing procedure should also consider cost (in terms of money and teaching time) and the usefulness of the test information for determining remedial procedures. A crucial caution: it is possible to get good reliability and validity data on curriculum-embedded tests even when the underlying program is defective. Many curricula teach misrules, trivia, or limited cases where a more general case could be taught. A careful examination of what is being taught should come first. Approaches to Monitoring Student Progress An effective monitoring system first requires a set of procedures for placing students in a program, or at least insuring that they have the preskills assumed by the program. Second, a method for identifying progress steps, through a curriculum sequence is needed. Third, a procedure for checking the quality of student work on each progress unit is required. This may consist of formal mastery tests, less formal verbal checkouts, or independent work exercises. The fourth requirement of a monitoring system is a set of procedures for guiding students through an instructional sequence and keeping track of where they are. Fifth, it helps to motivate progress if goals are established for each student or groups of students and a method is devised to visually show progress toward the goal. Finally, a set of procedures for correcting errors or reteaching objectives not mastered is required. With well constructed curriculum-embedded tests it is possible to have the information to make logical decisions to improve instruction as it is proceeding. With a carefully designed sequence of instruction, it is possible to build a testing, reporting, and teacher coaching system to support that instruction. The important information in the system are: 1. Placement procedures to get the students started in the program where they need to be. 2. A report of lessons taught which can be related to days available for teaching and goals for the year. 3. Curriculum-embedded tests which check the quality of student progress through the program. When quality of progress is considered along with the rate of progress (2 above), a strong basis for making instructional decisions exists. 4. Teacher coaches working within this monitoring system are in an excellent position to focus solutions on instructional procedures and to aid the teacher in correcting student performance problems.

Outcomes Evaluation: Criterion- and Norm-Referenced Tests Criterion-referenced tests are particularly useful in outcome evaluations where different instructional programs can be compared on the same objectives, or where different procedures for teaching the same programs can be compared. In order to interpret the findings from the comparison of programs it is important to: (1) demonstrate that the students in the programs are equivalent on important characteristics, (2) insure that the programs were implemented with fidelity, (3) evaluate the time devoted to the teaching of the programs, and (4) consider differential outcomes for sub-objectives. When programs have common objectives, but use different directions and response requirements, this difference needs to be considered in constructing tests. When the objectives of two programs are different, interpretation of any comparisons of the different objectives is most difficult. Criterion-referenced tests are ideally suited for comparing programs on their ability to produce standards of competency.