ASSESSMENT OVERVIEW Norm-Referenced and Criterion-Referenced Tests Tests are systematic procedures for assessing behavior under specified conditions. Norm-referenced tests compare people to each other. Criterion-referenced tests compare a person's performance to a specified standard. Norm-referenced tests are especially useful in selecting relatively high and low members of a group. Criterion-referenced tests are useful in specifying those who meet or fail to meet a standard of performance. A good item on a norm-referenced test is one that some pass and some fail. An item that everybody (or most) passed would be eliminated from a norm-referenced test. On the other hand, on a criterion-referenced test being used to evaluate instruction, such an item might be very valuable. The main use of criterion-referenced tests in education is to evaluate the process of instruction to show where it has occurred and where it has not. There are several approaches to developing criterion-referenced tests. One traditional approach bases the tests on a set of very specific curriculum independent instructional objectives (e.g., the Brigance Inventory of Basic Skills and similar tests popular in the 1960 s and 70 s). This type of test requires teachers to specify what they are trying to teach, to pretest the students, and to get feedback after instruction through post-testing. Such procedures can assist in improving programs and instruction. However, difficulties with constructing this type of test are: 1. Writing objectives and building tests is a lot of work. If every teacher had to do it, the waste of effort would be considerable. 2. It is difficult to evaluate good and bad test items without an instructional program. The test might be failed because of problems with the program implementation, use of different directions and response requirements from those taught the students, use of examples beyond the range of those taught, and so forth. 3. It is not possible to specify how specific or how general an objective should be in the absence of an instructional program. Another approach ties the tests to a specified instructional program. To be maximally useful tests must be specifically referenced to defined instructional materials, which are in turn aligned with annual achievement expectations (e.g., state or district goals). When this is done, it is possible to monitor the process of instruction throughout the program and provide for corrective action whenever it is needed (proactively). Clear diagnosis of problems and their remediation is possible with such tests. Also, when working with a program that has been previously demonstrated to work, it is possible to analyze the test results to determine if failures are due to program implementation difficulties (most students fail items which have been "taught"), or to student difficulties (one or more students fail many tasks on the test which are not failed by most students).
Understanding Test Scores Tests are used to measure behavior. Measurement is a comparison procedure. In norm-reference testing, comparisons are made to the performance of other people. In criterion-referenced testing, comparisons are made with a standard of performance. Five statistical concepts were introduced to permit examination in more detail of the nature of norm-referenced and criterion-referenced scores. A frequency distribution graphically depicts how many people received what scores on a test. The vertical axis on the graph shows the "how many" people, the horizontal axis shows the "what scores." Two important statistics for describing the set of scores that make up a frequency distribution are the mean and the standard deviation. The mean (M) is an average. All of the scores (Xs) are summed and then divided by the number of scores (N). This tells us where the "middle" of the frequency distribution is. M = Sum Xs N The standard deviation (SD) is a measure of the degree to which scores in a distribution deviate from the mean. It can be thought of as the average of the deviations from the mean, except that the deviations are squared and then the average is later "unsquared" by taking the square root. The standard deviation tells us how far the scores in a distribution spread out from the mean on the average. The mean and the standard deviation of a distribution can be used to convert each score in the distribution to standard scores (SS). The raw scores are expressed as a deviation from the mean, and then they are divided by the standard deviation. SS = x -M SD The sign of a standard score tells you immediately whether the score is above or below the mean. Most raw scores will fall between +3 and -3 on a standard score scale. Standard scores provide one kind of "consistent frame of reference" for comparing scores of individuals within a distribution, and between distributions involving different measurements. The most common approach to comparing scores in criterion-referenced testing is to compute the percent right. These scores tell how close one comes to meeting the objective the test was designed to measure. An example involving spelling and math was presented to show how widely different conclusions could be drawn from the test scores using standard scores rather than percent right scores. "How much is much?" depends on the comparison standard. Standard scores discard the absolute level of performance in looking at score distributions. Percent right scores retain this information and are generally more informative when one is concerned with teaching mastery or competency.
Norm Referenced Given a set of scores, a mean and standard deviation can be computed to describe the frequency distribution for those scores. The mean and standard deviation can then be used to express raw scores in standard score form. Standard scores readily tell where a score falls in a frequency distribution relative to other scores. Criterion Referenced Given a set of scores, we can compute a mean and standard deviation, but it is unlikely that the latter would be used if computed. A frequency distribution can be plotted if desired. Standard scores would not be used. Instead, percent right scores would be computed to see how many students met a criterion of say 85 or 90 percent right. Constructing Curriculum-Embedded Tests It makes no sense to try to build a test based on an instructional-program if in fact no consistent program is followed. So be sure first that the program you are implementing can be used with consistency before spending time building tests to use with it. It is also important in building tests to be able to tell the difference between sets of skills that define a general case and those that do not. A general case has been taught when after teaching some members of a defined set, all members can be performed correctly, Examples or applications of concepts, operations, or problem-solving rules form general-case sets. Linear-additive sets involve skill sets where each new member has to be taught. This can occur because a rote teaching method is used or because of the inherent structure of the knowledge and skill set. Usually language concepts, mathematical operations, and problem-solving strategies each form linear-additive sets. Learning about some member of the set does not teach how to do the others. When a class of skills to be tested involves a general-case set, the class can be described by the characteristics of the set. When a linear-additive set is involved, the specific members of the set must be identified. The steps to follow in constructing progress tests for an instructional program are these: First, identify the major end skills (annual goals). A scope and sequence chart or a teacher's guide is a good place to start. Second, identify the specific directions and response requirements necessary to show mastery of the objectives. Third, find where each skill is taught (if it is taught). This will lead to an analysis of subskills which may be needed and provide a bases for a flowchart showing where each skill is introduced and how long it is taught. Fourth, divide the skills into pathways. The goal is to show what is being taught when, and how tasks or skills build on each other. A pathway may be defined by a set of skills which are taught in a common format, or as sequence of skills using different formats which build to a major end goal. Fifth, divide the pathway into testing units. A test for each two weeks of progress may be needed. The sixth step is to decide which skills to test at each testing cycle. Since not everything can be tested, key building blocks and their consolidation into more complex skills are given priority. Testing also focuses on members of sets which are likely to be confused with recently taught members. The seventh and last step involves construction of test items.
Five guidelines for construction of test items were suggested: 1. Test what has been taught. 2. Give preference to most recently taught items and to highly similar items taught earlier. 3. Do not test a skill unit until it has been taught for three days. 4. Do not test trivial skills, that is, skills which are never used in the program again. 5. Avoid ambiguous instructions. When selecting instructional programs, you can apply the same analysis procedures used in building tests for an instructional program. Evaluating Curriculum-Embedded Tests In evaluating curriculum-embedded tests it is first necessary to decide how many different teaching outcomes you wish to test. This depends on the stage of instruction and whether you are dealing with outcomes involving general cases or linear-additive sets. Usually, a general-case set provides the basis for one test. Exceptions to this may occur before the set is fully taught. In a linear-additive set, each member should be treated as a separate test unless all members have had a good chance to be taught. In looking at item reliability we are concerned with the degree to which there is performance consistency on items assumed to measure the same thing. In dealing with a general-case set, a percent-agreement index can be computed for any pair of items by counting the number of students for whom there is an outcome agreement and dividing by the number of students. Then the average agreement for all possible pairs of items on the test can be computed. Usually, inspection of the table of plusses and minuses will reveal the problem items without this computation. Low agreement may occur where items are written ambiguously and need revision. It can also occur between subgroups of items which are consistent within themselves. In this latter case, it is likely that you are testing two different things. With linear-additive sets, each member constitutes a different teaching objective. An effective measure of reliability would use at least two measures of each member of the set. A percent agreement index can then be computed for each member of the set and averaged over members. It may be economical to use double item testing at least in a tryout form of the test. If good reliability is found, then testing with one item for each member of the set is possible. After all members of a linear-additive set have been taught, it is reasonable to consider testing with only a sample of the set. However, where poor performances are found, a full testing of the set should be undertaken to guide remediation. In looking at test validity we are concerned with whether the test is a measure of the specific teaching objective. This can be determined logically, by analyzing the content validity of the items, and empirically by examining the sensitivity to instruction of the items.
Content validity is concerned with if the test item falls in the set of performances defined by a teaching objective. Sensitivity to instruction is demonstrated by showing that an item is not passed prior to instruction and is passed after instruction. Where items are failed before and after instruction, one has to determine the adequacy of the instruction before a judgment about the test items can be made. Where items are failed before and passed after instruction, it is necessary to evaluate whether the change could have occurred because of instruction taking place elsewhere. A validity index was proposed based on the gain in percentage passing from pre- to post-testing. The evaluation of a testing procedure should also consider cost (in terms of money and teaching time) and the usefulness of the test information for determining remedial procedures. A crucial caution: it is possible to get good reliability and validity data on curriculum-embedded tests even when the underlying program is defective. Many curricula teach misrules, trivia, or limited cases where a more general case could be taught. A careful examination of what is being taught should come first. Approaches to Monitoring Student Progress An effective monitoring system first requires a set of procedures for placing students in a program, or at least insuring that they have the preskills assumed by the program. Second, a method for identifying progress steps, through a curriculum sequence is needed. Third, a procedure for checking the quality of student work on each progress unit is required. This may consist of formal mastery tests, less formal verbal checkouts, or independent work exercises. The fourth requirement of a monitoring system is a set of procedures for guiding students through an instructional sequence and keeping track of where they are. Fifth, it helps to motivate progress if goals are established for each student or groups of students and a method is devised to visually show progress toward the goal. Finally, a set of procedures for correcting errors or reteaching objectives not mastered is required. With well constructed curriculum-embedded tests it is possible to have the information to make logical decisions to improve instruction as it is proceeding. With a carefully designed sequence of instruction, it is possible to build a testing, reporting, and teacher coaching system to support that instruction. The important information in the system are: 1. Placement procedures to get the students started in the program where they need to be. 2. A report of lessons taught which can be related to days available for teaching and goals for the year. 3. Curriculum-embedded tests which check the quality of student progress through the program. When quality of progress is considered along with the rate of progress (2 above), a strong basis for making instructional decisions exists. 4. Teacher coaches working within this monitoring system are in an excellent position to focus solutions on instructional procedures and to aid the teacher in correcting student performance problems.
Outcomes Evaluation: Criterion- and Norm-Referenced Tests Criterion-referenced tests are particularly useful in outcome evaluations where different instructional programs can be compared on the same objectives, or where different procedures for teaching the same programs can be compared. In order to interpret the findings from the comparison of programs it is important to: (1) demonstrate that the students in the programs are equivalent on important characteristics, (2) insure that the programs were implemented with fidelity, (3) evaluate the time devoted to the teaching of the programs, and (4) consider differential outcomes for sub-objectives. When programs have common objectives, but use different directions and response requirements, this difference needs to be considered in constructing tests. When the objectives of two programs are different, interpretation of any comparisons of the different objectives is most difficult. Criterion-referenced tests are ideally suited for comparing programs on their ability to produce standards of competency.