RESEARCH DESIGN AND METHODOLOGY SECTION. Generalizability of Oral Reading Fluency Measures: Application of G Theory to Curriculum-Based Measurement

Similar documents
OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

QUESTIONS ABOUT ACCESSING THE HANDOUTS AND THE POWERPOINT

Using CBM for Progress Monitoring in Reading. Lynn S. Fuchs and Douglas Fuchs

Evidence for Reliability, Validity and Learning Effectiveness

Progress Monitoring & Response to Intervention in an Outcome Driven Model

NCEO Technical Report 27

Alpha provides an overall measure of the internal reliability of the test. The Coefficient Alphas for the STEP are:

PROGRESS MONITORING FOR STUDENTS WITH DISABILITIES Participant Materials

BENCHMARK TREND COMPARISON REPORT:

Further, Robert W. Lissitz, University of Maryland Huynh Huynh, University of South Carolina ADEQUATE YEARLY PROGRESS

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

How to Judge the Quality of an Objective Classroom Test

NCSC Alternate Assessments and Instructional Materials Based on Common Core State Standards

Psychometric Research Brief Office of Shared Accountability

On-the-Fly Customization of Automated Essay Scoring

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Norms How were TerraNova 3 norms derived? Does the norm sample reflect my diverse school population?

Extending Place Value with Whole Numbers to 1,000,000

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

Using CBM to Help Canadian Elementary Teachers Write Effective IEP Goals

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

The Oregon Literacy Framework of September 2009 as it Applies to grades K-3

STA 225: Introductory Statistics (CT)

CONTINUUM OF SPECIAL EDUCATION SERVICES FOR SCHOOL AGE STUDENTS

Social Emotional Learning in High School: How Three Urban High Schools Engage, Educate, and Empower Youth

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Ph.D. in Behavior Analysis Ph.d. i atferdsanalyse

TU-E2090 Research Assignment in Operations Management and Services

A Study of Metacognitive Awareness of Non-English Majors in L2 Listening

success. It will place emphasis on:

The My Class Activities Instrument as Used in Saturday Enrichment Program Evaluation

Learning By Asking: How Children Ask Questions To Achieve Efficient Search

Proficiency Illusion

Assessing Functional Relations: The Utility of the Standard Celeration Chart

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Conceptual and Procedural Knowledge of a Mathematics Problem: Their Measurement and Their Causal Interrelations

Early Warning System Implementation Guide

Peer Influence on Academic Achievement: Mean, Variance, and Network Effects under School Choice

ACADEMIC AFFAIRS GUIDELINES

Lecture 1: Machine Learning Basics

The Good Judgment Project: A large scale test of different methods of combining expert predictions

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Math 96: Intermediate Algebra in Context

A cautionary note is research still caught up in an implementer approach to the teacher?

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

A Note on Structuring Employability Skills for Accounting Students

Sheila M. Smith is Assistant Professor, Department of Business Information Technology, College of Business, Ball State University, Muncie, Indiana.

Tun your everyday simulation activity into research

Linking the Common European Framework of Reference and the Michigan English Language Assessment Battery Technical Report

Learning Disabilities and Educational Research 1

I N T E R P R E T H O G A N D E V E L O P HOGAN BUSINESS REASONING INVENTORY. Report for: Martina Mustermann ID: HC Date: May 02, 2017

SSIS SEL Edition Overview Fall 2017

Hierarchical Linear Models I: Introduction ICPSR 2015

EVALUATING MATH RECOVERY: THE IMPACT OF IMPLEMENTATION FIDELITY ON STUDENT OUTCOMES. Charles Munter. Dissertation. Submitted to the Faculty of the

Identifying Students with Specific Learning Disabilities Part 3: Referral & Evaluation Process; Documentation Requirements

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

VIEW: An Assessment of Problem Solving Style

PSYC 620, Section 001: Traineeship in School Psychology Fall 2016

Preprint.

Mandarin Lexical Tone Recognition: The Gating Paradigm

IEP AMENDMENTS AND IEP CHANGES

Field Experience Management 2011 Training Guides

MSW POLICY, PLANNING & ADMINISTRATION (PP&A) CONCENTRATION

Instructional Intervention/Progress Monitoring (IIPM) Model Pre/Referral Process. and. Special Education Comprehensive Evaluation.

Initial English Language Training for Controllers and Pilots. Mr. John Kennedy École Nationale de L Aviation Civile (ENAC) Toulouse, France.

ISD 2184, Luverne Public Schools. xcvbnmqwertyuiopasdfghjklzxcv. Local Literacy Plan bnmqwertyuiopasdfghjklzxcvbn

w o r k i n g p a p e r s

THEORY OF PLANNED BEHAVIOR MODEL IN ELECTRONIC LEARNING: A PILOT STUDY

Miriam Muñiz-Swicegood Arizona State University West. Abstract

WHY SOLVE PROBLEMS? INTERVIEWING COLLEGE FACULTY ABOUT THE LEARNING AND TEACHING OF PROBLEM SOLVING

Practices Worthy of Attention Step Up to High School Chicago Public Schools Chicago, Illinois

SPECIALIST PERFORMANCE AND EVALUATION SYSTEM

King-Devick Reading Acceleration Program

Wonderworks Tier 2 Resources Third Grade 12/03/13

EQuIP Review Feedback

Research Design & Analysis Made Easy! Brainstorming Worksheet

Running head: LISTENING COMPREHENSION OF UNIVERSITY REGISTERS 1

Causal Relationships between Perceived Enjoyment and Perceived Ease of Use: An Alternative Approach 1

Concept mapping instrumental support for problem solving

LANGUAGE TESTING: RECENT DEVELOPMENTS AND PERSISTENT DILEMMAS

MERGA 20 - Aotearoa

Grade Dropping, Strategic Behavior, and Student Satisficing

Improving recruitment, hiring, and retention practices for VA psychologists: An analysis of the benefits of Title 38

CONNECTICUT GUIDELINES FOR EDUCATOR EVALUATION. Connecticut State Department of Education

Executive Summary. Laurel County School District. Dr. Doug Bennett, Superintendent 718 N Main St London, KY

Technical Manual Supplement

ECON 365 fall papers GEOS 330Z fall papers HUMN 300Z fall papers PHIL 370 fall papers

DOES OUR EDUCATIONAL SYSTEM ENHANCE CREATIVITY AND INNOVATION AMONG GIFTED STUDENTS?

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Analyzing Linguistically Appropriate IEP Goals in Dual Language Programs

Aviation English Training: How long Does it Take?

Empowering Students Learning Achievement Through Project-Based Learning As Perceived By Electrical Instructors And Students

Massachusetts Department of Elementary and Secondary Education. Title I Comparability

Applying Florida s Planning and Problem-Solving Process (Using RtI Data) in Virtual Settings

FY year and 3-year Cohort Default Rates by State and Level and Control of Institution

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

Strategic Practice: Career Practitioner Case Study

Summary results (year 1-3)

Technology and Assessment Study Collaborative

Transcription:

School Psychology Quarterly, Vol. 15, No. 1, 2000, pp. 52-68 RESEARCH DESIGN AND METHODOLOGY SECTION Generalizability of Oral Reading Fluency Measures: Application of G Theory to Curriculum-Based Measurement John M. Hintze University of Massachusetts at Amherst Steven V. Owen University of Connecticut Edward S. Shapiro Lehigh University Edward J. Daly III Western Michigan University The purpose of this study was to demonstrate the use of Generalizability (G) theory as an alternative method of validating direct behavioral measures. Reliability and validity from a classic test score theory are explored and rephrased in terms of G theory. Two studies that used oral reading fluency measures within a curriculum-based measurement (CBM) approach are examined with G theory. Results indicate that CBM oral reading fluency measures are highly dependable and can be reliably used to make both between individual (nomothetic) and within individual (idiographic) decisions. Formulated by Cronbach, Gleser, Nanda, and Raj aratnam (1972), Generalizability (G) theory is a statistical technique developed specifically to assess the dependability of behavioral measurements. As an alternative to classical test score theory, G theory allows researchers to examine multiple sources of error simultaneously (e.g., across different occasions, materials, or examiners) (Cronbach et al., 1972; Shavelson& Webb, 1991; Shavelson, Webb, & Rowley, 1989; Suen, 1990). In the process, G theory provides a summary coefficient reflecting the level of dependability a generalizability coefficient analogous to classical test score theory's reliability coefficient which partitions variance that can be attributed to various sources of error (e.g., different raters, occasions, test forms, etc.). Unlike classic test score theory, which attributes everything not explained by true score variance Address correspondence to John M. Hintze, Ph.D., University of Massachusetts at Amherst, School of Education, School Psychology Program, Amherst, MA 01003; E-mail: hintze@educ.umass.edu 52

GENERAUZABILITY OF ORAL READING FLUENCY MEASURES 53 to error, G theory allows researchers to estimate proportions of variance attributable to environmental arrangements and contexts (Burns, 1998). G theory also provides information about the precision of decisions made with various measurement techniques. Thus, the reliability of decisions made about the relative standing of individuals (e.g., "John scored higher than 95% of his peers."), and decisions about an individual's absolute performance across a variety of contexts (e.g., "John's score would be expected to be similar from one week to the next, across different forms of the test, in a different environments, etc.") can be ascertained. G theory is able to provide not only estimates of overall reliability ("true score") and error ("residual"), but it also allows the researcher to explore the dependability of decisions made with measures from both intra- and interindividual perspectives. These added features make G theory particularly relevant to behavioral assessment measures, which often use repeated measurement over time, in a variety of contexts with multiple raters. THEORY AND APPLICATION OF GENERALIZABILITY (G) THEORY G theory is different from classic test score theory in a number of important respects. Within G theory, reliability and error variance are considered within the context of the testing situation (Suen, 1990). Conceptualizing reliability in this manner is in direct contrast to classic test score theory, which describes measurement error as random without any specific context (Allen & Yen, 1979). G theory attempts to specify portions of error that can be accounted for by various situational variables under which the measurements were taken; it identifies and explains portions of variance that in classic test score theory are simply attributed to random error. A direct consideration of measurement error is an assumption that is particularly relevant to behavioral assessment measures which often assume a domain sampling approach where alternate forms of similar measures are randomly drawn from a total universe of possible items that are administered repeatedly over time by different raters within a variety of contexts (Suen, 1990). As an example of such an approach, curriculum-based measurement (CBM) progress monitoring in reading uses a series of alternate form reading passages sampled from a larger universe of all potential reading samples (typically those that students are expected to read by the end of a given school year) administered repeatedly over time (typically once or twice per week) by a teacher, instructional support team member, etc. (Fuchs & Deno, 1991). Within this context, G theory can provide a coefficient that is interpreted similarly to the reliability coefficient of classical test score theory (Allen & Yen, 1979). In this manner, the G coefficient would provide an indication of the dependability of the CBM measures, with higher coefficients suggesting stronger and more robust measurement properties. The second advantage of G theory is its ability to assess multiple sources of measurement error. To assess these multiple sources of error, G theory uses a repeated measures ANOVA design that allows researchers to estimate a variance compo-

54 HINTZE ET AL. nent (a ) for each source of variation in observed scores. In the CBM progress monitoring example, variance components for persons (here, true individual differences among students), occasions (repeated measures over time), raters (evaluators), and residual (unexplained error) may be calculated. By partitioning variability in this manner, G theory enables the researcher to pinpoint the major sources of measurement error and estimate the relative magnitude of each source. What is considered unexplained error in classic test score theory may be partitioned into distinct components in G theory. By identifying such sources of measurement variability, testing personnel are provided with information that can be used to improve assessment procedures and the characteristics of tests themselves. Perhaps most importantly, G theory provides a mechanism by which researchers can examine the nature and fidelity of decisions made with the scores under study. Termed Decision (D) studies, the resultant findings provide the researcher with information about the usefulness of the measure in producing data for making interindividual and intraindividual decisions. Interindividual decisions are ones that address "how much better" one individual performed compared to another (Shavelson & Webb, 1991). These studies are similar to the between-individual comparisons that are made within behavioral assessment. Intraindividual decisions focus on "how well" an individual can perform, regardless of the performance of his or her peers (Shavelson & Webb, 1991). Again, such analyses explore the type of idiographic decisions that are made with behavioral assessment data. For interindividual decisions, only those variance components that influence the relative standing of the individual within the group are used for analysis. These components include variables that may interact with a person's score (e.g., the effect of different test forms or evaluators on a person's score, or the effect of the time of testing on a person's score) and are considered in comparison with what is known about how individuals typically respond to the measurement device. The results of this analysis yield a coefficient that signals the credibility of the decisions that involve comparing individuals. For intraindividual decisions, all facets of variance are considered in comparison with what is known about individual variation associated with the measurement device. The results of such an analysis yield a coefficient that suggests the strength of within-individual decisions that can be made (i.e., across times, test forms, etc.) regardless of an individual's ranking within a group. In the case of CBM in reading, D studies can provide valuable information regarding the appropriateness of using oral reading fluency measures to make both between-individual and within-individual decisions. The purpose of this article is twofold. The first objective is to provide a workable example of G theory in practice that illustrates its usefulness with direct behavioral measurements. Data from two previously published studies in the area of CBM will be used for illustrative purposes. The second objective is to evaluate the technical merits of the CBM oral reading fluency metric from a G theory perspective. Of particular importance is assessing the dependability of CBM measures and examining

GENERALIZABILITY OF ORAL READING FLUENCY MEASURES 55 the variability that can be accounted for across time, curricula, and difficulty levels. Based on extant literature, we hypothesize that the CBM oral reading fluency measure would prove highly dependable and would not be influenced unduly by sources of error because of variations in curricula or the time series nature of the measurement system. In addition, in the current study we were interested in assessing the practicality of using oral reading fluency measures to make both inter- and intraindividual decisions. Study 1 Participants and Procedures METHOD The primary purpose of Study 1 was to ascertain whether CBM procedures, typically constructed for use in a traditional skills-based basal reading series, would be sensitive to progress in literature-based basal reading series over time. Participants were 160 general education students from 31 second- through fifth-grade classrooms, located in two different school districts. All students received primary reading instruction in their general education classroom. Half of the students (20 from each grade) were instructed primarily in a literature-based reading series, whereas the remaining 80 participants (20 from each grade) were instructed primarily in a traditional skills-based reading program. Progress monitoring sessions were conducted twice a week over an 8-week period. Missed probe sessions were not made up. At each session, students were provided with two reading passages, one from each of the reading series (i.e., literature- and traditional skills-based) corresponding to the long-term goal level material of the student's respective grade. Order of presentation of reading passages was counterbalanced across sessions for each student. The number of words read correctly per minute on each reading passage served as the oral reading fluency outcome datum for each individual probe session. Data Analysis G study. The process through which the magnitudes of error associated with each facet in a measurement design are estimated is referred to as the G study (Suen, 1990). To estimate the magnitude of the various sources of measurement error, the data in the current study were analyzed using a repeated measures ANOVA (BMDP 8 V; Dixson, 1992). The purpose of this ANOVA was to calculate variance components for the object of measurement (represented by persons), different fac- 1. Portions of these data have been previously published by Hintze and Shapiro (1997). The current work represents new and previously unpublished analyses of the data.

56 HINTZE ET AL. TABLE 1. Estimates of Variance Components for Study 1 Facet n Estimated Percentage Variance of Total Component Variance Person (CT2P) 160 1013.89.48 Grade (a2g) 4 399.65.19 Method (a2m) 2 0.93.00 Occasion (a20) 16 19.26.01 Person x Method (s2p,m) 2 0.43.00 Person x Occasion (a2p,0) 16 154.99.07 Method x Occasion (CT2m,0) 32 16.49.01 Grade x Method (a2g>m) 8 2.02.00 Grade x Occasion (a2g,0) 64 20.39.01 Grade x Method x Occasion (a2gjmi0) 128 32.71.02 Person x Grade x Method x Occasion + 128 454.48.21 Residual (C72p,g,m,o,e) Total 2115.24 1.00 Note. "Person" refers to participants in the study, "grade" refers to the four different grade levels, "method" refers to the two different reading series used for progress monitoring, "occasion" refers to the 16 repeated progress monitoring sessions conducted over the 8-week period, n = the number of contributors to the variance component. ets of measurement (represented by different grades, alternate forms, and repeated measures taken over time), and the interactions among persons and facets. Data from missing probe sessions were imputed through regression. The resulting variance components indicated the expected degree of score variation for a single level of each facet. For example, the proportion of error that can be explained by the average subject, a single form of CBM, and a single occasion of measurement. Table 1 provides the variance component estimates for the current study. The greatest amount of observed variation in oral reading fluency scores is explained by individual variation among the participants (a p) and developmental changes across grades (a g) (approximately 48 and 19% respectively). The variance attributed to the type of material used for progress monitoring (a m), repeated measures over time (a 0), and the interaction of persons by method (a p m), persons by occasion (a p,0), method by occasion (a m)o), grade by method (a g,m), grade by occasion (a g,o), and grade by method by occasion (a g,m,o) were considerably smaller (combined total of approximately 12%). Lastly, the variance attributed to the residual term (a p,m,o,g,e) was low, about 21% of the observed variation in oral reading scores. What this analysis indicates is that of the total amount of variance present, the bulk of variation in oral reading fluency is accounted for by individual variation 2. In nested designs not all sources of variability can be estimated because of confounding. As such, not all possible interactions are noted.

GENERALIZABILITY OF ORAL READING FLUENCY MEASURES 57 among the participants and expected developmental changes across grades, and very little is attributable to the CBM progress monitoring methods or unexplained sources of variability. In the current framework, this suggests that much of the observed variation is a result of individual differences among the students themselves and developmental changes in oral reading fluency, and comparatively little to the assessment method itself standard expectations of any worthwhile assessment method. D study. Recall that in addition to partitioning variance associated with various facets of measurement, G theory allows the researcher to calculate a generalizability coefficient analogous to classical test score theory's reliability coefficient. There are two types of G coefficients that can be calculated, depending on how the researcher intends to use the scores. In the case of intra-individual or within-individual decision making, the researcher needs to calculate (/absolute- The first step in this procedure is rather straightforward and is expressed as a2 - g gj go apm ap amo, ggm, ggo agmo Ppgmoe,»g «m no»m «o "m"o "g"m "g«o «g«nt«o "g«m"o«e 2 2 where a abs is the total amount of measurement error for absolute decisions; a is a variance component from the ANOVA source table (see Table 1); and n is the number of contributors to the variance component (e.g., a 0 has 16 contributors, sono= 16). In the current study, this is represented as 2 _ 39925 0.93 1926 0.43 154.99 16.49 2.02 ^ 20.39 32.71 454.48 _ Utfw aabs 4 ~ +~ T ~ + 16 +~ 2 ~ + 16 + 32 +~ 1 T + 64 + 128 + 128 Next, an adaptation to classic test score theory, which expresses reliability as the ratio of observed score variance to true score variance plus error, is made to reflect what proportion of the total variance is because of true score variance. Using the language of generalizability theory this may be expressed as In the current study, this translates to 2 Gabs =, 2 " 2 x (3) (gp+gabs) G - 1013-89 (4) abs 1013.89 + 11631 which produces a generalizability coefficient of.90. This result suggests that researchers and practitioners can expect a high level of dependability of measurement for making intraindividual decisions when implementing CBM in Grades 2 through 5 over the course of 8 weeks (16 progress monitoring sessions) in two different sets of monitoring materials. Those, however, who routinely use CBM will

58 HINTZE ET AL. be quick to note that, as specified, CBM is not typically conducted in two sets of monitoring materials. Furthermore, individual decisions are usually made across a maximum of two grade levels, not four as was the case in the example study. In this case, the G coefficient of.90 may be an artifact of the specific research design and may not translate directly to practice. Fortunately, one of the added features of D studies is that parameters within the study can be isolated and examined analogously to using the Spearman-Brown formula in classic test score theory. For example, in the current study the effects of using only one source of monitoring materials (which is typical of CBM progress monitoring) in two grade levels may be predicted. The only change that is required is adjusting the n values that appear in the first equation to reflect one grade level and one set of progress monitoring materials over eight progress monitoring sessions. Inserting these values into Equation 1 produces: 2 399.25 0.93 19.26 0.43 154.99 16.49 2.02 20.39 32.71 454.48 - _0,r\ abs 2 1 16 1 16 16 2 32 32 32 or a2abs = 229.78. Inserting this new value of a2abs into Equation 3 yields C.= ")l3-89 (6) 1013.89 + 229.78 or a G coefficient of.82. The results of this analysis suggest that researchers and practitioners can expect adequate levels of dependability with CBM progress monitoring as is conducted typically with data that can be used to make within-individual decisions over the course of an 8-week period. Overall, the results of both absolute D studies make it quite clear that the dependability of the CBM measurement system for use in making individual decisions is quite strong. Clearly, researchers and practitioners can feel secure in the dependability of the CBM oral reading metric as it is currently used in monitoring individual progress over time, across a variety of curricula and grades. As the second step in the D study, the researcher may also want to explore the dependability of relative or interindividual decision making. Here, as in the case of intraindividual decisions, the researcher needs to calculate Grelative- This formula is simpler because it only concerns variance that has to do with the rank ordering of persons. As such, 2 _ ^pit > apgmoe /-TX okl 1- i {/) In the current study this is represented as 2 _ 0.43 154.99 454.48 Q1 (O) rel 2 16 128

GENERALIZABILITY OF ORAL READING FLUENCY MEASURES 59 which reduces to a rel - 13.47. The formula for Greiative changes slightly so that G"'=Rfe (9) In the current study, this once again translates to C - 1013-89 (10) rel 1013.89 + 13.47 or more simply, a generalizability coefficient of.99. This result suggests that researchers and practitioners can expect an exceedingly high level of dependability of measurement for making interindividual decisions when implementing CBM over the course of 8 weeks (16 progress monitoring sessions) in two different sets of monitoring materials in Grades 2 through 5. As in the case of absolute D study, it makes sense to examine the relative decision-making power when only one set of materials is used for progress monitoring over a shorter period. As in the absolute D study, the only change that is required is adjusting the n values as they appear in Equation 7 to reflect this change in design. Making these changes produces 2 0.43 154.99 454.48 /1t, Grel = + + (11) rel 1 8 128 or a rel = 23.35. Inserting this new value of a rel into Equation 9 yields C= l0'3-89 (12) 1013.89 + 2335 or more simply, a generalizability coefficient of.98. Thus, researchers and practitioners can expect high levels of dependability with CBM progress monitoring as typically conducted with data that can be used to make interindividual decisions in as little as 4 weeks. Using only three reading passages (as is typically done in survey level assessment and developing local norms) the generalizability coefficient is.95. Researchers and practitioners can place a high level of trust in the dependability of the CBM oral reading fluency metric as it is currently used in developing local norms, identifying and certifying problems between individuals, and estimating performance discrepancies of students within local curricula. Clearly, such reliability coefficients are as good if not better than most published norm-referenced materials used for similar purposes. Study 2 Participants and Procedures The purpose of the second study was to compare the growth rates obtained using instructional and challenging level long-term goal level material with CBM progress

60 HINTZE ET AL. TABLE 2. Estimates of Variance Components for Study 2 Facet n Estimated Percentage Variance Total Component Variance Person (CT2P) 80 1387.28.42 Grade (a2g) 4 1187.54.36 Method (o-2m) 2 20.78.01 Occasion (a20) 20 52.01.02 Person x Method (CT2p,m) 2 29-42 01 Person x Occasion (a2p,0) 20 146.71.04 Method x Occasion (a2m,o) 40 15.30.00 Grade x Method (a2g,m) 8 16.18.00 Grade x Occasion (a2g,0) 80 90.17.03 Grade x Method x Occasion (a2g>m,0) 160 81.26.02 Person x Grade x Method x Occasion + 160 300.94.09 Residual (<T2p,g,m,o,e) Total 3327.59 1.00 Note. "Person" refers to participants in the study; "grade" refers to the four different grade levels; "method" refers to the two different reading series used for progress monitoring; "occasion" refers to the repeated progress monitoring sessions conducted over the 10-week period, n = the number of contributors to the variance component. monitoring. Participants included 80 students from 12 first through fourth-grade classrooms located in one elementary school. Of the sample, 88% of the students received their reading instruction within general education classroom with no supplementary assistance, and 12% received either remedial or special educational services in reading outside the classroom, in addition to their instruction in the general education classroom. Progress monitoring sessions were conducted twice a week over a 10-week period. Missed probe sessions were not made up. At each session, students were provided with two reading passages, one from the instructional level and one from the challenging level of the reading basal used in the school at each grade. Order of presentation of reading passages was counterbalanced across sessions for each student. The number of words read correctly per minute on each reading passage served as the oral reading fluency outcome datum for each individual probe session. G Study. To estimate the magnitude of the various sources of measurement error, the data were analyzed through a repeated measures ANOVA (BMDP 8 V; Dixson, 1992). Because missed progress monitoring sessions were not made up, missing data were imputed with regression substitution. Table 2 presents the variance com- 3. Portions of these data may also be found in Hintze, Daly, and Shapiro (1998). The current work represents new and previously unpublished analyses of the data.

GENERALIZABILITY OF ORAL READING FLUENCY MEASURES 61 ponent estimates. Individual variation among the participants (a p) and developmental differences across grade (a g) contributed to the greatest amount of observed variation in oral reading fluency scores (approximately 42 and 36% respectively). The variance attributed to the type of material used for progress monitoring (a m), repeated measures over time (a o), and the interaction of persons by method (a n,m), persons by occasion (a n,o), method by occasion (a m,o), grade by method (a g,m), grade by occasion (a g,0), and grade by method by occasion (cr g,m,o) were considerably smaller (combined total of approximately 13%). Furthermore, the variance attributed to the residual term (a p,m,o,g,e) was low, explaining only about 9% of the observed variation in oral reading scores. Results of the G study suggest that of the total amount of variance present, roughly three-quarters of the variation in oral reading fluency was accounted for by variation within individuals and developmental differences across grades, and very little to the CBM progress monitoring procedures themselves or by unexplained error. These results concur with those from Study 1 and attest to the construct validity of oral reading fluency as it is used in CBM. D Study. To explore the dependability of the CBM progress-monitoring procedures for decision making, both absolute and relative decision studies were conducted. First, to estimate Gabsolute the proper terms from Table 2 are inserted into Equation 1 as 2 1187.56 20.78 52.01 29.42 146.71 15.30 16.18 90.17 8126 300.94 n ~ 4 2 20 2 20 40 8 80 160 160 which reduces to a abs = 337.85. Second, to estimate Gabsolute this information is introduced into Equation 3 so that C. - 1387'28 (14) 1387.28+337.85 which gives a generalizability coefficient of.80. This result suggests that researchers and practitioners can expect adequate dependability of measurement for making intraindividual decisions when implementing CBM over the course of 10 weeks across both instructional and long-term goal level material. Further analysis using a case scenario in which only one set of progress monitoring materials are used over two grade levels can be explored much in the same manner as in Study 1. Adjustments to Equation 13 to reflect such a change in a abs would appear as 2 1187.56 20.78 52.01 29.42 146.71 15.30 16.18 90.17 8126 300.94 nck CTabs = + + + + + + + + + (15) 2 1 20 1 20 20 2 20 20 20 or a2abs = 686.42. As such, _ 1387.28 O b = (16) 1387.28+686.42 }

62 HINTZE ET AL which gives a generalizability coefficient of.67. The results of this analysis suggest that researchers and practitioners can expect lower levels of dependability with CBM progress monitoring data when the materials are either too easy or too difficult. It would appear that carefully assessing the difficulty level of CBM progress monitoring reading passages and a student's response to such passages would be important when used for making within-individual decisions. To investigate the dependability of between- or interindividual decision making, the proper terms from Table 2 are first inserted into Equation 7 so that 2 29.42 146.71 300.94 a= + + (17) rel 2 20 160 2 2 or a rel - 23.93. Inserting a rel into Equation 9 indicates that C= 138728 (18) e 1387.28 + 23.93 which gives a generalizability coefficient of.98. Further investigation of relative decisions using only one set of progress monitoring materials over 5- or 3-week periods, respectively, indicates that: and G - 1387-28 (19) rel 1387.28 + 51.61 C= l38728 (20) 1387.28+66.41 which produce generalizability coefficients of.96 and.95, respectively. Once again, using only three reading passages from one reading series in one grade (as is done in survey level assessment and creating local norms) a resultant generalizability coefficient of.88 is observed. These results concur with those from Study 1, which suggests good dependability in the CBM oral reading fluency metric as is currently used in developing local norms, identifying and certifying problems between individuals, and in determining performance discrepancies of students within local curricula. DISCUSSION The purposes of this study were to provide workable examples of G theory with direct behavioral measures and to explore the dependability and sensitivity of decisions that can be made with oral reading fluency measures such as those used in CBM. With a repeated measures ANOVA and a few straightforward calculations, researchers can begin to study the degree to which a given set of measurements of an individual generalize to a broader and more extensive set of measurements that

GENERALIZABILITY OF ORAL READING FLUENCY MEASURES 63 help answer important questions regarding the technical adequacy of a measurement system for making decisions. As illustrated, G theory provides a flexible and straightforward framework for examining the dependability of behavioral measurements. A principal assumption of the theory posits that a measurement taken on a person is only a random sample of that person's behavior. More importantly, the usefulness of any measurement depends on the degree to which any one measurement sample can be generalized accurately to the behavior of the same person across a wider set of contexts. From this perspective, G theory fits well into most forms of clinical assessment, which typically make assumptions and draw inferences from isolated measurement of behavior to overall global functioning and patterns across a variety of situations. This notion is in contrast to the concept of reliability from a classic test score approach. Instead of asking how reliably a set of scores represents a particular construct of interest, G theory asks how well a set of observed scores can be used to represent a person's behavior in a general manner. From a clinical perspective, the answer to this question is the sine qua non of good assessment. Not only is it important that we measure behavior and constructs reliably, but the scores from these measures must also be representative of a person's functioning across time and settings. In addition, G theory extends classic test score theory in a number of important ways. As illustrated by the current work and elucidated by Shavelson and Webb (1991), G theory allows the researcher to estimate statistically the magnitude of each source of measurement error separately in one analysis and provides a mechanism for optimizing the reliability of measurement. By providing a G coefficient that is analogous to a classic test score reliability coefficient, G theory provides feedback to researchers and test developers about sources of error variance and the magnitude of each source of error affecting measurement. Perhaps its most unique feature is the ability to distinguish between inter- and intraindividual decisions. Unlike classic test score theory, researchers are able to evaluate empirically the dependability of any measure from the perspective of how much better one individual performed from another, or conversely, how well an individual can perform regardless of his or her peers' performance. This characteristic is particularly salient to school psychologists, who historically have used both summative and formative assessment measures, and to test developers, who increasingly are being asked to develop time- and cost-efficient measurement procedures for use in schools. Applications of G Theory to CBM Results of the current study indicate that CBM oral reading fluency measures are extremely dependable for a variety of decision-making purposes and continue to be a highly reliable means for indexing students' oral reading proficiency. More specifically, from an interindividual decision-making perspective, the current findings indicate that the dependability of CBM oral reading fluency measures for making between-individual decisions is quite strong. For CBM assessments as typically

64 HINTZE ET AL used in practice (i.e., survey level assessments, problem identification and certification), G coefficients of approximately.90 were observed in both cases. The magnitude of such coefficients suggests that practitioners can feel safe in using CBM data for screening and educational decisions that may include classification or eligibility determination (Salvia & Ysseldyke, 1995). The current findings are also consistent with research showing CBM oral reading fluency measures to be highly related to reported differences in oral reading performance across students of different grades and classifications (Deno, Marston, Shinn, & Tindal, 1983; Fuchs, Fuchs, Hamlett, Walz, & Germann, 1993; Hintze & Shapiro, 1997; Hintze, Daly, & Shapiro, 1998; Hintze, Shapiro, Conte, & Basile, 1997; Shinn & Marston, 1985; Shinn, Tindal, & Stein, 1988) and to teachers' judgement of student reading proficiency in both general and special education (Fuchs & Deno, 1981; Fuchs, Fuchs, & Deno, 1982; Marston & Deno, 1982). In addition, the current findings continue to support the intraindividual decision-making abilities of CBM oral reading fluency measures. As such, the use of CBM for developing and monitoring Individualized Education Plan (IEP) goals and objectives (Fuchs, 1993; Fuchs et al., 1993; Fuchs & Deno, 1991; Fuchs, Fuchs, & Deno, 1985), and the monitoring of individual progress over time (Fuchs, 1986,1989, 1993; Fuchs & Fuchs, 1986; Fuchs, Fuchs, & Hamlett, 1989a, 1989b, 1989c) appears to be psychometrically defensible. Interestingly, the current study lends support to the notion that the difficulty level of the material chosen for progress monitoring can have a substantial effect on resultant CBM outcomes (Hintze et al., 1998). Moreover, results suggest that practitioners may obtain reliable estimates of performance based on the most recent 8 to 10 data points. Although other work has suggested that a minimum of 20 points is required for accurate prediction to some future point in time (Good & Shinn, 1990; Shinn, Good, & Stein, 1989), the current findings suggest that fewer data points can serve as reliable indicators of generalized performance over time. The good news for practitioners is that fewer data points require less time and effort without sacrificing precision. Clearly, such efficient and reliable measures fit the recent amendments to the Individuals with Disabilities Education Act (IDEA), which require that evaluations be linked to IEP and programming objectives through the use of classroom-based data (Turnbull & Turnbull, 1998). Limitations and Considerations for Future Research Although the current article has attempted to provide a working example of G theory as it pertains to direct measures of reading, the results must be interpreted within context. First, because the main purpose of the work was to illustrate a set of methodological techniques, the experimental design and data were ex post facto in nature. As noted, the two data sets were part of two previous research endeavors. Readers should recognize and be cautious in over-generalizing the results because of the retrospective nature of the analyses. Indeed, future work should include a

GENERALIZABILITY OF ORAL READING FLUENCY MEASURES 65 careful consideration of design elements and data analytic plans concurrently to strengthen external validity. Second, researchers interested in using G theory should consider carefully the use of power analysis before designing experiments. Although not important from a hypothesis testing perspective (i.e., establishing alpha level), without proper sample size, variability may be either under- or overestimated, which in turn would affect the variance component estimates. For example, with small Nthe variance component estimates may be underestimated (because of reduced variability) and with large N the variance component estimates may be overestimated (because of increased variability). A well-done power analysis in this case should provide the estimated sample size that should neither over- nor underestimate the variance component estimates (Cohen, 1988). In addition to these methodological considerations, future research may also focus on other features that interact with the assessment process. Among the many universes to which a score may belong, Cone (1977) has identified six "generalities" that are particularly relevant to behavioral assessors and behavioral assessment measures: (a) scorer, (b) item, (c) time, (d) method, (e) setting, and (f) dimension. These universes of generalizability represent measurement conditions under which a given behavior for a given individual may be measured. Scorer generality refers to the extent to which data obtained by one observer or scorer are comparable to the observations of all observers who have been used. Item generality reflects the extent to which a given response or set of responses is representative of a larger universe of similar responses. In behavioral assessment, item generality would be most closely linked with broad- and narrow-band informant report measures used as verbal analogues to actual behavior. The relevance of time generality concerns the extent to which data collected on one occasion are representative of those that might have been collected at other times. Although behavioral assessors have long subscribed to the controlling effects of situational specificity, such temporal generalizability is of specific concern when measures of treatment outcome are used to make high-stakes decisions (e.g., using treatment outcome data to make categorical classifications). In other words, evaluators can investigate whether changes in behavior over time are reliable or the result of unidentified sources of variance. Method generality refers to the comparability of data produced from two or more ways of measuring the same construct. For example, evaluators would be interested in knowing the degree to which direct observations of inattention agree with informant reports of the same construct. The extent to which the two measurement methods agree is evidence of method generality. Setting generality asks whether data obtained in one situation are representative of those obtainable in others. For example, does a measure of inattention during independent seat work apply to the same measures taken during other academic periods of the day, such as small group activities? Such information is especially important for behavior change agents who are interested in the external validity of a particular intervention across a variety of situations and contexts. For example, "Will the results of a token economy in a spe-

66 HINTZE ET AL. cial education classroom generalize if it is implemented in a general education classroom?" Having such knowledge a priori may influence the intervention decisions of a behavior change agent, depending on the ultimate goals for generalization. Finally, dimension generality refers to the comparability of data on two or more different behaviors. For example, "To what extent are measures of students' academic engaged time associated with academic achievement?" CONCLUSIONS As with other forms of assessment, the validity of behavioral assessment measures have frequently been called into question. At the forefront of such questions has been an apparent difficulty of behavioral assessment measures to meet the principles and assumptions of classic test score theory. Because of a basic belief that direct behavioral measures were situation-specific samples of behavior, some have argued that behavior as a basic unit of datum should not be expected to evidence properties such as test-detest reliability, concurrent validity across situations, or convergent validity across methods (Nelson, 1983). However, the principles and assumptions underlying G theory are well suited for the validation of behavioral assessment measures. The current paper has argued that differences between classic test score theory and G theory are more conceptual than methodological or statistical. Nonetheless, one important advantage of using G theory with behavioral assessment measures is the acknowledgment that behavior is greatly influenced by the environmental context in which it occurs. Sensitivity to situational specificity, in addition to the ability to partition the variance attributed to contextual arrangements, make G theory a conceptually strong and compatible methodology for validating behavioral measures and exploring the sensitivity of the decisions made with such measures at the same time. REFERENCES Allen, M.J.,& Yen, W.M. (1979). Introduction to measurement theory. Monterey, CA: Brooks/Cole. Bums, K. J. (1998). Beyond classical reliability: Using generalizability theory to assess dependability. Research in Nursing & Health, 21, 83-90. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Cone, J. D. (1977). The relevance of reliability and validity for behavioral assessment. Behavior Therapy, 5,411-426. Cronbach, L. J., Gleser, G. C, Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley. Deno, S. L., Marston, D., Shinn, M. R., & Tindal, G. (1983). Oral reading fluency: A simple datum for scaling reading disability. Topics in Learning and Learning Disabilities, 2, 53-59. Dixson, W. J. (Ed.). (1992). BMDP statistical software manual. Los Angeles: University of California Press.

GENERALIZABILITY OF ORAL READING FLUENCY MEASURES 67 Fuchs, L. S. (1986). Monitoring progress among mildly handicapped pupils: Review of current practice and research. Remedial and Special Education, 7, 5-12. Fuchs, L. S. (1989). Evaluating solutions, monitoring progress and revising intervention plans. In M. R. Shinn (Ed.), Curriculum-based measurement: Assessing special children (pp. 153 181). New York: Guilford. Fuchs, L. S. (1993). Enhancing instructional programming and student achievement with curriculum-based measurement. In J. J. Kramer & J. C. Conoley (Eds.), Curriculum-based measurement (pp. 65 103). Lincoln, NE: University of Nebraska-Lincoln, Buros Institute of Mental Measurements. Fuchs, L. S., & Deno, S. L. (1991). Paradigmatic distinctions between instructionally relevant measurement models. Exceptional Children, 57, 488-500. Fuchs, L. S., & Deno, S. L. (1981). The relationship between curriculum-based mastery measures and standardized achievement tests in reading (Report No. 57). Minneapolis, MN: University of Minnesota Institute for Research on Learning Disabilities (ERIC Document Reproduction Service No. ED 212 662). Fuchs, L. S., & Fuchs, D. (1986). Effects of systematic formative evaluation: A meta-analysis. Exceptional Children, 53, 199-208. Fuchs, L. S., Fuchs, D., & Deno, S. L. (1982). Reliability and validity of curriculum-based informal reading inventories. Reading Research Quarterly, 18, 6-25. Fuchs, L. S., Fuchs, D., & Deno, S. L. (1985). Importance of goal ambitiousness and goal mastery to student achievement. Exceptional Children, 52, 63-71. Fuchs, L. S., Fuchs, D., & Hamlett, C. L. (1989a). Computers and curriculum-based measurement: Effects of teacher feedback systems. School Psychology Review, 18, 112 125. Fuchs, L. S., Fuchs, D., & Hamlett, C. L. (1989b). Effects of alternative goal structures within curriculum-based measurement. Exceptional Children, 55, 429 438. Fuchs, L. S., Fuchs, D., & Hamlett, C. L. (1989c). Effects of instrumental use of curriculum-based measurement to enhance instructional programs. Remedial and Special Education, 10, 43-52. Fuchs, L. S., Fuchs, D., Hamlett, C. L., Walz, L., & Germann, G. (1993). Formative evaluation of academic progress: How much growth can we expect? School Psychology Review, 22, 27 48. Good, R. H., & Shinn, M. R. (1990). Forecasting accuracy of slope estimates for reading curriculum-based measurement: Empirical evidence. Behavioral Assessment, 12, 179-193. Hintze, J. M., Daly, E. J., & Shapiro, E. S. (1998). An investigation of the effects of passage difficulty level on oral reading fluency for progress monitoring. School Psychology Review, 2 7,433 445. Hintze, J. M., & Shapiro, E. S. (1997). Curriculum-based measurement and literature-based reading: Is curriculum-based measurement meeting the needs of changing reading curricula? Journal of School Psychology, 35, 351-375. Hintze, J. M., Shapiro, E. S., Conte, K. L., & Basile, I. M. (1997). Oral reading fluency and authentic reading material: Criterion validity of the technical features of CBM survey-level assessment. School Psychology Review, 26, 535 553. Marston, D., & Deno, S. L. (1982). Implementation of direct and repeated measurement in the school setting (Report No. 106). Minneapolis, MN: University of Minnesota Institute for Research on Learning Disabilities. Nelson, R. O. (1983). Behavioral assessment: Past, present, and future. Behavioral Assessment, 5, 195-206. Salvia, J., & Ysseldyke, J. E. (1995). Assessment (6th ed.). Boston: Houghton Mifflin. Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage. Shavelson, R. J., & Webb, N. M., & Rowley, G. L. (1989). Generalizability theory. American Psychologist, 44, 922-932. Shinn, M. R., Good, R. H., & Stein, S. (1989). Summarizing trend in student achievement: A comparison of methods. School Psychology Review, 18, 356-370.

68 HINTZE ET AL. Shinn, M. R., & Marston, D. (1985). Differentiating mildly handicapped, low-achieving and regular education students: A curriculum-based approach. Remedial and Special Education, 6, 31 45. Shinn, M. R., Tindal, G., & Stein, S. (1988). Curriculum-based assessment and identification of mildly handicapped students: A research review. Professional School Psychology, 3, 69 85. Suen, H. K. (1990). Principles of tests theories. Hillsdale, NJ: Erlbaum. Turnbull, H. R., & Turnbull, A. P. (1998). Free appropriate public education: The law and children with disabilities (5th ed.). Denver, CO: Love. Action Editor: Timothy Z. Keith Acceptance Date: July 26, 1999