Developing a Statewide FCAT Growth Scale. A presentation at the 2007 Annual Meeting of the Florida Educational Research Association

Similar documents
Miami-Dade County Public Schools

Effectiveness of McGraw-Hill s Treasures Reading Program in Grades 3 5. October 21, Research Conducted by Empirical Education Inc.

Charter School Performance Comparable to Other Public Schools; Stronger Accountability Needed

A Guide to Adequate Yearly Progress Analyses in Nevada 2007 Nevada Department of Education

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Practices Worthy of Attention Step Up to High School Chicago Public Schools Chicago, Illinois

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Evaluation of Teach For America:

NCEO Technical Report 27

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

learning collegiate assessment]

Peer Influence on Academic Achievement: Mean, Variance, and Network Effects under School Choice

Evaluation of a College Freshman Diversity Research Program

Coming in. Coming in. Coming in

Psychometric Research Brief Office of Shared Accountability

Iowa School District Profiles. Le Mars

BENCHMARK TREND COMPARISON REPORT:

Cooper Upper Elementary School

Educational Attainment

Cooper Upper Elementary School

Bellehaven Elementary

5 Programmatic. The second component area of the equity audit is programmatic. Equity

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Review of Student Assessment Data

Shelters Elementary School

George Mason University Graduate School of Education Program: Special Education

EDUCATIONAL ATTAINMENT

EFFECTS OF MATHEMATICS ACCELERATION ON ACHIEVEMENT, PERCEPTION, AND BEHAVIOR IN LOW- PERFORMING SECONDARY STUDENTS

Biological Sciences, BS and BA

Student Mobility Rates in Massachusetts Public Schools

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

The Efficacy of PCI s Reading Program - Level One: A Report of a Randomized Experiment in Brevard Public Schools and Miami-Dade County Public Schools

Transportation Equity Analysis

Financing Education In Minnesota

Race, Class, and the Selective College Experience

Instructional Intervention/Progress Monitoring (IIPM) Model Pre/Referral Process. and. Special Education Comprehensive Evaluation.

STEM Academy Workshops Evaluation

President Abraham Lincoln Elementary School

Longitudinal Analysis of the Effectiveness of DCPS Teachers

Evidence for Reliability, Validity and Learning Effectiveness

Kansas Adequate Yearly Progress (AYP) Revised Guidance

Jason A. Grissom Susanna Loeb. Forthcoming, American Educational Research Journal

Further, Robert W. Lissitz, University of Maryland Huynh Huynh, University of South Carolina ADEQUATE YEARLY PROGRESS

University-Based Induction in Low-Performing Schools: Outcomes for North Carolina New Teacher Support Program Participants in

Omak School District WAVA K-5 Learning Improvement Plan

Wisconsin 4 th Grade Reading Results on the 2015 National Assessment of Educational Progress (NAEP)

Access Center Assessment Report

PROFESSIONAL TREATMENT OF TEACHERS AND STUDENT ACADEMIC ACHIEVEMENT. James B. Chapman. Dissertation submitted to the Faculty of the Virginia

Student Support Services Evaluation Readiness Report. By Mandalyn R. Swanson, Ph.D., Program Evaluation Specialist. and Evaluation

Probability and Statistics Curriculum Pacing Guide

Standards-based Mathematics Curricula and Middle-Grades Students Performance on Standardized Achievement Tests

Governors and State Legislatures Plan to Reauthorize the Elementary and Secondary Education Act

Proficiency Illusion

NATIONAL CENTER FOR EDUCATION STATISTICS RESPONSE TO RECOMMENDATIONS OF THE NATIONAL ASSESSMENT GOVERNING BOARD AD HOC COMMITTEE ON.

A Comparison of Charter Schools and Traditional Public Schools in Idaho

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Kenya: Age distribution and school attendance of girls aged 9-13 years. UNESCO Institute for Statistics. 20 December 2012

Higher Education Six-Year Plans

Executive Summary. Hialeah Gardens High School

Basic Skills Initiative Project Proposal Date Submitted: March 14, Budget Control Number: (if project is continuing)

Norms How were TerraNova 3 norms derived? Does the norm sample reflect my diverse school population?

Unequal Opportunity in Environmental Education: Environmental Education Programs and Funding at Contra Costa Secondary Schools.

ECON 365 fall papers GEOS 330Z fall papers HUMN 300Z fall papers PHIL 370 fall papers

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Interpreting ACER Test Results

1.0 INTRODUCTION. The purpose of the Florida school district performance review is to identify ways that a designated school district can:

Getting Results Continuous Improvement Plan

Samuel Enoka Kalama Intermediate School

Ending Social Promotion:

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

The Talent Development High School Model Context, Components, and Initial Impacts on Ninth-Grade Students Engagement and Performance

Rural Education in Oregon

The Impacts of Regular Upward Bound on Postsecondary Outcomes 7-9 Years After Scheduled High School Graduation

Australia s tertiary education sector

ADDENDUM 2016 Template - Turnaround Option Plan (TOP) - Phases 1 and 2 St. Lucie Public Schools

Linking the Common European Framework of Reference and the Michigan English Language Assessment Battery Technical Report

ENGLISH LANGUAGE LEARNERS (ELL) UPDATE FOR SUNSHINE STATE TESOL 2013

Sector Differences in Student Learning: Differences in Achievement Gains Across School Years and During the Summer

Kahului Elementary School

Alpha provides an overall measure of the internal reliability of the test. The Coefficient Alphas for the STEP are:

Achievement Testing Program Guide. Spring Iowa Assessment, Form E Cognitive Abilities Test (CogAT), Form 7

Rules and Discretion in the Evaluation of Students and Schools: The Case of the New York Regents Examinations *

Writing a Basic Assessment Report. CUNY Office of Undergraduate Studies

Hokulani Elementary School

Probability estimates in a scenario tree

Short Term Action Plan (STAP)

PEER EFFECTS IN THE CLASSROOM: LEARNING FROM GENDER AND RACE VARIATION *

Colorado s Unified Improvement Plan for Schools for Online UIP Report

Improving Conceptual Understanding of Physics with Technology

The Effects of Statewide Private School Choice on College Enrollment and Graduation

Author's response to reviews

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Running head: DEVELOPING MULTIPLICATION AUTOMATICTY 1. Examining the Impact of Frustration Levels on Multiplication Automaticity.

The Demographic Wave: Rethinking Hispanic AP Trends

Early Warning System Implementation Guide

Update Peer and Aspirant Institutions

8. UTILIZATION OF SCHOOL FACILITIES

Certified Six Sigma Professionals International Certification Courses in Six Sigma Green Belt

School Size and the Quality of Teaching and Learning

Corpus Linguistics (L615)

Transcription:

Developing a Statewide FCAT Growth Scale A presentation at the 2007 Annual Meeting of the Florida Educational Research Association Bob Johnson Bidya Subedi Richard Williams Department of Research and Evaluation School District of Palm Beach County November 2007 1

2

Developing a Statewide FCAT Growth Scale Bob Johnson Bidya Subedi Richard Williams Department of Research and Evaluation School District of Palm Beach County Abstract In recent years the demand for effective ways of evaluating educational programs has increased. However, evaluations using standardized test scores are hampered by two factors: (1) lack of equitable scales on standardized tests, and (2) difficulties in putting the size of the effects into terms that can be understood by decision makers. The School District of Palm Beach County (SDPBC) has previously addressed these issues by creating (1) Local Normal Curve Equivalents (LNCE), (2) Portion of a Year s Growth (PYG), (3) years Basic to Proficient (BTP), and (4) a new classification of effect size. These efforts were expanded in 2007 to include the creation of two new metrics that have advantages over the previous scales: a State Normal Curve Equivalent (SNCE), and a State Growth Scale (). Accountability and program evaluations Introduction In the age of accountability, school districts face increasing pressure to improve student achievement. Consequently, the demand for evaluating educational programs to determine their effectiveness is increasing. Evaluations often use standardized test scores as the basis of their analyses. One would think that the effect of a program could simply be determined by calculating the difference between pretest and posttest performance (simple gain scores). For several reasons, this is rarely the case. The first problem is that scale scores often unfairly advantage or disadvantage certain groups of students (and therefore, programs) because of their low achievement level, socio-economic status, ethnicity, or grade level. Unfortunately, most standardized score scales exhibit a pattern of larger annual changes at the lower grades and smaller annual changes at the upper grades. For example, Figure 1 shows the average developmental scale scores (DSS) for students on the 2006 FCAT SSS 1 reading test by grade level. It can be seen here that the differences between lower grades are larger than the differences 1 This is the State of Florida Comprehensive Achievement Test - Sunshine State Standards 3

between upper grades. The difference between the grade 3 and grade 4 average is 165 points, while the difference between the grade 9 and grade 10 average is 28 points. Consequently, on this test series students in lower grades are likely to gain more scale score points per year than students at upper grades. Figure 1. Median developmental scale scores (DSS) by grade level for the State of Florida on the 2006 FCAT SSS test in reading 2000 1900 1800 1700 DSS 1600 1500 1400 1300 Gr.3 Gr.4 Gr.5 Gr.6 Gr.7 Gr.8 Gr.9 Gr.10 1382 1547 1619 1709 1773 1834 1890 1918 This problem also exists within grade levels. Figure 2 shows the average annual scale score gains for grade 4 students in the School District of Palm Beach County (SDPBC) on the FCAT SSS reading test. The horizontal axis represents groups of students categorized by overall reading level. 2 The graph illustrates that low-scoring fourth graders exhibit greater annual scale score growth than high-scoring fourth graders. 2 The average of the pretest and posttest scores was used to assign students into groups to avoid the problem of regression to the mean if either pretest or posttest was used separately. 4

Figure 2. Average annual scale score gains within a grade level for the SDPBC on the 2003 SDPBC FCAT SSS test in reading in grade 4 350 300 250 DS Gain 200 150 10 0 50 0 Low est Scoring Groups Highest Scoring The second problem is that there are inconsistencies in many scale score scales. For example, Figure 3 shows the average developmental scale scores by grade for the 2001 FCAT SSS Mathematics test. The amount of scale score change from grade to grade is irregular, as the difference from grade 4 to grade 5 is 179 points while the difference from grade 5 to grade 6 is only 22 points. 5

Figure 3. Average developmental scale scores (DSS) by grade level for the State of Florida on the 2001 FCAT SSS test in mathematics 2000 1900 1800 1700 DSS 1600 1500 1400 1300 1200 Gr.3 Gr.4 Gr.5 Gr.6 Gr.7 Gr.8 Gr.9 Gr.10 1332 1447 1626 1648 1769 1866 1893 1991 These problems make the use of scale scores from most standardized tests inappropriate for program evaluation. It is understood that certain techniques may be attempted to adjust for these disparities. However, the adjustments are often problematic, and frequently produce adjusted scores that are difficult to understand. The development of new score scales Previous work by the SDPBC To address these issues, the SDPBC previously created several new score metrics for calculating and presenting the annual growth of groups of students. These metrics were (1) Local Normal Curve Equivalents (LNCE), (2) Portion of a Year s Growth (PYG), (3) years Basic to Proficient (BTP), and (4) a new classification of effect size (see Appendix A for more details). LNCEs were developed by constructing cohorts of students who had test scores for both pretests and posttests on the annual FCAT test. 3 The pretest and posttest scores for each cohort were then normalized. 4 These normalized scores, which had a mean of zero and 3 A different cohort was created for each grade level (for example, students who were in grade 3 on the pretest and grade 4 on the posttest were used in the grade 3-4 cohort). 4 The normalization process forced the distribution of scaled scores into a normal curve, and consequently is different from merely standardizing these scores. Pretest and posttest were normalized separately. 6

standard deviation of one, were then converted into normal curve equivalent scores with a mean of 50 and standard deviation of 21.07. The resulting LNCE scores represented a type of ranking of students relative to their peers. For example, an average student in the cohort would have a pretest LNCE of 50 and a posttest LNCE of 50, and an LNCE gain of zero. Similarly, a typical below average student (one scoring at the local 25 th percentile) would have a pretest LNCE of 36 and a posttest LNCE of 36, with an LNCE gain of zero. Any student who maintained their same relative position among students in the cohort would have the same LNCE for both pretest and posttest. Therefore, an LNCE gain of zero represented one year s average growth for low scoring students, high scoring students, and average scoring students. As a result, a group of students with a positive LNCE gain would have increased in rank relative to other students in the cohort (i.e., made more gain in one year than students at the same level of performance would normally make). Similarly, a group of students with a negative LNCE gain would have made less than one year s growth, when compared to the average growth of District students at the same level of performance. Repeated use of these LNCE measures in the SDPBC proved that they were unbiased towards low or high performing students. Figure 4 provides an example of this consistency. Students were grouped based on their average achievement 5 and the mean LNCE gain of each group was calculated. The LNCE gain of each group was consistently near zero, except for random variation. 5 The pretest and posttest LNCE scores were averaged to calculate the student s average achievement, which eliminated the regression to the mean effect that would occur if either score were used alone to determine the student s overall achievement level. 7

Figure 4. Mean LNCE gains for students at different achievement levels, using 2005-2006 FCAT SSS data for grade 8 reading 10 5 LNCE Gain 0 Lowest Scoring Highest Scoring -5-10 Group Figure 5 illustrates that LNCE gains were unbiased relative to ethnic groups, socioeconomic status (SES), and other criteria. The gains of these groups were near zero, with the exception of ELL students, who consistently showed positive gains. This exception was understood because many of these students had academic abilities beyond their English abilities, and tended to rapidly improve in standardized test scores as they advanced in learning English. 8

Figure 5. Mean LNCE gains for subgroups of students, using 2005-2006 FCAT SSS data for grade 8 reading. 10 5 LNCE Gain 0-5 Asian Black Hispanic White Non-White Federal Lunch ESE Gifted ELL -10 Group LNCE gains also proved to be unbiased by grade level, since the mean LNCE gain of each grade level cohort was zero by definition. Additional measures were developed to communicate the LNCE results to relevant stakeholders. The Portion of a Year s Growth (PYG) converted LNCE gains into a proportion of one year s growth. A PYG of 1.0 represents one year s growth, 1.5 represents one and a half year s growth, and so on. Further, the Basic to Proficient (BTP) measure was developed to report the number of years it would take to move a group of students from the Basic level (25 th percentile) to a proficient level (50 th percentile) at a given rate of LNCE gain. Lastly, a new classification of effect size was created to explain these gain in terms which were easy to understand. 6 These measures are further explained in Johnson & Subedi (2006). Advantages and limitations of LNCEs The development of SNCEs Although extremely useful, the LNCE scale exhibited a few limitations. First, the LNCE scale assumed no District growth. In other words, the District itself always had a normative LNCE growth of zero, which corresponded to a PYG growth of 6 See Appendix A for a description of the calculation of PYG and BTP, and an explanation of the new SDPBC effect sizes. 9

1.0 years. Consequently, while LNCEs and PYGs were very helpful in measuring the growth of students, schools, and programs against the District, the growth of the District itself could not be calculated. In a similar way, the evaluation of any group of students within the District would not receive credit for gains made by the entire District. Second, since the District mean gain was defined to be zero, these LNCEs could not accurately measure the gains of a group if that group were a majority of the students in the District. Third, the LNCEs were limited in that smaller districts would not have enough students to develop these measures. Development of SNCEs For these reasons, the SDPBC worked to develop State Normal Curve Equivalents (SNCE). These measures were created with the same methodology used to develop the LNCEs, but using the larger population of FCAT SSS takers in the State of Florida. These resulting SNCEss proved to be highly correlated with the original LNCE measures previously developed for SDPBC students. In addition, these SNCEs possessed the same unbiased quality as the original LNCEs. The advantage of SNCEs, however, was that the growth of the District could then be compared to the typical growth of students in the State. Additionally, the annual growth of District schools could also be compared with the typical annual growth of other schools Statewide having similar demographics. Advantages and limitations of SNCEs The development of Although superior to LNCEs, the SNCE scale had one significant limitation. Although Districts, schools, and programs could easily be compared to the State, the State itself was assured of having a zero SNCE gain (or a State PYG of 1.0). As a result, the SNCEs could not measure any real gain made by the State. For this reason, SNCE gains were not measured in absolute terms, but only as they were greater or less than the average State growth. Both LNCEs and SNCEs had additional limitations. First, these scores only apply to a cohort of students, and therefore students not in the cohort do not have these scores. This made it impossible to refer to the average SNCE of all students tested at any one point in time (since some of these students would not be in the cohort). Second, a given student could possibly have two different LNCEs or SNCEs for the same test in the same year. For example, a fourth grade student in SY2006 7 would have one SNCE as the posttest of the SY2005-SY2006 cohort, and this same student would have 7 SY2006 refers to the 2005-2006 school year. 10

another SNCE as the pretest of the SY2005-SY2006 cohort. These two SNCEs may not be the same. Third, there is the problem of including retained students in the cohort (see discussion later in this paper). Creation of the metric The SDPBC began to address these issues by the creation of a State Growth Scale (). The was developed primarily to allow the measurement of real growth of the State over time. By attempting to estimate real State growth, the would not only allow the comparison of districts, schools, or programs to the State s performance, but would also allow the measurement of gains of the districts, schools, programs, and the State itself against an absolute standard. Adjustment for actual State gain The following method was developed to adjust for the real growth of the State over time. First, a baseline year was established (SY2003). In this baseline year, the mean of the cohort at every grade was already defined to be an NCE of 50. Within this baseline year, this gives a particular developmental scale score (DSS) that is equivalent to an NCE of 50 at every grade. The adjustment process then finds the NCE for every other year that is equivalent to the given DSS. All of the NCEs for that year and grade are then adjusted so that the given DSS becomes an NCE of 50. In that manner, at any given grade an NCE of 50 corresponds to the mean of the cohort for the baseline year (SY2003). For example, suppose that in SY2003, a DSS of 1100 at grade four was equivalent to an NCE of 50. Then suppose that, in SY2004, a DSS of 1100 was an NCE of 48 (which would indicate that the students scored better in SY2004 than in SY2003). The above process would then adjust all of the NCEs for SY2004 by +2. This adjustment would make a SY2004 DSS of 1100 to be an NCE of 50 for that year. This same process would then be applied to all years. As a result, the NCEs of all years would be adjusted so that an NCE of 50 would always represent the same DSS. Table 1 shows the resulting adjustments when this method was applied across years and grades in reading. 11

Table 1. Sample NCE adjustments for different years and grades in reading Reading Posttest Promoted+Retained NCE Adjustments Grade SY2002 SY2003 SY2004 SY2005 SY2006 4-2.4 0.0 4.0 5.1 2.1 5-1.7 0.0 1.1 4.6 5.3 6-0.6 0.0 0.1 0.6 5.8 7-0.8 0.0 0.2 0.2 4.6 8-1.7 0.0-2.1-3.3-1.6 9-0.7 0.0 1.2 3.6 6.9 10 0.5 0.0-1.6-2.9-2.8 The data shows that the NCEs for years after SY2003 tend to have positive adjustments, and the years prior to SY2003 tend to have negative adjustments. This indicates that, in general, the State has increased its performance from SY2002 to SY2006. However, these adjustments created another problem. Note that within a year, the adjustments are rather inconsistent. For example, in SY2006, it seems illogical to have a 1.6 adjustment at grade 8 and a +6.9 adjustment at grade 9. One possibility way to deal with this problem was to use a regression approach across the grades within each year. Note, for example, the unsmoothed values above for SY2005. It would be possible to take these values and regress them. The red line in Figure 6 is the regression line that fits the pattern for that year. Such an adjustment would solve the problem of smoothing the adjustments over time. Figure 6. Sample of unsmoothed adjustments for SY2005 6 5 4 3 2 1 0-1 -2-3 -4 1 2 3 4 5 6 7 Figure 7 below shows the smoothed values for mathematics for all years and grades. 12

Figure 7. Smoothed adjustments for mathematics 8 7 6 5 4 3 2 1 0-1 -2-3 -4 4 5 6 7 8 9 10 xgrade tbl_fy FY2002 FY2003 FY2004 FY2005 FY2006 FY2007 Here we can see that the smoothed values for mathematics form a consistent pattern, showing increased growth at the State level each year in mathematics. Further, the growth is greater at the lower grades, and smaller at the upper grades. Figure 8 below shows the same smoothed data for reading. 13

Figure 8. Smoothed adjustments for reading 8 7 6 5 4 3 2 1 0-1 -2-3 -4 4 5 6 7 8 9 10 xgrade tbl_fy FY2002 FY2003 FY2004 FY2005 FY2006 FY2007 Here the pattern is not quite as consistent, as the adjustments for FY2004, FY2005, and FY2007 cross over the lines for FY2002, FY2003, an FY2006. 8 The above analysis was cross-checked by computing similar statistics using the published average FCAT SSS DS scores for the entire State. (The above method only used cohort students.) This cross-check produced patterns extremely similar to the above. The conclusion was that, whatever the nature of the pattern, it is indicative of the Statewide FCAT data. The irregularities for reading could have been handled be either (1) manual adjustment to make the pattern similar to that of mathematics; or (2) using the empirical data as is. The SDPBC decided to use the existing data, because adjusting the pattern would probably be based on unjustifiable assumptions. The calculation of The are based on adjusted SNCEs. To avoid confusion with SNCEs, the was designed with a different calculation in order to appear to be completely different. 8 FY2004 is the same as SY2004. 14

The is calculated on a pseudo grade equivalent basis, using the adjusted SNCEs. The SDPBC had previously estimated that the average growth from one year to the next is approximately 6 NCEs for reading and 8 NCEs for math. This information was then used to compute a grade-related score using the following formula. For reading: = [(Grade level + 0.8) + (NCE 50) / 6] * 100 For mathematics: = [(Grade level + 0.8) + (NCE 50) / 8] * 100 where 0.8 represents the eighth month of the school year, 50 is the mean NCE, and six or eight is the average NCE growth from one year to the next. This translates SNCEs into as illustrated in Table 2 below. Table 2. Sample calculation of for reading Percentile NCE Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 9 Grade 10 99 99 1297 1397 1497 1597 1697 1797 1897 90 77 930 1030 1130 1230 1330 1430 1530 75 64 717 817 917 1017 1117 1217 1317 50 50 480 580 680 780 880 980 1080 25 36 243 343 443 543 643 743 843 10 23 30 130 230 330 430 530 630 1 1-337 -237-137 -37 63 163 263 A grade 4 percentile of 50, which would reflect average performance for a fourth grader, would translate to a of 480. Dividing this by 100 would give 4.8, which is an approximate grade equivalent for an average fourth grader (corresponding to an FCAT spring administration of grade 4, eighth month). Similarly, a grade 4 percentile of 75 (equivalent to an NCE of 64), would give a of 717 (which would be roughly equivalent to the average performance of students in the second month of seventh grade). Table 3 below gives the similar translation for mathematics. 15

Table 3. Sample calculation of for mathematics Percentile NCE Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 9 Grade 10 99 99 1093 1193 1293 1393 1493 1593 1693 90 77 818 918 1018 1118 1218 1318 1418 75 64 658 758 858 958 1058 1158 1258 50 50 480 580 680 780 880 980 1080 25 36 302 402 502 602 702 802 902 10 23 142 242 342 442 542 642 742 1 1-133 -33 67 167 267 367 467 In this way, have the advantage of being able to be interpreted as an approximate grade equivalent. Further, the difference of two (divided by 100) is the PYG. For example, if a student moved from an of 480 in grade 4 to a 580 in grade 5, the difference would be 100 points, which would translate to a PYG of 1.00. If a student moved from an of 480 in grade 4 to a 655 in grade 5, the difference would be 175 points, which would translate to a PYG of 1.75. Pre / post differences As mentioned previously, for any of these metrics (LNCE, SNCE, or ), a given student may have two different NCEs for the same test in the same year. This occurs because the test score of any particular student (e.g., SY2005 grade 4 reading) would be the posttest of the SY2004-SY2005 cohort, as well as the pretest of the SY2005-SY2006 cohort. These two SNCEs may not be the same. An investigation was conducted to determine the extent of these differences. In most cases, the relationship of DSS to SNCE for pretest and posttest was virtually identical, as illustrated in Figure 9 below. 16

Figure 9. Sample of a consistent pre/post relationship However, in a few instances the pretest and posttest relationship was distinctly different, as seen in Figure 10. 17

Figure 10. Sample of an inconsistent pre/post relationship Most of these differences occurred at grade 9, and may be related to the large numbers of students who are retained at that grade. The pre/post relationships at most grades could therefore be reduced to a single relationship of DSS to SNCE. However, for the problematic grades, some type of sophisticated adjustment would have been required, since the pretest relationship for the second cohort would have to be adjusted to match the posttest relationship at the same grade. This, of course, would necessitate the same adjustment being applied to the posttest of the cohort for the second year. These adjustments would face successive cumulative adjustments over many years. Because of the extreme complexity of doing this, the decision was made to continue to use separate pre/post DSS to SNCE relationships without attempting these adjustments. Smoothing The process used for the development of utilizes a small measure of smoothing. Originally, PROC TPSPLINE in SAS was used with LAMBDA0 = 100000 to create smooth curves for the relationship of DSS to. However, later investigations showed that this degree of smoothing introduced a small level of bias in isolated instances. As a result, the smoothing was tightened using LAMBDA0 = 100, which produced results very similar to the original unsmoothed data. 18

The problem of retention The LNCE, SNCE, and measures discussed above all include retained students. 9 In this process, a single cohort was formed of all students who had pretest and posttest measures. For example, all students who had a pretest in grade 4 in SY2005 were included in a single cohort. Most of these students would have been promoted (i.e., would have taken the grade 3 test in SY2004). However, some of them would have been retained (i.e., would have taken the grade 4 test in SY2004). 10 The above process includes these retained students under the assumption of scaled score equivalence. That is, when a test series is vertically scaled, there should be no systematic difference in scale score whether a student or group of students take an on-level test, or a below-level test (as long as the test is reasonably appropriate in difficulty for the students being assessed). Under this assumption, it should not make any difference whether the pretest in the example above is the grade 3 test or the grade 4 test. However, there is some reason to suspect that this assumption may not work in some cases. The grade-to-grade DSS differences on the FCAT SSS, as discussed in the introduction, are often irregular. To check this, NCE gains from SY2006 to SY2007 were calculated separately for both promoted and retained students. Table 4 shows this data. Table 4. Sample NCE gains from SY2006 to SY2007 for promoted and retained students Reading Math Promoted Retained Promoted Retained Gr.4-0.1 3.8-0.1 3.8 Gr.5-0.1 7.3 0.0-0.3 Gr.6-0.1 2.0-0.2 6.9 Gr.7 0.0-1.0 0.2-8.1 Gr.8 0.0 0.0 0.1-3.0 Gr.9 0.1-2.5 0.3-5.7 Gr.10 0.0 4.1 0.0-4.6 Since the promoted students form the majority of each cohort, their gains remain close to zero (as expected). However, the NCE gains of retained students exhibit an irregular pattern, varying substantially from grade to grade. This data adds weight to the supposition that there is a problem when calculating the gains of retained students. The SDPBC briefly experimented with an alternative method of calculating SNCEs for retained students. In this method, retained students were measured against their gradeappropriate counterparts, instead of their cohort counterparts. For example, in the standard calculation above, all students who had a pretest in grade 4 in SY2005 were 9 For comparison purposes, they were also calculated without retained students. 10 A very few might have other grade combinations (e.g., grade 6 to grade 8). 19

included in a single cohort. The pretest DSS scores of the retained students, who would have taken the grade 4 test in SY2004, were effectively being compared to the grade 3 students in SY2004 who comprise the majority of the pretest. In the alternate calculation, these retained students were compared with other students who took the grade 4 test in SY2004 as their pretest (most of which would be in the SY2005 grade 5 cohort). These NCEs, of course, were artificially low because they are being compared with students a grade higher than the standard calculation. These low NCEs were then adjusted to approximate what the NCEs might have been if they had been tested in their actual cohort. This method produced greater grade-to-grade consistency. However, the correct adjustment for the artificially low NCEs could only be guessed, and the calculation of values for these retained students would have been more complex. For this reason, this alternate calculation was set aside and the standard calculation continued. Advantages / disadvantages of these metrics Conclusions All of these metrics have the distinct advantage of being unbiased; i.e., where virtually all subgroups of students (high scoring, low scoring, federal lunch, non-federal lunch, ESE, non-ese, Black, Hispanic, white, etc.) have systematically equal amounts of annual gain. This is essential for program evaluations, since it is undesirable if a program is automatically advantaged or disadvantaged in an evaluation because of the demographics of the students involved. All of these metrics also have the advantage of being able to be converted into a PYG (portion of a year s growth). The PYG metric has proved useful in explaining the results of evaluations to educational decision makers. These and other advantages (+) and disadvantages ( ) of these metrics are summarized below. 20

Table 5. Advantages and disadvantages of LNCE, SNCE, and Advantage LNCE SNCE Provides method of unbiased analysis of groups + + + Easily understandable when translated to PYG + + + Can compare district, school gains to State gains + + Can report State, district, school change over time + Provides scores for non-cohort students + Has approximate normative interpretation (as GE s) + Disadvantage LNCE SNCE Pre/post for same student may be different Retained students may have problems Has only cohort students Assumes no district-level change Assumes no State-level change Suggested uses for these metrics For LNCEs: 1. Use for program evaluations when the group(s) being analyzed are less than 50% of the District. 2. Use to compare school gains to the District gain. 3. Use to compare gains of other groups (teachers, classes, etc.) to the District gain. It is not possible for LNCEs to measure the gain of the District. Also, it is not recommended that LNCEs be used to measure the gains of groups of students that comprise a majority of the District, since the District LNCE gain is zero by definition. For SNCEs: 1. Use for program evaluations when the group(s) being analyzed are a majority of students in the District. 2. Use to compare schools in the District (or in the State) with similar schools in the State. It is not recommended that SNCEs be used to measure District gain, since that gain would only be relative to the assumption of zero State gain. For : 1. Use for all absolute measures of gain for the State, districts, or schools. 2. Use for all absolute measures of gain for students. It is not recommended that SRS be used for program evaluations, since the adjustments for gain are only approximate, and may be different from grade to grade. 21

Issues to be addressed The LNCE, SNCE, and measures provide useful advantages for program evaluations, as well as other situations where it is desirable to measure growth over time. The nature of these metrics provides an unbiased measure that is generally not available elsewhere. Given additional time, the issues of retained students, pre/post differences, and method of adjustment for true State gain over time should be further addressed. References Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New Jersey: Lawrence Erlbaum. Johnson, B. & Subedi, B.R. (2006). Uniform measurement scales developed for program evaluation. A paper presented in the annual meeting of the Florida Educational Research Association, November 15-17, 2006. 22

Appendix A. Description of the PYG, BTP, and the new SDPBC effect size categories Portion of a year s growth To aid interpretation of LNCEs, the SDPBC translated LNCE gains into the portion of a year s growth. In 2005, the DRE analyzed normative data for a number of standardized tests and found that the average grade to grade difference was approximately 6.0 NCEs in reading and 8.0 NCEs in mathematics. Using this relationship, any LNCE gain can be translated into an approximate portion of a year s growth (PYG). For reading: PYG = 1 + (LNCEgain / 6) For mathematics: PYG = 1 + (LNCEgain / 8). The table below illustrates the relationship of LNCE to PYG. Relationship of LNCE gains and Portion of a Year s Growth (PYG). LNCE Gain PYG Reading PYG Mathematics 10.0 2.7 2.3 9.0 2.5 2.1 8.0 2.3 2.0 7.0 2.2 1.9 6.0 2.0 1.8 5.0 1.8 1.6 4.0 1.7 1.5 3.0 1.5 1.4 2.0 1.3 1.3 1.0 1.2 1.1 0.0 1.0 1.0-1.0 0.8 0.9-2.0 0.7 0.8-3.0 0.5 0.6-4.0 0.3 0.5-5.0 0.2 0.4-6.0 0.0 0.3-7.0-0.2 0.1-8.0-0.3 0.0-9.0-0.5-0.1-10.0-0.7-0.3 23

Years Basic to Proficient (BTP) Another measure was developed to communicate LNCE gains to stakeholders. The Years Basic to Proficient (BTP) estimates how many years it would take to move a group of students from considerably below average (the 25 th percentile) to average (the 50 th percentile) at any given rate of LNCE gain. BTP is calculated from the following formula: BTP = 14.21 / LNCEgain (because the distance between the 25 th percentile and the 50 th percentile is 14.21 NCEs.) The table below illustrates the relationship of LNCE annual gains to BTPs. Number of years required to move from the 25th to the 50th percentile at various rates of LNCE gain. Years Needed Basic to Proficient LNCE Gain (BTP) +10 1.4 + 9 1.6 + 8 1.8 + 7 2.0 + 6 2.4 + 5 2.8 + 4 3.6 + 3 4.7 + 2 7.1 + 1 14.2 0 n/a - 1-14.2-2 -7.1-3 -4.7-4 -3.6-5 -2.8-6 -2.4-7 -2.0-8 -1.8-9 -1.6-10 -1.4 24

New SDPBC effect size categories In 2007, the SDPBC adopted new categories for educational effect size. These categories provide a method of understanding the extent to which a difference in scores impacts student achievement. It is used to categorize the size of (1) growth, (2) year-to-year change, or (3) differences between groups. Educational Effect Size Reading Portion of a Year's Growth Math Portion of a Year's Growth Years Needed Basic to Proficient Effect Size Ranges and above and above and below and up EXCEPTIONAL (+) 3.8 3.1 0.8 0.8000 <3.8 <3.1 <0.8 0.7999 EXTENSIVE (+) 0.6500 2.8 2.3 1.3 0.5000 <2.8 <2.3 <1.3 0.4999 SUBSTANTIAL (+) 0.3500 1.7 1.5 3.4 0.2000 <1.7 <1.5 <3.4 0.1999 MODERATE (+) 0.1666 1.5 1.4 6.7 0.1333 <1.5 <1.4 <6.7 0.1332 SLIGHT (+) 0.1000 1.2 1.2 13.5 0.0667 <1.2 <1.2 13.5 0.0666 INCONSEQUENTIAL (+) INCONSEQUENTIAL (-) N/A 0.0333 1.0 1.0 0.0000 <1.0 <1.0-0.0001 N/A -0.0333 0.8 0.8-0.0666 <0.8 <0.8-0.0667 SLIGHT (-) N/A -0.1000 0.5 0.6-0.1332 <0.5 <0.6-0.1333 MODERATE (-) N/A -0.1666 0.3 0.5-0.1999 <0.3 <0.5-0.2000 SUBSTANTIAL (-) N/A -0.3500-0.8-0.3-0.4999 <-0.8 <-0.3-0.5000 EXTENSIVE (-) N/A -0.6500-1.8-1.1-0.7999 <-1.8 <-1.1-0.8000 EXCEPTIONAL (-) N/A and below and below and below 25

These categories were originally developed using Cohen s (1988) definitions of effect sizes, and adapted for educational settings. The SDPBC categories of Substantial, Extensive and Exceptional correspond to Cohen's categories of small, medium, and large. The categories of Slight and Moderate provide classifications for educational effects that are smaller than Cohen s small, but which would be considered important for education. 26