Karla Brooks Baehr, Ed.D. Senior Advisor and Consultant The District Management Council

Karla Brooks Baehr, Ed.D. Senior Advisor and Consultant The District Management Council This paper aims to inform the debate about how best to incorporate student learning into teacher evaluation systems in ways that will promote rather than undermine teachers motivation to improve their practice. Spurred by the requirements the U.S. Department of Education established for Race to the Top and the Title I waiver process, many states have revamped their teacher evaluation systems. Each has had to think through exactly how to make teacher impact on student learning play a significant part in teacher and principal evaluation. Many states have decided to use statewide assessment results as the single measure of impact for teachers in so called tested grades and subjects, and most have established specific weights or percentages to be used to determine how much each teacher s impact on student learning must count toward his or her overall effectiveness rating. Massachusetts has done neither. Some would call us out as wimps; not surprisingly, I disagree. Of course, I have a bias. As former deputy commissioner at the Massachusetts Department of Elementary and Secondary Education (DESE), I had major responsibility for developing the new Massachusetts regulations for educator evaluation in 2011 12. Subsequently, I led the team charged with designing and launching the state s model system, which has been adopted by most of the state s three hundred plus districts. In both roles, I have had the privilege of working with many educators on the ground superintendents, principals, union leaders, and teachers as well as assessment experts, policy experts, and state officials from across the country. I am convinced more than ever that the strategy we have taken in Massachusetts has great promise for advancing the profession and moving us closer to our goal of having every student taught by an effective teacher in a school led by an effective leader. I hope to convince you or at least give you some rich food for thought. I will do so by highlighting the critical culture shifts the Massachusetts approach to educator evaluation is designed to accelerate; very briefly framing our overall approach to educator evaluation; highlighting how our approach is different from that of most states in terms of o how we are determining teacher impact on student learning, and o how we will count teacher impact on student learning; arguing why our approach makes sense and will contribute to the changes we need in our schools. Critical Culture Shifts Let me start by asserting the obvious: we are not going to fire our way to excellence. Of course, we have to rid the profession of that small minority who really don t belong (and the Massachusetts regulations Karla Brooks Baehr, The District Management Council, October 30, 2013 page 1 of 9

make that easier to do). But the real imperative we face is to create conditions in every school whereby all teachers are motivated and supported to get better. If we assume that there is currently a bell curve distribution of teacher effectiveness, then our imperative is straightforward. We must move the entire bell curve to the right: vastly more of our teachers need to be performing at proficient and highly proficient levels. To do so, the culture typical in many schools needs to change to embrace high standards, shared accountability for results, collaboration, and continuous improvement. Changes in evaluation processes and assumptions can contribute to the culture change we need. If, for example, teaching performance standards are specific and described in detail at several levels of proficiency, and at the same time, the growth each student makes in a year is something teacher teams are expected to both consider and predict, then a foundation can be laid for a culture that reflects and embraces high standards. If classroom observation and evidence collection focus not just on what the teacher is doing but also on what students are learning, then the foundation for shared accountability can be strengthened but only if the data about what students are learning is identified, shared, and compared in ways that build teachers sense of autonomy, respect, and efficacy. If each educator s experience of evaluation moves from something that is done to me to a valuable process of self assessment and goal setting that is done with me, then collaboration and continuous improvement can become the norm. Of course, to support commitment to and motivation for improving craft, we need to make meaningful distinctions about teacher effectiveness. We cannot afford the widget effect, whereby 99 percent of teachers are considered proficient, or the Lake Wobegon effect, in which the performance of most teachers is judged to be above average. We need a system that reliably identifies that small proportion of teachers whose performance is truly unsatisfactory as well as those at the other extreme, who can serve as models of exemplary performance. For those in the middle, especially, we need to provide specific and actionable feedback they can use to improve their practice of the complex craft of teaching. Highlights of the Massachusetts Approach Massachusetts has taken three key steps to design and implement its teacher evaluation system: (1) a forty plus member stakeholder group established the initial framework for the approach; (2) the Board of Elementary and Secondary Education adopted regulations to refine and implement the framework; and (3) DESE worked with stakeholders to develop a model that districts could adopt or adapt through the collective bargaining process. Districts also had the option of designing their own system and submitting it to DESE for review against the requirements of the regulations. DESE s model developed in consultation with key stakeholders included model collective bargaining contract language; model rubrics for teachers, school based administrators, and superintendents; and several implementation guides (see www.doe.mass.edu/edeval/model). The foundation of the Massachusetts approach is a five step continuous improvement cycle for educators that aims to make every evaluator an active partner throughout the cycle and embodies a process that is built on and promotes collaboration and accountability for results. Karla Brooks Baehr, The District Management Council, October 30, 2013 page 2 of 9

Self Assessment is Step 1 of the cycle. The educator uses a comprehensive rubric to assess strengths and weaknesses. The rubric describes thirty three dimensions of teaching divided among four standards: curriculum, planning, and assessment; teaching all students; family and community engagement; and professional culture. Each dimension is described at four performance levels: exemplary, proficient, needs improvement, and unsatisfactory. In addition to the rubric, the educator uses data about student learning (including, as will be explained later, a rating of his or her impact on student learning), survey data from students, and information about past performance from prior evaluations. More than 95 percent of the state s districts adopted the model rubrics for teachers, principals, and superintendents. As a result, for the first time, the state has the basis for a shared statewide understanding of what effective teaching and leadership looks like a critical step to advancing the profession. Drawing in part on the work of Ron Ferguson and others, DESE is building a model student survey, which it plans to make available to all Massachusetts districts at little or no cost in 2013 14. Just as the department s model rubric has been adopted by nearly every district, it is expected that the model survey will be widely adopted as well. In Step 2, Analysis, Goal Setting, and Plan Development, the educator proposes at least one professional practice goal and one student learning goal. To encourage the collaboration that can be key to heightened motivation for improvement, team goals must be considered. Together, the educator and his or her evaluator refine the goals and develop a plan outlining what each will do to support achievement of the goals, and what evidence each will be collecting to assess progress on the goals, as well as overall performance against the four standards. The evaluator has the final say over both the goals and the plan. Karla Brooks Baehr, The District Management Council, October 30, 2013 page 3 of 9

Step 3, Implementation of the Plan, focuses on observations of practice and collection of evidence across all four standards. The regulations mandate unannounced observations. Though not required by the regulations or the model contract language, peer observation is encouraged. In Step 4 of the cycle, Formative Assessment/Evaluation, the educator participates in a mid cycle review to examine evidence, monitor progress, make mid course corrections to the educator plan, and help guard against unwelcome surprises. Step 5 is the Summative Evaluation stage, which focuses on progress on goals as well as performance on each of the four standards. Educators earn one of the four performance ratings for each standard and overall: exemplary, proficient, needs improvement, or unsatisfactory. Student Learning Goals in Action The student learning goal proposed by the teacher and reviewed and approved by the evaluator in Step 2 is the first significant way student learning is integrated into the Massachusetts evaluation system. Following is an example of how goal setting is working in practice. This was an eighth grade teacher s self assessment and goal setting process: 1. He first considered his team s emerging goals and was tempted to sign on. 2. Before doing so, he examined the prior year s student growth data from the state assessment (MCAS), adapted from Colorado s growth model. As a result, every student in grades 4 8 who has taken the MCAS test for at least two consecutive years can be assigned a student growth percentile (SGP) score from 1 (lowest) to 99 (highest), which compares his or her improvement from one year to the next to all students in the state with similar MCAS score histories. 3. When this teacher examined SGP scores for the students he had taught the previous year, he was troubled to see a bimodal distribution for his weakest students: about half had SGP scores above 50; the other half had very low SGP scores, between 10 and 35. 4. Troubled, he looked at data from two years earlier and found a similar pattern. Digging more closely into the data, he found that the low SGP students seemed to falter most on the openended questions, requiring writing. 5. He was surprised, because the writing portfolios he and his students kept showed strong progress over the course of the year, including for those students who had struggled most with writing at the beginning of the year. He began to speculate on how his deep devotion to a class writers workshop might be playing a role in creating this pattern. He valued very much the way his writers workshop built students voice and their motivation to write. But he wondered if the near absence of on demand writing assignments might be putting his most struggling students at a disadvantage. Might their low MCAS growth reflect their unfamiliarity with on demand writing? And might the relative absence of on demand writing in his class hinder their adjustment to other courses where there was a much greater emphasis on on demand writing? Although his students seemed to fare reasonably well the next time they took the state Karla Brooks Baehr, The District Management Council, October 30, 2013 page 4 of 9

assessment in tenth grade he speculated that their adjustment to high school might be more fraught because of the sudden need for on demand writing. 6. As a result of his reflection, he proposed a student learning goal of increasing SGP scores below 40 by 50 percent. 7. He then developed a linked professional practice goal to modify the writers workshop for identified struggling students to ensure that they would have more exposure to and practice with on demand writing. 8. He set out in pursuit of his goals, knowing that his success in meeting his student learning goal would help determine his Summative Performance Rating of exemplary, proficient, needs improvement, or unsatisfactory. This teacher s example of self assessment and goal setting reflects the intention of the Massachusetts evaluation design: to build motivation for improvement by nurturing a sense of autonomy, respect, and efficacy. It rests on a deceptively simple assumption: data is illuminating, educators want to be effective, and they will act on data they have reason to trust because they have had a role in choosing it. Summative Performance Ratings: Early Results The early results of the first year of implementation of the new system are promising. Here, for example, is the distribution of ratings among the performance levels used until last year in the second largest school district in Springfield, Massachusetts. This was the year before the district s implementation of the new educator evaluation system:.25% unsatisfactory 54% meets expectations 46% exceeds expectations In 2011 12, Springfield s distribution closely mirrored the national pattern, in which 99 percent of teachers were rated satisfactory in binary systems, 94 percent of teachers were rated in the top two categories in systems using three or more rating categories, and fewer than 1 percent of teachers were rated unsatisfactory in any system, regardless of the number of rating categories. 1 At the end of its first year of implementation, here is the distribution of Springfield s ratings using the state s requirement for four rating levels: 2% unsatisfactory 18% needs improvement 75% proficient 5% exemplary These results reflect a major shift in expectations. 1 See Widget Effect. Karla Brooks Baehr, The District Management Council, October 30, 2013 page 5 of 9

A Second Rating for Every Educator: Impact on Student Learning In addition to the Summative Performance Rating, which incorporates progress on student learning goals into the educator s performance rating, there is a second significant way that the Massachusetts system recognizes and incorporates teacher impact on student learning. Every teacher receives a second rating to complement the Summative Performance Rating. That second rating is the Impact on Student Learning Rating of high, moderate, or low, based on trends and patterns of student growth on the state assessment, where applicable, and district determined measures of student growth common across the same grades and subjects in the district. District determined measures (DDMs) are measures of student growth that are common to grade levels/courses across the district. For the 17 percent of educators for whom they are available, student growth percentile on the MCAS must be included with the district determined measures. Any rating for impact on student learning has to be based on at least 4 data points because it must take into account patterns and trends that is, at least two measures over at least two years. DDMs are engaging Massachusetts educators in meaningful discussions about the questions at the heart of the work: What s most important for our students to learn? How can we assess it fairly? What can the results tell us about our curriculum and instruction, both collectively and individually? Districts are being invited to take an expansive view of assessment as they develop, adapt, or adopt their DDMs. No one wants the new teacher evaluation system to unleash a flood of additional paperand pencil, multiple choice testing on students. For now, considerations of technical reliability and validity are taking something of a backseat to a more seat of the pants sense of relevance and fairness. State leaders are hoping to hear from teachers comments such as these: I had a voice. Using this assessment will help me because I can use the results to refine my teaching. Administrators won t rush to judgment. Over time, as teachers and districts develop their assessment literacy and come to see the power of strong assessments, greater reliability and validity will come. That same eighth grade teacher was invited by his superintendent to work with his teaching team to propose three DDMs to use in determining their Impact on Student Learning Ratings. The superintendent plans to have a committee of administrators and teachers review the proposals from all of the district s teacher teams. They will either accept the proposed measures or require revisions. For those teachers whose impact can be fairly judged by the state assessment, MCAS growth scores will also Karla Brooks Baehr, The District Management Council, October 30, 2013 page 6 of 9

be used to determine the Impact Rating. Here is what the eighth grade English team proposed for DDMs: 1. A fall and spring test of grammar 2. A pre and post unit assessment of persuasive writing using a common rubric 3. A fall and spring sample of narrative writing (a short story or a play) using a common rubric The teacher reported that the process of reaching consensus about what measures to propose required deeper discussion of their assumptions and practices about curriculum, teaching methods, and assessment than team members had ever had before. He described the discussion as difficult yet illuminating and enlightening. Forced to articulate the rationale for his preferences and listen carefully to what his colleagues had to say about theirs, he developed a deeper understanding of his own practice and ways to improve it. The research about impacts of high stakes consequences for teachers based on low student test scores is in its infancy. At the same time, administrator and teacher knowledge of effective assessment practices is highly variable. For both these reasons, Massachusetts did not want the Impact Rating to be based on limited data or a limited time frame. If all three measures proposed by the eighth grade English team were accepted as DDMs, then the teacher s Impact Rating would be based on 8 separate data points: the 3 the team proposed for each of two years, plus the MCAS growth score in ELA for each of two years. With 8 data points and the phenomenon of regression to the mean it is reasonable to predict that a substantial majority of teachers would earn an Impact Rating of moderate. Even with the minimum of four ratings two measures for each of two years Massachusetts anticipates that most teachers would earn Impact Ratings of moderate. How the Summative Performance Rating and Impact on Student Learning Rating Interact Massachusetts also decided that the Impact Rating would not carry with it very high stakes at least in the beginning years of the new educator evaluation system. Therefore, the Impact Rating does not trump the Summative Rating, nor does it carry a percentage or specified weight in determining an overall rating. Instead, the Impact Rating informs the length and focus of the educator s plan. Karla Brooks Baehr, The District Management Council, October 30, 2013 page 7 of 9

Summative Rating Exemplary Proficient Needs Improvement 1 YEAR SELF DIRECTED GROWTH PLAN Recognition 2 YEAR SELF DIRECTED GROWTH PLAN DIRECTED GROWTH PLAN Unsatisfactory IMPROVEMENT PLAN Low Moderate High Rating of Impact on Student Learning Based on multiple measures of performance, including MCAS Student Growth Percentile (SGP) when available The teacher s Summative Performance Rating determines the educator plan. A teacher with tenure (called professional teacher status in Mass.) whose Summative Rating is proficient or exemplary is placed on a two year self directed growth plan. A teacher rated needs improvement is placed on a directed growth plan for six months to a year, depending on the judgment of the evaluator. If the educator does not reach the proficient level of performance by the plan s end, she or he earns a rating of unsatisfactory. A teacher rated unsatisfactory is placed on an improvement plan, which can be as short as thirty days and in no case longer than one year. The Impact Rating has no impact on directed growth or improvement plans. The Massachusetts assumption is that a rating of needs improvement or unsatisfactory calls for urgent action by the educator and the evaluator. The regulations shorten the typical timeline for termination and streamline the process as well. Impact Ratings have no bearing on those decisions. Instead, Impact Ratings affect the educator plan only for those educators whose performance is rated proficient or exemplary: 1. If a teacher has been rated proficient or exemplary but his or her Impact Rating is low, the educator plan is reduced to one year, and a central focus of the plan must be on uncovering the reason for the discrepancy: Has the evaluator missed important gaps in the teacher s practice? Are the assessments being used to determine the Impact Rating inadequate indicators of student learning? The supervisor of the evaluator is now expected to play a role because assistance and direction may be needed to shape a rigorous inquiry. Karla Brooks Baehr, The District Management Council, October 30, 2013 page 8 of 9

2. If a teacher has been rated proficient or exemplary and his or her Impact Rating is high, then the district must commit to recognizing the educator, because it is essential that districts identify those teachers who are having a positive impact on student learning so that their practices can be disseminated, modeled, and replicated. It is too early to know for sure whether this low stakes use of the Impact Ratings will achieve the Massachusetts goals. In many respects, Massachusetts is trying to thread the needle in its approach to assessing teacher impact on student learning. We want very much for data about student learning to be at the center of teacher self assessment, goal setting, and evidence collection. We believe thoughtful, rigorous, and collaborative analysis of a wide range of data about student learning will be illuminating and will change educator practice. At the same time, we believe that an overreliance on student assessment data to determine teacher performance ratings runs too great a risk of stifling the collaboration and spirit of inquiry that are critical to improving practice. We fear it would, instead, elicit defensive responses that are the antithesis of the embrace of data that we seek. At its heart, our approach represents an effort to bring a healthy dose of humility to the challenge of educator evaluation. Humility doesn t always come easily to state policy makers and us bureaucrats. But we would argue that it is essential in the much needed work of changing the culture of schools to a culture of high standards, shared accountability for results, collaboration, and continuous improvement. Karla Brooks Baehr, The District Management Council, October 30, 2013 page 9 of 9