Do High-School Teachers Really Affect Test-Score and Non-Test Score Outcomes?: The Need to Account for Tracking

Similar documents
Match Quality, Worker Productivity, and Worker Mobility: Direct Evidence From Teachers

w o r k i n g p a p e r s

Working Paper: Do First Impressions Matter? Improvement in Early Career Teacher Effectiveness Allison Atteberry 1, Susanna Loeb 2, James Wyckoff 1

A Comparison of Charter Schools and Traditional Public Schools in Idaho

PEER EFFECTS IN THE CLASSROOM: LEARNING FROM GENDER AND RACE VARIATION *

ABILITY SORTING AND THE IMPORTANCE OF COLLEGE QUALITY TO STUDENT ACHIEVEMENT: EVIDENCE FROM COMMUNITY COLLEGES

Cross-Year Stability in Measures of Teachers and Teaching. Heather C. Hill Mark Chin Harvard Graduate School of Education

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Universityy. The content of

On the Distribution of Worker Productivity: The Case of Teacher Effectiveness and Student Achievement. Dan Goldhaber Richard Startz * August 2016

Longitudinal Analysis of the Effectiveness of DCPS Teachers

NBER WORKING PAPER SERIES USING STUDENT TEST SCORES TO MEASURE PRINCIPAL PERFORMANCE. Jason A. Grissom Demetra Kalogrides Susanna Loeb

Do First Impressions Matter? Predicting Early Career Teacher Effectiveness

Introduction. Educational policymakers in most schools and districts face considerable pressure to

The Talent Development High School Model Context, Components, and Initial Impacts on Ninth-Grade Students Engagement and Performance

Peer Influence on Academic Achievement: Mean, Variance, and Network Effects under School Choice

NCEO Technical Report 27

Evaluation of a College Freshman Diversity Research Program

Miami-Dade County Public Schools

How and Why Has Teacher Quality Changed in Australia?

Race, Class, and the Selective College Experience

Role Models, the Formation of Beliefs, and Girls Math. Ability: Evidence from Random Assignment of Students. in Chinese Middle Schools

School Competition and Efficiency with Publicly Funded Catholic Schools David Card, Martin D. Dooley, and A. Abigail Payne

1GOOD LEADERSHIP IS IMPORTANT. Principal Effectiveness and Leadership in an Era of Accountability: What Research Says

Teacher intelligence: What is it and why do we care?

Psychometric Research Brief Office of Shared Accountability

5 Programmatic. The second component area of the equity audit is programmatic. Equity

Shelters Elementary School

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

The Effects of Statewide Private School Choice on College Enrollment and Graduation

Effectiveness of McGraw-Hill s Treasures Reading Program in Grades 3 5. October 21, Research Conducted by Empirical Education Inc.

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

The Effect of Income on Educational Attainment: Evidence from State Earned Income Tax Credit Expansions

More Teachers, Smarter Students? Potential Side Effects of the German Educational Expansion *

learning collegiate assessment]

LANGUAGE DIVERSITY AND ECONOMIC DEVELOPMENT. Paul De Grauwe. University of Leuven

BENCHMARK TREND COMPARISON REPORT:

The Effects of Ability Tracking of Future Primary School Teachers on Student Performance

Teacher Quality and Value-added Measurement

On-the-Fly Customization of Automated Essay Scoring

The Impacts of Regular Upward Bound on Postsecondary Outcomes 7-9 Years After Scheduled High School Graduation

READY OR NOT? CALIFORNIA'S EARLY ASSESSMENT PROGRAM AND THE TRANSITION TO COLLEGE

Access Center Assessment Report

Lecture 1: Machine Learning Basics

Multiple regression as a practical tool for teacher preparation program evaluation

PROFESSIONAL TREATMENT OF TEACHERS AND STUDENT ACADEMIC ACHIEVEMENT. James B. Chapman. Dissertation submitted to the Faculty of the Virginia

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Early Warning System Implementation Guide

Is there a Causal Effect of High School Math on Labor Market Outcomes?

Evaluation of Teach For America:

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Teacher Effectiveness and the Achievement of Washington Students in Mathematics

Estimating the Cost of Meeting Student Performance Standards in the St. Louis Public Schools

Student Support Services Evaluation Readiness Report. By Mandalyn R. Swanson, Ph.D., Program Evaluation Specialist. and Evaluation

12- A whirlwind tour of statistics

Iowa School District Profiles. Le Mars

MGT/MGP/MGB 261: Investment Analysis

Class Size and Class Heterogeneity

Fighting for Education:

Sector Differences in Student Learning: Differences in Achievement Gains Across School Years and During the Summer

Working with What They Have: Professional Development as a Reform Strategy in Rural Schools

Jason A. Grissom Susanna Loeb. Forthcoming, American Educational Research Journal

Wisconsin 4 th Grade Reading Results on the 2015 National Assessment of Educational Progress (NAEP)

IS FINANCIAL LITERACY IMPROVED BY PARTICIPATING IN A STOCK MARKET GAME?

Essays on the Economics of High School-to-College Transition Programs and Teacher Effectiveness. Cecilia Speroni

Effective Pre-school and Primary Education 3-11 Project (EPPE 3-11)

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Review of Student Assessment Data

DEMS WORKING PAPER SERIES

Cooper Upper Elementary School

EFFECTS OF MATHEMATICS ACCELERATION ON ACHIEVEMENT, PERCEPTION, AND BEHAVIOR IN LOW- PERFORMING SECONDARY STUDENTS

Probability and Statistics Curriculum Pacing Guide

A Program Evaluation of Connecticut Project Learning Tree Educator Workshops

A Guide to Adequate Yearly Progress Analyses in Nevada 2007 Nevada Department of Education

Research Update. Educational Migration and Non-return in Northern Ireland May 2008

THEORY OF PLANNED BEHAVIOR MODEL IN ELECTRONIC LEARNING: A PILOT STUDY

Cooper Upper Elementary School

ECON 365 fall papers GEOS 330Z fall papers HUMN 300Z fall papers PHIL 370 fall papers

BASIC EDUCATION IN GHANA IN THE POST-REFORM PERIOD

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

The Relation Between Socioeconomic Status and Academic Achievement

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

The Efficacy of PCI s Reading Program - Level One: A Report of a Randomized Experiment in Brevard Public Schools and Miami-Dade County Public Schools

University-Based Induction in Low-Performing Schools: Outcomes for North Carolina New Teacher Support Program Participants in

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Quantitative Research Questionnaire

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

National Longitudinal Study of Adolescent Health. Wave III Education Data

Data Diskette & CD ROM

NATIONAL CENTER FOR EDUCATION STATISTICS RESPONSE TO RECOMMENDATIONS OF THE NATIONAL ASSESSMENT GOVERNING BOARD AD HOC COMMITTEE ON.

What is related to student retention in STEM for STEM majors? Abstract:

Assignment 1: Predicting Amazon Review Ratings

Algebra 2- Semester 2 Review

STA 225: Introductory Statistics (CT)

Educational Attainment

NBER WORKING PAPER SERIES ARE EXPECTATIONS ALONE ENOUGH? ESTIMATING THE EFFECT OF A MANDATORY COLLEGE-PREP CURRICULUM IN MICHIGAN

Evidence for Reliability, Validity and Learning Effectiveness

Status of Women of Color in Science, Engineering, and Medicine

Transcription:

Do High-School Teachers Really Affect Test-Score and Non-Test Score Outcomes?: The Need to Account for Tracking By C. KIRABO JACKSON* In a high-school setting, both non-random assignment of students and unobserved track-level treatments will bias teacher effect estimates. Using administrative teacher and student data, I present a strategy that exploits detailed course-taking information to credibly estimate the effects of 9th grade teachers on test scores, college aspirations, attendance, disciplinary infractions, and remaining in school. I find strong evidence of bias due to track-specific treatments. There are modest effects of math teachers on math test-scores and no effect of English teachers on English tests-scores. While math teachers have no effect on non-test score outcomes, English teachers affect attendance, disciplinary infractions, and dropout. I show that the patterns are not driven by student sorting. (JEL I21, J00). Studies that identify elementary school teachers associated with student test-score gains show that a one standard deviation increase in teacher quality leads to between one-tenth and one-fifth of a standard deviation increase in math and reading scores (Rivkin, Hanushek, & Kain, 2005; Rockoff, 2004; Kane & Staiger, 2008). Using similar methodologies in high-school, researchers find similar effects on math (Aaronson, Barrow, & Sander, 2007) and reading (Koedel, An Empirical Analysis of Teacher Spillover Effects in Secondary School, 2008) achievement. The apparent similarities across studies has lead to a sense that this relationship between teacher quality and student achievement is universal across subjects and grade levels, and had lead to the use of value-added measures to assess teacher quality for personnel evaluations and merit pay programs in elementary, middle, and high-schools. Because elementary and secondary schools differ in important ways, value-added methodologies designed for elementary school teachers may be inappropriate for measuring teacher quality in high-school for two reasons. First, while much of the literature on teacher effects has focused on biases due to the non-random matching of students to teachers in elementary schools setting (Koedel & Betts, 2009; Rothstein, 2010), in a high-school setting, even with random assignment of students to teachers, if different teachers teach in different tracks and students in different tracks are exposed to different treatments (e.g. students in the gifted and talented track receive extra college prep-sessions, or students in the remedial math 1

class all have a bad social studies teacher) there will be bias due to "track treatment effects". That is, in high-schools there is possible selection bias due to non-random placement of students to tracks, and omitted variables bias due to track specific treatments. This second source of bias creates additional challenges to identifying teacher effects in high-school. The extant literature has not accounted for this second source of bias, and may therefore be misleading about the importance of teachers in high-school. The second reason test-based value-added may inappropriate measures of teacher quality in high-school is that they capture one aspect of student output. There is mounting evidence that the effect of high-schools on test scores are not the best predictors of their effects on educational attainment (Booker, Sass, R. Gill, & Zimmer, 2008; Deming, Hastings, Kane, & Staiger, 2011) or other important adult outcomes. In light of the fact that in 2002, nearly half of the African American and Latino students, and 11% of white students attended high-schools in which graduation was not the norm (Balfanz & Legters, 2004) improving test scores may not be the most important objective. However, there is little understanding of the effect of teachers on important non-test-score outcomes so that current teacher test score value-added models likely paint an incomplete picture of the importance of teachers. 1 This lack of completeness is disquieting given that the relationship between effectiveness in increasing test scores and effectiveness in improving important non-test score outcomes is unknown even though there is a current policy push to use test-score based value-added in the hiring and firing of teachers. 2 I aim to address both these concerns by (1) presenting a strategy to estimate the effects of math and English teachers on test scores in 9th grade that exploits very detailed course taking information that allows one to address both sources of bias; (2) estimating effects of 9th grade Math and English teachers on college aspirations, attendance, disciplinary infractions, and remaining in school for 10th grade (a very good predictor of not dropping out and eventual graduation) 3 ; and (3) determining if those teachers who are effective at raising test scores in 9th 1 One exception is Koedel (2009) who uses a sample of 30 teachers in 3 schools to analyze the effects of math teachers on high school dropout. 2 For example, under Race to the Top (Senate Bill 736) teacher and administrator evaluations must base at least 50% of the evaluation on student learning growth measured by state assessments or end of course exams. 3 Using the North Carolina Data I find that the correlation between enrolling in 10th grade and graduating on time is 0.5 and that between being enrolled in 10th grade and dropping out is -0.3. Specifically, being in school for tenth grade is associated with a 0.68 percentage point increase in graduating within four years (1.5 sd effect size) and a 24 percentage point reduction in the likelihood of dropout (a 1.7 sd effect size). 2

grade are also effective at reducing student dropout and delinquent behaviors in 9th grade and improving attendance and college aspirations. To address the unobserved track-level treatment problem, because I can observe both the courses students take and the level of instruction within particular courses, I can control for the unique set of courses (and their levels) taken by students so that all comparisons are made among students who are at the same school and in the same academic track. Comparing the outcomes of students with different teachers at the same school and in the same track removes the influence of any school-by-track level treatments that could confound comparisons of teachers who teach in different tracks (such as students in the advanced math course taking physics as opposed to biology which may have a direct effect on their math performance). Making comparisons among student in the same track and school also removes bias due to sorting or selection into tracks. In such models, variation comes from comparing the outcomes of students in the same track and school but are exposed to different teachers either due to (a) changes in the teachers for a particular course and track over time, or (b) schools having multiple teachers for the same course and track in the same year. Because personnel changes within schools over time may be correlated with other improvements within schools, I estimate models that include school-byyear fixed effects. The remaining concern is that comparisons among students within the same track may be susceptible to selection bias. I argue that most plausible stories of student selection involve selection to tracks or courses rather than to teachers within tracks or courses, so that teacher effects based on within-school-track variation should not be confounded by selection. However, to assuage this concern, I show that conditional on track and course fixed-effects, teacher assignments are orthogonal to observable student characteristics. Also, I exploit the statistical fact that any selection within tracks will be eliminated by aggregating the treatment at the track level, and show that over time cohorts in tracks exposed to higher average estimated pre-sample teacher value-added in the track have better outcomes on average. While I find little evidence of bias due to student sorting, I do find evidence of substantial bias due to omitted track-level treatments, even in models that condition on a rich set of covariates and multiple lags of test scores. Specifically, in models that do not account for track level treatments within schools there is substantial and large covariance across mean classroom residuals for the same teacher over time for all test score and non-test score outcomes indicative of large persistent teacher effects. However, when track-by-school fixed-effects are 3

included, there are important effects of math teachers on math performance, but little evidence that English teachers systematically improve English test scores suggesting that models that do not explicitly control for track-specific treatments may be biased. While math teachers have no effect on non-test-score outcomes, English teachers have persistent effects on attendance, discipline problems, and remaining in school for 10th grade (a proxy for not dropping out). These findings are robust across several models and are validated by various empirical tests. This paper makes four important contributions. First, this is the first study to show that track-level treatments are important in a high-school setting, and to present a methodology that can be used by both policy-makers and researchers to credibly identify teacher effects in a highschool setting. Second, this is the first study to show that high-school teacher effects do not exist for test scores in all subjects. Third, this paper is among the first to present credible evidence that that high-school teachers have non-trivial effects on important non-test score outcomes, so that policy-makers and researchers should take a broader view of the importance of teachers. Finally, this paper demonstrates that while test-score based measures of quality are reasonable for highschool math teachers, using such measures in the hiring and firing of English (and perhaps other subject) teachers may be misguided and that schools may want to use both test-score and nontest-score value-added in evaluating teachers. The remainder of the paper is as follows: Section II describes the data used, Section III details the empirical framework, Section IV details and the identification strategy, Section V presents the main results robustness checks and specification checks, and Section VI concludes. II Data: This paper uses data on all public middle- and high-school students in North Carolina from 2005 to 2010 from the North Carolina Education Research Data Center. 4 The student data include demographic characteristics, detailed transcript data for all courses taken, middle-school achievement data, end of course scores for Algebra I and English I and codes allowing one to 4 These data have been used by (Clotfelter, Ladd, & Vigdor, 2010) to look at the effect of high-school teachers qualifications on student test scores. They uses variation across subjects within the same student to identify the effect of teacher characteristics on student outcomes. This methodology does not allow one to identify individual teacher effects and does not address the bias due to track specific treatments. 4

link students' end of course test-score data to individual teachers who administered the test. 5 Because the teacher identifier listed is not always the students teacher, I link these data to detailed personnel records and teaching activities and remove all student and teacher records that are not associated with a regular classroom teacher who teaches Algebra I or English I. Because English I and Algebra I are the two tests that have been the most consistently administered over time, I limit the analysis to students who took either the Algebra I course or the English I course. Over 90 percent of all 9th graders take at least one of these courses so that the resulting sample is representative of 9th graders as a whole. To avoid endogeneity bias that would result from teachers having an effect on repeating ninth grade, the master data is based on the first observation for when a student is in ninth grade. Summary statistics are presented in Table 1. The data cover 377,662 ninth grade students in 619 secondary schools in classes with 6538 English I teachers, and 6215 Algebra I teachers. While, roughly half of the students are male, about 60 percent are white, 29 percent are black, 6 percent are Hispanic, 2 percent are Asian, and the remaining one percent are native American mixed race or other. About 3.8 percent of students have the highest parental education level (i.e. the highest level of education of the student's two parents) below high-school, 27.7 with a high school degree, 8.5 percent with a junior college or trade school degree, about 19 percent with a four year college degree or greater (41 percent of observations in the dataset have missing parental education and are coded as such). About 10.8 percent of students are limited English proficiency. The achievement data have all been normalized and standardized to be mean zero with unit variance for each cohort and test. Mean incoming 7th and 8th grade test scores in the 9th grade sample are approximately one-tenth of a standard deviation higher than that of the average in 7th or 8th grade. This is because the sample of 9th grade students are less likely to have repeated a grade and to have dropped out of the schooling system. Looking to the main outcomes, the average yearly attendance was 168 out of 180 days (indicating mean unexcused absences of 12 days), about 74 percent of students claimed to have aspirations to attend some form of college, about 76 percent of students were in 10th grade the following year (i.e. 24 percent of 9th graders were not in 10th grade a very good predictor of dropping out of high school) and the average student committed 0.39 offences. 5 There are also end of course scores for History, and Science. However, because exams are not given in all years for these exams I limit the analysis to students who took English I or Math I. 5

The Measure of Track One of the most important variables in the analysis is the measure of academic track and the definition of a school-track. Even though schools may not have explicit labels for tracks, they may practice de-facto tracking by placing students into distinct groups of courses. While there are hundreds of courses that students can take (including special topics, and reading groups) there are 20 courses will make up over 70 percent of all courses taken in 9th grade and some combination of these 20 courses would make up more than 75 percent of all courses for most students. I list these courses in Table 2. Of these twenty courses 10 are elective courses and 10 are core academic courses (shown in bold). English I is the most common academic course taken such that 89.2 percent of 9th graders take English I, followed by World History which 84 percent of students take, Earth Science which 63 percent of students take, and algebra I which 51 percent of students take. Other common academic courses that fewer then 50 percent of student take include Art, Pre- Algebra, Biology, Introduction to Algebra, Basic Earth Science, and Spanish I. Even though World History and Earth Science are very common courses, there is no end of course exam corresponding exactly to these courses for all years. A key detail in these data is not just the specific course taken but also the level of the course taken. Even among students taking Algebra I or English I courses there are three different levels of instruction (advanced, regular, and basic) so that not all students who take Algebra I or English I are in the same academic track. As such, I exploit the richness of the data and take as my measure of a school-track, the unique combination of the 10 largest academic courses, the level of algebra I taken, and the level of English I taken in a particular school. As such, all students who take the same number and set of courses and the level of English I and Algebra I courses at the same school are in the same school-track. Students who take the same set of courses at another school are in different school-tracks, students who are in the same school but took either a different number of courses or at least one different course are in different schooltracks, and students at the same school who took the same courses but took Algebra I or English I at different levels would are in different school-tracks. Defining tracks flexibly at the schoolby-course-by-level level allows for different schools to have different selection models and treatments for each track. Due to the nature of how tracks are defined some tracks only have one student, but because many student pursue the same course of study only 3.7 percent of all student 6

observations are in singleton tracks. In total there are 18,226 non-singleton school-tracks across the 726 schools, 60 percent of students are in school-tracks with more than 50 students, and the average student is in a school-track with 117 other students. To provide a sense of the variation within and across school-tracks, I present the same summary statistics for the data aggregated at the school-by-year level and the school-track-byyear level in Table 1. Comparing the standard deviations of the variables will provide some indication of sorting into tracks. The standard deviation of reading and math scores in 8th grade are about 0.95 for the student level data and track level data. Given that the track-by-school level data are more aggregated, if student were randomly assigned to schools and tracks the standard deviation of the school by track mean would be smaller than that of the overall data. Because there are approximately 6.5 students per track-school (including singleton tracks) the standard deviation of the track by school level data would be approximately 6.5= 2.5 less in the aggregate data under random assignment, but the standard deviation of the individual data are very similar to that of the school-track level means. This suggests that students are systematically grouped into tracks by incoming achievement in a manner that increases the standard deviation of the school-track mean by a factor of approximately 2.5 over what one would observe with random assignments to tracks and schools. One can see a similar pattern for other covariates, such that students are systematically grouped into tracks by ethnicity, parental education, and LEP status, in a manner that increases the standard deviation of the track mean by a factor of approximately 2.5 over what one would observe with random assignments to tracks and schools. For Algebra I and English I scores, the standard deviation of the school-by-track-by-year level data are more dispersed than the individual level data. This indicates that something is happening at the track level to increase dispersion of these outcomes. This could be due to sorting into tracks, or due to track level treatments suggesting that taking track level sorting or treatments into account may be important. It is important to note that much of this evidence of tracking may occur at the school level rather than at the track level within schools. Correlation Between the Outcome Measures In this paper I aim to look at the effect of teachers on test-score and non-test score outcomes, and I aim to determine whether those teachers who improve test-score outcomes are also those who improve non-test score outcomes. As such, it is instructive to determine whether 7

the test-score and non test-score outcomes are correlated. In Table 3 I present the correlations between the main outcomes of interest for the raw individual-level data, for the individual-level data after taking out school and year means, and for the individual level data after taking out the track-by-school and year means. The correlations between the test-score and non-test score outcomes are in the direction one would expect, but they are surprisingly low. Specifically, the correlation between one's 9th grade English score and ones 9th grade algebra score is about 0.54, while than between either the 9th grade English score or 9th grade algebra score and the non-test score outcomes (college aspirations, offences in 9th grade, attendance, enrollment in 10th grade) are all below 0.2 in magnitude. This implies that while students who have better test scores tend to have better nontest-score outcomes, the relationship between test-score and non-test score outcomes are weak. This might imply that those teachers who improve test score outcomes may not be the same one who improve non-test score outcomes and vice versa. In the second column, I present the same correlations after demeaning the outcomes by school and year (to allow for comparison of students within the same school and year). This has a small effect on the correlations. In the bottom panel, I present the same correlations after demeaning the outcomes by school-by-track and year (to allow for comparison of students within the same school and track and year). Now the correlation between test score outcomes and non-test score outcomes are all close to zero. This means that among students in the same school and track, those who have better test score outcomes are not necessarily those who have the best non-test score outcomes. This further implies that those teachers who improve test score outcomes may not be the same teachers who improve non-test score outcomes. To test for whether this is due to a lack of variation within tracks, I computed the fraction of the variance that exists within tracks relative to the full variation. For all outcomes the over half of the variance in the outcomes is within tracks. The only exception is being in 10th grade for which it is 49 percent so that the lack of correlation is not due to a lack of variation in these outcome measures within tracks. Overall, the data indicate that there is not a lot of covariance in the test-score and non-test score outcomes, and that what small covariance does exist likely occurs at the school-track level rather than within school-tracks. 8

III Empirical Framework Under Tracking: The objective of this paper is to estimate the effect of individual math and English teachers on 9th grade students' test score and non-test score outcomes such as entering 10th grade (the best predictor of dropping out of school), aspirations for college, disciplinarily infractions, and attendance. This entails comparing the outcomes of students who have one teacher to the outcomes of students who have another teacher. In a high-school setting obtaining consistent estimates of teacher effects can be challenging for two reasons: First, because parents and students perceive track placement as having important long-run implications for the trajectory of a student's life, students may select to tracks in unobserved dimensions that would lead to selection-bias; Second, high-school students are often placed into tracks (groups of courses) such that taking a class with a particular teacher means taking a particular course, which may be associated with taking a particular set of other courses, being counseled in a different way and being exposed to different peers (in other classes). These other "treatments" associated with a particular track may have an independent effect on outcomes and may therefore confound estimated individual teacher effects. Specifically, for any school-track c, insofar as students select to tracks within high-school based on unobserved determinant of the outcomes then there may be unobserved determinants of outcomes that are correlated with a student's track. Also, with track-specific treatments P there is an additional unobserved determinant of student outcomes (P c) associated with being in this school-track. Under the assumptions of additive separability of inputs achievement in the production of student achievement and that lagged is a summary statistic for the full history of family school and student inputs, one can write student achievement as [1] below. [1] Y ijcy = A iy-1 δ+ X iy β + I ji θ j +π (P c) + ε ijcy Here, Y ijy is the outcome of student i with teacher j in school-track c in year y, A iy-1 is incoming achievement level of student i, X iy is a matrix of student level covariates obtained in 8th grade (including prenatal income, Limited English Proficiency status, ethnicity, and gender), I ji is an indicator variable equal to 1 if student i is in class with teacher j and equal to 0 otherwise, θ j is a teacher fixed effect, (P c) is a treatment specific to students in school-track c, and ε ijcy is the idiosyncratic error term. When the school-track is unobserved, teacher effect estimates will be biased by a term that is equal to the correlation between having teacher j and being in school-track c, multiplied 9

by the effect of the other treatments in school-track c plus the effect of selection into school-track c. That is, When the school-track is unobserved, the conditional expectation of teacher effect ˆj is given by [2] below. E( ˆ I, X,A) ( / )[ ( P c) E( c)]. 2 [2] j j j I ji, c I ji ijcy Equation [2] makes clear that in the presence of tracking, unless there are no other treatments associated with being in track (i.e. P=0) and there is no selection to tracks (i.e. E(ε ijcy c)= E(ε ijcy )) estimates obtained from equation [1] without accounting for track treatment effects and selection to school-tracks will be biased. This also shows that, even with random assignment of students to teachers, if certain teachers are associated with certain tracks at certain schools and there are track level treatments (as is likely true high school settings) teacher effect estimates from [1] without accounting for track level treatment will be biased. Also, if the school-track level treatments and student selection are orthogonal to teacher quality then failing to account for these sources of bias likely lead one to overstate the variability of teacher quality. More formally, Var( ˆ I,X, A) Var( ) Var( / )[ ( P c) ( c)] Var( ). 2 j j j I ji, c I iy j Without detailed information on exactly what treatments P are associated with each school-track c, and exactly what characteristics of students ε ijcy are associated with each schooltrack c, it is difficult to control for these sources of bias directly. However, one can remove both the influence of any track specific treatments and the effects of selection across tracks by making inferences within groups of students in the same track at the same school. In a regression context, if one can observe track placement, this is achieved by including I ci, an indicator variable equal to 1 if student i is in school-track c and 0 otherwise. This leads to [3] below. [3] Y ijcy = A iy-1 δ+ X iy β + I ji θ j + I ci θ c + ε icjy In expectation, the coefficient on this school-track indicator variable is E(θ c ) π(p c)+e[ε iy c]. This reflects a combination of both the unobserved treatment specific to schooltrack c and selection in unobserved dimensions to school-track c. By conditioning on schooltracks (a group of courses taught at a particular level within a particular school), one can obtain consistent estimates of the teacher effects θ j as long as there is no selection to teachers within a school-track. In section V.2 I show that the main results are not driven by selection within tracks. Sources of identifying variation: 10

Because the main models include school-by-track fixed effects teacher effects are identified by comparing the outcomes of students at the same school in the same track but who have different teachers. In these models, identification of teacher effects comes from two sources of variation; (1) comparisons of teachers at the same school teaching students in the same track at different points in time, and (2) comparisons of teachers at the same school teaching students in the same track at the same time. To illustrate the sources of variation, consider the simple case illustrated in Table 4. There are five tracks A, B, C, D and E in a single school. Each track is defined by the school, the academic courses, and the level of the Algebra I and English I class taken. There are four math teachers at the school at all times but the identities of the teachers change from year to year due to staffing changes. The first source of variation is due to changes in the identities of Algebra I and English I teachers over time due staffing changes within schools over time. For example, between 2000 and 2005 teachers 3 and 4 were replaced with teachers 5 and 6. Because, teachers 3 and 6 both teach in tracks C and D (in different years) one can estimate the value-added of teacher 3 relative to teacher 6 by comparing the outcomes of students in tracks C and D with teacher 3 in 2000 with those of students in tracks C and D with teacher 6 in 2005. Similarly, because teachers 4 and 5 both teach in the tracks B and E (in different years) one can estimate the value-added of teacher 4 relative to that of teacher 5. To account for any mean differences in outcomes between 2000 and 2005 that might confound comparisons within tracks over time (such as school-wide changes that may be coincident with the hiring of new teachers), one can use the change in outcomes between 2000 and 2005 for teachers 1 and 2 (who are in the school in both years) as a basis for comparison. In a regression setting this is accomplished with the inclusion of schoolby-year fixed effects (Jackson & Bruegmann, Teaching Students and Teaching Each Other: The Importance of Peer Learning for Teachers, 2009). This source of variation is valid as long as students do not select across cohorts (e.g. stay back a grade or skip a grade) or schools in response to changes in Algebra I and English I teachers. In Section IV.1 I show that teacher value-added (estimated out of sample) is unrelated to observable student characteristics that predict outcomes suggesting that selection is not a serious cause for concern. The second source of variation comes from having multiple teachers teaching the same course in the same track (e.g. there are two Algebra I teachers who teach the college bound students). For example, because both teachers 2 and 4 taught students in track B in 2000 one can 11

estimate the value-added of teacher 2 relative to that of teacher 4 by comparing the outcomes of teachers 2 and 4 among those students in track B in 2000. Similarly, because both teachers 2 and 5 taught students in track B in 2005 one can estimate the value-added of teacher 2 relative to teacher 5 by comparing the outcomes of teachers 2 and 5 among those students in track B in 2005. 6 Because this variation comes from comparing teachers from the same track at the same school at the same time, the key identifying assumption is that while students may select to tracks, students do not select to individual teachers within tracks. IV.1 I show that differences in teacher value-added (estimated out of sample) within school-tracks are unrelated to differences in observable student characteristics that predict outcomes within school-tracks suggesting that selection to teachers within-tracks not a serious cause for concern. Moreover, in section V.2, I show that the main findings cannot be driven by student selection within tracks. IV Identification Strategy: While one approach to estimating the importance of teacher quality is to estimate equation [3] and compute the variance of the estimated teacher effects ˆ, this will overstate the variance of true persistent teacher quality because teacher effects are estimated with error due to sampling variation and there are classroom-level disturbances (such as a dog barking outside the classroom on the day of the test). As such, to estimate the importance of individual teachers on student outcomes while using only the within-track variation, I follow (Kane & Staiger, 2008) and (Jackson, 2010). 7 Specifically, I estimate a model without teacher indicator variables, and then compute the covariance of mean teacher level residuals for the same teacher over time as my estimate of the variance of the persistent component of teacher quality that is observed across years. Specifically, in the first stage I estimate equation [4] below. [4] Y icjy = A iy-1 δ+ X iy β +X* c π c + I ci θ c + θ sy + ε* ijcy The key conditioning variable is I ci, an indicator variable denoting the school-track c (defined at the school-by-course-group-by-course-level) of student i. A iy-1 is the incoming achievement of 6 NOTE: One can identify all teacher effects for algebra I so long as there is sufficient overlap of courses and tracks. In this example, teacher 2 can be compared to 4 within track B in 2000, and compared to 3 within track C in 2000. Similarly, 4 can be compared to 5 within tracks B and E across 2000 and 2005, and 3 can be compared to 6 within tracks C and D across 2000 and 2005. As such, 2 can be directly compared to 3 and 4, and indirectly to 5 and 6 so that all teachers who teach Algebra I can be compared to each other. 7 This procedure is also used in (Jackson & Cowan, 2010) and (Jackson, 2009). 12

student i. In response to concerns about dynamic tracking I include both math and reading test scores from both 7 th and 8 th grade (two lags of achievement). X i is a matrix of additional student covariates such as parental education, ethnicity, gender, and LEP status. To account for classroom level characteristics that may be correlated with teacher quality I also include the mean characteristics of other students in the classroom X* c. These include peer math and reading scores from 8 th and 7 th grade in addition to mean peer parental education, mean LEP status, and the ethnic and gender composition in the class. To account for any school level time effects (such as the hiring of a new school principal) that would affect all students in the school I also include school-by-year fixed effects θ sy. Because this model does not include teacher indicator variables the error term includes the teacher effect so that ε* ijcy =θ j + ε ijcy. In the second stage, I compute mean residuals from [4] for each teacher in each year n n jy e* e, where n jy is the number of students in class with teacher j in year y. 1 jy i 1 ijcy j jcy To compute the variance of the persistent teacher quality, I calculate the covariance of mean residuals for the same teacher in years y and year y-1. Under the assumption that the nonpersistent error components for each teacher e jcy are uncorrelated over time (recall that the model includes school-by-year fixed effects) and uncorrelated with teacher quality, the covariance of mean residuals for the same teacher over time is a measure of the true variance of persistent teacher quality. That is, where Cov( j, ejcy ) Cov( j, ejcy 1) Cov( ejcy, ejcy 1) 0 then the covariance of mean residuals for the same teacher over time Cov( e, e ) Cov(, ) var( ). j jy j jy 1 j j j For inference purposes, I also present the p-value associated with the null hypothesis that the residuals for one year are uncorrelated with residuals from another for the same teacher. This is a test for persistent teacher quality effects which is different from the test of the null hypothesis that all the teacher effects is equal to zero. Because teacher effects do contain classroom level disturbances, in small samples, the F-test will tend to over reject the null. 8 As 8 This can be illustrated with a simple example. Suppose there are no teacher effects. However, there are binary classroom level disturbances that are either -1 or 1, each with probability half. If each teacher is observed twice then the possible outcomes for each teacher are 1-1, -11, 11, and -1-1. Half the time the classroom effect will be the same across years (-1-1 or 11) and be non-zero. In this simple scenario, even though there are no teacher effects, in expectation half of the teachers have estimated teacher effects (the average of the two classroom effects) that are distinguishable from zero. As such, the F-test of the null hypothesis that all of the teacher effects is equal to zero will be rejected. This is essentially a small sample problem that is lessened when teachers are observed in a large number of classrooms. Results from simulations confirm that with classroom level disturbances a simple F-tests strongly 13

such, I present the more conservative test for persistent teacher quality effects. Also, while aspects of teacher quality may be transitory, from the policy-makers perspective, it is the persistent component that it important for predicting teacher performance in the future. IV.1 Test for Selection or Sorting Into Tracks Before presenting the main results it is helpful to assess the degree to which non-random sorting of students to tracks would lead to bias. To test for the extent of tracking in general, I follow (Aaronson, Barrow, & Sander, 2007) and (Koedel, 2008) who assess the extent to which students may be sorted based on test scores or test score gains. Specifically, I calculate mean within-teacher-year student test-score dispersion (i.e. the average across all teacher-years of the standard deviation of test scores computed for each teachers classroom in a given year) for the observed teacher assignments and compare that to the mean within teacher student test score dispersion for other counterfactual teacher assignments. Table 5 displays the average withinteacher-year standard deviation of 8th grade scores, 7th grade scores, and test-score growth between 7th to 8th grade for both math and reading. I present the actual within teacher-year test score dispersion, what one would observe with full student sorting (of the variable) within schools, full student sorting across all classroom and schools, random student assignment within schools and finally random student assignment across all classrooms and schools. Comparing the actual test score dispersion to that when students are perfectly sorted across teachers within their school reveals that a within-school sorting mechanism would reduce the within-teacher-year standard deviation to between 25 and 30 percent of the actual observed within-teacher-year standard deviation, depending on the exact test score and subject. In contrast, the actual within-teacher-year test score dispersion is between 88 and 100 percent of what one would observe under random assignment of students to classrooms within schools, depending on the exact incoming achievement measure used. This is very similar to findings in other studies, and suggests that tracing based on ability is minimal. This suggests that value-added estimates are unlikely to be biased due to sorting into tracks. To provide further evidence of sorting bias, I compute predicted outcomes for all students and test if predicted outcomes are correlated with the pre-sample estimated value-added of the rejects the null hypothesis of no teacher effects even where there are none. In contrast, on the same simulated data, the test of persistence across mean classroom residuals fails to reject. The two tests only perform similarly where there are no classroom level disturbances. 14

teacher. Specifically, I split the data into two and estimate teacher value-added using data from 2005 through 2007, while computing predicted outcomes for students in 2008 through 2010. The predicted outcomes are fitted values from a linear regression of the outcomes on all observable student characteristics. 9 To account for estimation error in the value-added estimates I compute empirical Bayes estimates following (Kane & Staiger, 2008). For details on how this is constructed, see Appendix Note 1. I then run a regression of standardized normalized teacher effects from 2005-2007 data on predicted outcomes from 2008-2010 (with school and year fixed effects) to see if students who are likely to have better outcomes (based on observed characteristics) tend to be assigned to teachers who had better or worse than average value-added historically. With positive assortative matching the coefficient on teacher value-added would be positive and with negative assortative matching it will be negative. If there is little systematic sorting of students to teachers (based on teacher quality and student observable characteristics), the coefficient on pre-sample teacher value-added will be zero. I present the results in the top panel of Table 6. The point estimates are very small, and one out of 12 is statistically significant at the 5 percent level. Given that one would expect 0.6 out of 12 estimates to be statistically significant at the 5 percent level due to sampling variation, this is compelling evidence of no sorting bias. Because these regressions only include school and year fixed effects, the results suggest that there is little bias due to sorting into tracks. As such, any difference in results with and without track fixed effects is likely due to track specific treatments rather than sorting bias. IV.2 Test for Selection or Sorting Within Tracks While the results thus far suggest that there is little selection to tracks, the main specification is based on variation within tracks. As such, it is important to show that there is no selection to teachers within tracks. I implement this additional test by testing if there is a relationship between teacher quality (estimated out of sample) and predicted student outcomes (based on observable pre-exposure covariates) within tracks. This is accomplished by 9 As discussed in Jackson (2010), this is a more efficient and straightforward test of the hypothesis that there may be meaningful selection on observables that estimating the effect of the treatment on each individual covariate. This is because (a) the predicted outcomes are a weighted average of all the observed covariates where each covariate is weighted in its importance in determined the outcome, (b) with several covariates selection individual covariates may working different directions making interpretation difficult and also, (c) with multiple covariates some point estimates may be statistically significantly different from zero by random chance. 15

augmenting the test described in IV.1 to include track-by-school fixed effect as opposed to only school fixed effects. These estimates are presented in the lower panel of Table 6. As before, the point estimates are very small, and none of the coefficients is statistically significant at the 5 percent level. However, two of the estimates are significant at the 10 percent level which is consistent with sampling variation. In any case, none of the point estimates is close to being economically significant. These results suggest that the estimated teacher effects based on the within school-track variation are likely valid because there is little evidence of selection to teachers within tracks. V Results I present the estimated variance of the true persistent teacher effects for each outcome based on correlations across teachers classrooms over time in Table 7. Because estimated covariances less than zero do not make any sense, negative variance estimates are suppressed but are indicated in the table with the associated p-value. To allow one to see how the estimated variances change as one accounts for various sources of bias, I present estimates from (1) a parsimonious model with only school fixed effects, student covariates, and lagged student achievement, (2) a model with only school fixed effects, student covariates, and lagged student achievement and the second lag of student achievement, (3) a model with school fixed effects, student covariates, lagged student achievement, the second lag of student achievement and the mean covariates, lagged achievement, and the second lag of achievement of other students in the class, (4) a model with track-by-school fixed effects, student covariates, lagged student achievement, and the second lag of student achievement, (5) a model with track-by-school fixed effects, student covariates, lagged student achievement, the second lag of student achievement and the mean covariates, lagged achievement, and the second lag of achievement of other students in the class, and (6) a model with track-by-school fixed effects, and school-by-year fixed effects, student covariates, lagged student achievement, the second lag of student achievement and the mean covariates, lagged achievement, and the second lag of achievement of other students in the class. In the basic model that it typically estimated with one lag of achievement, student covariates and school fixed effects, the estimated variance of math (algebra) teacher effects is 0.446 standard deviations (in student achievement units). Adding controls for the second lag of 16

achievement reduces the estimate by 15 percent to 0.378σ. This is consistent with (Rothstein, 2010) who suggests that the second lag of achievement is an important variable to condition on while estimating teacher quality effects. While this estimate may seem large, this is on the same order of magnitude of the math teacher effects found in Aaronson, Barrow and Sander (2007) who estimates that the variance of 9th grade math teacher effects in are between 0.13σ and 0.39σ depending on the specification. Adding controls for the mean characteristics of classroom and school peers has negligible effects on the estimated variance of algebra teacher quality. Accounting for track-by-school fixed effects leads to a large reduction in the estimated variance of teacher effects. Specifically, the estimated variance of algebra teacher effects on algebra test scores falls from 0.378σ to 0.146σ a reduction of about 60 percent with the inclusion of track-by-school fixed effects. Including peer characteristics and school-by-year effects has little effect on the estimated variance of math teacher quality on algebra test scores, such that the full model yields an estimate of 0.1534 standard deviations. The p-value associated with the null hypothesis that these effect are not real persistent effects is 0.003 suggesting that there is a real math teacher effect on algebra test scores that persists across classrooms. Looking to other outcomes, a general pattern emerges such that math teacher effects appear to be sizable and statistically significant in models that do not account for track-by-school effect, but are much smaller and fail to be statistically significant conditional on track-by-school fixed effects. Specifically, the parsimonious model 1 yields statistically significant and economically meaningful math teacher effects on attendance, the number of disciplinary offences, and staying in school for 10th grade. Specifically, moving from a algebra teacher at the median of the quality distribution to one at the 85th percentile for each outcome would increase attendance by 4.8 days ( 0.17σ) reduce the number of offences by 0.52 ( 0.3σ) and increase the likelihood of remaining in school until 10th grade by 0.08 percentage points ( 0.18σ). 10 However, in models that account for track by-school effects, the estimated effects on attendance, offences, and being in 10th grade fall by over 70 percent and one cannot reject the null hypothesis of no effect on these non-test score outcomes at the 5 percent level. However, the effect on staying in school for 10th grade is 0.06σ (p-value=0.11) so that there may be an imprecisely estimated math teacher effect on this outcome. 10 These effects on remaining in school through 10th grade from the parsimonious model are in line with the estimated dropout effects from (Koedel, 2008). 17

Looking to English teachers tells a rather different story from that of math teachers. Specifically, in the basic model (model 1) the estimated effect of English teachers on English test scores is 0.212 standard deviations, however, the test that this reflects a real persistent teacher effect fails to reject at the 5 percent level. In models that account for track-by-school effects, the estimated covariance is negative, small and not statistically significantly different from zero. This suggests that conditional on track-by-school effects, English teachers have no persistent effect on English exams. However, unlike with algebra teachers, English teachers do have effects on non-test score outcomes. Specifically, in the full model (model 6), moving from an English teacher at the median of the quality distribution to one at the 85th percentile for each outcome would increase attendance by 4.6 days ( 0.17σ) reduce the number of offences by 0.24 ( 0.15σ) and increase the likelihood of staying in school for 10th grade by 2.7 percentage points ( 0.06σ). One can reject the null hypothesis of no persistent teacher effect for these three outcomes at the 5 percent level. As a check on the estimated specifications, one can look at the effect of math teachers on English performance and vice-versa. For all specifications the estimated covariance of math teacher effects on English test scores is small, negative and not statistically significantly different from zero suggesting that there is no effect of algebra teachers on English performance. Similarly, for specifications that do not account for track-by-school effects, the estimated covariance of English teachers on math test scores are small and not statistically significantly different from zero. In models that include school-by-track effects but do not also include school-by-year effects the estimated covariances are negative suggesting no effect, but the p- values on these negative covariances are less than 0.05 suggesting some strange mean reversion in the data. In any case, the full model with school-by-track effects and school-by-year effects yields a small covariance that is negative and not statistically significantly different from zero. This is a useful test for bias due to tracking, and confirms the previous tests indicating that that there is little bias due to sorting. V.1 Is this reduction in the variance of teacher effects due to over-controlling? Given that researchers typically find meaningful teacher effects in all subjects, readers may wonder if part of the reason that there are no test score effects for English is that there is little variation in teacher quality within school-tracks. There are few reasons why this is unlikely 18