Universityy. The content of

Size: px

Start display at page:

Download "Universityy. The content of"

Tracey Walters
6 years ago
Views:

WORKING PAPER #31 An Evaluation of Empirical Bayes

1 WORKING PAPER #31 An Evaluation of Empirical Bayes Estimation of Value Added Teacher Performance Measuress Cassandra M. Guarino, Indianaa Universityy Michelle Maxfield, Michigan State Universityy Mark D. Reckase, Michigan State Universityy Paul Thompson, Michigan State Universityy Jeffrey M. Wooldridge, Michigan State Universityy December 12, The content of this paper does not necessarily reflect the views of The Education Policy Center or Michigan State University

2 An Evaluation of Empirical Performance Measures Bayes Estimatio on of Value Added Teacher Author Information Cassandra M. Guarino, Indiana University Michelle Maxfield, Michigan State University Mark D. Reckase, Michigan State University Paul Thompson, Michigan State University Jeffrey M. Wooldridge, Michigan State University The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grants, R305D and R305B to Michigan State University. The opinionss expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education. Abstract Empirical Bayes' (EB) estimation is a widely used procedure to calculate teacher value added. It is primarily viewed as a way to makee imprecise estimatess more reliable. In this paper we review the theory of EB estimation and use simulated data to study its ability to properly rank teachers. We compare the performance of EB estimators withh that of other widely used value added estimators under different teacher assignment scenarios. We find that, although EB estimators generally perform well under random assignment of teachers to classrooms, their performance generally suffers under non random teacher assignment. Under nonrandom assignment, estimators that explicitly (if imperfectly) control for the teacher assignment mechanism perform the best out of all the estimators we examine. We also find that shrinking the estimates, as in EB estimation, does not itselff substantially boost performance.

3 An Evaluation of Empirical Bayes Estimation of Value-Added Teacher Performance Measures Cassandra M. Guarino Michelle Maxfield Mark D. Reckase Paul Thompson Jeffrey M. Wooldridge December 12, 2012 Abstract: Empirical Bayes (EB) estimation is a widely used procedure to calculate teacher value-added. It is primarily viewed as a way to make imprecise estimates more reliable. In this paper we review the theory of EB estimation and use simulated data to study its ability to properly rank teachers. We compare the performance of EB estimators with that of other widely used value-added estimators under different teacher assignment scenarios. We find that, although EB estimators generally perform well under random assignment of teachers to classrooms, their performance generally suffers under non-random teacher assignment. Under nonrandom assignment, estimators that explicitly (if imperfectly) control for the teacher assignment mechanism perform the best out of all the estimators we examine. We also find that shrinking the estimates, as in EB estimation, does not itself substantially boost performance. The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grants R305D and R305B to Michigan State University. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education. 1

4 1 Introduction Empirical Bayes (EB) estimation of teacher effects has gained recent popularity in the value-added research community and has been applied in numerous recent studies (see, for example, McCaffrey et al., 2004; Kane and Staiger, 2008; Chetty et al., 2011; Corcoran et al., 2011; and Jacob and Lefgren, 2005 and 2008). Researchers motivate the use of EB estimation as a way to decrease classification error of teachers, especially when there are limited data from which to compute value-added estimates. When minimal data are available for teachers, either due to small class sizes or few years of data, teacher value-added estimates can be very noisy. EB estimates of teacher value-added reduce the variability of the estimates by shrinking them toward the average estimated teacher effect in the sample (hence, EB estimators are often referred to as shrinkage estimators ). As the degree of shrinkage depends on class size, estimates for teachers with smaller class sizes are more affected, potentially helping with the misclassification of these teachers. Another possible benefit of EB estimation is that it can be less computationally demanding then methods that view the teacher effects as fixed parameters to estimate. Despite the potential shrinkage benefits of EB estimation, EB estimates of teacher effects can suffer from severe bias under nonrandom teacher assignment. By treating the teacher effects as random, EB estimation requires that teacher assignment is uncorrelated with any factors that affect student achievement including, importantly, observed factors such as past test scores. If this key assumption does not hold the EB estimators are inconsistent: even as we get more students per teacher, the estimates do not converge to the true teacher effects. Other estimators that directly control for observed factors such as past test scores while estimating the teacher effects are generally more robust in terms of consistency. An important question is: With typical empirical settings where VAMs are estimated, how does the EB approach compare with other VAM estimators? In this paper we address the following research questions: (1) How much do EB estimates of the teacher effects vary compared with those estimates obtained from other standard estimation techniques in both real and simulated data? (2) How does the performance of EB estimators compare with that of estimators that treat the teacher effect as fixed when moving from a random teacher assignment scenario to various nonrandom assignment scenarios? (3) Are there cases where it is beneficial to use an EB-type approach to shrink estimates of teacher fixed effects despite the lack of theoretical motivation for such a procedure? The remainder of this paper proceeds as follows. In Section 2, we derive the EB estimator and discuss the underlying identifying assumptions. In Section 3, 2

5 we introduce the alternative estimators that will be used for comparison in this study. In Section 4, we apply EB and other value-added estimation techniques to real data to determine the degree to which the estimation method changes the teacher value-added estimates. Section 5 describes our simulation design as well as the evaluation measures that will be used to assess the performance of the estimators. The main simulation results are presented in Section 6, and Section 7 summarizes and concludes. 2 Empirical Bayes Estimation There are several ways to derive empirical Bayes estimators of teacher value added. We begin with a so-called mixed estimation (ME) approach, as in Ballou, Sanders, and Wright (2004) (BSW), because it is relatively straightforward and does not require delving into the particulars of Bayesian estimation methods. Our focus here is on estimating teacher effects grade by grade. Therefore, we assume either that we have a single cross section or multiple cohorts of students for each teacher. We do not include cohort effects in what follows, so multiple cohorts are allowed by pooling students across cohorts for each teacher. Let y i denote a measure of achievement for student i, randomly drawn from the population. This measure could be a test score or a gain score. Suppose there are G teachers and the teacher effects are b g, g = 1,..., G. In the mixed effects (and EB) setting, these are treated as random variables as opposed to fixed population parameters. Viewing the b g as random variables independent of other observable factors affecting test scores has consequences for the properties of EB estimators. Typically VAMs are estimated controlling for other factors which we denote by a row vector x i. These factors generally include student demographics and, in some instances, prior test scores. We assume the coefficients on these covariates are fixed parameters. We can write a mixed effects linear model relating the student achievement outcome y i to teacher effects and the controls as y i = x i γ + z i b + u i, (1) where z i is a 1 G row vector of teacher assignment dummies, b is the G 1 vector of teacher effects, and u i contains the unobserved student-specific effects. Because a student is assigned to one and only one teacher, z i1 + z i2 + + z ig = 1. Equation (1) is an example of a mixed model because it includes the usual fixed population parameters γ and the random coefficients b. Even if there are no covariates, x i typically includes an intercept. If x i γ is only a constant, so x i γ = γ, then γ is the 3

6 average teacher effect and we can then take E(b) = 0. This means that b g is the effect of teacher g net of the overall mean teacher effect. Equation (1) is written for a particular student i so that teacher assignment is determined by the vector z i. A standard assumption is that, conditional on b, (1) represents a linear conditional mean: which follows from E(y i x i, z i, b) = x i γ + z i b (2) E(u i x i, z i, b) = 0. (3) An implication of (3) is that each u i is uncorrelated with b, which means that unobserved student characteristics are not related to the quality of the teachers. A more important restriction is that u i is uncorrelated with z i, so that teacher assignment is not systematically related to unobserved student characteristics. This assumption is less of a concern when we have good controls in x i (because then there is less left over in u i ). If we assume a sample of N students assigned to G teachers we can write (1) in matrix notation as y = Xγ + Zb + u, (4) where y and u are N 1, X is N K, and Z is N G. In order to obtain the best linear unbiased estimator (BLUE) of γ and the best linear unbiased predictor (BLUP) of b, we need to assume that the covariates and teacher assignments satisfy a strict exogeneity assumption: E(u i X, Z, b) = 0, i = 1,..., N. (5) An implication of assumption (5) is that inputs and teacher assignment of other students does not affect the outcome of student i. Given assumption (5) we can write the conditional expectation of y as In the EB literature a standard assumption is E(y X, Z, b) = Xγ + Zb (6) in which case b is independent of (X, Z), (7) 4

7 E(y X, Z) = Xγ + ZE(b X, Z) = Xγ = E(y X) (8) because E(b X, Z) = E(b) = 0. Assumption (7) has the implication that student i s teacher assignment does not depend on the quality of the teacher (as measured by the b g ). From an econometric perspective, equation (8) means that γ can be estimated in an unbiased way by OLS regression of y i on x i, i = 1,..., N. (9) Consequently, we can estimate the effects of the covariates x i by omitting the teacher assignment dummies. From a more traditional econometric perspective, this means we are assuming teacher assignment is uncorrelated with the covariates x i. Under (5) and (7), the OLS estimator of γ is unbiased and consistent, but it is inefficient if we impose the standard classical linear model assumptions on u. In particular, if then V ar(u X, Z, b) = V ar(u) = σ 2 I N (10) V ar(y X, Z) = E[(Zb + u)(zb + u) X, Z] = ZV ar(b)z + V ar(u) = σb 2 ZZ + σui 2 N, where we also add the standard assumption V ar(b) = σb 2 I G, (11) and σb 2 is the variance of the teacher effects, b g. Under the assumption that σb 2 and σ2 u are known actually, it suffices to know their ratio the BLUE of γ under the preceding assumptions is the GLS estimator, γ = [X (σb 2 ZZ + σui 2 N ) 1 X] 1 X (σb 2 ZZ + σui 2 N ) 1 y. (12) The N N matrix ZZ is a block diagonal matrix with G blocks of the form 5

8 , where block g is N g N g and N g is the number of students taught by teacher g. The GLS estimator γ is the well-known random effects (RE) estimator popular from panel data and cluster sample analysis. Notice that the random effects in this case are teacher effects, not individual-specific student effects. (Remember, we only have one observation per student.) Before we discuss γ further, as well as estimation of b, it is helpful to write down the mixed effects model in perhaps a more common form. After students have been designated to classrooms, we can write y gi as the outcome for student i in class g, and similarly for x gi and u gi. Then, for classroom g, we have y gi = x gi γ + b g + u gi x gi γ + r gi, i = 1,..., N g, (13) where r gi b g + u gi is the composite error term. Equation (13) makes it easy to see that the BLUE of γ is the random effects estimator under the previous assumptions. It also highlights the assumption that b g is assumed to be independent of the covariates x gi, and the assumption E(u gi X g, b g ) = 0 rules out covariates from student h affecting the outcome of student i. Of course, from this equation we can also see that OLS pooled across i and g is unbiased for γ because we are assuming E(b g X g ) = 0. As shown in, say, BSW, the BLUP of b under assumptions (5), (7), and (10) is b = (Z Z + ρi G ) 1 Z (y Xγ ) (Z Z + ρi G ) 1 Z r, (14) where ρ = σ 2 u/σ 2 b and r = y Xγ is the vector of residuals. Straightforward algebra shows that and (N 1 + ρ) (Z Z + ρi G ) 1 0 (N = 2 + ρ) (N G + ρ) 1 6

9 Therefore, we can write Z r = N1 i=1 r 1i N2 i=1 r 2i. NG i=1 r Gi N g ( ) b g = (N g + ρ) 1 rgi Ng = N i=1 g + ρ ( ) ( σb 2 = σb 2 + r (σ2 g = u/n g ) r g σ 2 b σ 2 b + (σ2 u/n g ) ) (ȳ g x g γ ), (15) where r g = N 1 g N g i=1 r gi = ȳ g x g γ (16) is the average of the residuals rgi = y gi x gi γ within classroom g. To operationalize γ and b g we must replace σb 2 and σ2 u with estimates. There are different ways to obtain estimates depending on whether one uses OLS residuals after an initial estimation or a joint estimation method. With the composite error defined as r gi = b g + u gi we can write An estimator of σr 2 the OLS regression σ 2 r = σ 2 b + σ 2 u. can be obtained from the usual sum of squared residuals from y gi on x gi, i = 1,..., N g, g = 1,..., G. (17) Call the residuals r gi. Then a consistent estimator is σ 2 r = 1 (N K) N G g r gi, 2 (18) which is just the usual degrees-of-freedom (df) adjusted error variance estimator from OLS. g=1 i=1 7

10 To estimate σ 2 u, write r gi r g = u gi ū g, where r g is the within-teacher average, and similarly for ū g. A standard result on demeaning a set of uncorrelated random variables with the same variance gives V ar(u gi ū g ) = σu(1 2 Ng 1 ) and so, for each g, E [ Ng ] (r gi r g ) 2 = σu(n 2 g 1). i=1 When we sum across teachers it follows that 1 (N G) N G g (r gi r g ) 2 (19) g=1 has expected value σ 2 u, where N = G g=1 N g. To turn (19) into an estimator we can replace the r gi with the OLS residuals, as before, r gi, from the regression in (17). The estimator based on the OLS residuals is σ 2 u = 1 (N G) i=1 N G g ( r gi r g ) 2. (20) g=1 With fixed class sizes and G getting large, the estimator that uses N in place of N G is not consistent. Therefore, we prefer the estimator in equation (20), as it should have less bias in applications where G/N is not small. With many students per teacher the difference should be minor. We could also use N G K as a further df adjustment, but subtracting off K does not affect the consistency. Given σ 2 r and σ 2 u we can estimate σ 2 b as i=1 σ 2 b = σ 2 r σ 2 u. (21) In any particular data set especially if the data have been generated so as not to satisfy the assumptions we imposed to derive the GLS estimate of γ and the BLUP estimates of the b g there is no guarantee that expression (21) is nonnegative. A simple solution to this problem (and one used in software packages that have random effects estimation commands, such as Stata) is to set σ 2 b = 0 whenever σ2 r < σ 2 u. In order to ensure this happens infrequently with multiple cohorts, we compute σ 2 u by replacing r g with the average obtained for the particular cohort. This ensures 8

11 that, for a given cohort, the terms N g i=1 ( r gi r g ) 2 are as small as possible. In theory, if there are no cohort effects we could use an overall cohort mean for r g. But using cohort-specific means reduces the problem of negative σ b 2 when the model is misspecified. An alternative approach is to essentially estimate σb 2 and σ2 u jointly along with γ, using software that ensures nonnegativity of the variance estimates. By far the most common approach to doing so is to assume joint normality of the teacher effects, b g, and the student effects, u gi, across all g and i along with the previous assumptions. One important point is that the resulting estimators are consistent even without the normality assumption; so, technically, we can think of them as quasi- maximum likelihood estimators. The maximum likelihood estimator of σu 2 has the same form as in equation (20), except the residuals are based on the MLE of γ rather than the OLS estimator. A similar comment holds for the MLE of σb 2 (if we do not constrain it to be nonnegative). See, for example, Hsiao (2003, Section 3.3.3). Unlike the GLS estimator of γ, the FGLS estimator is no longer unbiased (even under assumptions (5) and (7)), and so we must rely on asymptotic theory. In the current context, the estimator is consistent and asymptotically normal provided G with N g fixed. In practice, this means that the number of teachers, G, should be substantially larger than the students per teacher, N g. Typically this is the case in VAM studies, which are applied to large school districts or even to entire states and therefore include lots of teachers. Often the number of students per teacher is fewer than 100 whereas we might have several hundred if not several thousand teachers. When γ is replaced with the FGLS estimator and the variances σb 2 and σ2 u are replaced with estimators, the EB estimator is no longer a BLUP. (For one, the FGLS estimator of γ is not even unbiased.) Nevertheless, we use the same formula as in (15) for operationalizing the BLUPs. Conveniently, certain statistical packages such as Stata with its xtmixed command allow one to recover the operationalized BLUPs after maximum likelihood estimation. When we use the (quasi-) MLEs to obtain the b g we obtain what are typically called the empirical Bayes estimates. One way to understand the shrinkage nature of b g is to compare it with the estimator obtained by treating the teacher effects as fixed parameters. Let ˆγ and ˆβ be the OLS estimators from the regression y on X, Z. (22) Then ˆγ is the so-called fixed effects (FE) estimator obtained by a regression of y i on the controls in x i and the teacher assignment dummies z i. In the context of the model 9

12 y = Xγ + Zβ + u (23) E(u X, Z) = 0 V ar(u X, Z) = σ 2 ui N, ˆγ is the BLUE of γ and ˆβ the FE estimates of the teacher effects is the BLUE of β. As is well-known, It is well known ˆγ can be obtained by an OLS regression where y gi and x gi have been deviated from within-teacher averages (see, for example, Wooldridge, 2010, Chapter 10). Further, the teacher fixed effects estimates can be obtained as ˆβ g = ȳ g x gˆγ. (24) Equation (24) can make computation of the teacher VAMs fairly efficient if one does not want to run the long regression in (22). By comparing equations (15) and (24) we see that b g differs from ˆβ g in two ways. First, the RE estimator γ is used in computing b g while ˆβ g uses the FE estimator of γ. Second, b g shrinks the average of the residuals toward zero by the factor where σb 2 σb 2 + (σ2 u/n g ) = (ρ/n g ) (25) ρ = σ 2 u/σ 2 b. (26) Equation (25) illustrates the well-known result that the smaller is the number of students for teacher g, N g, the more the average residual is shrunk toward zero. An important point is that, unlike the RE estimator of γ, the FE estimator allows teacher assignment to be arbitrary correlated with the covariates x i : we make no assumption about the relationship between Z and X in (23). Using asymptotic theory on estimating γ that fixes the N g and lets G (the number of teachers) get large, ˆγ is consistent for γ allowing any kind of correlation between z i and x i. As mentioned earlier, asymptotic theory for fixed N g with G growing is relevant in many applications because the number of students per teacher tends to be relatively small. Nevertheless, recent work by Hansen (2007) shows that ˆγ has good large sample properties when G is roughly the same magnitude as the N g 1. Such a scenario is 1 In simulations, Hansen shows that the asymptotic properties work well when G and N g are 10

13 relevant especially if we are studying middle school or high school teachers who may have a hundred or more students. If these are, say, math teachers in a certain grade we may only have a couple of hundred teachers. Hansen s results effectively justify the standard kinds of inference applied to ˆγ (including clustering by classroom) provided the outcomes are independent across classrooms. The bottom line is the FE estimator of γ can be expected to have good properties provided the number of teachers is not small relative to the number of students. Of course the asymptotic theory does not provide direct evidence on how the estimators behave for particular sample sizes, or how estimation of γ affects estimation of the teacher effects. In Sections 5 and 6 we describe the simulation approach and findings that shed light on the finite-sample properties of the estimated teacher effects. A well-known algebraic result for example, Wooldridge (2010, Chapter 10) that holds for any given number of teachers G is that γ ˆγ as ρ 0 or N g. In fact, the RE estimator of γ can be obtained from the pooled OLS regression where y gi θ g ȳ g on x gi θ g x g (27) ( ) σ 2 1/2 ( ) 1/2 u 1 θ g = 1 = 1. (28) σu 2 + N g σb (N g /ρ) It is easily seen that θ g 1 as ρ 0 or N g. In other words, with many students per teacher or large teacher effects relative to student effects, the RE and FE estimates can be very close. But they are never identical. Not coincidentally, the shrinkage factor in equation (25) tends to unity if ρ 0 or N g. The bottom line is that with a large number of students per teacher the shrinkage estimates of the teacher effects can be close to the fixed effects estimates. The RE and FE estimates also tend to be similar when σu 2 (the student effect) is small relative to σb 2 (the teacher effect), but this scenario seems unlikely. An important point that appears to go unnoticed in applying the shrinkage approach is that in situations where γ and ˆγ substantively differ, γ suffers from systematic bias because it assumes teacher assignment is uncorrelated with x i. Because γ is used in constructing the b g in equation (15), the bias in γ generally results in biased teacher effects, and the teacher effects would be biased even if (15) roughly around 40 11

14 did not employ a shrinkage factor. The shrinkage likely exacerbates the problem: the estimates are being shrunk toward values that are systematically biased for the teacher effects. Without covariates, the difference between the EB and fixed effects estimates of the b g is much less important: they differ only due to the shrinkage factor. In practice, the fixed effects estimates, ˆβg, are obtained without removing an overall teacher average, which means ˆβ g = ȳ g. (This is the same as regressing the test score outcomes on the full set of teacher dummies, without a constant.) To obtain a comparable expression for b g we must account for the GLS estimator of the mean teacher effect, which would be obtained as the intercept (the only parameter) in the RE estimation. Call this estimator µ b, which in the case of no covariates is γ. Then the teacher effects are b g = µ b + η g (ȳ g µ b) = η g ȳ g + (1 η g )µ b = ȳ g (1 η g )(ȳ g µ b), (29) where η g is the shrinkage factor in equation (25). Compared with the FE estimate of b g, b g is shrunk toward the overall mean µ b. When the teacher effects are treated as parameters to estimate, the b g are biased because of the shrinkage factor, even in the case where they are BLUP. Returning to the general model with covariates, the expression in equation (15) motivates a common two-step estimator of the teacher effects. In the first step of the procedure, one obtains γ using the OLS regression in equation (17), and obtains the residuals, r gi. In the second step, one averages the residuals r gi within each teacher that is, from i = 1 to N g to obtain the teacher effect for teacher g. We call this approach the average residual (AR) method. After obtaining the averages of the residuals we can, in a third step, shrink these using the empirical Bayes shrinkage factors in equation (15). Typically the estimates in equations (18) and (20), based on the OLS residuals, are used in obtaining the shrinkage factors. We call the resulting estimator the shrunken average residual (SAR) method. The SAR approach differs from the EB approach only in how the residuals are obtained (and possibly in the specific estimates of σb 2 and σ2 u used in the shrinkage factors). Consequently SAR, like EB, generally suffers from systematic bias if teacher assignment is correlated with the covariates x i (whether we shrink the estimates or not). In effect, the AR approach partials x i out of y i but does not partial x i out of z i. If x i and z i are correlated it is crucial to partial x i out of z i in order to consistently estimate the teacher effects. The so-called fixed effects (where fixed effects refers to 12

15 the teacher effects) regression in (22) partials x i out of z i (and also partials x i out of y i ), which makes it a more reliable estimator under nonrandom teacher assignment. If one is going to assume x i and z i are uncorrelated, one might as well use the EB approach. Unless x i has very large dimension the computational saving from using OLS rather than feasible GLS (or MLE) to estimate γ is minor. As we discussed earlier, the algebraic relationship between RE and FE means that γ tends to be closer to the FE estimator ˆγ than is the OLS estimator γ. Consequently, under nonrandom teacher assignment, the estimated teacher effects using the RE estimator of γ will tend to have less bias than the estimates that begin with OLS estimation of γ. Plus, if it turns out that teacher assignment is uncorrelated with the covariates, the OLS estimator of γ is inefficient relative to the RE estimator under the standard random effects assumptions (and probably more generally). Before we leave this section we must emphasize once more that the so-called fixed effects estimates of the teacher VAMs allow any correlation between z i and x i, and so we expect it to be more reliable in situations with nonrandom teacher assignment. 3 Summary of Estimation Methods In this paper we examine seven different value-added estimators used to recover the teacher effects and apply them to both real and simulated data. Some of the estimators use EB or shrinkage techniques, while others do not. They can all be cast as a special case of the estimators described in the previous section. For clarity, we briefly describe each one. 3.1 Dynamic Methods Five of the estimators can be obtained from a dynamic equation of the form A it = λa i,t 1 + X it δ + Z it β + v it, (30) where A it is achievement (measured by a test score) for student i in grade t, X it is a vector of student characteristics, and Z it is the vector of teacher assignment dummies. Note that this is similar to Equation (1), but with the lagged test score written separately from X it for clarity. Also, X it is omitted from the estimation of the teacher effects in the simulation analysis below, as student characteristics aren t included in the data generating process. Recall that the EB estimator above was derived using a single cross-section of students (meaning that a particular student is only in the analysis once). Thus, we use only one grade fifth grade for the 13

16 analysis. One estimator we use estimates (30) by OLS, where we pool across student and classroom. We refer to this estimator as dynamic OLS, or DOLS. Notice that DOLS treats the teacher effects as fixed parameters to estimate via the inclusion of a dummy variable for each teacher in the regression. The inclusion of the lagged test score is to account for the possibility that teacher assignment is related to the past test score. In order to conclude that DOLS is consistent for β (and the other parameters), as the number of students per teachers grows, we would need to assume v it is uncorrelated with A i,t 1, X it, and Z it. When (30) is derived from a structural cumulative effects model (CEM) which is common in the educational production function literature the key condition for consistency does not hold unless teacher assignment is strictly exogenous with respect to past shocks and a certain common factor restriction holds 2. However, as shown in the simulations in Guarino, Reckase, and Wooldridge (2012) (GRW), the DOLS estimator often does well for estimating β at least for the purposes of ranking and classifying teachers even when the assumptions underlying the consistency of DOLS fails. This finding can be explained by noting that the inclusion of A i,t 1 controls directly or proxies for a variety of nonrandom teacher assignment mechanisms. Plus, it is not necessarily important to get a consistent estimator of λ, δ, or even β to get estimated VAMs that do a good job ranking the teachers, which is of primary concern. It could be that shrinking the DOLS estimates using the standard shrinkage formula in (25) could provide more precise estimates than not shrinking, and it may improve the performance of DOLS relative to EB. Therefore, the second estimator we consider is a shrunken DOLS (SDOLS) estimator, which takes the DOLS estimated teacher fixed effects and then shrinks them by the shrinkage factor (which can be thought of as a reliability coefficient here) derived in Section 2, using the variance estimates from equations (18), (20), and (21). SDOLS is rarely done in practice, as it is not a true Empirical Bayes estimator. We include it as an exploratory exercise in order to better determine the effects of shrinking. When the class sizes are all the same, the SDOLS and DOLS estimates different only by a constant positive multiple, and so in this case shrinking the DOLS estimates will have no effect in terms of ranking teachers. Our interest is to see whether shrinking can help when the class sizes differ. A third estimator we consider is the average residual (AR) method described in Section 2, where the teacher dummies are omitted from the regression in equation (30). Thus, in the first step we regress A it on its lag, A i,t 1, and X it and obtain the residuals, say ˆr it. Then, we average these residuals by classroom as the estimated 2 See Guarino, Reckase, and Wooldridge (2012) for a detailed discussion of the underlying identifying assumptions behind the estimators in this paper. 14

17 teacher effect. The fourth estimator, which we call the SAR (for shrunken average residual) estimator, takes the AR estimates and shrinks them by the shrinkage factor and variance estimates described in Section 2. Recall that because we are using OLS rather than GLS in the first-stage regression, shrinking the AR estimates does not result in a true EB estimator, but it is commonly used as a simple way of operationalizing the EB approach (see, for example, Kane and Staiger, 2008). AR and SAR will be fairly similar with large class sizes and will be consistent under the same assumptions (with the number of students per teacher increasing). The finitesample performance of these estimators will differ only due to the shrinkage factor that is applied to obtain the SAR estimates. It is important to keep in mind that, unlike DOLS and SDOLS, the AR and SAR estimators do not allow for general correlation between the teacher assignment and past test scores (or other student covariates). To compare with the AR approaches, we also include a true EB estimator. This is a dynamic MLE version of the EB estimator (EB LAG) that also treats the teacher effects as random, but uses maximum likelihood in the first stage instead of OLS 3. After obtaining the MLE estimates of the teacher effects in the first stage, the shrinkage factor is applied to obtain the EB MLE estimates 4. Given that MLE is being used in the first-stage instead of OLS (as in the average residual approach), we would expect the EB LAG estimator to outperform the SAR estimator in most scenarios. 3.2 Gain Score Methods If we set λ = 1 in equation (30) we obtain the standard gain score equation: A it = X it δ+z it β + v it. (31) We examine three value-added estimators based on this gain score specification. We first examine a pooled OLS estimator, which we called POLS, which includes of a dummy variable for each teacher in equation (31). We also consider an estimator, SPOLS, that shrinks these estimates using the same shrinkage formula and variance estimates derived in Section 2. As there are no covariates in equation (31) when we 3 As describe in Section 2, technically, we should refer to the MLE as a quasi-mle because one need not assume normality to obtain consistent estimators. The QMLE is convenient for jointly estimating the parameters in the mean and variance in a single step, and is asymptotically equivalent to the feasible GLS estimator. 4 As described in Rabe-Hesketh and Skrondal (2012), this two step procedure can be performed in one-step using Stata s xtmixed command with teacher random effects. The predicted random effects of this regression are identical to shrinking the MLE estimates by the shrinkage factor. 15

18 evaluate the simulated data, the POLS estimates in that case are identical to those obtained using the an average residual approach (here, using the gain score as the dependent variable), meaning that the teacher effects are treated as random. Thus, SPOLS can be thought of as EB-type estimator with the simulated data (but not with the real data). Lastly, we also include a version of the EB estimator that uses MLE in the first step which we call EB GAIN (this is a true EB estimator along with EB LAG). If assignment is based on past scores and λ 1, we expect these estimators to be outperformed by DOLS (and GRW find this is generally the case) since they are derived assuming the λ = 1 restriction holds. 4 Comparing Estimates Across Methods Using Real Data We first apply these estimation methods to actual student-level test score data from a large, diverse southern state and examine the rank correlations between the estimated teacher effects of the various estimators for each school district 5. We perform this analysis to get an idea of the variation in the VAM estimates using these various estimators when using a real data set. The data are from 2001 through 2007 and provide estimates of value-added for 5th grade teachers. Overall, we estimate 20,749 teacher effects using test score data from the state s annual assessment exam. We estimate teacher effects using equations (30) and (31) with controls for various student characteristics 6 and include dummies for the year. The teacher effects are estimated using data on multiple cohorts of students. Figure 1 presents box plots that depict the distributions of the within-district rank correlations between the various estimates in the panel data case. Although the median rank correlations between the SAR and DOLS estimates are 0.95, we observe rank correlations below 0.8 for about 10 percent of our sample. The EB GAIN and POLS estimates exhibit a distribution of rank correlations similar to that of SAR and DOLS and EB LAG and DOLS. Although we observe median rank correlations between EB LAG and DOLS, SAR and DOLS, and EB GAIN and POLS that are near 0.95, a few districts have rank correlations below 0.6. Thus, it appears that the differences in methods and/or differences in the assignment mechanisms used by some districts are influencing the estimated teacher effects. On the other hand, the rank 5 There are 67 school districts in the data set. 6 Student characteristics include race, gender, disability status, free/reduced price lunch eligibility, limited English proficiency status, and the number of days the student was absent from school. 16

19 correlations between AR and SAR and between DOLS and SDOLS, are strikingly high (median rank correlations of 0.97) and not very dispersed, suggesting that the shrinking process itself has little effect on the ranking of teachers. The SAR and EB LAG estimates have rank correlations between 0.8 and 0.99, suggesting that the shrunken estimates are somewhat sensitive to how the teacher effects are calculated in the first stage. Finally, the ranking of teachers appears to be quite different depending on whether or not you restrict λ = 1 when estimating the teacher effects as is shown by the lower and more disperse rank correlations between POLS and DOLS and between EB LAG and EB GAIN. 5 Comparing Estimates Across Methods Using Simulated Data Although using real data can provide insight into how closely matched the teacher rankings of various VAM estimators are to each other, we are unable to fully determine which methods are performing the best in the real data setting. The question of which VAM estimators are performing the best can only truly be addressed in simulations where the teacher effects are known. Therefore, to further evaluate the performance of the EB estimators relative to other common value-added estimators, we apply these methods to simulated data where the true teacher effects are known. This approach allows us to examine how well various estimators recover the true effect of the teacher under a variety of assignment scenarios. Using data generated from the data generating processes described in Section 5.1, we apply the set of value-added estimators discussed in Section 3 for one grade (cross-section). We then compare the resulting estimates with the true underlying teacher effects. 5.1 Simulation Design Much of our main analysis focuses on a base case that restricts the data generating process (DGP) to a relatively narrow set of idealized conditions. These ideal conditions do not allow for measurement error or peer effects and assume that teacher effects are constant over time. The data are constructed to represent grades three through five (the tested grades) in a hypothetical school, but we will only calculate estimates of teacher effects for fifth grade teachers. We create data sets that contain students nested within teachers nested within schools, with students followed longitudinally over time in order to reflect the institutional structure of an elementary school. Our simple baseline 17

20 DGP is as follows: A i3 = λa i2 + β i3 + c i + e i3 A i4 = λa i3 + β i4 + c i + e i4 (32) A i5 = λa i4 + β i5 + c i + e i5 where A i2 is a baseline score reflecting the subject-specific knowledge of child i entering third grade, A it is the grade-t test score (for t = 3, 4, 5), λ is a time constant decay parameter, β it is the teacher-specific contribution to growth (the true teacher value-added effect), c i is a time-invariant student-specific effect (may be thought of as ability or motivation ), and e it is a random deviation for each student. Because we assume independence of e it over time we are maintaining the so-called common factor restriction in the underlying cumulative effects model 7. In all of the simulations reported in this paper, the random variables A i2, β it, c i, and e it are drawn from normal distributions. The standard deviation of the teacher effect is.25, the standard deviation of the student fixed effect is.5, and that of the random noise component is 1. These give relative shares of 5, 19, and 76 percent of the total variance in gain scores (when λ = 1), respectively. Given that the student and noise components are larger than the teacher effects, we call these small teacher effects. We also allow for correlation between the time-invariant child-specific heterogeneity, c i, and the baseline test score, A i2 ; which we set to 0.5. Our data structure has the following characteristics that do not vary across simulation scenarios: 1 school Varying numbers of students per classroom Class sizes of 10, 15, 20, and 30 Teachers receive same class size in each cohort (i.e. a teacher that receives class size of 20 in year t will receive class size of 20 in year t + 1) 45 5th grade teachers 8 7 This restriction implies that past shocks to student learning decay at the same rate as all inputs. See Guarino, Reckase, and Wooldridge, 2012 for a more detailed discussion of this assumption. 8 We vary class size in this way (with teachers having the same number of students each year and more teachers with small classes) in order to create a situation where there is a substantial number of teachers with a small average number of students to showcase the disparities between EB/shrinkage and other estimators. 18

21 6 with class size of 30 9 with class size of with class size of with class size of th grade students 4 cohorts of students No crossover of students to other schools To create different scenarios, we vary certain key features: the grouping of students into classes, the assignment of classes of students to teachers within schools, and the amount of decay in prior learning from one period to the next. We generate data using each of the 9 different mechanisms for the assignment of students outlined in Table 1. These grouping and assignment procedures aren t purely deterministic there is slight randomness in the process. For many of our main analyses, we only allow for a small degree of randomness in the assignment mechanism (a random component with standard deviation of 0.1). As a sensitivity check, we allow for more randomness into this assignment process (a random component with standard deviation of 1) to see if any of these methods perform better as more noise is introduced. We also vary the decay parameter λ as follows: (1) λ = 0.5 (significant decay in student learning), (2) λ = 0.75 (slight decay), and (3) λ = 1 (no decay or complete persistence in student learning). The estimators used with the simulated data are the estimators discussed in Section 3, but with only teacher dummies and, for the dynamic specifications, the lagged test score included as covariates. We use 100 Monte Carlo replications per scenario in evaluating each estimator. 5.2 Evaluation Measures For each iteration (and for each of the seven estimators), we save the estimated individual teacher effects and also retain the true teacher effects, which are fixed across the iterations for each teacher. To study how well the methods uncover the true teacher effects, we adopt five simple summary measures using the teacher level data. The first is a measure of how well the estimates preserve the rankings of the true effects. We compute the Spearman rank correlation, ˆρ, between the estimated teacher effects and the true effects and report the average ˆρ across the 100 iterations. Second, we compute a measure of misclassification. These misclassification rates 19

22 reflect the percentage of above average teachers (in the true quality distribution) who are misclassified as below average in the distribution of estimated teacher effects. In addition to examining rank correlations and misclassification rates, it is also helpful to have a measure that quantifies some notion of the magnitude of the bias in the estimates. Given that some teacher effects are biased upwards while others are biased downwards, it is difficult to capture the overall bias in the estimates in a simple way. Our approach is to create a statistic, ˆθ, that captures how closely the magnitude of the deviation of the estimates from their mean tracks the size of the deviation of the true effects from the true mean. To create this measure, we regress the deviation of the estimated teacher effects from their overall estimated means on the analogous deviation of the true effects generated from the simulation for each estimator. We can represent this simple regression as ˆβ j ˆβ = ˆθ(βj β) + residual j, (33) where ˆβ j is the estimated teacher effect and β j is the true effect of teacher j. From this simple regression, we report the average coefficient, ˆθ, across the 100 replications of the simulation for each estimator. This regression tells us whether the estimated teacher effects are correctly distributed around the average teacher. If ˆθ = 1, then a movement of β j away from its mean is tracked by the same movement of ˆβ j from its mean. If the demeaned ˆβ j are essentially perfect predictors of the demeaned β j, then the average ˆθ across simulations will be close to one. When ˆθ 1, the magnitudes of the estimated teacher effects can be compared across teachers. If ˆθ > 1, then the estimated teacher effects amplify the true teacher effects. In other words, teachers above average will be estimated to be even more above average and vice versa for below average teachers. An estimation method that produces ˆθ substantially above one can do a good job of ranking teachers, but the magnitudes of the differences in estimated teacher effects cannot be trusted. The magnitudes also cannot be trusted if ˆθ < 1; in this case, ranking the teachers becomes more difficult because the estimated effects are compressed relative to the true teacher effects. Although the magnitude of the estimated teacher effects is generally of secondary importance to rankings in most policy applications, it is helpful to examine the extent to which the shrinkage of the estimates, as in the EB methods, increases bias in these noisy estimates. Thus, we report the average value of the ˆθ across the simulations because it provides evidence of which methods, under which scenarios, produce estimated teacher effects whose magnitudes have meaning, and provides insight into why some methods can be relatively successful at ranking teachers even 20

23 when the estimated effects are systematically biased. Efficiency of the teacher effect estimates is also a key consideration when evaluating the different methods. As described in Section 2, EB methods for calculating the teacher effects are not unbiased when thinking about the teacher effects as fixed parameters we are trying to estimate. However, if the identifying assumptions hold, these methods should provide more efficient estimates. This is one motivation for using such methods, as estimates should be more stable over time leading to smaller variance in the teacher effects. As the teacher effect is fixed for each teacher across the 100 iterations, we have 100 estimates of each teacher effect. To get at a summary measure for the efficiency of the estimators, we calculate the standard deviation of the 100 teacher effect estimates for each teacher and then take a simple average. Using ˆθ as a measure of bias and the average of the standard deviations as a measure of efficiency, it is useful to combine them is a way to compare the trade-off between bias and efficiency across the estimators. Generally, this is done using the mean squared error (MSE), which can be decomposed as the sum of the variance and the squared bias of the estimators. Thus, in order to obtain a single measure for each estimation procedure, we use MSE = ŝd2 + (1 ˆθ) 2 This provides a simple statistic to determine whether the bias induced by shrinking is justifiable due to efficiency gains across a variety of scenarios. 6 Simulation Results In each of the results tables we report the five evaluation measures for each particular estimator-assignment scenario combination. The first is the average rank correlation between the estimated and true teacher effects over the 100 replications. The second is the average proportion of below average teachers who are misclassified as being above average. The third measure is the average value of ˆθ from the regression described in equation (33). The fourth is the average standard deviation of the teacher effects, and the fifth is the pseudo MSE measure. 6.1 Base Case Results Tables 2 through 4 present the simulation results for the base case, where each teacher s value added estimate is based on four cohorts of students, teacher effects are small, and there is very little noise in the assignment mechanism. The tables (34) 21

24 have varying levels of decay in student learning - λ values of 0.5, 0.75, or 1. We begin this discussion with the pure random assignment (RA) case, where EB-type estimation methods are justified. The results of the random assignment case are given in the top row of each of the tables. As the theory suggests, EB LAG and EB GAIN perform well, with the rank correlations between the estimated effects and the true teacher effects near 0.8 at all levels of decay. The SAR estimator, which uses OLS instead of MLE, performs equally well in terms of the rank correlation even though it is not theoretically preferred. In fact, all of the other estimators we examine perform very similarly to the EB estimators in this case in terms of their ability of ranking teachers. In general, the estimators that do not restrict the value of λ to equal 1 rank teachers better. Even in the case where λ = 1 the estimators that do not use this information perform roughly the same as the estimators that do (Table 4). Use of EB and shrinkage estimators is often motivated as a way to reduce the noise in the estimation of teacher effects, particularly for teachers with a small number of students. More stability in the estimated effects might be thought to reduce misclassification of teachers. However, our simulation results indicate that there is no substantial improvement in the misclassification of teachers for the EB, SPOLS, SDOLS or SAR estimators. In fact, in the random assignment case, the shrunken estimators have higher misclassification rates than those of the corresponding unshrunken estimators across all levels of decay. This result appears to be more noticeable when applying the shrinkage factor to the estimates obtained via estimators including teacher fixed effects (POLS & DOLS). For example, in the λ = 0.5 case (Table 2), DOLS has a misclassification rate of.200 compared with a misclassification rate of for SDOLS. AR has a misclassification rate of while SAR has a misclassification rate of It appears that the slight increase in misclassification is largely due to the increase in bias resulting from the shrinking process. As we can see from the tables across all values of λ, the average ˆθ is very close to 1 for POLS, DOLS, and AR. The shrunken versions of these estimators and the two EB estimators have ˆθ values that are much smaller than 1 due to the shrinking process. SDOLS and SAR are the least biased with ˆθ values between 0.85 and 0.87 across all values of λ. The EB and SPOLS estimators have ˆθ values below 0.8 for all levels of decay in the random assignment case. Although there is a slight efficiency gain from EB estimation (and also shrinking in general) as witnessed by the smaller average standard errors of the teacher effects, this gain is more than offset by the induced bias the shrinkage and EB estimators have much higher pseudo MSE measures than the POLS, DOLS, and AR estimators which do not shrink. Also note that the two true EB estimators have the highest pseudo MSE values. 22

25 We now move to the cases where the students are nonrandomly grouped together, but teachers are still randomly assigned to the classrooms (DG-RA and HG-RA). In these two scenarios, we see a fairly similar pattern as in the RA scenario, although the overall performance of all estimators is slightly diminished. In contrast to the RA and DG-RA cases, the gain score estimators outperform the lag score estimators in the HG-RA when λ < 1. Interestingly though, when λ = 1, these gain score estimators are outperformed by the lag score estimators in this scenario. Despite the two cases where the gain score estimators perform particularly well, the overall analysis of the random assignment case suggests that the lag score methods are the most robust to changes in the decay rate and to different grouping mechanisms. Of these estimators the DOLS and AR methods appear to be the most preferable given that they exhibit the lowest pseudo MSE measures and rank correlations that are comparable to their shrunken counterparts, SDOLS and SAR, and the EB LAG estimator. The performance of the various estimators diverges notably under nonrandom teacher assignment. We again allow for nonrandom grouping based on either the prior year s test score or based on student-level heterogeneity, but now allow for nonrandom assignment of students to teachers. Students with high test scores or high unobserved ability can be assigned to either the best (positive assignment) or worst (negative assignment) teachers. Given that some estimators explicitly control for the teacher assignment through the inclusion of teacher fixed effects, we expect that methods that allowed for correlation between the assignment mechanism and the teacher assignment in some way would perform better in these scenarios than those treating the teacher effect as random. The simulation results presented here largely support this hypothesis. A key finding of this analysis is the disparity in performance of the AR and EB LAG estimators compared with the DOLS estimator in the DG-PA and DG-NA cases. For example, in Table 2, we observe DOLS has a rank correlation of around 0.75 for both DG-PA and DG-NA, and AR and EB LAG have rank correlations of 0.13 and 0.14 in the DG-PA case and 0.09 and 0.12 in the DG-NA case, respectively. AR and EB LAG also have very large misclassification rates in both cases ranging from 0.46 to 0.49, compared with a rate of 0.2 in both cases for DOLS. This stark difference in performance is nearly entirely due to bias caused by the failure of the AR and EB lag approaches to net out the correlation between the lagged test score and the teacher assignment (i.e. the assignment mechanism in these DG scenarios), a correlation which DOLS explicitly allows for with the inclusion of teacher dummies in the regression. We see that even before applying the shrinkage factor to the AR estimates the 23

26 magnitudes of the estimated teacher effects are severely compressed relative to the true magnitudes, as evidenced by ˆθ values of between 0.26 and 0.30 in the DG scenarios. The EB LAG estimator produces even more biased estimates, with ˆθ values of between 0.12 and Although the AR and EB LAG estimators have a slight efficiency gain compared with DOLS, the large biases result in pseudo MSE values of AR and EB LAG that are substantially larger than the DOLS pseudo MSE. From Table 2, we observe pseudo MSE values of between 0.5 and 0.56 for AR and between 0.76 and 0.82 for EB LAG. DOLS, which has ˆθ values very close 1, has pseudo MSE values of 0.4 for both DG scenarios. As was the case in the RA scenarios, shrinking the estimates does not have any substantial positive impact on performance and in most cases is detrimental. Despite achieving a lower variance across the teacher effects, shrinking the estimates biases the estimated effects. Analyzing this bias-variance trade-off shows that the bias outweighs the gain in efficiency. For example, in the DG-PA case with λ = 0.5, SAR has a pseudo MSE of 0.39 while AR has a pseudo MSE of These results, in conjunction with the analogous results in the RA scenarios, suggest that the unshrunken estimates are preferred to the shrunken versions. We observe that performance of the gain score estimators in these DG cases is very sensitive to the level of decay and the type of assignment. In the λ = 0.5 case, gain score estimators perform very poorly across all evaluation measures in the DG- PA case. POLS, for example, has a rank correlation of 0.78 and a pseudo MSE of 3.05, which is starkly different from DOLS with a rank correlation of 0.75 and pseudo MSE of The EB GAIN and SPOLS estimators perform equally poorly, but have lower MSE values due to the increased precision and reduced bias as a result of the shrinking process. In the DG-NA case, however, these estimators outperform the lag score estimators in terms of rank correlations and misclassification rates. The gain score estimators have rank correlations around 0.96 and misclassification rates of around 0.09, while DOLS has a rank correlation of 0.76 and a misclassification rate of 0.2. These gain score estimators are substantially biased, however, with ˆθ values above 2 (hence the ability of these estimators to rank teachers well). This extreme bias leads the pseudo MSE to be substantially larger for the gain score estimators than for DOLS. A similar pattern of results for the gain score estimators follows when λ = 0.75, although these estimators do exhibit better performance in the DG-PA case. Most notably, the rank correlation measures are no longer negative in this DG-PA case as λ approaches 1. The estimated effects also appear to become less biased in the DG-NA scenario. The ˆθ values are much closer to 1 in the λ = 0.75 case than in the λ = 0.5 case. When the λ = 1 restriction is satisfied, the gain score estimators 24

27 outperform the lag score estimators in terms of rank correlation and misclassification rate, but are severely biased (ˆθ values near 2). The rank correlations for the gain score estimators become negative in the DG-NA case when λ = 1. Finally, we examine the case of nonrandom assignment of students grouped on the basis of student-level heterogeneity. The results for these HG scenarios are especially unstable: all estimators do an excellent job ranking teachers under positive teacher assignment, and all estimators do a very poor job under negative teacher assignment (with very large negative rank correlations). In the HG-PA case, the bias in the estimated VAMs is amplified as can be seen by the large average values for ˆθ and this helps rank the teachers. But with HG-NA, the estimated and true VAMs are actually negatively correlated, a very troubling scenario. 6.2 Sensitivity Analysis While we did not see an advantage to EB-type or shrinkage estimation in the case of four cohorts of students, it may be that these approaches are beneficial with very limited data on the teachers. Thus, we replicate the above analysis, but with only one cohort of students for each teacher. In this case, the teacher effect estimates are based on as little as 10 students (and up to 30). The results of this analysis is presented in tables 5-7. With limited data, the performance of all estimators is reduced, however the main patterns observed in the four cohort case are also observed when we only use one cohort. Despite the similarity in many of the results, the AR and EB LAG estimators do appear to more noticeably outperform DOLS in the DG-RA and HG-RA scenarios when using one cohort instead of four cohorts. While this suggests that these other methods may be preferred to DOLS in cases where teachers are randomly assigned to classrooms, this does not mean that these methods should be preferred in general when using minimal data. When the assignment becomes nonrandom the performance of AR and EB LAG suffers, as was the case in the four cohort analysis. Given that the true assignment mechanism is likely not known when computing these value-added estimates, this sharp drop in performance for AR and EB LAG under nonrandom assignment, should give researchers some pause when choosing between these various methods. Although DOLS does not perform as well with minimal data, it still is quite robust to the various assignment mechanisms and likely should still be preferred to AR and EB methods when the assignment mechanism is unknown. We also vary the level of randomness in the assignment mechanism that assigns teachers to classrooms. In our base case, we allow for nearly deterministic assignment. In a real world setting, however, we may think that due to scheduling conflicts, 25

28 parental requests, and other factors there may not be such deterministic assignment based on the prior test. The results presented in Tables 8 and 9 for the DG-PA and DG-NA scenarios allow for a greater degree of randomness in the assignment mechanism. The patterns of results in this case are much the same as the base case (with minimal noise in the assignment). The AR and EB LAG estimators perform slightly better than in the base case in these two DG scenarios. Given that these estimators perform well under random assignment, it is not all that surprising that as more randomness is introduced into the assignment mechanism, the performance of these estimators increases. However, despite the slight improvement in performance of these estimators, the main issue remains: the bias resulting from the failure to partial out the correlation between the lagged test score and the teacher assignment is still causing the AR and EB LAG estimators to be biased and to be outperformed by DOLS. Moreover, DOLS performs just as well under noisy assignment as it did under nearly deterministic assignment. 7 Conclusion Using simulated experiments, where the true teacher effects are known, we have explored the properties of two commonly used Empirical Bayes estimators as well as the effects of shrinking estimates of teacher effects in general. Overall, EB methods don t appear to have much if any advantage over simple methods such as DOLS which treat the teacher effects as fixed, even in the case of random teacher assignment where EB estimation is theoretically justified. Under random assignment, all estimators perform well in terms of ranking teachers, and any efficiency gains from EB estimation are offset by the bias introduced by this method. Importantly, EB estimation is not appropriate under nonrandom teacher assignment. The hallmark of EB estimation of teacher effects is to treat the teacher effects as random variables that are independent (or at least uncorrelated) with any other covariates. This assumption is tantamount to assuming that teacher assignment does not depend on things such as past test scores (this is also true for the AR methods). When teacher assignment is not random, estimators that either explicitly control for the assignment mechanism or proxy for it in some way typically provide more reliable estimates of the teacher effects. Among the estimators and assignment scenarios we study, DOLS and SDOLS (a shrunken version of DOLS) are the only estimators that control for the assignment mechanism (again, either explicitly or by proxy) through the inclusion of both the lagged test score along with the teacher assignment dummies. As expected, DOLS and SDOLS perform well relative to the 26

29 other estimators in the nonrandom teacher assignment scenarios. In the analysis of the real data, we found that the rank correlations between, say, DOLS and SAR or DOLS and EB GAIN are quite low for some districts, suggesting that the decision among these estimators is important. Thus, if there is a possibility of nonrandom assignment, DOLS is the best choice. We also find that estimators that do not impose restrictions on λ are generally preferred to those that impose λ = 1 (POLS, SPOLS, and EB GAIN). We saw in the real data that the rank correlations between POLS and DOLS and between EB LAG and EB GAIN were the lowest. Thus, the decision to restrict λ = 1 can lead to very different results. We saw in the simulations that estimators that do not restrict λ, such as DOLS, are more robust across both different values of λ and the different sorting and assignment scenarios. Lastly, we find that shrinking the estimates of the teacher effects itself doesn t really improve the performance of the estimators, even in the case where estimates are based on one cohort of students (with some teachers having as few as 10 students). The rank correlations are extremely close in our simulations for those estimators that differ only due to the shrinkage factor AR and SAR; DOLS and SDOLS; and POLS and SPOLS. The simulations show that shrinking does not change the rankings of teachers much, and this was true in the empirical analysis, too: the rank correlations between AR and SAR, as well as between DOLS and SDOLS, are very close to one in almost all districts. Also, shrinking the estimates generally increases the misclassification rate in the simulations. Thus, our evidence suggests that the rationale for using shrinkage estimators to reduce the misclassification of teachers due to noisy estimates of teacher effects should not be given too much weight when choosing among estimators. It is much more important to account for nonrandom teacher assignment. Given the robust nature of the DOLS estimator to a wide variety of grouping and assignment scenarios, it should be widely preferred to AR and EB methods when there is uncertainty about the true underlying assignment mechanism. If the assignment mechanism is known to be random then applying these AR and EB estimators can be appropriate, especially when the amount of data per teacher is minimal. Given that the assignment mechanism is not likely known, however, blindly applying these AR and EB methods can be extremely problematic, especially if teachers are truly assigned nonrandomly to classrooms. Therefore, we stress caution when applying theses AR and EB methods and urge researchers to be mindful of the underlying assignment mechanism and the robust nature of DOLS when choosing between these various value-added methods. 27

30 References Ballou, D., Sanders, W., and Wright, P. (2004), Controlling for Student Background in Value-Added Assessment of Teachers, Journal of Educational and Behavioral Statistics 29(1), Chetty, R., Freidman, J., and Rockoff, J. (2011), The Long-Term Impacts of Teachers: Teacher Value-Added and Student Outcomes in Adulthood, NBER, Working Paper Corcoran, S., Jennings, J., and Beveridge, A. (2011), Teacher effectiveness on high- and low-stakes tests, Unpublished Draft Guarino, C. M., Reckase, M. D., and Wooldridge, J. M. (2012), Can Value- Added Measures of Teacher Performance Be Trusted?, Education Policy Center at Michigan State University, Working Paper 18 Hansen, C. B. (2007), Asymptotic Properties of a Robust Variance Matrix Estimator for Panel Data when T is Large, Journal of Econometrics 141, Jacob, B. and Lefgren, L. (2005), Principals as Agents: Subjective Performance Measurement in Education, NBER, Working Paper Jacob, B. and Lefgren, L. (2008), Can Principals Identify Effective Teachers? Evidence on Subjective Performance Evaluation in Education, Journal of Labor Economics 26(1), Kane, T. and Staiger, D. (2008), Estimating Teacher Impacts on Student Achievement: An Experimental Evaluation, NBER, Working Paper McCaffrey, D., Lockwood, J. R., Louis, T., and Hamilton, L. (2004), Models for Value-Added Modeling of Teacher Effects, Journal of Educational and Behavioral Statistics 29(1), Morris, C. (1983), Parametric Empirical Bayes Inference: Theory and Applications, Journal of the American Statistical Association 78(381), Rabe-Hesketh, S. and Skrondal, A. (2012), Multilevel and Longitudinal Modeling Using Stata, 3e. Stata Press: College Station, TX Wooldridge, J.M. (2010), Econometric Analysis of Cross Section and Panel Data, 2e. MIT Press: Cambridge, MA 28

31 8 Tables and Figures Figure 1: Spearman Rank Correlations Across Different VAM Estimators 9 Tables and Figures 29

Figure 2: Spearman Rank Correlations Across Different VAM Estimators Acronym DG-RA Table 1: Definitions of Grouping-Assignment Mechanisms Process for grouping students in classrooms Process for

32 Figure 2: Spearman Rank Correlations Across Different VAM Estimators Acronym DG-RA Table 1: Definitions of Grouping-Assignment Mechanisms Process for grouping students in classrooms Process for assigning students to teachers RA Random Random Dynamic (based on prior test scores) Random DG-PA DG-NA HG-RA HG-PA HG-NA Dynamic (based on prior test scores) Dynamic (based on prior test scores) Static (based on student heterogeneity) Static (based on student heterogeneity) Static (based on student heterogeneity) Positive correlation between teacher effects and prior student scores (better teachers with better students) Negative correlation between teacher effects and prior student scores Random Positive correlation between teacher effects and student fixed effects Negative correlation between teacher effects and student fixed effects 30

w o r k i n g p a p e r s

w o r k i n g p a p e r s 2 0 0 9 Assessing the Potential of Using Value-Added Estimates of Teacher Job Performance for Making Tenure Decisions Dan Goldhaber Michael Hansen crpe working paper # 2009_2