Evaluating Dropout Prevention and Recovery Models The University of California Educational Evaluation Center Dr. John T. Yun, Director NGA Center for Best Practices State Strategies to Achieve Graduation for All Seaport Hotel Boston, Massachusetts September 20, 2010
Presentation Outline Introduction Importance of evaluation in today s policy climate Evaluation Types Goals versus Objectives Theories of Action Methods for the Madness Some Time to Work Questions
Evaluation Types Process An evaluation designed to assess implementation of program Audience? Formative An evaluation designed to guide program development and improvement Audience? Summative An evaluation designed to assess programmatic impact Audience?
What are the Differences? Process Evaluations look at implementation and do not discuss whether what is being done is effective (fidelity and implementation). Formative Evaluations are designed to provide information solely for the purpose of program improvement. Summative Evaluations look at examining the outcomes and theory of action
What are the Differences? There are many different understandings of these terms here are mine Formative and Summative are largely differences in philosophy and purpose not necessarily true differences in approach Formative and summative evaluations can use similar methods and generally differ in terms of the rigor required Process evaluations are more clearly defined. You can perform a process evaluations can be either formative or summative
Program Goals versus Objectives Goals Ultimate outcomes of program (distal outcomes) Happier life Greater income Attending college Less cost to society Broad impact May be very difficult to measure Program Objectives Measureable changes that should occur during project/intervention Contain criteria for measuring success and failure
SMART Objectives S -Specific M Measurable A Appropriate R Realistic T Timebound By keeping to these general rules for objectives you can be sure that what you say you want to do is measureable.
Importance of Theories of Action This tells you what you believe are the outcomes of the programs Shows the causal links between program components, proximate, and distal outcomes. Allows for testing of both theory and implementation The clearer the TOA the easier the evaluation Not always easy in practice
Sample Theory of Action (logic model) Intervention Proximate Outcome Distal Outcome Provide more information to parents and students about college Increase comfort and interest in college going Increase rates of college going
Parent Education Program Logic model SITUATION: During a county needs assessment, majority of parents reported that they were having difficulty parenting and felt stressed as a result Copyright 2008 Board of Regents of the University of Wisconsin System, d/b/a Division of Cooperative Extension of the University of Wisconsin-Extension.
Parent Education Program Logic model Copyright 2008 Board of Regents of the University of Wisconsin System, d/b/a Division of Cooperative Extension of the University of Wisconsin-Extension.
Implementation v. Theory Failure Implementation Failure Didn t implement well Theory Failure Things don t work the way you think they will They both look the same based on outcome.
Failures Both failures show no change in outcome MUST be able to distinguish between them! Intervention Proximate Outcome Distal Outcome Provide more information to parents and students about college Increase comfort and interest in college going Increase rates of college going Poor implementation leads to bad outcomes Implementation Failure Good Implementation, no change in outcomes Theory Failure
Limitations of Logic Model Approach To a hammer everything looks like a nail Logic models can become an end and not means they may keep you from seeing what s actually happening in an organization Assume since fit program into the box the box fits Can program complexity be captured in a logic model/theory of action? Example when it cannot? Assume that the box will always stay the same All logic models are timebound
Logic Model Takeaway Critical to have a clearly delineated logic model/theory of Action Provides guideposts for evaluation design Creates a powerful test of key program assumptions By building in as much detail as possible, can look at both process and outcomes
Methods for the Madness I will focus on Impact evaluation Most important to the purposes of policy Experiments (First Best World) Very strongest methodology allows for causal attribution under certain circumstances Quasi-Experiments (Second Best World) Regression Discontinuity Analysis Propensity Score Models Interrupted Time Series Student Fixed Effects Models Difference in Difference Models
Definitions Experimentation: involves deliberate intrusion into an ongoing process to identify effects of that intrusion Randomized experiments: involve assignment to treatment and comparison groups based on chance Quasi-experiments: involve assignment to treatment not based on chance
How to approach design The goal of many designs (but not all) is to establish causality Key to understand the power and limits of the approach Central Problem is establishing Counterfactual, or what would have happened if the student had not participated. Most poor evaluation due to comparison of non-identical students. Experiments are Great at causal description but NOT Causal Explanation They can tell us what results from deliberately manipulating single experimental conditions Are not as good at determining why the condition lead to the outcome.
Where Causality and Random Assignment Meet Logic of Causal Relationships Cause must precede effect Cause must covary with effect Must rule out alternative causes Randomized Experiments Do All This They give treatment, then measure effect Can easily measure covariation Randomization makes most other causes less likely This is related to threats to internal validity Quasi-experiments are problematic on the third criterion.
Advantages of Experiments Unbiased estimates of effects Relatively few, transparent and testable assumptions More statistical power than alternatives Long history of implementation in health, and in some areas of education Credibility in science and policy circles
Disadvantages of Experiments Not always feasible for reasons of ethics, politics, or logistics Experience is limited, especially with higher order units like whole schools Need to have: No differential attrition No contamination across different conditions
What No Effect Looks Like
What a Main Effect Looks Like
Regression Discontinuity Assignment based only on a cutoff score Second best design for causal inference Proofs that provides unbiased inference Empirical evidence it produces similar results to an experiment It can be widely used in education Data analysis is quite tricky, but manageable
Assignment under RD Assignment can be by a merit score, need score, first come, first served, date of birth RD can actually involve any assignment variable that is ordered, including made-up ones Key concepts are an assignment variable, a cutoff score, and an outcome Think of RD as a randomized experiment at the cutoff point Think of RD as a design with a completely known assignment process
Upshot for RD A very powerful design Lots of opportunity in Education for use Depends on ability to get good cut score and people to stick to it Can be combined with randomize designs Can be difficult to correctly specify model Less power than RD (need larger samples, approximately 2.5x)
Propensity Scores Propensity score analysis tries to model selection into treatment Propensity scores are the probability that given your observables (measured variables) that you will be assigned to treatment Goal of Propensity score analysis is to find people with IDENTICAL probabilities to be in treatment who were either in treatment or not in treatment, thus you get a comparison group that is equivalent.
Upshot for Propensity Scores Need as many observables that are relevant to selection into the program PRIOR to intervention Ideally, these should be strongly correlated to assignment, less correlated with outcome. Work best when there is a clear selection theory that can be modeled using Propensity scores this allows you to select good variables to use.
Interrupted Time Series (ITS) Represent a whole series of design types (short time series, difference-in-difference, fixed-effects models) A series of observations on a dependent variable over time ~N = 100 observations is the desirable standard ~N < 100 observations is still helpful, even with very few observations (e.g., N = 7) Interrupted by the introduction of an intervention. The time series should show an effect at the time of the interruption.
ITS A very powerful design Dependent on the availability of a good archived outcome data Dependent on the ability to gather time series outcomes Note that much more archived data is available at the school and district level than the individual Design effects can do much to improve ability to make causal inference Design effects can be comparison to untreated groups or to outcomes that are unlikely to be affected by treatment, but likely to be affected by contextual variables
Student Fixed Effects Can be considered a subset of ITS Compare student s growth on important outcomes (test scores, motivation, etc ) The key is to see if growth is affected postintervention. In addition, you subtract out using a fixedeffect, the mean outcome for each student so you are not comparing students to one another, but only student changes relative to the intervention
Student receives After school program in one year out of three: Test for break from trend growth Student Test Score These years act as the control group as student not in after school program Student treated (after school program) YEAR
Difference in Difference (DID) A version of ITS and fixed effect models for non-experimental situations. Identify groups that underwent a policy change and compare to trends for groups that did not undergo the policy change. Most obvious example: comparing trends in schools that did and did not receive the intervention Usually only a few data points, distinguishes it from traditional ITS
School receives After-school program in one year out of three: Test for break from trend growth Student Test Score Control school that has similar trends in scores School treated this year (after school program) YEAR
Upshot for DID and Fixed Effects Need equivalent groups to compare growth This can be difficult in practice Impacts must happen quickly Need good controls The more data the more powerful the method
Other Considerations Statistical power and sample size The statistical power of a study refers to the probability that you can see an effect if it exists. Statistical power increases with: large samples or MANY clusters of schools or teachers outcome variables that have a low natural variation lots of baseline (pre-experiment) measures of the outcome variable (to account for random initial differences) Major funders request that applicant to indicate the Minimum Detectable Effect Size (MDES) of a proposed study. Ex. MDES of 0.2 means the study could reject the hypothesis of 0 effect with high probability if true effect was 0.2 standard deviations or higher. 40
Some Work Time Pre-Questions for Evaluation Presentation The following questions would be useful to reflect upon prior to the evaluation presentation on Thursday, September 30 th. 1. What is/are the main goal(s) of your program? 2. Specifically, how do the components of the program lead to these outcomes? What are the key intermediate steps? (Theory of Action; Causal Model) 3. How can you measure outcomes of your program? 4. Who is the audience for the evaluation? What is the purpose of the evaluation? Program Improvement or Justification of Funding? 5. What evaluation technique(s) could you implement? a) First-best: Randomized Controlled Trial (RCT) b) Second-Best: Regression Discontinuity (RD) c) Third-Best: Propensity Score Models d) Third-Best: Interrupted Time Series Model (ITS) e) Third-Best: Difference in difference models (DID) f) Third-Best: Student fixed-effect models 41
Questions
Conclusions Key to clearly delineate your logic model/theory of Action Must set methods that are appropriate to your goal There are many methods that can get you to your answer, each has critical tradeoffs associated with them Consider whether Cost Benefit Analysis is appropriate for your goals
Thank you for your time John T. Yun Director, University of California Educational Evaluation Center jyun@education.ucsb.edu ucec@education.ucsb.edu (805)893-2342