Introduction to Causal Inference. Problem Set 1. Required Problems

Similar documents
Urban Analysis Exercise: GIS, Residential Development and Service Availability in Hillsborough County, Florida

Schoology Getting Started Guide for Teachers

Association Between Categorical Variables

Lecture 1: Machine Learning Basics

Houghton Mifflin Online Assessment System Walkthrough Guide

Science Olympiad Competition Model This! Event Guidelines

CS Machine Learning

Livermore Valley Joint Unified School District. B or better in Algebra I, or consent of instructor

College of Business University of South Florida St. Petersburg Governance Document As Amended by the College Faculty on February 10, 2014

Role Models, the Formation of Beliefs, and Girls Math. Ability: Evidence from Random Assignment of Students. in Chinese Middle Schools

Office of Planning and Budgets. Provost Market for Fiscal Year Resource Guide

BASIC EDUCATION IN GHANA IN THE POST-REFORM PERIOD

Excel Intermediate

Preferences...3 Basic Calculator...5 Math/Graphing Tools...5 Help...6 Run System Check...6 Sign Out...8

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Generating Test Cases From Use Cases

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Massachusetts Department of Elementary and Secondary Education. Title I Comparability

What is Thinking (Cognition)?

ecampus Basics Overview

ACADEMIC TECHNOLOGY SUPPORT

STAT 220 Midterm Exam, Friday, Feb. 24

Answers To Hawkes Learning Systems Intermediate Algebra

GDP Falls as MBA Rises?

The Moodle and joule 2 Teacher Toolkit

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

Statewide Framework Document for:

On-the-Fly Customization of Automated Essay Scoring

Python Machine Learning

Dyslexia and Dyscalculia Screeners Digital. Guidance and Information for Teachers

Justin Raisner December 2010 EdTech 503

Minitab Tutorial (Version 17+)

Machine Learning and Development Policy

Coimisiún na Scrúduithe Stáit State Examinations Commission LEAVING CERTIFICATE 2008 MARKING SCHEME GEOGRAPHY HIGHER LEVEL

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Background Information. Instructions. Problem Statement. HOMEWORK INSTRUCTIONS Homework #3 Higher Education Salary Problem

STUDENT MOODLE ORIENTATION

Introduction to Simulation

NCEO Technical Report 27

Medical Complexity: A Pragmatic Theory

success. It will place emphasis on:

Field Experience Management 2011 Training Guides

Mathematics (JUN14MS0401) General Certificate of Education Advanced Level Examination June Unit Statistics TOTAL.

UNIT ONE Tools of Algebra

Lucintel. Publisher Sample

Connect Microbiology. Training Guide

Outreach Connect User Manual

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

A Comparison of Charter Schools and Traditional Public Schools in Idaho

AP Statistics Summer Assignment 17-18

Simple Random Sample (SRS) & Voluntary Response Sample: Examples: A Voluntary Response Sample: Examples: Systematic Sample Best Used When

Using SAM Central With iread

Algebra 2- Semester 2 Review

Improving the impact of development projects in Sub-Saharan Africa through increased UK/Brazil cooperation and partnerships Held in Brasilia

Math 96: Intermediate Algebra in Context

Backwards Numbers: A Study of Place Value. Catherine Perez

Note: Principal version Modification Amendment Modification Amendment Modification Complete version from 1 October 2014

School of Innovative Technologies and Engineering

Higher Education. Pennsylvania State System of Higher Education. November 3, 2017

CLEVELAND STATE UNIVERSITY James J. Nance College of Business Administration Marketing Department Spring 2012

Office Hours: Day Time Location TR 12:00pm - 2:00pm Main Campus Carl DeSantis Building 5136

Protocol for using the Classroom Walkthrough Observation Instrument

Usability Design Strategies for Children: Developing Children Learning and Knowledge in Decreasing Children Dental Anxiety

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

Your School and You. Guide for Administrators

Music in World Cultures, MHL 143 (34446)

Adult Degree Program. MyWPclasses (Moodle) Guide

Global Television Manufacturing Industry : Trend, Profit, and Forecast Analysis Published September 2012

Setting Up Tuition Controls, Criteria, Equations, and Waivers

Radius STEM Readiness TM

INTERMEDIATE ALGEBRA Course Syllabus

CPS122 Lecture: Identifying Responsibilities; CRC Cards. 1. To show how to use CRC cards to identify objects and find responsibilities

Creating a Test in Eduphoria! Aware

12- A whirlwind tour of statistics

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Industrial Assessment Center. Don Kasten. IAC Student Webcast. Manager, Technical Operations Center for Advanced Energy Systems.

Teaching a Laboratory Section

CSC200: Lecture 4. Allan Borodin

MULTIMEDIA Motion Graphics for Multimedia

Experience: Virtual Travel Digital Path

Data Modeling and Databases II Entity-Relationship (ER) Model. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

DEPARTMENT OF FINANCE AND ECONOMICS

New Features & Functionality in Q Release Version 3.2 June 2016

Function Tables With The Magic Function Machine

Iowa School District Profiles. Le Mars

Using Blackboard.com Software to Reach Beyond the Classroom: Intermediate

Diagnostic Test. Middle School Mathematics

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Assignment 1: Predicting Amazon Review Ratings

Pre-AP Geometry Course Syllabus Page 1

INTRODUCTION TO GENERAL PSYCHOLOGY (PSYC 1101) ONLINE SYLLABUS. Instructor: April Babb Crisp, M.S., LPC

Online ICT Training Courseware

MASTER S THESIS GUIDE MASTER S PROGRAMME IN COMMUNICATION SCIENCE

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Learning Lesson Study Course

AQUA: An Ontology-Driven Question Answering System

Transcription:

Introduction to Causal Inference Problem Set 1 Professor: Teppei Yamamoto Due Friday, July 15 (at beginning of class) Only the required problems are due on the above date. The optional problems will not directly count toward your grade, though you are encouraged to complete them as your time permits. If you choose not to work on the optional problems, use them for your self-study during the summer vacation. Required Problems Problem 1 Supposed that you have a random sample of size n from the population of interest. Answer the following questions that are designed to help you get familiar with potential outcomes. Try to keep your answers brief and your language precise. Throughout the problem, assume that the Stable Unit Treatment Value Assumption (SUTVA) holds. (a) Explain the notation Y i (0). (b) Contrast the meaning of Y i (0) with the meaning of Y i. (c) Contrast the meaning of Y i (0) with the meaning of Y i (1). Is it ever possible to observe both at the same time? Why? (d) Explain the notation E[Y i (0) D i = 1], where D i is a binary variable that gives the treatment status for subject i, 1 if treated, 0 if control. (e) Contrast the meaning of E[Y i (0)] with the meaning of E[Y i D i = 0]. (f) Contrast the meaning of E[Y i (0) D i = 1] with the meaning of E[Y i (0) D i = 0]. (g) Which of the following quantities (that you explained in parts (d) through (f)) can be identified from observed information? Do not make any additional assumptions about the distributions of Y or D, except the SUTVA and also that there is at least one observation with 1

D i = 1, and at least one with D i = 0 in the observed data. E[Y i (0) D i = 1] (1) E[Y i (0)] (2) E[Y i D i = 0] (3) E[Y i (0) D i = 0] (4) (h) Now, assume that D i is randomly assigned to the units in this sample. Which of the above quantities can be identified from the observed data? Problem 2 This problem is for those of you who are unfamiliar with R, which we will use throughout the rest of the course in lectures and problem sets. If you are already comfortable using R, skip this problem and proceed to Problem 3. 1. If you haven t, download and install R and RStudio on your own computer. Make sure to choose the right versions for your operating system. 2. Complete Lesson 1 on this website. That is, you should watch each of the video tutorials (Parts 1.1 to 1.6) and complete all the swirl quizzes. (You will be able to find out how to install and use swirl by following the links on the above website.) As proof of your hard work, type the following directly in your console after you complete all the swirl quizzes: > savehistory(file = "pset1prob3.txt") and submit the file along with your problem set answers. Problem 3 This problem is designed for those of you who are already familiar with R and need a quick review of key concepts and operations. If you are new to R and just getting started with it, this problem is optional for you. 1. Using the data given below, answer the following questions in R. Major Foreign Holders of U.S. Treasury Securities 2

Country Dec2007 Dec2006 Dec2005 Dec2004 Dec2003 Japan 581.2 622.9 670.0 689.9 550.8 China, Mainland 477.6 396.9 310.0 222.9 159.0 United Kingdom 158.1 92.6 146.0 95.8 82.2 Oil Exporters 137.9 110.2 78.2 62.1 42.6 Brazil 129.9 52.1 28.7 15.2 11.8 Caribbean Banking Centers 116.4 72.3 77.2 51.1 47.3 All Others Total 2353.2 2103.1 2033.9 1849.3 1523.1 Total Public Debt Outstanding Year Debt Held by the Public Intra-governmental Holdings Total Dec2003 4044.2 2953.7 6997.9 Dec2004 4408.4 3187.8 7596.2 Dec2005 4714.8 3455.6 8170.4 Dec2006 4901.0 3779.2 8680.2 Dec2007 5136.3 4092.9 9229.2 (a) How much did the debt held by the public increase from 2003 to 2007? What proportion of this new debt was bought up by foreigners? What proportion of the new debt was bought up by the Chinese? (b) What proportion of U.S. Treasury Securities held by foreigners in 2007 were held by the top six countries/groupings? (c) As a percentage, what is the nominal increase in holdings from 2003 to 2007 of these six countries/groupings as a group? What if we exclude Japan? (d) Create a vector of the percentage changes for each country from 2003 to 2007. Be sure to include names indicating which value corresponds to which country. (e) Repeat the exercises in parts (b) and (c) using the vectors you have just created. Use sum() and indices. (f) What was the ratio of debt held by the public relative to intra-governmental government holdings, for each year 2003-2007? When did this ratio hit it s high point and low point? Make sure to include year labels. 2. (a) Download the Middle East dataset available on the course website and save it to some convenient location. Change your working directory via setwd() or using the pulldown menu to the directory in which you saved the data. Read the data using the read.csv("filename.csv") command. How many observations (countries) are there? How many variables? [Don t just count off the screen. Imagine we were working with a much larger dataset.] (b) Create a new variable in the data frame that equals GDP per capita in each country, and calculate the mean GDP per capita in the entire region. 3

(c) What is the class of the religion variable? If it is not already a character string, coerce the plurality religion for each country to a character string. (d) You can use negative indices to remove rows or columns from objects. For example, X[-1,] will give you the matrix X less its first row. Using this trick, remove the capital and currency variables from the dataset. (e) How many of these countries have an HDI value above 0.75? (Hint: The sum(x) function counts up the number of TRUEs in x if x is a logical vector.) (f) How many of these countries have both a density equal to or above 35 and a population above 10 million? (g) Create a new variable within the Middle East dataset indicating whether each country has a high (above 0.75) or low HDI value. (h) Create a new dataset consisting of just the high HDI countries. (i) The save.image function allows you to save your workspace into a file so you can retrieve your work later on. Use the following syntax to save your objects in your working directory: save.image(file = "nameyoulike.rdata"). (You can also do the same via RStudio s drop-down menu.) Now, close RStudio (don t forget to save your code as well!), reopen it, and load the saved workspace using the load function. Problem A Optional Problems You and some colleagues are conducting an intervention in Ghana s 2016 election 1. Your goal is to assess the effect of deploying a new biometric voting machine on the incidence of electoral fraud at polling stations. Though Ghana is made up of 275 constituencies, due to the political context you are only allowed to perform your experiment within one constituency. Unconcerned, you and your team randomly select eight polling stations in the constituency, and among the eight, randomly assign half to receive the new voting machines and the other to serve as a control group. D i gives the resulting treatment status for each polling station i, for i {1,..., N}, where D i {0, 1} and N = 8. The outcome of interest is the percentage of votes in a polling station attributable to fraud, Y i. a) Assume that both parts of SUTVA hold. Calculate, explaining your answer: i) For each polling station i, the number of potential outcomes that can be defined; ii) For each polling station i, the number of unit treatment effects that can be defined; and, 1 This problem is loosely inspired by true events. If you are interested in the substantive or methodological issues, see Asunka et al. 2015. 4

iii) For the sample of polling stations, the number of (unconditional) average treatment effect estimands that can be defined. b) A Ghanaian political insider gets wind of your study, and gives you some information. She says that because all of the polling stations in your study are within one small constituency, there will be interference between units. She explains how interference might occur in this case: In Ghana, the political operatives in each polling station who are responsible for committing fraud will move elsewhere if their efforts are frustrated. Their range of movement is predetermined by the geographic influence of the party bosses. Local conditions are such that the operatives in each polling station can only move to one, and only one, other polling station, as shown in Figure 1. 3 4 2 5 1 6 8 7 Figure 1: Operatives may move to only one other polling station, as indicated by the arrows (for example, from the second to third polling station, but not the reverse). Given this structure, answer again questions (i) - (iii). c) Your funder is keen to spend their budget, and demands that you field the intervention as described originally: N = 8, and for D i = {0, 1}, N i=1 D i = 4. You set aside your concerns, and observe the data in Table 1 on the percentage of votes in each polling station attributable to fraud. Write out an appropriate estimator for the average treatment effect (ATE) given SUTVA. Drawing on your insights so far and your knowledge of causal inference, under what conditions will this estimator be unbiased? Apply it to the data and compute an estimate of the effect of the new biometric voting machines on the incidence of fraud. d) Given that you know the structure of the interference network, you believe you may be able to rescue something from the study. Using Figure 1, translate Table 1 into a table formatted like Table 2, where the latter columns give Y i (D i, D i 1 ) for the combinations of possible 5

Table 1: Observed Data: Treatment and Ballot Stuffing Unit D i Y i 1 1 4% 2 1 2% 3 0 8% 4 1 9% 5 0 12% 6 0 13% 7 0 4% 8 1 1% values for D i and D i 1. (And for i = 1, Y 1 (D 1, D 8 ), per Figure 1). Where you can, fill the cells of your table with values from Table 1; leave blank the cells representing unobserved potential outcomes. Table 2: Observed and Unobserved Potential Outcomes Unit D i D i 1 Y i (1, 1) Y i (1, 0) Y i (0, 1) Y i (0, 0) 1. 8 e) Formally express each of the estimands described below using potential outcomes; propose appropriate estimators for each; and finally estimate them with the help of your new table. i) The ATE, conditional on a neighbor taking treatment. ii) The ATE, conditional on a neighbor taking control. iii) The magnitude of effect modification due to assignment of treatment to a neighboring unit. f) Having heard about your complicated experimental design, a colleague approaches you for advice about a potential field experiment. Their experiment, which randomizes the provision of information about the quality of schooling, will be conducted in a dense urban area. Treatment is to be assigned at the household level. After questioning your colleague extensively, you ascertain that the structure of the urban area can be described by a ring lattice network with the number of households given by N, and the interconnectedness of the households given by degree d: 6

1 2 6 3 5 4 Figure 2: Households and connections between households in the village. This example has N = 6, where N represents the nodes or units, and d = 4, where d represents the degrees of the network, or how many connections each node/unit has to other nodes/units Figure 2 implies that any household in the village is connected to the four (d = 4) nearest households, so that interference can be ruled out between households that are three or more lots apart. We can rule out interference between any households not immediately connected by an edge. Given this setup, and given N > d, answer the following: i) For each household j, give the number of potential outcomes that can be defined. ii) For each household j, give the number of unit treatment effects that can be defined. iii) Imagine now that the experimental intervention is no longer binary, but ternary. It comprises two distinct treatments and a control. In this scenario, how many potential outcomes are there for each j? And how many unit treatment effects? iv) Assume that as N increases the structure of the households continues to follow an expanding ring lattice as shown in figure 2, where the number of connections is still defined by d, for any d < N. In terms of k, the number of levels of treatment (e.g., k = 2 if treatment is binary), and d, the number of degrees in the ring lattice network, write down a general expression for M k,d, the number of unique unit treatment effects for any given unit j. An increase in which parameter, k or d, has worse implications for the tractability of a study? Problem B This problem will introduce you to directed acyclic graphs (DAGs), and reasoning about them. Consider the following two DAGs, in which X, Y, and Z are observed, and U a and U b are hypothesized but unobserved. a) List all of the nodes in Graph A. b) List all of the edges in Graph A. c) List each path terminating at Y that exists in Graph A, and repeat for Graph B. Interpret the differences between graphs. 7

U a U b U a U b X D Z X Y Z X Y Figure 3: Causal Graph A Figure 4: Causal Graph B d) Explain the relationship between X D and X in Graph B. e) Is the relationship between X and Y identified in Graph B? Why? f) Assuming D i {0, 1}, write down the ATE of X on Y for Causal Graph B using the notation introduced in class. 8