Investigation & Classification of Median Income

Similar documents
Assignment 1: Predicting Amazon Review Ratings

Trends in Student Aid and Trends in College Pricing

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Probability and Statistics Curriculum Pacing Guide

Race, Class, and the Selective College Experience

About the College Board. College Board Advocacy & Policy Center

Python Machine Learning

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

learning collegiate assessment]

Lecture 1: Machine Learning Basics

Higher Education Six-Year Plans

ABILITY SORTING AND THE IMPORTANCE OF COLLEGE QUALITY TO STUDENT ACHIEVEMENT: EVIDENCE FROM COMMUNITY COLLEGES

(Sub)Gradient Descent

STA 225: Introductory Statistics (CT)

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

CS Machine Learning

Massachusetts Department of Elementary and Secondary Education. Title I Comparability

Best Colleges Main Survey

Miami-Dade County Public Schools

Trends in Tuition at Idaho s Public Colleges and Universities: Critical Context for the State s Education Goals

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Evaluation of Teach For America:

Evaluation of a College Freshman Diversity Research Program

Iowa School District Profiles. Le Mars

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Learning From the Past with Experiment Databases

The Good Judgment Project: A large scale test of different methods of combining expert predictions

School Competition and Efficiency with Publicly Funded Catholic Schools David Card, Martin D. Dooley, and A. Abigail Payne

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

TRENDS IN. College Pricing

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

SCHOOL PERFORMANCE FACT SHEET CALENDAR YEARS 2014 & TECHNOLOGIES - 45 Months. On Time Completion Rates (Graduation Rates)

GRADUATE STUDENTS Academic Year

Trends in College Pricing

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Lesson M4. page 1 of 2

Ryerson University Sociology SOC 483: Advanced Research and Statistics

EDUCATIONAL ATTAINMENT

Access Center Assessment Report

Rule Learning With Negation: Issues Regarding Effectiveness

Data Glossary. Summa Cum Laude: the top 2% of each college's distribution of cumulative GPAs for the graduating cohort. Academic Honors (Latin Honors)

2 nd grade Task 5 Half and Half

6 Financial Aid Information

Human Emotion Recognition From Speech

Trends in Higher Education Series. Trends in College Pricing 2016

Validation Requirements and Error Codes for Submitting Common Completion Metrics

Financing Education In Minnesota

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Financial aid: Degree-seeking undergraduates, FY15-16 CU-Boulder Office of Data Analytics, Institutional Research March 2017

Argosy University, Los Angeles MASTERS IN ORGANIZATIONAL LEADERSHIP - 20 Months School Performance Fact Sheet - Calendar Years 2014 & 2015

Social and Economic Inequality in the Educational Career: Do the Effects of Social Background Characteristics Decline?

School Size and the Quality of Teaching and Learning

Scholarship Reporting

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Linking the Common European Framework of Reference and the Michigan English Language Assessment Battery Technical Report

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Peer Influence on Academic Achievement: Mean, Variance, and Network Effects under School Choice

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Value of Athletics in Higher Education March Prepared by Edward J. Ray, President Oregon State University

Availability of Grants Largely Offset Tuition Increases for Low-Income Students, U.S. Report Says

UK Institutional Research Brief: Results of the 2012 National Survey of Student Engagement: A Comparison with Carnegie Peer Institutions

A Guide to Adequate Yearly Progress Analyses in Nevada 2007 Nevada Department of Education

BENCHMARK TREND COMPARISON REPORT:

How to Judge the Quality of an Objective Classroom Test

Lakewood Board of Education 200 Ramsey Avenue, Lakewood, NJ 08701

CSC200: Lecture 4. Allan Borodin

Interpreting ACER Test Results

Mathematics subject curriculum

Educational Attainment

Paying for College. Marla Lewis Office of Student Financial Aid

OFFICE OF ENROLLMENT MANAGEMENT. Annual Report

Mathematics process categories

Measures of the Location of the Data

Moving the Needle: Creating Better Career Opportunities and Workforce Readiness. Austin ISD Progress Report

Student Aid Alberta Operational Policy and Procedure Manual Aug 1, 2016 July 31, 2017

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

WHEN THERE IS A mismatch between the acoustic

STABILISATION AND PROCESS IMPROVEMENT IN NAB

PIRLS. International Achievement in the Processes of Reading Comprehension Results from PIRLS 2001 in 35 Countries

On-the-Fly Customization of Automated Essay Scoring

THE LUCILLE HARRISON CHARITABLE TRUST SCHOLARSHIP APPLICATION. Name (Last) (First) (Middle) 3. County State Zip Telephone

Measurement. When Smaller Is Better. Activity:

Grade 6: Correlated to AGS Basic Math Skills

Rule Learning with Negation: Issues Regarding Effectiveness

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Alex Robinson Financial Aid

INSTRUCTION MANUAL. Survey of Formal Education

Australia s tertiary education sector

DUAL ENROLLMENT ADMISSIONS APPLICATION. You can get anywhere from here.

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Federal Update. Angela Smith, Training Officer U.S. Dept. of ED, Federal Student Aid WHITE HOUSE STUDENT LOAN INITIATIVES

Why Did My Detector Do That?!

Multivariate k-nearest Neighbor Regression for Time Series data -

Overview of Access and Affordability at UC Davis

Developing an Assessment Plan to Learn About Student Learning

Background Information. Instructions. Problem Statement. HOMEWORK INSTRUCTIONS Homework #3 Higher Education Salary Problem

Transcription:

Investigation & Classification of Median Income Based on US Gov t Scorecard Data Toyya Pujol-Mitchell Chris Shartrand

Problem Description In Fall of 2015, President Obama announced the release of the US Department of Education s College Scorecard. The goal of the College Scorecard was to allow American families to make better and more informed decisions when choosing a college. The raw data was posted on the www.data.gov for public use. As the cost of college in the United States continues to rise, more American families are looking at college as a financial investment. Income measured after attendance is a practical assessment of one s return on the investment of college. Hence, the goal of this project is to assess the College Scorecard data with respect to median income. First, we hope to find what college characteristics are most important to a student s income six and ten years after enrolling. Second, determine if we can accurately classify these colleges into groups that produce high income earners and low income earners. Data Description and Preparation The 2014 financial and college data together consisted of over 1700 variables for the over 7800 post-secondary institutions in the US and its territories. The data attempts to provide a comprehensive view of the schools and their students. For example, the financial type data included financial information of the college and its students. Examples would be average instructional expenditure for a full time equivalent and average family income of dependent students. Non-financial data included data on the competitiveness of the school and strength of enrolled students (average SAT scores, acceptance rate, etc). The data also included other relevant information about the school such as regional data (State and zip code), Accreditation Agency, demographics of the schools, and highest degree offered. The data preparation was extensive and involved Merging the financial and non-financial data Removing rows with more than 20% missing data Removing columns that contained the same value for every row Identification of factor variables and converting them into binary variables for lasso regression Removing columns with over 1000 factors Methods Model Development We first performed multiple linear regression as a baseline for the model. The large number of variables created an Adjusted R-squared of 80.9% for the median income 6 years after enrollment and 84.4% for 10 years after. This model included the 481 variables that survived the data cleaning. Due to the high number of variables, we performed Lasso for variable selection. We decided against stepwise, since the complexity of the data was too high for stepwise to run efficiently. Lasso was able to reduce the data about by 55%, which we believed was still too large for practical use. Hence, we chose eighteen variables based on the MSE vs Variables chart (see below), since it provided a low number of variables without exponential error growth. The model of eighteen variables had an Adjusted R-squared of 73.4% for six-years and 76.2% for ten-years. Additional fitting was performed (transformation of data for normality correction, addition of interaction terms via stepwise regression, and outlier and influential point removal) resulting in an Adjusted R-squared of 82.1% and 83.9% for six and ten-year respectively. This is quite close to the initial regression with all the variables, and hence a good fit given the large number of variable reduction. The results of the selected models (not including interaction terms) as well as the description of the variable names are below.

Looking at the chart below, we can evaluate the coefficients of the selected variables. The six and ten-year models have many overlapping variables, which demonstrates consistency. The majority of the variables seem to be financial and debt data of the student such as family income and cumulative debt. Family income for both independent and dependent students have a positive effect on median income. The impact of family income can also be seen indirectly in that the percent of students who received a Pell Grant, a federal grant for low income students, is a significant variable. The percent of majors awarded also comes has an effect. The percent of engineering, mechanic and repair technologies, transportation/materials moving, business, and social sciences degrees also have a positive effect on median income. On the other hand, the percent of visual/performing arts and culinary/personal services have a negative effect. The last theme is financial data of the college. Tuition revenue and expenditures per a full-time equivalent student (FTE) both have positive effects on income. 6 Year Model 10 Year Model Variable Name Variables Description (Intercept) 9.675 16.63 DEP_INC_AVG Mean Family Income for Dependent Student DEP_INC_AVG 7.23E-05 9.71E-05 RPY_3YR_RT_SUPP 3 Year Loan Repayment Rate WDRAW_DEBT_MDN Median debt of students not completing school RPY_3YR_RT_SUPP 7.806 4.909 INEXPFTE Instructional expenditure per FTE WDRAW_DEBT_MDN 1.96E-04 1.86E-04 PCIP14 Percent of Engineering Degrees Awarded INEXPFTE 1.90E-04 1.86E-04 PCIP12 Percent of Personal And Culinary Services Degrees Awarded PCIP14 25.93 33.65 IND_INC_AVG Mean Family Income for Independent Student CUML_DEBT_P25 Cumulative loan debt at the 25th percentile PCIP12-5.6-8.546 UGDS_ASIAN % undergraduate degree-seeking students who are Asian IND_INC_AVG 2.03E-04 1.45E-04 PCTPELL Percent of Undergraduates Who Received Pell Grant UGDS_ASIAN 23.11 31.54 PCIP50 Percent of Visual And Performing Arts Degrees Awarded CUML_DEBT_P25 2.89E-04 3.24E-04 TUITFTE Net tuition revenue per full-time equivalent student PCTPELL -2.268-4.716 Percent of Mechanic And Repair Technologies/Technicians PCIP47 Degrees Awarded PCIP50-9.481-7.664 GRAD_DEBT_N Median Debt Completers Cohort TUITFTE 6.12E-05 Percent of Transportation And Materials Moving Degrees PCIP49 PCIP47 5.77 8.479 Awarded No. of Students in the Median Debt Not 1st Generation GRAD_DEBT_N -6.83E-05 3.44E-04 NOTFIRSTGEN_DEBT_N Students Cohort PCIP49 10.55 PCIP39 Percent of Theology And Religious Vocations Degrees Offered NOTFIRSTGEN_DEBT_ -5.48E-05 PCIP39-6.064 No. of Students in the Family Income Independent Students IND_INC_N Cohort IND_INC_N 2.27E-04 HIGHDEG Highest Level of Degree Offered HIGHDEG -0.1474 APPL_SCH_PCT_GE4 No. of schools on FAFSA applications >= 4 APPL_SCH_PCT_GE4 7.217 Percent of Business, Management, Marketing, And Related PCIP52 PCIP52 6.612 Support Services Degrees Offered PCIP45 Percent of Social Sciences Degrees Offered PCIP45 8.235 Bachelor's degree in Computer And Information Sciences And CIP11BACHL 0.3769 CIP11BACHL Support Services Offered Using Lasso Regression and Linear Regression, we were able to determine the most important variables in predicting median income. Next, we want to see if clustering the data will help us assess the similarities of the colleges and any trends among the data. Clustering After developing the linear models through Lasso Regression for income six-years out and ten-years out, it was imperative to understand which of the eighteen variables for each model affected the income. We also wanted to assess whether or not there were any grouping trends for the data. As a result, we decided to run a k-means clustering on the data for both six and ten years out. We chose a k = 3 groups in the hope that they would uniformly cluster into distinct groups of low, medium and high income. It immediately became clear that full visualization of the data would be impossible, as 18-dimensional space that would be necessary. Initially, we decided to deal with this issue by just producing plots of the clustering based on pairs of variables in order to view trends. Four of these plots can be seen below all corresponding to income six years out. Plots for income ten years out display similar trends.

Evident from the plots was the lack of cluster predictability from the majority of the eighteen variables. The last of the four plots, instructional expenditures per full time student against average family income of dependent students, was one of the few perspectives that yielded clear cluster boundaries. We additionally found the three cluster centroids and computed the predicted income based on the linear model to discover if we had successfully grouped the data into clusters of low, medium and high income. For income six years out, we found the predicted income of the three centroids to be $,, $,, and $, respectively. In the case of income ten years out, the income of the centroids was found to be $,, $,, and $, respectively. These results were a key factor in confirming our intuition that college attendees can expect to have a higher income when they are ten years out of school versus six years out of school. Despite these somewhat promising observations, we were unable to fully confirm our original goal. While the plot of instructional expenditure against average family income showed distinct cluster groups, it was only one perspective of an 18-dimensional space. Hence we still could not yet say confidently that we were able to uniformly cluster the data into three groups of low, medium and high income. Furthermore, it was also evident from the plots that visualization could not solely aid us in understanding which of the eighteen variables contributed most to the differences in income six and ten years out. Therefore, we deemed dimension reduction via principal component analysis to be necessary. We began by reducing the size of the data for income six years out. Because we still wanted to be able to visualize the clustering, a goal of using the first three principal components was set. Analysis of the explained variance found that. % of the variance in income six years out could be explained solely by the first three components, which was a level that was satisfactory for using only the first three principal components. Similarly for income ten years out,. %

of the variance could be explained by the first three principal components. Using the three principal components for both six-year and ten-year income, we reran the k-means clustering for three groups. The 3-dimensional plots for both six year and ten year can be seen below. As reflected in the two plots, we were successfully able to cluster the data into three groups of low, medium and high income for both the six year and ten-year data. It is also interesting to observe the larger variation in the clusters for the ten-year data. This observation follows an intuitive sense that it is harder to model the amount of income you would make the longer out of college that you are. By analyzing the principal component scores, we were able to find which of the underlying variables most greatly affected income. For income six years out, average family income of dependent students, average family income of independent students and institutional tuition revenue per student were the most heavily weighted variables for the first, second, and third principal component respectively. Therefore, this tells us that the amount of money that your family makes and the amount of money that you spend to attend college are the most important determining factors for the amount of income you would expect to make six years out of college. In the case of income ten years out, average family income of dependent students, average family income of independent students and institutional expenditure per student were the most heavily weighted variables for the first, second, and third principal component respectively. Comparing the two results against each other found that even after an extra four years out of college, the amount of money your family makes is still the most important factor in determining the amount of money you make. However, there is a key shift in the change from institutional tuition revenue to institutional expenditure. This phenomenon indicates that in the short term, your earnings depend on the amount of money you paid to your college, as in you may accept a lower paying job after graduation to start to pay off student debt versus staying unemployed for a longer period of time in order to find a higher paying job. However, in the long term, your earnings depend on the amount of money that your college paid to educate you, indicating that an institutions willingness to fund the educational process greatly affects their students future wage earnings. Classification Following the cluster and principal component analysis, we found that while we had gained monumental insight into the factors that affect income earnings we still lacked predictive capabilities on classifying whether a college is likely to produce a low or high income student based on their institutional data. The creation of a logistic regression was therefore warranted. To generate the binomial data of low or high income, we found the mean income from the institutional data and classified all,0 institutions into low income if they were below the mean and high income otherwise. This process was run for both the six-year income data and ten-year income data. Based on the logistic model that was produced for both sets of data, we computed the odds that an institution was to be included in the high income classification. If the odds were below 0., we placed that institution into the low income classification and all others were placed into the high income classification. Following this, we created a confusion table to quantify the predictive capabilities of our classification model. The table for both six and ten-year logistic regression models and the K-Nearest Neighbors (KNN) models, to be described following, can be seen below.

Classification Model Correct Incorrect Percent Correct Percent Incorrect Logistic 6 Years Out 2714 332 89.10% 10.90% Logistic 10 Years Out 2700 346 88.64% 11.36% KNN 6 Years Out 631 131 82.81% 17.19% KNN 10 Years Out 624 138 81.89% 18.11% In the clustering section, we discussed the limitations of K-Means. Nevertheless, we wanted to attempt to use a nearest neighbor algorithm for classification. KNN was a good choice. We chose a K=4, which was determined by a leave-oneout cross validation. For the KNN implementation, % of the data was used for training, which is why the nominal number correct is lower than the logistic. The results of the correct rates can be seen in the table above. The KNN performed quite well, but slightly worse than that of the logistic regression. Results and Future Work The results of our analysis can be summarized below We were able to create a linear model with an Adjusted R-squared of 82.1% for 6 years from start of enrollment and 83.9% for 10 years Utilizing the model selection process of the linear model, we were able to find the eighteen most important variables for predicting median income. These eighteen variables were the basis for the clustering and classification We were able to successfully cluster our data into groups of low, medium and high income for both six years out and ten years out. The predicted income for the centroids of the groups were $,, $,, and $, and $,, $,, and $, respectively. By using principal component analysis, we were able to discover that six years from enrollment, the amount of money that your family makes and the amount of money that your institution makes per student were the most important factors in determining the amount of money that you make. While for ten years out of college, the amount of money that your family makes and the amount of money that your institution spent on educating its students were the most important factors in determining the amount of money that you make. By fitting a logistic regression on our data to classify institutions into producing low and high income earning students, we were able to correctly classify. % of institutions six years out but that number dropped slightly to. % for ten years out. KNN classification yielded slightly worse results than that of the logistic regression with correct rates of. % and. % respectively. Some suggestions for additional work in this project is described in this section. One suggestion is changing the number of K for the KNN algorithm. There are methods to find an optimal K, via cross validation, which may improve our classification accuracy. In addition, other classification methods such as Linear/Quadratic Discriminate Analysis, Principle Component Analysis or Support Vector Machines could be explored to determine if they yield better results. For the logistic regression, we chose the default probability of 0.5 as the determining cut-off. A range, such as 0.4 to 0.6, could be explored to see if the correct rate could be increased via the logistic regression. The number of variables could be increased to 30, which as the smallest number before the MSE growth rate was no longer linear.