PRESENTATION TITLE. A Two-Step Data Mining Approach for Graduation Outcomes CAIR Conference

Similar documents
Access Center Assessment Report

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Lecture 1: Basic Concepts of Machine Learning

Python Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

CS Machine Learning

Lecture 1: Machine Learning Basics

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Assignment 1: Predicting Amazon Review Ratings

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Race, Class, and the Selective College Experience

(Sub)Gradient Descent

READY OR NOT? CALIFORNIA'S EARLY ASSESSMENT PROGRAM AND THE TRANSITION TO COLLEGE

Rule Learning With Negation: Issues Regarding Effectiveness

Validation Requirements and Error Codes for Submitting Common Completion Metrics

LIM College New York, NY

Issues in the Mining of Heart Failure Datasets

Rule Learning with Negation: Issues Regarding Effectiveness

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

SUNY Downstate Medical Center Brooklyn, NY

Mining Association Rules in Student s Assessment Data

Evaluation of Teach For America:

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Probabilistic Latent Semantic Analysis

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Multiple Measures Assessment Project - FAQs

Do multi-year scholarships increase retention? Results

Financial Aid & Merit Scholarships Workshop

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Math Placement at Paci c Lutheran University

Using dialogue context to improve parsing performance in dialogue systems

College of William and Mary Williamsburg, VA

Bellevue University Bellevue, NE

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

University of Maine at Augusta Augusta, ME

Educational Attainment

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Data Stream Processing and Analytics

OFFICE OF ENROLLMENT MANAGEMENT. Annual Report

Tableau Dashboards The Game Changer

Early Warning System Implementation Guide

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Learning From the Past with Experiment Databases

Developing an Assessment Plan to Learn About Student Learning

Australian Journal of Basic and Applied Sciences

St. John Fisher College Rochester, NY

Developing a TT-MCTAG for German with an RCG-based Parser

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

K-Medoid Algorithm in Clustering Student Scholarship Applicants

Data Glossary. Summa Cum Laude: the top 2% of each college's distribution of cumulative GPAs for the graduating cohort. Academic Honors (Latin Honors)

Conference Presentation

Evaluation of a College Freshman Diversity Research Program

Grade 6: Correlated to AGS Basic Math Skills

MAINE 2011 For a strong economy, the skills gap must be closed.

Applications of data mining algorithms to analysis of medical data

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Millersville University Degree Works Training User Guide

Predicting the Performance and Success of Construction Management Graduate Students using GRE Scores

Azusa Pacific University Azusa, CA

Mining Student Evolution Using Associative Classification and Clustering

Reducing Features to Improve Bug Prediction

Connecting to the Big Picture: An Orientation to GEAR UP

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Presentation Advice for your Professional Review

Undergraduate Admissions Standards for the Massachusetts State University System and the University of Massachusetts. Reference Guide April 2016

University of Arkansas at Little Rock Little Rock, AR

Strategic Plan Dashboard Results. Office of Institutional Research and Assessment

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Team Formation for Generalized Tasks in Expertise Social Networks

MYCIN. The MYCIN Task

Learning goal-oriented strategies in problem solving

Best Colleges Main Survey

9th Grade Begin with the End in Mind. Deep Run High School April 27, 2017

Upward Bound Program

2015 High School Results: Summary Data (Part I)

The taming of the data:

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

arxiv: v1 [cs.lg] 3 May 2013

Universidade do Minho Escola de Engenharia

Colorado s Unified Improvement Plan for Schools for Online UIP Report

Freshman On-Track Toolkit

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Linking Task: Identifying authors and book titles in verbose queries

MJC ASSOCIATE DEGREE NURSING MULTICRITERIA SCREENING PROCESS ADVISING RECORD (MSPAR) - Assembly Bill (AB) 548 (extension of AB 1559)

Unraveling symbolic number processing and the implications for its association with mathematics. Delphine Sasanguie

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Learning Methods in Multilingual Speech Recognition

Accessing Higher Education in Developing Countries: panel data analysis from India, Peru and Vietnam

Indian Institute of Technology, Kanpur

Peru State College Peru, NE

Implementing an Early Warning Intervention and Monitoring System to Keep Students On Track in the Middle Grades and High School

Review of Student Assessment Data

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Multi-Lingual Text Leveling

Welcome to the session on ACCUPLACER Policy Development. This session will touch upon common policy decisions an institution may encounter during the

IS FINANCIAL LITERACY IMPROVED BY PARTICIPATING IN A STOCK MARKET GAME?

Transcription:

PRESENTATION TITLE A Two-Step Data Mining Approach for Graduation Outcomes 2013 CAIR Conference Afshin Karimi (akarimi@fullerton.edu) Ed Sullivan (esullivan@fullerton.edu) James Hershey (jrhershey@fullerton.edu) Sunny Moon (hmoon@fullerton.edu) November 21, 2013

Data Mining Science of extracting patterns and knowledge from large data sets to predict future trends and behavior. o Supervised Learning o Unsupervised Learning

Two Step Process Classification decision tree model to predict six-year graduation of FTF (supervised learning) Cluster analysis (K-Means clustering) on the identified at-risk students to reveal patterns and suggest cluster-level intervention (unsupervised learning)

Classification Model Using Decision Tree Decision Tree vs. Neural Networks, Logistic Regression, SVM, etc. Decision trees are easy to understand, implement, and visualize

Decision Trees Continued Used in different disciplines including Operations Research Inverted trees with root at the top; used to create model that predicts target variable Generated by recursive partitioning An example of node selection criteria is Information Gain (C5.0) that selects node variable with least entropy with respect to target variable

Example decision tree Play tennis or not? (depending on weather conditions) Each branch corresponds to an attribute value Outlook Sunny Overcast Rainy Each internal node tests an attribute Humidity Yes Wind High Normal Strong Weak No Yes No Yes Each leaf assigns a classification Example taken from Kurt Driessens slides

Overfitting Generated decision tree relies too much on irrelevant feature of training data. The generated model performs poorly on future/unseen data. To reduce overfitting, use pruning (technique in which leaf nodes that do not add to the discriminative power of the decision tree are removed)

Training/Building the Tree Using 24 predictor variables: 12 socio-economic, demographics, HS performance variables 12 first term college variables All converted to nominal variables 1 target variable: 6 Yr Degree (with Yes/No values) Using the fall 03, 04, 05, 06 FTF cohorts for training

Predictor Variables Gender Under-Represented Status Residence (county) Parents Education HS GPA # of College Prep Math Courses Passed in HS # of College Prep Science Courses Passed in HS # of College Prep Social Science Courses Passed in HS # of College Prep Art Courses passed in HS SAT Math SAT Verb Prior Institution Type Admission Basis Code Pell Grant Recepient Freshman Program Participation College (Entry) Entry Level Math Proficiency English Proficiency Degree-Applicable Units Earned in First Semester F,D or WU Grade in 1st Semester First Term GPA Math Course (1st term) English Course (1st term)

Model Validation & Testing Total of 14,152 records from fall 03, 04, 05, 06 cohorts (missing HS GPAs, SATs excluded) for model training Random 1,000 records removed and set aside for future testing Remaining 13,152 records used for training/validation using a 5-fold cross validation

5-Fold Cross Validation 2,630 records 10,522 records

5-Fold Cross Validation 2,630 records 10,522 records

5-Fold Cross Validation 2,630 records 10,522 records

5-Fold Cross Validation 10,522 records 2,630records

5-Fold Cross Validation 10,522 records 2,630records

Model s Accuracy Classification accuracy is the average accuracy of the 5 runs: Classification Accuracy: 66.4% Sensitivity (true positive rate): 72.4% Specificity (true negative rate): 60.3%

RapidMiner 5.0

Relevance (weights) of the variables on the Information Gain Ratio Variable Weight (normalized) F,D or WU Grade in 1st Semester 0.075 Degree-Applicable Units Earned in First Semester 0.042 First Term GPA 0.036 Math Course (1st term) 0.033 Admission Basis Code 0.015 HS GPA 0.01 Gender 0.009 Freshman Program Participation 0.008 Entry Level Math Proficiency 0.007 English Course (1st term) 0.007 Under-represented Status 0.007 # of College Prep Math Courses Passed in HS 0.004 English Proficiency 0.004 College (entry) 0.004 Parents Education 0.003 SAT Verbal 0.003 Pell Grant Recepient 0.002 SAT Math 0.002 Prior Institution Type 0.002 Residence (county) 0.001 # of College Prep Social Science Courses Passed in HS 0.001 # of College Prep Science Courses Passed in HS 0.001 # of College Prep Art Courses Passed in HS 0.001

Generated Tree

Testing Tested the model using the 1,000 records that were NOT used in building the model. Also, later (when summer 13 degrees were posted) tested the model using the Fall 07 cohort

Testing with Fall 07 FTF Cohort (Sept 13) Model predicts 1,717 (out of 4,026) students not to graduate in 6 years Model s classification accuracy: 68% (1183+1567)/4026 sensitivity: 1567/2101 = 75% specificity: 1183/1925 = 61% Top half of predicted non-graduates predicted with 82% accuracy

Clustering Place these 859 students who were predicted not to graduate in clusters such that: Students in each cluster are as similar as possible (based on their HS and 1 st term college academic performances) and Clusters are as different from each other as possible (again based on students HS and 1 st -term college academic performances)

K-Means Clustering-Using Mixed Euclidean Distance (both numeric and nominal variables) Focus is on the HS to college transition Variables used (only academic performance precollege and 1 st term): HS GPA SAT Verb SAT Math Number of degree-applicable units earned in 1 st term Number of F, D, WU or NC grades in 1 st term 1 st term type of math course passed/failed

Clusters Centroid Plot

Clusters Analysis Cluster N High School GPA SAT Math SAT Verb Degreeapplicable Units Earned # of F, D, WU or NC grades Mean σ Mean σ Mean σ Mean σ Mean σ 0 324 2.84 0.22 493 88.2 469 83 1.57 1.95 3.27 1.01 1 208 3.45 0.23 472 87.6 451 77 2.41 2.41 2.57 1.11 2 327 2.96 0.23 471 81.6 453 75 6.35 3.06 1.39 0.59

Clusters Analysis Continued Cluster 1st Term Math Course Outcome Failed Remedial Failed GE Passed Remedial Passed Math Math Math GE Math None 0 20% 57% 16% 6% 2% 1 15% 45% 29% 6% 5% 2 18% 30% 29% 20% 3%

Cluster 0 (The Un-motivated) HS GPA 2.8 SAT Math 493, SAT Verb 469 1 st term college: Earned 1.6 degree-applicable units # of F, D, WU or NC grades: 3.3 57% took & failed GE math, 20% took and failed remedial math 1 st term GPA: 0.58 Mostly men (59% men, 41% women) College of major group mode: hierarchical, followed by semi-hierarchical Benefits from (Probation) Advisement Cluster 2 (The Slow Starters) HS GPA 2.9 SAT Math 471, SAT Verb 453 1 st term college: Earned 6.3 degree-applicable units # of F, D, WU or NC grades: 1.4 30% took & failed GE math, 30% took and passed remedial math 1 st term GPA: 1.63 Mostly women (47% men, 53% women) College of major group mode: semi-hierarchical, followed by non-hierarchical Benefits from Academic Support

Cluster 1 (The Disconnected) HS GPA: 3.4 (above avg. HS GPA of fall 07 incoming freshmen) SAT Math 472, SAT Verb 451 1 st term college: Earned 2.4 degree-applicable units # of F, D, WU or NC grades: 2.6 45% took & failed GE math, 29% took and passed remedial math 1 st term GPA: 0.83 Largely 1 st generation college students (40.4%) Majority underrepresented students (55.3%) Majority from outside local area high schools (57%) Mostly Women (36% men, 64% women) Benefits from Practices that Promote Campus Engagement, Early Warning System

Summary Predictive model for early identification of at-risk students using early indicators (not past 1 st term in college) Provides insight into clusters of at-risk students; suggests cluster-level intervention Don t need expertise in machine learning, AI, statistics (data mining tools handle algorithms) Need to know the data intimately (data compilation & preparation most critical, most time-consuming)

Questions/Comments? Contact email: akarimi@fullerton.edu