Let the data speak: Machine Learning methods for data editing and imputation. Paper by: Felibel Zabala Presented by: Amanda Hughes

Similar documents
CS Machine Learning

Mining Student Evolution Using Associative Classification and Clustering

Mining Association Rules in Student s Assessment Data

Python Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Assignment 1: Predicting Amazon Review Ratings

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Rule Learning With Negation: Issues Regarding Effectiveness

Linking Task: Identifying authors and book titles in verbose queries

Learning From the Past with Experiment Databases

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Data Fusion Through Statistical Matching

Rule Learning with Negation: Issues Regarding Effectiveness

Probability and Statistics Curriculum Pacing Guide

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

K-Medoid Algorithm in Clustering Student Scholarship Applicants

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Issues in the Mining of Heart Failure Datasets

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

STA 225: Introductory Statistics (CT)

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Probabilistic Latent Semantic Analysis

Radius STEM Readiness TM

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Learning Methods in Multilingual Speech Recognition

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Ricopili: Postimputation Module. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Applications of data mining algorithms to analysis of medical data

Introduction to Questionnaire Design

On-Line Data Analytics

Universidade do Minho Escola de Engenharia

Generative models and adversarial training

Mathematics Success Grade 7

CS 446: Machine Learning

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Measurement. Time. Teaching for mastery in primary maths

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

learning collegiate assessment]

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Computerized Adaptive Psychological Testing A Personalisation Perspective

Reducing Features to Improve Bug Prediction

Using dialogue context to improve parsing performance in dialogue systems

Axiom 2013 Team Description Paper

Multivariate k-nearest Neighbor Regression for Time Series data -

Switchboard Language Model Improvement with Conversational Data from Gigaword

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Visit us at:

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

A Case Study: News Classification Based on Term Frequency

Laboratorio di Intelligenza Artificiale e Robotica

School of Innovative Technologies and Engineering

Learning Methods for Fuzzy Systems

Miami-Dade County Public Schools

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

On-the-Fly Customization of Automated Essay Scoring

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Platform for the Development of Accessible Vocational Training

A Domain Ontology Development Environment Using a MRD and Text Corpus

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Truth Inference in Crowdsourcing: Is the Problem Solved?

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Predicting Future User Actions by Observing Unmodified Applications

Why Did My Detector Do That?!

Conference Presentation

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Certified Six Sigma Professionals International Certification Courses in Six Sigma Green Belt

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Seminar - Organic Computing

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Active Learning. Yingyu Liang Computer Sciences 760 Fall

INSTRUCTION MANUAL. Survey of Formal Education

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

An Introduction to Simio for Beginners

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Software Maintenance

Corpus Linguistics (L615)

Abstract. Janaka Jayalath Director / Information Systems, Tertiary and Vocational Education Commission, Sri Lanka.

CSC200: Lecture 4. Allan Borodin

Artificial Neural Networks written examination

Procedia - Social and Behavioral Sciences 98 ( 2014 ) International Conference on Current Trends in ELT

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Online Marking of Essay-type Assignments

Transcription:

Let the data speak: Machine Learning methods for data editing and imputation Paper by: Felibel Zabala Presented by: Amanda Hughes September 2015

Objective Machine Learning (ML) methods can be used to help us analyse and understand erroneous data and non-response in various data collections. This presentation aims to communicate machine learning methods we have explored to assist in developing sound editing and imputation methodology using Statistics New Zealand s Household Economic Survey as a case study. 2

Context: Statistics New Zealand The Household Economic Survey is currently migrating to the Household Processing Platform. Edit rules are currently being developed. Editing system will have a contextual editor that provides users with a relational view of data requiring manual editing. To assist in the development of the contextual editor, we are exploring the use of association data mining in relation to the creation of editing rules. 3

Association rule mining: Introduction Introduced by Agrawal et al in 1993. Originated from analysing a market basket of transactions to generate association rules that describe which items from transactions tend to occur together. Item associations are generated based on: The strength of the association, The frequency of the occurrence and, The predictive utility of the relationship. 4

Association rule mining: Definition Association rule: An implication expression of the form X Y, where X and Y are disjoint itemsets. Example: {Milk, Diapers} {Beer} Rule evaluation measure: Support (s): Fraction of transactions that contain both X and Y. Confidence (c): Measures how often items in Y appear in transactions that contain X. TID Items 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coca Cola 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coca Cola Example: {Milk, Diapers} {Beer} s = σ (Milk, Diapers, Beer) / T = 2/5 = 0.4 c = σ (Milk, Diapers, Beer) / σ (Milk, Diapers) = 2/3 = 0.67 5

Association rule mining: Goal The goal of association rule mining is to find all rules with support and confidence above defined thresholds. First generate all combinations of items whose support minsup (called frequent itemsets) Then extract all the high-confidence rules from the frequent itemsets. The most popular algorithm used in association rule mining is the Apriori algorithm (arules). 6

Association rule mining: Limitation Association rules: applied to categorical data. HES: mostly quantitative data, so had to use ordinal association work around. Ordinal association rule mining is done using the following steps: Find ordinal rules with a minimum confidence (using a version of the Apriori algorithm). Identify data items that break the rules and can be considered outliers. 7

Association rule mining: HES data Investigated using unedited HES data consisting of 4,292 records described by 33 attributes. Used ordinal rules to illustrate identification of outliers. Age and income are converted to ordinal attributes Age into five-year age groups for 15-64 and 65+ Income into percentiles of the income distribution of the data set. 8

Association rule mining: HES data, results Association rules with minimum confidence equal to 0.15 were extracted. Association rule: Highest qualification Total income 9

Association rule: Age Total income 10

Household Economic Survey (HES) HES income + expenditure (+ wide range of demographic information) For a personal income questionnaire to be a response in the current HES, all key questions must have a valid answer. Current method: nearest neighbour donor imputation. 11

Household Economic Survey (HES) Proposed methodology: same imputation methodology but income questionnaire divided into three modules: Jobs module, government transfers module and the investment module. A previous SNZ project investigated and recommended the use of a ML method as a standard tool to create homogeneous imputation classes. 12

Classification methods for imputation: Classification and Regression Trees Imputation is done within homogeneous classes to minimise the potential non-response bias. Sometimes a large number of variables are available to form imputation classes A Statistics New Zealand project proposed the use of decision tree (or classification) learning methods (CART) to determine the useful variables to create imputation classes. 13

Classification and Regression Trees The basic structure of a classification or a regression tree consists of a root node which grows through a series of splits to create terminal nodes. Different criteria are used for splitting 14

Classification and Regression Trees: Case study CART was used to explore matching variables needed to impute for missing income from the three income modules using the nearest neighbour imputation methods. Data from HES 2010/11 was used as training samples to identify the regression tree that will be used for future data. Used the R-package, Rpart (Recursive Partitioning). 15

16

Classification and Regression Trees: Case study The complexity parameter is used to control the size of the classification tree and to select the optimal tree size. Tree construction does not continue unless it would decrease the overall lack of fit by a factor of cp. If a smaller cp was used, say, 0.001, more branches would have been produced which would result in the use of more imputation cells. 17

Classification and Regression Trees: Case study, weights. To find the corresponding weights, we used random forests (R-package randomforest). Random forests can be used to rank the importance of variables in a classification problem. A random forest combines the predictions made by multiple decision trees, where each tree is generated using a random vector generated from some fixed probability distribution 18

Classification and Regression Trees: Case study, results 19

Classification and Regression Trees: Case study, evaluation Results evaluated, using HES 2012/13 data. 5% and 10% of missing income records were simulated from the true income records. Canceis was used impute the missing records using the matching variables recommended. The presence of bias was measured comparing the imputed data to the true observed data. RESULT: no evidence of bias for the imputed values. 20

Conclusion We have found useful applications of machine learning methods for data editing and imputation. Association rule mining allows us to extract some rules from datasets. Classification methods allow us to create homogeneous imputation classes, and enable us to determine efficient matching variables for use in nearest neighbour imputation. 21

Future work In the future we plan to explore the use of cluster analysis for data editing and imputation. Future training on data editing and imputation will include relevant topics on machine learning methods. We also plan to develop a user interface that will make it easier to investigate machine learning methods for data editing and imputation. 22