Topics in Business Intelligence

Similar documents
Python Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Word Segmentation of Off-line Handwritten Documents

Assignment 1: Predicting Amazon Review Ratings

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Rule Learning With Negation: Issues Regarding Effectiveness

(Sub)Gradient Descent

CS Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Mining Association Rules in Student s Assessment Data

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

A Case Study: News Classification Based on Term Frequency

Lecture 1: Machine Learning Basics

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

CSL465/603 - Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Applications of data mining algorithms to analysis of medical data

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Statistics and Data Analytics Minor

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Learning From the Past with Experiment Databases

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Human Emotion Recognition From Speech

Diagnostic Test. Middle School Mathematics

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Reducing Features to Improve Bug Prediction

Australian Journal of Basic and Applied Sciences

Grade 6: Correlated to AGS Basic Math Skills

Data Fusion Through Statistical Matching

Lecture 1: Basic Concepts of Machine Learning

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Time series prediction

Probabilistic Latent Semantic Analysis

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Northern Kentucky University Department of Accounting, Finance and Business Law Financial Statement Analysis ACC 308

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Leader s Guide: Dream Big and Plan for Success

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Len Lundstrum, Ph.D., FRM

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Mathematics process categories

Unit 7 Data analysis and design

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Probability and Statistics Curriculum Pacing Guide

Manual for the internship visa program of the Fulbright Center

Development of an IT Curriculum. Dr. Jochen Koubek Humboldt-Universität zu Berlin Technische Universität Berlin 2008

learning collegiate assessment]

Researcher Development Assessment A: Knowledge and intellectual abilities

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Unit 3: Lesson 1 Decimals as Equal Divisions

ACCOUNTING FOR MANAGERS BU-5190-AU7 Syllabus

Algebra 2- Semester 2 Review

To the parents / guardians of students of the ISE Primary School

DEPARTMENT OF FINANCE AND ECONOMICS

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Financing Education In Minnesota

CS 446: Machine Learning

KeyTrain Level 7. For. Level 7. Published by SAI Interactive, Inc., 340 Frazier Avenue, Chattanooga, TN

Speech Recognition at ICSI: Broadcast News and beyond

Milton Public Schools Fiscal Year 2018 Budget Presentation

Welcome to. ECML/PKDD 2004 Community meeting

Analyzing the Usage of IT in SMEs

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

FAQ (Frequently Asked Questions)

Forget catastrophic forgetting: AI that learns after deployment

Firms and Markets Saturdays Summer I 2014

Switchboard Language Model Improvement with Conversational Data from Gigaword

Learning Methods for Fuzzy Systems

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Sociology 521: Social Statistics and Quantitative Methods I Spring Wed. 2 5, Kap 305 Computer Lab. Course Website

Management Update: A Growing Market Battle to Deliver E-Learning Systems

STA 225: Introductory Statistics (CT)

Multiple Measures Assessment Project - FAQs

Universidade do Minho Escola de Engenharia

Master of Science in Taxation (M.S.T.) Program

An Introduction to Simio for Beginners

Modeling user preferences and norms in context-aware systems

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

COVER SHEET. This is the author version of article published as:

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Measures of the Location of the Data

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

The stages of event extraction

SYLLABUS- ACCOUNTING 5250: Advanced Auditing (SPRING 2017)

Speech Emotion Recognition Using Support Vector Machine

May To print or download your own copies of this document visit Name Date Eurovision Numeracy Assignment

Artificial Neural Networks written examination

Laboratorio di Intelligenza Artificiale e Robotica

Information System Design and Development (Advanced Higher) Unit. level 7 (12 SCQF credit points)

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

After Scholarships, What?: Creative Ways To Lower Your College Costs--and The Colleges That Offer Them

A non-profit educational institution dedicated to making the world a better place to live

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Transcription:

Topics in Business Intelligence Lecture 1: Introduction to BI & case study Tommi Tervonen Econometric Institute, Erasmus School of Economics

What is Business Intelligence (BI)? BI refers to computer-based techniques used in spotting, digging-out, and analyzing business data, such as sales revenue by products and/or departments or associated costs and incomes BI technologies provide historical, current, and predictive views of business operations Business Intelligence often aims to support better business decision-making wikipedia.org/wiki/business_intelligence

Examples of BI

Examples of BI

Examples of BI

Examples of BI

BI framework Watson & Wixom, 2007

Main components in BI

Knowledge discovery process

Why data mining? Tremendous amount of data Walmart Customer buying patterns a data warehouse 7.5 Terabytes large in 1995 VISA Detecting credit card interoperability issues 6800 payment transactions per second High dimensionality of data Many dimensions to be combined together High complexity of data Time-series data, temporal data, sequence data Spatial, spatiotemporal, multimedia, text and Web data

Data mining Subtypes: Text mining: mining of patterns from text Web mining: discovering patterns from the web

Data mining: predictive analysis types Classification of observations to (possibly ordered) classes, e.g. credit card transactions to normal or fraudulent ones Prediction is similar, but instead of assignment to classes, we try to predict the value of a numerical variable, e.g. amount of credit card purchase Association rules or affinity analysis tells what is associated with the observations. Recommender systems (e.g. amazon.com) use association rules

Data mining: pre-analyses Data visualization allows easy overview of the data Data exploration often needs to be done with large data sets to answer more vague questions. Similar variables and observations can be aggregated to get a better picture of the data Data reduction consolidates a large number of variables or cases into a smaller set. Correlation & principal component analyses

What is data? Data can essentially be: 1 Continuous ordered values with a scale. E.g. client monthly spending (e), speed of car (km/h) 2 Categorical discrete, possibly ordered values. E.g. car class (small family car, large family car, executive,...), bank customer credit class (A, B, C, D) Often data is categorical due to form of reporting (e.g. from questionnaires: monthly salary)

Data mining methods for BI Mostly: Statistical methods for analysis of continuous variables Machine learning for analysis of categorical variables Variables are divided into predictors and responses

Data nature & methods Continuous Categorical No response response response Continuous Linear regression Logistic regression Principal components predictors Neural nets Neural nets Cluster analysis k-nearest neighbors Discriminant analysis k-nearest neighbors Categorical Linear regression Neural nets Association rules predictors Neural nets Classification trees Regression trees Logistic regression Naive Bayes Ordered categorical variables (e.g. 1, 2, 3) can often be converted to continuous ones Continuous variables can always be converted to categorical ones through frequency analysis (binning)

Data nature & methods Continuous Categorical No response response response Continuous Linear regression Logistic regression Principal components predictors Neural nets Neural nets Cluster analysis k-nearest neighbors Discriminant analysis k-nearest neighbors Categorical Linear regression Neural nets Association rules predictors Neural nets Classification trees Regression trees Logistic regression Naive Bayes Ordered categorical variables (e.g. 1, 2, 3) can often be converted to continuous ones Continuous variables can always be converted to categorical ones through frequency analysis (binning)

Data mining process

Learning modes In unsupervised learning, no outcome variable is predicted.

Learning modes In supervised learning the model is trained to predict a known response The data needs to be split into training and test sets

Supervised learning with linear regression

Supervised learning with linear regression

Supervised learning with linear regression x = 200, y =?

Data mining process 1 Develop an understanding of the purpose of the data mining project 2 Obtain the dataset to be used in the analysis 3 Explore, clean, and preprocess the data 4 Reduce the data, if necessary, and (in supervised learning) separate into training, test, and validation sets 5 Determine the data mining task (classification, prediction, etc) 6 Choose the technique to be used 7 Apply algorithms 8 Interpret results 9 Deploy model

Q?

Course organization Lectures: 1st Introduction to BI & case study 2nd Data reduction 3rd Model validation 4th No lecture - use the time for preparing your presentation 5th Student lectures: Naive Bayes and k-nn, Classification trees 6th Student lectures: Logistic regression, Neural nets 7th Overview of results, comparison with (yet another) test set, feedback

Course learning objectives 1 Knowledge of basic principles of data warehouse 2 Comprehension of business implications of BI and data mining 3 Application of a single data mining classification method 4 Evaluation of data mining results

Course evaluation & material Evaluation: Student lecture & case analysis (100%) Student lectures have mandatory attendance (1 miss allowed)

Course evaluation & material Evaluation: Student lecture & case analysis (100%) Student lectures have mandatory attendance (1 miss allowed) Online material (all will be available @ http://smaa.fi/tommi/courses/tbi/): My slides from the first 3 lectures Slides of the student lectures Scientific papers Course book: Shmueli, Patel & Bruce, Data mining for Business Intelligence - helps in making the student lecture but is not mandatory

Student lectures Prepared in pairs or small groups Each lecture should consist at least the following: 1 Theoretical explanation of the method 2 An application of the method to a simple case 3 Presentation of real-life BI applications of the method 4 Analysis of the case study with the method Each lecture should be 40mins + 5min discussion: expect to spend 2 weeks in preparation

Case study Direct mailings to potential customers ( junk mail ) can be an effective way to market a product of service. However, most junk mail is of no interest to majority of people, and ends up being thrown away

Case study Direct mailings to potential customers ( junk mail ) can be an effective way to market a product of service. However, most junk mail is of no interest to majority of people, and ends up being thrown away More directed marketing to highly potential customers saves waste & effort, and consequently lowers costs and increases profits

Case study scope Our customer is a Dutch charity organization that wants to be able to classify it s supporters to donators and non-donators. The non-donators are sent a single marketing mail a year, whereas the donators receive multiple ones (up to 4).

Case study scope Our customer is a Dutch charity organization that wants to be able to classify it s supporters to donators and non-donators. The non-donators are sent a single marketing mail a year, whereas the donators receive multiple ones (up to 4). Tasks: 1 Develop a data mining model for classifying the customers to donators and non-donators 2 Explain through the model which factors are important in deciding who is a donator

Case study data Information about donators in 8 variables: TIMELR TIME since Last Response (nr weeks) TIMECL TIME as CLient (nr years) FRQRES FReQuency of RESponse (to mailings) MEDTOR MEDian of Time Of Response AVGDON AVeraGe DONation (per responded mailing) LSTDON LaST DONation ANNDON Average ANNual DONation DONIND Donation indicator in the considered mailing (response) Training and test sets of over 4000 customers

Tools Spreadsheet software (e.g. gnumeric, OpenOffice calc, or Excel) RapidMiner: an open-source, cross platform tool with available commercial support

Motivation: current directions in BI (debatable) Packaged analytic applications delivered as both on premises software and software as a service (SaaS) will push control of the information used for decision making toward business units and away from IT organizations The economic crisis will reveal which enterprises have a sound information infrastructure and which do not The application of social software to the collaborative decision making process will demonstrate the business value of the information coming from BI systems by directly tying it to decisions made Gartner Inc., 2009

Rhine s paradox Joseph Rhine was a parapsychologist in the 1950 s who hypothesized that some people had Extra-Sensory Perception He devised an experiment where subjects were asked to guess 10 hidden cards red or blue

Rhine s paradox Joseph Rhine was a parapsychologist in the 1950 s who hypothesized that some people had Extra-Sensory Perception He devised an experiment where subjects were asked to guess 10 hidden cards red or blue He discovered that almost 1 in 1000 had ESP they were able to get all 10 right!

Rhine s paradox He told these people they had ESP and called them in for another test of the same type Alas, he discovered that almost all of them had lost their ESP What did he conclude?

You should t tell people they have ESP. It causes them to lose it.

Bonferroni s principle If you look for interesting patterns in more places than your amount of data will support, you are bound to find crap

1st week of case study (Download, install, and explore RapidMiner) 1 Develop an understanding of the purpose of the data mining project 2 Obtain the dataset to be used in the analysis 3 Explore the data (Import data into RapidMiner)