Filip Wójcik Data scientist, senior.net developer Wroclaw University lecturer

Similar documents
CS Machine Learning

Python Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

(Sub)Gradient Descent

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

CSL465/603 - Machine Learning

Welcome to. ECML/PKDD 2004 Community meeting

Lecture 1: Basic Concepts of Machine Learning

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Applications of data mining algorithms to analysis of medical data

Probabilistic Latent Semantic Analysis

Rule Learning With Negation: Issues Regarding Effectiveness

Mining Association Rules in Student s Assessment Data

Laboratorio di Intelligenza Artificiale e Robotica

Learning Methods for Fuzzy Systems

Lecture 1: Machine Learning Basics

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Knowledge-Based - Systems

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

A student diagnosing and evaluation system for laboratory-based academic exercises

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Rule Learning with Negation: Issues Regarding Effectiveness

On-Line Data Analytics

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Top US Tech Talent for the Top China Tech Company

We are strong in research and particularly noted in software engineering, information security and privacy, and humane gaming.

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

Seminar - Organic Computing

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Australian Journal of Basic and Applied Sciences

Content-free collaborative learning modeling using data mining

BUSINESS INTELLIGENCE FROM WEB USAGE MINING

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Learning From the Past with Experiment Databases

Artificial Neural Networks written examination

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Evolutive Neural Net Fuzzy Filtering: Basic Description

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Time series prediction

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Laboratorio di Intelligenza Artificiale e Robotica

FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature for Classification

Statistics and Data Analytics Minor

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Human Emotion Recognition From Speech

Computerized Adaptive Psychological Testing A Personalisation Perspective

Automating the E-learning Personalization

Len Lundstrum, Ph.D., FRM

Axiom 2013 Team Description Paper

A Case Study: News Classification Based on Term Frequency

Humboldt-Universität zu Berlin

Assignment 1: Predicting Amazon Review Ratings

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Learning Methods in Multilingual Speech Recognition

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Diploma in Library and Information Science (Part-Time) - SH220

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Learning and Transferring Relational Instance-Based Policies

PROCESS USE CASES: USE CASES IDENTIFICATION

GACE Computer Science Assessment Test at a Glance

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

LEARNING THROUGH INTERACTION AND CREATIVITY IN ONLINE LABORATORIES

Self Study Report Computer Science

Word Segmentation of Off-line Handwritten Documents

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Dinesh K. Sharma, Ph.D. Department of Management School of Business and Economics Fayetteville State University

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Universidade do Minho Escola de Engenharia

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Development of an IT Curriculum. Dr. Jochen Koubek Humboldt-Universität zu Berlin Technische Universität Berlin 2008

Customized Question Handling in Data Removal Using CPHC

Customised Software Tools for Quality Measurement Application of Open Source Software in Education

Three Strategies for Open Source Deployment: Substitution, Innovation, and Knowledge Reuse

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Writing Research Articles

Bluetooth mlearning Applications for the Classroom of the Future

Integrating E-learning Environments with Computational Intelligence Assessment Agents

Information System Design and Development (Advanced Higher) Unit. level 7 (12 SCQF credit points)

Department of Computer Science GCU Prospectus

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Department of Computer Science. Program Review Self-Study

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Mining Student Evolution Using Associative Classification and Clustering

Speech Recognition at ICSI: Broadcast News and beyond

Requirements-Gathering Collaborative Networks in Distributed Software Projects

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

EECS 571 PRINCIPLES OF REAL-TIME COMPUTING Fall 10. Instructor: Kang G. Shin, 4605 CSE, ;

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Transcription:

MACHINE LEARNING: when big data is not enough Filip Wójcik Data scientist, senior.net developer Wroclaw University lecturer filip.wojcik@outlook.com

What is machine learning? (1/4) Artificial intelligence Machine learning Big data Data mining Data science

What is machine learning? (2/4) Domain Expertise Statistical Research Mathematics Data Science Machine Learning Data Processing Computer Science

What is machine learning? (3/4) Data volumes are increasing Need to process massive amounts of data Data analysis processes automation

What is machine learning? (4/4) Big data Machine learning Large volumes of data storage & processing Highly parallelized algorithms Sophisticated architecture Hardware-related (clusters, nodes, server machines) Smart data processing methods Domain-agnostic Technology-agnostic Hardware-agnostic Predictions and modelling Strongly related to statistics

Machine learning tools

Machine learning use cases (1/2) Customer preferences discovery Automated expert systems construction Assigning new data to groups Market basket analysis Discovering preferences Explaining data Classification SUPERVISED Regression Pattern recognition UNSUPERVISED Grouping Detecting irrelevant features/columns Detecting highly correlated features/columns Detecting noise Financial trends discovery Statistical analysis Prediction of numerical values/outcomes Customers grouping Discovering similarities Features importance recognition

Machine learning use cases (2/2) Cannot be interpreter by humans Their internal structure is complicated and is hard to understand Mostly very sophisticated mathematically Justifications of predictions are purely mathematical Easily interpretable Can be translated to human-friendly form Not so sophisticated mathematically Black box methods White box methods

Key data structures (1/3) Structured SQL-like (tables) Flat files Data Logs Text data Unstructured Semantic networks

Data Frame Key data structures (2/3) Features/attributes Company Discrete features Boolean feature Numerical feature Financial instruments Status Company X Equities Open 0.6 Revenue Company Y Corporate Bonds Open 0.03 Records/objects Company Z Structure hybrid Closed 0.02

Data Frame Key data structures (3/3) Company Financial instruments Status Company X Equities Open 0.6 Revenue Company Y Corporate Bonds Open 0.03 Company Z Structure hybrid Closed 0.02 Company Financial instruments Status 001 001 1 0.6 Revenue 010 010 1 0.03 100 100 0 0.02

Algorithms overview Machine learning Supervised Unsupervised Learning expert systems Regression Decision trees Rule-based systems Neural networks Optimization Correlations finders Model-based systems Linear Discrete Evolutionary algorithms Clustering Probabilistic expert systems Adjusted Regression Swarm algorithms Association miners Fuzzy expert systems

Supervised learning

Supervised learning (1/3) Two data sets Training known answers, given to algorithm Test known answers, not given to algorithm Teacher/oracle Objective rating function Checks the algorithm progress Learning based on the experience Application of teachers/oracle suggestions to improve score Avoiding overfitting

Supervised learning (2/3) Data partitioning Training data 70% Test data 30% Sometimes the amount of data with known answers is limited Data division helps in better controlling the learning process Improving the effectiveness of data usage Test data Training data

Supervised learning (3/3) Update internal memory Present the data WITHOUT THE ANSWERS Calculate the error rate Training data Predict the answers When the error rate is low enough FINAL TEST ON Test data Punish for bad answer/prize for good one

Supervised learning decision trees

Supervised learning Decision trees (1/5) General approach Uses structured data Recursive top-down approach: divide and conquer, based on the best-promising attributes Can use numerical and discrete data as well Pros Very flexible Easy to implement Easy to interpret by humans Can be translated to easy-to-read rules and included in reports/documentations

Supervised learning Decision trees (2/5) Calculate the entropy/chaos of entry data Create decision node, and add child links. Process children recursively Divide data using the attributes that reduce the chaos mostly Divide the data using selected attribute Select attribute with biggest chaos reduction

Supervised learning Decision trees (3/5) client hotel addons money_spent offer business Hilton trip 40,000 deluxe business Hilton full board 38,000 deluxe business Hilton trip 40,000 deluxe middle class Meta none 800 basic middle class Meta meal 900 basic Value Count % Deluxe 3 0.5 Basic 2 0.333 Premium 1 0.16666 manager Meta spa 1,500 premium

Supervised learning Decision trees (4/5) client hotel addons money_spent offer business Hilton trip 40,000 deluxe business Hilton full board 38,000 deluxe business Hilton trip 40,000 deluxe middle class Meta none 800 basic middle class Meta meal 900 basic manager Meta spa 1,500 premium True Client == business? False hotel addons money_spent offer Hilton trip 40,000 deluxe Hilton full board 38,000 deluxe Hilton trip 40,000 deluxe hotel addons money_spent offer Meta none 800 basic Meta meal 900 basic Meta spa 1,500 premium

Supervised learning Decision trees (5/5) Classification /regression tasks Explaining complicated data Detecting irrelevant features Use cases Clients profiling Data visualization Building rule systems

Unsupervised learning

Unsupervised learning One data set Single set of data No good answers provided (in most cases) No teacher/oracle No option to evaluate prediction against correct answers Algorithm evaluation based on similarity measures/chaos measures/etc. Algorithm operates on data on its own Algorithm explores the possible data partitioning Algorithm maintains its internal error measures

Unsupervised learning association analysis

Unsupervised learning Association analysis (1/3) General approach Ordered data Searching for coincidences/correlations in data Features Works only with nominal data or discretized (binned)/thresholded numeric data Easy to implement Flexible Easy to interpret by humans Can significantly reduce the amount of irrelevant features

Unsupervised learning Association analysis (2/3) Transaction number Products 1. 1. Soya milk 2. Salad 2. 1. Salad 2. Walnuts 3. Wine 4. Bread 3. 1. Soya milk 2. Walnuts 3. Wine 4. Juice 4. 1. Salad 2. Soya milk 3. Walnuts 4. Wine 5. 1. Salad 2. Soya milk 3. Walnuts 4. Juice Frequent items support Soya, salad 0.4 Soya, salad, walnuts 0.4 Salad 0.6 Implications support Soya => walnuts 0.4 Soya => salad 0.4 Soya, Walnuts, Wine => juice 0.4

Unsupervised learning Association analysis (3/3) Anomaly detection Searching for correlations Data explanation Use cases of unsupervised learning algorithms Pattern recognition Irrelevant features detection Clustering

Must-reads

ML lecutures Pracical examples & code Math & theory

THANK YOU!