cse634 DATA MINING Professor Anita Wasilewska Spring 2018

Similar documents
Mining Association Rules in Student s Assessment Data

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Lecture 1: Basic Concepts of Machine Learning

Python Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CSL465/603 - Machine Learning

Learning From the Past with Experiment Databases

Applications of data mining algorithms to analysis of medical data

Mining Student Evolution Using Associative Classification and Clustering

Welcome to. ECML/PKDD 2004 Community meeting

CS Machine Learning

On-Line Data Analytics

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Word Segmentation of Off-line Handwritten Documents

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Reducing Features to Improve Bug Prediction

(Sub)Gradient Descent

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Australian Journal of Basic and Applied Sciences

Computerized Adaptive Psychological Testing A Personalisation Perspective

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Laboratorio di Intelligenza Artificiale e Robotica

Laboratorio di Intelligenza Artificiale e Robotica

Knowledge-Based - Systems

Generative models and adversarial training

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Statistics and Data Analytics Minor

Top US Tech Talent for the Top China Tech Company

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Linking Task: Identifying authors and book titles in verbose queries

A Case Study: News Classification Based on Term Frequency

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Axiom 2013 Team Description Paper

Customized Question Handling in Data Removal Using CPHC

Lecture 1: Machine Learning Basics

Managing Experience for Process Improvement in Manufacturing

Evolution of Symbolisation in Chimpanzees and Neural Nets

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Universidade do Minho Escola de Engenharia

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

The Method of Immersion the Problem of Comparing Technical Objects in an Expert Shell in the Class of Artificial Intelligence Algorithms

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Probabilistic Latent Semantic Analysis

Content-based Image Retrieval Using Image Regions as Query Examples

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Learning Methods for Fuzzy Systems

Classification Using ANN: A Review

EGRHS Course Fair. Science & Math AP & IB Courses

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

AQUA: An Ontology-Driven Question Answering System

Time series prediction

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Leveraging MOOCs to bring entrepreneurship and innovation to everyone on campus

Using dialogue context to improve parsing performance in dialogue systems

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Multisensor Data Fusion: From Algorithms And Architectural Design To Applications (Devices, Circuits, And Systems)

Humboldt-Universität zu Berlin

Self Study Report Computer Science

Speech Recognition at ICSI: Broadcast News and beyond

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Cross-Lingual Text Categorization

DOCTOR OF PHILOSOPHY HANDBOOK

Automatic document classification of biological literature

Diploma in Library and Information Science (Part-Time) - SH220

ECO 3101: Intermediate Microeconomics

A Web Based Annotation Interface Based of Wheel of Emotions. Author: Philip Marsh. Project Supervisor: Irena Spasic. Project Moderator: Matthew Morgan

Learning and Transferring Relational Instance-Based Policies

Computer Science (CSE)

The taming of the data:

Exposé for a Master s Thesis

Problems of the Arabic OCR: New Attitudes

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Applications of memory-based natural language processing

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

MAHATMA GANDHI KASHI VIDYAPITH Deptt. of Library and Information Science B.Lib. I.Sc. Syllabus

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

PSY 1010, General Psychology Course Syllabus. Course Description. Course etextbook. Course Learning Outcomes. Credits.

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Java Programming. Specialized Certificate

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Speech Emotion Recognition Using Support Vector Machine

ECON492 Senior Capstone Seminar: Cost-Benefit and Local Economic Policy Analysis Fall 2017 Instructor: Dr. Anita Alves Pena

Ordered Incremental Training with Genetic Algorithms

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Transcription:

cse634 DATA MINING Professor Anita Wasilewska Spring 2018

COURSE SYLLABUS

Course Web Page www.cs.stonybrook.edu/ cse634 The webpage contains: Detailed Lectures Notes slides Some Course Book slides Some previous Research Presentations Course Syllabus Please check it often- this is also a way I will communicate with you

Course Text Book DATA MINING Concepts and Techniques Jiawei Han, Micheline Kamber Morgan Kaufman Publishers, 2003,2011 Second Edition There is a new Third Edition, but we will follow the Second one as it is more widely available (and cheaper) We will follow the book very closely

Course Description Data Mining, called also Knowledge Discovery in Databases (KDD) and now called also BIG DATA is a multidisciplinary field It brings together research and ideas from database technology, machine learning, statistics, pattern recognition, knowledge based systems, information retrieval, high-performance computing, and data visualization to name the few

Course Description Data Mining main focus is the automated extraction of patterns representing knowledge implicitly stored in large databases, data warehouses, and other massive information repositories. The course will closely follow the book Course Lectures are designed to explain in details the material from book chapters

Course Description The course is designed to give a broad, yet in-depth overview of the Data Mining field It will examine the most BASIC recognized algorithms and techniques in a rigorous detail It also will explore the newest trends and developments of the field in a form of student s talks based on newest research developments and papers from the field - these will be subjects of student s Research Presentations

COURSE STRUCTURE Part 1 Introduction Book chapters 1, 2 and Lectures 1, 2 Part 2 Classification Decision Tree Induction and Neural Networks Book chapter 6 and Lectures 3-7 Team Classification Project See the Project Description in Syllabus and check the link on the course Website.

COURSE STRUCTURE Part 3 Association Analysis Apriori Algorithm Classification by Association Book chapter 5 and Lectures 8, 9 Test Review One Lecture 10 Part 4 Genetic Algorithms Genetic Algorithms Introduction Genetic Algorithms Examples Book chapter 6, Lectures 11, 12

COURSE STRUCTURE Test Review Two Lecture 13 Midterm/Final Test It is in class test and covers material from Parts 1-4 Part 5 Cluster Analysis Book chapter 7 and Lectures 14 Part 6 Lecture 15 Part 7 Foundations of Data Mining Students Research Presentations Attention: Project and Research Presentations are to be conducted in teams

GRADING COMPONENTS During the semester students are responsible for the following (in order as listed) Team Project (40pts) Midterm/Final Test (70pts) Team Research Presentation (60pts) Final Report (30 points)

FINAL GRADE COMPUTATION NONE of GRADES will be CURVED During the semester you can earn 300pts or more (in the case of extra points) The % grade will be determine in the following way: # of earned points divided by 3 = % grade The % grade is translated into a letter grade in a standard way as follows 100-90 % is A range A (100-96%), A- (95-90%) 89-80 % is B range B- (80-82%), B (83-85%), B+ (86-89%) 79-70 % is C range: C- (70-72%), C (73-75%), C+ (76-79%) 69-60 % is D range F is below 60%

Course Contents and Schedule The course will follow the book very closely and in particular we will cover all or partsof the following chapters and subjects The order does not need to be sequential Chapter 1 Introduction and General overview What is Data Mining, which data, what kinds of patterns can be mined - Lecture Chapter 2 Data preprocessing Data cleaning, data integration and transformation, data reduction, discretization and concept hierarchy generation Lecture

Course Contents and Schedule Chapters 3, 4 Data Warehouse and OLAP technology for Data Mining Students Presentations Chapter 5 Mining Association Rules in Large Databases Transactional databases and Apriori Algorithm Lecture and Students Presentation

Course Contents and Schedule Chapter 6 Classification and Prediction 1. Decision Tree Induction ID3, C4.5 - Lecture and Students Presentations 2. Neural Networks - (Lecture and students Presentations 3. Bayesian Classification - Lecture and students Presentations 4. Classification based on Concepts from Association rule mining - Lecture 5. Genetic algorithms - Lecture and students Presentations 6. Statistical Prediction - students Presentations

Course Contents and Schedule Chapter 7 Cluster Analysis A Categorization of major Clustering methods Lecture and students Presentations Chapters 8-11 Applications and TRENDS in DM Reading and /or students presentations Foundations of Data Mining SPRINGER Encyclopedia of Complexity and Systems Science, 2009 Editors: Editor-in-chief: Meyers, Robert A http://www.springer.com/us/book/9780387758886

PROJECT DESCRIPTION Project goal is to use Internet based Classification Tools to build two type classifiers: descriptive and non-descriptive Discuss the results in both cases Compare these two approaches on the basis of obtained results The detailed project description is in the course Syllabus It also is published as a link published at the course webpage

PROJECT DESCRIPTION Descriptive Classifier Use a Decision Tree tool to generate sets of discriminant rules describing the content of the data. Use WEKA http://www.cs.waikato.ac.nz/ ml/weka/index.html) Non-Decsriptive Classifier Use Neural Networks tool to build your classifier Use WEKA or a tool of your choice Describe specifics of your tool in a way that makes your report comprehensible for others.

PROJECT DESCRIPTION Project data is provided on the course web page This is a real life classification data with TYPE DE ROCHE (Rock Type) as a class attribute There are 98 records with 48 attributes and 6 classes This is a real life experimental data and it contains a lot of missing data (no value). The project has to follow the steps of DM Process to build different classifiers defined by three experiments

Project Experiments Experiment 1 Use the preprocessed data to perform a full classification (learning). This means build a classifier for all classes C1- C6 simultaneously Experiment 2 Use the preprocessed data to perform a contrast classification (contrast learning). This means build a classifier contrasting class C1 with a class notc1 that contains other classes

Project Experiments Experiment 3 Repeat Experiments 1, 2 for reprocessed data with the most important attributes as defined by the expert Write a detailed Project Description with methods, motivations, results and submit via e-mail to TA and Professor It is a team project The teams are the same is the for the Research Presentation

Research Presentations Each presentation must consists of the following two parts Part 1 (40pts) It is a Lecture type, 20-25 minutes long presentation Part 2 (20pts) It is a short, 5-10 minutes presentation of a research paper, or an application

Research Presentation Presentation Part 1 main goal is to teach others the material It must be a detailed, Lecture type presentation It can be be based on, or extending the content of the book not covered by course lectures, It can also cover other subjects not covered in the course book and taken from other sources

Research Presentation Presentation Part 2 It is a presentation of a research paper or of a newest commercial application connected with the subject covered in the Presentation Part 1 The structure of the Presentation Parts 1, 2 is described in the Syllabus Each group member must present some part of the whole group work. The format of how you decide to do it is left to you as a group.

Presentation s Subjects Students can find their own subjects But here are suggestions of some possible subjects Data Warehouse and OLAP technology - Chapter 3 of the Book Data Cube Computation and Data Generalization - Chapter 4 of the Book

Presentation s Subjects Statistical Methods 1 Statistical Prediction, Prediction by Regression, or any other purely statistical methods Statistical Methods 2 - Bayesian Classification Statistical Methods 3 - Cluster Analysis and categorization of major Clustering methods Evolutionary Computing Genetic algorithms as optimization Genetic algorithms as classification Other evolutionary computing methods.

Presentation s Subjects NEW ADVANCES] in Data Mining Deep Learning Web Mining - an overview of methods and problems Text Mining - an overview of methods and problems Visualization and DM techniques Natural Language Processing and DM techniques FIND YOUR OWN subject and discuss it with the Professor

FINAL REPORT Each student has to write a report about 10 research presentations The detailed format of the report is in the course Syllabus It os also published as a link published at the course webpage