MD - Data Mining

Similar documents
Python Machine Learning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

SSE - Supervision of Electrical Systems

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

SAM - Sensors, Actuators and Microcontrollers in Mobile Robots

Statistics and Data Analytics Minor

(Sub)Gradient Descent

Introduction to Financial Accounting

Time series prediction

Computerized Adaptive Psychological Testing A Personalisation Perspective

Universidade do Minho Escola de Engenharia

Integrating E-learning Environments with Computational Intelligence Assessment Agents

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Australian Journal of Basic and Applied Sciences

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Assignment 1: Predicting Amazon Review Ratings

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

CSL465/603 - Machine Learning

Learning From the Past with Experiment Databases

Computational Data Analysis Techniques In Economics And Finance

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

City University of Hong Kong Course Syllabus. offered by Department of Architecture and Civil Engineering with effect from Semester A 2017/18

Agent-Based Software Engineering

A Case Study: News Classification Based on Term Frequency

Lecture 1: Machine Learning Basics

STA 225: Introductory Statistics (CT)

CS Machine Learning

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Rule Learning With Negation: Issues Regarding Effectiveness

Learning Methods for Fuzzy Systems

PHD COURSE INTERMEDIATE STATISTICS USING SPSS, 2018

Evolutive Neural Net Fuzzy Filtering: Basic Description

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Ph.D. in Behavior Analysis Ph.d. i atferdsanalyse

Reducing Features to Improve Bug Prediction

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

TEACHING AND EXAMINATION REGULATIONS PART B: programme-specific section MASTER S PROGRAMME IN LOGIC

Multivariate k-nearest Neighbor Regression for Time Series data -

Probabilistic Latent Semantic Analysis

We are strong in research and particularly noted in software engineering, information security and privacy, and humane gaming.

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

San José State University Department of Marketing and Decision Sciences BUS 90-06/ Business Statistics Spring 2017 January 26 to May 16, 2017

Strategy and Design of ICT Services

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Word Segmentation of Off-line Handwritten Documents

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Knowledge Elicitation Tool Classification. Janet E. Burge. Artificial Intelligence Research Group. Worcester Polytechnic Institute

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Introduction to Information System

Economics of Organizations (B)

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

Introduction to the European Credit system for Vocational Education and Training ECVET. EACEA Expert briefing Brussels 25 March 2010

Evolution of Symbolisation in Chimpanzees and Neural Nets

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Using dialogue context to improve parsing performance in dialogue systems

Lecture 1: Basic Concepts of Machine Learning

CS 446: Machine Learning

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

EGRHS Course Fair. Science & Math AP & IB Courses

Unit 7 Data analysis and design

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

A. What is research? B. Types of research

Undergraduate Program Guide. Bachelor of Science. Computer Science DEPARTMENT OF COMPUTER SCIENCE and ENGINEERING

Firms and Markets Saturdays Summer I 2014

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

A Reinforcement Learning Variant for Control Scheduling

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Guide to Teaching Computer Science

Seminar - Organic Computing

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

Rule Learning with Negation: Issues Regarding Effectiveness

Issues in the Mining of Heart Failure Datasets

Model Ensemble for Click Prediction in Bing Search Ads

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

MASTER OF PHILOSOPHY IN STATISTICS

TUCSON CAMPUS SCHOOL OF BUSINESS SYLLABUS

Mining Student Evolution Using Associative Classification and Clustering

Bug triage in open source systems: a review

Emma Kushtina ODL organisation system analysis. Szczecin University of Technology

EXAMINING THE DEVELOPMENT OF FIFTH AND SIXTH GRADE STUDENTS EPISTEMIC CONSIDERATIONS OVER TIME THROUGH AN AUTOMATED ANALYSIS OF EMBEDDED ASSESSMENTS

Semi-Supervised Face Detection

Generation of Attribute Value Taxonomies from Data for Data-Driven Construction of Accurate and Compact Classifiers

On-Line Data Analytics

Welcome to. ECML/PKDD 2004 Community meeting

Applications of memory-based natural language processing

Switchboard Language Model Improvement with Conversational Data from Gigaword

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Development of an IT Curriculum. Dr. Jochen Koubek Humboldt-Universität zu Berlin Technische Universität Berlin 2008

Data Structures and Algorithms

Content-based Image Retrieval Using Image Regions as Query Examples

Proceedings of the Federated Conference on Computer Science DOI: /2016F560 and Information Systems pp ACSIS, Vol. 8.

Multi-label Classification via Multi-target Regression on Data Streams

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Transcription:

Coordinating unit: Teaching unit: Academic year: Degree: ECTS credits: 017 70 - FIB - Barcelona School of Informatics 715 - EIO - Department of Statistics and Operations Research 73 - CS - Department of Computer Science BACHELOR'S DEGREE IN INFORMATICS ENGINEERING (Syllabus 010). (Teaching unit Optional) 6 Teaching languages: Catalan Teaching staff Coordinator: - Karina Gibert Oliveras (karina.gibert@upc.edu) - Mario Martín Muñoz (mmartin@cs.upc.edu) Prior skills Foundations of probability and statistics. Basic Programming in R Requirements - Prerequisite PE - Prerequisite PRO Degree competences to which the subject contributes Specific: CSI.. To conceive, deploy, organize and manage computer systems and services, in business or institutional contexts, to improve the business processes; to take responsibility and lead the start-up and the continuous improvement; to evaluate its economic and social impact. CSI.3. To demonstrate knowledge and application capacity of extraction and knowledge management systems. CSI.6. To demonstrate knowledge and capacity to apply decision support and business intelligence systems. Generical: G3. THIRD LANGUAGE: to know the English language in a correct oral and written level, and accordingly to the needs of the graduates in Informatics Engineering. Capacity to work in a multidisciplinary group and in a multi-language environment and to communicate, orally and in a written way, knowledge, procedures, results and ideas related to the technical informatics engineer profession. G9. PROPER THINKING HABITS: capacity of critical, logical and mathematical reasoning. Capacity to solve problems in her study area. Abstraction capacity: capacity to create and use models that reflect real situations. Capacity to design and perform simple experiments and analyse and interpret its results. Analysis, synthesis and evaluation capacity. 1 / 10

Teaching methodology The learning methodology will consist in the analysis of case studies concerning complex data sets from real problems. From these problems the body of necessary scientific knowledge will be introduced. The theoretical and practical lessons are interleaved such that programming and/or integration of data mining functions enhance the assimilation of the various concepts explained. The open programming environment R will be used in the laboratory. The laboratory classes will be devoted to solving problems related to the knowledge provided in the theory classes and to the resolution by the students of a similar problem. This problem may include the resolution of very brief conceptual questions and will be delivered for its evaluation. Finally, the students must complete two full practical works, a statistical modeling problem and a modelling problem of the "scientific", "transaction" or "marketing" kind (only one of them must be chosen by the student). This last practical work will be presented orally to the whole class. Learning objectives of the subject 1.Knowing the types of the main problems of Data Mining.Data quality assesment and preprocessing 3.Problem solving: identify the statistical and/or machine learning techniques more appropriate to solve the problem 5.Implement simple learning algorithms 6.Validation of results 7.Presentation of results in a professional environment for decision making Study load Total learning time: 150h Hours large group: 30h 0.00% Hours medium group: 0h 0.00% Hours small group: 30h 0.00% Guided activities: 6h 4.00% Self study: 84h 56.00% / 10

Content Introduction to Data Mining. Statistical modeling and types of problems: analysis of binary data ("transactions"), analysis of scientific data and analysis of data from enterprises Visualization and dimensionality reduction Feature selection and extraction. Visualization of multivariate data. Clustering Direct partitioning methods, hierarchical methods and expectation maximization Predictive Methods Regressió lineal múltiple i generalitzada. Regressió Logística. Xarxes Neuronals Decision Trees Classification and regression trees (CART). Validation protocols and data resampling Holdout, cross-validation and the bootstrap 3 / 10

Generation of association rules A-priori and Eclat algorithms. Discriminant Analysis Bayesian decision theory. LDA and QDA Discriminant Analysis and Naïve Bayes Non parametric discrimination Nearest neighbours Regression Shrinkage and Variable Selection Regularized linear regression. LASSO and the Elastic Net methods. Formal concept analysis Formal method for pattern finding Preprocessing a 4 / 10

Bagging i ensemble methods Bagging i ensemble methods 5 / 10

Planning of activities Development Unit 1 Hours: h Theory classes: h Laboratory classes: 0h Self study: 0h 1 A review of R language Hours: 6h Theory classes: 0h Laboratory classes: 6h Self study: 0h Development of item Hours: 16h Theory classes: 4h Laboratory classes: 4h Self study: 8h Development of item 3 Hours: 9h Laboratory classes: h Development of Item 4 Hours: 11h Laboratory classes: 4h 6 / 10

Development of item 5 Hours: 9h Laboratory classes: h Development of Item 6 Hours: 7h Laboratory classes: 0h Development of Item 7 Hours: 9h Laboratory classes: h Development of Item 8 Hours: 11h Laboratory classes: 4h 7 / 10

Development of Item 9 Hours: 11h Laboratory classes: h Self study: 6h Development of Item 10 Hours: 13h Laboratory classes: 4h Self study: 6h 6 Practice 1 Hours: 3h Guided activities: 3h Self study: 0h, 3, 5, 6 Practice Hours: 3h Guided activities: 3h Self study: 0h 3, 5, 6, 7 8 / 10

Qualification system The evaluation of the course will be based on the grade obtained in the exercises developed during the lab sessions. On the other hand there will be two practical works. For each practical work, the student will deliver the corresponding written report. Finally, at the end of the course, the students must present orally the second practical work. The student will be required to show the necessary reasoning as well as English skills. These skills will be are evaluated using the corresponding rubrics. The overall laboratory grade is the average of the grades obtained for the exercises developed out of the laboratory sessions. The final mark will be obtained as follows: Lab = overall laboratory grade PR1 = grade for the first practical work PR = grade for the second practical work Final grade = 0.*Labo + 0.4*Pr1 + 0.4*Pr In both practical works (counting 40% each), 35% corresponds to the technical correction and 5% corresponds to the 'reasoning' generic competence, so that this competence gets an overall weight of 10% of the final grade. 9 / 10

Bibliography Basic: Hand, D.J. Construction and assessment of classification rules. Wiley, 1997. ISBN 978-0-471-96583-1. Hastie, T.; Tibshirani, R.; Friedman, J. The elements of statistical learning: data mining, inference, and prediction. nd ed. Springer, 009. ISBN 9780387848570. Hernández Orallo, J.; Ramírez Quintana, M.J.; Ferri Ramírez, C. Introducción a la minería de datos. Pearson, 004. ISBN 978840540917. Maindonald, J.H.; Braun, J. Data analysis and graphics using R: an example-based approach. 3rd ed. Cambridge University, 010. ISBN 97805176939. Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern classification. nd ed. John Wiley & Sons, 001. ISBN 0-471-05669-3. Complementary: Aluja Banet, T.; Morineau, A. Aprender de los datos: el análisis de componentes principales: una aproximación desde el Data Mining. EUB, 1999. ISBN 9788483104. Others resources: Hyperlink http://www.cran.es.r-project.org http://www.kdnuggets.com/ http://www.cs.waikako.ac.nz 10 / 10