CSCI 6366.01, Data Mining and Warehousing Spring 2015 Instructor: Zhixiang Chen, Office: ENGR 3.272, Phone: 665-3520, Email: zchen@utpa.edu, WWW Home Page: faculty. utpa.edu/zchen/ Office Hours: Monday Tuesday Wednesday Thursday Friday 10:45 PM -- 11:45 PM 4:35 AM -- 5:35 AM Lectures: CSCI 6366 Thursday 5:45 -- 8:25 PM ENGR 1.272 Course Description: UTPA Graduate Catalog: CSCI 6366 Data Mining and Warehousing As a multidisciplinary field, draws on work from areas including database technology, artificial intelligence, machine learning, neural network, statistics, information retrieval, and data visualization. Theoretical and practical methods will be presented on knowledge discovery and systems design and implementation. Text and Materials: The text book is "Introduction to Data Mining", Pang-Ning Tan, Michael Steinbach, Vipin Kumar, 2 nd (or first) edition, Pearson/Addison Wesley. Other suggested materials:
Will be given in class as the semester progresses. Prerequisites: CSCI 6305 Foundation of Algorithms, Data and Programming Languages in Computer Science: In-depth analysis of computing algorithms and data structures for implementation in the context of software engineering design using structured programming languages. Course Topics: Introduction o What Is Data Mining? o Motivating Challenges o The Origins of Data Mining o Data Mining Tasks Data o Types of Data o Data Quality o Data Preprocessing o Measures of Similarity and Dissimilarity Exploring Data o The Iris Data Set o Summary Statistics o Visualization o OLAP and Multidimensional Data Analysis Classification: Basic Concepts, Decision Trees, and Model Evaluation o Preliminaries o General Approach to Solving a Classification Problem o Decision Tree Induction o Model Overfitting o Evaluating the Performance of a Classifier o Methods for Comparing Classifiers Classification: Alternative Techniques o Rule-Based Classifier o Nearest-Neighbor classifiers o Bayesian Classifiers o Artificial Neural Network (ANN) o Support Vector Machine (SVM o Ensemble Methods o Class Imbalance Problem Association Analysis: Basic Concepts and Algorithms o Problem Definition o Frequent Itemset o Rule Generation o Compact Representation of Frequent Itemsets o Alternative Methods for Generating Frequent Itemsets
o FP-Growth Algorithm o Evaluation of Association Patterns o Effect of Skewed Support Distribution Association Analysis: Advanced Concepts o Handling Categorical Attributes o Handling Continuous Attributes o Handling a Concept Hierarchy o Sequential Patterns o Subgraph Patterns o Infrequent Patterns Cluster Analysis: Basic Concepts and Algorithms o Overview o K-means o Agglomerative Hierarchical Clustering o DBSCAN o Cluster Evaluation Cluster Analysis: Additional Issues and Algorithms o Characteristics of Data, Clusters, and Clustering Algorithms o Prototype-Based Clustering o Density-Based Clustering o Graph-Based Clustering o Scalable Clustering Algorithms o Which Clustering Algorithm? Anomaly Detection o Preliminaries o Statistical Approaches o Proximity-Based Outlier Detection o Density-Based Outlier Detection o Clustering-Based Techniques Course Objectives: After completing this course, you should be able to understand algorithms and methods of data mining develop data mining programs and applications program using available data mining tools and general-purpose languages understand analysis, metrics, visualization and navigation of data mining results learn how to use a few commercial data mining tools Know basic techniques for both directed and undirected knowledge discovery. Know and use software package techniques for mining. Have a good understanding of data mining techniques: association rules, clustering, anomaly detection, etc. Design data schemas for a warehouse environment. Student Learning Outcomes:
Upon successful completion of the course, students are able to: understand the basic principles of the primary data mining techniques understand the difference between data mining, data warehousing, machine learning, etc. Design mining models and manage databases to enable data mining technologies as part of larger systems. Exam, Assignment and Grading: Midterm 20% Final 30% Project one 10% Project two 10% Project three 10% Term paper (just one) 10% Presentation of term paper 5% Attendance 5% total 100% The letter grade will be determined as follows: A: 90-100% B: 80-89% C: 70-79%, D: 60-69% F: 0-59% Assignment Policies: All assignments must be in the instructor's hands before class on the due date which will be specified on each assignment. Late assignments will be accepted up to two days with a one-time 30% late penalty. Any work submitted more two days past the deadline will not accepted. Assignments will be graded on the basis of correctness, logic, clearness, motivation, and style. Unless stated otherwise, all assignments are individual assignments and are expected to be a student's own work. General discussions regarding understanding problems are encouraged, but giving or receiving major sections of solutions to problems will be considered cheating and will be dealt with on an individual basis. Attendance: You are responsible for all materials covered in class, the text book, and homework assignments. Integrity:
Cheating of any kind will not be tolerated. Any assignment or exam that is handed in must be your own work. However, talking with one another to understand the material better is strongly encouraged. Recognizing the distinction between cheating and cooperation is very important. If you copy someone else's solution, you are cheating. If you let someone else copy your solution, you are cheating. We will not distinguish between the person who copied a solution and the person whose solution was copied. Both people will be treated as cheaters. If someone dictates a solution to you, you are cheating. Everything you hand in must be in your own words, based on your own understanding of the solution. If someone helps you understand the problem during a high-level discussion, you are not cheating. We strongly encourage you to help one another understand the material presented in class, in the book, and general issues relevant to the assignments. When taking an exam, you must work independently. Any collaboration during an exam will be considered cheating. When a cheating is caught, zero marks will be given the cheated work, and the case will be forwarded to the Department chair and beyond if necessary. ADA Announcement: If you have a documented disability which will make it difficult for you to carry out the work as I have outlined here and/or if you need special accommodation/assistance due to a disability, please contact the Office of Services for Persons with Disabilities (OSPD), Emilia Ramirez- Schunior Hall, Room 1.101, immediately, or the Associate Director at MAUREEN@UTPA.EDU, Ext. 7005. Appropriate arrangements/accommodations can be arranged. Verification of disability and processing of special services required, such as note takers, extended test time, separate accommodations for testing, will be determined by OSPD. Please do not assume adjustments/accommodations are impossible. Please consult with the Associate Director, OSPD, at Ext. 7005. Additional Policies: Collaboration All assignments in this course are to be done individually. This does not mean that you cannot discuss anything about this course with others. What it does mean is that anything that you hand in must accurately represent your knowledge and work. Plagiarism This class will heavily involve the use of the written works of others. Your own written work will involve discussing the ideas of others. When using the ideas of others, it is important to acknowledge whose ideas you are using, and to clearly distinguish the ideas of others from your own. To convey the impression, whether inadvertently or deliberately, that another's work is your own, is called plagiarism. Plagiarism is a serious offense in the university.