ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM COLLEGE OF SCIENCE School of Mathematical Sciences NEW (or REVISED) COURSE: COS-STAT-747 Principles of Statistical Data Mining 1.0 Course Designations and Approvals Required course approvals: Academic Unit Curriculum Committee College Curriculum Committee Approval request date: Approval granted date: Optional designations: General Education: Writing Intensive: Honors Is designation desired? No No No *Approval request date: **Approval granted date: 2.0 Course information: Course title: Principles of Statistical Data Mining Credit hours: 3 Prerequisite(s): One course in basic statistics Co-requisite(s): None Course proposed by: Ernest Fokoué Effective date: August 2013 Contact hours Maximum students/section Classroom 3 25 Lab 0 Studio 0 Other (specify) 0 2.a Course Conversion Designation*** (Please check which applies to this course). *For more information on Course Conversion Designations please see page four. Semester Equivalent (SE) Please indicate which quarter course it is equivalent to: Semester Replacement (SR) Please indicate the quarter course(s) this course is replacing: 0307-846- Principles of Statistical Data Mining 2.b Semester(s) offered (check) September 2010
Fall (online) Spring (campus) Summer Other All courses must be offered at least once every 2 years. If course will be offered on a bi-annual basis, please indicate here: 2.c Student Requirements Students required to take this course: (by program and year, as appropriate) None Students who might elect to take the course: This is an elective for graduate students in Advanced Certificate and MS programs in Applied Statistics. Graduate students in other programs who interested in statistical data mining will also elect to take this class. In the sections that follow, please use sub-numbering as appropriate (eg. 3.1, 3.2, etc.) 3.0 Goals of the course (including rationale for the course, when appropriate): 3.1 To achieve a practical understanding of modern statistical data mining techniques 3.2 To develop the ability to correctly apply modern data mining techniques to a variety of real world case studies involving very massive high dimensional complex data. 3.3 To gain a hands on experience with data mining through case studies, among which examples like: Describing website visitors, Market basket analysis, Describing customer satisfaction, Predicting credit risk of small businesses, Predicting e-learning student performance, Predicting customer lifetime value and Operational risk management. 4.0 Course description (as it will appear in the RIT Catalog, including pre- and corequisites, and quarters offered). Please use the following format: COS-STAT-747 Principles of Statistical Data Mining I This course covers topics such as clustering, classification and regression trees, multiple linear regression under various conditions, logistic regression, PCA and kernel PCA, model-based clustering via mixture of Gaussians, spectral clustering, text mining, neural networks, support vector machines, multidimensional scaling, variable selection, model selection, k-means clustering, k-nearest neighbors classifiers, statistical tools for modern machine learning and data mining, naïve Bayes classifiers, variance reduction methods (bagging) and ensemble methods for predictive optimality. 5.0 Possible resources (texts, references, computer packages, etc.) Required texts 5.1 Applied Data Mining for Business and Industry, 2nd ed., Paolo Giudici and Silvia Figini (2009), Wiley, ISBN 978-0-470-74582-3 Recommended Texts 5.2 Statistical Data Mining Using SAS Applications, 2nd ed., George Fernandez (2009), CRC Press, ISBN 978-1-439-81075-3 5.3 Data Mining Using SAS Enterprise Miner, Randall Matignon (2009), Wiley 5.4 Getting Started with SAS Enterprise Miner (From SAS) 2
5.5 Applied Analytics Using SAS Enterprise Miner (From SAS) 6.0 Topics (outline): 6.1. Complex data structures and the emergence of Data Mining and Machine Learning 6.2. Measures of location and measures of variability 6.3. Distance measures, Similarity Measures and Dependency measures 6.4. Multiple linear regression and its extensions to Radial Basis Function regression 6.5. Difference of focus between model identification and predictive optimality 6.6. Principles and applications of dimensionality reduction techniques 6.7. Principal component Analysis and Singular Value Decomposition 6.8. Cluster analysis.via Hierarchical and Hierarchical Methods 6.9. Factor Analysis and Mixtures of Factor Analyzers 6.10. Multidimensional scaling and its relationship to other techniques 6.11. Model Based Clustering via Mixtures of Gaussians 6.12. Logistic regression for Pattern Recognition 6.13. Linear and Quadratic Discriminant analysis. 6.14. Classification and Regression Trees 6.15. Neural networks: Multilayer Perceptron and Kohonen networks. 6.16. Support Vector Machines for classification and regression 6.17. Nearest-neighbor models: kmeans and K Nearest Neighbors 6.18. Variance Reduction Techniques: Bagging Predictors 6.19. Non-parametric modeling and Bayesian Modeling 6.20. Generalized linear models and Log-linear models 6.21. Graphical models and their applications 6.22. Model Evaluation and model selection techniques 6.23. Ensemble Methods for Predictive Optimality: Boosting 3
7.0 Intended course learning outcomes and associated assessment methods of those outcomes (please include as many Course Learning Outcomes as appropriate, one outcome and assessment method per row). Course Objectives Level 2: Comprehension: 2.1.Understands the central role of model uncertainty in data mining, and maintains a keen awareness of the difference between accurate model identification and optimal prediction 2.2.Appreciates and takes into account the everpresent bias/variance dilemma in model selection and model building, and strives to find solutions that achieve bias/variance trade-off 2.3.Knows when and how to combine unsupervised learning techniques (e.g.: PCA for feature extraction) with supervised learning techniques (e.g. Neural Networks) to achieve optimality 2.4.Recognizes when and how to use Ensemble methods rather than select a single model, and also knows when to use variance reduction techniques like Bagging! 2.5.Understands the profound meaning of the No Free Lunch theorem, and refrains from relying solely on one single method of data mining, and indeed always comparing various methods before making recommendations Level 3: Application: 3.1.Identifies an interesting real world engineering problem during the course of study and formulates its statistically 3.2.Recognizes for each real world case study which classes of data mining methods are more appropriate 3.3.Uses statistical software like SAS Enterprise Miner to perform a thorough data mining analysis of real world problems Level 4: Analysis: 4.1.Determines/decides which statistical model(s) appear to be most appropriate for the task at hand in light of the graphs and descriptive statistics obtained for exploratory data analysis Assessment Method Homework Exams Projects 4
4.2.Fits the chosen plausible model(s) using a statistical software package like SAS Enterprise Miner, then extracts and interprets the estimates of the parameters 4.3.Performs additional statistical hypothesis tests wherever needed 4.4.Checks all the assumptions underlying each method/technique used 4.5.Interprets the statistical estimation and prediction results produced by the software package Level 5: Synthesis: 5.1.Selects the best model according to some of the usual model selection criteria 5.2.Provides any needed/required formal prediction or estimation. 5.3.Uses an ensemble (aggregation) of methods wherever the need arises 5.4.Draws conclusions and interpretations about the original engineering task based on sound formal analysis like confidence intervals and results of hypothesis testing. Level 6: Evaluation: 6.1.Evaluates several potential statistical models and decides on the most appropriate one for a given purpose. 6.2.Provides any needed/required formal prediction or estimation 6.3.Makes recommendations in clear and non technical language based a thorough assessment of the statistical findings 5
8.0 Program outcomes and/or goals supported by this course Relationship to Program Outcomes (1 = slightly, 2=moderately, 3=significantly) Program Outcomes and/or Goals for CQAS 8.1 Advanced Certificate in Lean Six Sigma 8.1.1 Demonstrates an solid understanding of statistical thinking and Lean Six Sigma methodology in solving real-world problems. 8.1.2 Leads Lean Six Sigma improvement projects. Level of Support 1 2 3 8.2 Advanced Certificate and Masters of Science in Applied Statistics 8.2.1 Demonstrates solid understanding of statistical thinking and applied statistics methodology in solving real-world problems. 8.2.2 Designs studies that are efficient and valid. 8.2.3 Analyzes data using appropriate statistical methods. 8.2.4 Communicates the results of statistical analysis with effective reports and presentations. Note: Students obtaining the Advanced Certificate in Applied Statistics will not be expected to perform at the same level as students obtaining a Master of Science degree. 9.0 - Not Applicable General Education Learning Outcome Supported by the Course, if appropriate Communication Express themselves effectively in common college-level written forms using standard American English Revise and improve written and visual content Express themselves effectively in presentations, either in spoken standard American English or sign language (American Sign Language or English-based Signing) Comprehend information accessed through reading and discussion Intellectual Inquiry Review, assess, and draw conclusions about hypotheses and theories Analyze arguments, in relation to their premises, assumptions, contexts, and conclusions Construct logical and reasonable arguments that include anticipation of counterarguments Use relevant evidence gathered through accepted scholarly methods and properly acknowledge sources of information Assessment Method 6
Ethical, Social and Global Awareness Analyze similarities and differences in human experiences and consequent perspectives Examine connections among the world s populations Identify contemporary ethical questions and relevant stakeholder positions Scientific, Mathematical and Technological Literacy Explain basic principles and concepts of one of the natural sciences Apply methods of scientific inquiry and problem solving to contemporary issues Comprehend and evaluate mathematical and statistical information Perform college-level mathematical operations on quantitative data Describe the potential and the limitations of technology Use appropriate technology to achieve desired outcomes Creativity, Innovation and Artistic Literacy Demonstrate creative/innovative approaches to course-based assignments or projects Interpret and evaluate artistic expression considering the cultural context in which it was created 10.0 Other relevant information (such as special classroom, studio, or lab needs, special scheduling, media requirements, etc.) None *Optional course designation; approval request date: This is the date that the college curriculum committee forwards this course to the appropriate optional course designation curriculum committee for review. The chair of the college curriculum committee is responsible to fill in this date. **Optional course designation; approval granted date: This is the date the optional course designation curriculum committee approves a course for the requested optional course designation. The chair of the appropriate optional course designation curriculum committee is responsible to fill in this date. ***Course Conversion Designations Please use the following definitions to complete table 2.a on page one. Semester Equivalent (SE) Closely corresponds to an existing quarter course (e.g., a 4 quarter credit hour (qch) course which becomes a 3 semester credit hour (sch) course.) The semester course may develop material in greater depth or length. Semester Replacement (SR) A semester course (or courses) taking the place of a previous quarter course(s) by rearranging or combining material from a previous quarter course(s) (e.g. a two semester sequence that replaces a three quarter sequence). New (N) - No corresponding quarter course(s). 7