ST 562: Data Mining with SAS Enterprise Miner In Workflow 1. 17ST GR Director of Curriculum (demarti4@ncsu.edu; bondell@stat.ncsu.edu) 2. 17ST Grad Head (demarti4@ncsu.edu; bondell@stat.ncsu.edu; fuentes@ncsu.edu) 3. COS CC Coordinator GR (alun_lloyd@ncsu.edu; clbowma2@ncsu.edu) 4. COS CC Meeting GR (alun_lloyd@ncsu.edu; clbowma2@ncsu.edu) 5. COS CC Chair GR () 6. COS Final Review GR (clbowma2@ncsu.edu; alun_lloyd@ncsu.edu) 7. COS Dean GR (cohen@math.ncsu.edu) 8. ABGS Coordinator (george_hodge@ncsu.edu; lian_lynch@ncsu.edu; mlnosbis@ncsu.edu) 9. ABGS Meeting (george_hodge@ncsu.edu; lian_lynch@ncsu.edu; mlnosbis@ncsu.edu) 10. ABGS Chair (george_hodge@ncsu.edu; lian_lynch@ncsu.edu; mlnosbis@ncsu.edu) 11. Grad Final Review (george_hodge@ncsu.edu; lian_lynch@ncsu.edu; mlnosbis@ncsu.edu) 12. PeopleSoft (ldmihalo@ncsu.edu; blpearso@ncsu.edu; Charles_Clift@ncsu.edu; jmharr19@ncsu.edu; Tracey_Ennis@ncsu.edu) Approval Path 1. Thu, 17 Mar 2016 17:01:52 GMT Donald Martin (demarti4): Approved for 17ST GR Director of Curriculum 2. Thu, 17 Mar 2016 17:19:44 GMT Donald Martin (demarti4): Approved for 17ST Grad Head 3. Thu, 17 Mar 2016 17:38:13 GMT Cheryll Bowman-Medhin (clbowma2): Approved for COS CC Coordinator GR 4. Thu, 17 Mar 2016 17:41:43 GMT Cheryll Bowman-Medhin (clbowma2): Approved for COS CC Meeting GR 5. Wed, 06 Apr 2016 12:22:36 GMT Melissa sbisch (mlnosbis): Approved for COS CC Chair GR 6. Wed, 06 Apr 2016 12:27:20 GMT Melissa sbisch (mlnosbis): Approved for COS Final Review GR 7. Wed, 06 Apr 2016 14:50:45 GMT Jo-Ann Cohen (cohen): Approved for COS Dean GR 8. Mon, 11 Apr 2016 18:01:31 GMT George Hodge (george_hodge): Approved for ABGS Coordinator 9. Thu, 21 Apr 2016 13:27:44 GMT Melissa sbisch (mlnosbis): Approved for ABGS Meeting New Course Proposal Date Submitted: Thu, 17 Mar 2016 16:48:09 GMT Viewing: ST 562 : Data Mining with SAS Enterprise Miner Changes proposed by: boos Change Type Major Course Prefix ST (Statistics) Course Number 562 Dual-Level Course
Cross-listed Course Title Data Mining with SAS Enterprise Miner Abbreviated Title Data Mining with SAS College College of Sciences Academic Org Code Statistics (17ST) CIP Discipline Specialty Number 27.0501 CIP Discipline Specialty Title Statistics, General. Term Offering Spring Only Year Offering Offered Every Year Effective Date Spring 2017 Previously taught as Special Topics? Yes Number of Offerings within the past 5 years 5 Course Prefix/Number Semester/Term Offered Enrollment 610, 610,610, 590, 590 spring 18,27,25,49,44 Course Delivery Face-to-Face (On Campus) Distance Education (DELTA) Grading Method Graded/Audit Credit Hours 3 Course Length
15 weeks Contact Hours (Per Week) Component Type Lecture 3 Course Is Repeatable for Credit Instructor Name David Dickey Instructor Title Professor Grad Faculty Status Full Anticipated On-Campus Enrollment Contact Hours Open when course_delivery = campus OR course_delivery = blended OR course_delivery = flip Enrollment Component Per Semester Per Section Multiple Sections? Comments Lecture 50 50 DELTA/Online Enrollment: Open when course_delivery = distance OR course_delivery = online OR course_delivery = remote Delivery Format Per Semester Per Section Multiple Sections? Comments LEC 30 30 Course Prerequisites, Corequisites, and Restrictive Statement ST 512 or ST 514 or ST 515 or ST 517 Is the course required or an elective for a Curriculum? Catalog Description This is a hands-on course using modeling techniques designed mostly for large observational studies. Estimation topics include recursive splitting, ordinary and logistic regression, neural networks, and discriminant analysis. Clustering and association analysis are covered under the topic unsupervised learning, and the use of training and validation data sets is emphasized. Model evaluation alternatives to statistical significance include lift charts and receiver operating characteristic curves. SAS Enterprise Miner is used in the demonstrations, and some knowledge of basic SAS programming is helpful. Justification for new course: We are in an era in which large amounts of data are being collected, sometimes without a particular goal in mind, then later used for decision making. These data are typically observational in nature rather than from controlled studies, and there can be outliers and large chunks of missing values in some of the variables. Such data call for additional tools in the modern analyst's tool bag. Fast methods that accommodate missing values and outliers, such as recursive splitting methods, have arisen and computer methods for speeding up traditional analyses like logistic regression have been included in software such as SAS Institute's Enterprise Miner package. Flexible models like neural networks have developed a following among analysts. When loyalty cards are scanned at a store, they provide data for association analysis. Learning what items are purchased together and customer segmentation by clustering has also found application in business. The demand for graduates with the skills to analyze such data far exceeds the supply, and demand is growing. Hands-on experience with an industrial strength data mining package that has all of the above abilities, such as SAS Enterprise Miner used in the course, empowers our students at NC State to be competitive in the workforce.
Does this course have a fee? Consultation Instructional Resources Statement Since the course has been taught for 5 years as a special topics course, there will be no need for additional resources. Course Objectives/Goals The goal of this course is to introduce the basic elements of data mining techniques to students with backgrounds equivalent to that supplied by the department's statistical methodology sequence. Students will get hands-on experience with the SAS Enterprise Miner product as well as SAS programming through in class demonstrations and practice with homework data sets. Student Learning Outcomes By the end of this course, the students will be able to Use SAS Enterprise Miner to run analyses Check for problem data and mitigate the problems Use classification and regression trees to perform recursive splitting Perform and interpret logistic regression Evaluate and compare models with modern tools like lift charts Run discriminant analysis and compare it to modern methods Fit neural network models to data Perform cluster analyses for large data sets Use association and sequence analysis on large data sets Student Evaluation Methods Evaluation Method Weighting/Points for Each Details Homework 20% Exam 20% Exam 20% Exam 20% Final Exam 20% Topical Outline/Course Schedule Topic Time Devoted to Each Topic Activity Overview, diagrams, ordinary regression 2 weeks summary of upcoming topics with examples, creating data mining diagrams, setting up the environment for running SAS Enterprise Miner, linking data sets to be used in examples Classification Trees, Regression Trees 4 weeks Use of Chi-square tests in recursive splitting, Interpretation of decision trees using famous Framingham heart study data, Splitting algorithms for decision trees, Treatment of missing values, Simplifying decision trees using validation data, Building trees for estimates, decisions, or ranking gives different results, Build several trees on an example data set, Compare decision trees to regression trees and give a regression tree example, Compute lift charts for trees
Discriminant Analysis 2 weeks Review multivariate normal distribution, Develop discriminant functions from normal distribution definition, Discuss the role of priors in discriminant, analysis, Interpret posterior probabilities and error rates, Compare quadratic discriminants to linear ones Ordinary and Logistic Regression 3 weeks Explain the need for new regression methods when the response is categorical (focus on binary), Show the logistic function, Develop maximum likelihood estimators, graph the likelihood function and discuss Gauss-Newton estimation, Show a logistic example within SAS (space shuttle O-ring data), Review additional data cleaning steps needed here but not in tree based methods, Develop logistic regressions within Enterprise Miner, Interpret logistic output including a discussion of concordance Neural Networks 1 week Relate hyperbolic tangent functions to familiar logistic functions, Demystify neural nets somewhat by showing them as compositions of hyperbolic tangents, Explore that flexibility of neural networks, Control neural network complexity using logistic regression model building as a preliminary variable selection tool. Evaluation Methods 1 week Develop the ROC curve idea, Show how ROC curves relate to concordance, Compare several of the above models through their ROC curves and lift charts, Pick a winner among models and export the model code to C, Java, or SAS code Clustering 1 week Distinguish agglomerative, divisive, and direct clustering, Describe single, average, and complete linkage, Ward's method, and k-means and give examples, Describe the two step method used in Enterprise Miner, Cluster some Census Bureau data on U.S. households within Enterprise Miner, Show graphical depictions of cluster compositions Association Analysis and other topics as time permits 1 week Relate association analysis to simple conditional probability Compute lift for association analysis Show association and sequence analysis on some banking data. Other topics as time permits: multidimensional scaling, bagging and boosting of tree based models Syllabus ST_562_syllabus.pdf Additional Documentation Additional Comments mlnosbis 4/11/2016: overlapping courses. ghodge 4/16/2016 consultation required as it does not seem to overlap with any courses. Ready for ABGS reviewers ABGS Reviewer Comments: -Good, but syllabus has no details about grading of assignments. Course Reviewer Comments
Key: 10017