Fall Syllabus. HAP 780 : Data Mining in Health Care

College of Health and Human Services Fall 2016 Syllabus Course information Course placement Instructor Course description Course objectives HAP 780 : Data Mining in Health Care Time: Mondays, 7.20pm 10pm Location: Aquia Building 219 ( ) Core ( X) Concentration ( X) Elective ( ) Pre-requisite(s) ( X) Course(s) recommended before taking this course: HAP 700, HAP 709, HAP 602 It is impossible to mine and without good knowledge of database systems. HAP 709 or other relational database courses (with SQL) are strongly recommended before taking this course. Janusz Wojtusiak PhD jwojtusi@gmu.edu Office Hours by appointment Wednesdays 1-4pm (Northeast Module, Room 108, Fairfax Campus) An introductory course to data mining and knowledge discovery in health care. Methods for mining health care databases and synthesizing task-oriented knowledge from computer data and prior knowledge are emphasized. Topics include fundamental concepts of data mining, data preprocessing, classification and prediction (decision trees, attributional rules, Bayesian networks), constructive induction, cluster and association analysis, knowledge representation and visualization, and an overview of practical tools for discovering knowledge from medical data. These topics are illustrated by examples of practical applications in health care. Upon completion of the course, students will be able to: 1. Understand and describe data mining techniques and their use in knowledge discovery as it applies to health related fields. 2. Define a health related problem to be solved by means of data mining. 3. Apply data preprocessing techniques to clean and prepare data sets for analysis.

Required textbook(s) and/or materials 4. Built and assess predictive models using various techniques such as decision trees, decision rules, Bayesian networks and clustering. 5. Develop skills of using recent data mining software for solving practical problems in health services research and other medical and public health related fields. 6. Use methods for presenting knowledge in natural language and other understandable forms. 7. Review and critique current research papers on data mining algorithms and implementations. Required Text: Class notes and slides. Recommended Readings: Han, J., Kamber, M., Pei, J. (2011), Data Mining: Concepts and Techniques, 3 rd edition, Morgan Kaufmann. Witten I.H., Frank E., Hall, MA (2011). Data Mining: Practical Machine Learning Tools and Techniques, third edition. Morgan Kaufmann. Black K. (2008). Business Statistics for Contemporary Decision Making. New Jersey: John Wiley & Sons. Course requirements Computer requirements This is a computationally intensive course and you are expected to access databases, software tools, and other contents. You will need: Fast computer (multicore PC or Mac) with at least 100GB of free disk space and at least 4GB RAM (4GB+ recommended), Windows 7 or newer. Mac users may require more powerful computers to enable virtualization to run windows. Fast internet connection Microsoft office for viewing and preparing files Other software will be provided in class (SQL server, Weka, R, Genie, Python) If you do not have sufficient computer, you can request access to Health Informatics Learning Lab, located in Northeast Module, or use one of computer labs at GMU. It is responsibility of students to configure and maintain own computers, make sure that it is set up correctly and installed software (i.e., security) do not interfere with software used in class. Expectations: Students are responsible for assigned readings, class content and material. Students are also responsible for finding right computer equipment that allows accessing the course materials, using data and software tools, and for checking email/blackboard on daily basis.

Data mining is a very broad topic, which is condensed here into one semester course. This course requires students to participate in lectures and spend at least another 6 hours per week on assignments, reading, and project. Evaluation Methods: If you are taking this course as part of a graduate level course, you will receive a grade. Your grade will depend on your participation, quality of your project work and your team work. Assignments and projects are graded based on multiple criteria that will be discussed in detail. Always write all answers in own words. Do not copy-and-paste. You can ask questions by sending email to the instructor. In most cases you will receive response within 48 hours. Participation Outside Classroom You should attend a meeting (conference, seminar, local chapter meeting, etc.) and write about a page description of what you learned and how the attended event relates to this course. It is not sufficient to simply pay the membership fee for a professional organization and do not participate in the organization in any way. The report is due last day of classes. Look for a meeting early in the semester. In person-meetings are strongly suggested. Data Mining Topic presentation You will need to prepare 10 minute presentation about a topic related to the class. I strongly recommend to find a journal article, analyze it in detail and present. Do not prepare presentations that repeat topics covered in class. Do not repeat topics presented by other students. Final Project Data mining requires combining theoretical knowledge with practical skills. In order to develop skills in the context of health care applications, semester-long project is the most important component of the grade. The project topics should be related to analyzing healthcare data in order to solve clinical or administrative problems. The project should include, but be not limited to: (1) problem description; (2) data selection; (3) data pre-processing; (4) selection DM methods; (5) application of methods; (6) analysis of results; (7) review of available literature and related work; (7) conclusions and description of impact on healthcare. Brief description of what you learned in the project. Direct application of existing software to publically available datasets is not sufficient. The projects must demonstrate significant efforts in data manipulation, processing, and mining. Projects must also illustrate understanding of applied techniques as well as the healthcare problem addressed.

Teaching methods Evaluation In the past some student project were revised, extended and submitted for conference presentations. ( X) Lecture ( ) Group work ( ) Independent research ( ) Field work ( )Papers ( ) Guest speakers ( ) Student presentations ( ) Case Studies ( X) Lab ( ) Class discussion ( ) Other Weekly Assignments 35% DM topic presentation 10% Participation Outside Classroom 5% Semester-long project 50% Grading Scale Mason Honor Code Individuals with Disabilities 96+ A 90-95 A - 86-89 B + 80-85 B 75-79 B - 70-74 C 0-70 F The complete Honor Code is as follows: To promote a stronger sense of mutual responsibility, respect, trust, and fairness among all members of the George Mason University community and with the desire for greater academic and personal achievement, we, the student members of the university community, have set forth this honor code: Student members of the George Mason University community pledge not to cheat, plagiarize, steal, or lie in matters related to academic work. (From the 2016-17 Catalog catalog.gmu.edu) The university is committed to providing equal access to employment and educational opportunities for people with disabilities. Mason recognizes that individuals with disabilities may need reasonable accommodations to have equally effective opportunities to participate in or benefit from the university educational programs, services, and activities, and have equal employment opportunities. The university will adhere to all applicable federal and state laws, regulations, and guidelines with respect to providing reasonable accommodations as necessary to afford equal employment opportunity and equal access to programs for qualified people with disabilities. Applicants for admission and students requesting reasonable accommodations for a disability should call the Office of Disability Services at 703-993-2474. Employees and applicants for employment should call the Office of Equity and Diversity Services at 703-993-8730. Questions regarding reasonable accommodations and discrimination on the basis of disability should be directed to the Americans with Disabilities Act (ADA) coordinator in the Office of Equity and Diversity Services. (From the 2016-17 Catalog catalog.gmu.edu)

E-mail Policy Web: masonlive.gmu.edu Mason uses electronic mail to provide official information to students. Examples include notices from the library, notices about academic standing, financial aid information, class materials, assignments, questions, and instructor feedback. Students are responsible for the content of university communication sent to their Mason e-mail account and are required to activate that account and check it regularly. Students are also expected to maintain an active and accurate mailing address in order to receive communications sent through the United States Postal Service. (From the 2016-17 Catalog catalog.gmu.edu) I plan to videotape selected lectures for the future use of online students. The recorded videos will be posted online for students to view. The camera will be facing the screen and instructor. Because live interaction with class is recorded, some of your questions and voice may be also recorded. If you do not wish to be on the final recording, please let me know. Then, I will ask for your help to review the final versions of recordings to ensure that you are completely edited out. Tentative Weekly Schedule The schedule below is approximate and may be changed to adapt to students' needs and requests, new material, and for other reasons. Due dates and assignments are subject to change and will be provided weekly. Wk Date Topics Assignments Due Date 1 8/29 Introduction to data mining in health care What do you know? 9/11 Review of databases Introduction to software 9/5 No Class Labor Day 2 9/12 Measuring/Describing the world What do you know/ 9/18 Data Preprocessing - part 1 prepare sample data 3 9/19 Data preprocessing - part 2 What do you know/ 9/25 Knowledge representation prepare sample data 4 9/26 Data preprocessing part 3: Exploratory What do you know/ 10/2 data analysis, simple statistics Review of types of health data 5 10/3 Mining Frequent Patterns/Associations What do you know/ 10/9 6 10/11 Classification and Regression: Basics What do you know/ 10/16 (Tue) 7 10/17 Classification 2 What do you know/ 10/23 8 10/24 Cluster Analysis What do you know/ 10/30

9 10/31 Outlier Detection What do you know/ 10 11/7 Time and Space What do you know/ 11 11/14 No class meeting online material assigned on healthcare applications What do you know/ 12 11/21 Text and Image Mining Genomic data What do you know/ 13 11/28 BIG DATA Analysis What do you know/ 14 12/5 Final Project Presentations All Missing Assignments Due 11/6 11/13 11/21 11/21 12/5 Sample Assignments Below are draft instructions for some of the assignments. They are for information purposes only to help students better plan time and understand course content. The actual assignments will be posted on Blackboard and may be different than ones presented here. Assignment 1 Introduction to Databases and Data Mining When answering questions: (1) use own words; (2) discuss answers; (3) do not copy-andpaste; (4) provide enough details, so I can help if answer is incorrect. 1. How data warehouse is different from an operational database? Are there any similarities? 2. Why outliers are particularly important in healthcare applications? Explain and give examples. 3. How data mining process of very large data differs from very small data? Describe challenges related to both types of data. 4. You are a consultant hired by a hospital. The hospital is engaged in a quality improvement process to reduce medical errors. You are asked to to learn why some reported incidents result in lawsuits or claims, while others don t. Describe how would you approach this problem. What type of data mining is involved? What do you need in data in order to perform this task? 5. Load attached file hepatitis.csv to SQL Server, MySQL, PostgreSQL, or other relational database (do not use MS Access). Some data description is available at: http://archive.ics.uci.edu/ml/datasets/hepatitis Prepare SQL queries to answer the following questions: - what is the average value of bilirubin? - what is the average value of bilirubin for patients that live?

- what is the average value of bilirubin for patients that are dead? - how many patients have value of bilirubin higher than average in the data? - how many patients are at least 50 years old? How many of them are dead? - how many patients are younger than 50 years? How many of them are dead? - is average age of dead patients higher than average age of alive patients? Present both queries and results. Do not click-in the queries. Write them in SQL. 6. Decide on the topic and date of your presentation. Submit date and topic of presentation to doodle. 7. Propose the topic of your project. Write 1-2 paragraphs describing what you want to do. The topic may evolve later, but you need to start thinking about it now. 8. Download SQL Server 2016 Enterprise 64-bit (English) from DreamSpark. Install the server yourself, or keep the downloaded files on your laptop for installation in class. We will proceed with step-by-step installation. http://e5.onthehub.com/webstore/welcome.aspx?vsro=8&ws=058b3ace-2512-e111- a703-f04da23e67f6 It is free for HAP students for academic use! We will do installation in class next week. If you are Mac user, you need to install it in a virtual machine on Windows. Windows can also be downloaded from DreamSpark. You need a virtualization software such as Parallels or VirtualBox. Assignment 2 Preprocessing Part 1 1. What types of dirty data one can expect when starting a data mining project? List at least five potential problems and give examples. 2. Why do you have to specify field types when loading data into a database or data warehouse? Why not make everything a text field? 3. Why semantic/analytic data types are important in data mining? 4. Use the claims data from HAP 709 class (you can get excel files at http://gunston.gmu.edu/healthscience/709/querriesreports.asp#analyze%20data ). - How many patients have chronic conditions? - What is the most common chronic condition? - What is the maximum number of body systems involved for a single patient? Which patient(s)? You should use definitions of chronic conditions and body systems from the website of Agency of Health Research and Quality. The mapping is available in CCI2012.CSV file at the website: http://www.hcup-us.ahrq.gov/toolssoftware/chronic/chronic.jsp

You will need to load claims data to SQL Server (preferred) or Access. Then you need to load chronic conditions indicators (CCI2012.CSV file). Then you need to link the two files in queries that answer the questions above. Please send screenshots of different steps of your work, the final results, and any SQL code you used. Assignment 3 Preprocessing Part 2 1. Why is it important to explore data before executing any DM algorithms? What can go wrong? What can be discovered in an exploratory analysis? 2. What are the three types of missing values? Give examples. 3. Suppose you are given two datasets: a survey result about satisfaction with a clinic visit and patient records. For privacy reasons personal information (name, record number, address) has been removed from both datasets. Your goal is to find out if there is any relationship between the survey results and severity of cases (severity can be obtained form medical records). - How do you approach the problem of linking the two datasets? Speculate on what attributes you would use to link them. - Should you assume that medical record is found for every patient who completed survey? - Should you assume that survey is found for every patient who was treated at the clinic? 4. Using the Hepatitis dataset from Assignment 1: - Write SQL queries to calculate arithmetic mean, standard deviation, median, mode, and all three quartiles for SEX, AGE, and BILIRUBIN. Note: make calculations only for values that make sense. - Using SQL prepare data for Q-Q plot of BLIRUBIN levels for male vs. female patients. You do not need to make the actual plot (although the prepared data can be easily copied and plotted as a scatterplot in excel). Prepare data in the form of a table with 2 columns, in which 1 column corresponds to male and second to female patients, and rows corresponds to selected quantiles. Note: even if you fail at some details here and something does not work, describe the procedure how you would approach the problem. 5. Install Weka software http://www.cs.waikato.ac.nz/ml/weka/ on your computer. It runs on most platforms (Windows, Mac, Linux) and requires Java (JRE). Installation is simple. We will be using Weka in class. Send screenshots. Assignment 4 Data Transformation 1. Why it is important to select right attributes for DM algorithms? Why not keep all attributes? 2. Give an example (different than in lecture) when creation of new attributes may be useful?

3. When stratified sampling should be used? 4. Using hepatitis data, write SQL code for: - Sampling 50% of data without repetitions - Sampling 50% of data with repetitions - Stratified sampling, to make frequency of both classes equal 5. In Weka, using the hepatitis data - Compare results of three different attribute selection methods - Compare results of three different sampling methods (show plotted data distributions). Assignment 5 Association Rule Mining 1. Describe why association rules and classification rules are not the same? Give examples. 2. Why FP-Tree algorithm is usually faster than apriori? Give some intuitive explanation. 3. List at least four metrics of quality for association rules. Provide formulas. 4. Using heritage data (release 1) in SQL a. Find support for all single itemsets b. List all itemsets with 2 elements and support of at least 0.2 c. List all itemsets with 3 elements and support at least 0.2 5. In Weka a. Load heritage data (release 1) b. Apply at least two association rule generation algorithms and compare results c. Apply FP-Tree algorithm with at least two measures of rule metrics Assignment 6 Classification and Regression 1. Describe process of preparing data for classification learning. 2. Why it is important to select correct type of model? List at least three reasons. 3. Suppose you are asked to create a model to predict hospitalization. What you have is claims data. Describe process of preparing data, constructing model, and testing the model. Should you use regression learning or classification learning for this problem? Why? 4. In SQL/Weka: a. Prepare heritage data for classification learning b. Load heritage data release 3 (preprocessed to binary representation, including demographics and output attribute(s)) c. Perform exploratory analysis d. Create at least three classification models for predicting hospitalization based on Year 1 data. e. Which model performs the best on year 2 data?

f. Create regression model for predicting hospitalization days. g. What is the difference between regression and classification models? h. Present your results in a form of short report that includes screenshots, tables, an d needed description. Assignment 7 Classification Part 2 1. When ROC can be used as a measure of quality of models? How does it differ from other measures such as accuracy, precision or recall? 2. When designing system for detecting life-threatening events for ICU patients, what is more important: precision or recall? Why? 3. Describe how multiple classification models can be combined? 4. Using heritage release 3 data prepared last week a. Include drug information into data b. Include laboratory information into data c. Import newly created data into Weka and run classification algorithms d. Does inclusion of the information improve predictions? There are many ways to complete question 4, so you need to make different decisions. Try not to overcomplicate the problem. Assignment 8 Cluster Analysis 1. Describe differences and similarities between clustering and classification. Use examples. 2. Suppose you are given a dataset with 1000 binary attributes representing presence of diagnoses. Describe potential problems with clustering data with that many dimensions. 3. Using the data table shown below. a. Calculate distance between all points in 1-norm, 2-norm and infinitynorm. Show dissimilarity matrix. b. Is there any need to preprocess the data to be more suitable for clustering? If so, describe the operations and show the resulting data table. c. Apply k-means clustering algorithm with k=2. ID Age BMI Gender Total Cholesterol 1 30 24 M 180 2 70 19 M 190 3 65 26 M 220 4 40 32 F 260

4. In Weka using heritage 3 dataset a. Apply k-means algorithm for k=2, 3, 5, 10 b. Apply EM algorithm. What is the optimal number of clusters obtained by EM? c. Compare the created clusters to classification based on hospitalization in year 2. Assignment 9 Outlier Detection 1. How outliers differ from noise? Provide examples of both. 2. Why collective outliers are harder to find than individual outliers? 3. What are issues in applying classification methods for outlier detection. What are the requirements? 4. Using Heritage data, are there any patients tat can be considered outliers in the data? Why? Assignment 10 Text Mining 1. Why text mining is important in healthcare applications 2. Describe pre-processing of clinical notes before classification learning can be applied. What methods can be used at each step of pre-processing? 3. Write regular expression to: a. detect zip codes in text b. Find last names of all patients whose first name is John (note that regular expressions may have some false positives/false negatives). 4. List challenges in automatically retrieving ICD-9 codes from clinical notes. Search literature for to find relevant published work. Also, include own observations and comments. 5. Using the SMS data a. Split data into training (80%) and testing (20%) sets b. Build naïve Bayes classifier for detecting spam based on bag of words i. List all words in the documents ii. Count occurrences in spam and ham iii. Assign likelihoods P(word spam) and P(word ham) for all words iv. Convert test data into list of words. For each message you need, 2 columns: message id and word v. Classify test data. This can be done by a series of joins with the data prepared in (iii). vi. Calculate accuracy of your model (accuracy, precision, recall)