IST 718 Advanced Information Analytics Course: Advanced Information Analytics Semester: Summer 2016 Instructor: Gary Krudys Email: gekrudys@syr.edu Office: Hinds 114 Phone: 315-857-7243 (cell) Office Hours: by Appointment/Online Meeting Place: Online Catalog Description A broad introduction to analytical processing tools and techniques for information professionals. Students will develop a portfolio of resources, demonstrations, recipes, and examples of various analytical techniques. Prerequisite Skills There is no required prerequisite for this course, but you will find it much easier to succeed if you have completed IST687, IST777, or both. This course makes extensive use of the open source R package as a framework for analytical computing. In addition to algebra, Boolean algebra, and probability, this course makes extensive use of complex data structures. If you have not taken either IST687 or IST777, then having some background in scripting or programming languages will be helpful. Course Description Analytics is a huge topic, comprising quantitative analysis, systematic and automated analysis of qualitative material (such as text), parsing, measurement, missing data mitigation, data reduction, data mining, descriptive statistics, modeling, machine learning, visualization, and a variety of other areas. As this list suggests, it would be impossible to cover all of these topics in a single semester. Rather than attempt to cover all of these areas badly, this course focuses on becoming familiar and comfortable with a range of the available tools in the context of challenging, data-focused problems. Addressing these problems in creative ways, by connecting datasets and tools, can provide a practical understanding of analytics as a whole while allowing students to develop specialization in one or more areas of interest to them. The primary goal of this course is for you become familiar and comfortable with a variety of methods for obtaining, screening, cleaning, linking, manipulating, analyzing, and displaying data. This is not a course on data visualization per se but you will learn create summaries, overviews, models, analyses, and basic displays such as tables, histograms, trees, and scattergrams. Upon successful completion of this course, you will have developed some or all of the following areas of skill and knowledge: 1
Review of data repositories, sources of archival data, database structures, and metadata Essential quantitative analysis including descriptive statistics, summarization, and a brief review of inferential statistics (a complete treatment of inferential statistics occurs in IST777) Linked data and data mashups Scripting methods for handling data in R and other tools Translating the provenance and structure of a linked data set into a set of reasonable analyses and displays Matching available analyses to the information needs of clients and users Debugging problems in data processing and results Drawing conclusions and presenting data Learning Objectives During the course, we will emphasize: Experiential learning through reading and practical exercises. Collaborative learning through online discussions between instructors and peers. Self-learning with appropriate instructional support and timely feedback using analytical case studies. In order to be successful in this course, the student will: Pro-actively research solution options vs. relying solely on textbook content Actively code while completing the reading assignments. Present results in a professional manner. Comments Clarity Correctness. Submit their assignments on time. Upon completion of the course, the student will be able to: Understand complex data structures, transformation of data structures, and manipulation of data elements. Understand essential analysis techniques including descriptive statistics, summarization, and elementary modeling. Understand scripting methods, including debugging methodologies, for handling data in R and other tools. Appreciation of the range of applicability of information analytics to real problems in areas such as business, science, and engineering. Capability to match available analytical methodologies to the information needs of clients and users and present results in a meaningful way. 2
Course Materials Hogan, Thomas P., Bare Bones R: A Brief Introductory Guide, Sage, 2010. (Optional for those who have taken IST687 or IST777) Stanton, Jeffrey M., Introduction to Data Science, 2013. (Free to download at http://jsresearch.net; optional for those who have taken IST687 or IST777) Leipzig, Jeremy and Xiao-Yi Li, Data Mashups in R, O Reilly, 2011. (Required) Matloff, Norman, The Art of R Programming: A Tour of Statistical Software Design, No Starch Press, 2011. (Required) Student Evaluation: 1. Five Laboratory Exercises 30% Due biweekly 2. Linked Dataset 10% Due mid-semester 3. Tool Exploration Case Study 20% Due by week 10 4. Final Project 30% Due at end of semester 5. Discussion 10% All Semester Long Laboratory Exercises Laboratory exercises provide problem-solving experiences that reinforce the material covered in the readings. The laboratory exercises facilitate the first learning objective of the course by providing the opportunity to apply techniques from class to realistic problem solving situations. A separate laboratory template document will provided with specific instructions for each assignment. There are 5 graded laboratory exercises in this course worth a total of 30% of the course grade or about 6% apiece. The exercises come at about two week intervals. Maximum points are possible if the submission is on time, complete, and correct. Late Exercises will only be accepted within 1 week of due date o 5= Solid / no mistakes (or really minor), well commented/documented o 4 = Good / some mistakes o 3 = Fair / some major conceptual errors o 2 = Poor / did not finish o 0 = Did not participate / did not hand in o On time +1 3
Linked Data Set One of the most critical and difficult tasks that analysts face lies in bringing together disparate data sets to create analytical possibilities that do not exist with simpler arrangements. Some examples here include crime data joined with maps; census data joined with health outcome records; national economic data joined with cultural factors; and polling data joined with social media activity. A linked dataset suitable for subsequent analysis offers a successful join, robust checks for accuracy, missing data mitigation, and metadata fully describing the contents and provenance of the new dataset. A separate document will be provided with specifications for creating the linked dataset. The code, documentation, and data will be used in later phases of the semester, so on time completion of this assignment is essential. Tool Exploration Case Study Data science is a young and fast moving professional field. Vendors continually develop new tools and capabilities for analysts. In this case study project, you will locate, explore, and learn a new technology tool of your own choosing. The tool must provide an interface or connection to R, must be open source or available in a free educational version, and must provide a demonstrable or visible result that can be shared with other members of the class. An example in this category is the RHadoop toolset provided by Revolution Analytics as an interface between R and Hadoop. Final Project. For the final project, students will identify a set of questions that pertain to their linked data set, will conduct analysis to explore those questions, will draw conclusions based on the outputs of those analyses, and will produce a readable report explaining the results. Maximum points are possible if the submission is on-time, complete, and demonstrates the student s ability to match the appropriate analytical methods to the chosen problem, draw appropriate conclusions, and present the results in a meaningful way. Class-Wide Phone Conferences: For the online version of this course, the instructor will answer student questions during periodic toll-free phone conference calls. There will be an introductory call early in the semester, and then one call prior to each of the three non-lab assignments. The phone conferences are optional but participation is highly encouraged as course learning objectives, specific concepts, and upcoming assignments will be discussed. 4
Course Grading: Grades for specific assignments and the course final grade will be assigned by the instructor. There are 1000 possible grade points in this course and each Assignment s grade value goes directly toward the total earned by each student. The numeric final point total will translate to the final letter grade for the course as follows: A = 95-100 A- = 90-94.9 B+ = 85-89.9 B = 80-84.9 B- = 75-79.9 C+ = 70-74.9 C = 65-69.9 C- = 60-64.9 F = below 60 Grades will be available for viewing in the Grade Book section for the course s on-line site. Academic Integrity The academic community of Syracuse University and of the School of Information Studies requires the highest standards of professional ethics and personal integrity from all members of the community. Violations of these standards are violations of a mutual obligation characterized by trust, honesty, and personal honor. As a community, we commit ourselves to standards of academic conduct, impose sanctions against those who violate these standards, and keep appropriate records of violations. The academic integrity statement can be found at http://supolicies.syr.edu/ethics/acad_integrity.htm. Blackboard The ischool uses Syracuse University s Blackboard system to facilitate distance learning and main campus resources. The environment is composed of a number of elements that will help you be successful in both your current coursework and your lifelong learning opportunities. To access Blackboard, go to the following URL: http://blackboard.syr.edu. Use your Syracuse University NetID & Password to log into Blackboard. For questions regarding technical aspects of Blackboard, please submit a help ticket to the ischool dashboard at My.iSchool.Dashboard (https://my.ischool.syr.edu). Log in with your NetID, select Submit a Helpdesk Ticket, and select Blackboard as the request type. The ischool Blackboard support team will assist you. Students with Disabilities In compliance with Section 504 of the Americans with Disabilities Act (ADA), Syracuse University is committed to ensure that no otherwise qualified individual with a disability shall, solely by reason of disability, be excluded from participation in, be denied the benefits of, or be 5
subjected to discrimination under any program or activity If you feel that you are a student who may need academic accommodations due to a disability, you should immediately register with: Office of Disability Services (ODS) 804 University Avenue Room 308 3 rd Floor 315.443.4498 or 315.443.1371 (TTD only) ODS is the Syracuse University office that authorizes special accommodations for students with disabilities. 6
Course Schedule as of 3/16/2016 Week Topics Readings Activities/Assignments 0 5/16/16 Course Introduction Aligning Align class with the methods, goals, and expectations of the course. Syllabus Walk Through Course navigation Subject content Exercises/Assignments Discussion threads Grading Course communication Final Project Walk Through Introduction Lecture Final Project Lecture Complete and post Student Profile Introduce Yourself Access R-Bloggers site http://www.r-bloggers.com/ Follow instructions to subscribe to daily newsletter 1 Setting Up Data/Bare Bones R 5/23/16 Installation Data Sets Workspace Functions Graphics Hogan Ch. 1 Install the R open source software package on your computer Install RStudio Exercise 1 2 5/30/16 Describing Demonstrating ability to describe a data set via summary statistics and visualization. 3 Modeling 6/6/16 Model patterns in data to better understand a business process. Hogan Ch. 2 Matloff Ch. 12 (Brief Overview) Exercise 2 Matloff Intro, Ch. 1, 2 Discussion Ideas for Linked Data Set Building 4 Expand our initial modeling efforts 6/13/16 to build information from data. Matloff Ch. 3, 4, Exercise 3 5 Scripting Matloff Ch. 5, 6 Discussion: Dealing with Messy Data 7
6/20/16Script our initial methods in order to deal with the volume and velocity of data. 6 Inferring 6/27/16 Use analytics to infer the unknown given a set of knowns. Matloff Ch. 7, 8 Exercise 4 Submit 1 page proposal for the dataset or data source you plan to use for your Final Project. Follow Final Project framework guidelines. 7 Mapping 7/4/16 Explore how to gain information from geospatial data. Matloff Ch. 9, 10, 11 Linked Data Set Submission 8 Mashups 7/11/16 Use scripting skills to combine (or mashup) data sets and produce meaningful analysis. TBD Exercise 5 Linked Data Set Discussion Board Commentary Submit 2 page proposal outlining data analysis plan. Follow Final Project framework guidelines. 9 Mashups TBD Discussion: Good Research 7/18/16 Questions 10 Mashups TBD Tool Exploration Case Study 7/25/16 Submission Submit 3 page project report describing results of data 8
screening, cleaning, and linking. Follow Final Project framework guidelines. 11 Presenting 8/1/16 Examine how to present results in a meaningful way Hogan Ch 3 Matloff Ch 12 Final Project submissions due. Follow Final Project framework guidelines. Debugging 12 Examine the process for testing for 8/8/16 and removing defects from a system Matloff Ch 13 Discussion: Efficient Debugging and Problem Solving 9
Additional Information: Read More About It: Bivand, R. S., Pebesma, E. J., & Gomez-Rubio, V. (2008). Applied Spatial Data Analysis with R. New York: Springer. Davenport, T. H., & Harris, J. G. (2007). Competing on Analytics. Boston: Harvard Business School Press. Faraway, J. J. (2006). Extending the Linear Model with R. Boca Raton: Chapman & Hall / CRC. Provost, F., & Fawcett, T. (2013). Data Science for Business. Sebastopol, CA: O'Reilly Media, Inc. 10