Collecting and Analyzing Big Data

Similar documents
Python Machine Learning

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

CS Machine Learning

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Laboratorio di Intelligenza Artificiale e Robotica

CS 446: Machine Learning

(Sub)Gradient Descent

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Tools and Techniques for Large-Scale Grading using Web-based Commercial Off-The-Shelf Software

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Platform for the Development of Accessible Vocational Training

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Computational Data Analysis Techniques In Economics And Finance

Introduction, Organization Overview of NLP, Main Issues

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

Spring 2014 SYLLABUS Michigan State University STT 430: Probability and Statistics for Engineering

Computerized Adaptive Psychological Testing A Personalisation Perspective

Lecture 1: Machine Learning Basics

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

STA 225: Introductory Statistics (CT)

Laboratorio di Intelligenza Artificiale e Robotica

Academic Catalog Programs & Courses Manchester Community College

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

USC MARSHALL SCHOOL OF BUSINESS

Introduction to Simulation

Learning From the Past with Experiment Databases

Sociology 521: Social Statistics and Quantitative Methods I Spring Wed. 2 5, Kap 305 Computer Lab. Course Website

Lectora a Complete elearning Solution

Probability and Statistics Curriculum Pacing Guide

PHD COURSE INTERMEDIATE STATISTICS USING SPSS, 2018

GLBL 210: Global Issues

Enter the World of Polling, Survey &

Statistics and Data Analytics Minor

THE DEPARTMENT OF DEFENSE HIGH LEVEL ARCHITECTURE. Richard M. Fujimoto

Online Marking of Essay-type Assignments

Radius STEM Readiness TM

CS 101 Computer Science I Fall Instructor Muller. Syllabus

Bluetooth mlearning Applications for the Classroom of the Future

CS 100: Principles of Computing

FAQ (Frequently Asked Questions)

Lecture 1: Basic Concepts of Machine Learning

Learning Microsoft Publisher , (Weixel et al)

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

CSL465/603 - Machine Learning

DOUBLE DEGREE PROGRAM AT EURECOM. June 2017 Caroline HANRAS International Relations Manager

Universidade do Minho Escola de Engenharia

Mining Association Rules in Student s Assessment Data

From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Strategy and Design of ICT Services

Len Lundstrum, Ph.D., FRM

Capturing and Organizing Prior Student Learning with the OCW Backpack

Degree Qualification Profiles Intellectual Skills

Vorlesung Advanced Topics in HCI (Mensch-Maschine-Interaktion 2)

EdX Learner s Guide. Release

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

ATENEA UPC AND THE NEW "Activity Stream" or "WALL" FEATURE Jesus Alcober 1, Oriol Sánchez 2, Javier Otero 3, Ramon Martí 4

Probabilistic Latent Semantic Analysis

A Case-Based Approach To Imitation Learning in Robotic Agents

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Trip to the beach essay >>>CLICK HERE<<<

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Sociology. M.A. Sociology. About the Program. Academic Regulations. M.A. Sociology with Concentration in Quantitative Methodology.

We are strong in research and particularly noted in software engineering, information security and privacy, and humane gaming.

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Ryerson University Sociology SOC 483: Advanced Research and Statistics

Ricopili: Postimputation Module. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

From Social to Scholarly and Back Again

On-Line Data Analytics

Teaching Reproducible Research Inspiring New Researchers to Do More Robust and Reliable Science

Automating Outcome Based Assessment

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

A faculty approach -learning tools. Audio Tools Tutorial and Presentation software Video Tools Authoring tools

SELECCIÓN DE CURSOS CAMPUS CIUDAD DE MÉXICO. Instructions for Course Selection

Curriculum for the Bachelor Programme in Digital Media and Design at the IT University of Copenhagen

EDUC 998 The Doctoral Dissertation Proposal Summer 2004

Indian Institute of Technology, Kanpur

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

The Enterprise Knowledge Portal: The Concept

GETTING POSITIVE NEWS COVERAGE

Visual Journalism J3220 Syllabus

Self Study Report Computer Science

SAT Results December, 2002 Authors: Chuck Dulaney and Roger Regan WCPSS SAT Scores Reach Historic High

Career Preparation for English Majors Department of English The Ohio State University

"On-board training tools for long term missions" Experiment Overview. 1. Abstract:

INNOVATIONS IN TEACHING Using Interactive Digital Images of Products to Teach Pharmaceutics

McGraw-Hill Education Preparation For The GED Test 2nd Edition By McGraw-Hill Education Editors

Hongyan Ma. University of California, Los Angeles

Ecole Polytechnique Fédérale de Lausanne EPFL School of Computer and Communication Sciences IC. School of Computer and Communication Sciences

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Education for an Information Age

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Transcription:

University of Oslo The Faculty of Social Sciences Oslo Summer School in Comparative Social Science Studies 2017 Collecting and Analyzing Big Data Lecturer: Associate Professor Neal Caren Department of Sociology The University of North Carolina, Chapel Hill USA Main disciplines: Sociology, Economics, Political Science, Big Data Dates: 31 July - 4 August 2017 Course Credits: 10 pts (ECTS) Limitation: 25 participants Objectives This course is an introduction to collecting and analyzing "big data" for social scientists. Over the last decade, the variety and types of data available to researchers have exploded. This includes not only contemporary data, such as from websites and social media platforms, but also historical data, from digitized interviews to 19th century newspapers. At the same time, analytic techniques from computer science are increasingly being used to solve social science problems. One week is not enough time to master the techniques for collecting and analyzing big data. You will, however, be able to establish the foundation for developing these skills. The course is designed as a practical overview. The emphasis each class will be on applying the specific techniques rather than on their mathematical basis. The course will provide an overview in that each lesson will introduce a new method in order to demonstrate the range methods. Combined, students will have the skills and resources to apply these methods to theoretically-relevant problems in the social sciences. University of Oslo The Faculty of Social Sciences

By the end of the course, it is expected that students will be able to: Collect data from the internet using web scraping and APIs. Read and write digital text files. Analyze data using supervised learning technique such as random forest models. Analyze data using unsupervised learning techniques such as topic models. Understand and apply current methods for analyzing texts. Link machine learning methods to relevant social science questions. Program in Python Course credits Students have the option of submitting a research paper in order to receive ECTS credits. These research papers (6000 to 8,000 words) should apply one or more of the techniques used in the course to a theoretically interesting research question. Papers should generally follow the format of a research article in the student's discipline, although the literature review may be concise than normal. Additionally, students must provide code, and where feasible, data, to replicate the analysis. This is to be completed within 8 weeks after the course. Requirements Students should have a Python distribution appropriate for data science. The recommended way to do this is to install Continuum's Anaconda Python distribution. It is free and available for all operating systems. Students are not expected to have any knowledge of Python. Reading list Müller, Andreas C. and Sarah Guido. 2017. Introduction to Machine Learning with Python: A Guide for Data Scientists, O'Reilly Media, Inc. 392 pages. Additional readings will be made available. Page 2 of 7

COURSE OUTLINE Session 1: Big data, machine learning and the social sciences This lecture will unpack some of the major findings from the intersection of social science and big data. The focus will be on the specific tools and methods that were used. We will also review the major sources of data and tools currently available for data science. Müller, Andreas C. and Sarah Guido. Introduction. Chapter 1 (pages 1-24) in Introduction to Machine Learning with Python: A Guide for Data Scientists, O'Reilly Media, Inc O Neil, Cathy and Rachel Schutt. Introduction: What is Data Science Chapter 1 (pages 1-16) in Doing Data Science. O'Reilly Media, Inc. O Neil, Cathy and Rachel Schutt. Statistical Inference, Exploratory Data Analysis, and the Data Science Process Chapter 2 (pages 17-50) in Doing Data Science. O'Reilly Media, Inc. Session 2: Getting Started with Python This lecture will focus on getting students up and running with Python for social science applications. This includes both an overview of the elements of the Python data science stack (e.g. IPython/Jupyter, pandas, matplotlib, scikit-learn) but a more detailed introduction to working with Python. McKinney, Wes. Python Language Basics, IPython, and Jupyter Notebooks. Chapter 2 (pages 15-54) in Python for Data Analysis, 2nd Edition (2017). Page 3 of 7

Session 3: Harvesting data from the web: APIs Collecting big data is often done through web application programming interfaces, or APIs. This is a way for developers, or researchers, to access data stored on governmental or corporate servers. For example, Twitter, Facebook, and Yelp, all make some of their data available through APIs. This lecture will introduce the basics of collecting data from an API in Python. Mitchell, Ryan. Using APIs Chapter 4 (pages 49-70) in Web Scraping with Python: Collecting Data from the Modern Web, O'Reilly Media, Inc. "Chronicling America API." http://chroniclingamerica.loc.gov/about/api/ Session 4: Harvesting data from the web: Web scraping A second major source of big data is collecting it directly from websites. Web scraping involves visiting one more pages and collecting and storing the relevant information in an automated fashion. This lecture will introduce the basics of web scraping in Python. Mitchell, Ryan. Using APIs Chapter 4 (pages 49-70) in Web Scraping with Python: Collecting Data from the Modern Web O'Reilly Media, Inc. Mitchell, Ryan. Your First Web Scraper. Chapter 1 (pages 3-12) in Web Scraping with Python: Collecting Data from the Modern Web O'Reilly Media, Inc. Mitchell, Ryan. Advanced HTML Parsing Chapter 2 (pages 13-30) in Web Scraping with Python: Collecting Data from the Modern Web O'Reilly Media, Inc. Page 4 of 7

Session 5: Manipulating Big Data By most estimates, 80% of data analysis is cleaning and merging the data. This lecture introduces best practices for preparing your data in Python. Vanderplas, Jake. Data Manipulation with Pandas. Chapter 3 (pages 97-216) in Python Data Science Handbook.O'Reilly Media, Inc. Session 6: Supervised Learning I You are likely familiar with supervised learning, but you probably don't call it that. Supervised learning in the machine language term for when you are modeling one variable as a function of another set of variables, such as linear or logistic regression. This lecture reviews common methods for regression and classifications such as linear regression and introduces more complex algorithms. Müller, Andreas C. and Sarah Guido. Supervised Learning. Chapter 2 (pages 25-69) in Introduction to Machine Learning with Python: A Guide for Data Scientists, O'Reilly Media, Inc. Session 7: Model Evaluation Keeping the data used to evaluate your model separate from the data used to develop your model is critical to the machine learning workflow. This is of particular concern when there are concerns about overfitting. This lecture introduces the idea of cross validation and reviews methods for evaluating model fit. Müller, Andreas C. and Sarah Guido. Model Evaluation and Improvement. Chapter 5 (pages 251-304) in Introduction to Machine Learning with Python: A Guide for Data Scientists, O'Reilly Page 5 of 7

Session 8: Supervised Learning II This lecture extends on focus on supervised learning to techniques to include decision trees and random forest models. Müller, Andreas C. and Sarah Guido. Supervised Learning. Chapter 2 (pages 70-125) in Introduction to Machine Learning with Python: A Guide for Data Scientists, O'Reilly Session 9: Working with Text Data This lecture introduces the basics of manipulating and analyzing text data, including counting and analyzing term frequencies for text categorization. Müller, Andreas C. and Sarah Guido. Working with Text Data. Chapter 7 (pages 323-346) in Introduction to Machine Learning with Python: A Guide for Data Scientists, O'Reilly. Session 10: Unsupervised Learning with Text Data This lecture will introduce methods for analyzing themes in text data. The focus will be on topic modeling which involves assigning each document to one or multiple topics. Müller, Andreas C. and Sarah Guido. Working with Text Data. Chapter 7 (pages 347-356) in Introduction to Machine Learning with Python: A Guide for Data Scientists, O'Reilly O Neil, Cathy and Rachel Schutt. Next-Generation Data Scientists, Hubris, and Ethics Chapter 16 (pages 349-362) in Doing Data Science. O'Reilly Media, Inc. Page 6 of 7

The Lecturer Neal Caren is an Associate Professor of Sociology at the University of North Carolina, Chapel Hill. His research interests center on the quantitative analysis of protest and social movements. His work has been published in the American Sociological Review, Social Forces, Social Problems, and the Annual Review of Sociology. The data in many of his publications has been either scraped from the web, downloaded using APIs, or otherwise involved collected and analyzing texts. He is the author of a well-used publicly available script for converting Lexis-Nexis article downloads into a CSV file. For several years, he has run a graduate workshop on computational social science and digital data collection, has given external workshops on the topic, and has many several tutorials available online. He is also the editor of the social movements journal Mobilization. Page 7 of 7