Text as Data Text Analytics

Similar documents
Python Machine Learning

Probabilistic Latent Semantic Analysis

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Assignment 1: Predicting Amazon Review Ratings

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Lecture 1: Machine Learning Basics

Justin Raisner December 2010 EdTech 503

Linking Task: Identifying authors and book titles in verbose queries

Natural Language Processing. George Konidaris

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

AQUA: An Ontology-Driven Question Answering System

Success Factors for Creativity Workshops in RE

St. Martin s Marking and Feedback Policy

A Case Study: News Classification Based on Term Frequency

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

CSL465/603 - Machine Learning

Basic: Question Words: Who, What, Where, When week 1

Lecture 2: Quantifiers and Approximation

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

TU-E2090 Research Assignment in Operations Management and Services

Rottenberg, Annette. Elements of Argument: A Text and Reader, 7 th edition Boston: Bedford/St. Martin s, pages.

DOCTORAL SCHOOL TRAINING AND DEVELOPMENT PROGRAMME

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Fort Lauderdale Conference

Welcome to Georgia Tech!

Tour. English Discoveries Online

Houghton Mifflin Online Assessment System Walkthrough Guide

WORK OF LEADERS GROUP REPORT

Research computing Results

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Close Up. washington, Dc High School Programs

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Power of Ten Leadership Academy Class Curriculum

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

NCAA Eligibility Center High School Portal Instructions. Course Module

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

GAT General (Analytical Reasoning Section) NOTE: This is GAT-C where: English-40%, Analytical Reasoning-30%, Quantitative-30% GAT

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Learning to Think Mathematically with the Rekenrek Supplemental Activities

Classify: by elimination Road signs

Beginning Blackboard. Getting Started. The Control Panel. 1. Accessing Blackboard:

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Economics Unit: Beatrice s Goat Teacher: David Suits

Lecture 1: Basic Concepts of Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

State University of New York at Buffalo INTRODUCTION TO STATISTICS PSC 408 Fall 2015 M,W,F 1-1:50 NSC 210

Exploration. CS : Deep Reinforcement Learning Sergey Levine

MOODLE 2.0 GLOSSARY TUTORIALS

MBA 5652, Research Methods Course Syllabus. Course Description. Course Material(s) Course Learning Outcomes. Credits.

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

Summarizing A Nonfiction

Common Core State Standards for English Language Arts

SECTION 12 E-Learning (CBT) Delivery Module

Individual Differences & Item Effects: How to test them, & how to test them well

learning collegiate assessment]

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

BHA 4053, Financial Management in Health Care Organizations Course Syllabus. Course Description. Course Textbook. Course Learning Outcomes.

Common Core Exemplar for English Language Arts and Social Studies: GRADE 1

If you need the Praxis CORE exams for admission to the Teacher Ed Program, then plan to attend the following workshop:

Summary: Impact Statement

Spiritual Works of Mercy

Storytelling Made Simple

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

San José State University Department of Marketing and Decision Sciences BUS 90-06/ Business Statistics Spring 2017 January 26 to May 16, 2017

Naviance Family Connection

EDIT 576 (2 credits) Mobile Learning and Applications Fall Semester 2015 August 31 October 18, 2015 Fully Online Course

PRESENTED BY EDLY: FOR THE LOVE OF ABILITY

Camas School levy passes! 69% approval! Crump! Truz! GOP homies tussle for Camas primary votes! Trump trumps with 42%, vs. 24% for Cruz!

SAT & ACT PREP. Evening classes at GBS - open to all Juniors!

2016 Warren STEM Fair. Monday and Tuesday, April 18 th and 19 th, 2016 Real-World STEM

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

POFI 2301 WORD PROCESSING MS WORD 2010 LAB ASSIGNMENT WORKSHEET Office Systems Technology Daily Flex Entry

ENGL 213: Creative Writing Introduction to Poetry

Introduction to Psychology

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Achievement Level Descriptors for American Literature and Composition

Moodle Student User Guide

give every teacher everything they need to teach mathematics

CS Machine Learning

IMPLEMENTING THE EARLY YEARS LEARNING FRAMEWORK

EDIT 576 DL1 (2 credits) Mobile Learning and Applications Fall Semester 2014 August 25 October 12, 2014 Fully Online Course

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Classroom Assessment Techniques (CATs; Angelo & Cross, 1993)

Northeastern University Online Course Syllabus

Syllabus for CHEM 4660 Introduction to Computational Chemistry Spring 2010

Speech Recognition at ICSI: Broadcast News and beyond

(Sub)Gradient Descent

A Vector Space Approach for Aspect-Based Sentiment Analysis

Rule Learning With Negation: Issues Regarding Effectiveness

MATH Study Skills Workshop

12- A whirlwind tour of statistics

Minitab Tutorial (Version 17+)

Transcription:

Text as Data Text Analytics Robert Stine School of the University of Pennsylvania www-stat.wharton.upenn.edu/~stine 1

Introduction 2

Why look at text as data? Why look at text? Interesting How does ETS they score the written SAT? Diagnose autism? What gives away how a justice on the Supreme Court will vote? Opportunity to augment classical data How can I use these written comments? Connections to modern statistical modeling Issues of big data, neural networks/deep learning, and variable/model selection Examples of text data Medical data combine lab measurements with clinical evaluations Open-ended survey responses (e.g., ANES) Written employment applications Ad click prediction based on search text 3

Illustrative Applications Two types: supervised and unsupervised Supervised have a known response to guide analysis Unsupervised don t (think cluster analysis) Unsupervised examples Are Facebook posts about my company positive or negative? What topics dominate articles written in science? Supervised Does the content of a speech indicate political leaning? Can you anticipate popularity of a movie from initial review? Does text improve models or proxy for numerical data? 4

Lecture Schedule Plan Monday Tuesday Introduction A deep dive, then back to fundamentals Sentiment analysis, vector space models Latent semantic analysis Wednesday Generative probability models Thursday Naive Bayes and hierarchical topic models Overflow, deep learning Language models Style First hour of lecture, some computing Second hour more focused on R computing 5

Further Topics in Text Not covering everything! Emphasize problems with statistics connection Some things you will want to learn more about Linguistics, structure of language Parts of speech, named entities. Make a friend of a linguist! Language modeling, translation Sequence to sequence modeling needs even more data Text manipulations using regular expressions Books Get a copy on-line of egrep_for_linguists.pdf Manning and Schütze (1999) Foundations of Statistical NLP Jurasfsky and Martin (2008) Speech and Language 6

Software Comparison to Mosteller & Wallace analysis They studied authoship of the Federalist papers by hand Mosteller and Wallace (1963). Inference in an authorship problem. JMP, SAS R Text tools now found in mainstream packages Reproducible research: Scripting versus point and click tm (text miner) supplemented by tidytext Supporting package: dplyr, ggplot2, stringr, readr Alternative: NLTK and python But then you have to move to R for the analysis 7

Overview Example 8

Questions and Data Wine tasting notes Can you distinguish a red wine from a white wine using a brief note that describes its taste and aroma? Can you recognize the variety of red wine? Cabernet vs merlot vs pinot vs zinfandel classification Can you predict the price? Rating points? regression Each tasting note is short, but we have a lot of them Does text add value? Have numerical data, traditional predictive features Does information in the text add value? 9

Tasting Notes Data 21,000 tasting notes from Beverage Tasting Institute Earthy, herbal, slightly herbaceous aromas. A medium-bodied palate leads to a short finish that is earthy, tart and has limited fruit. Toasty oak, cherry and thyme aromas. A rich entry leads to a full-bodied palate and a well-structured finish with vibrant acidity, refined tannins, and lovely varietal fruit. Lots of tasting notes, but each is relatively short Mark Liberman http://languagelog.ldc.upenn.edu/nll/?p=3887/ Do people describe taste, or do they describe color? The color of odors 10

Typical Steps Prepare data Deciding on role for text 90% or more of effort Editing: removing weird characters, such as html markup Feature engineering: eg making regression variables Modeling choices, issues Unsupervised (clustering) vs supervised (regression) Structural (prob model) vs predictive (conditional mean) Inference What is the inferential context? Do you have a sample? 11

Browsing the Data Always good to wander around in your data Visual, interactive software tools like JMP make this painless Novelty for stat data: Several columns are long strings wine.jmp 12

Browsing the Data Always good to wander around in your data Visual, interactive software tools like JMP make this painless Several quantitative variables were extracted from label Regular expressions used to match patterns in data which is that? wine.jmp 13

Regression Model for Price Traditional multiple regression Log(price) as response Features alcohol, vintage, color, and points Too many varieties to use this one With n=16,421, every feature is statistically significant numerous missing prices Be careful interpreting these the response is on a log scale. 14

What s the benefit of text? Does adding information gleaned from the tasting notes improve this regression? Is the model more predictive? Does R 2 grow? If so, can we interpret the effects of adding text? Analogous to using physician notes in diagnostic medicine How can we find out? Two approaches Feature engineering: Hand-craft new variables At the moment Black Box: JMPs Text Explorer" tool We will look inside this tool in the coming lectures 15

Feature Engineering Make new variables Rationale for length of the tasting note: probably write more about a good wine than a crummy wine Recode other features, particularly variety, to make useful Indicators for special words: yummy, delicious, great Sentiment analysis and no peeking at the response! R 2 grows from 0.32 to 0.35 Interesting to see effects of varieties 16

What s a token? Going Deeper into Text Explore the description more carefully What other characteristics can be exploited? What words, phrases are common enough to be interesting term = word type Author likes to use the word medium in a phrase. 17

Document Term Matrix Count word types that appear in each document One row for every document (an observation) One column for every word type (a variable) w1 w2 w3... wm d1 d2 d3... c23 number of times word type w3 appears in document 2 dn 18

Document Term Matrix Count word types that appear in each document What s a word? Where did common words like a and the go? Stemming? Are herb and herbs different words? Accept defaults for now, with explicit choices when using R DTM is huge One row for every document, one column for every type Sparse: Most tokens are common, most types are rare Treat large matrix using idea from stat: Principal Components 19

Latent Semantic Analysis LSA Principal components analysis of the document term matrix Variations based on how one normalizes the variables just like standardizing variables in regression analysis Default results Do you see clusters??? 20

Using the Principal Components Add the principal components to the regression Come back Tuesday and Wednesday to find out how this magic works and what those components mean. The model improves again R 2 grows from 0.32 to 0.35 to 0.40 Should we add more? 21

Next Steps What s the science behind the success of using text? Description features alone explain 28% of variation in price Details, details Glossed over several choices What s a word? Do we keep all the words? What about phrases? What s this singular value thing? The choices might actually not matter, but you need to know what the choices are and why they might matter. Software JMP is pretty neat, but it does not implement some methods, such as sentiment analysis and topic models Plus, its not free (at least not after a 30 day trial) 22