TTIC 31210: Advanced Natural Language Processing. Lecture 14: Finish up Bayesian/Unsupervised NLP, Start Structured Prediction

Similar documents
Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

CS 598 Natural Language Processing

Lecture 1: Machine Learning Basics

Context Free Grammars. Many slides from Michael Collins

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Compositional Semantics

AQUA: An Ontology-Driven Question Answering System

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Grammars & Parsing, Part 1:

SEMAFOR: Frame Argument Resolution with Log-Linear Models

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Constraining X-Bar: Theta Theory

The stages of event extraction

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Chapter 4: Valence & Agreement CSLI Publications

Prediction of Maximal Projection for Semantic Role Labeling

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Parsing of part-of-speech tagged Assamese Texts

Ensemble Technique Utilization for Indonesian Dependency Parser

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

The Strong Minimalist Thesis and Bounded Optimality

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Pseudo-Passives as Adjectival Passives

Probabilistic Latent Semantic Analysis

Natural Language Processing. George Konidaris

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Python Machine Learning

Experts Retrieval with Multiword-Enhanced Author Topic Model

Unsupervised Learning of Narrative Schemas and their Participants

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Developing Grammar in Context

Learning Methods in Multilingual Speech Recognition

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

BASIC ENGLISH. Book GRAMMAR

BYLINE [Heng Ji, Computer Science Department, New York University,

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

California Department of Education English Language Development Standards for Grade 8

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Some Principles of Automated Natural Language Information Extraction

Underlying and Surface Grammatical Relations in Greek consider

Natural Language Processing: Interpretation, Reasoning and Machine Learning

Beyond the Pipeline: Discrete Optimization in NLP

Speech Recognition at ICSI: Broadcast News and beyond

Lesson objective: Year: 5/6 Resources: 1a, 1b, 1c, 1d, 1e, 1f, Examples of newspaper orientations.

Multilingual Sentiment and Subjectivity Analysis

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Short Text Understanding Through Lexical-Semantic Analysis

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

CS Machine Learning

Theoretical Syntax Winter Answers to practice problems

The College Board Redesigned SAT Grade 12

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

A Computational Evaluation of Case-Assignment Algorithms

Analysis of Probabilistic Parsing in NLP

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Construction Grammar. University of Jena.

Using dialogue context to improve parsing performance in dialogue systems

Loughton School s curriculum evening. 28 th February 2017

Linking Task: Identifying authors and book titles in verbose queries

BULATS A2 WORDLIST 2

A Comparison of Two Text Representations for Sentiment Analysis

Aspectual Classes of Verb Phrases

The Smart/Empire TIPSTER IR System

Laboratorio di Intelligenza Artificiale e Robotica

(Sub)Gradient Descent

Control and Boundedness

TextGraphs: Graph-based algorithms for Natural Language Processing

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Let's Learn English Lesson Plan

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Truth Inference in Crowdsourcing: Is the Problem Solved?

Unsupervised and Constrained Dirichlet Process Mixture Models for Verb Clustering

Sample Goals and Benchmarks

Cross Language Information Retrieval

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Matching Similarity for Keyword-Based Clustering

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Universal Grammar 2. Universal Grammar 1. Forms and functions 1. Universal Grammar 3. Conceptual and surface structure of complex clauses

Algorithms and Data Structures (NWI-IBC027)

Function Tables With The Magic Function Machine

Guidelines for Writing an Internship Report

CSL465/603 - Machine Learning

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

CAS LX 522 Syntax I. Long-distance wh-movement. Long distance wh-movement. Islands. Islands. Locality. NP Sea. NP Sea

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Transcription:

TTIC 31210: Advanced Natural Language Processing Kevin Gimpel Spring 2017 Lecture 14: Finish up Bayesian/Unsupervised NLP, Start Structured Prediction 1

Today and Wednesday: structured prediction No class Monday May 29 (Memorial Day) Final class is Wednesday May 31 2

Assignment 3 has been posted, due Thursday June 1 Final project report due Friday, June 9 3

Key Quantities Our data is a set of samples: 4

Gibbs Sampling Template 5

LDA 6

Expectation Maximization (EM) EM is an algorithmic template that finds a local maximum of the marginal likelihood of the observed data 7

E step: EM compute posteriors over latent variables: M step: update parameters given posteriors: 8

Different Views of the Dirichlet Process (DP) last time we discussed the stick-breaking view of the DP today we ll briefly discuss the Chinese Restaurant Process view with both views, we still have the same DP hyperparameters (base distribution & concentration parameter) 9

Base Distribution for DP our unbounded distribution over items will choose them from the base distribution base distribution usually has infinite support simple example base distribution for our morph lexicon: 10

Concentration Parameter in stick-breaking process, concentration parameter determines how much of the stick we break off each time high concentration == small parts of stick full stick 11

the stick-breaking construction of the DP is useful for specifying models and defining inference algorithms another useful way of representing a draw from a DP is with the Chinese Restaurant Process (CRP) CRP provides a distribution over partitions with an unbounded number of parts 12

imagine a Chinese restaurant with an infinite number of tables 1 2 3 4 13

first customer sits at first table: 1 2 3 4 14

second customer enters, chooses a table: 1 2 3 4 15

second customer chooses table 1: enters, 1 2 3 4 16

second customer chooses table 1: enters, chooses new table: 1 2 3 4 17

second customer chooses table 1 enters, 1 2 3 4 18

third customer enters, 1 2 3 4 19

third customer chooses table 1: enters, chooses new table: 1 2 3 4 20

third customer chooses new table enters, 1 2 3 4 21

fourth customer enters, p(choose table 1): p(choose table 2): p(choose new table): 1 2 3 4 22

1 2 3 4 23

large value of concentration parameter: 1 2 3 4 full stick 24

small value of concentration parameter: 1 2 3 4 25

A Draw G from a DP (Stick-Breaking Representation) draw infinite probabilities from stick-breaking process with parameter s draw atoms from base distribution atoms can be repeated! 26

A Representation of G Drawn from a DP (Chinese Restaurant Process Representation) draw table assignments for n customers with parameter s for each occupied table, draw atom from base distribution each draw from G is an atom, where its probability comes from the number of customers at its table number of tables occupied 27

When to be Bayesian? if you re doing unsupervised learning or learning with latent variables if you want to marginalize out some model parameters if you want to learn the structure/architecture of your model if you want to learn a potentially-unbounded lexicon (Bayesian nonparametrics) 28

What is Structured Prediction? 29

Modeling, Inference, Learning 30

Modeling, Inference, Learning modeling: define score function Modeling: How do we assign a score to an (x,y) pair using parameters? 31

Modeling, Inference, Learning inference: solve _ modeling: define score function Inference: How do we efficiently search over the space of all labels? 32

Modeling, Inference, Learning inference: solve _ modeling: define score function learning: choose _ Learning: How do we choose? 33

Modeling, Inference, Learning inference: solve _ modeling: define score function learning: choose _ Structured Prediction: size of output space is exponential in size of input or is unbounded (e.g., machine translation) (we can t just enumerate all possible outputs) 34

Simplest kind of structured prediction: Sequence Labeling Part-of-Speech Tagging determiner verb (past) prep. proper proper poss. adj. noun determiner verb (past) prep. noun noun poss. adj. noun Some questioned if Tim Cook s first product modal verb det. adjective noun prep. proper punc. modal verb det. adjective noun prep. noun punc. would be a breakaway hit for Apple. 35

Formulating segmentation tasks as sequence labeling via B-I-O labeling: Named Entity Recognition O O O B-PERSON I-PERSON O O O Some questioned if Tim Cook s first product O O O O O O B-ORGANIZATION O would be a breakaway hit for Apple. B = begin I = inside O = outside 36

Constituent Parsing (S (NP the man) (VP walked (PP to (NP the park)))) S VP NP PP NP DT NN VBD IN DT NN the man walked to the park Key: S = sentence NP = noun phrase VP = verb phrase PP = prepositional phrase DT = determiner NN = noun VBD = verb (past tense) IN = preposition 37

Dependency Parsing wall symbol $ konnten sie es übersetzen? $ could you translate it? 38

Coreference Resolution As we head towards training camp, the Philadelphia Eagles have finally filled most of their needs on offense. One of the main goals for this off-season was to find weapons for the team s franchise quarterback, Carson Wentz. The Eagles needed a wide receiver who could stretch the field and give Wentz the opportunity to throw the long ball. They signed receiver Torrey Smith to a 3-year deal. While the signing of Smith was huge for the team, the biggest signing the Eagles made was former Chicago Bears receiver Alshon Jeffery. He had a solid 5-year stint in Chicago, but as the team started to fall apart, Jeffery was forced to explore other options. 39

input: a document Coreference Resolution output: a set of mentions (textual spans in document), and memberships of those mentions in clusters 40

Semantic Role Labeling Who did what to whom at where? The police officer detained the suspect at the scene of the crime Agent ARG0 Predicate V Theme ARG2 Location AM-loc input: a sentence output: one span in the sentence identified as a predicate, and a set of other spans identified as particular roles for that predicate J&M/SLP3

Supervised Word Alignment given parallel sentences, predict word alignments: Brown et al. (1990) 42

Machine Translation phrase-based model (Koehn et al., 2003): sie es übersetzen : you translate it konnten : could sie : you übersetzen : translate es : it konnten sie : could you es übersetzen : translate it? :? sie : you konnten : could es : it es : it übersetzen : translate? :? input: a sentence in the source language output: a segmentation of the source sentence into segments, a translation of each segment, and an ordering of the translations

Key Categories of Structured Prediction I think of structured prediction methods in two primary categories: score-based and search-based 44

Score-Based Structured Prediction focus on defining the score function of the structured input/output pair: in dependency parsing, this is called graph-based parsing because minimum spanning tree algorithms can be used to find the globally-optimal max-scoring tree 45

Search-Based Structured Prediction focus on the procedure for searching through the structured output space (usually involves simple greedy or beam search) design a classifier to score a small number of decisions at each position in the search this classifier can use information about the current state as well as the entire history of the search in dependency parsing, this is called transitionbased parsing because it consists of greedily, sequentially deciding what parsing decision to make 46

Structured Prediction to make SP practical, we need to decompose the SP problem into parts this is true whether we are going to use search-based or score-based SP score-based: score function decomposes additively into scores of parts search-based: search factors into a sequence of decisions, each one adding a part to the final output structure 47