Loss-augmented Structured Prediction

Similar documents
(Sub)Gradient Descent

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Grammars & Parsing, Part 1:

CS Machine Learning

CS 598 Natural Language Processing

Compositional Semantics

The Strong Minimalist Thesis and Bounded Optimality

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Beyond the Pipeline: Discrete Optimization in NLP

Ensemble Technique Utilization for Indonesian Dependency Parser

Prediction of Maximal Projection for Semantic Role Labeling

Parsing of part-of-speech tagged Assamese Texts

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Proof Theory for Syntacticians

Lecture 1: Machine Learning Basics

Natural Language Processing. George Konidaris

Python Machine Learning

Context Free Grammars. Many slides from Michael Collins

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Discriminative Learning of Beam-Search Heuristics for Planning

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Linking Task: Identifying authors and book titles in verbose queries

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Radius STEM Readiness TM

Learning Computational Grammars

Developing a TT-MCTAG for German with an RCG-based Parser

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Second Exam: Natural Language Parsing with Neural Networks

Corrective Feedback and Persistent Learning for Information Extraction

Applications of memory-based natural language processing

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

"f TOPIC =T COMP COMP... OBJ

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

Lecture 10: Reinforcement Learning

The Role of the Head in the Interpretation of English Deverbal Compounds

An Introduction to the Minimalist Program

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A Graph Based Authorship Identification Approach

arxiv: v1 [cs.cv] 10 May 2017

Some Principles of Automated Natural Language Information Extraction

The stages of event extraction

Natural Language Processing: Interpretation, Reasoning and Machine Learning

The Smart/Empire TIPSTER IR System

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

A Framework for Customizable Generation of Hypertext Presentations

Accurate Unlexicalized Parsing for Modern Hebrew

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

Artificial Neural Networks written examination

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Lecture 1: Basic Concepts of Machine Learning

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Memory-based grammatical error correction

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

A deep architecture for non-projective dependency parsing

SEMAFOR: Frame Argument Resolution with Log-Linear Models

An investigation of imitation learning algorithms for structured prediction

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

Truth Inference in Crowdsourcing: Is the Problem Solved?

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

The Interface between Phrasal and Functional Constraints

GACE Computer Science Assessment Test at a Glance

Chapter 2 Rule Learning in a Nutshell

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Axiom 2013 Team Description Paper

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

AQUA: An Ontology-Driven Question Answering System

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Statewide Framework Document for:

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

Adapting Stochastic Output for Rule-Based Semantics

Short Text Understanding Through Lexical-Semantic Analysis

Analysis of Probabilistic Parsing in NLP

Building a Semantic Role Labelling System for Vietnamese

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Calibration of Confidence Measures in Speech Recognition

Hyperedge Replacement and Nonprojective Dependency Structures

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Computational Evaluation of Case-Assignment Algorithms

Achim Stein: Diachronic Corpora Aston Corpus Summer School 2011

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Extracting Verb Expressions Implying Negative Opinions

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

1.11 I Know What Do You Know?

A Vector Space Approach for Aspect-Based Sentiment Analysis

Transcription:

Loss-augmented Structured Prediction CMSC 723 / LING 723 / INST 725 Marine Carpuat Figures, algorithms & equations from CIML chap 17

POS tagging Sequence labeling with the perceptron Sequence labeling problem Input: sequence of tokens x = [x 1 x L ] Variable length L Output (aka label): sequence of tags y = [y 1 y L ] # tags = K Size of output space? Structured Perceptron Perceptron algorithm can be used for sequence labeling But there are challenges How to compute argmax efficiently? What are appropriate features? Approach: leverage structure of output space

Solving the argmax problem for sequences with dynamic programming Efficient algorithms possible if the feature function decomposes over the input This holds for unary and markov features used for POS tagging

Feature functions for sequence labeling Standard features of POS tagging Unary features: # times word w has been labeled with tag l for all words w and all tags l Markov features: # times tag l is adjacent to tag l in output for all tags l and l Size of feature representation is constant wrt input length

Solving the argmax problem for sequences Trellis sequence labeling Any path represents a labeling of input sentence Gold standard path in red Each edge receives a weight such that adding weights along the path corresponds to score for input/ouput configuration Any max-weight max-weight path algorithm can find the argmax e.g. Viterbi algorithm O(LK 2 )

Defining weights of edge in treillis Unary features at position l together with Markov features that end at position l Weight of edge that goes from time l- 1 to time l, and transitions from y to y

Dynamic program Define: the score of best possible output prefix up to and including position l that labels the l-th word with label k With decomposable features, alphas can be computed recursively

A more general approach for argmax Integer Linear Programming ILP: optimization problem of the form, for a fixed vector a With integer constraints Pro: can leverage well-engineered solvers (e.g., Gurobi) Con: not always most efficient

POS tagging as ILP Markov features as binary indicator variables Enforcing constraints for well formed solutions Output sequence: y(z) obtained by reading off variables z Define a such that a.z is equal to score

Sequence labeling Structured perceptron A general algorithm for structured prediction problems such as sequence labeling The Argmax problem Efficient argmax for sequences with Viterbi algorithm, given some assumptions on feature structure A more general solution: Integer Linear Programming Loss-augmented structured prediction Training algorithm Loss-augmented argmax

In structured perceptron, all errors are equally bad

All bad output sequences are not equally bad Hamming Loss Gives a more nuanced evaluation of output than 0 1 loss Consider y # " = A, A, A, A y # ' = [N, V, N, N]

Loss functions for structured prediction Recall learning as optimization for classification e.g., Let s define a structure-aware optimization objective e.g., Structured hinge loss 0 if true output beats score of every imposter output Otherwise: scales linearly as function of score diff between most confusing imposter and true output

Optimization: stochastic subgradient descent Subgradients of structured hinge loss?

Optimization: stochastic subgradient descent subgradients of structured hinge loss

Optimization: stochastic subgradient descent Resulting training algorithm Only 2 differences compared to structured perceptron!

Loss-augmented inference/search Recall dynamic programming solution without Hamming loss

Loss-augmented inference/search Dynamic programming with Hamming loss We can use Viterbi algorithm as before as long as the loss function decomposes over the input consistently w features!

Sequence labeling Structured perceptron A general algorithm for structured prediction problems such as sequence labeling The Argmax problem Efficient argmax for sequences with Viterbi algorithm, given some assumptions on feature structure A more general solution: Integer Linear Programming Loss-augmented structured prediction Training algorithm Loss-augmented argmax

Syntax & Grammars From Sequences to Trees

Syntax & Grammar Syntax From Greek syntaxis, meaning setting out together refers to the way words are arranged together. Grammar Set of structural rules governing composition of clauses, phrases, and words in any given natural language Descriptive, not prescriptive Panini s grammar of Sanskrit ~2000 years ago

Syntax and Grammar Goal of syntactic theory explain how people combine words to form sentences and how children attain knowledge of sentence structure Grammar implicit knowledge of a native speaker acquired without explicit instruction minimally able to generate all and only the possible sentences of the language [Philips, 2003]

Syntax in NLP Syntactic analysis often a key component in applications Grammar checkers Dialogue systems Question answering Information extraction Machine translation

Two views of syntactic structure Constituency (phrase structure) Phrase structure organizes words in nested constituents Dependency structure Shows which words depend on (modify or are arguments of) which on other words

Constituency Basic idea: groups of words act as a single unit Constituents form coherent classes that behave similarly With respect to their internal structure: e.g., at the core of a noun phrase is a noun With respect to other constituents: e.g., noun phrases generally occur before verbs

Constituency: Example The following are all noun phrases in English... Why? They can all precede verbs They can all be preposed/postposed

Grammars and Constituency For a particular language: What are the right set of constituents? What rules govern how they combine? Answer: not obvious and difficult That s why there are many different theories of grammar and competing analyses of the same data! Our approach Focus primarily on the machinery

Context-Free Grammars Context-free grammars (CFGs) Aka phrase structure grammars Aka Backus-Naur form (BNF) Consist of Rules Terminals Non-terminals

Context-Free Grammars Terminals We ll take these to be words Non-Terminals The constituents in a language (e.g., noun phrase) Rules Consist of a single non-terminal on the left and any number of terminals and non-terminals on the right

An Example Grammar

Parse Tree: Example Note: equivalence between parse trees and bracket notation

Dependency Grammars CFGs focus on constituents Non-terminals don t actually appear in the sentence In dependency grammar, a parse is a graph (usually a tree) where: Nodes represent words Edges represent dependency relations between words (typed or untyped, directed or undirected)

Dependency Grammars Syntactic structure = lexical items linked by binary asymmetrical relations called dependencies

Dependency Relations

Example Dependency Parse They hid the letter on the shelf Compare with constituent parse What s the relation?

Universal Dependencies project Set of dependency relations that are Linguistically motivated Computationally useful Cross-linguistically applicable [Nivre et al. 2016] Universaldependencies.org

Summary Syntax & Grammar Two views of syntactic structures Context-Free Grammars Dependency grammars Can be used to capture various facts about the structure of language (but not all!) Treebanks as an important resource for NLP