Lecture 2: Mixing Compositional Semantics and Machine Learning

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

CSL465/603 - Machine Learning

CS Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Using dialogue context to improve parsing performance in dialogue systems

Speech Recognition at ICSI: Broadcast News and beyond

A Case Study: News Classification Based on Term Frequency

Probabilistic Latent Semantic Analysis

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Lecture 1: Basic Concepts of Machine Learning

Linking Task: Identifying authors and book titles in verbose queries

Applications of memory-based natural language processing

Assignment 1: Predicting Amazon Review Ratings

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Calibration of Confidence Measures in Speech Recognition

Artificial Neural Networks written examination

The stages of event extraction

AQUA: An Ontology-Driven Question Answering System

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

arxiv: v1 [cs.cl] 2 Apr 2017

Introduction to Simulation

Second Exam: Natural Language Parsing with Neural Networks

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Evolutive Neural Net Fuzzy Filtering: Basic Description

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Knowledge Transfer in Deep Convolutional Neural Nets

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Proof Theory for Syntacticians

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Online Updating of Word Representations for Part-of-Speech Tagging

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

A Comparison of Two Text Representations for Sentiment Analysis

Modeling function word errors in DNN-HMM based LVCSR systems

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Chapter 2 Rule Learning in a Nutshell

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Modeling function word errors in DNN-HMM based LVCSR systems

Beyond the Pipeline: Discrete Optimization in NLP

Natural Language Processing. George Konidaris

Parsing of part-of-speech tagged Assamese Texts

South Carolina English Language Arts

Ensemble Technique Utilization for Indonesian Dependency Parser

School of Innovative Technologies and Engineering

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Compositional Semantics

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Rule Learning With Negation: Issues Regarding Effectiveness

BYLINE [Heng Ji, Computer Science Department, New York University,

STA 225: Introductory Statistics (CT)

Softprop: Softmax Neural Network Backpropagation Learning

arxiv: v1 [cs.lg] 15 Jun 2015

Word Segmentation of Off-line Handwritten Documents

Axiom 2013 Team Description Paper

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

INPE São José dos Campos

CS 446: Machine Learning

CS 598 Natural Language Processing

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

A study of speaker adaptation for DNN-based speech synthesis

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

arxiv: v1 [cs.cl] 20 Jul 2015

KLI: Infer KCs from repeated assessment events. Do you know what you know? Ken Koedinger HCI & Psychology CMU Director of LearnLab

WHEN THERE IS A mismatch between the acoustic

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Generative models and adversarial training

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Attributed Social Network Embedding

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Evolution of Symbolisation in Chimpanzees and Neural Nets

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Prediction of Maximal Projection for Semantic Role Labeling

Some Principles of Automated Natural Language Information Extraction

An investigation of imitation learning algorithms for structured prediction

Distant Supervised Relation Extraction with Wikipedia and Freebase

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

B.S/M.A in Mathematics

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Writing Research Articles

Learning Methods for Fuzzy Systems

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Technical Manual Supplement

Speech Emotion Recognition Using Support Vector Machine

Learning Methods in Multilingual Speech Recognition

Indian Institute of Technology, Kanpur

Transcription:

Lecture 2: Mixing Compositional Semantics and Machine Learning Kyle Richardson kyle@ims.uni-stuttgart.de April 14, 2016

Plan main paper: Liang and Potts 2015 (conceptual basis of class) secondary: Mooney 2007 (semantic parsing big ideas), Domingos 2012 (remarks about ML) 2

Classical Semantics vs. Statistical Semantics (caricature) Logical Semantics: Logic, algebra, set theory compositional analysis, beyond words, inference, brittle. Statistical Semantics: Optimization, algorithms, geometry distributional analysis, word-based, grounded, shallow. The types of approaches share the long-term vision of achieving deep natural language understanding... 3

Montague-style Compositional Semantics Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: John studies. john John (λx.(study x)) studies 4

Montague-style Compositional Semantics Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: John studies. john John (λx.(study x)) studies (λx.(study x))(john) (study john ) {True, False} john John (λx.(study x)) studies 4

A mini functional interpreter (python) Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: John studies. john John (λx.(study x)) studies >>> students studying = set([ john, mary ]) >>> study = lambda x : x in students studying >>> fun application = lambda fun, val : fun(val) 5

A mini functional interpreter (python) Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: John studies. john John (λx.(study x)) studies >>> students studying = set([ john, mary ]) >>> study = lambda x : x in students studying >>> fun application = lambda fun, val : fun(val) >>> fun application(study, bill ) ## What will we get? 5

A mini functional interpreter (python) Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: John studies. john John (λx.(study x)) studies >>> students studying = set([ john, mary ]) >>> study = lambda x : x in students studying >>> fun application = lambda fun, val : fun(val) >>> fun application(study, bill ) ## What will we get? >>> False 5

A mini functional interpreter (python) Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: John studies. john John (λx.(study x)) studies >>> students studying = set([ john, mary ]) >>> study = lambda x : x in students studying >>> fun application = lambda fun, val : fun(val) >>> fun application(study, mary ) ## What will we get? 6

A mini functional interpreter (python) Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: John studies. john John (λx.(study x)) studies >>> students studying = set([ john, mary ]) >>> study = lambda x : x in students studying >>> fun application = lambda fun, val : fun(val) >>> fun application(study, mary ) ## What will we get? >>> True 6

Montague-style Compositional Semantics Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: Bill does not study. bill Bill (λx.(study x)) study (λf.λx.(not (f x))) does not 7

Montague-style Compositional Semantics Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: Bill does not study. bill Bill (λx.(study x)) study (λf.λx.(not (f x))) does not (λx.(not (study x)))(bill) bill (λx.(not (study x))) Bill (λf.λx.(not (f x))) does not (λx.(study x)) study 7

A mini functional interpreter (python) Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: Bill does not study. bill Bill (λx.(study x)) study (λf.λx.(not (f x))) does not >>> students studying = set([ john, mary ]) >>> study = lambda x : x in students studying >>> fun application = lambda fun, val : fun(val) >>> neg = lambda F : (lambda x : not F(x)) 8

A mini functional interpreter (python) Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: Bill does not study. bill Bill (λx.(study x)) study (λf.λx.(not (f x))) does not >>> students studying = set([ john, mary ]) >>> study = lambda x : x in students studying >>> fun application = lambda fun, val : fun(val) >>> neg = lambda F : (lambda x : not F(x)) >>> neg(study)( bill ) # True 8

A mini functional interpreter (python) Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: Bill does not study. bill Bill (λx.(study x)) study (λf.λx.(not (f x))) does not >>> students studying = set([ john, mary ]) >>> study = lambda x : x in students studying >>> fun application = lambda fun, val : fun(val) >>> neg = lambda F : (lambda x : not F(x)) >>> neg(study)( bill ) # True >>> fun application(neg,study)( bill ) 8

A mini functional interpreter (python) Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: Bill does not study. bill Bill (λx.(study x)) study (λf.λx.(not (f x))) does not >>> students studying = set([ john, mary ]) >>> study = lambda x : x in students studying >>> fun application = lambda fun, val : fun(val) >>> neg = lambda F : (lambda x : not F(x)) >>> neg(study)( bill ) # True >>> fun application(neg,study)( bill ) >>> fun application(fun application(neg,study), bill ) 8

A mini functional interpreter (python) Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: Bill does not study. bill Bill (λx.(study x)) study (λf.λx.(not (f x))) does not >>> students studying = set([ john, mary ]) >>> study = lambda x : x in students studying >>> fun application = lambda fun, val : fun(val) >>> neg = lambda F : (lambda x : not F(x)) >>> neg(study)( bill ) # True >>> fun application(neg,study)( bill ) >>> fun application(fun application(neg,study), bill ) >>> neg(neg(sleep))( bill ) 8

Montague-style Compositional Semantics: What s needed Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: Bill does not study. bill Bill (λx.(study x)) study (λf.λx.(not (f x))) does not Grammar rules for building syntactic structure. Interpretation rules to composing meaning. Decoding algorithm for generating structures 9

Montague-style Compositional Semantics: Issues Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: Bill does not study. bill Bill (λx.(study x)) study (λf.λx.(not (f x))) does not Features and (Computational) Issues: compositional, provides a full analysis. supports further inferencing 10

Montague-style Compositional Semantics: Issues Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: Bill does not study. bill Bill (λx.(study x)) study (λf.λx.(not (f x))) does not Features and (Computational) Issues: compositional, provides a full analysis. supports further inferencing issue: Does not provide an analysis of words (not grounded). 10

Montague-style Compositional Semantics: Issues Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: Bill does not study. bill Bill (λx.(study x)) study (λf.λx.(not (f x))) does not Features and (Computational) Issues: compositional, provides a full analysis. supports further inferencing issue: issue: Does not provide an analysis of words (not grounded). Is brittle, cannot handle uncertainty. 10

Montague-style Compositional Semantics: Issues Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: Bill does not study. bill Bill (λx.(study x)) study (λf.λx.(not (f x))) does not Features and (Computational) Issues: compositional, provides a full analysis. supports further inferencing issue: issue: issue: Does not provide an analysis of words (not grounded). Is brittle, cannot handle uncertainty. Says nothing about how the translation to logic works. 10

Statistical Approaches to Semantics Statistical semantics hypothesis: Statistical patterns of human word usage can be used to figure out what people mean Turney et al. (2010) corpus word-context matrix The furry dog is walking outside... furry walking shiny driving The shiny car is driving... dog 10 20 0 0 A furry cat is walking around... cat 12 25 2 0 A shiny bike is driving... car 0 0 23 26... bike 0 1 30 25 11

Statistical Approaches to Semantics Statistical semantics hypothesis: Statistical patterns of human word usage can be used to figure out what people mean Turney et al. (2010) corpus word-context matrix furry walking shiny driving dog 4 20 0 0 cat 3 25 2 0 car 0 0 5 26 bike 1 1 4 25 12

Example Tasks and Applications: Turney et al. (2010) Statistical semantic models are often used in downstream classification or clustering tasks/applications. Term-document matrices Document retrieval/clustering/classification. Question Answering and Retrieval. Essay scoring. Word-Context Matrices Word similarity/clustering/classification Word-sense disambiguation Automatic thesaurus generation/paraphrasing Pair-pair matrices Relational similarity/clustering/classification. Analogy comparison. 13

Statistical Approaches to Semantics Statistical semantics hypothesis: Statistical patterns of human word usage can be used to figure out what people mean Turney et al. (2010) corpus word-context matrix The furry dog is walking outside... furry walking shiny driving The shiny car is driving... dog 10 20 0 0 A furry cat is walking around... cat 12 25 2 0 A shiny bike is driving... car 0 0 23 26... bike 0 1 30 25 Features and Issues (caricature): Robust, requires little manual effort, grounded Can provide rich analysis of content words. 14

Statistical Approaches to Semantics Statistical semantics hypothesis: Statistical patterns of human word usage can be used to figure out what people mean Turney et al. (2010) corpus word-context matrix The furry dog is walking outside... furry walking shiny driving The shiny car is driving... dog 10 20 0 0 A furry cat is walking around... cat 12 25 2 0 A shiny bike is driving... car 0 0 23 26... bike 0 1 30 25 Features and Issues (caricature): Robust, requires little manual effort, grounded Can provide rich analysis of content words. issue: Hard to scale beyond words. 14

Statistical Approaches to Semantics Statistical semantics hypothesis: Statistical patterns of human word usage can be used to figure out what people mean Turney et al. (2010) corpus word-context matrix The furry dog is walking outside... furry walking shiny driving The shiny car is driving... dog 10 20 0 0 A furry cat is walking around... cat 12 25 2 0 A shiny bike is driving... car 0 0 23 26... bike 0 1 30 25 Features and Issues (caricature): Robust, requires little manual effort, grounded Can provide rich analysis of content words. issue: Hard to scale beyond words. issue: In general, hard to model logical operations, shallow. 14

Mixing compositional and statistical semantics Desiderata: Want a model of semantics that is robust, reflects real-word usage and learnable, but one that is also compositional. 15

Mixing compositional and statistical semantics Desiderata: Want a model of semantics that is robust, reflects real-word usage and learnable, but one that is also compositional. Generalization 15

Mixing compositional and statistical semantics Desiderata: Want a model of semantics that is robust, reflects real-word usage and learnable, but one that is also compositional. Generalization Logical semantics: generalize using composition and abstract recursive structures. 15

Mixing compositional and statistical semantics Desiderata: Want a model of semantics that is robust, reflects real-word usage and learnable, but one that is also compositional. Generalization Logical semantics: generalize using composition and abstract recursive structures. Machine Learning (classification): learns generalizations through real-world examples (e.g. target input-output) 15

Mixing compositional and statistical semantics Desiderata: Want a model of semantics that is robust, reflects real-word usage and learnable, but one that is also compositional. Generalization Logical semantics: generalize using composition and abstract recursive structures. Machine Learning (classification): learns generalizations through real-world examples (e.g. target input-output) Bridge: get our learning to target compositional structures. 15

A simple model: Liang and Potts Model: a simple discriminative learning framework. compositional model: (semantic) context-free grammar. learning model: linear classification and first-order optimization. 16

Compositional Model: Linguistic Objects: < u, s, d > u: utterance s: semantic representation (symbolized as ˆuˆ) d: denotation (symbolized as s ) 17

Compositional Model: Linguistic Objects: < u, s, d > u: utterance s: semantic representation (symbolized as ˆuˆ) d: denotation (symbolized as s ) Example: < seven minus five, (- 7 5), 2 > 17

Compositional Model: Linguistic Objects: < u, s, d > u: s: d: utterance semantic representation (symbolized as ˆuˆ) denotation (symbolized as s ) Example: < seven minus five, (- 7 5), 2 > < minus times, (* (- 2 2) 2), 0 > 17

Compositional Model: Linguistic Objects: < u, s, d > u: s: d: utterance semantic representation (symbolized as ˆuˆ) denotation (symbolized as s ) Example: < seven minus five, (- 7 5), 2 > < minus times, (* (- 2 2) 2), 0 > semantic parsing: u s 17

Compositional Model: Linguistic Objects: < u, s, d > u: s: d: utterance semantic representation (symbolized as ˆuˆ) denotation (symbolized as s ) Example: < seven minus five, (- 7 5), 2 > < minus times, (* (- 2 2) 2), 0 > semantic parsing: u s interpretation: s d 17

Computational Modeling: The full picture Standard processing pipeline input List samples that contain every major element Semantic Parsing sem (FOR EVERY X / MAJORELT : T; (FOR EVERY Y / SAMPLE : (CONTAINS Y X); (PRINTOUT Y))) Knowledge Representation Interpretation world sem ={S10019,S10059,...} Lunar QA system (Woods (1973)) 18

Compositional Model: Context-free grammar provides the background grammar and interpretation rules 19

Compositional Model: Context-free grammar provides the background grammar and interpretation rules example: u = times plus three N: (plus (mult 2 2) 3) N : (mult 2 2) R : plus N : 3 R : mult plus three times 20

Compositional Model: Context-free grammar provides the background grammar and interpretation rules example: u = times plus three N: (plus (mult 2 2) 3) N : (mult 2 2) R : plus N : 3 R : mult plus three times >>> plus = lambda x,y : x + y >>> mult = lambda x,y : x * y 20

Compositional Model: Context-free grammar provides the background grammar and interpretation rules example: u = times plus three N: (plus (mult 2 2) 3) N : (mult 2 2) R : plus N : 3 R : mult plus three times >>> plus = lambda x,y : x + y >>> mult = lambda x,y : x * y >>> plus(2,2) # 4 20

Compositional Model: Context-free grammar provides the background grammar and interpretation rules example: u = times plus three N: (plus (mult 2 2) 3) N : (mult 2 2) R : plus N : 3 R : mult plus three times >>> plus = lambda x,y : x + y >>> mult = lambda x,y : x * y >>> plus(plus(2,3),2) # 7 21

Compositional Model: Context-free grammar provides the background grammar and interpretation rules example: u = times plus three N: (plus (mult 2 2) 3) N : (mult 2 2) R : plus N : 3 R : mult plus three times >>> plus = lambda x,y : x + y >>> mult = lambda x,y : x * y >>> plus(mult(2,2),3) # 7 22

Compositional Model: Components Components: Grammar rules for building syntactic structure. 23

Compositional Model: Components Components: Grammar rules for building syntactic structure. Interpretation rules to composing meaning. 23

Compositional Model: Components Components: Grammar rules for building syntactic structure. Interpretation rules to composing meaning. Decoding algorithm for generating structures (later lecture) 23

Compositional Model: Components Components: Grammar rules for building syntactic structure. Interpretation rules to composing meaning. Decoding algorithm for generating structures (later lecture) Rule extraction (later lecture) 23

Compositional Model: Components Components: Grammar rules for building syntactic structure. Interpretation rules to composing meaning. Decoding algorithm for generating structures (later lecture) Rule extraction (later lecture) Issues: 23

Compositional Model: Components Components: Grammar rules for building syntactic structure. Interpretation rules to composing meaning. Decoding algorithm for generating structures (later lecture) Rule extraction (later lecture) Issues: example: u = times plus three N: (plus (mult 2 2) 3) N: (plus (plus 2 2) 3) N : (mult 2 2) R : plus N : 3 N : (plus 2 2) R : plus N : 3 R : mult plus three R : plus plus three times times 23

Compositional Model: Components Components: Grammar rules for building syntactic structure. Interpretation rules to composing meaning. Decoding algorithm for generating structures (later lecture) Issues: example: u = times plus three N: (plus (mult 2 2) 3) N: (mult (plus 2 2) 3) N : (mult 2 2) R : plus N : 3 N : (plus 2 2) R : mult N : 3 R : mult plus three R : mult plus three times times 24

Compositional Model: Components Components: Grammar rules for building syntactic structure. Interpretation rules to composing meaning. Decoding algorithm for generating structures (later lecture) Issues: example: u = times plus three N: (plus (mult 2 2) 3) N: (mult 2 (plus 2 3)) N : (mult 2 2) R : plus N : 3 R : mult N : (plus 2 3) R : mult plus three times R: plus N : 3 times plus three 25

Learning Model Goal: Helps us learn the correct derivations and handle uncertainty (word mappings, composition). Classifier: a system that inputs a vector of discrete and/or continuous feature values and outputs a single discrete value, the class. Domingos (2012). 26

Learning Model Goal: Helps us learn the correct derivations and handle uncertainty (word mappings, composition). Classifier: a system that inputs a vector of discrete and/or continuous feature values and outputs a single discrete value, the class. Domingos (2012). 26

Learning Model Goal: Helps us learn the correct derivations and handle uncertainty (word mappings, composition). Classifier: a system that inputs a vector of discrete and/or continuous feature values and outputs a single discrete value, the class. Domingos (2012). Components 26

Learning Model Goal: Helps us learn the correct derivations and handle uncertainty (word mappings, composition). Classifier: a system that inputs a vector of discrete and/or continuous feature values and outputs a single discrete value, the class. Domingos (2012). Components training data D = {(xi, y i ) i...n} 26

Learning Model Goal: Helps us learn the correct derivations and handle uncertainty (word mappings, composition). Classifier: a system that inputs a vector of discrete and/or continuous feature values and outputs a single discrete value, the class. Domingos (2012). Components training data D = {(xi, y i ) i...n} feature representation of data 26

Learning Model Goal: Helps us learn the correct derivations and handle uncertainty (word mappings, composition). Classifier: a system that inputs a vector of discrete and/or continuous feature values and outputs a single discrete value, the class. Domingos (2012). Components training data D = {(xi, y i ) i...n} feature representation of data scoring and objective function 26

Learning Model Goal: Helps us learn the correct derivations and handle uncertainty (word mappings, composition). Classifier: a system that inputs a vector of discrete and/or continuous feature values and outputs a single discrete value, the class. Domingos (2012). Components training data D = {(xi, y i ) i...n} feature representation of data scoring and objective function optimization procedure 26

Training data Goal: Find the correct derivations and output using our compositional model 27

Training data Goal: Find the correct derivations and output using our compositional model Logical forms (more information) (u = minus times, s = (* (- 2 2) 2)) 27

Training data Goal: Find the correct derivations and output using our compositional model Logical forms (more information) (u = minus times, s = (* (- 2 2) 2)) Denotations (less information) (u = minus times, r = 0) 27

Training data Goal: Find the correct derivations and output using our compositional model Logical forms (more information) (u = minus times, s = (* (- 2 2) 2)) Denotations (less information) (u = minus times, r = 0) Weakly Supervised: the learner. In both cases, details are still hidden from 27

Learning from Semantic Representations example: ( times plus three,(plus (mult 2 2) 3)) N: (plus (mult 2 2) 3) N: (plus (plus 2 2) 3) N : (mult 2 2) R : plus N : 3 N : (plus 2 2) R : plus N : 3 R : mul plus three R : plus plus three times times 28

Learning from Semantic Representations example: ( times plus three,(plus (mult 2 2) 3)) N: (plus (mult 2 2) 3) N: (plus (plus 2 2) 3) N : (mult 2 2) R : plus N : 3 N : (plus 2 2) R : plus N : 3 R : mul plus three R : plus plus three times times Trade off: More information (good) but more annotation (bad) 28

Learning from Denotations example: ( times plus three,7) N: (plus (mult 2 2) 3) N: (plus (plus 2 2) 3) N : (mult 2 2) R : plus N : 3 N : (plus 2 2) R : plus N : 3 R : mul plus three R : plus plus three times times 29

Learning from Denotations example: ( times plus three,7) N: (plus (mult 2 2) 3) N: (plus (plus 2 2) 3) N : (mult 2 2) R : plus N : 3 N : (plus 2 2) R : plus N : 3 R : mul plus three R : plus plus three times times Trade off: Less annotation (good) but less information (maybe bad) 29

Weak Supervision Goal: Find the correct derivations and output using our compositional model Logical forms (more information) (u = minus times, s = (* (- 2 2) 2)) Denotations (less information) (u = minus times, r = 0) Current learning methods for NLP require annotating large corpora with supervisory information...[e.g. pos tags, syntactic parse trees, semantic role labels]... Building such corpora is an expensive, arduous task. As one moves towards deeper semantic analysis the annotation task becomes increasingly more difficult and complex. Mooney (2008) 30

Feature Representations: General Remark At the end of the day, some machine learning projects succeed and fail. What makes the difference? Easily the most important factor is the features used. Domingos (2012) 31

Feature selection and overfitting What if the knowledge and data we have are not sufficient to completely determine the correct classifier? Then we run the risk of just hallucinating a classifier (or parts of it) that is not grounded in reality.. This problem is called overfitting. Domingos (2012) Bias: Tendency to consistently learn the wrong thing. Variance: Tendency to learn random things irrespective of the real signal. 32

Good vs. Bad Feature Selection 33

Feature Extraction Example input: x = times plus three. y 1 = N: (plus (mult 2 2) 3) y 2 = N: (plus (plus 2 2) 3) N : (mult 2 2) R : plus N : 3 N : (plus 2 2) R : plus N : 3 R : mult plus three R : plus plus three times times φ(x,y 1 ) = R : mult [ times ] 1 R : plus [ plus ] 1 top [ R : plus ] 1... φ(x,y 2 ) = R : plus [ times ] 1 R : plus [ plus ] 1 top [ R : plus ] 1... 34

Scoring Function (Linear) Score Function Score w (x,y) = w φ(x, y) = d j=1 w jφ(x, y) 35

Scoring Function (Linear) Score Function Score w (x,y) = w φ(x, y) = d j=1 w jφ(x, y) weight vector w = [w 1 = 0.1 w 2 = 0.2 w 3 = 0.0...] 35

Scoring Function (Linear) Score Function Score w (x,y) = w φ(x, y) = d j=1 w jφ(x, y) weight vector w = [w 1 = 0.1 w 2 = 0.2 w 3 = 0.0...] φ(x,y 2 ) = w 1 R : plus [ times ] 1 w 2 R : plus [ plus ] 1 w 3 top [ R : plus ] 1... score w (x, y 2 ) = w φ(x, y 2 ) = (0.1 1.0) + (0.2 1.0) + (0.0 1.0) 35

Scoring Function (Linear) Score Function Score w (x,y) = w φ(x, y) = d j=1 w jφ(x, y) weight vector w = [w 1 = 0.1 w 2 = 0.2 w 3 = 0.0...] prediction: arg-max y Y Score w (x, y) 36

Objectives: What do we want to learn? (informal) General Idea: want to learn a model (or weight vector) that can distinguish correct and incorrect derivations. y 1 = N: (plus (mult 2 2) 3) y 2 = N: (plus (plus 2 2) 3) N : (mult 2 2) R : plus N : 3 N : (plus 2 2) R : plus N : 3 R : mult plus three R : plus plus three times times φ(x,y 1 ) = R : mult [ times ] 1 R : plus [ plus ] 1 top [ R : plus ] 1... φ(x,y 2 ) = R : plus [ times ] 1 R : plus [ plus ] 1 top [ R : plus ] 1... 37

Objectives: What do we want to learn? (informal) General Idea: want to learn a model (or weight vector) that can distinguish correct and incorrect derivations. y 1 = N: (plus (mult 2 2) 3) N: (mult 2 (plus 2 3)) N : (mult 2 2) R : plus N : 3 R : mult N : (plus 2 3) R : mult plus three times R: plus N : 3 times plus three φ(x,y 1 ) = R : mult [ times ] 1 R : plus [ plus ] 1 plus [ R : mult ] 1... φ(x,y 2 ) = R : plus [ times ] 1 R : plus [ plus ] 1 mult [ R : plus ] 1... 38

Objectives: What do we want to learn? (formal) hinge loss: (learning from logical forms) min w R d n max y Y [Score w (x, y )+c(y, y )] Score w (x, y) (x,y) D ( minus times, s = (* (- 2 2) 2)) 39

Objectives: What do we want to learn? (formal) hinge loss: (learning from logical forms) min w R d n (x,y) D max y Y [Score w (x, y )+c(y, y )] Score w (x, y) ( minus times, s = (* (- 2 2) 2)) In English: select parameters that minimize the cumulative loss over the training data. 39

Objectives: What do we want to learn? (formal) hinge loss: (learning from logical forms) min w R d n (x,y) D max y Y [Score w (x, y )+c(y, y )] Score w (x, y) ( minus times, s = (* (- 2 2) 2)) In English: select parameters that minimize the cumulative loss over the training data. Missing: A decoding algorithm for generating Y (not trivial, Y might be very large). 39

Optimization: How do I achieve this objective? Stochastic gradient descent: An online learning and optimization algorithm (more about this in future lectures). 40

Optimization: Illustration 41

Learning Model Components training data: D = {(xi, y i ) i...n} 42

Learning Model Components training data: D = {(xi, y i ) i...n} feature representation of data 42

Learning Model Components training data: D = {(xi, y i ) i...n} feature representation of data scoring and objective function 42

Learning Model Components training data: D = {(xi, y i ) i...n} feature representation of data scoring and objective function optimization procedure 42

Learning Model Components training data: D = {(xi, y i ) i...n} feature representation of data scoring and objective function optimization procedure Important Ideas What kind of data do we learn from? (differs quite a bit) What kind of features do we need? 42

Experimentation and Evaluation Training Set: Test Set: A portion of the data to train model on. An unseen portion of the data to evaluate on. Dev Set : (optional) An unseen portion of the data for analysis, tuning hyper parameters,.. 43

Experimentation and Evaluation Training Set: Test Set: A portion of the data to train model on. An unseen portion of the data to evaluate on. Dev Set : (optional) An unseen portion of the data for analysis, tuning hyper parameters,.. Evaluation1: Given unseen examples, how often does my model produce the correct output semantic representation? Evaluation2: Given unseen examples, how often does my model produce the correct output answer? 43

Conclusions and Take Aways Presented a simple model that mixes machine learning and compositional semantics. Conceptually describes most of the work in this class. Technically describes many of the models we will use. Fundamental Problem: Which semantics representations do we use, and what do we learn from? 44

Conclusions and Take Aways Presented a simple model that mixes machine learning and compositional semantics. Conceptually describes most of the work in this class. Technically describes many of the models we will use. Fundamental Problem: Which semantics representations do we use, and what do we learn from? Question: Does this particular actually work? 44

Conclusions and Take Aways Presented a simple model that mixes machine learning and compositional semantics. Conceptually describes most of the work in this class. Technically describes many of the models we will use. Fundamental Problem: Which semantics representations do we use, and what do we learn from? Question: Does this particular actually work? Yes! Liang et al. (2011) (lecture 5), Berant et al. (2013); Berant and Liang (2014) (presentation papers) 44

Roadmap Lecture 2: Lecture 3: Lecture 4: Lecture 5: rule extraction, decoding (parsing perspective) rule extraction, decoding (MT perspective) structured classification and prediction. grounded learning (might skip). 45

References I Berant, J., Chou, A., Frostig, R., and Liang, P. (2013). Semantic parsing on Freebase from question-answer pairs. In in Proceedings of EMNLP-2013, pages 1533 1544. Berant, J. and Liang, P. (2014). Semantic parsing via paraphrasing. In ACL (1), pages 1415 1425. Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10):78 87. Liang, P., Jordan, M. I., and Klein, D. (2011). Learning dependency-based compositional semantics. In Proceedings of ACL-11, pages 590 599. Mooney, R. (2008). Learning to connect language and perception. In Proceedings of AAAI-2008. Turney, P. D., Pantel, P., et al. (2010). From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research, 37(1):141 188. Woods, W. A. (1973). Progress in natural language understanding: an application to lunar geology. In Proceedings of the June 4-8, 1973, National Computer Conference and Exposition, pages 441 450. 46