Lecture 19: Language Acquisition II. Professor Robert C. Berwick

Similar documents
Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Lecture 1: Machine Learning Basics

Grammars & Parsing, Part 1:

Language properties and Grammar of Parallel and Series Parallel Languages

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Managerial Decision Making

A General Class of Noncontext Free Grammars Generating Context Free Languages

A Version Space Approach to Learning Context-free Grammars

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Proof Theory for Syntacticians

Natural Language Processing. George Konidaris

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Artificial Neural Networks written examination

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Acquiring Competence from Performance Data

The Strong Minimalist Thesis and Bounded Optimality

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

CS 598 Natural Language Processing

The Evolution of Random Phenomena

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Lecture 10: Reinforcement Learning

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Evolutive Neural Net Fuzzy Filtering: Basic Description

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Truth Inference in Crowdsourcing: Is the Problem Solved?

Discriminative Learning of Beam-Search Heuristics for Planning

Calibration of Confidence Measures in Speech Recognition

Lecture 1: Basic Concepts of Machine Learning

Genevieve L. Hartman, Ph.D.

An Introduction to the Minimalist Program

Context Free Grammars. Many slides from Michael Collins

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

Diagnostic Test. Middle School Mathematics

A Stochastic Model for the Vocabulary Explosion

Using computational modeling in language acquisition research

CS Machine Learning

Parsing of part-of-speech tagged Assamese Texts

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Generative models and adversarial training

Name: Class: Date: ID: A

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Shockwheat. Statistics 1, Activity 1

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

"f TOPIC =T COMP COMP... OBJ

Stochastic Calculus for Finance I (46-944) Spring 2008 Syllabus

An Efficient Implementation of a New POP Model

If we want to measure the amount of cereal inside the box, what tool would we use: string, square tiles, or cubes?

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Semi-Supervised Face Detection

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Learning to Rank with Selection Bias in Personal Search

Rule Learning With Negation: Issues Regarding Effectiveness

Psychology and Language

Chapter 4: Valence & Agreement CSLI Publications

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Language Model and Grammar Extraction Variation in Machine Translation

SARDNET: A Self-Organizing Feature Map for Sequences

Fundraising 101 Introduction to Autism Speaks. An Orientation for New Hires

Towards a Robuster Interpretive Parsing

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Dublin City Schools Mathematics Graded Course of Study GRADE 4

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Algebra 2- Semester 2 Review

Word learning as Bayesian inference

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Functional Skills Mathematics Level 2 assessment

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

TRANSFER ARTICULATION AGREEMENT between DOMINICAN COLLEGE and BERGEN COMMUNITY COLLEGE

Compositional Semantics

AMULTIAGENT system [1] can be defined as a group of

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Python Machine Learning

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Ohio s Learning Standards-Clear Learning Targets

Creation. Shepherd Guides. Creation 129. Tear here for easy use!

Rule-based Expert Systems

Radius STEM Readiness TM

Physics 270: Experimental Physics

MYCIN. The MYCIN Task

Alignment of Australian Curriculum Year Levels to the Scope and Sequence of Math-U-See Program

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Alberta Police Cognitive Ability Test (APCAT) General Information

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Construction Grammar. University of Jena.

Go fishing! Responsibility judgments when cooperation breaks down

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Math DefragGED: Calculator Tips and Tricks

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Transcription:

Lecture 19: Language Acquisition II Professor Robert C. Berwick berwick@csail.mit.edu

The Menu Bar Administrivia: lab 5-6 due this Weds! Language acquisition the Gold standard & basic results or the (Evil) Babysitter is Here (apologies to Dar Williams) Informal version Formal version Can we meet the Gold standard? What about probabilistic accounts? Stochastic CFGs & Bayesian learning

Conservative Strategy Baby s hypothesis should always be smallest language consistent with the data Works for finite languages? Let s try it Language 1: {aa,ab,ac} Language 2: {aa,ab,ac,ad,ae} Language 3: {aa,ac} Language 4: {ab} aa ab ac ad ae Babysitter Baby aa L3 ab L1 ac L1 ab L1 aa L1

Evil Babysitter To find out whether Baby is perfect, we have to see whether it gets 100% correct even in the most adversarial conditions Assume Babysitter is trying to fool Baby although she must speak only sentences from L T and she must eventually speak each such sentence Does C-Baby s strategy work on every possible fair sequence for every possible language? In finite # of languages case, Yes why?

A learnable ( identifiable ) family of Languages Family of languages: Let L n = set of all strings of length < n, over some fixed alphabet = {a, b} What is L 0? What is L 1? What is L n? Let the family L= {L 0, L 1,, L n } No matter what the L i can Babysitter really follow rules? Must eventually speak every sentence of L. Is this possible? Yes: ; a, b; aa, ab, ba, bb; aaa, aab, aba, abb, baa,

An Unlearnable Family of Languages: socalled Superfinite family Let L n = set of all strings of length < n What is L 0? What is L 1? What is L? Our (infinite) family is L = {L 0, L 1,,L n,, L } A perfect C-baby must be able to distinguish among all of these depending on a finite amount of input But there is no perfect C-baby

An Unlearnable Family Our class is L = {L 0, L 1,, L } C-Baby adopts conservative strategy, always picking smallest possible language in L So if Babysitter s longest sentence so far has 75 words, baby s hypothesis is L 75. This won t always work for all languages in L What language can t a conservative Baby learn? So, C-baby cannot always pick smallest possible language and win

An Unlearnable Family Could a non-conservative baby be almost a perfect C-Baby, and eventually converge to any of the languages in the family of languages? Claim: Any perfect C-Baby must be quasiconservative : If the true language is L 75, and baby posits something else, baby must still eventually come back and guess L 75 (since it s perfect). So if longest sentence so far is 75 words, and Babysitter keeps talking from L 75, then eventually baby must actually return to the conservative guess L 75. Agreed?

The Evil Babysitter If longest sentence so far is 75 words, and Babysitter keeps talking from L 76, then eventually a perfect C-baby must actually return to the conservative guess L 75. But suppose the true target language is L. Evil Babysitter can then prevent our supposedly perfect C-Baby from converging to it If Baby ever guesses L, say when the longest sentence is 75 words: Then Evil Babysitter keeps talking from L 75 until Baby capitulates and revises her guess to L 75 as any perfect C- Baby must. So Baby has not stayed at L as required. Then Babysitter can go ahead with longer sentences. If Baby ever guesses L again, she plays the same trick again (and again)

The Evil Babysitter If longest sentence so far is 75 words, and Babysitter keeps talking from L 76, then eventually a perfect C-baby must actually return to the conservative guess L 76. Suppose true language is L. Evil Babysitter can prevent our supposedly perfect C-Baby from converging to it in the limit If Baby ever guesses L, say when the longest sentence is 75 words: Then Evil Babysitter keeps talking from L 76 until Baby capitulates and revises her guess to L 76 as any perfect C- Baby must. So Baby has not stayed at L as required. Conclusion: There s no perfect Baby that is guaranteed to converge to L 0, L 1, or L as appropriate. If C-Baby always succeeds on finite languages, Evil Babysitter can trick it on infinite language; if C-Baby succeeds on the infinite L then Evil Babysitter can force it to never learn finite L s

What does this result imply? Any family of languages that includes all the finite languages and at least this one super-finite language is not identifiable in the limit from positive-only evidence This includes the family of all finite-state languages; the family of all context-free languages; etc. etc.

Is this too adversarial? Should we assume Babysitter is evil? Maybe more like Google. Perhaps Babysitter isn t trying to fool the baby - not an adversarial situation

Formally: Notation & definitions

Notation & definitions

Notation and definitions

Notation and definitions

The locking sequence (evil babysitter) theorem After lock sequence seen, then happily ever after inside the sphere of radius epsilon g

Proof

Construct Evil babysitter text

To get classical result for exact identification (0 1 metric) 1/2

Classic Gold Result ( Superfinite theorem ) Proof: By contradiction. Suppose A can identify the family L. Therefore, A can identify the infinite language L. Therefore, finite locking sequence for L, call it σ inf. But L = range(σ inf )isafinite language, and so L L Then t k = σ inf, k=length(σ inf ) is a text for L. Since A learns L on all fair texts for L, it must converge to L on t k. Therefore, A does not identify L, a contradiction.

Extensions reveal the breadth of Gold s result

What happens if we go probabilistic? Everyone always complains about the Gold results Gold is too stringent about way texts are used identification on all texts. Suppose we relax this get measure-1 learnability Upshot is: this does not enlarge the class of learnable languages unless Two senses (1) Distribution-free (modern sense) - pay attention to complexity (2) Some assumed distribution (eg, exponentially declining, as for CFGs) What is different? For (2), not much:

What if we make the grammars probabilistic? Horning, 1969: Class of unambiguous probabilistic CFGs is learnable in the limit [why unambiguous?] Intuition: since the probability of long sentences becomes vanishingly small, in effect, the language is finite If Baby hasn t heard a sentence beyond a sentence length/complexity, they never will (This idea can be pursued in other ways.)

Punchline What about the class of probabilistic CFGs? Suppose Babysitter has to output sentences randomly with the appropriate probabilities, (what does that mean?) Is s/he unable to be too evil? Are there then perfect Babies that are guaranteed to converge to an appropriate probabilistic CFG? I.e., from hearing a finite number of sentences, Baby can correctly converge on a grammar that predicts an infinite number of sentences But only if Baby knows the probability distribution function of the sentences a priori (Angluin) Even then, what is the complexity (# examples, time)?

Learning probabilistically

And the envelope please

Beyond this Pac-learning: probably approximately correct

Learning Probabilistic Grammars: Horning Need criterion to select among grammars Horning uses a Bayesian approach To develop this idea, we need the idea of a grammar-grammar, that is, a grammar that itself generates the family of possible grammars If the grammar-grammar is probabilistic, it defines a probability distribution over the the class of grammars it generates The complexity of a grammar G is then defined as, log 2 p(g)

Horning s approach II In this metric, the more variables (nonterminals) in the grammar-grammar, the more alternatives for each, or the longer the alternatives in a generated grammar, the smaller its probability & the greater its complexity This provides a metric for selecting the simplest grammar compatible with data seen so far Example

Example grammar-grammar Let G be the probabilistic grammar-grammar with the following productions, which generates regular grammars with 1 or 2 variables (S, A) and 1 or 2 terminal symbols 1. S R [0.5] 7. A TN [0.5] 2. S RR [0.5] 8. T a [0.5] 3. R N P [0.5] 9. T b [0.5] 4. P A [0.5] 10. N S [0.5] 5. P P,A [0.5] 11. N A [0.5] 6. A T [0.5]

Example left-most derivation of a sentence = a grammar Grammar (sentence) is: S b, bs, aa A a, ba, as S b, bs, aa A a, ba, as or as one sentence : This takes 27 (left-most) steps in the grammargrammar!

Derivation of the grammar from the grammargrammar S RR [0.5] N PR [1.0] S PR S P, AR S P, A, AR S A, A, AR S T, A, AR S b, A, AR S b, TN, AR S b, bn, AR

Derivation of grammar S b, bn, AA S b, bs, AR S b, bs, TNR S b, bs, anr S b, bs, aar S b, bs, aan P [1.0] S b, bs, aan P S b, bs, aaa P S b, bs, aaa P, A S b, bs, aaa P, A, A S b, bs, aaa A, A, A S b, bs, aaa T, A, A S b, bs, aaa a, A, A S b, bs, aaa a, TN, A S b, bs, aaa a, bn, A S b, bs, aaa a, ba, A S b, bs, aaa a, ba, TN S b, bs, aaa a, ba, an S b, bs, aaa a, ba, as Whew!! 27 steps. Done. p(g)= log 2 (0.5) 25 = 25log 2 (1/2)= 25

Note that if change the productions of the grammar-grammar we can change what the output grammars look like For example, if we change Rule 7, A TN [0.5] so that the pr is less, then we penalize the length of a right-hand side Now we can play the Bayesian game, since we can compute the prior probability of each grammar, as above, by its generation probability We can also compute the probability of a sentence, if we assume probabilities to each production in the generated grammar, in the usual way (viz, CGW or lab 5/6); assume these to be uniform at first

Horning s Bayesian game Prior probability of a grammar G i in the hypothesis space is denoted p(g i ) Probability of an input sequence of sentences S j given a grammar G i is denoted p(s j G i ) and is just the product of the probabilities that G i generated each sentence s 1, s 2, s k, i.e., p(s 1 G i ) p(s k G i ) But we want to find the probability that G i is really the correct grammar, given data sequence S j i.e., we want to find: p(g i S j ) [the posterior probability] Now, we can use Bayes rule that determines this as: p(g i S j ) = p(g i ) p(s j G i ) p(s j )

We want the best (highest probability) G i given the data sequence S j p(g i S j ) = p(g ) p(s G ) i j i p(s j ) = p(g i ) p(s j G i ) arg max p(s j ) = arg max p(g i ) p(s j G i ) (since S j constant) And we can compute this! We just need to search through all the grammars and find the one that maximizes this can this be done? Horning has a general result for unambiguous CFGs; for a more recent (2011) approach that works with 2 simple grammar types & child language see Perfors et al. Note: again, the G s only approach the best G with increasing likelihood

Another view of this maximize posterior probability view = argmax p(g i ) p(s j G i ) Now let s assume: (1) that p(g i ) 2 Gi so that smaller grammars are more probable; (2) by Shannon s source coding theorem, optimal encodings of the data S j wrt grammar G i approaches log 2 p(s j G i ) Then maximizing this posterior probability becomes, after taking log 2, is equivalent to finding the minimum of: G i log 2 p(s j G i ) This is usually called minimum description (MDL) We want to find the shortest (smallest) grammar plus the encoding of the data using that grammar

Most restrictive grammar just lists all possible utterances à Only the observed data is grammatical, so it has a high probability A simple grammar could be made that allowed any sentences à Grammar would have a high probability à But data a very low one MDL finds a middle ground between always generalizing and never generalizing

Complexity and Probability More complex grammar à Longer coding length, so lower probability More restrictive grammar à Fewer choices for data, so each possibility has a higher probability

Minimum description length as a criterion has a long pedigree Chomsky, 1949, Morphophonemics of Modern Hebrew So, this MDL criterion was there from the start: Minimize the grammar size and Minimize the length of the exceptions that can t be compressed by the grammar + data that can be

What about actually learning stochastic CFGs? Basic idea (from Suppes, 1970 onwards to Perfors et al.) Start with uniform probabilities on the rules Then adjust according to maximum likelihood counts to find best p(g D) Use a search method, because exhaustive search using Horning s idea has too many possibilities Standard search method to find maximum likelihood is expectation maximization (EM) Measure of merit is how well grammar predicts the sentences

Idea: Learn PCFGs with EM Classic experiments on learning PCFGs with Expectation- Maximization [Lari and Young, 1990] { X 1, X 2 X n } Full binary grammar over n symbols Parse uniformly/randomly at first Re-estimate rule expectations off of parses Repeat X A X C XB 44

Re-estimation of PCFGs Basic quantity needed for re-estimation with EM: Can calculate in cubic time with the Inside-Outside algorithm. Consider an initial grammar where all productions have equal weight: Then all trees have equal probability initially. Therefore, after one round of EM, the posterior over trees will (in the absence of random perturbation) be approximately uniform over all trees, and symmetric over symbols. 45

An Example of a run: learning English vs. German 1.0 S1 S 1.0 S NP VP 1.0 NP Det N 1.0 VP V 1.0 VP V NP 1.0 VP NP V 1.0 VP V NP NP 1.0 VP NP NP V 1.0 Det the 1.0 N the 1.0 V the 1.0 Det a 1.0 N a 1.0 V a 1.0 Det dog 1.0 N dog 1.0 V dog 1.0 Det man 1.0 N man 1.0 V man 1.0 Det bone 1.0 N bone 1.0 V bone 1.0 Det bites gives 1.0 N bites gives 1.0 V bites gives

Example sentences fed in the dog bites a man the man bites a dog a man gives the dog a bone the dog gives a man the bone a dog bites a bone

What is this doing? Does it always work so well?

1 S1 S Resulting grammar 1 S NP VP 1 NP Det N 0.6 VP V NP 0.4 VP V NP NP 0.416667 Det the 0.583333 Det a 0.416667 N dog 0.333333 N man 0.25 N bone 0.6 V bites 0.4 V gives But this is not surprising!

What is this doing? Does it always work so well?

walking on ice (A) Is the right structure. Why? Can a stochastic CFG learning algorithm find (A), rather than the other structures? In fact, this turns out to be hard. The SCFG picks (E)! Why? Entropy of (A) turns out to be higher (worse) than (E)-(H). Learner that uses this will go wrong.