Formal Language Theory

Similar documents
Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Developing a TT-MCTAG for German with an RCG-based Parser

"f TOPIC =T COMP COMP... OBJ

Hyperedge Replacement and Nonprojective Dependency Structures

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Grammars & Parsing, Part 1:

CS 598 Natural Language Processing

Parsing of part-of-speech tagged Assamese Texts

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Language properties and Grammar of Parallel and Series Parallel Languages

Natural Language Processing. George Konidaris

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Are You Ready? Simplify Fractions

Proof Theory for Syntacticians

A Version Space Approach to Learning Context-free Grammars

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

A General Class of Noncontext Free Grammars Generating Context Free Languages

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

The Strong Minimalist Thesis and Bounded Optimality

Control and Boundedness

A R "! I,,, !~ii ii! A ow ' r.-ii ' i ' JA' V5, 9. MiN, ;

An Interactive Intelligent Language Tutor Over The Internet

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Some Principles of Automated Natural Language Information Extraction

A Grammar for Battle Management Language

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

Statewide Framework Document for:

Grade 5 + DIGITAL. EL Strategies. DOK 1-4 RTI Tiers 1-3. Flexible Supplemental K-8 ELA & Math Online & Print

On the Polynomial Degree of Minterm-Cyclic Functions

Discriminative Learning of Beam-Search Heuristics for Planning

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Evolution of Collective Commitment during Teamwork

The Interface between Phrasal and Functional Constraints

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

School of Innovative Technologies and Engineering

Argument structure and theta roles

CAS LX 522 Syntax I. Long-distance wh-movement. Long distance wh-movement. Islands. Islands. Locality. NP Sea. NP Sea

arxiv: v1 [math.at] 10 Jan 2016

Theoretical Syntax Winter Answers to practice problems

Compositional Semantics

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

An Introduction to the Minimalist Program

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

A Usage-Based Approach to Recursion in Sentence Processing

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Multiple case assignment and the English pseudo-passive *

systems have been developed that are well-suited to phenomena in but is properly contained in the indexed languages. We give a

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

PRODUCT PLATFORM DESIGN: A GRAPH GRAMMAR APPROACH

Constraining X-Bar: Theta Theory

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Context Free Grammars. Many slides from Michael Collins

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Specifying Logic Programs in Controlled Natural Language

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Efficient Normal-Form Parsing for Combinatory Categorial Grammar

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

Som and Optimality Theory

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Grade 6: Correlated to AGS Basic Math Skills

Multimedia Application Effective Support of Education

Self Study Report Computer Science

Prediction of Maximal Projection for Semantic Role Labeling

Artificial Neural Networks written examination

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

UCLA UCLA Electronic Theses and Dissertations

Type-driven semantic interpretation and feature dependencies in R-LFG

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Ch VI- SENTENCE PATTERNS.

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Citation for published version (APA): Veenstra, M. J. A. (1998). Formalizing the minimalist program Groningen: s.n.

LING 329 : MORPHOLOGY

GACE Computer Science Assessment Test at a Glance

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

On the Notion Determiner

TabletClass Math Geometry Course Guidebook

Mathematics subject curriculum

Algebra 1 Summer Packet

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Chapter 4 - Fractions

Derivational and Inflectional Morphemes in Pak-Pak Language

Radius STEM Readiness TM

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Accurate Unlexicalized Parsing for Modern Hebrew

Mathematics. Mathematics

AP Statistics Summer Assignment 17-18

NAME: East Carolina University PSYC Developmental Psychology Dr. Eppler & Dr. Ironsmith

In Udmurt (Uralic, Russia) possessors bear genitive case except in accusative DPs where they receive ablative case.

1.11 I Know What Do You Know?

Extending Place Value with Whole Numbers to 1,000,000

Korean ECM Constructions and Cyclic Linearization

Classroom Connections Examining the Intersection of the Standards for Mathematical Content and the Standards for Mathematical Practice

Transcription:

Formal Language Theory Gerhard Jäger University of Tübingen Workshop Artificial Grammar Learning and Formal Language Theory Nijmegen, November 23, 2010 Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 1 / 37

Formal Language Theory Formal Language: set of strings over a finite vocabulary finite or infinite Formal Language Theory: collection of mathematical/algorithmic tools about defining FL (with finite means) processing FL (recognizing, parsing, translating) FLT is not about semantics of FLs statistical properties of FLs initiated by Chomsky in the 1950s to motivate generative grammar important role in formal linguistics and theoretical computer science recent new domain of application in bio-informatics Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 2 / 37

The Chomsky Hierarchy Formal Grammar: finite specification of a formal language Chomsky defined general format for FGs: string rewriting systems A String Rewriting System essentially consists of a set of rewrite rules α β (α and β are strings of symbols) a designated start symbol S A derivation starts with S and applies rewrite rules to sub-strings until no further rules can be applied language defined by a grammar: set of strings that can be derived this way 1 1 I am skipping over the (at this point) inessential distinction between non-terminal and terminal symbols. Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 3 / 37

The Chomsky Hierarchy format of String Rewriting Systems is very general every (formal) language that can be defined algorithmically can be defined by a FG in this sense Chomsky Hierarchy: hierarchy of ever more restricted versions of FGs defines a hierarchy of formal languages 1 Type 0: recursively enumerable 2 Type 1: context-sensitive 3 Type 2: context-free (phrase structure) 4 Type 3: regular (finite state) Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 4 / 37

The Chomsky Hierarchy Type-0 grammars and recursively enumerable languages Examples no restrictions on general format of rewrite rules equivalent to Turing Machine describes all languages that can be defined algorithmically Peano arithmetics set of all numbers that are the sum of two primes set of first order theorems set of equivalent pairs of regular expressions with exponentiation (decidable but not context-sensitive) Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 5 / 37

The Chomsky Hierarchy Context-sensitive grammars and languages restriction of format of rewrite rules: Rules are non-shrinking. α β: length(α) length(β) ensures decidability membership problem in worst case is PSPACE hard Examples set of all primes set of all square numbers copy language a n b m c n d m triple-copy language ({w 3 w Σ }) a n b n c n a n b n c n d n e n Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 6 / 37

The Chomsky Hierarchy Context-free grammars and languages further restriction of rule format: Left hand side contains exactly one symbol. A α membership problem decidable in cubic time. Examples mirror language a n b n a n b m c m d n well-formed parentheses algebraic expression Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 7 / 37

The Chomsky Hierarchy Regular grammars and languages further restriction of rule format: Right-hand side contains at most one non-terminal symbol, preceding all terminal symbols. Terminal symbols: symbols that never occur at the RHS of a rule. A (B)α, α a string of terminal symbols membership problem decidable in linear time. Examples a n b m set of multiples of 4 set of natural numbers that leave a remainder of 3 when divided by 4 Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 8 / 37

The Chomsky Hierarchy Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 9 / 37

NL and the Chomsky Hierarchy Where are natural languages located? hotly contested issue over several decades typical argument: find a recursive construction C in a natural language L argue that the competence of speakers admits unlimited recursion (while the performance certainly poses an upper limit) reduce C to a formal language L of known complexity via homomorphisms make a case that L must be at least as complex as L extrapolate to all human languages: if there is one language which is at least as complex as..., then the human language faculty must allow it in general Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 10 / 37

NL and the Chomsky Hierarchy Chomsky 1957: English is not regular. The following constructions can be arbitrarily embedded into each other: If S 1, then S 2. Either S 3 or S 4. The man that said that S 5 is arriving today. Therefore Chomsky says English cannot be regular. It is clear, then that in English we can find a sequence a + S1 + b, where there is a dependency between a and b, and we can select as S1 another sequence c + S2 + d, where there is a dependency between c and d... etc. A set of sentences that is constructed in this way...will have all of the mirror image properties of [the mirror language] which exclude [the mirror language] from the set of finite state languages. (Chomsky 1957) Skip technical stuff Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 11 / 37

NL and the Chomsky Hierarchy Closure properties of regular languages Theorem 1: If L 1 and L 2 are regular languages, then L 1 L 2 is also a regular language. Theorem 2: The class of regular languages is closed under homomorphism. Theorem 3: The class of regular languages is closed under inversion. Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 12 / 37

NL and the Chomsky Hierarchy argument is formally questionable because either may occur without or, or without either, if without then and then without if logic of the argument is correct though; can be made formally water-tight with e.g. neither-nor constructions Neither did John claim that he neither smokes while... nor snores, nor did anybody believe it. English has (in principle) unlimited number of nested dependencies of unbounded length Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 13 / 37

NL and the Chomsky Hierarchy homomorphism: neither a nor b everything else ε If it neither rains nor snows, then if it rains then it snows. ab Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 14 / 37

NL and the Chomsky Hierarchy maps English not to the mirror language, but to the language L 1 : S ast T bst T bs S ε This is the language over {a,b} where each a is followed by a number of bs. Skip technical stuff Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 15 / 37

NL and the Chomsky Hierarchy The pumping lemma for regular languages Let L be a regular language. Then there is a constant n such that if z is any string in L, and length(z) n, we may write z = uvw in such a way that length(uv) n, v ε, and for all i 0,uv i w L. Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 16 / 37

NL and the Chomsky Hierarchy Suppose English is regular. Due to closure under homomorphism, L 1 is regular. a b is a regular language. Thus a b L 1 is a regular language L 2 = L 1 a b = {a n b m n m} due to Theorem 1 Due to closure under inversion and homomorphism, is also regular. Hence L 4 is regular: L 3 = {a n b m n m} L 4 = L 2 L 3 = a n b n Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 17 / 37

NL and the Chomsky Hierarchy If English is regular, L 1 is regular. If L 1 is regular, a n b n is regular. (This is the technical stuff.) a n b n is not regular aaa...bbb Therefore English cannot be a regular language. Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 18 / 37

NL and the Chomsky Hierarchy Dissenting view: all arguments to this effect use center-embedding humans are extremely bad at processing center-embedding notion of competence that ignores this is dubious natural languages are regular after all Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 19 / 37

NL and the Chomsky Hierarchy Are natural languages context-free? history of the problem: Chomsky 1957: conjecture that natural languages are not cf sixties, seventies: many attempts to prove this conjecture Pullum and Gazdar 1982: all these attempts have failed for all we know, natural languages (conceived as string sets) might be context-free Huybregts 1984, Shieber 1985: proof that Swiss German is not context-free Culy 1985: proof that Bambara is not context-free Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 20 / 37

NL and the Chomsky Hierarchy Nested and crossing dependencies CFLs unlike regular languages can have unbounded dependencies however, these dependencies can only be nested, not crossing example: a n b n has unlimited nested dependencies context-free the copy language has unlimited crossing dependencies not context-free Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 21 / 37

NL and the Chomsky Hierarchy The respectively argument Bar-Hillel and Shamir (1960): English contains copy-language cannot be context-free Consider the sentence John, Mary, David,... are a widower, a widow, a widower,..., respectively. Claim: the sentence is only grammatical under the condition that if the nth name is male (female) then the nth phrase after the copula is a widower (a widow) Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 22 / 37

NL and the Chomsky Hierarchy The respectively argument dependency structure of the copy language formal argument: If English is cf, then the copy language is cf. Copy language is not cf. Hence English is not cf. Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 23 / 37

NL and the Chomsky Hierarchy Counterargument crossing dependencies triggered by respectively are semantic rather than syntactic compare above example to (Here are John, Mary and David.) They are a widower, a widow and a widower, respectively. Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 24 / 37

NL and the Chomsky Hierarchy Cross-serial dependencies in Dutch Huybregt (1976): Dutch has copy-language like structures thus Dutch is not context-free (1) dat Jan Marie Pieter Arabisch laat zien schrijven that Jan Marie Pieter Arabic let see write that Jan let Marie see Pieter write Arabic Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 25 / 37

Proof of non-context freeness German dass der Karl die Maria dem Peter n den Hans m schwimmen lehren m helfen n lässt that Karl lets Maria help Peter to teach Hans how to swim Dutch German structure corresponds to formal language a m b n d n c m context-free dat Karel Marie Piet n Jan m laat helpen n leren m zwemmen Dutch structure corresponds to formal language a m b n c m d n not context-free Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 26 / 37

Proof of non-context freeness Swiss German dass de Karl d Maria em Peter n de Hans m laat hälfe n lärne m schwüme Swiss German structure corresponds to formal language a m b n c m d n not context-free Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 27 / 37

Tree Adjoining Grammars since around 1980 several attempts to move slightly beyond context-free power perhaps most influential: Aravind Joshi s Tree Adjoining Grammars (TAG) Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 28 / 37

Tree Adjoining Grammars Context-free derivations as tree growth NP S VP V NP saw S NP D N man NP VP D N V NP man saw Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 29 / 37

creates vertical nested dependencies may cash out as crossing dependencies in the string Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 30 / 37 Tree Adjoining Grammars TAG generelizes this to insertion of trees in the middle of other trees

Related formalisms Linear Indexed Grammars: pushdown stack as part of context-free rules Combinatory Categorial Grammar, Linear context-free rewriting systems, Head grammars: partial reshuffling of constituents during (essentially context-free) derivation Minimalist grammars version of Chomsky s latest paradigm; formalized by Ed Stabler lexically controlled movement of constituents during derivation possible Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 31 / 37

Mildly context-sensitive grammar formalisms two closely related families of mutually equivalent formalisms Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 32 / 37

Mildly context-sensitive grammar formalisms TAG and relatives parsing problem O(n 6 ) examples a n b m c n d m copy language a n b n c n (d n ) LCFRS and relatives parsing problem in PTIME examples a n b n c n d n e n triple-copy language (actually any k-copy language for fixed k) a n 1 a n k for fixed k Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 33 / 37

General properties of MCS grammar formalisms Joshi 1985: introduced the notion semi-formal characterization: a class of languages is mildly context-sensitive if it contains all context-free languages it can describe a limited number of types of cross-serial dependencies its parsing problem is in PTIME all languages in it have constant growth property last property excludes set of primes, set of square numbers etc. Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 34 / 37

Are all natural languages MCS? Michaelis and Kracht 1997: Old Georgian is not semilinear. All MCS formalisms mentioned above describe semilinear languages, thus not all NL describable by TAGs or LCFRSs. (2) govel-i igi sisxl-i saxl-isa-j m-is Saul-is-isa-j all-nom art-nom blood-nom house-gen-nom art-gen Saul-GEN-GEN-NOM all the blood of the house of Saul subordinate nouns carry case marking for all superordinate nouns ( case stacking ) if productive, this makes Old Georgian a non-lcfrs language open issue whether the pattern was productive or whether it exists productively in living languages Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 35 / 37

Relation of Chomsky Hierarchy to other complexity measures Algorithmic complexity weak link: each class in the Chomsky hierarchy places an upper bound on the algorithmic complexity: Type 0: recursively enumerable context-sensitive: PSPACE LCFRS: PTIME TAG: O(n 6 ) context-free: O(n 3 ) regular: linear individual languages may have much lower complexity than to be expected from the smallest CH class they are contained in: linear: copy language, k-copy language, a n 1...a n k polynomial: set of square numbers Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 36 / 37

Relation of Chomsky Hierarchy to other complexity measures Kolmogorov complexity intuitive idea: length of the shortest computer program that produces the object in question as its output measures complexity of strings (or objects representable as strings) not directly applicable to languages, i.e. sets of strings Gerhard Jäger (University of Tübingen) Formal Language Theory AGL Workshop 37 / 37