Context-Free Grammars and Languages

Similar documents
CS 598 Natural Language Processing

Grammars & Parsing, Part 1:

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Language properties and Grammar of Parallel and Series Parallel Languages

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Proof Theory for Syntacticians

Parsing of part-of-speech tagged Assamese Texts

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

A General Class of Noncontext Free Grammars Generating Context Free Languages

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Refining the Design of a Contracting Finite-State Dependency Parser

A Version Space Approach to Learning Context-free Grammars

A R "! I,,, !~ii ii! A ow ' r.-ii ' i ' JA' V5, 9. MiN, ;

Context Free Grammars. Many slides from Michael Collins

"f TOPIC =T COMP COMP... OBJ

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Specifying Logic Programs in Controlled Natural Language

Some Principles of Automated Natural Language Information Extraction

A Grammar for Battle Management Language

NATURAL LANGUAGE PARSING AND REPRESENTATION IN XML EUGENIO JAROSIEWICZ

systems have been developed that are well-suited to phenomena in but is properly contained in the indexed languages. We give a

Natural Language Processing. George Konidaris

Aspectual Classes of Verb Phrases

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Hyperedge Replacement and Nonprojective Dependency Structures

Developing a TT-MCTAG for German with an RCG-based Parser

GACE Computer Science Assessment Test at a Glance

Enumeration of Context-Free Languages and Related Structures

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

and secondary sources, attending to such features as the date and origin of the information.

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Parsing natural language

Character Stream Parsing of Mixed-lingual Text

Focus of the Unit: Much of this unit focuses on extending previous skills of multiplication and division to multi-digit whole numbers.

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Analysis of Probabilistic Parsing in NLP

Compositional Semantics

Dear Teacher: Welcome to Reading Rods! Reading Rods offer many outstanding features! Read on to discover how to put Reading Rods to work today!

4-3 Basic Skills and Concepts

Universiteit Leiden ICT in Business

Chapter 4: Valence & Agreement CSLI Publications

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Sample Problems for MATH 5001, University of Georgia

PRODUCT PLATFORM DESIGN: A GRAPH GRAMMAR APPROACH

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Guidelines for Writing an Internship Report

More ESL Teaching Ideas

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

The Interface between Phrasal and Functional Constraints

Chapter 4 - Fractions

Part I. Figuring out how English works

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

GRAMMAR IN CONTEXT 2 PDF

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Efficient Normal-Form Parsing for Combinatory Categorial Grammar

Type Theory and Universal Grammar

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Highlighting and Annotation Tips Foundation Lesson

Accurate Unlexicalized Parsing for Modern Hebrew

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

Probabilistic Latent Semantic Analysis

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Grade 6: Correlated to AGS Basic Math Skills

Ohio s Learning Standards-Clear Learning Targets

Grade 5 + DIGITAL. EL Strategies. DOK 1-4 RTI Tiers 1-3. Flexible Supplemental K-8 ELA & Math Online & Print

Ch VI- SENTENCE PATTERNS.

Language Evolution, Metasyntactically. First International Workshop on Bidirectional Transformations (BX 2012)

The College Board Redesigned SAT Grade 12

Achim Stein: Diachronic Corpora Aston Corpus Summer School 2011

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

The Strong Minimalist Thesis and Bounded Optimality

Developing a concrete-pictorial-abstract model for negative number arithmetic

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

1.11 I Know What Do You Know?

Dependency, licensing and the nature of grammatical relations *

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Introducing the New Iowa Assessments Language Arts Levels 15 17/18

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Formulaic Language and Fluency: ESL Teaching Applications

BULATS A2 WORDLIST 2

WSU Five-Year Program Review Self-Study Cover Page

Statewide Framework Document for:

Hindi Aspectual Verb Complexes

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Learning Computational Grammars

AQUA: An Ontology-Driven Question Answering System

Word Stress and Intonation: Introduction

Welcome to the Purdue OWL. Where do I begin? General Strategies. Personalizing Proofreading

Using dialogue context to improve parsing performance in dialogue systems

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

An Interactive Intelligent Language Tutor Over The Internet

Surface Structure, Intonation, and Meaning in Spoken Language

Transcription:

September 23, 2014

Limitations of finite automata

Limitations of finite automata There are languages, such as {0 n 1 n n 0} that cannot be described (specified) by finite automata or regular expressions

Limitations of finite automata There are languages, such as {0 n 1 n n 0} that cannot be described (specified) by finite automata or regular expressions Context-free grammars provide a more powerful mechanism for language specification

Limitations of finite automata There are languages, such as {0 n 1 n n 0} that cannot be described (specified) by finite automata or regular expressions Context-free grammars provide a more powerful mechanism for language specification Context-free grammars can describe features that have a recursive structure making them useful beyond finite automata

Historical notes Context-free grammars were first used to study human languages One way of understanding the relationship between syntactic categories (such as noun, verb, preposition, etc) and their respective phrases leads to natural recursion This is because noun phrases may occur inside the verb phrases and vice versa.

Note Context-free grammars can capture important aspects of these relationships

Important application Context-free grammars are used as basis for compiler design and implementation Context-free grammars are used as specification mechanisms for programming languages Designers of compilers use such grammars to implement compiler s components, such a scanners, parsers, and code generators The implementation of any programming language is preceded by a context-free grammar that specifies it

Context-free languages The collection of languages specified by context-free grammars are called context-free languages Context-free languages include regular languages and many others Here we will study the formal concepts of context-free grammar and context-free language

Notations

Notations We abbreviate the phrase context-free grammar to CFG.

Notations We abbreviate the phrase context-free grammar to CFG. We abbreviate the phrase context-free language to CFL.

Notations We abbreviate the phrase context-free grammar to CFG. We abbreviate the phrase context-free language to CFL. We abbreviate the concept of a CFG substitution rule to the tuple lhs rhs where lhs stands for left hand side and rhs stands for right hand side.

More on substitution rules

More on substitution rules The lhs of a substitution rule is also called variable and is denoted by capital letters

More on substitution rules The lhs of a substitution rule is also called variable and is denoted by capital letters The rhs of a substitution rule is also called a specification pattern and consists of a string of variables and constants

More on substitution rules The lhs of a substitution rule is also called variable and is denoted by capital letters The rhs of a substitution rule is also called a specification pattern and consists of a string of variables and constants The constants that occur in a specification pattern are also called terminal symbols

CFG: Informal A CFG grammar consists of a collection of substitution rules where one variable is designated as start variable Example: the CFG G 1 has the following specification rules: A 0A1 A B B #

Note Nonterminals of CFG G 1 are {A, B} and A is the start variable Terminals of CFG G 1 are {0, 1, #}

More terminology

More terminology The substitution rules of a CFG are also called productions

More terminology The substitution rules of a CFG are also called productions Nonterminals used in the specification rules defining a CFG may be strings

More terminology The substitution rules of a CFG are also called productions Nonterminals used in the specification rules defining a CFG may be strings Terminals in the substitution rules defining a CFG are constant strings

Terminals Terminals used in CFG specification rules are analogous to the input alphabet of an automaton Example terminals used in CFG-s are letters of an alphabet, numbers, special symbols, and strings of such elements.

Language specification A CFG is used as a language specification mechanism by generating each string of the language in following manner:

Language specification A CFG is used as a language specification mechanism by generating each string of the language in following manner: 1. Write down the start variable; it is the lhs of one of the substitution rules, the top rule, unless specified otherwise

Language specification A CFG is used as a language specification mechanism by generating each string of the language in following manner: 1. Write down the start variable; it is the lhs of one of the substitution rules, the top rule, unless specified otherwise 2. Find a variable that is written down and a rule whose lhs is that variable. Replace the written down variable with the rhs of that rule

Language specification A CFG is used as a language specification mechanism by generating each string of the language in following manner: 1. Write down the start variable; it is the lhs of one of the substitution rules, the top rule, unless specified otherwise 2. Find a variable that is written down and a rule whose lhs is that variable. Replace the written down variable with the rhs of that rule 3. Repeat step 2 until no variables remain in the string thus generated

Example string generation Using CFG G 1 we can generate the string 000#111 as follows:

Example string generation Using CFG G 1 we can generate the string 000#111 as follows: A

Example string generation Using CFG G 1 we can generate the string 000#111 as follows: A 0A1

Example string generation Using CFG G 1 we can generate the string 000#111 as follows: A 0A1 00A11

Example string generation Using CFG G 1 we can generate the string 000#111 as follows: A 0A1 00A11 000A111

Example string generation Using CFG G 1 we can generate the string 000#111 as follows: A 0A1 00A11 000A111 000B111

Example string generation Using CFG G 1 we can generate the string 000#111 as follows: A 0A1 00A11 000A111 000B111 000#111

Example string generation Using CFG G 1 we can generate the string 000#111 as follows: A 0A1 00A11 000A111 000B111 000#111 Note: The sequence of substitutions used to obtain a string using a CFG is called a derivation and may be represented by a tree called a parse tree

Example derivation tree The derivation tree of the string 000#111 using CFG G 1 is in Figure 1 A A A A B 0 0 0 # 1 1 1 Figure 1 : Derivation tree for 000#111

Note

Note All strings of terminals generated in this way constitute the language specified by the grammar

Note All strings of terminals generated in this way constitute the language specified by the grammar We write L(G) for the language generated by the grammar G. Thus, L(G 1 ) = {0 n #1 n n 0}.

Note All strings of terminals generated in this way constitute the language specified by the grammar We write L(G) for the language generated by the grammar G. Thus, L(G 1 ) = {0 n #1 n n 0}. The language generated by a context-free grammar is called a Context-Free Language, CFL.

More notations

More notations To distinguish nonterminal from terminal strings we often enclose nonterminals in angular parentheses,,, and terminals in quotes,

More notations To distinguish nonterminal from terminal strings we often enclose nonterminals in angular parentheses,,, and terminals in quotes, If two or more rules have the same lhs, as in the example A 0A1 and A B, we may compact them using the form A 0A1 B where is used with the meaning of an or

More notations To distinguish nonterminal from terminal strings we often enclose nonterminals in angular parentheses,,, and terminals in quotes, If two or more rules have the same lhs, as in the example A 0A1 and A B, we may compact them using the form A 0A1 B where is used with the meaning of an or In general if there are multiple rules of the form lhs rhs 1, lhs rhs 2,..., lhs rhs n we may compactly write them in the form lhs rhs 1 rhs 2 rhs n

CFG G 2 The CFG G 2 specifies a fragment of English SENTENCE NounPhrase VerbPhrase NounPhrase CpNoun CpNoun PrepPhrase VerbPhrase CpVerb CpVerb PrepPhrase PrepPhrase Prep CpNoun CpNoun Article Noun CpVerb Verb Verb NounPhrase Article a the Noun boy girl flower Verb touches likes sees Prep with

Note The CFG G 2 has ten variables (capitalized and in angular brackets) and 9 terminals (written in the standard English alphabet) plus a space character Also, the CFG G 2 has 18 rules Examples strings that belongs to L(G 2 ) are: a boy sees the boy sees a flower a girl with a flower likes the boy

Example derivation with G 2 SENTENCE NounPhrase VerbPhrase CpNoun VerbPhrase Article Noun VerbPhrase a Noun VerbPhrase a boy VerbPhrase a boy CpVerb a boy Verb a boy sees

Formal definition of a CFG

Formal definition of a CFG A context-free grammar is a 4-tuple (V, Σ, R, S) where:

Formal definition of a CFG A context-free grammar is a 4-tuple (V, Σ, R, S) where: 1. V is a finite set called the variables or nonterminals

Formal definition of a CFG A context-free grammar is a 4-tuple (V, Σ, R, S) where: 1. V is a finite set called the variables or nonterminals 2. Σ is a finite set of strings, disjoint from V, called terminals

Formal definition of a CFG A context-free grammar is a 4-tuple (V, Σ, R, S) where: 1. V is a finite set called the variables or nonterminals 2. Σ is a finite set of strings, disjoint from V, called terminals 3. R is a finite set of rules (or substitution rules) of the form lhs rhs, where lhs V, rhs (V Σ)

Formal definition of a CFG A context-free grammar is a 4-tuple (V, Σ, R, S) where: 1. V is a finite set called the variables or nonterminals 2. Σ is a finite set of strings, disjoint from V, called terminals 3. R is a finite set of rules (or substitution rules) of the form lhs rhs, where lhs V, rhs (V Σ) 4. S V is the start variable

Example CFG grammar G 1 = ({A, B}, {0, 1, #}, R, A) where R is: A 0A1 A B B #

Direct derivation

Direct derivation If u, v, w (V Σ) (i.e., are strings of variables and terminals) and A w R (i.e., is a rule of the grammar) then we say that uav yields uwv, written uav uwv

Direct derivation If u, v, w (V Σ) (i.e., are strings of variables and terminals) and A w R (i.e., is a rule of the grammar) then we say that uav yields uwv, written uav uwv We may also say that uwv is directly derived from uav using the rule A w

Derivation Suppose u, v (V Σ) are strings of variables and terminals We say that u derives v, written as u v, if u = v or if a sequence u 1, u 2,..., u k (V Σ) exists, for k 0, and u 1 u 2... u k v

Language specified by G If G = (V, Σ, R, S) is a CFG then the language specified by G (or the language of G) is L(G) = {w Σ S w}

Note Often we specify a grammar by writing down only its rules We can identify the variables as the symbols that appear only as the lhs of the rules Terminals are the remaining strings used in the rules

More examples of CFGs

More examples of CFGs Consider the grammar G 3 = ({S}, {a, b}, {S asb SS ɛ}, S)

More examples of CFGs Consider the grammar G 3 = ({S}, {a, b}, {S asb SS ɛ}, S) L(G 3 ) contains strings such as abab, aaabbb, aababb;

More examples of CFGs Consider the grammar G 3 = ({S}, {a, b}, {S asb SS ɛ}, S) L(G 3 ) contains strings such as abab, aaabbb, aababb; Note: if one think at a, b as (, ) then we can see that L(G 3 ) is the language of all strings of properly nested parentheses

Arithmetic expressions Consider the grammar G 4 = ({E, T, F }, {a, +,, (, )}, R, E) where R is: E E + T T T T F F F (E) a L(G 4 ) is the language of arithmetic expressions

Note Arithmetic operations in L(G 4 ) are addition, represented by +, and multiplication, represented by * An examples of a derivation using G 4 is in Figure 2

Example derivation with G 4 E E + T T T * F F F a a Figure 2 : a Derivation tree for a+a*a

Designing CFGs As with the design of automata, the design of CFGs requires creativity CFGs are even trickier to construct than finite automata because we are more accustomed to programming a machine than we are to specify programming languages

Design techniques

Design techniques Many CFG are unions of simpler CFGs. Hence the suggestion is to construct smaller, simpler grammars first and then to join them into a larger grammar

Design techniques Many CFG are unions of simpler CFGs. Hence the suggestion is to construct smaller, simpler grammars first and then to join them into a larger grammar The mechanism of grammar combination consists of putting all their rules together and adding the new rules S S 1 S 2... S k where the variables S i,1 i k, are the start variables of the individual grammars and S is a new variable

Example grammar design Design a grammar for the language {0 n 1 n n 0} {1 n 0 n n 0}

Example grammar design Design a grammar for the language {0 n 1 n n 0} {1 n 0 n n 0} 1. Construct the grammar S 1 0S 1 1 ɛ that generates {0 n 1 n n 0}

Example grammar design Design a grammar for the language {0 n 1 n n 0} {1 n 0 n n 0} 1. Construct the grammar S 1 0S 1 1 ɛ that generates {0 n 1 n n 0} 2. Construct the grammar S 2 1S 2 0 ɛ that generates {1 n 0 n n 0}

Example grammar design Design a grammar for the language {0 n 1 n n 0} {1 n 0 n n 0} 1. Construct the grammar S 1 0S 1 1 ɛ that generates {0 n 1 n n 0} 2. Construct the grammar S 2 1S 2 0 ɛ that generates {1 n 0 n n 0} 3. Put them together adding the rule S S 1 S 2 thus getting S S 1 S 2 S 1 0S 1 1 ɛ S 2 1S 2 0 ɛ

Second design technique

Second design technique Constructing a CFG for a regular language is easy if one can first construct a DFA for that language

Second design technique Constructing a CFG for a regular language is easy if one can first construct a DFA for that language Conversion procedure:

Second design technique Constructing a CFG for a regular language is easy if one can first construct a DFA for that language Conversion procedure: 1. Make a variable R i for each state q i of DFA

Second design technique Constructing a CFG for a regular language is easy if one can first construct a DFA for that language Conversion procedure: 1. Make a variable R i for each state q i of DFA 2. Add the rule R i ar j to the CFG if δ(q i, a) = q j is a transition in the DFA

Second design technique Constructing a CFG for a regular language is easy if one can first construct a DFA for that language Conversion procedure: 1. Make a variable R i for each state q i of DFA 2. Add the rule R i ar j to the CFG if δ(q i, a) = q j is a transition in the DFA 3. Add the rule R i ɛ if q i is an accept state of the DFA

Second design technique Constructing a CFG for a regular language is easy if one can first construct a DFA for that language Conversion procedure: 1. Make a variable R i for each state q i of DFA 2. Add the rule R i ar j to the CFG if δ(q i, a) = q j is a transition in the DFA 3. Add the rule R i ɛ if q i is an accept state of the DFA 4. If q 0 is the start state of the DFA make R 0 the start variable of the CFG.

Third design technique Certain CFLs contain strings with two related substrings as are 0 n and 1 n in {0 n 1 n n 0} Example of relationship: to recognize such a language a machine would need to remember an unbounded amount of info about one of the substrings

Note A CFG that handles this situation uses a rule of the form R urv which generates strings wherein the portion containing u s corresponds to the portion containing v s

Fourth design technique In a complex language, strings may contain certain structures that appear recursively Example: in arithmetic expressions any time the symbol a appear, the entire parenthesized expression may appear.

Ambiguity If a CFG G generates the same string x in several different ways, we say that x is derived ambiguously in G. If a CFG G generates some string ambiguously we say that the grammar G is ambiguous

Example Consider the grammar G 4 whose rules are: E E + T T, T T F F, F (E) a and the grammar G 5, whose rules are: E E + E E E (E) a L(G 4 ) = L(G 5 ) Note: one can easily show this by showing the inclusions L(G 4 ) L(G 5 ) and L(G 5 ) L(G 4 ) G 5 generates ambiguously some arithmetic expressions while G 4 doesn t.

Ambiguous expressions Figure 3 shows two different derivation trees for a+a*a E E * E E + E a E E + E a E * E a a a a Figure 3 : Two derivation trees for a+a*a

Note The grammar G 5 does not capture the usual precedence relations and so groups + before * and vice versa In contrast, the grammar G 4 generates the same language, but every generated string has a unique derivation tree Hence, G 5 is ambiguous and G 4 is not, i.e., G 4 is unambiguous

Another example G 2 below is another ambiguous grammar SENTENCE NounPhrase VerbPhrase NounPhrase CpNoun CpNoun PrepPhrase VerbPhrase CpVerb CpVerb PrepPhrase PrepPhrase Prep CpNoun CpNoun Article Noun CpVerb Verb Verb NounPhrase Article a the Noun boy girl flower Verb touches likes sees Prep with

Example ambiguous string The sentence: the girl touches the boy with the flower has two different derivations, so it is ambiguous The two derivations correspond to the two readings: (the girl touches the boy) (with the flower) (the girl touches) (the boy with the flower)

Note When a grammar generates a string ambiguously it means that the string has two different parse trees and not two different derivations Two different derivations however, may produce the same derivation tree because they may differ in the order in which they replace nonterminals not in the rules they use To concentrate on the structure we define a type of derivation that replaces variables in a fixed order

Fixing rule application order Leftmost derivation: a derivation of a string w in a grammar G is a leftmost derivation if at every step the leftmost nonterminal is replaced

Ambiguity again A string w is derived ambiguously in the CFG G if it has two or more different leftmost derivations. A CFG G is ambiguous if it generates some string ambiguously

Note Sometimes when we have an ambiguous grammar (such as G 5 ) we can find an unambiguous grammar (such as G 4 ) that generates the same language

Inherent ambiguity Some CFL, however, can be generated only by ambiguous grammar. A CFL that can be generated only by ambiguous grammars is called inherently ambiguous Example of inherently ambiguous language: {0 i 1 j 2 k i = j j = k}