Context-Free Grammars

Similar documents
Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Language properties and Grammar of Parallel and Series Parallel Languages

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Proof Theory for Syntacticians

Grammars & Parsing, Part 1:

A General Class of Noncontext Free Grammars Generating Context Free Languages

A Version Space Approach to Learning Context-free Grammars

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

The Strong Minimalist Thesis and Bounded Optimality

A R "! I,,, !~ii ii! A ow ' r.-ii ' i ' JA' V5, 9. MiN, ;

Parsing of part-of-speech tagged Assamese Texts

Some Principles of Automated Natural Language Information Extraction

"f TOPIC =T COMP COMP... OBJ

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Efficient Normal-Form Parsing for Combinatory Categorial Grammar

Discriminative Learning of Beam-Search Heuristics for Planning

Detecting English-French Cognates Using Orthographic Edit Distance

arxiv: v1 [math.at] 10 Jan 2016

Grade 6: Correlated to AGS Basic Math Skills

CS 598 Natural Language Processing

South Carolina English Language Arts

Natural Language Processing. George Konidaris

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

Multimedia Application Effective Support of Education

Factoring - Grouping

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

On the Polynomial Degree of Minterm-Cyclic Functions

Sample Problems for MATH 5001, University of Georgia

Part III: Semantics. Notes on Natural Language Processing. Chia-Ping Chen

Liquid Narrative Group Technical Report Number

TabletClass Math Geometry Course Guidebook

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Hyperedge Replacement and Nonprojective Dependency Structures

Abstractions and the Brain

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Radius STEM Readiness TM

Prediction of Maximal Projection for Semantic Role Labeling

PRODUCT PLATFORM DESIGN: A GRAPH GRAMMAR APPROACH

The CYK -Approach to Serial and Parallel Parsing

Mathematics Scoring Guide for Sample Test 2005

Evolution of Collective Commitment during Teamwork

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Introduction to Causal Inference. Problem Set 1. Required Problems

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

PHYSICS 40S - COURSE OUTLINE AND REQUIREMENTS Welcome to Physics 40S for !! Mr. Bryan Doiron

GACE Computer Science Assessment Test at a Glance

This scope and sequence assumes 160 days for instruction, divided among 15 units.

What the National Curriculum requires in reading at Y5 and Y6

Shockwheat. Statistics 1, Activity 1

CSC200: Lecture 4. Allan Borodin

Extending Place Value with Whole Numbers to 1,000,000

Developing a TT-MCTAG for German with an RCG-based Parser

Context Free Grammars. Many slides from Michael Collins

AQUA: An Ontology-Driven Question Answering System

Classroom Connections Examining the Intersection of the Standards for Mathematical Content and the Standards for Mathematical Practice

Self Study Report Computer Science

Physics 270: Experimental Physics

Learning goal-oriented strategies in problem solving

Statewide Framework Document for:

LTAG-spinal and the Treebank

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

systems have been developed that are well-suited to phenomena in but is properly contained in the indexed languages. We give a

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Reducing Abstraction When Learning Graph Theory

Spring 2016 Stony Brook University Instructor: Dr. Paul Fodor

Lecture 1: Basic Concepts of Machine Learning

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Aspectual Classes of Verb Phrases

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Cal s Dinner Card Deals

Mathematics Assessment Plan

Toward Probabilistic Natural Logic for Syllogistic Reasoning

An Introduction to the Minimalist Program

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Organizational Knowledge Distribution: An Experimental Evaluation

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

(3) Vocabulary insertion targets subtrees (4) The Superset Principle A vocabulary item A associated with the feature set F can replace a subtree X

Chapter 2 Rule Learning in a Nutshell

Writing Research Articles

A Grammar for Battle Management Language

Pre-Processing MRSes

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Math 96: Intermediate Algebra in Context

INTERMEDIATE ALGEBRA PRODUCT GUIDE

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Diagnostic Test. Middle School Mathematics

Introduction and Motivation

South Carolina College- and Career-Ready Standards for Mathematics. Standards Unpacking Documents Grade 5

CS Machine Learning

Transcription:

Context-Free Grammars Notes on Automata and Theory of Computation Chia-Ping Chen Department of Computer Science and Engineering National Sun Yat-Sen University Kaohsiung, Taiwan ROC Context-Free Grammars p. 1

Introduction Consider the language, L = {a n b n,n 0}. L describes a nested structure, such as nested parenthesis. L has been shown not to be regular. We will introduce the context-free grammar (cfg) which can characterize L. A language is context-free if there exists a cfg for it. The set of context-free languages includes the regular set as a subset. Context-Free Grammars p. 2

Parsing The membership problem of cfg is this: Given a cfg G and a string w, is w L(G)? If w L(G), then there is a sequence of production rules that leads to w starting from S. An important concept in learning cfg is parsing. A parsing algorithm determines how a string w can be derived with a grammar G. Parsing describes sentence structure. It is important for understanding natural languages as well as programming languages. Context-Free Grammars p. 3

Context-Free Grammar A grammar G = (V,T,S,P) is context-free if all production rule in P has the form A x, where A V and x (V T). It is called context-free because the left side has a single variable. No context of the variable is relevant. The application of a rule does not depend on other parts of the sentential form. We can see that a regular grammar is a cfg. Context-Free Grammars p. 4

Context-Free Language Recall that the language of a grammar G is defined by L(G) = {w T : S w}. A language L is said to be context-free if L = L(G) for some cfg G. For example, a regular language is context-free since a regular grammar is context-free. Context-Free Grammars p. 5

Example of cfg The following language L = {ww R : w {a,b} }. is context-free since it can be generated by S asa bsb λ. Note if x L, then x R = x. Such a language is also called palindrome. Context-Free Grammars p. 6

Another Example We design a cfg for the language L = {a n b m : n m}. We consider the rules for n > m and n < m. For extra a s, we decompose S by a string of a s (A), followed by an equal number of b s (S 1 ). S AS 1 ; S 1 as 1 b λ; A aa a. Similarly for extra b s. So the rules for L is S as 1 S 1 B; S 1 as 1 b λ; A aa a; B Bb b. Context-Free Grammars p. 7

Yet Another Example A grammar can be context-free but not linear, e.g. S asb SS λ. Looking simple, this cfg is a useful one as it accepts L = {w {a,b} : n a (w) = n b (w), n a (v) n b (v) for prefix v of w}, which is a homomorphism to the set of properly nested parentheses. Context-Free Grammars p. 8

Derivation A derivation of a string w L(G) is a sequence of sentential forms from S to w. When a cfg is not linear, a production rule may have more than one variables on the right side, so there may be more than one variable in a sentential form. In such cases, we have a choice for the next variable to be replaced by a corresponding right side. Context-Free Grammars p. 9

Leftmost/Rightmost Derivation A derivation is said to be leftmost if in each step the leftmost variable in the sentential form is replaced. It is rightmost if the rightmost variable is replaced in each step. Leftmost and rightmost derivations always exist for a string w L(G). Context-Free Grammars p. 10

Derivation Tree A derivation tree of a cfg G = (V,T,S,P) is a tree. The root is S. An interior node is labeled by A V. A leaf is labeled by a T or λ. The label of an interior node and the labels of its children constitute a rule in P. A leaf labeled λ has no siblings. A derivation tree shows which rules are used in the derivation of w. The order of the rules used is not shown in the tree. Context-Free Grammars p. 11

Partial Derivation Tree A partial derivation tree is similar to a derivation tree, except that The root may not be S. A leaf is labeled by A V T {λ}. The string of symbols from left to right of a tree, omitting λ s, is called the yield. Here left to right means the tree is traversed in a depth-first manner, always taking the leftmost unexplored branch. The yield of a derivation tree for w is w. Context-Free Grammars p. 12

Theorem We first establish the connection between derivation and derivation tree. Let G be a cfg. If w L(G), i.e. there exists a derivation S w, then there exists a derivation tree whose yield is w. Conversely, if w is the yield of a derivation tree, then w L(G). In addition, if t G is any partial derivation tree rooted by S, then the yield of t G is a sentential form of G. Context-Free Grammars p. 13

Proof We first show that for every sentential form, say u, there is a corresponding partial derivation tree. If u can be derived from S in one step, there there must be a rule S u. Suppose the claim is true for all sentential forms derivable in n steps. For a u that is derived from S in (n + 1) steps, the first n steps correspond to a partial tree by the inductive assumption, and a new partial derivation tree can be built based on the last step of the production. Similarly, we can prove that every partial derivation tree rooted by S corresponds to a sentential form. The theorem is proved since a terminal string in L(G) is a sentential form, and a derivation tree is a partial derivation tree. Context-Free Grammars p. 14

Existence of Leftmost Derivation The derivation tree is a representation of derivation. In this representation, the order of production rules in the derivation is irrelevant. From a derivation tree, we can always get a sequence of partial derivation trees rooted by S in which the leftmost node of variable is expanded. In terms of sentential form, the leftmost variable is expanded, which corresponds to a leftmost derivation. We conclude that for each w L(G), there is a leftmost derivation. Context-Free Grammars p. 15

Parsing Given G, we may want to know L(G), i.e. the set of strings that can be derived using G. Given G and a string w, we may be interested in whether w L(G). This is the membership problem. Suppose w L(G), then there exists a sequence of productions that w is derived from S. Parsing is the process of finding such a sequence. Context-Free Grammars p. 16

Brute Force Parsing The brute-force (exhaustive) method to decide whether w L(G) would be to construct all derivations and see if any of them matches w. We can do this recursively. First we construct all x derived from S in one step. If none matches w, we expand the leftmost variable for every x, which gives all sentential forms derived from S in two steps, and so on. If w L(G), there is a leftmost derivation for w in a finite number of steps. So eventually w will be matched. Let s look at an example. Context-Free Grammars p. 17

Flaw and Remedy The brute-force parsing has a serious flaw: it may never terminate. In fact, if w / L(G), clearly w will never be matched. In the case w / L(G), we want to be able to terminate the search when we are sure of it. We can put some restriction on the form of production rules to be able to terminate the search when w / L(G). These restriction should have virtually no effect on the descriptive power of cfg s. Context-Free Grammars p. 18

Theorem If all of the production rules are not of the forms A λ, or A B. then the exhaustive search can terminate in no more than 2 w rounds. (proof) With the above condition, each step in derivation either increases the number of terminals or the length in the sentential form. Since none of these numbers can be more than w to derive w, we need no more than 2 w steps to decide if w L(G). Context-Free Grammars p. 19

Efficiency Issue While the previous theorem guarantees a termination, the number of sentential forms may grow excessively large. If we restrict ourselves to leftmost derivations, we can have no more than P sentential forms after the first round, P 2 sentential forms after the second round, and so on. So the maximum number of sentential forms generated during exhaustive search is n P + P 2 + + P 2 w = O( P 2 w +1 ). Exhaustive search is thus generally very inefficient. Context-Free Grammars p. 20

Simple Grammar A more efficient algorithm than the exhaustive search to decide whether w L(G) can do the job in a number of steps proportional to w 3. Even O( w 3 ) can be excessive. Is there a linear-time parsing algorithm? A cfg G = (V,T,S,P) is said to be a simple grammar, or s-grammar, if all of its production rules are of the form A ax, where a T,x V and any pair (A,a) occurs at most once in P. Context-Free Grammars p. 21

Linear Time For a simple grammar G, any string w L(G) can be parsed in w steps. Suppose w = a 1 a 2...a n L(G). Since there can be only at most one rule with S on the left and a 1 on the right, the derivation has to begin with S a 1 A 1...A m. Similarly, there can be only at most one rule with A 1 on the left and a 2 on the right, so the next sentential form has to be S a 1 a 2 B 1...A 2...A m. Each step produces one more terminal, so the entire derivation cannot have more than w steps. Context-Free Grammars p. 22

Ambiguity of Grammar A cfg G is said to be ambiguous if there exists some w L(G) with two or more distinct derivation trees (parses). Ambiguity implies the existence of two or more leftmost derivations for some string. See example 5.11. While it may be possible to associate precedence with operators, it is better to rewrite the grammars. Ambiguity is not desired in programming languages. In some cases, one can rewrite an ambiguous grammar in an equivalent and unambiguous one. Context-Free Grammars p. 23

Ambiguity of Language Suppose L is a context-free language. It is not ambiguous if there exists an unambiguous cfg for L. Otherwise, i.e. if all cfg s for L are ambiguous, then L is said to be (inherently) ambiguous. While the grammar in example 5.11 is ambiguous, the language is not, as there is a non-ambiguous cfg that generates the same language. It is a difficult matter to show that a language is inherently ambiguous. See example 5.13. Context-Free Grammars p. 24

Example Consider the language L = {a n b n c m } {a n b m c m }, n,m 0. L is a context-free language. Specifically, L = L 1 L 2, where P 1 = S 1 S 1 c A, A aab λ, and similarly for L 2. A grammar for L is P = P 1 P 2 {S S 1 S 2 }. A string a i b i c i has two distinct derivations, one begins with S S 1 and the other begins with S S 2, so the grammar is ambiguous. It does not follow that the language is ambiguous. A rigorous proof is quite technical and is omitted here. Context-Free Grammars p. 25

Programming Languages One important application of formal languages is in the definition of programming languages and in the construction of compilers and interpreters. We want to define a programming language in a precise manner so we can use this definition to write translation programs. Both regular and context-free languages are important in designing programming languages. One is used to recognize certain patterns and the other is used to model more complicated structures. Context-Free Grammars p. 26

Backus-Naur Form A programming language can be defined by a grammar. This is traditionally specified by the Backus-Naur form (BNF), which is essentially same as cfg but with a different system of notation. It is easy to look at an example of BNF to see how it corresponds to a cfg. Context-Free Grammars p. 27

Syntax and Semantics Those aspects of a programming language that can be modeled by a cfg are called syntax. Even if a program is syntactically correct, it may not be acceptable. For example, type clashes may not be permitted in a programming language. The semantics of a programming language models aspects other than those modeled by the syntax. It is related to the interpretation or meaning of objects. It is an ongoing research to find effective methods to model programming language semantics. Context-Free Grammars p. 28

Transforming Grammars In our definition of cfg s, there is no restriction on the form of the right side of a rule. Such flexibility is in fact not necessary. That is, given a cfg, we can transform it to an equivalent cfg whose rules conform to certain restrictions. Specifically, a normal form is a restricted class of cfg but which is broad enough to cover all context-free languages (except perhaps {λ}). We will introduce the Greibach and the Chomsky normal forms. Context-Free Grammars p. 29

A Technical Note The empty string λ often requires special attention, so we will assume that the languages are λ-free in the following discussion. This is based on the following facts. If L is a λ-free context-free language, then L {λ} is context-free as well. In addition, suppose L is context-free, then there exists a cfg for L {λ}. Context-Free Grammars p. 30

Substitution Rule Suppose variables A B and there is a rule A x 1 Bx 2. Then one can substitute this rule by A x 1 y 1 x 2 x 1 y 2 x 2... x 1 y n x 2. where B y 1 y 2... y n is the set of rules with B as the left side. In other words, B can be replaced by all strings it derives in one step. Context-Free Grammars p. 31

Proof Suppose w L(G) so S G w. If the sequence of derivations does not include that rule, then the same sequence exists for Ĝ, so w L(Ĝ). If it does include that rule, then B eventually has to be replaced. It can be assumed that B is replaced immediately, and then obviously there is a rule in Ĝ leading to the next sentential form. Therefore w L(Ĝ). Context-Free Grammars p. 32

Useless Production A variable A is said to be useful iff there exists w such that S xay w, where x,y (V T). Otherwise it is useless. A variable may be useless because it cannot be reached from S it cannot derive a terminal string A production rule is useless if it involves any useless variables. They can be removed from P without changing L(G). Context-Free Grammars p. 33

Dependency Graph To decide if a variable can be reached from S, we can use a dependency graph as follows. In this graph, each vertex corresponds to a variable. There is an edge from C to D iff there exists a rule of the form C xdy. As a result, a variable A is useless if there is no path from S to A in this dependency graph. Context-Free Grammars p. 34

Theorem Let G be a cfg. Then there exists an equivalent cfg Ĝ which has no useless variables or productions. We first construct G 1 that involves only variables that can derive terminal strings. 1. Set V 1 =. Repeat until no variables are added to V 1. Add A to V 1 if there exists a rule A α where all symbols of α are in V 1 T. 2. Take P 1 as those rules in P that involves only symbols in V 1 T. We then remove variables in V 1 not reachable from S by constructing the aforementioned dependency graph. Context-Free Grammars p. 35

λ-production A λ-production is A λ. A variable A is said to be nullable if it is possible that A λ. A λ-production can be removed. Example 6.4 gives an example. Context-Free Grammars p. 36

Theorem Let G be a cfg and λ / L(G). Then there exists an equivalent cfg Ĝ without λ-production. We first find the set of nullable variables V N. 1. For all A with A λ, add A to V N. 2. Repeat until no variables are added to V N. For any B V, if there exists a rule B α where all symbols of α are in V N, then add B to V N. For a production rule A x 1...x m in P, put this rule, as well as those with nullable variables replaced by λ s in all possible combinations, in P. Context-Free Grammars p. 37

Unit-Production A unit-production is A B, A,B V. Let G be a cfg without λ-productions. Then there exists an equivalent cfg Ĝ without unit-production. We first add all non-unit production rules of P to P. Then we find all A B such that A B, and add to P A y 1... y n, where B y 1... y n is the set of all rules in P with B on the left side. Context-Free Grammars p. 38

Theorem Let L (λ / L) be a context-free language. Then there exists a cfg G for L, where G does not have useless production rules λ-productions unit-productions. Context-Free Grammars p. 39

Chomsky Normal Form A cfg is said to be in Chomsky normal form if all production rules are of the form where a T and B,C V. A BC, or A a. The right side is either a single terminal symbol or a string of two variables. (Theorem 6.6) Let L (λ / L) be a context-free language. Then there exists a cfg in Chomsky normal form for L. Context-Free Grammars p. 40

Greibach Normal Form A cfg is said to be in Greibach normal form if all production rules are of the form where a T and x V. A ax, A right side has to be a terminal symbol followed by a variable string of an arbitrary length. (Theorem 6.7) Let L (λ / L) be a context-free language. Then there exists a cfg in Greibach normal form for L. Context-Free Grammars p. 41

Membership Algorithm The membership problem for cfg is Given G and w, decide if w L(G). An algorithm to answer correctly for all instances of G and w is called a membership algorithm for cfg. Does there exist a membership algorithm for cfg? We claimed that there is one with complexity O( w 3 ). This is the CYK algorithm, after Cocke, Younger and Kasami. Context-Free Grammars p. 42

CYK Algorithm The idea of CYK is to solve one big problem by solving a sequence of smaller ones. Assume we have a grammar in Chomsky normal form and a string w = a 1...a n. Define the set of variables V ij = {A V : A w ij = a i...a j }. Note w L(G) S V 1n. Context-Free Grammars p. 43

Details To decide V ij, observe that A V ii iff A a i. So V ii for all i can be decided trivially. For j > i, A w ij iff A BC, B w ik, and C w k+1j. That is V ij = k {i,...,j 1} {A : A BC,B V ik,c V k+1j }. The order of computation is thus Compute V 11,V 22,...,V nn. Compute V 12,V 23,...,V n 1n. Compute V 13,V 24,...,V n 2n, and so on. Context-Free Grammars p. 44