Context Free Grammars

Similar documents
Grammars & Parsing, Part 1:

Language properties and Grammar of Parallel and Series Parallel Languages

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

A General Class of Noncontext Free Grammars Generating Context Free Languages

A R "! I,,, !~ii ii! A ow ' r.-ii ' i ' JA' V5, 9. MiN, ;

A Version Space Approach to Learning Context-free Grammars

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Proof Theory for Syntacticians

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

CS 598 Natural Language Processing

PRODUCT PLATFORM DESIGN: A GRAPH GRAMMAR APPROACH

"f TOPIC =T COMP COMP... OBJ

Hyperedge Replacement and Nonprojective Dependency Structures

Developing a TT-MCTAG for German with an RCG-based Parser

Parsing of part-of-speech tagged Assamese Texts

Enumeration of Context-Free Languages and Related Structures

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Grade 6: Correlated to AGS Basic Math Skills

Are You Ready? Simplify Fractions

On the Polynomial Degree of Minterm-Cyclic Functions

Refining the Design of a Contracting Finite-State Dependency Parser

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Natural Language Processing. George Konidaris

The Indices Investigations Teacher s Notes

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Factoring - Grouping

Radius STEM Readiness TM

Standard 1: Number and Computation

The Strong Minimalist Thesis and Bounded Optimality

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Statewide Framework Document for:

Discriminative Learning of Beam-Search Heuristics for Planning

systems have been developed that are well-suited to phenomena in but is properly contained in the indexed languages. We give a

Morphotactics as Tier-Based Strictly Local Dependencies

Mathematics Assessment Plan

Linking Task: Identifying authors and book titles in verbose queries

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Context Free Grammars. Many slides from Michael Collins

A Neural Network GUI Tested on Text-To-Phoneme Mapping

GCSE. Mathematics A. Mark Scheme for January General Certificate of Secondary Education Unit A503/01: Mathematics C (Foundation Tier)

Efficient Normal-Form Parsing for Combinatory Categorial Grammar

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

TOPICS LEARNING OUTCOMES ACTIVITES ASSESSMENT Numbers and the number system

Character Stream Parsing of Mixed-lingual Text

Abstractions and the Brain

Math 098 Intermediate Algebra Spring 2018

A Grammar for Battle Management Language

Self Study Report Computer Science

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

This scope and sequence assumes 160 days for instruction, divided among 15 units.

WSU Five-Year Program Review Self-Study Cover Page

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Probability and Game Theory Course Syllabus

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Guide to the Uniform mark scale (UMS) Uniform marks in A-level and GCSE exams

1.11 I Know What Do You Know?

Data Modeling and Databases II Entity-Relationship (ER) Model. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Mathematics subject curriculum

CS 101 Computer Science I Fall Instructor Muller. Syllabus

Prediction of Maximal Projection for Semantic Role Labeling

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

Cal s Dinner Card Deals

GACE Computer Science Assessment Test at a Glance

Mathematics process categories

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Compositional Semantics

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Learning goal-oriented strategies in problem solving

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Specifying Logic Programs in Controlled Natural Language

Some Principles of Automated Natural Language Information Extraction

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

arxiv: v1 [cs.cv] 10 May 2017

Characteristics of Functions

South Carolina English Language Arts

Bittinger, M. L., Ellenbogen, D. J., & Johnson, B. L. (2012). Prealgebra (6th ed.). Boston, MA: Addison-Wesley.

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Visual CP Representation of Knowledge

On-Line Data Analytics

Parsing natural language

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Backwards Numbers: A Study of Place Value. Catherine Perez

University of Groningen. Systemen, planning, netwerken Bosman, Aart

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Classroom Connections Examining the Intersection of the Standards for Mathematical Content and the Standards for Mathematical Practice

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Lecture 1: Machine Learning Basics

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

GCE. Mathematics (MEI) Mark Scheme for June Advanced Subsidiary GCE Unit 4766: Statistics 1. Oxford Cambridge and RSA Examinations

Beyond the Pipeline: Discrete Optimization in NLP

Transcription:

Context Free Grammars UNIT III By Prof.T.H.Gurav Smt.Kashibai Navale COE, Pune

Context-Free Grammar Definition. A context-free grammar is a 4-tuple : G = (V, T, P, S) OR G = (V, Σ, P, S) V = Non-terminal(variables) a finite set T = alphabet or terminals a finite set P = productions a finite set S = start variable S V Productions form, A α where A V, α (V S T) * : NT NT NT T OR

String generation by CFG Generate strings by repeated replacement of nonterminals with string of terminals and nonterminals. 1. write down start variable (non-terminal) 2. replace a non-terminal with the right-handside of a rule that has that non-terminal as its left-hand-side. 3. repeat above until no more non-terminals

Context-Free Languages Definition. Given a context-free grammar G = (T, NT, P, S), the language generated or derived from G is the set: L(G) = {w ε T* S * w } All intermediate stages of the strings resulting from the start S in the derivation process are called as sentential form. Definition. A language L is context-free if there is a context-free grammar G = (T, NT, P, S), such that L is generated from G

Types of derivations There are two ways to derive the string from the grammar 1. Leftmost derivation : When at each step of derivation a production is applied to the leftmost NT, then the derivation is said to be leftmost. 2. Rightmost derivation: When at each step of derivation a production is applied to the rightmost NT, then the derivation is said to be rightmost.

Consider the following grammar S A A B A e a A b A A B b b c B c b B Sample derivations: S AB AAB aab aab aabb aabb S AB AbB Abb AAbb Aabb aabb These two derivations use same productions, but in different orders.

Parse Trees The pictorial representation of the derivations in the form of a tree is very useful. This tree is called parse tree OR derivation Tree. Root label = start node. S A A A a a B b B b Each interior label = variable. Each parent/child relation = derivation step. Each leaf label = terminal or e. All leaf labels together = derived string = yield.

Yield of a parse Tree If we look at the leaves of any parse tree and concatenate them from left to right we get a string called yield of parse tree.

Derivation Trees/parse trees S A A B A e a A b A A B b b c B c b B S A A A a a b B B b A a A A a A S b w = aabb B A b A A a e S A A a A b Other derivation trees for this string? A b?? Infinitely many others possible.

CFGs & CFLs: Example 1 {a n b n n 0} It is non regular already proved by pumping lemma. Can be represented by CFG G = ({S}, {a,b}, {S Є, S a S b}, S)

Example2 Eg: Construct a CFG for language L which has all strings which are palindromes over ={a,b} Example : madam is palindrome G pal =({S}, {a,b}, A, S), where A={S e, S 0, S 1, S 0S0, S 1S1} Sometimes we group productions with the same head, e.g. S e 0 1 0S0 1S1.

Example The string abaaba can be derived as S start symbol asa Rule S asa absba Rule S bsb abasaba Rule S asa abaєaba Rule S Є abaaba is a palindrome

Ambiguty:Defination A CFG is ambiguous if there is a string in the language that is the yield of two or more parse trees. A CFG is ambiguous if there is a terminal string that has multiple leftmost derivations from the start variable. Equivalently: multiple rightmost derivations

Example Let G={{E},{a,b,-,/},P,E} P = { E E-E E/E a b} E is start symbol Solution : Consider the derivation of the string-> a b/a Derivation 1 Derivation 2 E=>E-E E=>E/E E=>a-E/E E=>E-E/E E=>a-a/E E=> a-e/e E=>a-a/b E=>a-a/E E=>a-a/b

Parse trees E E E / E E - E E - E a a E / E a b b a

Reasons The relative precedence of subtraction and division are not uniquely defined. And two groupings correspond to expressions with different values. It doesn t captures Associativity!!

Unambiguous G E E - T T T T / F F F (E) I I a b Try for a-b/a Now!!

CFG Simplification Grammar may consists of extra symbols which unnecessarily increases length of grammar. So simplification needed. 1. Eliminate ambiguity. 2. Eliminate useless variables. 3. Eliminate e-productions: A e. 4. Eliminate unit productions: A B. 5. Eliminate redundant productions.

Eliminate useless variables. A variable is useful if it occurs in a derivation that begins with the start symbol and generates a terminal string. Two types of the symbols are useless A symbol (NT or T ) Non generating symbols : symbols not generating any terminal string Non reachable symbols : can not be reached from Start symbol. We use Dependency Graph method to decide not reachable NT. S aa S B A S aa B A A B Here A is Reachable and B is not Reachable from S

Eliminate e-productions: A e. A CFG may have productions of the form A e. This production is used to erase A. Such production is called as null production. While eliminating e rule from grammar.. Meaning of CFG should not be changed. Example:G= S 0S 1S e construct G generating L(G)-{e} Solution : Then replace S e in other rules to generate new rules. ie S 0 and S 1 There fore G = S 0S 1S 1 0

Eliminate unit productions: A B Unit productions are the productions in which one NT gives another NT Eg: A B OR X Y Steps : 1. Select unit production A B, such that there exists production B X1X2X3 Xn } 2. Then while removing A B we should add A X1X2X3 Xn in the grammar. eliminate A B from grammar

Example G { S 0A 1B C A 0S 00 B 1 A C 01 } Solution : unit productions are S C B A We have C 01 So S 0A 1B 01 We have A 0S 00 So B 0S 00 1 Thus G = {S 0A 1B 01 A 0S 00 B 0S 00 1 C 01 }

Two normal forms : 1. Chomsky N F 2. Greibach N F

Chomsky Normal Form If all rules of the grammar are of the form. NT NT. NT NT T In CNF we have restriction on the length of RHS and nature of Symbols in RHS of Rules.

Greibach Normal Form A CFG is in Griebach Normal Form if each rule is of the form NT one terminal. Any number of NT Example S aa S a is in GNF But S AA Or S Aa is not in GNF

Rules: 1. Substitution Rule Let G=(V,T,P,S) be a given Grammar and if production A Ba & B β1 β2 β3. βn then we can convert A rule to GNF as A β1a β2a β3a. βna Example : let S Aa and A aa ba aas b We can apply rule 1 as S aaa baa aasa ba A aa ba aas

2. Left Recursion Rule Let G=(V,T,P,S) be a given Grammar and if production A Aa1 Aa2 Aa3 β1 β2 β3. βn such that βi do not start with A then equivalent grammar in GNF is : A β1 β2 β3. βn A β1 Z β2 Z β3z. βnz Z Z a1 a2 a3. an a1z a2z a3z. anz

Left linear grammar and right linear grammar 1. If NT appears as a rightmost symbol in each production of CFG then it is called right-linear grammar. 2. If NT appears as a leftmost symbol in each production of regular grammar then it is called left-linear grammar. Linear grammars (either left or right) actually produce the Regular Languages, and also called as regular Grammar. ( which means that all the Regular Languages are also CF.)

Regular grammars Right Linear Grammars: Rules of the forms A ε A a A ab A,B: variables(nt) and a: terminal Left Linear Grammars: Rules of the forms A ε A a A Ba A,B: variables(nt) & A: terminal

RLG to FA Grammar G is right-linear Example: S aa B A aa B B b B a

Steps Consider grammar G is given, corresponding FA, M will be obtained as follows: 1. Initial state of FA will be start NT of G. 2. A Production in G corresponds to transition in M 3. The transitions in M are defined as : 1. Each production A ab gives transition from State A to B on input alphabet a. 2. Each production A a gives transition from State A to Qf(final state of FA) on input alphabet a.

Example Construct NFA, M such that S aa B A a B B a bb

1. Every state is a grammar variable: 2. Add edges to each production (a) S aa S B S (b) A ab (C) B a B bb e a A B a a VF special final state b L(G) = L(M)

FA to RLG Steps : 1. Start State of the FA will become the Start Symbol of the G 2. Create set of Productions as a. If q0(initial state of the FA) Ԑ F then add a production S Ԑ to P b. For every Transition of the form, B a C add production B ac c. B a C Add production B ac and B a

FA to RLG(example) Convert FA to a RLG b a a q0 q1 q2 e b q 3

b a q0 q1 a q 2 q0 aq 1 q1 bq 1 e q 3 b q1 aq 2 q2 b bq 3 L( G) L( M ) L q3 q 1

Conversion from RLG to LLG and Vice versa Right Linear G Transition Graph Left Linear G Steps : fig : From RLG to LLG 1 Represent RLG using Transition graph(fa). 2. Interchange the start state and the Final State. 3.Reverse the directions of all transitions keeping the labels and the states unchanged. 4. Write left linear G from the changed transition graph.

Properties of CFL 1. The union and concatenation of two context-free languages is context-free, but the intersection need not be. 2. The reverse of a context-free language is context-free, but the complement need not be. 3. Every regular language is context-free because it can be described by a regular grammar. 4. The intersection of a context-free language and a regular language is always context-free. 5. There exist context-sensitive languages which are not context-free. 6. To prove that a given language is not context-free, one may employ the pumping lemma for context-free languages

Pumping lemma for CFL Let G be a CFG. Then there exists a constant n such that any string w ε L(G) with w >=n can be rewritten as w=uvxyz, subject to the following conditions: 1. vxy <=n, the middle portion is less than n. 2. vy = Є strings v and y will be pumped. 3. For all i>=0 uv i xy i z is in L. the two strings v and y can be pumped zero or more times.

u v x y z

Example 1 L = {a n b n c n n 0} Assume L is a CFL, Choose w = a 2 b 2 c 2 in L Applying PL, w = uvxyz, where vy >0 and vxy p, such that uv i xy i z in L for all i 0 Two possible cases: vxy = (combination of a & b), uv 2 xy 2 z will result in more a s and/or more b s than c s, not in L vxy = (combination of b & c), uv 2 xy 2 z will result in more b s and/or more c s than a s, not in L Contradiction, L is not a CFL

Grammar types There are 4 types of grammars according to the types of rules: Each type recognizes a set of languages. General grammars RE languages Context Sensitive grammars CS languages Context Free grammars CF languages Linear grammars Regular languages

Chomsky Hierarchy Comprises four types of languages and their associated grammars and machines. Type 3: Regular Languages Type 2: Context-Free Languages Type 1: Context-Sensitive Languages Type 0: Recursively Enumerable Languages These languages form a strict hierarchy

1. Type 3 : A є 2. Type 2: 3. Type 1: A a ab A Ba A α where A ε V and α ε (V union T)* αaβ αxβ with β >= α where β,and X are strings of NT and/or T with X not NULL and A is NT.

Language Grammar Machine Example Regular Grammar Deterministic or Regular Language Right-linear grammar Left-linear grammar Nondeterministic Finite-state Acceptor(FA) a* Context-free Language Context-free grammar Pushdown automaton(pda) a n b n Contextsensitive Context-sensitive grammar Linear-bounded Automaton a n b n c n Recursively enumerable Unrestricted grammar Turing machine(tm) Any computable function

Graph grammars Graph grammars has been invented in order to generalize (Chomsky) string grammars.

Graph grammars: definition A graph grammar is a pair: GG = (G 0,P) G 0 is called the starting graph and P is a set of production rules L(GG) is the set of graphs that can be derived starting with G 0 and applying the rules in P

Continue.. A set of production rules are used to replace one subgraph by another. The process of replacing depends upon the embedding: edges to/from the old subgraph must be transformed into edges to/from the new subgraph.

Types of GG Often, on a high level, two kinds of graph grammars are distinguished: Hyperedge replacement grammars Rewrite rule replaces (hyper)edge by new graph Node replacement grammars Rewrite rule replaces node by new graph

Node replacement grammars node replacement grammars with rules of the form: N G / E Node label Labeled graph Embedding rules Replace any node with label N by G, connecting G to N s neighborhood according to the embedding rules listed in E. Embedding rules are based on node labels.

Example NR-GG rule N a b b c c / {(a,b), (b,c)} a a c c b N b b c a a a c c b N b b c a

Example NR-GG rule N a b b c c / {(a,b), (b,c)} a a c c b N b b c a a a c b b a b c c c b b c a

Production Rules Following two types are used to describe the production rules in GG. 1. Algebraic (using gluing construction) 2. Set theoretic( uses expressions to describe the embedding

Applications Picture processing : A picture can be represented as a graph, where labelled nodes represents primitives and labelled edges represents geometric representations( such as is right of, is bellow) Diagram recognition:

Recursively Enumerable Languages A TM accepts a string w if the TM halts in a final state. A TM rejects a string w if the TM halts in a non final state or the TM never halts. A language L is recursively enumerable if some TM accepts it. Hence they are also called as Turing Acceptable L. Recursively Enumerable Languages are also called Recognizable

For a Turing-Acceptable language : L Turing Machine for L q accept Input string q reject It is possible that for some input string the machine enters an infinite loop

Recursive Language Recursive Language : A language L is recursive if some TM accepts it and halts on every input. Recursive languages are also called Decidable Languages because a Turing Machine can decide membership in those languages (it can either accept or reject a string).

For a decidable language Input string Decider for L : L q accept q reject Decision On Halt: Accept Reject For each input string, the computation halts in the accept or reject state

Undecidable Languages undecidable language = not decidable language If there is no Turing Machine which accepts the language and makes a decision (halts) for every input string. Note : (machine may make decision for some input strings) For an undecidable language, the corresponding problem is undecidable (unsolvable):

Applications of RE and CFG in compilers

Programming Language (Source) Compiler Machine Language (Target) c

The Structure of a Compiler c

1. RE and FA : Are usually used to classify the basic symbols (e.g. identifiers, constants,keywords) of a language. 2. Context free Grammar: 1. Describes the structure of a program. 2. are used to count: brackets: (), begin...end, if...then...else c

Lexical Analysis/ Scanning Converts a stream of characters (input program) into a stream of tokens. Terminology Token: Name given to a family of words. e.g., integer constant Lexeme: Actual sequence of characters representing a word. e.g., 32894 Pattern: Notation used to identify the set of lexemes represented by a token. e.g., [0 9]+ c

Some more examples Token Sample Lexemes Pattern while while while integer constant 32894, -1093, 0 [0-9]+ identifier buffer size [a-za-z]+ c

Patterns How do we compactly represent the set of all lexemes corresponding to a token? For instance: The token integer constant represents the set of all integers: that is, all sequences of digits (0 9), preceded by an optional sign (+ or ). Obviously, we cannot simply enumerate all lexemes. Use Regular Expressions. c

Regular Definitions Assign names to regular expressions. For example, digit 0 1 9 natural digit digit Shorthands: a+: Set of strings with one or more occurrences of a. a*: Set of strings with zero or one occurrences of a. Example: integer (+ )*digit+ c

Regular Definitions and Lexical Analysis Regular Expressions and Definitions specify sets of strings over an input alphabet. They can hence be used to specify the set of lexemes associated with a token. That is, regular expressions and definitions can be used as the pattern language c

Parsing/ syntax analysis Main function of parser: Produce a parse tree from the stream of tokens received from the lexical analyzer which is then used by Code Generator to produce target code. This tree will be the main data structure that a compiler uses to process the program. By traversing this tree the compiler can produce machine code. Secondary function of parser: Syntactic error detection report to user where any error in the source code are. c

Applications of RE 1. Data Validation: Test for a pattern within a string. For example, you can test an input string to see if a telephone number pattern or a credit card number pattern occurs within the string. This is called data validation. c

Continue 2. Patten matching: You can find specific text within a document or input field. For example, you may need to search an entire Web site, remove outdated material, and replace some HTML formatting tags. In this case, you can use a regular expression to determine if the material or the HTML formatting tags appears in each file. This process reduces the affected files list to those that contain material targeted for removal or change. You can then use a regular expression to remove the outdated material. Finally, you can use a regular expression to search for and replace the tags. c