Syntax Analysis Context Free Grammar

Syntax Analysis Context Free Grammar CMPSC 470 Lecture 05 Topics: Overview of Parser Context-Free Grammar (CFG) Eliminate ambiguity in CFG A. Overview of Parser Source Program Lexical Analyzer token get next token Parser Symbol table a) Role of Parser 1. Report any syntax error 2. Recover from commonly occurring error to continue processing the remainder of program 3. Create parse tree b) Parser use grammer Types of grammar representation Grammar gives precise, yet easy-to-understand syntactic specification of a programming language Parser can be automatically constructed from certain classes of grammars.

c) Types of parsers & grammars Commonly used parsing methods: top-down, bottom-up Common grammars: LL, LR Left recursive grammar Non-left recursive grammar A grammar is ambiguous if it permits more than one parse trees.

d) Programming errors Lexical error: misspelling of identifiers, keywords, or operators Syntactic errors: misplaced semicolons, extra or missing braces ( {, }, (, ) ), case statement without an enclosing switch. Semantic error: type mismatches between operators and operands. Logical error: anything from incorrect reasoning on the part of the programmer to the use in a program. It produces unintended or undesired output or behavior.

B. Context-Free Grammar (CFG) A context-free grammar (CFG) is a certain type of formal grammar, which is a set of production rules that describe all possible strings in a given formal language. CFG is used to specify the syntax of a language. a) Format CFG has following form: ssssssss iiii ( eeeeeeee ) ssssssss eeeeeeee ssssssss Terminals: components of tokens output by lexical analyzer. Here, we can see only keywords, which will be terminal node of parse tree. Nonterminal: syntactic variables that denote set of strings. Productions (of a grammar): they specify how terminals and nonterminals can be combined to form a string. It consists of Start symbol: One of nonterminal. Conveniently, the productions for the start symbol are listed first.

Example) Let we have a program. while ( i > 0 ) if ( i % 2 == 0 ) i = i / 2; else i = i 1; The grammar that supports the above program can be:

b) Notational conventions 1. Terminals (a) Lowercase letters early in alphabet, such as a,b,c (b) Operator symbols: +, *, /, (c) Punctuation symbols: parentheses, comma, (d) Digits: 0,1,,9 (e) Boldface string: id, if (f) 2. Nonterminals (a) Upper letters early in alphabet, such as A,B,C (b) S is usually start symbol. (c) Lowercase, italic names such as expr or stmt. (d) 3. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols; that is, either nonterminals or terminals. 4. Lowercase letters late in the alphabet, chiefly u,v,..., z, represent (possibly empty) strings of terminals. 5. Lower Greek letters, αα, ββ, γγ for example, represent (possibly empty) string of grammar symbols. Thus, a generic production can be written as AA αα, where AA is the head and αα the body. 6. A set of productions AA αα 1, AA αα 2,, AA αα kk with common head AA (call them AA-productions), may be written AA αα 1 αα 2 αα kk. Call αα 1, αα 2,, αα kk the alternatives for AA. 7. Unless stated otherwise, the head of the first production is the start symbol.

c) Derivations Let we have the following grammar G: EE EE + EE EE (EE) iiii Starting from EE, we can obtain a sentence (iiii) by sequentially replacing EE like: We call, such a sequence of replacement as derivation of (iiii) from EE, and EE derives (iiii) Symbols: means, derives in one step, and means derive in zero or more steps, and + means derive in one or more steps Rules: 1. αα αα, for any string αα, 2. αα ββ and ββ γγ, then. Sentence: If SS is a start symbol and SS αα, then αα is a sentential form of grammar G. Sentential form may contain both terminals and nonterminals, and may be empty. If αα is a sentence of G, then αα has only terminals. A language generated by a grammar is the set of sentences. A string terminal ww is in LL(G) iff and only if ww is sentence of G (or SS ww). LL(G) = ww SS ww Context-free language is a language that can be generated by a grammar. Two grammars are equivalent if they generate the same language.

Leftmost derivation: The leftmost nonterminal is always derived. It use lm symbol. Rightmost derivation: The rightmost nonterminal is always derived. It use symbol. rm d) Parse Tree A parse tree is a graphical representation of a derivation. Each interior node of a parse tree represents the application of a production. Interior node is nonterminals, and leaves are terminals. Parse tree filters out the order in which productions are applied to replace nonterminals.

e) Ambiguity A grammar is ambiguous if It produces more than one parse trees for some sentences, or If it produces more than one leftmost (or rightmost) derivations for the same sentence. For most parsers, it is desirable that the grammar be unambiguous Example) The following grammar is ambiguous: EE EE + EE EE EE iiii Because it permits two distinct leftmost derivation for the sentence.

f) CFG is more powerful notation than regular expression Every construct described by regular expression can be described by CFG: every regular language is contextfree language, but not vice-versa. Convert NFA to CFG: Example) Determine CFG accepting regular expression (aa bb) aaaa 1. Determine NFA accepting the regular expression 2. For each state ii of NFA, create nonterminal AA ii 3. If state ii has a transition to jj on input aa aa (ii ii), add production AA ii aaaa jj. εε If ii ii, add production AA ii AA jj. 4. If ii is an accepting state, add AA ii εε 5. If ii is the start state, make AA ii the start symbol of grammar. Language LL = {aa nn bb nn nn 1} with equal number of aa s and bb s can be described by grammar (CFG), but not by regular expression. We say that finite automata cannot count. CFG accepting LL = {aa nn bb nn nn 1} is DFA accepting

g) Non-context-free language Semantic analysis cannot be checked by CFG. Example 1) Consider abstract language LL 1 = {wwwwww ww is in (aa bb) }. In programming language, first ww represents, Second ww represents cc represents In a programming language like C/C++/Java, LL 1 abstracts the problem of CFG cannot describe the non-context-free language like LL 1, and this checking should be done in semantic analysis phase Example 2) In LL 2 = {aa nn bb mm cc nn dd mm nn 1 and mm 1}, aa nn and bb mm represent cc nn and dd mm represent CFG cannot describe this language, and so the semantic analysis phase should check:

C. Eliminate Ambiguity in CFG An ambiguous grammar can have more than one parse tree generating a given string of terminals. Since a string with more than one parse tree has more than one meaning, we desire unambiguous grammars. Sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity. a) Associativity of operators Let we have a grammar G for assignments, like aa = bb = cc, such that: AA AA = AA aa bb cc zz Given sentence aa = bb = cc, there are 2 parse trees: Case 1: The parse tree built with a grammar that is left-associative, where bb belongs to left operator =: Case 2: The parse tree built with a grammar that is right-associative, where bb belongs to right operator =: A new grammar that eliminates the ambiguity by making operator = associate to left: A new grammar that eliminates the ambiguity by making operator = associate to right: The new parse tree of aa = bb = cc is: The new parse tree of aa = bb = cc is:

An ambiguous grammar for expression 9 + 5 2 is given as follows: EE EE + EE EE EE nnnnnn Operator + and are left-associative, in general. How to remove the ambiguity of the grammar? b) Precedence of operators Consider the expression 9 + 5 2. There are two ways of interpreting this: (9 + 5) 2 and 9 + (5 2). Associativity rule cannot be applied because has higher precedence than +. Given the following grammar for and, which are left-associative: EE EE + FF EE FF FF FF nnnnnn The ambiguity of the grammar can be eliminated by rewriting the grammar: The parse tree of 9 + 5 2 and 9 5 + 2 are:

In a similar manner, the grammar can be rewritten to have the parenthesis, which has highest precedence. Draw the parse tree of (1 + 2 3) 4:

c) Eliminate dangling else Consider the following grammar for if-statement: ssssssss iiii eeeeeeee tttttttt ssssssss iiii eeeeeeee tttttttt ssssssss eeeeeeee ssssssss ooooheeee The grammar is ambiguous since the following sentence iiii EE 1 tttttttt iiii EE 2 tttttttt SS 1 eeeeeeee SS 2 has two parse trees This case is called dangling else. In all programming language, the first parse tree is preferred. That is match each else with the closest unmatched then. Rewrite the grammar to eliminate the dangling else.

d) Eliminate immediate left recursion A grammar is immediate left recursion if there is a derivation: AA AAAA for some string αα that is terminals or nonterminals. Example) Given the following grammar: EE EE + TT TT its body begins with EE, so procedure of EE is called recursively. This case is called immediate left recursion. Top-down parser cannot handle the immediate leftrecursive grammar. The immediate left recursion can be eliminated, by rewriting the grammar, as follows: Generalize: Let we have a grammar: AA AAAA ββ AA is immediate left recursive, and the left recursion can be removed by rewriting the grammar using new nonterminal RR: Now, RR is right recurve, and the grammar does not have left recursive. Note that the two grammars produce the same sentence ββββ αααα.

Generalize eliminating immediate left recursion: Consider the immediate left recursion of the following AA-productions: AA AAαα 1 AAαα 2 AAαα mm ββ 1 ββ 2 ββ nn Rewrite the grammar to eliminate the immediate left recursion: Example) Eliminate the immediate left recursion of the following grammar: EE EE + TT EE TT TT e) Eliminate left recursion A grammar is left recursive if there is a derivation AA + AAAA for some string αα. Example) Consider the following grammar: SS AAAA BBBB CC AA AAAA BBBB SSSS εε BB AAAA SSSS εε AA is immediate left recursive. SS and BB are not immediate left recursive, but left recursive.

Eliminate the left recursion: 1a. For ss-production. apply SS AAAA to AA-productions apply SS BBBB to BB-productions repeat 1b. remove immediate left recursion among SS-productions 2a. For next productions: AA-production apply AA BBBB to BB-productions 2b. Remove immediate left recursions among AA-productions 3a. For next productions: BB-production apply 3b. Remove immediate left recursions among BB-productions 4. Stop because there is no other productions to eliminate left recursions. Finally, the following grammar does not have left recursions.

f) Left factoring Left factoring is a grammar transformation that is useful for producing a grammar suitable for predictive, or topdown, parsing. Example: Consider the following if-statements: ssssssss iiii eeeeeeee tttttttt ssssssss eeeeeeee ssssssss iiii eeeeeeee tttttttt ssssssss Given input iiii or iiii eeeeeeee tttttttt ssssssss, we cannot immediately choose which alternative production should be applied. By rewriting the grammar, the if-statement can be leftfactored to defer the decision until its inputs are clear, as follows: This new grammar is still ambiguous because it still have dangling else problem. This problem will be resolved later. Generalize: When the choice between two or more alternative productions are not clear for a nonterminal AA, find the longest prefix αα common to two or more of its alternatives. Let the AA-production has the following form: AA ααββ 1 ααββ 2 ααββ nn γγ where αα, ββ 1,, ββ nn, γγ are terminals or nonterminals. The grammars can be rewritten to defer the decision until its inputs are clear, using a new nonterminal AA, as follows: