Syntax analysis Context-free grammars Previously we used context-free grammars for specifying an input language, and for predictive parsing Now we consider this topic systematically, in much greater detail fter we have developed the theory of predictive parsing, we will turn to the study of a more powerful (that is, more widely applicable), but decidedly less intuitive approach to parsing Eventually, we ll also experiment with the use of the parser generator Yacc context-free grammar (CFG) is a 4-tuple G = (V, Σ, S, P) where V is a finite set of variables, or nonterminal symbols Σ is a finite set of terminal symbols, or tokens S V is the start symbol P is a finite set of productions of the form α where V and α is a string over V Σ We begin by looking more closely at the basics of the underlying theory of context-free grammars Let us first define the notion of a derivation step in CFG G: Given a variable, and strings α, β, γ over V Σ, we can write αγ G αβγ if there is a production β In this case, we say that αγ derives αβγ in one step (We usually suppress the subscript G) 1 2
Example Take G = ({E, }, {(, ), +,,num}, E, P) where P consists of E E E E ( E ) E num + Then we have, for instance, Derivable strings, sentential forms, and sentences We write α β to say that α derives β in zero or more steps More precisely, α α for any string α over V Σ, and α γ if there is a β st α β and β γ and E num In particular, for every variable, there is a set of strings derivable from the strings α over V Σ st α E (E) (EE) (nume) (num+e) (num+num) In general, we can abbreviate n -productions α 1 α 2 α n by writing α 1 α 2 α n, so the CFG above can be written E E E ( E ) num sentential form of G is a string derivable from the start symbol sentence of G is a sentential form of G in which no variables occur The language generated by G, denoted L(G), is the set of sentences of G Example num and (num + num) are sentences of the CFG G with productions E E E ( E ) num while E and (nume) are sentential forms Is ((num + E)) a sentence of G? sentential form? 3 4
Leftmost and rightmost derivations Rightmost derivations, parse trees t each step in a derivation, two choices are made: which variable to replace, and which production to apply We will see that the behavior of top-down parsers, such as the predictive parsers we played around with, corresponds to leftmost derivations in which the production at each step is applied to the leftmost variable in the sentential form Example We previously saw two leftmost derivations E num and E (E) (EE) (nume) (num+e) (num+num) in the grammar with productions E E E ( E ) num nother derivation of (num + num) is E (E) (EE) (Enum) (E+num) (num+num) which is not leftmost In fact, the derivation E (E) (EE) (Enum) (E+num) (num+num) an example of a rightmost derivation the production at each step is applied to the rightmost variable in the sentential form We ll see later that some powerful parsing algorithms are best understood in terms of rightmost derivations parse tree can be understood as a graphical representation of a derivation which suppresses as much information as possible about the order of production applications For example, the parse tree corresponding to the above rightmost derivation also corresponds to the previous leftmost derivation, and to additional derivations that are neither leftmost nor rightmost But every parse tree corresponds to a unique leftmost derivation, and a unique rightmost derivation 5 6
mbiguous grammars CFG G is ambiguous if there is more than one leftmost derivation for a sentence of G Equivalently, G is ambiguous if there is more than one rightmost derivation for a sentence of G Equivalently, G is ambiguous if there is more than one parse tree for a sentence of G Claim The CFG with productions E E E ( E ) num is ambiguous Consider, for example, the sentence num + num num We previously discussed the kind of problem that such ambiguity causes for translation or evaluation of such an expression For example, the ambiguity of E E E num can be easily eliminated by transforming the grammar as follows E E num In some cases, eliminating ambiguity may make the grammar harder to understand Sometimes it is better to use other means for disambiguating a grammar In fact, it is often nice to write ambiguous grammars with commonsensical disambiguating rules such as operator precedence for expressions We will eventually see this idea applied in Yacc More generally, for some approaches to parsing, we want unambiguous grammars 7 8
mbiguity cannot always be eliminated some context-free languages are inherently ambiguous For example, the language L = {a m b m a n b n m, n N } { a m b n a n b m m, n N } is context-free, but cannot be generated by any unambiguous CFG Consider the following grammar for this language: S B ab ǫ B abb C C bca ǫ Intuitively, any grammar for L will have two parse trees for a n b n a n b n for some n N There is no general method for identifying ambiguity, or for eliminating it when possible In fact, even the question of whether an arbitrary CFG is ambiguous is unsolvable 9 Regular languages and context-free languages Recall: some context-free languages are not regular For example L = {a n b n n N } is context-free generated by the grammar S asb ǫ but it is not regular Every regular language is context-free Consider any NF M = (States, Σ,move, S 0,Final) CFG that generates the language accepted by M is G = (States, Σ, S 0, P) where P is constructed by taking ab for each B move(, a), for all States and a Σ {ǫ}, and also taking ǫ for every Final 10
Some languages are not context-free Top-down parsing and left-recursion simple example is { ww w (a b) } In practice, such limitations are one reason that the grammar for a typical programming language will allow syntactically illegal programs for example, programs in which an identifier is used without being declared Recall: this sort of thing might be handled via a symbol table (One attribute of an identifier could reflect whether or not the identifier is declared) The book gives some additional examples, trying to relate specific examples of languages that are not context-free to associated problematic structures of programming languages The behavior of a top-down parser corresponds to a leftmost derivation: at each step apply a production to the leftmost variable in the current sentential form The immediate goal is to generate a sentential form whose first character is the current input symbol For instance, for input bbb and grammar we could begin by constructing b b / b so that the sentential form (in this case, sentence) generated is bb While this parse is wrong in the sense that it can t generate the whole input string it does generate the first (and second) input character 11 12
Of course a successful top-down parse of bbb using grammar would instead generate b b / b to match the first b fter doing this you re also ready to match the second and third b So it appears that we may succeed with this grammar, at least if we allow backtracking, so that we can go back and try something else if we happen to guess wrong at some point But there is problem The problem is that when attempting a top-down parse of bb (or bbb, or ) with this grammar, we could instead find ourselves generating a parse tree such as Notice that this never leads to a match with the first character of the input We never generate an initial terminal character The crucial difficulty here is that the production b is left-recursive But even a grammar with no immediate left-recursion can exhibit this behaviour 13 14
For example, we could have a similar difficulty with the grammar B b B which is also left-recursive, although there is no single production that is left-recursive B B B So we want a definition of left-recursive grammars, and a method for systematically eliminating left recursion We begin with a preliminary definition We write α + β to say that α derives β in one or more steps More precisely, α + β if α β, and α + γ if there is a β st α + β and β γ CFG G = (V, Σ, S, P) is left-recursive if there is a variable st + α for some string α over V Σ So, for example, the grammar B b B is left-recursive because, for instance, + 15 16
For next time Next time we ll (at least begin to) learn how to systematically eliminate left-recursion Read 41 43 if you haven t already 17