Lecture 6: Context-Free Grammars September 27, 2016 CS 1010 Theory of Computation So far, we ve classified languages as regular or not regular. Regular languages are recognized by DFAs, NFAs, and regexes, and are closed under regular operations. We can also pump regular languages via the pumping lemma. However, as we know due to contradictions of the pumping lemma, some fairly simple languages are not regular. Topics Covered 1. Context-Free Grammars 2. Context-Free Languages 3. Ambiguity 4. Chomsky Normal Form 1 Context-Free Grammars Consider the regex 0 (10 11) 1. Let A = 0, B = (10 11), and C = 1. We can express this regex as the concatenation ABC. We can also right down rules for generating elements of the language of the regular expression: A ε 0A B ε 10B 11B C ε 1C For example, we can follow the chain of steps: start ABC 0ABC 00ABC 00BC 0011BC 001110BC 001110C 001110111. This binary string, 001110111, is in the language of 0 (10 11) 1. A context-free grammar (CFG) is a 4-tuple (V, Σ, R, S) such that 1. V is a finite set of variables 2. Σ is a finite set of terminals (a.k.a. the alphabet) 3. R is a finite set of rules, where r R is of the form A w where A V and w (V Σ) Lecture 6: Context-Free Grammars Page 1 / 5
4. S V is the designated start variable If G = (V, Σ, R, S) is a CFG; u, v, w are strings in (V Σ) ; A is in V ; and A w is in R, then we say that vau yields vwu, written vau vwu. We also say that u derives v, written u v, if there exist u 1,..., u k with k 0 such that u u 1 u k v. The leftmost derivation is a derivation in which, at every step, a rule is applied to the leftmost variable. 2 Context-Free Languages For a CFG G = (V, Σ, R, S), the language L(G) = {w w Σ and S w}. A language L is context-free if L = L(G) for some CFG G. Theorem Theorem If L is regular, then it is context-free. Not every context-free language is regular. As an example of the second theorem, recall our favorite nonregular language, {0 n 1 n n 0}. We can define a CFG for this language, with only one rule: S ε 0S1. Thus we can see that not every context-free language is regular. Language Structure We can use CFGs to capture the structure of spoken language. Consider the following set of rules, standing in for various parts of speech: <sentence> <noun-phrase> <verb-phrase> <noun-phrase> <complex-noun> <complex-noun> <prep-phrase> <verb-phrase> <complex-verb> <complex-verb> <prep-phrase> <prep-phrase> <prep> <complex-noun> <prep> with <complex-noun> <article> <noun> <article> a the <noun> boy girl flower heart <complex-verb> <verb> <verb> <noun-phrase> <verb> touches sees likes Lecture 6: Context-Free Grammars Page 2 / 5
We can use these rules to create a sentence, structuring the derivation as a parse tree: <sentence> <noun-phrase> <complex-noun> <verb-phrase> <complex-verb> <article> <noun> <verb> a girl sees Alternatively, we can write the rules that lead to a sentence; for example: S <noun-phrase> <verb-phrase> <complex-noun> <verb-phrase> a girl <verb-phrase> a girl <complex-verb> <prep-phrase> a girl sees <prep-phrase> a girl sees with the heart 3 Ambiguity Is it possible to create the same sentence in multiple ways? Consider the sentence a girl sees a boy with a flower. There are at least two ways to parse this sentence: 1. We might have the noun-phrase a girl, the verb-phrase sees, and the noun-phrase a boy with a flower. 2. We could also have the noun-phrase a girl, the complex-verb sees a boy, and the prepositional-phrase with a flower. In particular, there are two different parse trees corresponding to each of these constructions of the sentence. This leads to the definition of an ambiguous CFG. A CFG G is ambiguous if there exists some w in L(G) such that w has at least two different parse trees. Equivalently, w has more than one leftmost derivation. If G is ambiguous, there are two further classifications: 1. There exists G, not ambiguous, such that L(G) = L(G ). 2. L(G) is inherently ambiguous, so for all G such that L(G) = L(G ), G is ambiguous. Lecture 6: Context-Free Grammars Page 3 / 5
4 Chomsky Normal Form A CFG G = (V, Σ, R, S) is in Chomsky Normal Form (CNF) if every rule r R is in one of the following forms: 1. S ε 2. A BC, where A, B, and C are all variables and B, C S 3. A a, where A is a variable and a is a terminal Intuitively, we don t want S to ever appear on the righthand side of a rule. Moreover, in a parse tree corresponding to a CFG in Chomsky Normal Form, each node is either a leaf or has exactly two children. This will make it possible to design polynomial-time algorithms to check for membership in a CFL. We will cover this in more detail in future lectures. Theorem Let G be a CFG. Then there exists a CFG G in CNF such that L(G) = L(G ). Proof This is not a true proof, as we only outline a process for converting a CFG into CNF without justification of correctness. Suppose that A w, where w is neither a terminal nor of the form BC. In other words, A w is a rule that is not in Chomsky Normal Form. There are two cases to consider: 1. First, A B 1 B 2... B k for variables B i V. To fix this, we successively add new variables C 1,..., C k 2 and rules such that A B 1 C 1, C 1 B 2 C 2,..., C k 2 B k 1 B k. These rules are all of the form A BC, which is acceptable for CNF. 2. Second, we might have the case that A w 1 w 2... w k for terminals w i Σ. To fix this, we create a new variable B and add the rule B w i. We replace the original rule with A w 1... w i 1 Bw i+1... w k. We continue this process as needed, and then fix the resulting instance of the first case, if needed. There are three more cases to consider, where the righthand side is not necessarily too long but is incorrect in a different way. 3. If A S or A SB, then we create a new start variable S 0 and add S 0 S to the grammar. 4. A rule such as A ε is incorrect because ε is not a terminal. To fix this, find every rule of the form B AC, B CA, and B AA. For either of the first two rules, add another rule B C. If B AA, add B A and B ε. If a rule has previously been removed, do not add it again. 5. The final case is when a rule is of the form A B. To fix this, whenever we have the rule C AD, we add C BD. If C A, we add C B, unless previously removed. Lecture 6: Context-Free Grammars Page 4 / 5
Notice that many of these fixes create new situations or improper rules that will be successively dealt with as we iterate through the process. It is important to consider why, despite this, the algorithm is not circular. In particular, whenever we add a rule, we first check that it has not been previously removed. Lecture 6: Context-Free Grammars Page 5 / 5