Quick Grammar Type Recognition: Concepts and Techniques

Quick Grammar Type Recognition: Concepts and Techniques Amin Milani Fard +, Arash Deldari *, and Hossein Deldari + + Department of Computer Engineering, Ferdowsi University, Mashhad, Iran * Department of Computer Engineering, Sadjad University, Mashhad, Iran milanifard@stu-mail.um.ac.ir Abstract. This paper intends to give an overview to grammar classification in terms of language specification and parsing methods; an important and always fashionable topic in computer science, compilers and language processing area. It is known that when a conflict happens in constructing the parsing table, the grammar is not acceptable by that parsing method, however we are interested in quick ways to determine a given grammar type. Although so many papers and books have been published containing useful information about this matter, none of them covers all the recognition aspects of grammars especially quick methods. We finalized the work with our quick grammar recognizer algorithm to detect grammar type. 1 Introduction In computer science, parsing is the process of analyzing a sequence of tokens in order to determine its grammatical structure with respect to a given formal grammar. Parsing process, formally known as syntax analysis, transforms input text into a data structure, usually a tree, which is suitable for later processing. Generally, parsers operate in two stages, first identifying the meaningful tokens in the input, and then building a parse tree from those tokens. The task of the parser is essentially to determine if and how the input can be derived from the start symbol within the rules of the formal grammar. 2 Language Specifications The concepts and terminology for describing the syntax of languages is taken from Noam Chomsky s works on linguistic structure [1], [2]. His classification of grammars and the related theory was the basis of further work on formal language theory, theory of computation, and efficient methods of parsing in compiler design, [3], [4], [5] and [6]. Various restrictions on the productions define different types of grammars and corresponding languages in the Chomsky hierarchy: Type-0 grammars (unrestricted grammars), also known as recursively enumerable languages, include all formal grammars and do not have any restrictions. They generate all languages that can be recognized by a Turing machine.

Type-1 grammars (context-sensitive grammars) generate the context-sensitive languages: L R, exception: s ε is allowed if s never occurs on any right hand side. In normal form these grammars rules have the form αaβ αγβ with A a non-terminal andα,β and γ strings of terminals and non-terminals. The strings α and β may be empty, but γ must be nonempty. It can also include the rule s ε. All these languages can be recognized by linear-bounded automata. Type-2 grammars (context-free grammars) generate the context-free languages. L N. These are defined by rules of the form A γ with A a non-terminal and γ a string of terminals and non-terminals. These languages can be recognized by a pushdown automaton. Context-free languages are the theoretical basis for the syntax of most programming languages. Type-3 grammars (regular grammars) generate the regular languages. L N, R = a or R = ax, where a A and X N. Such a grammar restricts its rules to a single nonterminal on the left-hand side and a right-hand side consisting of a single terminal, possibly followed by a single non-terminal. The rule s ε is also allowed if s does not appear on the right side of any rule. These languages can be decided by a finite state automaton and can be obtained by regular expressions. Regular languages are commonly used to define search patterns and the lexical structure of programming languages. Chomsky hierarchy depicted in Fig. 1., indicates every regular language is contextfree, every context-free language is context-sensitive and every context-sensitive language is recursively enumerable. Type 0 Grammars Type 1 Grammars Type 2 Grammars Type 3 Grammars Finite Languages Fig. 1. Chomsky hierarchy From a practical point of view, grammars may be used to solve membership problem given a string over A, does it belong to language L(G) or not. Another problem is the so-called parsing problem which is finding a sequence of rewriting steps from the grammar's start symbol to the given sentence. Parsing can be seen as structuring the input according the given grammar. The algorithm that makes structuring is called a parser [7].

3 Parsing algorithms The most commonly known context-free parsing algorithms are top-down and bottom-up parsing. In top-down parsing, parser begins with the start symbol of the grammar and attempts to generate the same sentence that it is attempting to parse. The most commonly known top-down parsing algorithms are LL [3]. In bottom-up parsing, parser matches the input of the right-hand side of the productions and builds a derivation tree in reverse. The bottom-up parsing uses traditionally one symbol look ahead to guide the choice of action. The most commonly known bottom-up parsing algorithms are LR, SLR and LALR [3], [4], [8], [9]. Commonly these parsing algorithms are limited to working on subclasses of context-free grammars [9]. Hierarchies of subclasses [4] are shown in the Fig. 2. A grammar is said to be LL(k) if a parser can be written for that grammar that can make a decision of which production to apply at any stage simply by looking at most at the next k symbols of the input. LL(1) grammars are a simple but important category, where one symbol lookahead is adequate for the implementation of a top-down predictive parser [7]. A grammar is said to be LR(k) if a parser can be written for that grammar which makes a single left to right pass over the input with a lookahead of at most k symbols. These grammars can be parsed with bottom-up parsers, requiring no backtracking. LL(k) and LR(k) parsers do not backtrack and they operate efficiently. Available parser generator tools commonly support only some subclasses. Yacc [10], SableCC [11], CUP [12] and most of the other supports LALR(1). ANTLR [13], PCCTS [14] and some others support LL(k) parsing. Unambiguous Grammars Ambiguous Grammars LR(K) LR(1) LALR(1) LL(K) LL(1) SLR LR(0) LL(0) Fig. 2. Hierarchy of Context-free grammar classes No matter using top-down or bottom up, an ambiguous grammar is not able to be parsed with the known parsers. This is due to the ambiguity which happens in constructing the derivation tree. Detecting whether a grammar is ambiguous or not is not always an easy rule-based approach, however, one simple strategy is the following:

Ambiguous grammars mainly contain productions of the form A AαA β, in which both left recursion and right recursion, occurs simultaneously either direct or indirect after some non-terminal replacement. 3.1 Unger parsing An Unger parser [15] is the simplest known method to parse any context-free grammar. The exponential time complexity of this parser made it inapplicable as long as grammar is not ambiguous. In Unger parser algorithm, for each right-hand side production of grammar we must first generate all possible partitions of the input sentence. Generating partitions is not difficult: if we have m productions in right-hand side, numbered from 1 to m, and n is length of input, numbered from 1 to n, we have to find all possible partitions such that the numbers of the characters for each production are consecutive, and any production does not contain lower-numbered characters than any character in a lower-numbered production. Partition fails if a terminal symbol in a right-hand side does not match the corresponding part of the partition. The non-failed partition results will all lead to similar split-ups as sub-problems. These sub-problems must all be answered in the affirmative, or the partition is not the right one. For an ambiguous grammar that contains loops, there are infinitely many derivations to be found. So, the process needs to avoid the problem by cutting off the search in these cases. Maintaining a list of partitions that we are currently investigating can do this. If a new partitioning already appears in the list, we do not investigate that and proceed as if the partition was answered negatively. Fortunately, if the grammar does not contain such a loop, a cut-off will not do any harm either, because the search is doomed to fail anyway [7]. 3.2 Top-down parsing Although it is possible to program a backtracking top-down parser, the resulting parser will be complex and slow. Predictive parsers (sometimes called recursive descent parsers) do no backtracking they can always determine which production to use. Clearly, predictive parsers can be written for grammars in which all production alternatives start with different terminal symbols. Production of the form A Aα β γ, is called left recursive. When one of the productions in a grammar is left recursive then a predictive parser may loop forever. To overcome this problem, the left recursive rule can be replaced with the followings: A β A γ A A α A ε Left Factoring is also another problem in top-down parsing. When a non-terminal has two or more productions whose right-hand sides start with the same grammar symbols, the grammar is not LL(1) and cannot be used for predictive parsing. Considering the productions A α β1 α β2 α βn γ in which contains left factor (α), the following replacement solve the problem: A α A γ A β1 β2 βn

A CFG is LL(1) if for each collections of productions A α1 α2 αn, the following holds: 1. has no left recursion 1. First(αi) First(αj) = for all i j (No left factoring) 2. if αi * ε then 2.a. αj * ε for all i j 2.b. First(αj) Follow(A) = for all i j A CFG is it LL(k), whenever there are two leftmost derivations, 1. S * ωaα ωβα * ωx lm lm lm 2. S * ωaα ωγα * ωy lm lm lm, such that First k (x) = First k (y), it follows that β = γ. [20] 3.3 Bottom-up parsing Bottom-up parsers start with the tokens in the input string rather than with the starting symbol of the grammar. A bottom-up parser produces the rightmost derivation in reverse. Shift-reduce parsers are based on two operations the shift operation reads and stores an input symbol and the reduce operation matches groups of adjacent stored symbols with the right hand side of a production and replaces them by the corresponding left hand side. 3.3.1 Precedence parsing There is a certain class of grammars called precedence grammars for which it is possible to write relatively simple parsers. Here, precedence relationships between adjacent symbols determine the actions of the parser. Details of the techniques were given in the Languages and Compilers course books [3]. At first sight, precedence parsing looks like a good technique it is simple and implementations can be very efficient. However, it is a technique that is now rarely used in practice because it is difficult, if not impossible, to transform an average programming language grammar into a precedence form. A CFG is precedence grammar if the following conditions meet: 1. No two non-terminal exist next to each other 2. No epsilon (empty) production occur 3.3.2 LR parsing LR parsers are efficient bottom-up parsers that can be constructed for a large class of context-free grammars. An LR(k) grammar is one that generates strings each of which can be parsed during a single deterministic scan from left to right without looking ahead more than k symbols. These parsers are generally very efficient and good at

error reporting, but unfortunately they are very difficult to write without the help of special parser-generating programs. Even top-down parsers have their problems: left recursion has to be removed and further restrictions have to be imposed to ensure a deterministic and efficient parser. Parsing technique for LR(k) grammars was first described by Knuth [17] and has since been widely used and much developed. A convenient way of implementing an LR(1) parser is via a parsing table [16]. Each entry (indexed by the current input symbol and the state number at the top of the stack) contains a description of the next action the parser should perform. The possible actions are shift, reduce, accept and error. It is known that when a conflict happens in constructing the parsing table, the grammar is not acceptable by that parsing method. For example a grammar is not LR(1) if has either shift-reduce conflict for any item [A α.xβ, t] in s with x a terminal, there is no item in s of the from [B α., x] or reduce-reduce conflict there are no two items in s of the form [A α., t] and [B β., t]. Our concentration, however, is on the matter weather there exist quick ways to determine a given grammar type or not. Three methods, in order of increasing power are simple LR (SLR), lookahead LR (LALR), and canonical LR (CLR). SLR and LALR approaches reduce the size of the parsing table, but they cannot handle all the grammars that can be parsed by the canonical LR method. The SLR(1) parser is based on a LR(0) parsing table, but onesymbol lookahead is added after the table has been built [7]. A grammar is LR (0) if you can take a valid token sequence, chop it in two, and still make sense of the left part. The LR grammar hierarchy is as follows: LR(0) SLR(1) LR(1) LR(k) A CFG is LR(0) if it is LL(1) and do not have epsilon product. Almost every LL grammar is LR(0) and thus LALR. The exceptions being grammars with empty rules, some of them may be LL without being LR(0) [18]. A "null" non-terminal symbol is defined as a non-terminal that only derives or produces the null string (epsilon). A "p-reduced" grammar is a reduced grammar in which all nonterminal symbols are not "null". If First(A)=ε then A is null else A is not null. A CFG is LALR(1) if it is LL(1) and is p-reduced [18]. A CFG is SLR(1) if 1) For any item {A α.xβ: x T there is no {B γ. : x Follow(B) 2) For any item {A α. and {B β. Follow(A) Follow(B) = A CFG is SLR(k) if and only if the following two statements are true for all states q in the LR(0) machine for the S-augmented grammar [19], [21]. 1. Whenever q contains a pair of distinct items [A 1 ω 1 ] and [A 2 ω 2 ], then Follow k (A 1 ) Follow k (A 2 ) = 0 2. Whenever q contains a pair of items [A α.aβ] and [B ω.], where a is a terminal, then First k (aβ Follow k (A)) Follow k (B) = 0

A CFG is LR(1) if: 1) For any item [A α.xβ,a] with x T there is no [B γ.,x] 2) For any two complete items [A γ.,a] and [B β.,b] it follows a and a!=b. A CFG is LR(k), k 0, if the three conditions bellow imply that αaω =γbx. (That is, α=γ, A=B, and x=y.) [19], [20] Let G = (N,,P,S) be a CFG and let G =(N,,P,S ) be its augmented grammar. A grammar is LR if it is LR(k) for some k. 1. S * αaω αβω, rm rm 2. S * γbx αβy, rm rm 3. First k (ω) = First k (y) Main Theorem for LR detection A CFG is in first normal form (1NF) - Chomsky normal form - if and only if all production rules are of the form: A BC or A α or S ε, where A, B and C are non-terminal symbols, α is a terminal symbol (a symbol that represents a constant value), S is the start symbol, and ε is the empty string. Also, neither B nor C may be the start symbol. Every grammar in Chomsky normal form is context-free, and conversely, every context-free grammar can be efficiently transformed into an equivalent one which is in Chomsky normal form. With the exception of the optional rule S ε (included when the grammar may generate the empty string), all rules of a grammar in Chomsky normal form are expansive; thus, throughout the derivation of a string, each string of terminals and non-terminals is always either the same length or one element longer than the previous such string [22]. A CFG is in second normal form (2NF) - Greibach normal form - means that all production rules are of the form: A αx or S ε, where A is a nonterminal symbol, α is a terminal symbol, X is a (possibly empty) sequence of nonterminal symbols not including the start symbol, S is the start symbol, and ε is the null string. Observe that the grammar must be without left recursions [22]. Let G be a grammar in 1NF. Then do the following as often as possible: Pick some non-terminal, 1. If A is left-recursive, apply full left-recursion elimination". 2. Unfold all occurrences of A in the grammar. 3. Eliminate productions for A from the grammar (as it become unreachable). A grammar G is said to be in third normal form (3NF), if it is in 2NF and there are no two productions the right-hand sides of which start with the same symbol, such as in Z x u x v. Except for the aforementioned rare termination problem this normal form can obviously be obtained by apply left factoring wherever possible. How-

ever, we can improve the efficiency by delaying the left factorings as long as possible. This may be called "lazy left factoring" [23]. Main Theorem: Pepper in [23] proved that if G is a grammar and its transformed 3NF version is G' then original grammar G would be LR(k) if and only if the transformed grammar G' is LL(k). 4 Proposed mechanism Regarding detection methods proposed in previous section, a procedural approach is needed to determine a given grammar type. To meet so, we propose the following recognition steps, shown in Fig. 3, such that in the most efficient way the context free grammar type will be found. In this approach if an ambiguity sign is detected, parsing is not possible but with the Unger backtracking method. Otherwise detection framework would continue with LL test. TryX(n) functions return false in case unable to parse with the correspondence parsing method and return true if possible. Therefore if parsing could not be handled, a more powerful approach would be evaluated. When LL rejects parsing, the LR evaluation starts and in case LR rejects, backtracking would be obtained. Fig. 4, shows a simple method to detect whether a grammar is ambiguous or not by checking left and right recursion. Other detection algorithms are proposed as discussed earlier. if(!isambiguous()) if(!tryll(0)) if(!tryll(1)) if(!tryll(k)) if(!trylr(0)) if(!tryslr(1)) if(!trylalr(1)) if(!trylr(1)) if(!trylr(k)) TryBackTrack(); else TryBackTrack(); Fig. 3. The proposed quick grammar recognizer algorithm if(canreplacenonterminals()) if(isleftrecursive() && IsRightRecursive()) else Fig. 4. IsAmbiguous algorithm

if(isleftrecursive() HasLeftFactoring())) if(twoproductsreachepsilon()) if(first_followintersect()!=0) else Fig. 5. TryLL(1) algorithm if(hastwolmd() && EqualFirstk()) if(!equallm()) else Fig. 6. TryLL(k) algorithm if(hasepsilonproduct()) if(isleftrecursive() HasLeftFactoring())) if(twoproductsreachepsilon()) if(first_followintersect()!=0) else Fig. 7. TryLR(0) algorithm if(existsameproduct()) if(followsetsintersect()!=0) else Fig. 8. TrySLR(1) algorithm

if(existnullnonterminal()) if(isleftrecursive() HasLeftFactoring())) if(twoproductsreachepsilon()) if(first_followintersect()!=0) else Fig. 9. TryLALR(1) algorithm if(existsunchecked_lr1item()){ if(hasthesamelookahead()) else Fig. 10. TryLR(1) algorithm if(hastwormd() && EqualFirstk()) if(!equalrm()) else Fig. 11. TryLR(k) algorithm 5 Conclusion and future work In this paper we investigated grammar classification techniques in terms of language specification and parsing method specifications in a systematic framework. The work concerned with a very important and always fashionable topic in computer science and compilers and language processing area: grammar specification and parsing. It is known that when a conflict happens in constructing the parsing table, the grammar is not acceptable by that parsing method, however we built a framework to quickly determine a given grammar type. We finalized the work with our quick grammar recognizer algorithm to detect grammar type. Our future work is based on a mathematical approach in order to formulize grammars and perform an interpolation curvefitting method and compare with the proposed approach.

References 1. Chomsky, N., Three Models for the Description of Language, IRE Transactions on Information Theory, 2 (1956), pp. 113-124, 1956. 2. Chomsky, N., On Certain Formal Properties of Grammars, Information and Control, 1 (1956), pp. 137-167, 1959. 3. A. V. Aho, R. Sethi, and J. D. Ullman, Compilers. Principles, techniques, and Tools, Addison-Wesley, 1986. 4. A. W. Appel, Modern Compiler Implementation in Java, Cambridge Univ. Press, 1998. 5. T. W. Parsons, "Introduction to Compiler Construction", Computer Science Press, New York, 1992. 6. K. Slonneger, B. L. Kurtz, Formal Syntax and Semantics of Programming Languages: A Laboratory Based Approach, Addison-Wesley, 1995, Available at: http://www.cs.uiowa.edu/~slonnegr/plf/book/ 7. Jokipii Antic ''Grammar-based Data Extraction Language (GDEL)'', Master of Science Thesis in Information Technology, University of Jyväskylä Department of Mathematical Information Technology, 10th October 2003 8. A. V. Aho, J. D. Ullman, The Theory of Parsing, Translation, and Compiling, Volume 1: Parsing, Prentice-Hall, 1972. 9. D. Grune, C. J. H. Jacobs, "Parsing Techniques: A Practical Guide", Ellis Horwood, 1990. 10. S. C. Johnson, YACC - Yet Another Compiler-Compiler, Technical Report Computer Science 32, Bell Laboratories, Murray Hill, New Jersey, 1975, Available at: http://epaperpress.com/lexandyacc/download/yacc.pdf 11. E, Gagnon, SableCC, an Object-Oriented Compiler Framework, PhD thesis, School of Computer Science, McGill University, Montreal, March 1998, Available at: http://www.sablecc.org/thesis.pdf 12. S. E. Hudson, CUP parser generator for Java, 1997. Available at: http://www.cs.princeton.edu/ appel/modern/java/cup/ 13. T. J. Parr and R. W. Quong, ANTLR: A predicated-ll(k) parser generator, Software Practice and Experience, 25(7):789 810, July 1995, Available at: http://www.antlr.org/papers/antlr.ps 14. T. J. Parr, Language Translation Using PCCTS & C++, Automata Publishing Company, 1997. ISBN: 0962748854. 15. S.H. Unger, "A global parser for context-free phrase structure grammars", Commun. ACM, vol. 11, no. 4, p. 240-247, April 1968. 16. Des Watson. High-Level Languages and their Compilers. International Computer Science Series. Addison-Wesley Publishing Company, Wokingham, England, 1989. 17. D. E. Knuth. On the translation of languages from left to right. Information and Control, 8(6):607 639, 1965. 18. John C. Beatty J, "On the relationship Betwen the LL(1) and LR(1) Grammars". ACM Vol. 29, 1982. 19. Žemlička, M.: "Principles of Kind Parsing - An Introduction". [Technical report KSI MFF UK No. 2002/1], MFF UK, Praha, December 2002 20. Alfred V. Aho, Jeffrey D. Ullman: The Theory of Parsing, Translation, and Compiling, Vol. I: Parsing, Prentice Hall, 1972. ISBN 0-13-914556-7. 21. Seppo Sippu, Eljas Soisalon-Soininen: Parsing Theory. Volume II: LR(k) and LL(k) Parsing. Springer Verlag. EATCS 20. ISBN 3-540-51732-4. 22. John Martin (2003). Introduction to Languages and the Theory of Computation. McGraw Hill. ISBN 0-07-232200-4. Pages 237 240 section 6.6: simplified forms and normal forms. 23. Peter Pepper, "LR Parsing = Grammar Transformation + LL Parsing - Making LR Parsing More Understandable And More Efficient", No 99-5, April 1999 http://citeseer.ist.psu.edu/pepper99lr.html