CS502: Compilers & Programming Systems

Similar documents
Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Proof Theory for Syntacticians

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

CS 598 Natural Language Processing

Parsing of part-of-speech tagged Assamese Texts

Grammars & Parsing, Part 1:

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Developing a TT-MCTAG for German with an RCG-based Parser

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Some Principles of Automated Natural Language Information Extraction

A Version Space Approach to Learning Context-free Grammars

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Natural Language Processing. George Konidaris

Compositional Semantics

Hyperedge Replacement and Nonprojective Dependency Structures

What is a Mental Model?

"f TOPIC =T COMP COMP... OBJ

Context Free Grammars. Many slides from Michael Collins

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Character Stream Parsing of Mixed-lingual Text

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

PRODUCT PLATFORM DESIGN: A GRAPH GRAMMAR APPROACH

English Language and Applied Linguistics. Module Descriptions 2017/18

An Introduction to the Minimalist Program

The College Board Redesigned SAT Grade 12

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Grammar for Battle Management Language

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

An Evaluation of the Interactive-Activation Model Using Masked Partial-Word Priming. Jason R. Perry. University of Western Ontario. Stephen J.

An Interactive Intelligent Language Tutor Over The Internet

AQUA: An Ontology-Driven Question Answering System

GRAMMAR IN CONTEXT 2 PDF

Linking Task: Identifying authors and book titles in verbose queries

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

The Interface between Phrasal and Functional Constraints

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

A R "! I,,, !~ii ii! A ow ' r.-ii ' i ' JA' V5, 9. MiN, ;

Detecting English-French Cognates Using Orthographic Edit Distance

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Specifying Logic Programs in Controlled Natural Language

Writing Research Articles

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Derivational and Inflectional Morphemes in Pak-Pak Language

Language properties and Grammar of Parallel and Series Parallel Languages

The Strong Minimalist Thesis and Bounded Optimality

A General Class of Noncontext Free Grammars Generating Context Free Languages

LTAG-spinal and the Treebank

What the National Curriculum requires in reading at Y5 and Y6

NATURAL LANGUAGE PARSING AND REPRESENTATION IN XML EUGENIO JAROSIEWICZ

A Case Study: News Classification Based on Term Frequency

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Chapter 4: Valence & Agreement CSLI Publications

Florida Reading Endorsement Alignment Matrix Competency 1

Controlled vocabulary

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Grade 5 + DIGITAL. EL Strategies. DOK 1-4 RTI Tiers 1-3. Flexible Supplemental K-8 ELA & Math Online & Print

Organizing Comprehensive Literacy Assessment: How to Get Started

Language Evolution, Metasyntactically. First International Workshop on Bidirectional Transformations (BX 2012)

Refining the Design of a Contracting Finite-State Dependency Parser

Analysis of Probabilistic Parsing in NLP

Candidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

Update on Soar-based language processing

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Multimedia Application Effective Support of Education

WSU Five-Year Program Review Self-Study Cover Page

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

GACE Computer Science Assessment Test at a Glance

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Accurate Unlexicalized Parsing for Modern Hebrew

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Prediction of Maximal Projection for Semantic Role Labeling

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

cambridge occasional papers in linguistics Volume 8, Article 3: 41 55, 2015 ISSN

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

CPS122 Lecture: Identifying Responsibilities; CRC Cards. 1. To show how to use CRC cards to identify objects and find responsibilities

Efficient Normal-Form Parsing for Combinatory Categorial Grammar

Guidelines for Writing an Internship Report

Physics 270: Experimental Physics

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

systems have been developed that are well-suited to phenomena in but is properly contained in the indexed languages. We give a

LING 329 : MORPHOLOGY

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

NAME: East Carolina University PSYC Developmental Psychology Dr. Eppler & Dr. Ironsmith

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

A Graph Based Authorship Identification Approach

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Transcription:

CS502: Compilers & Programming Systems Context Free Grammars Zhiyuan Li Department of Computer Science Purdue University, USA

Course Outline Languages which can be represented by regular expressions are called regular languages. Most language constructs are more complex than regular languages. Example: It is impossible to use a DFA to recognize all sequences of balanced (possibly nested) parentheses. The pumping lemma is often used to prove that a certain language is too complex to be regular. Context-free grammars (CFGs) are commonly used to define a wider class of languages because they are powerful enough to specify common syntax rules.

What is the grammar used for? It defines the correct forms of program constructs. Program semantics will be defined in terms of program constructs. Given a context-free grammar, the compiler writer tries to construct a parser to recognize syntax constructs. The parser checks to see whether the program conforms to the grammar, i.e. whether it has the correct syntax structure. For an arbitrary context free grammar, we may or may not be able to build a parser automatically that recognizes all programs that conform to the grammar. Recall that, given an arbitrary regular expression, R, any of the three methods we studied (NFA, NFA DFA, DFA) can be used to build a lexical analyzer automatically that recognizes all strings defined by R, without backtracking. We may need to rewrite the grammar (manually) in a form for which we know how to build a parser.

Impact of the parser on semantics processing Not only must the parser recognize the correct syntactic forms, it must also be suitable for triggering correct semantic actions that Build the correct abstract syntax tree This is vital to the generation of the correct final code Hence, we must study how to properly design the grammar for a programming language we want to implement.

Basic Concepts A language L is a set of strings formed by symbols from an alphabet. In the parsing phase, such symbols are tokens. A program is viewed as a sequence of tokens. L is also often said to be a set of sentences. For programming languages, each sentence is a program(!) We use an example to explain the following terminology: Production rules and grammar symbols The start symbol A derivation step and a derivation sequence A terminal is a grammar symbol which derives nothing but itself. (The set of terminals form the vocabulary of L.)

Beginning with the start symbol, every time we replace a nonterminal by the right hand phrase of one of its production rules, we have performed a derivation step and derived a new sentential form. A sentence is a special case of sentential forms Left-most (lm) vs. right-most (rm) derivations. Given a program, the parser in a modern compiler essentially performs lm (or rm ) derivations. In each derivation step, a new node (and some new edges) may get inserted in the AST, or some new type information may get extracted and placed in the symbol table. If a sequence of tokens can be derived from the start symbol, then it is accepted by the CFG.

A Parse Tree A parse tree corresponds to a set of derivation sequences for a given input Given a parse tree, there exist a unique left-most derivation sequence and a unique right-most derivation sequence The parser can be viewed as incrementally (and implicitly) constructing a parse tree. A CFG is called ambiguous if and only if there exist a sentence for which there exist more than one parse tree. A CFG which contains a cycle is definitely ambiguous. Why? Why ambiguous grammars are bad? Program semantics is defined in terms of program constructs the ambiguity in the CFG often causes ambiguity in the definition of program constructs and operation orders.

A key issue is whether the correct AST can be built by following that set of preference rules. Sometimes, additional rules are introduced (in English descriptions, e.g.) in order to define such constructs or orders unambiguously E.g. how to handle the dangling else case

Some common forms of production rules Use left recursion or right recursion to define a list of constructs. Example: List of statements. Use a mirrored recursion to define nested pairs. Example: balanced and nested pairs of parentheses. Binary expressions

Parsers There are two fundamental approaches to parsing: top-down vs. bottom-up. With the top-down approach, the parser performs left-most derivations, beginning with the start nonterminal. With the bottom-up approach, the parser traces rightmost derivations backward, beginning with the given sentence (i.e. the sequence of tokens).