CITS2211 Discrete Structures Lectures for Semester Non-Regular Languages

Similar documents
Proof Theory for Syntacticians

Language properties and Grammar of Parallel and Series Parallel Languages

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

A R "! I,,, !~ii ii! A ow ' r.-ii ' i ' JA' V5, 9. MiN, ;

Grammars & Parsing, Part 1:

A General Class of Noncontext Free Grammars Generating Context Free Languages

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Lecture 10: Reinforcement Learning

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

arxiv: v1 [math.at] 10 Jan 2016

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

Parsing of part-of-speech tagged Assamese Texts

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

A Version Space Approach to Learning Context-free Grammars

CS 598 Natural Language Processing

What the National Curriculum requires in reading at Y5 and Y6

systems have been developed that are well-suited to phenomena in but is properly contained in the indexed languages. We give a

TabletClass Math Geometry Course Guidebook

On the Polynomial Degree of Minterm-Cyclic Functions

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

GACE Computer Science Assessment Test at a Glance

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Spring 2016 Stony Brook University Instructor: Dr. Paul Fodor

West s Paralegal Today The Legal Team at Work Third Edition

IN THIS UNIT YOU LEARN HOW TO: SPEAKING 1 Work in pairs. Discuss the questions. 2 Work with a new partner. Discuss the questions.

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Evolution of Collective Commitment during Teamwork

Loughton School s curriculum evening. 28 th February 2017

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Assessing Children s Writing Connect with the Classroom Observation and Assessment

E-3: Check for academic understanding

South Carolina English Language Arts

Developing a TT-MCTAG for German with an RCG-based Parser

Detecting English-French Cognates Using Orthographic Edit Distance

Context Free Grammars. Many slides from Michael Collins

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Chapter 4 - Fractions

Developing Grammar in Context

Statewide Framework Document for:

MOODLE 2.0 GLOSSARY TUTORIALS

Fluency YES. an important idea! F.009 Phrases. Objective The student will gain speed and accuracy in reading phrases.

Classroom Connections Examining the Intersection of the Standards for Mathematical Content and the Standards for Mathematical Practice

Mathematics subject curriculum

Enumeration of Context-Free Languages and Related Structures

Grade 6: Correlated to AGS Basic Math Skills

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Are You Ready? Simplify Fractions

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Artificial Neural Networks written examination

1. Introduction. 2. The OMBI database editor

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Language Evolution, Metasyntactically. First International Workshop on Bidirectional Transformations (BX 2012)

Life and career planning

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Understanding and Supporting Dyslexia Godstone Village School. January 2017

ARNE - A tool for Namend Entity Recognition from Arabic Text

Disambiguation of Thai Personal Name from Online News Articles

PowerTeacher Gradebook User Guide PowerSchool Student Information System

Introduction to Causal Inference. Problem Set 1. Required Problems

WSU Five-Year Program Review Self-Study Cover Page

The Strong Minimalist Thesis and Bounded Optimality

Mathematics (JUN14MS0401) General Certificate of Education Advanced Level Examination June Unit Statistics TOTAL.

Getting Started with Deliberate Practice

Classify: by elimination Road signs

Using Blackboard.com Software to Reach Beyond the Classroom: Intermediate

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

Computer Science 141: Computing Hardware Course Information Fall 2012

Similar Triangles. Developed by: M. Fahy, J. O Keeffe, J. Cooper

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

NUMBERS AND OPERATIONS

Highlighting and Annotation Tips Foundation Lesson

A Pumpkin Grows. Written by Linda D. Bullock and illustrated by Debby Fisher

Diagnostic Test. Middle School Mathematics

Developing a concrete-pictorial-abstract model for negative number arithmetic

Large Kindergarten Centers Icons

Refining the Design of a Contracting Finite-State Dependency Parser

Myths, Legends, Fairytales and Novels (Writing a Letter)

Mathematics Assessment Plan

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Using dialogue context to improve parsing performance in dialogue systems

Ohio s Learning Standards-Clear Learning Targets

RESPONSE TO LITERATURE

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

THE ANTINOMY OF THE VARIABLE: A TARSKIAN RESOLUTION Bryan Pickel and Brian Rabern University of Edinburgh

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

5. UPPER INTERMEDIATE

Toward Probabilistic Natural Logic for Syllogistic Reasoning

National Literacy and Numeracy Framework for years 3/4

2017 national curriculum tests. Key stage 1. English grammar, punctuation and spelling test mark schemes. Paper 1: spelling and Paper 2: questions

Using Proportions to Solve Percentage Problems I

The Evolution of Random Phenomena

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

Corrective Feedback and Persistent Learning for Information Extraction

Cognitive Modeling. Tower of Hanoi: Description. Tower of Hanoi: The Task. Lecture 5: Models of Problem Solving. Frank Keller.

Transcription:

CITS2211 Discrete Structures Lectures for Semester 2 2017 Non-Regular Languages October 30, 2017

Highlights We have seen that FSMs are surprisingly powerful But we also saw some languages FSMs can not recognise Now we will learn a useful theorem for testing whether a language is regular or not: the pumping lemma for regular languages We will study a new type of automata: pushdown automata And a new class of languages: context-free languages

Reading Introduction to the Theory of Computation by Michael Sipser Chapter 1: Regular Languages Section 1.4 The pumping lemma for regular languages Chapter 2: Context-free languages Section 2.2 Pushdown automata

Lecture Outline 1 Non-regular languages 2 The pumping lemma for regular languages 3 Push-down automata 4 Context-free languages

Non-regular languages Q: Are all languages regular? A: No We can use a diagonalization argument to show that there are non-regular languages. Remember we used a diagonalization argument to show that there are uncountable sets. Now we will show that there must be some non-regular languages over the alphabet {0, 1}. Firstly, notice that the set of regular expressions is countable we can arrange them in lexicographical order. Therefore we can form a complete list R 1, R 2, R 3,..., of regular expressions. Let L 1, L 2,..., be the corresponding list of regular languages described by the above regular expressions.

A diagonalization argument Now we will form a new language L and show that it is not in the list above. Recall that the earlier diagonalization arguments formed a new object that was definitely different from every object on the supposedly complete list. We will copy this strategy as follows: Consider the set of strings 1, 11, 111, 1111, and form the new language L as follows: If 1 / L 1, then add it to L (else do nothing) If 11 / L 2, then add it to L If 111 / L 3, then add it to L If 1111 / L 4, then add it to L and so on...

A diagonalization argument (cont.) Then clearly L is different from L 1, L 2,... and thus L is not regular. This is an example of a non-constructive proof. We have shown the existence of a non-regular language, but have not actually specified one.

FSMs and regular languages Consider the following recognizer. 0 1 1 s 0 s 1 What language does this machine recognize? 0

We can start by looking at some input strings to see if there is a pattern... Accepted: 1, 01, 001 Rejected: 0, 00, 10 We hypothesise that this FSM accepts strings ending in 1, and a moment s thought shows that this is indeed correct. Therefore the language accepted by this FSM is L(M) = (0 + 1) 1

Another example: what language does this FSM recognise? s 0 0 s 1 1 0 1 0 0, 1 s 3 1 s 2

For this example we can see that if the FSM ever reaches state s 3 then it can never reach the accepting states (s 1 or s 2 ). Therefore any accepted string must start with 0. Another 0 will change the state to s 3, and therefore the second character must be a 1. We continue this argument to see that the FSM will only be in an accepting state provided it has received a sequence of 01 pairs. As soon as this pattern is broken, the machine changes to state s 3. Therefore the language accepted by this FSM is L(M) = (01)

The pumping lemma We now turn to the problem of determining when a language is non-regular, and hence cannot be recognized by a finite state machine. The main tool for this is a result called the pumping lemma which can be proved using the Pigeonhole Principle we studied a few weeks ago. The pumping lemma for regular languages was first stated by Y. Bar-Hillel, Micha A. Perles, and Eli Shamir in 1961.

Theorem: Pumping Lemma (for regular languages) Theorem: If L is a regular language then an integer p called the pumping length of L such that words w L where w > p an expression w = xyz where 1 i 0. xy i z L 2 y 1 3 xy p That is, if L is regular then any sufficiently long word in the language contains a non-empty substring that can be repeated an arbitrary number of times (0,1 or more).

How to use the pumping lemma The pumping lemma is of the form R P. That is IF a language is regular (R) THEN certain properties must hold (P). Usage 1: Assume R. From the lemma we also have P. Now derive a contradiction. The contradiction tells us that the original assumption R must be false. This way we can prove that a language is NOT regular. Usage 2: R P P R. So if we can show P then by the pumping lemma we can deduce that the language NOT regular. Note, that we can not say that if the pumping properties hold then the language is regular, because we do not have P R

Use of pumping lemma The trickiness of applying pumping lemma is using the and conditions correctly. You need to practice! To prove that a language is not regular, find some sufficiently long string that is in the language, but which cannot be pumped. The existence of such a string shows that the given language is not regular. We use an adversary game argument [Hopcroft and Ullman] Your choices in the game correspond to the quantifiers in the statement of the Pumping Lemma (see above) and adversary choices correspond to lines.

Proof of pumping lemma RTP. L. p. w. w L xyz. w = xyz xy p y 1 i > 0. xy i z L Proof. Suppose L is a regular language. Then it can be recognized by a finite state machine M with p states. Now suppose that w = c 1 c 2... c n is a word of length n p. Then consider the states that occur when M is run with input string w. s 0 c1 s 1 c2 s 2 cn s n By the pigeonhole principle, at least two of these states must be the same so let s i and s j be the first two occurrences of the first repeated state.

Pumping Lemma (cont) Now set x = c 1 c 2... c i y = c i+1 c i+2... c j z = c j+1... c n Now we can see that y is a string that takes the finite state machine in a circle from a state back to itself. This means that we can now pump the input string by repeating this portion as often as possible, and still get a string that is recognized by the machine. Hence xz, xyyz, xyyyz and in general xy i z are all recognized by M and hence in the language L. QED

Adversary Argument using the Pumping Lemma 1 Select the language L to be proven non-regular 2 The adversary (she) picks the pumping constant p 3 You select some string w L (based on your knowledge of p) 4 She breaks w into any x, y, z she wants subject to the constraints xy p y 1 5 You achieve a contradiction to the Pumping Lemma by showing that for any x, y, z chosen by the adversary, i so that xy i z / L. Your choice of i may depend on p, x, y, and w. 6 From this it can be concluded that L is not regular.

Adversary Argument Example Example Show that the language L is not regular where, L = {w w has an equal number of 0s and 1s} Proof Suppose L is regular. Let the pumping length be p. Choose w = 0 p 1 p. Clearly w L. We will see this is a useful example of w for the proof (not all words in L are useful choices). For any adversary choice of xyz = w, both x and y can only contain 0s since xy p (constraint 3). Say x = 0 m, y = 0 n for some m + n p. Now, the pumping lemma states that xyyz L but we know xyyz / L because xyyz has more 0s than 1s. We have derived a contradiction. Therefore L is not regular. QED

Take Care: Pumping Lemma uses not Note that while the pumping lemma states that all regular languages do satisfy the conditions described above, the converse of this statement is not true. A language that satisfies the pumping conditions may still be non-regular. Example: L = {a i b j c k i, j, k 0 (i = 1 j = k)} a) show that L is not regular b) show that w = a i b j c k satisfies the pumping lemma conditions (for some i,j,k) c) explain why parts a) and b) do not contradict the pumping lemma This question is an exercise in this week s tutorial.

Context-Free Languages Context-free languages can describes certain features with a recursive structure. They are more powerful than FSMs because they have some limited memory in the form of a stack.

What are non-regular languages like? Context-free languages were first studied for understanding human languages. For English we have the following loose rule sentence noun-phrase verb-phrase which we interpret as saying A valid sentence consists of a noun-phrase followed by a verb-phrase To complete the description, we then need to define noun-phrase, verb-phrase and so on, which are defined in the same way noun-phrase article noun verb-phrase verb adverb

Context Free languages for Computer Science An important use of CF languages in Computer Science is the specification and compilation of languages such as Java or SQL.

Context Free languages and Automata Context-free languages include all the regular languages and many more. For example, we saw that 0 n 1 n was not a regular language, but today we will learn that it is a context-free language. Context-free languages are precisely the class of languages that can be recognised by pushdown automata, which are finite state machines that use a stack as a memory device. The most important formal languages for Computer Science, are probably context-free languages, because many computer languages are context free languages (at least in part if not all).

Grammars Context-free languages can be specified using grammars. For example, the language {0 n 1 n n 0} is generated by the grammar A 0A1 ɛ Example construction sequence: A, 0A1, 00A11, 000A111, 000111

Grammars: the mechanics A grammar grows all the strings of a language. A grammar is a collection of substitution rules called productions. Each rule has a left hand side symbol, an arrow and a right hand side. Variable symbols are called non-terminals and usually represented by a capital letter. Other symbols are from the alphabet of the language called terminals and usually represented by a lower case letter. One symbol is designated the start variable usually written S. Strings in the language are grown by starting with the start symbol and then replacing non-terminals according to the production rules.

Grammar Example 1 The language of all expressions with balanced brackets is generated by the grammar Example construction sequence: S SS (S) ɛ S, SS, SSS, (S)SS, ((S))SS, (())(S)S, (())()S, (())()(S), (())()()

Grammar Example 2 A 0A1 A B B x Example derivation: A, 0A1, 00A11, 00B11, 00x11 Rules can be written on separate lines (Example 2), or using to denote a list of rules for the same non-terminal (Example 1).

Context-free grammar definition Definition: A context-free grammar is a 4-tuple (V, Σ, R, S) where 1 V is a finite set called the variables (usually denoted by capital letters) 2 Σ is a finite set, Σ V =, called the terminals (usually denoted by lower case letters or symbols as the alphabet of the language) 3 R is a finite set of rules, with each rule of the form V X where X is a string of variables and terminals. 4 S V is the start variable Sipser Definition 2.2, page 104 in the 3rd edition

Idea of Push-Down Automata (PDA) Context-free languages can be recognised by automata called PDAs PDAs are similar to non-deterministic FSMs (NFSMs) but they have an extra component called a stack The stack provides extra memory, in addition to states This memory allows PDA to do counting that an NFSM can not

Formal defn of PDAs Definition: A pushdown automata (PDA) is defined to be a 6-tuple (Q, Σ, Γ, F, δ, q 0, F ) where Q is a finite set of states Σ is a finite alphabet of input symbols Γ is the finite stack alphabet δ : Q Σ ɛ Γ ɛ P(Q Γ ɛ ) is the transition function q 0 Q is the start state F Q is a set of accepting states (F may be the empty set)

PDA transitions Note this definition is nearly the same as for NFSMs with the exception of the transition function and the addition of a stack alphabet Γ. As well as changing state for a given input, a PDA may read and pop a symbol from the stack push a symbol onto the top of the stack.

Writing PDA transitions Transitions are written a, b c where a is an input symbol If the machine sees input a then it may replace b on top of the stack with c In other words, b is the symbol popped off the stack and c is the symbol pushed onto the stack If b is ɛ (the empty symbol) then make the transition without any pop (read) operation If c is ɛ then make the transition without any push (write) operation $ is a special symbol used to denote the bottom of the stack: it means the stack is empty

PDA Example for 0 n 1 n 0, ɛ 0 ɛ, ɛ $ q 1 q 2 1, 0 ɛ q 4 ɛ, $ ɛ q 3 1, 0 ɛ

Reading PDA transitions ɛ, ɛ $ Given no input and nothing to pop, put the empty stack symbol onto the stack. All PDAs start with this transition 0, ɛ 0 On seeing an input 0, push a 0 onto the stack. Do not pop anything from the stack 1, 0 ɛ On seeing input 1 and a 0 on the stack, pop the 0 from the stack and do not push anything on. This step pairs off all the 1s with the previously stored 0s. ɛ, $ ɛ On seeing no input and an empty stack, accept the string since we must now have seen the same number of 1s as 0s

Palindrome language Design a PDA to recognize the language { ww R w {0, 1} }. w is any binary string, w R means w written backwards. So for example, 001110011100 is in the language. Here we will use non-determinism to guess when the middle of the string has been reached. Approach: 1 Push all the symbols read onto the stack 2 Guess you have reached the middle of the word 3 Then pop elements off the stack if they match the next input symbol 4 Accept the string if every popped symbol matches the input, and the stack empties at the same time the end of the input is reached. 5 Reject otherwise

Palindrome language PDA ɛ, ɛ $ q 1 q 2 0, ɛ 0 1, ɛ 1 ɛ, ɛ ɛ q 4 ɛ, $ ɛ q 3 0, 0 ɛ 1, 1 ɛ

Designing PDAs See Sipser Lemma 2.21 for details (p117 in 3rd ed) Idea: PDA accepts input w if grammar G generates it, by following the derivations Use non-determinism to allow for choice of productions. Push the start symbol S onto the stack ɛ, ɛ S If top of stack is a non-terminal (S) then non-deterministically choose any production and substitute S by the rule. Since the rules generate more symbols we need a string of pushes. No inputs are consumed at this stage. If top of stack is a terminal symbol (0 or 1) then check whether it matches the next symbol in the input string If they don t match then go to a non-accept state, if they do match then continue

Designing PDAs (cont) Example - see tutorial questions - to be done in class

How to show a language is context-free Theorem: PDA are equivalent in power to context-free grammars. This is a useful result because it gives 2 options for proving that a language is CF 1 specify a PDA for the language 2 specify a CF grammar for language

Backus-Naur form (for information) Context-free grammars related to computer languages are often given in a special shorthand notation known as Backus-Naur form. <identifier> :: = <letter> <identifer> <letter> <identifier> <digit> <letter> :: = a b c... z <digit> :: = 0 1... 9 In BNF, the non-terminals are identifed by the angle brackets, and productions with the same left-hand side are combined into a single statement with the OR symbol.