IBAN LANGUAGE PARSER USING RULE BASED APPROACH

Similar documents
PROBLEMS IN ADJUNCT CARTOGRAPHY: A CASE STUDY NG PEI FANG FACULTY OF LANGUAGES AND LINGUISTICS UNIVERSITY OF MALAYA KUALA LUMPUR

Parsing of part-of-speech tagged Assamese Texts

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

TEACHING WRITING DESCRIPTIVE TEXT BY COMBINING BRAINSTORMING AND Y CHART STRATEGIES AT JUNIOR HIGH SCHOOL

Research Journal ADE DEDI SALIPUTRA NIM: F

AQUA: An Ontology-Driven Question Answering System

SIMILARITY MEASURE FOR RETRIEVAL OF QUESTION ITEMS WITH MULTI-VARIABLE DATA SETS SITI HASRINAFASYA BINTI CHE HASSAN UNIVERSITI TEKNOLOGI MALAYSIA

Grammars & Parsing, Part 1:

STUDENTS SATISFACTION LEVEL TOWARDS THE GENERIC SKILLS APPLIED IN THE CO-CURRICULUM SUBJECT IN UNIVERSITI TEKNOLOGI MALAYSIA NUR HANI BT MOHAMED

CS 598 Natural Language Processing

Some Principles of Automated Natural Language Information Extraction

Advanced Grammar in Use

yang menghadapi masalah Down Syndrome. Mereka telah menghadiri satu program

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

UNIVERSITY ASSET MANAGEMENT SYSTEM (UniAMS) CHE FUZIAH BINTI CHE ALI UNIVERSITI TEKNOLOGI MALAYSIA

Developing a TT-MCTAG for German with an RCG-based Parser

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

UNIVERSITI PUTRA MALAYSIA SKEW ARMENDARIZ RINGS AND THEIR RELATIONS

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

UNIVERSITI PUTRA MALAYSIA TYPES OF WRITTEN FEEDBACK ON ESL STUDENT WRITERS ACADEMIC ESSAYS AND THEIR PERCEIVED USEFULNESS

Compositional Semantics

THE ROLE OF ENGLISH TEACHERS ON HELPING PASSIVE LEARNERS IN CLASSROOM (A Study at The Ninth Grade Students of SMP N 31 Andalas Padang)

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Natural Language Processing. George Konidaris

Dian Wahyu Susanti English Education Department Teacher Training and Education Faculty. Slamet Riyadi University, Surakarta ABSTRACT

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Towards Teachers Communicative Competence Enhancement: A Study on School Preparation for Bilingual Programs

THE EFFECT OF USING SILENT CARD SHUFFLE STRATEGY TOWARD STUDENTS WRITING ACHIEVEMENT A

Faculty Of Information and Communication Technology

Developing Grammar in Context

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

COOPERATIVE LEARNING TIME TOKEN IN THE TEACHING OF SPEAKING

DESINGING TASK-BASED INSTRUCTIONAL STRATEGY ON RECYCLING NEWSPAPER IN READING PROCEDURE TEXT

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

USING STUDENT TEAMS ACHIEVEMENT DIVISIONS (STAD) METHOD TO IMPROVE STUDENTS WRITING ABILITY

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Context Free Grammars. Many slides from Michael Collins

Proof Theory for Syntacticians

Ensemble Technique Utilization for Indonesian Dependency Parser

An Interactive Intelligent Language Tutor Over The Internet

ILLOCUTIONARY ACTS FOUND IN HARRY POTTER AND THE GOBLET OF FIRE BY JOANNE KATHLEEN ROWLING

THE IMPLEMENTATION OF TEACHING ENGLISH TO THE TENTH GRADE STUDENTS AT SMK NEGERI 8 SURAKARTA IN 2015/2016 ACADEMIC YEAR

IMPROVING STUDENTS SPEAKING ABILITY THROUGH SHOW AND TELL TECHNIQUE TO THE EIGHTH GRADE OF SMPN 1 PADEMAWU-PAMEKASAN

Guidelines for Writing an Internship Report

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

CHAPTER III RESEARCH METHODOLOGY. A. Research Type and Design. questions. As stated by Moleong (2006: 6) who makes the synthesis about

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Applications of memory-based natural language processing

UNIVERSITI PUTRA MALAYSIA IMPACT OF ASEAN FREE TRADE AREA AND ASEAN ECONOMIC COMMUNITY ON INTRA-ASEAN TRADE

BODJIT KAUR A/P RAM SINGH

Novi Riani, Anas Yasin, M. Zaim Language Education Program, State University of Padang

The Smart/Empire TIPSTER IR System

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

INCREASING STUDENTS ABILITY IN WRITING OF RECOUNT TEXT THROUGH PEER CORRECTION

NATURAL LANGUAGE PARSING AND REPRESENTATION IN XML EUGENIO JAROSIEWICZ

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

1. Introduction. 2. The OMBI database editor

Formulaic Language and Fluency: ESL Teaching Applications

Teachers Prior Knowledge Influence in Promoting English Learning Strategies in Primary School Classroom Practices

UNIVERSITI PUTRA MALAYSIA RELATIONSHIP BETWEEN LEARNING STYLES AND ENTREPRENEURIAL COMPETENCIES AMONG STUDENTS IN A MALAYSIAN UNIVERSITY

Lulus Matrikulasi KPM/Asasi Sains UM/Asasi Sains UiTM/Asasi Undang-Undang UiTM dengan mendapat sekurangkurangnya

Prediction of Maximal Projection for Semantic Role Labeling

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Using dialogue context to improve parsing performance in dialogue systems

RANCANGAN KURSUS. Muka surat : 1 daripada 6. Nama dan Kod Kursus: Komputer dalam Pendidikan Kimia(MPS1343) Jumlah Jam Pertemuan: 3 x 14 = 42 jam

CHAPTER IV RESEARCH FINDING AND DISCUSSION

Loughton School s curriculum evening. 28 th February 2017

Indian Institute of Technology, Kanpur

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Procedia - Social and Behavioral Sciences 154 ( 2014 )

UNIVERSITI PUTRA MALAYSIA

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Myths, Legends, Fairytales and Novels (Writing a Letter)

What the National Curriculum requires in reading at Y5 and Y6

SULIT FP511: HUMAN COMPUTER INTERACTION/SET 1. INSTRUCTION: This section consists of SIX (6) structured questions. Answer ALL questions.

Linking Task: Identifying authors and book titles in verbose queries

SYARAT-SYARAT KEMASUKAN DI TATI UNIVERSITY COLLEGE

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

THE ROLES OF INTEGRATING INFORMATION COMMUNICATION TECHNOLOGY (ICT) IN TEACHING SPEAKING AT THE FIRST SEMESTER OF ENGLISH STUDENTS OF FKIP UIR

Underlying and Surface Grammatical Relations in Greek consider

GARIS PANDUAN BAGI POTONGAN PERBELANJAAN DI BAWAH PERENGGAN 34(6)(m) DAN 34(6)(ma) AKTA CUKAI PENDAPATAN 1967 BAGI MAKSUD PENGIRAAN CUKAI PENDAPATAN

SIJIL PELAJARAN MALAYSIA 2011

Knowledge-Based - Systems

AN INVESTIGATION INTO THE FACTORS AFFECTING SECOND LANGUAGE LEARNERS CLASSROOM PARTICIPATION

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Syamsul Rizal Vera Fitria

USING AN ADAPTED VERSION OF RECIPROCAL TEACHING TO TEACH READING COMPREHENSION TO LOW ENGLISH PROFICIENCY LEARNERS

INSTRUCTION: This section consists of SIX (6) structured questions. Answer FOUR (4) questions only.

CHAPTER III RESEARCH METHODOLOGY. A. Research Method. descriptive form in conducting the research since the data of this research

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Bluetooth mlearning Applications for the Classroom of the Future

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

Transcription:

IBAN LANGUAGE PARSER USING RULE BASED APPROACH Chia Yong Seng Master ofadvanced Information Technology 2010

P.t<HIDMAT MAt<LUMAT AKADI!MIK 111111111 rliijii 111111111 1000246337 IBAN LANGUAGE PARSER USING RULE BASED APPROACH CHIA YONG SENG A dissertation submitted in partial fulfillment of the requirements for the degree of Master ofadvanced Information Technology Faculty of Computer Science and Information Technology UNIVERSITI MALAYSIA SARA WAK 2009

ACKNOWLEGDEMENT The author wishes to express sincere appreciation to Dr. Edwin Mit, Ms. Suhaila, and Dr. Alvin Yeo for their assistance in the preparation of this dissertation. In addition, special thanks to those whose familiarity with the needs and ideas of this research project was helpful during the early programming phase of this undertaking. Thanks also to the members of the school council for their valuable inputs. And finally thanks to my family members for their faithful supports. ii

TABLE OF CONTENTS ACKNOWLEGDEMENT... ii TABLE OF CONTENTS... iii IJST OF FIGURES......... vii LIST OF TABLES... ix ABSTRACT...... x ABSTRAK...... xi CHAPTER 1: INTRODUCTION... 12 1.1 Introduction........... 12 1.2 Research Background........... 13 1.3 Scope Of The Research........... 13 1.4 Objectives Of The Research......... 13 1.5 Significances Of The Research...... 14 1.6 Problem Statements......... 14 1.7 Propose Solution... 15 1.8 Chapter Summary...... 15 CHAPTER 2: LITERATURE REVIEW........... 16 2.1 Introduction...... 16 2.2 The Parser... f............... 16 2.2.1 The Parsing Process... 17 2.2.2 Word Tokenizing... 18 2.2.3 Word Tagging........ 19 2.2.4 Word Aligning... 19 2.3 Computer Perception On Linguistic... 19 iii

2.4 Different Approaches Of Parser... 21 2.4.1 The Top Down Approach Parser... 21 2.4.2 The Bottom Up Approach Parser...... 22 2.5 Reviews On Language Parsers... 23 2.5.1 Apple Pie Parser...... 23 2.5.2 LingSoft's ENGCG Parser...... 28 2.5.3 Parser A Sentence (phrase Parser)............ 31 2.5.4 SalingWika (A Top Down Parser)............ 34 2.5.5 Overview Comparisions........ 36 2.6 Chapter Summary... 39 CHAPTER 3: METHODOLOGY... 40 3.1 Introduction... 40 3.2 Development Methodology...... 41 3.2.1 Spiral Methodology Cycles... 44 3.3 Parser's Process Flow... 52 3.4 Iban Formal Grammar......... 53 3.5 Rule Based Grammar Applied...... 55 f 3.6 The Top Down Approach Parser................... 63 3.7 The Bottom Up Approach Parser...... 65 3.8 Chapter Summary............... 67 CHAPTER 4: IMPLEMENTATIONS... 68 4.1 Introduction... 68 4.2 Implementing The Parser...... 68 4.2.1 The Secondary Word Tagger... 69 iv

4.2.2 The Source OfIban Dictionary... 70 4.2.3 Database Design... 71 4.2.4 Tagset Used......... 72 4.2.5 Finding Object And Subject In Sentence... 73 4.2.6 Finding Subject And Object In Multiple Sentence... 74 4.2.7 Iban Tree Structure... 75 4.3 System Development...... 77 4.4 System Input Output... 78 4.5. Chapter Summary... 80 CHAPTER 5: DISCUSSION (RESULTS & TESTING)....... 81 5.1 Introduction................. 81 5.2 Test Samples................. 81 5.3 Conditional Coverage Testing............... 83 5.4 Predicate Coverage Testing.................. 85 5.5 Permutable Predicate Coverage Testing... 89 5.6 Lengthy Predicate Coverage Testing... 93 5.7 Permutable And Lengthy Predicate Coverage Testing..... 109 5.8 Multiple Predicates Coverage Testing...... 115 5.9 Performance Metric... 120 5.10 Total Words Not Available From Dictionary...... 124 5.12 Analysis Results............... 126 5.13 Iban Parser Limitations............ 126 5.14 Chapter Summary...:... 130 CHAPTER 6: FUTURE WORKSIEXTENTIONS... 132 v

6.1 Introduction......... 132 6.2 Achievements........... 132 6.3 Recommendations For Future Works............... 133 6.4 Chapter Summary...... 134 REFERENCES.................. 135 APPENDIX A: LIST OF TEST SAMPLES................ 138 APPENDIX B: PROTOTYPE SCREENSHOTS.................... 143 vi

LIST OF FIGURES Figure 2.1 Example of parse tree in Apple Pie Parser...................... 25 Figure 2.2 Score calculation formulae in Apple Pie Parser......... 27 Figure 2.3 Screenshot taken from Apple Pie Parser... 28 Figure 2.4 Screenshot taken from ENGCG Parser............. 29 Figure 2.5 Screenshot taken from Phrase Parser............ 32 Figure 2.6 Phrase Parser's connector connections......... 34 Figure 3.1 Architecture of proposed Iban Parser System............ 41 Figure 3.2 Spiral methodology taken in building Iban language Pal'ser...... 42 Figure 3.3 Process flow for parsing an Iban sentence..................52 Figure 4.1 Example ofiban word in Iban dictionary............ 70 Figure 4.2 Subject and Object in sentence................73 Figure 4.3 Basic construction of conjunction for multiple sentences............... 74 Figure 4.4 Iban Tree Structure............... 76 Figure 4.5 Interface layout of input interface................. 78 Figure 4.6 Interface layout of Output interface................ 79 Figure 5.1 Top Down approach separation point... 127 Figure 5.2 Bottom Up approach separation point................... 128 Figure 5.3 Pronoun on first parse in Top Down approach........... 129 Figure 5.4 Pronoun on first parse in Bottom Up approach......... 129 Figure 6 Iban Parser's input interface.............. 143 Figure 7 Apache Tomcat's console display........... 143 Figure 8 Iban Parser's result................................. 144 Figure 9 Iban Parser's 'Fop Down Tree structure...................... 145 vii

Figure 10 Ihan Parser's Bottom Up Tree structure...... 146 Figure 11 Iban Parser's Tree structure (Tomcat's Console)... 147 Figure 12 Ihan Parser's Top Down Tree structure (for Conjunction sentence), Part 1... 148 Figure 13 Ihan Parser's Top Down Tree structure (for Conjunction sentence), Part 2... 149 Figure 14 Ihan Parser deployment, Java Servlets classes........ 150 Figure 15 Ihan Parser deployment, Java Server Pages (JSP)...... 151 Figure 16 Ihan Parser source, Part 1........ 152 Figure 17 Ihan Parser source, Part 2... 153 Figure 18 Ihan Parser source, Part 3...... 154 Figure 19 Ihan Parser source, Part 4... 155 Figure 20 Ihan Parser dictionary, Part 1... 156 Figure 21 Ihan Parser dictionary, Part 2............. 156 viii

LIST OF TABLES Table 2.1 Comparison between Parsers.................... 37 Table 4.1 IBAN_ENG_LEXICON database schema............... 71 Table 5.1 Test sample for testing.........................82 Table 5.2 Conditional coverage testing.............. 84 Table 5.3 Predicate coverage testing........... 89 Table 5.4 Permutable Predicate coverage testing................92 Table 5.5 Lengthy Predicate coverage testing...................... 107 Table 5.6 Permutable and Lengthy Predicate coverage testing................. 1l4 Table 5.7 Multiple Predicates coverage testing.............. 120 Table 5.8 Iban Parser's performance metric............... 121 Table 5.9 Iban Parser's performance metric on Regular sentences......... 122 Table 5.10 Iban Parser's performance metric on Irregular sentences... 124 Table 5.11 Total words not available from Than dictionary................ 125 ix

ABSTRACT (There is a need for documentation or studies on Iban language in Natural Language Processing (NLP), because tools or Parser for Iban language is not available. In order to understanding and learning Iban language, an Iban Parser is required to generate Iban sentence structure, which allow computer scientist to study Iban language in academic ways. The purpose of this research project is to propose an Iban Parser, a Parser that will parse Iban sentence. The Parser will recognize sentence's part of speech with Rule Based Grammar. Upon recognize all Iban words in a sentence; the Parser will present that sentence in Tree data structure presentation. Proposed Iban Parser is design to parse sentence with Top Down approach and Bottom Down approach. ) Proposed Iban Parser comes with Top Down approach and Bottom Up approach, both approaches perform sentence parsing differently. This research projects had ran multiples tests which are (1) Conditional coverage testing, (2) Predicate coverage testing, (3) Lengthy Predicate coverage testing, (4) Permutable Predicate coverage testing, (5) Lengthy and Permutable Predicate testing, and lastly (6) Multiple Predicates coverage testing to test the Iban Parser. Overall test results showed that Iban Parser can recognize the Part Of Speech in Iban sentence. The design and multiple tests conducted were recorded in this research project would serves as stepping stone for related research fields in Iban language. x

ABSTRAK Adanya keperluan untuk dokumen atau belajar tentang bahasa lban dalam "Natural Language Processing" (NLP) kerana alat "Parser" u.ntuk memahami bahasa lban yang tidak tersedia ada. Dalam rangka untuk memahami dan belajar bahasa lban, sebuah alat "Parser" lban diperlukan untuk menghasilkan struktur ayat lban, yang memungkinkan ilmuwan komputer untuk belajar bahasa lban dari segi akademik. Tujuan dari projek penelitian ini adalah untuk mencadangkan sebuah alal "Parser" lban yang akan "!'okenize" ayat lban. AlaI "Parser" lban akan mengenali bahagian pidato dengan berdasarkan Peraturan Nahu lban. Setelah alat "Parser" lban mengenali semua kata-kata dalam sebuah ayat; ianya akan menghasilkan ayat dalam presentasi struktur data Pohon. Alat "Parser" lban yang dicadangkan akan "tokenize" ayat dengan pendekatan "Top Down" donpendekatan "Bottom Up". :HOl."Parser" lban yang dicadangkan dengan pendekatan "Top Down" dan "Bottom Up" pendekatan akan melakukan "tokenizing" yang berbeza. Projek penelitian ini telah melakukan satu siri ujian untuk menguji pendekatan tersebut untuk alat "Parser" lban. Secara keseluruhan hasil ujian menunjukkan bahawa alat "Parser" lban d.apai mengenali bahagian ayat lban dari segi pidato. Reka bent uk dan siri ujian yang direkod dalam dokumen projek penelitian ini akan berfungsi sebagai batu loncatan untuk bidang penelitian yang berkaitan dalam bahasa lban. xi

CHAPTER 1: INTRODUCTION 1.1 Introduction A Parser is Natural Language Processing tool for generating sentence structure; different language will have a different Parser. A Parser role is to break a sentence (input) into atomic form (which is also known as tokens), to enable computer to recognize each word grammatical representation. The purpose of this research project is to present the basic conceptual design, the parsing process flow, and parsed data ptesentation of Iban Parser. This research project would serves as reference for audiences such as computer scientist and researchers in related research study field in Natural Language Processing for Iban language. Dissertation written for this research project was organized in the following manner; Chapter 1 (Introduction) introduces the background and objectives of this research project. Chapter 2 (Literature Review) reviews existing Parsers and their approaches. Chapter 3 (Methodology) describes some of design aspects of Than Parser. Chapter 4 (Implementation) records Parser's construction procedure or steps taken. Chapter 5 (Discussion) analyzes testing results on the Iban Parser and reviews the its limitations. Chapter 6 (Future Works) concludes this dissertation with achievements and recommendations for future works. 12

1.2 Research Background According to the research projects list compiled by John Hutchins (2009) of European Association for Machine Translation on behalf of the International Association for Machine Translation, there is no documented works on translating English to Iban language or vice versa. Research fields related to Iban language is not listed and not available for references. Therefore this dissertation (or research project) would also acts as stepping stone for further research works or any related researches. 1.3 Scope Of The Research This l'esearch project deals with 'Iban sentences (5 to 10 words) as inputs, constructs a Parser for parsing these sentences and recognizes the sentence structure based on author defined Rule Based Grammar. This project also utilies a small Iban dictionary (with 10,000 entries). 1.4 Objectives Of The Research Objectives of this research project are listed as below; (1) Develop a prototype of Than language Parser. 13

(2) Automate the generation Iban sentence structure. (3) Recognize Iban language's Part Of Speech (e.g., RJN (Rambai Jaku Nama), RJA (Rambai Jaku Adjektif), and RJP (Rambai Jaku Pengawa». 1.5 Significances OfThe Research This research project will be very useful as reference in learning and understanding Iban language structure. Possible benefits foreseen from this research project are listed as below; (1) Assist human translator work in translating Iban language documents. (2) Act as foundation in applications such as concordance and grammar checker. (3) Serve as reference for other related researches in Natural Language Processing field. 1.6 Problem Statements This research project was initiated due to several factors, these factors are listed as below; (1) There is lacking documented or related (similar with this research project) works on Than language made available. Proper documentations are important and act as references for related works in Iban language translation. (2) Natural Language Processing tools or Parser for Iban language is not available, Parser is needed for recognizing Iban language sentence structure. 14

(3) Lack of documented computational defined grammar rule for Ihan sentence in Natural Language Processing. 1.7 Propose Solution To tackle prohlems identified in section 1.6, the following solutions are proposed in this research project. (1) This research project will provide a write up document on studies done Ihan Parser. This research project will he documented as dissertation, and he anchors as reference in related research fields. (2) This research project proposes an Ihan Parser's design. The proposed Ihan Parser will automated generate Ihan language sentence structure. (3) This research project proposed defined Ihan sentence grammar rules for Natural Language Processing field. 1.8 Chapter Summary As mentioned in this Chapter 1, currently there is no Ihan Parser developed for this purpose. In order to translate and learn Ihan language (based on sentence structure), an Iban Parser is required. This research project on Ihan language will propose and present a suitable and experimental Ihan Parser. 15

CHAPTER 2: LITERATURE REVIEW 2.1 Introduction This chapter discuss about language Parsers that had made available and studies that had been done on Parser's parsing process. Parsers chosen for review are Apple Pie Parser, ENGCG Parser, Phrase Parser and SalingWika. Reviewing their parsing process and recognizes their distinctive features. This chapter will discuss studies on Parser's parsing process which involves Word Tokenizing, Word Tagging, and Word Aligning. 2.2 The Parser A Natural Language Parser (NLP) is a program constructured to recognize the grammatical structure of a sentence. The Parser breaks the sentence into small parts, and later regroup them in generated sentence structure as Object or Subject of a verb (James Allen, 1995). Generated sentence structure is represented as lexical symbols (will be refer as Key in this re earch project), each symbols is used for representing a sentence in computer linguistic manner. Putting lexical symbols together will form grammatical sentence presentation. 16

Below are list of common Part Of Speech syntax used; NP Noun Phrase, for referring to things, place, qualities, concepts, events or objects. s Sentence, a sentence that is used for assert, query or command purpose. VP Verb Phrase, a Predicate. VP[infJ VP starting with infinite form. S[infJ Sentences in infinite form. PP Preposition Phrase, verb that involves specific Preposition Phrase. ADJP Adjective Phrase, consisting of single Adjective. ADVP Adversial Phrase. ART An article. N A common noun. 2.2.1 The Parsing Process Par ing a sentence can be done in two ways, Syntactic parsing and Semantic parsing. According to Wikipedia (2009), the Syntactic parsing check sentence based on token and cre-ate expression (or recognitions) that is usually ruled by Context-free grammar (CFG). Context-free grammar is used to describe structure of language. While Semantic parsing 17

(Wikipedia, 2009) took place after Syntactic parsing, it will try to work out the implications of expression. This research study will only involved Syntactic parsing in Iban Parser, where an Iban sentence will be broken into small tokens and go through parsing processing. In a generic Parser, the parsing process of a sentence will involved Tokenizing, Tagging and Aligning. 2.2.2 Word Tokenizing A Tokenizer is a NLP tool for scanning a string of characters (James Allen, 1995), such as added line of text from command prompt, and converting these character strings into a list of words and punctuation marks. Each item in this list is called a "token". Wh n parsing a sentence. the whole "chunk" (which is the entire added sentence) will not be par ing by Parser; instead the Parser will work with tokens, which is faster and easier. Without Tokenizer, Parser would need to go through steps such recognizing word boundaries, skipping whitespace, and finding delimiters (such as quotes and parenthesis). Tokenizer would perform all this in advance when a string is tokenizes, so these steps would not be repeated in parsing process. 18

2.2.3 Word Tagging Tagging is a process handled by Tagger for giving a Key (part of speech such as Noun, Verb, Adjective, etc.) to a string of word, in many cases a string of word can be Noun or DetelEinant (James Allen, 1995). This is done by matching a string of word with huge pre defined tag library, usually tagging will comes after tokenizing a sentence of word. 2.2.4 Word Aligning Aligning is a process of matching a string of word with another string of word (James Allen, 1995); this is usually done with pre defined source oflexicon (dictionary). 2.3 Computer Perception On Linguistic Unlike human, a computer cannot recognize a string like "The quick brown fox jumps over the lazy dog"; the computer only understands this string, is built of 43 characters string array which includes whitespaces. For a computer to learn and understand this string array. a new presentation is required (James Allen, 1995). With Word Tokenizing process, this 43 characters string array will be recognize as 8 words based on word boundaries and white paces. 19

One of the common ways for storing persist form of tokens is usmg XML (Extensible Markup Language) document. This is due to XML simplicity, usability over Internet, and supports via Unicode for languages around the world. The following is an example a sentence that was converted into XML format, which later can be recognized by a computer during parsing process. Computer can now understand each word as separate entity instead of whole "chunk" (in this case, the entire added sentence) in string array. The sentence "The quick brown fox jumps over the lazy dog", can be represented (in XML format) as, <sentence> <word>the</word> <word>quick</word> <word>brown</word> <word>fox</word> <word>jumps</word> <word>over</word> <word>the</word> <word>lazy</word> <word>dog</word> </sentence> 20

Computer recognizes word by word (e.g, "The", "quick", "brown", "fox", "jumps", "over", "the", Ulazy", and "dog") in XML document by distinguishing content between <word> markup and </word> markup. 2.4 Different Approaches Of Parser The two common strategies used in parsing a sentence are Top Down approach Parser and Bottom Up approach Parser (James Allen, 1995). The Top Down approach Parser generates sentence structure in expansive manner (from first to last word) while the Bottom Up approach Parser used the reductive approach (begin from last word and end with first word). Each strategy has its strengths and weakness depending on how they are use. Tokenization involved demarcating and classifying sections of an input string. 2.4.1 The Top Down Approach Parser The Top Down approach Parser breaks the sentence (S) into atomic form (which are token) from left to right (left most derivation) manner, which is starting from first word to last word in S. This approach is known as goal oriented, because symbol hypothesis is made based on unit will be found in the sentence (James Allen, 1995). Top Down approach Parser involved using stack data structure; 2 strategies available for this Parser are Depth First strategy and Breadth First strategy. Depth First strategy used 21

"Last In First Out" (LIFO) stack and Breadth First strategy used "First In First Out" (FIFO) stack. The Depth First strategy searches the main interpretation and expands it; if that interpretation failed to be found, it will consider and search the alternatives. While Breadth First strategy searches the main interpretation and alternatives all together before proceed to the next interpretation searching. The Depth First strategy may be faster in concluding the result if compare to Breadth First strategy, but may take a lot time if pursuing the wrong interpretation. 2.4.2 The Bottom Up Approach Parser The Bottom Up approach Parser matches word in right to left (right most derivation) manner. Unlike the Top Down approach Parser; it searches from known word in sentence which is the last word in sentence (S) (James Allen, 1995). This Bottom Up approach Parser rewrites a word by its possible Key (part Of Speech attributes like Noun, Verb, Adjective, etc) and replaces a symbol that matches its right hand in sequence based on grammar rule. Stack data structure is also used to store partial result for searching process. Parsing process in this Parser is based on Key (part of Speech attributes like Noun, Verb, Adjectives, etc). Key is used for a string is based on rule that start with the Key itself, or 22

rule that had already started with previous Key and presence of the current Key III completing or extending the rule. 2.5 Reviews On Language Parsers To understand about Parser, this research project reviews some made available English Parsers based its features, and techniques. Selected Parsers are; (1) Proteous Project - Apple Pie Parser (2) LingSoft's ENGGC (3) Parse a sentence (phrase Parser) (4) SalingWika (A Top Down Parser) 2.5.1 Apple Pie Parser Apple Pie Parser is a Bottom Up approach Parser type from Proteous Project, its using best first search algorithm. The Parser (proteous Project, 2009) finds the best Parser tree based on score given by the search algorithm. It generate syntactic tree similar to PennTreeBank (PTB) bracketing. The later version of PTB (version 2.0) includes argument structure label which i not available in APP generated syntactic tree. This Parser is developed for parsing simple English sentences. Unlike most PTB Parser that searches the whole sentence for Part Of Speech complete match, APP Parser searches the sentence partially. 23