JKimmo: A Multilingual Computational Morphology Framework for PC-KIMMO

Similar documents
An Interactive Intelligent Language Tutor Over The Internet

Parsing of part-of-speech tagged Assamese Texts

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Phonological Processing for Urdu Text to Speech System

Developing a TT-MCTAG for German with an RCG-based Parser

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

CS 598 Natural Language Processing

LING 329 : MORPHOLOGY

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

ARNE - A tool for Namend Entity Recognition from Arabic Text

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Natural Language Processing. George Konidaris

1. Introduction. 2. The OMBI database editor

Modeling full form lexica for Arabic

Interfacing Phonology with LFG

Character Stream Parsing of Mixed-lingual Text

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Cross Language Information Retrieval

A relational approach to translation

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

The Smart/Empire TIPSTER IR System

Grammars & Parsing, Part 1:

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Bluetooth mlearning Applications for the Classroom of the Future

A Framework for Customizable Generation of Hypertext Presentations

Applications of memory-based natural language processing

Using dialogue context to improve parsing performance in dialogue systems

Refining the Design of a Contracting Finite-State Dependency Parser

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

An Online Handwriting Recognition System For Turkish

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Topic: Making A Colorado Brochure Grade : 4 to adult An integrated lesson plan covering three sessions of approximately 50 minutes each.

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

November 17, 2017 ARIZONA STATE UNIVERSITY. ADDENDUM 3 RFP Digital Integrated Enrollment Support for Students

Software Maintenance

The Interface between Phrasal and Functional Constraints

Proof Theory for Syntacticians

Indian Institute of Technology, Kanpur

A First-Pass Approach for Evaluating Machine Translation Systems

On-Line Data Analytics

Adapting Stochastic Output for Rule-Based Semantics

Linking Task: Identifying authors and book titles in verbose queries

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Constructing Parallel Corpus from Movie Subtitles

CREATING SHARABLE LEARNING OBJECTS FROM EXISTING DIGITAL COURSE CONTENT

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Pre-Processing MRSes

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A General Class of Noncontext Free Grammars Generating Context Free Languages

CS 446: Machine Learning

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Florida Reading Endorsement Alignment Matrix Competency 1

The stages of event extraction

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Longman English Interactive

cambridge occasional papers in linguistics Volume 8, Article 3: 41 55, 2015 ISSN

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Building an HPSG-based Indonesian Resource Grammar (INDRA)

GACE Computer Science Assessment Test at a Glance

Infrared Paper Dryer Control Scheme

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Setting Up Tuition Controls, Criteria, Equations, and Waivers

Named Entity Recognition: A Survey for the Indian Languages

A Comparison of Two Text Representations for Sentiment Analysis

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Language properties and Grammar of Parallel and Series Parallel Languages

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Accurate Unlexicalized Parsing for Modern Hebrew

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

AQUA: An Ontology-Driven Question Answering System

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

The Strong Minimalist Thesis and Bounded Optimality

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

Analysis: Evaluation: Knowledge: Comprehension: Synthesis: Application:

CHANCERY SMS 5.0 STUDENT SCHEDULING

DIBELS Next BENCHMARK ASSESSMENTS

ME 443/643 Design Techniques in Mechanical Engineering. Lecture 1: Introduction

LTAG-spinal and the Treebank

Double Double, Morphology and Trouble: Looking into Reduplication in Indonesian

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Speech Recognition at ICSI: Broadcast News and beyond

Specifying Logic Programs in Controlled Natural Language

MYCIN. The embodiment of all the clichés of what expert systems are. (Newell)

Introduction, Organization Overview of NLP, Main Issues

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Ensemble Technique Utilization for Indonesian Dependency Parser

CORRECT YOUR ENGLISH ERRORS BY TIM COLLINS DOWNLOAD EBOOK : CORRECT YOUR ENGLISH ERRORS BY TIM COLLINS PDF

Transcription:

JKimmo: A Multilingual Computational Morphology Framework for PC-KIMMO Md. Zahurul Islam and Mumit Khan Center for Research on Bangla Language Processing, Department of Computer Science and Engineering, BRAC University, Dhaka Bangladesh zahurul@bracu.ac.bd, mumit@bracu.ac.bd Abstract Morphological analysis is of fundamental interest in computational linguistics and language processing. While there are established morphological analyzers for mostly Western and a few other languages using localized interfaces, the same cannot be said for Indic and other less-studied languages for which language processing is just beginning. There are three primary obstacles to computational morphological analysis of these less-studied languages: the generative rules that define the language morphology, the morphological processor, and the computational interface that a linguist can use to experiment with the generative rules. In this paper, we present JKimmo, a multilingual morphological open-source framework that uses the PC-KIMMO two-level morphological processor and provides a localized interface for Bangla morphological analysis. We then apply Jkimmo to Bangla computational morphology, demonstrating both its recognition and generation capabilities. Jkimmo s internationalization (i18n) frame-work allows easy localization in other languages as well, using a property file for the interface definitions and a transliteration scheme for the analysis. 1. Introduction Morphological analysis is a key component of Natural Language Processing (NLP) and Computational Linguistics, and is a fundamental requirement of most advanced language processing applications from grammar checkers to automatic machine translators. With the current wave of work in Bangla Computational Linguistics, the need for a robust morphological analyzer has become critical. Our goal is to create a robust and reusable framework for doing morphological analysis of Bangla. There are three primary components in such a robust morphological analyzer for a language: the generative morphological rules, the underlying morphological processor, and the computational interface through which the user experiments with the language morphology. There is ongoing work in developing the computational morphology for Bangla, using both simple rewriting rules and feature unification grammars [1-4]. There are also well-established implementations for two-level morphological analyzers, with PC-KIMMO being one of the more widely available ones that implements Kimmo Koskenniemi's two-level morphology [5-8]. What is missing however is the framework in which Bangla morphology can be implemented using Bangla language interface. The available processors were created before the widespread use of Unicode [9], predominantly using the Latin script. This creates an obstacle in creating us-able local language interfaces, making it difficult to experiment with the morphology of languages that use complex scripts, such as the Indic scripts including Bangla. Instead of creating yet another two-level morphological processor, we chose instead to Jkimmo by harnessing the existing PC-KIMMO implementation [8], using the generative rules defined by existing efforts, and created a software interface that allows Bangla language interface to PC-KIMMO. Our implementation uses Java Native Interface [10] as the bridge between PC-KIMMO and the Unicode-enabled user interface, allowing the user to experiment in any script supported by the Unicode standard. Since the analysis framework uses standard internationalization (i18n) schemes, it is trivially localized to any language by using property files for interface definitions, and transliteration schemes for the Latin-Unicode-Latin conversion needed to interface to PC-KIMMO backend. In section 2, we review some related work including work on Bangla morphological analyzers, followed by our methodology and implementation details in sections 3 and 4, and then conclude with some discussion of Jkimmo. 2. Related work Pykimmo [11] is a python implementation of PC- KIMMO developed by Carl de Marcken, Beracah Yankama, and Rob Speer at Massachusetts Institute of Technology. It was designed for laboratory experimentation with two-level morphological rules. However, since Pykimmo uses Latin scripts for both input and output, it requires the use of transliteration and English language user interface to interact with the system, thereby limiting its use. Another limitation of Pykimmo is that it s based PC-KIMMO version 1, which implements the two-level rules and the lexicon, but does not implement the grammar needed to describe non-concatenative and otherwise complex morphology. An effort for creating an interface for Bangla

morphological analysis has been developed at the Indian Institute of Technology - Kharagpur [12], which provides a web interface to the underlying morphological engine using the itrans transliteration scheme. Another such effort is the Xerox Arabic Morphological Analyzer and Generator [13], created with the Xerox Finite-State Technology. It has a Java Applet interface and uses ISO-8859-6 and Unicode character encodings. It is notable that none of these systems, unlike Jkimmo, is easily extendible to other languages using Unicode-encoded input and output. 3. Methodology 3.1. PC-KIMMO overview PC-KIMMO is a morphological analyzer based on Kimmo Koskenniemi's Two-Level Morphology model [5]. The first implementation of the two-level model was PC-KIMMO version 1, developed by the Summer Institute of Linguistics in 1990. PC-KIMMO implemented the rules and lexicon components of the two-level model using two files: (i) the rules file (.RUL) which specifies all the orthographic rules, and (ii) the lexical file (.LEX) which specifies all the lexicons, classification of lexicons and morphotactic constraints of these classes. [5] The structure of PC-KIMMO is shown in Figure 1. Figure 1: Structure of PC-KIMMO parser version 1 One limitation of the version 1 was its inability to perform syntactic parsing using a word grammar. This limitation was removed in PC-KIMMO version 2 with the introduction of a word grammar, adding it as the third component. [7] The version 2 included a featurestructure unification based chart parser capable of producing parse trees based on Shieber s PATR-II formalist. [14] The word grammar is specified using a grammar (.GRM) file, so this version requires a total of 3 files for morphological processing. PC-KIMMO has two functional components: generation and recognition. 3.2. Generation and Recognition The generator uses the two-level rules to recursively compute the surface form from the lexical form. The rules are encoded using a finite automata that can be automatically generated from the rules. The recognizer performs the inverse task to compute the lexical form from the surface form. Unlike the generator, the recognizer needs a lexicon in addition to the two-level rules. 4. Implementation JKimmo is a graphical user interface (GUI) implemented in the JAVA programming language, using PC-KIMMO version 2 as the back end. PC-KIMMO has tree main component: two level orthographic rule, lexicon and grammar. These are also the main components of Jkimmo; in addition, JKimmo has another component the transliteration scheme. The rule file must be loaded for morphological generation and both the rule and lexicon files must be loaded for morphological recognition. For generation, JKimmo does not need the grammar file; for recognition, the grammar file is optional. Since it uses PC-KIMMO as the backend, JKimmo automatically uses feature unification grammar. 4.1. JKimmo components 4.1.1. Transliteration file. The original PC-KIMMO software is written in C programming language and uses only Latin alphanumeric characters for input and output purposes. For inputs using scripts other than Latin, the user has to come up with his/her own transliteration scheme that uses Latin characters corresponding to characters of the non-latin script. Viewing and understanding the input and output strings in such a way can be cumbersome and non-intuitive for the user. JKimmo solves this problem in a modular, abstract fashion. It requires that the whole transliteration scheme be written down in a separate file. The user can then load that transliteration file. Once the transliteration file is loaded, the user can input strings and view output strings in his preferred language in an intuitive way. Transliteration scheme for Bengali language is given in Table 1. 4.1.2. Rule file. Two level orthographic rules are required for JKimmo. The rule file is same as PC- KIMMO rule file, reproduced here from the reference manual: the general structure of the rules file is a list of declarations composed of a keyword followed by data. The set of valid keywords in a rules file includes COMMENT, ALPHABET, NULL, ANY, BOUNDARY, SUBSET, RULE, and END. The COMMENT, SUBSET and RULE declarations are optional and also can be used more than once in a rules file. The END declaration is also optional, but can only be used once. [7] PC-KIMMO only recognizes Latin characters in rule file. To implement rule for language that uses other than Latin script we must follow the transliteration scheme. There is a free rule compiler for PC-KIMMO called kgen is available. It takes rule specification and it generate rule for PC-KIMMO. There are more free tools available that can be used for rule generation.

Figure 2: Communication protocol of JKimmo and PC-KIMMO Figure 3: Main components of JKimmo 4.1.3. Lexicon file. The lexicon contains the indivisible words and morphemes in their lexical forms, i.e., the lexical items, as well as the morphotactic constraints. Its primary task is to decompose a word into its constituent morphemes using a simple positional analysis. The positional analysis need only go far enough to ensure that all correct parses are produced but not too many incorrect parses. Co-occurrence restrictions between morpheme positions are best handled in the word grammar, not the lexicon, because that will raise complexities of morphotactic analysis. The format for the lexicon is reproduced from the reference manual: A lexicon consists of one main lexicon file plus one or more files of lexical entries. The general structure of the main lexicon file is a list of keyword declarations. The set of valid keywords is ALTERNATION, FEATURES, FIELDCODE, INCLUDE, and END. [7] To write lexicons that will be used in JKimmo for language that use other than Latin script then we have to follow the transliteration scheme. 4.1.4. Grammar file. The word grammar is encoded in the grammar file, which is optional for PC-KIMMO and, consequently, JKimmo. The grammar file has three sections: (i) feature abbreviations, (ii) category templates, and (iii) grammar rules. As in any featurestructure language, the grammar rules specify the feature constraints. Table 1: Bengali transliteration scheme Bangla Latin Bangla Latin Bangla Latin Bangla Latin Bangla Latin ^ a G N R a A I G t L å F I? T ш S i H u C d $ и L U C D S u M R J n H Q e J p ' V E Q P " e W o a V b Y X O W B % o Z к k X m & F х K Z y ~ Figure 4a: Generation example Figure 4b: Recognition example

4.1.5. Localized interface. JKimmo provides the choice of language for its interface. Currently JKimmo only support Bangla and English language for its interface. New language can be added by adding a new java ResourceBundles property file for that language [15]. 4.2. Algorithm The algorithms used by the underlying morphological processor are described in [7]. JKimmo communicates with the PC-KIMMO API using the two data structures: KimmoData and KimmoResult. The KimmoData data structure collects the information used for data processing within the PC-Kimmo functions, and designed to hold as much of the processing parameters as possible to reduce the number of parameters needed for each function. The KimmoResult data structure contains a single result from one of the PC-Kimmo processing functions (applykimmogenerator, applykimmorecognizer). It can be used to build a linked list for ambiguous results. These algorithms pertain only to the communication between JKimmo interface and PC-KIMMO library. We have used JNI as a bridge between JKimmo interface and PC-KIMMO library. We have used both PC-KIMMO data structures to access internal components. The JNI also have some native methods for communication. This algorithm is for languages that do not use Latin script. For languages that uses Latin script just omit transliteration related portion. 4.2.1. The generator. This algorithm has some perquisites like transliteration file and rule file must be loaded. The algorithm works as follows: 1 If the input specified in the lexical form is empty but user click on generate button 1.1 JKimmo will do nothing 2 For each input pair containing the first character in the lexical form as the lexical character, do the following steps: 2.1 If input string is correct: 2.1.1 Translate the Unicode string to Latin characters string. 2.1.2 JKimmo interface calls generate native method with translated string as argument. 2.1.3 Native method calls applykimmogenerator function of PC-KIMMO library. PC-KIMMO library save the result into result data structure. 2.1.4 JKimmo interface now call getresult native method to get the result. 2.1.5 Native method extracts the result (Latin character string) from KimmoResult data structure and sends to JKimmo interface. 2.1.6 JKimmo interface translate the Latin characters string to Unicode string and show the result. 2.2 If input string is wrong 2.2.1 JKimmo will show a warning message and do nothing. Figure 4a shows an example of JKimmo generation. 4.2.2. The recognizer. This algorithm also has some perquisites like transliteration file, rule file, lexicon must be loaded and grammar is optional. The algorithm works as follows: 1 If the input (surface) is empty but user click on recognize button 1.1 JKimmo will do nothing 2 For each input pair containing the first character in the surface form as the lexical character, do the following steps: 2.3 If input string is correct 2.3.1 Translate the Unicode string to Latin characters string. 2.3.2 JKimmo interface calls recognize native method with translated string as argument. 2.3.3 Native method calls applykimmorecognizer function of PC-KIMMO library. PC-KIMMO library save the results into result data structure. 2.3.4 JKimmo interface now call getresult and getgloss native method to get the results. 2.3.5 Native method extracts the results (Latin character string) from KimmoResult data structure and send to JKimmo interface. 2.3.6 JKimmo interface translate the Latin characters string to Unicode string and show the results. 2.4 If input string is wrong 2.4.1 JKimmo will show a warning message and do nothing. Figure 4b shows an example of JKimmo recognition.

5. Conclusion Our goal is to develop a reusable and robust opensource framework for computational morphological analysis of Bangla. We started with the existing efforts in defining the Bangla generative morphology for the rules, PC-KIMMO version 2 for the two-level morphological processor for the backend, and developed a Unicode-based multilingual interface, JKimmo, that can be used to experiment with Bangla morphology using Bangla language interface. JKimmo has been developed from the ground up as internationalized software, which means that it can be localized in any language using standard localization idioms such as property files and transliteration schemes. Some of the limitations of the current implementation of JKimmo are however noteworthy. One of most useful features of PC-KIMMO version 2 is creating the parse tree when recognizing a surface form. JKimmo currently only shows the lexical form and its glosses. The other limitation is in error handling, specifically where the errors are generated by the back-end. The next release of JKimmo will correct both of the limitations. 6. Acknowledgement This work has been supported in part by the PAN Localization Project (www.panl10n.net), grant from the International Development Research Center, Ottawa, Canada, administrated through Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Pakistan. We would also like to thank Arnab Zaheen, Naira Khan and other members of our research group. 7. References [1] P. Sengupta and B.B. Chaudhuri, Morphological processing of Indian languages for lexical interaction with application to spelling error correction, Sadhana, Vol. 21, Part. 3, 1996, pp. 363-380. [2] S. Bhattacharya, M. Choudhury, S. Sarkar and A. Basu, Inflectional Morphology Synthesis for Bengali Noun, Pro-noun and Verb Systems, Proc. of the National Conference on Computer Processing of Bangla, Dhaka, Bangladesh, March, 2005, pp. 34-43. [3] S. Dasgupta and M. Khan, Morphological Parsing of Bangla Words Using PC-KIMMO, Proc. 7th International Conference on Computer and Information Technology, ICCIT 2004, Dhaka, Bangladesh, Dec., 2004. [4] S. Dasgupta and M. Khan, Feature Unification for Morphological Parsing in Bangla, Proc. 7th International Conference on Computer and Information Technology, ICCIT 2004, Dhaka, Bangladesh, 2004. [5] K. Koskenniemi, Two-level morphology: a general computational model for word-form recognition and production., Publication No. 11. Helsinki: University of Helsinki Department of General Linguistics, 1983. [6] E.L. Antworth, PC-KIMMO: a two-level processor for morphological analysis, Occasional Publications in Academic Computing No. 16, Dallas, TX: Summer Institute of Linguistics, 1990. [7] E.L. Antworth. Morphological Parsing with Unification-based Word Grammar, A paper presented at North Texas Natural Language Processing Workshop, May 23, 1994. [8] PC-KIMMO available at http://www.sil.org/pckimmo/ [9] Unicode 4.1 specification, available from http://www.unicode.org/ [10] Java Native Interface Documentation available at http://java.sun.com/j2se/1.4.2/docs/guide/jni/ [11] Pykimmo is available at http://web.mit.edu/course/6/6.863/pykimmo/ [12] Bengali Morphological Analyzer demo, available at www.mla.iitkgp.ernet.in/morph_analyzer.html [13] K.R. Beesley, Finite-State Morphological Analysis and Generation of Arabic at Xerox Research: Status and Plans in 2001, ACL Work-shop on Arabic Language Processing: Status and prospects (Invited talk), 2001. [14] S.M. Shieber, An introduction to unificationbased approaches to grammar, CSLI Lecture Notes No. 4. Stanford, CA, 1986. [15] Java Localization documentation at http://java.sun.com/developer/technicalarticles/intl/re sourcebundles/