Khmer Part-of-Speech Tagger

Similar documents
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Part of Speech Template

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

CS 598 Natural Language Processing

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Context Free Grammars. Many slides from Michael Collins

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Indian Institute of Technology, Kanpur

The stages of event extraction

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Training and evaluation of POS taggers on the French MULTITAG corpus

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Grammars & Parsing, Part 1:

Linking Task: Identifying authors and book titles in verbose queries

BULATS A2 WORDLIST 2

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Sample Goals and Benchmarks

An Evaluation of POS Taggers for the CHILDES Corpus

Writing a composition

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Ch VI- SENTENCE PATTERNS.

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Developing Grammar in Context

Development of the First LRs for Macedonian: Current Projects

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Character Stream Parsing of Mixed-lingual Text

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The taming of the data:

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Advanced Grammar in Use

Memory-based grammatical error correction

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

What the National Curriculum requires in reading at Y5 and Y6

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

THE VERB ARGUMENT BROWSER

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

Modeling full form lexica for Arabic

AQUA: An Ontology-Driven Question Answering System

Parsing of part-of-speech tagged Assamese Texts

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

ScienceDirect. Malayalam question answering system

The College Board Redesigned SAT Grade 12

Prediction of Maximal Projection for Semantic Role Labeling

Using dialogue context to improve parsing performance in dialogue systems

Emmaus Lutheran School English Language Arts Curriculum

Grade 7. Prentice Hall. Literature, The Penguin Edition, Grade Oregon English/Language Arts Grade-Level Standards. Grade 7

Developing a TT-MCTAG for German with an RCG-based Parser

Introduction to Text Mining

Modeling function word errors in DNN-HMM based LVCSR systems

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

A Syllable Based Word Recognition Model for Korean Noun Extraction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Leveraging Sentiment to Compute Word Similarity

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

1. Introduction. 2. The OMBI database editor

Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems

cmp-lg/ Jan 1998

Loughton School s curriculum evening. 28 th February 2017

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

BASIC ENGLISH. Book GRAMMAR

Adjectives tell you more about a noun (for example: the red dress ).

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Guidelines for Writing an Internship Report

The Discourse Anaphoric Properties of Connectives

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Vocabulary Usage and Intelligibility in Learner Language

Learning Methods in Multilingual Speech Recognition

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Cross Language Information Retrieval

A Graph Based Authorship Identification Approach

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

BYLINE [Heng Ji, Computer Science Department, New York University,

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Copyright 2017 DataWORKS Educational Research. All rights reserved.

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Speech Emotion Recognition Using Support Vector Machine

Specifying a shallow grammatical for parsing purposes

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

LTAG-spinal and the Treebank

Mercer County Schools

Word Segmentation of Off-line Handwritten Documents

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

Proposed syllabi of Foundation Course in French New Session FIRST SEMESTER FFR 100 (Grammar,Comprehension &Paragraph writing)

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Transcription:

PAN Localization Project Project No: Ref. No: PANL10n/KH/Report POS Khmer Part-of-Speech Tagger 20 September 2008 Cambodia Country Component PAN Localization Project PAN Localization Cambodia (PLC) of IDRC

Table of Contents 1 Introduction... 3 1.1 POS overview... 3 1.2 Khmer language overview... 3 2 Khmer POS Tagger... 4 2.1 Khmer POS tagset... 4 2.2 Semi-automatic tagging... 5 2.2.1 Treetagger... 5 2.2.2 Corpus data... 8 2.2.3 Training and results... 8 3 Conclusion... 9 Page 2 of 10

1 Introduction 1.1 POS overview Every language has its parts of speech such as verb, noun, adjective...etc. POS tagging (Part of Speech tagging) is a process of automatic assigning the POS for the word. It is based on both its definition, as well as its context. It can be used for text parsing, information extraction, text summarization, grammar checker and machine translation [4]. There are many approach models to develop the automatic POS tagger such as Hidden Markov Model (HMM), Transformation Base and Decision Trees [4]. We have chosen Treetagger application, which is based on Decision Tree model, for semi-automatic tagging of Khmer language. There are two reasons to choose this application. First, it is freely available for research purpose. Secondly, it has been successfully used to tag many languages such as German, English, French and a few others [2]. 1.2 Khmer language overview Khmer, also known as Cambodian is the official language of the Kingdom of Cambodia. There are so many issues involved in Khmer Grammar rule. Because of the limitation of time we will not be able to describe all the problems in Khmer Grammar rule, however, we are going to summarize some of them which we faced during our research on Khmer POS Tagging. Khmer words are written continuously without word delimiter. Due to the limitation of resources for Khmer language, there is only CHOUN NATH Dictionary which is recognized as the official dictionary in Cambodia [4]. Khmer language has a lot of ambiguities i.e. it has different POS dependent on the context of the sentence. Beside of this, some words can be joined together and have another meaning and POS. The problems which we mentioned above are just a subset of the problem we faced during our research on Khmer POS Tag Set. We have defined 21 tags by grabbing the rules from a Khmer grammar book, Khmer dictionary and some documents related to Khmer grammar to make the POS tag set for Khmer language. Page 3 of 10

2 Khmer POS Tagger 2.1 Khmer POS tagset The tag-set we defined contains only major tags based on three main materials listed in the references section [3] [8] [9]. We do not include every sub category POS (POS that is the division of the major POS) defined by other authors [4] because we find it hard to consider them as standardized ones due to some debates. There are 21 tags defined and documented according to a set of rules discussed in the materials used. The following is a table of all the 21 tags. Category POS Tag ID No. POS Name POS Tag Noun 1 Noun NN 2 Proper Noun NNP Pronoun 3 Pronoun PRP 4 Relative Pronoun RPN Verb 5 Verb V 6 Auxiliary AUX Adverb 7 Adverb RB 8 Adjective JJ Adjective 9 Quantitative adjective QAD 10 Possessive adjective PAD 11 Determiner adjective DAD Ordinal Number 12 Ordinal Number ON Preposition 13 Preposition IN Interjection 14 Interjection UH Conjunction 15 Conjunction CC Foreign words 16 Foreign words FW Abbreviation 17 Abbreviation AB Symbol 18 Symbol SYM Mark 19 Mark M Tense expression words 20 Tense expression words TXW Additional Word 21 Additional Word AW Table 1: Khmer tagset Page 4 of 10

We are not going to explain all the tags, but the last one. Additional word is a category tag name given by our team. Khmer Language has a POS called NiBatSap ( ). We have no English word for this POS, so we decided to use Additional Word as a name for this POS in our POS Tag Set. Additional word is a word used to clarify the meaning of a noun or verb in a sentence. In current study, we did not identify the exact rule for this POS (Additional word). Although we defined 21 tags for Khmer POS, some of them are still missing which may require further research and development. 2.2 Semi-automatic tagging 2.2.1 Treetagger The Treetgger is a set of applications developed for annotating text with POS and lemma information. It has been developed within the TC project at the Institute for Computational Linguistics of the University of Stuttgart [5]. The Treetagger consists of two programs: traintree-tagger for training (2.2.1 step 1) and tree-tagger for tagging (2.2.1 step 2). To do the semi-automatic tagging we need two main steps given below. Step 1: Training Following files are required to complete the training process: lexicon, open tags, tagged data and it produces a parameter file that contains the rules for training. Lexicon: a file contains words with their POS. The part of speech for each word can be more than one, but the word in the lexicon file cannot be duplicated. The lexicon file produced is based on the tagged data. File format: [Word]([tab][space][Lemma]) + Page 5 of 10

Figure 1: Lexicon s file format Open Class tags: a file which contains a list of open class tags i.e. possible tags for unknown word. This information is needed to estimate possible tags for the unknown words. We decided following tags to be the open class tags, NN, V, JJ, M, IN, FW, AW, RB (see 2.1). File format: ([tag][space]) + Figure 2: Open class tag s file format Tagged data: a file which contains tagged training data. The data must be in one word per line format. It means that each line contains one token and one tag in the order, separated by a tab. File format: Figure 3: Tagged data s file format Page 6 of 10

Output file: A file in which the resulting tagger parameters are stored. Lexicon.txt Open Class.txt Train-tree-tagger.exe Output.par Tagged Data Figure 4: The process model of train-tree-tagger application Step 2: Tagging The application requires the parameter file, untagged file, and it produces a tagged file. Parameter file: an output file we got from train-tree-tagger in step 1. Input file: The data in this file must be in one word per line format. File format: Figure 5: Input file s format Output file: a tagged data after the execution of the tree-tagger. Input.txt Khmer.par Tree-tagger.exe Output.txt Figure 6: The process model of tree-tagger application Page 7 of 10

2.2.2 Corpus data There are two different sources, the LICADHO s report [5] and the Khmer Rouge Trial website [6], from where we have taken the text. The content of the documents are both written in official and daily speaking language. There are some preprocess on the text before they can be used as the data for the tagger application. Those preprocess are Conversion, word segmentation. Word Segmentation: As we mentioned in Khmer Language Overview section Khmer words are written continuously without delimiter. The input data for tagger application requires all the word separate from each other by new line. That is the reason we need to split the sentence into words. Finally, the untagged text is ready for tagging. We needed manually tagged corpus about 20,000 words for the semi-automatic tagging. 2.2.3 Training and results Manually tagged words (First 20,000 words) Lexicon Open class Conversion: the documents we retrieve from both sources are written in none- Unicode, because of this we need to convert those documents in to Unicode. Train-tree- Parameter file Un-tagged words Tree-tagger Tagged by program Manually verified Figure 7: The process model of training corpus Page 8 of 10

We started from the first 20,000 tagged words to generate a Lexicon file. We got a parameter file provided by train-tree-tagger application. We used the parameter file to tag next 10, 000 untagged words by using Tree-tagger application. We got a 10,000 tagged words after the execution of the tree-tagger. Then we corrected those 10,000 tagged words manually and checked the accuracy. The table below is a result from our experiment. Results Target data Tagged Data Lexicon Words to tag Accuracy 31,230 20,414 (manually tagged) 2,531 10,816 90.39% 41,413 31,230 3,362 10,183 94.52% 51,835 41,413 3,854 10,422 96.61% 62,522 51,835 4,259 10,687 86.26% 73,206 62,522 5,233 10,684 92.50% 78,492 73,206 5,934 5,286 98.80% 3 Conclusion Table 2: Result of semi-automatic tagging To obtain higher accuracy, large trained data is required. In addition to that, data should be taken from different domains. For our research, we have used the Decision Trees model to tag the data. The application can not tag the raw data which contains the text written continuously without delimiter (Word marker), every word must be separated by space character or new line. That concludes that an update of Khmer word segmentation is required with higher accuracy results. Beside that new words have been created every day such as name of people, name of organization, or word translated from foreign languages etc. This is another reason to expand trained data to get higher accuracy and we plan to train the corpus up to 150,000 words. References [1]. Saleh Yousef Al-Hudail. A Hidden Markov Model Based POS Tagger for Arabic. [2]. TreeTagger - a language independent part-of-speech tagger. Page 9 of 10

http://www.ims.uni-stuttgart.de/projekte/corplex/treetagger/decisiontreetagger.html, 02/09/2008. [3]. Dictionnaire Cambodgien, (5th ed.). 1967. Institute of Buddhist. Cambodia Printhouse ( ), 165. [4]. Chena NOU. Khmer Part-of-Speech Tagging. Global Information and Telecommunication Studies, WASEDA UNIVERSITY. [5]. http://www.smo.uhi.ac.uk/~oduibhin/oideasra/interfaces /winttinterface.htm, 02/09/2008. [6]. http://www.licadho.org, 02/09/2008. [7]. http://www.krtrial.info, 02/09/2008. [8]. Khin Sok. 2004. Khmer Language Grammar ( ). Royal Academy of Cambodia. [9]. Khmer Grammar Study Program for Freshmen. 2005. Institute of Foreign Languages (IFL), Department of English. Page 10 of 10