A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

Similar documents
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Derivational and Inflectional Morphemes in Pak-Pak Language

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Ensemble Technique Utilization for Indonesian Dependency Parser

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Building an HPSG-based Indonesian Resource Grammar (INDRA)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

BULATS A2 WORDLIST 2

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Words come in categories

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Context Free Grammars. Many slides from Michael Collins

Ch VI- SENTENCE PATTERNS.

Emmaus Lutheran School English Language Arts Curriculum

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Language contact in East Nusantara

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Developing Grammar in Context

Linking Task: Identifying authors and book titles in verbose queries

Grammars & Parsing, Part 1:

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Constructing Parallel Corpus from Movie Subtitles

An Evaluation of POS Taggers for the CHILDES Corpus

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Word Stress and Intonation: Introduction

Parsing of part-of-speech tagged Assamese Texts

The stages of event extraction

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

ScienceDirect. Malayalam question answering system

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Distant Supervised Relation Extraction with Wikipedia and Freebase

Indian Institute of Technology, Kanpur

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

CS 598 Natural Language Processing

Memory-based grammatical error correction

Prediction of Maximal Projection for Semantic Role Labeling

Development of the First LRs for Macedonian: Current Projects

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Using a Native Language Reference Grammar as a Language Learning Tool

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

AQUA: An Ontology-Driven Question Answering System

Advanced Grammar in Use

Grade 7. Prentice Hall. Literature, The Penguin Edition, Grade Oregon English/Language Arts Grade-Level Standards. Grade 7

A heuristic framework for pivot-based bilingual dictionary induction

LTAG-spinal and the Treebank

What the National Curriculum requires in reading at Y5 and Y6

Sample Goals and Benchmarks

Cross-Lingual Text Categorization

Coast Academies Writing Framework Step 4. 1 of 7

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

THE VERB ARGUMENT BROWSER

Test Blueprint. Grade 3 Reading English Standards of Learning

A Graph Based Authorship Identification Approach

A Syllable Based Word Recognition Model for Korean Noun Extraction

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Training and evaluation of POS taggers on the French MULTITAG corpus

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

The taming of the data:

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Written by: YULI AMRIA (RRA1B210085) ABSTRACT. Key words: ability, possessive pronouns, and possessive adjectives INTRODUCTION

Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems

Using dialogue context to improve parsing performance in dialogue systems

BASIC ENGLISH. Book GRAMMAR

The Ups and Downs of Preposition Error Detection in ESL Writing

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)

Cross Language Information Retrieval

A Comparison of Two Text Representations for Sentiment Analysis

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

IS SABAH MALAY A REAL LANGUAGE? By: Jane Wong Kon Ling, Ph.D Centre for the Promotion of Knowledge and Language Learning Universiti Malaysia Sabah

Mercer County Schools

Part of Speech Template

5 th Grade Language Arts Curriculum Map

Universiteit Leiden ICT in Business

Character Stream Parsing of Mixed-lingual Text

Problems of the Arabic OCR: New Attitudes

Learning Methods in Multilingual Speech Recognition

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

cmp-lg/ Jan 1998

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Grade 5: Module 3A: Overview

Modeling function word errors in DNN-HMM based LVCSR systems

Accurate Unlexicalized Parsing for Modern Hebrew

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)

Transcription:

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles Rayner Alfred 1, Adam Mujat 1, and Joe Henry Obit 2 1 School of Engineering and Information Technology, Universiti Malaysia Sabah, Jalan UMS, 88400, Kota Kinabalu, Sabah, Malaysia 2 Labuan School of Informatics Science, Universiti Malaysia Sabah, Labuan, Malaysia ralfred@ums.edu.my, adammujat@gmail.com, joehenryobit@yahoo.com Abstract. The Malay language is an Austronesian language spoken in most countries in the South East Asia region that includes Malaysia, Indonesia, Singapore, Brunei and Thailand. Traditional linguistics is well developed for Malay but there are very limited resources and tools that are available or made accessible for computer linguistic analysis of Malay language. Assigning part of speech (POS) to running words in a sentence for Malay language is one of the pipeline processes in Natural Language Processing (NLP) tasks and it is not well investigated. This paper outlines an approach to perform the Part of Speech (POS) tagging for Malay text articles. We apply a simple Rule-based Part of Speech (RPOS) tagger to perform the tagging operation on Malay text articles. POS tagging can be described as a task of performing automatic annotation of syntactic categories for each word in a text document. A rule-based POS tagger generally involves a POS tag dictionary and a set of rules in order to identify the words that are considered parts of speech. In this paper, we propose a framework that applies Malay affixing rules to identify the Malay POS tag and the relation between words in order to select the best POS tag for words that have two or more valid POS tags. The results show that the performance accuracy of the ruled-based POS tagger is higher compared to a statistical POS tagger. This indicates that the proposed RPOS tagger is able to predict any unknown word's POS at some promising accuracy. Keywords: Rule-Based POS Tagger, Computational Linguistic, Malay Affixing Rules, Malay Word Relation. 1 Introduction In Malaysia, the Malay language is officially known as Bahasa Malaysia, which translates as the "Malaysian language". The total number of speakers of Standard Malay is about 18 million. There are also about 170 million people who speak Indonesian, which is a form of Malay. Malay language is used as a national language for Malaysia and Indonesia and ranked fourth after Spanish for the most widely spoken languages on earth. Nevertheless, it is one of the least studied and known about, to the extent that it is even left out of rank orders of the world s major languages. Traditional linguistics is well developed for Malay but there are very limited resources and tools that A. Selamat et al. (Eds.): ACIIDS 2013, Part II, LNAI 7803, pp. 50 59, 2013. Springer-Verlag Berlin Heidelberg 2013

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles 51 are available or made accessible for computer linguistic analysis of Malay language. For example, the part of speech (POS) tagging for Malay text articles is one of the limited tools for computer linguistic analysis. POS is a process of tagging a text into corresponding part of speech tag based on the word definition and relation. A post of speech (POS) tagger for Malay language has some end product applications. Firstly, POS tagger for Malay language can be used as a grammar checker that identifies word relation based on word class, by checking the word class before and after the word. Next, a POS tagger for Malay language can also be used to classify question by identifying question focus [6] (e.g., a noun and verb after the interrogative word and keyword can be used to identify the question focus). For English language, a simple rule-based POS tagger was first introduced by Eric Brill [1]. In his work, he has illustrated that a rule-based tagger for English language can perform as good as taggers based upon probabilistic or statistical models. Statistical tagging for English text articles has been widely applied into tagged corpora using various approaches. Among the early technique was Hidden Markov Model (HMM) algorithm [12] which achieved the accuracy of more than 96% for English text articles. For Malay language, a statistical POS tagger using trigram Hidden Markov Model for tagging Malay text articles has been designed but only achieved the accuracy of 67.9%. The efforts in statistical POS tagger initiatives are mainly focused on European languages like English, German, Spanish etc [7,8,9,10,11]. The development of this research is mainly contributed by the availability of their language resources such as dictionaries and annotated corpus. Minority languages such as Malay language still need more supports in term of researches conducted in order to assist the development of tools for computer linguistic analysis of Malay language. In this work, a framework of a rule-based POS tagging for Malay language will be outlined, since Malay language has a very limited POS tagged corpus accessible for Malay language researchers. This is paper is organized as followed. Section 2 explains the background of the POS tagger for Malay language. Section 3 outlines the ruled-based POS tagger framework for Malay language articles. Section 4 describes the experimental design setup and discusses the experimental results. Section 5 discusses the results obtained from the experiments and finally, this paper is concluded with future works in Section 6. 2 Part of Speech Tagging for Malay Language Part of Speech (POS) tagging is a process of tagging a text into corresponding word class or part of speech, based on word definition and word relation. A simple rulebased POS tagger for Malay language applies a POS tag dictionary and affixing rules in order to identify the Malay word definition. The POS tag dictionary is manually extracted from the Malay thesaurus and stored in a text format [2]. Fig. 1. illustrates a snapshot of the Malay POS tag dictionary. Table 1 shows the list of POS tags for Malay language words.

52 R. Alfred, A. Mujat, and J.H. Obit adunan agak agaknya mengagak-agak teragak teragak-agak NN GUT PEN VB VB VB Fig. 1. Malay POS tag dictionary snapshot All the affixing rules that are applied in the proposed approach are studied and manually extracted from the Tatabahasa dewan edisi ketiga [3]. The derived word relations are based on the word types where some word types co-occur with words other word types (see Table 2). For instance, given a phrase in Malay language as follows, saya suka makan saya (NN) suka (JJ/VB/RB) makan (VB) where saya is a noun that co-occurs with the word suka which is classified as an adjective, a verb and an adverb. However, makan is a verb and only an adverb that is allowed to co-exist with the word makan. Thus, we will have the following word relations saya suka makan saya (NN) suka (RB) makan (VB) Table 1. POS tag list for Malay Word Type (English language) Noun Verb Adjective Function Subtype (English language) Subtype (Malay language) Tag NN Proper noun NNP VB JJ Conjunction Kata hubung CC Interjection Kata seru UH Interrogative Kata tanya WP Command Kata perintah CO Kata pangkal ayat PNG Auxiliary (Amplifier) Kata bantu AUX Kata penguat GUT Particles Kata penegas RP Negation Kata naïf NEG Kata pemeri MER Preposition Kata sendi nama IN Kata pembenar BNR Direction Kata arah DR Cardinal number Kata bilangan CD Kata penekan PEN Kata pembenda BND Adverb Adverb RB

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles 53 Table 2. Word Type Relation Word Type Noun (NN) Verb (VB) Adjective (JJ) Adverb (RB) Direction (DR) Preposition (IN) Auxiliary (AUX) Cardinal number Penekan (PEN) Pembenda (BND) Conjunction (CC) Penguat (GUT) Interrogative (WP) Pangkal ayat (PNG) Valid Sequences of Word Types adjective (JJ), adverb (RB),verb (VB),noun (NN),preposition (IN) auxiliary (AUX), adverb (RB), noun (NN), penekan (PEN), pembenda (BND) penguat (GUT), preposition (IN) verb (VB), preposition (IN), adjective (JJ), noun (AUX) noun (NN), preposition (IN) noun (NN), verb (VB), adjective (JJ) adjective (JJ), verb (VB), preposition (IN) noun (NN) adverb (RB), noun (NN), conjunction (CC) conjunction (CC), noun (NN) noun (NN), verb (VB), preposition (IN), adjective (JJ) adjective (JJ) noun (NN), verb (VB) noun (NN) Most Malay POS tagging systems apply a POS tag dictionary and affixing rules acquisition for POS (see section 3), because of the unavailability of resources such as tagged POS tag corpus. 2.1 Analysis of Affixed Word Bali analyzes Malay affixed words by identifying affixed words, segmenting them and finally interpreting the affixed words in Malay language [4]. In Malay, the form of words can be simple or complex. Affixed words are complex words generated by a morphological process called affixation that includes prefixation, suffixation, circumfixation, and infixation. Prefixation is the process of adding a prefix at the left side of the base and suffixation is the adding of a suffix to the right side of the base (See Fig. 2.). Circumfixation is the simultaneous adding of a discontinuous morphological unit called circumfix at the left and right sides of the base [4]. A circumfix is a combination of a prefix and a suffix treated as a single morphological unit. In Malay language, infixation is the insertion of an infix just after the first consonant of the base. affixed word Proclitic (Ku, Kau ) Circumflex Base Enclitic ( ku, kau, mu, nya) Particle Prefix Infix Suffix Fig. 2. Clitics, Affixing and Particle in Malay word In Bali s work, she has identified the affixing words, the clitics and particles and their relations. A word containing clitics and particle cannot be affixed but affixed word may have clitic and particle. In Fig. 2, it is shown that an affixed word can be

54 R. Alfred, A. Mujat, and J.H. Obit the host of one and only one clitic and/or one and only one particle. A clitic attached before the base is called proclitic and a clitic attached after the base is called enclitic. Fig. 2 shows the structure of an affixed word in Malay with the addition of a clitic (proclitic or enclitic) and particle [4]. In Malay language, there are two proclitics, four enclitics and three particles. Ku and Kau are two proclitics that generate passive word. On the other hand, ku, kau, mu and nya are four enclitics that are functioning as an object pronoun of active verb and a possessive adjective. In addition to that, the enclitic nya is also functioning as a subjective pronoun of passive verb and a definite article. Finally, the Malay particles include kah and tah that generate question marker, and lah that generates imperative and predicative marker. 3 A Rule-Based Part of Speech (RPOS) Tagger for Malay Texts In this paper, the proposed rule-based POS tagger for Malay language applies three general tagging convention of the Penn Treebank [6] that includes a) the part of speech tags are defined based on their syntactic distribution rather than their semantic function and b) the tagger capitalizes words tagged as proper noun and c) the tagger tags the abbreviations and initials. In addition to that, the proposed rule-based POS tagger for Malay language has additional POS tags which are not included in the Penn Treebank tags [3]. These tags include a) kata perintah command (CO) b) kata pangkal ayat (PNG) c) kata bantu auxiliary (AUX) d) kata penguat (GUT) e) kata naïf negation (NEG) f) kata pemeri (MER) g) kata pembenar (BNR) h) kata arah direction (DR) i) kata penekan (PEN) j) kata pembenda (BND) In this paper, we outline a simple rule-based POS tagger for Malay language. The rules involve the affixing and word relation rules [3]. Malay language affixing has a prefix, infix, suffix and combination, in this paper only the prefix, suffix and combination are considered. This is because infix is not a productive affixing and it can cause ambiguity in the POS tagging as a similar infix may exist in the noun, verb and adjective. The affixing rules consist of a noun (as shown in Table 3), a proper noun, an adjective (as shown in Table 4), a verb (as shown in Table 5), pembenda, penegas and penekan.

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles 55 The penegas rule includes a sequence of characters ending in kah, lah and tah. The pembenda rule includes a non noun root word and ending with nya. Finally, the penekan rule includes a noun root word and ending with nya. In addition to the affixing rules, we also include the word type relation rules. The word type relation rule is a rule used for selecting the base POS tag to represent the word if the word has more than one POS tags. This is done by checking the validity of the word type relation before and after the word as explained in the Section 2. The word type relation list, shown in Table 2, is not an exhaustive list which is extracted from Tatabahasa Dewan Edisi Ketiga [3]. Table 3. Noun Affixing Identification Rules Rules Prefix Next character Sequences of character May end with Suffix 1a pe ny, ng, r, l and w a-z an - 1b pem b and p a-z an - 1c pen d, c, j, sy and z a-z an - 1d peng g, kh, h,k and vowel a-z an - 1e penge - a-z (3 to 4 an - character) 1f pel or ke - a-z an - 1g juru, maha, tata, pra, swa, tuna, eka, dwi, tri, panca, pasca, pro, anti, poli, auto, sub, supra - a-z - - 1h not started with me, meng, mem, menge, ber, be, di, diper Rules - a-z - an, at, in, wan, wati, isme, isasi, logi, tas, man, nita, isme, ik, is, al Table 4. Adjective Affixing Identification Rules Next character Sequences of character Suffix Prefix May end with 2a ter, se, bi - a-z - - 2b ke - a-z an - 2c not starting with di - a-z - in, at, ah, iah, and men sequences of vowels then wi and sequences of consonants end ending with i

56 R. Alfred, A. Mujat, and J.H. Obit Rules Prefix Table 5. Verb Affixing Identification Rules Next Character Sequences Suffix of character May end with a-z - - 3a me ny, ng, r, l, w, y, p, t, k, s 3b mem b, f, p and v a-z kan or i - 3c men d, c, j, sy, z, t and s a-z kan or i - 3d meng g, gh, kh, h, k and a-z - - vowel 3e menge - a-z (3 to 4 - - character) 3f) memper or diper - a-z kan or i - 3g) ber not r a-z kan or an - 3h) bel - a-z - - 3i) Ter not r a-z - - 3j) Ke - a-z - An 3k) - - a-z - i or kan 3l) di or diper - a-z kan or i - WORD POS tag Dictionary Word POS tag does not exist in dictionary Single POS tag Affixing Rules NOUN ADJECTIVE More than one POS tag TAGGED WORD VERB PENEGAS WORD RELATION RULES PEMBENDA PENEKAN Fig. 3. The Rule-Based POS Tagger Framework for Malay Text Articles Fig. 3 illustrates the framework of the proposed rule-based POS (RPOS) tagger for Malay text articles which consists of a POS tag dictionary, a set of affixing rules and word relation rules. The POS tag dictionary consists of Malay words with their POS tags and these Malay words are extracted manually from Thesaurus Bahasa Melayu

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles 57 that has more than 8,700 tagged words [2]. The Malay language affixing generates a new word and meaning and in this paper we apply affixing characteristics in order to identify POS tags only for the noun, adjective, verb, penegas, pembenda and penekan. First, the rule-based POS tagger starts by checking the existence of the word POS tag in the POS tag dictionary. If the word exists in the POS tag dictionary and has only one tag then the word tagging is completed. If the word exists in the dictionary and has more the one possible tagging name, identifying valid word type relation will be performed to select the proper POS tag name. Otherwise, if the word does not exist in the POS tag dictionary, the word will be processed in line with the affixing rules before it is processed in the tagging process again. 4 Experimental Setup and Evaluations In this experiment, we have extracted ten sets of news article from the Malay online news and ten sets of biomedical articles from the Malaysian Journal of community health (http://161.142.92.97/). We performed the rule-based POS tagging process on these sets of news and biomedical articles based on the affixing and word type relation rules. We then compared the results with the actual tags. We have performed the process of tagging the words manually in order to evaluate the accuracy of our proposed algorithm. Table 6 shows the percentage accuracies of the rule-based POS tagger performance against the manually tagging process for both the news and biomedical articles. In Table 6, the total token represents the actual number of word found in the test sets. The counted token represents the number of words actually used for POS tagging. Table 6. Experiment Results for Rules based POS tagging for Malay language News Articles Biomedical Articles Accuracy Test Accuracy (%) Total Counted Total Counted (%) Biomedical Articles Set News Articles token token token token 1 386 296 456 399 94 86 2 302 229 436 400 91 89 3 199 167 501 440 85 89 4 185 132 490 444 88 82 5 177 141 434 390 86 88 6 189 136 453 379 90 79 7 149 107 477 420 92 83 8 175 136 411 380 92 93 9 304 225 300 250 88 87 10 434 340 231 187 86 85 Average 89 86 5 Discussions The results show that the proposed rule-based Malay POS tagger achieves 89 percent accuracy for the Malay news articles and 86 percent accuracy for the Malay biomedical articles. The result of the rule-based Malay POS tagger for Malay biomedical

58 R. Alfred, A. Mujat, and J.H. Obit articles is lower due to the existing of some borrowed words in Malay from the English Language. Based on our experiment results, for the news articles, we also have identified some of the words POS tags that the rule-based POS tagger for Malay language has failed to identify. These words POS tags include the words kopersai (NN), berniaga (VB), selepas (RB), waktu (NN/AUX), bertugas (VB), selepas(rb) and waktu (AUX). On the other hand, for the biomedical articles, it shows that the rule-based POS tagger for Malay language have failed to identify some words POS tags that include words which are borrowed from the English language such as antropometri (anthropometry a noun), dialysis (dialysis a noun), inflamasi (inflammation a noun), komplikasi (complication a noun), vascular (vascular a noun or adjective), nefropati (nephropathy a noun), neuropati (neuropathic a noun), retinopati (retinopathy a noun), infarksi (infarction a noun), myocardium (myocardium a noun), amputasi (amputation a noun) and superfisial (superficial a adjective). 6 Conclusion In this paper, we have outlined the framework for a simple Rule-based Part of Speech (RPOS) tagger for Malay text articles. Based on our experiment results, the performance of the proposed rule-based POS tagger is acceptable compared to performance of a statistical POS tagger reported earlier. This indicates that a ruled-based POS tagger for Malay language is able to predict any unknown word's POS at some promising accuracy. The performance of the proposed rule-based POS tagger for Malay language can be improved by adding more word type relations and more POS tags into the POS tag dictionary. By improving the word type relations, more sentence formats can be handled. References 1. Brill, E.: A simple rule-based part of speech tagger. In: HLT 1991: Proceedings of the Workshop on Speech and Natural Language, pp. 112 116. Association for Computational Linguistics, Morristown (1992) 2. Thesaurus Bahasa Melayu, New Edition Kuala Lumpur, Dewan Bahasa dan Pustaka (2008) ISBN 983628558X 3. Karim, N.S., Onn, F.M., Musa, H.H., Mahmood, A.H.: Tatabahasa Dewan Edisi Ketiga. Dewan Bahasa dan Pustaka, Kuala Lumpur (2008) 4. Ranaivo-Malancon, B.: Computational Analysis of Affixed Words in Malay Language. In: The 8th International Symposium on Malay/Indonesian Linguistics (ISMIL8), Penang, Malaysia (2004) 5. Purwarianti, A.: Developing Cross Language Systems for Language Pair with Limited Resource-Indonesian-Japanese CLIR and CLQA, Phd. thesis, Toyohashi University of Technology (2007) 6. Santorini, B.: Part-of-Speech tagging guideline for the Penn Treebank Project, 3rd Revision, 2nd Printing (1990)

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles 59 7. Merialdo, B.: Tanging English Text with a Probabilistic Model. Computational Linguistics 20(2), 155 171 (1994) 8. Elworthy, D.: Does Baum-Welch Re-estimation Help Taggers? In: Proceedings of the 4th ACL Conference on Applied Natural Language Processing, ANLP (1994) 9. Banko, M., Moore, R.C.: Part of Speech Tagging in Context. In: Proceedings of the 8th International Conference on Computational Linguistics, COLING (2004) 10. Wang, Q.I., Schuurmans, D.: Improved Estimation for Unsupervised Part-of-Speech Tagging. In: Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering, IEEE NLP-KE (2005) 11. Biemann, C., Giuliano, C., Gliozzo, A.: Unsupervised Part-Of-Speech Tagging Supporting Supervised Methods. In: Proceedings of RANLP 2007, Borovets, Bulgaria (2007) 12. Jurafsky, D., Martin, J.H.: Speech and language processing. Prentice Hall, New Jersey (2000)