A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

Size: px

Start display at page:

Download "A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles"

Rosanna Goodman
6 years ago
Views:

1 A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles Rayner Alfred 1, Adam Mujat 1, and Joe Henry Obit 2 1 School of Engineering and Information Technology, Universiti Malaysia Sabah, Jalan UMS, 88400, Kota Kinabalu, Sabah, Malaysia 2 Labuan School of Informatics Science, Universiti Malaysia Sabah, Labuan, Malaysia ralfred@ums.edu.my, adammujat@gmail.com, joehenryobit@yahoo.com Abstract. The Malay language is an Austronesian language spoken in most countries in the South East Asia region that includes Malaysia, Indonesia, Singapore, Brunei and Thailand. Traditional linguistics is well developed for Malay but there are very limited resources and tools that are available or made accessible for computer linguistic analysis of Malay language. Assigning part of speech (POS) to running words in a sentence for Malay language is one of the pipeline processes in Natural Language Processing (NLP) tasks and it is not well investigated. This paper outlines an approach to perform the Part of Speech (POS) tagging for Malay text articles. We apply a simple Rule-based Part of Speech (RPOS) tagger to perform the tagging operation on Malay text articles. POS tagging can be described as a task of performing automatic annotation of syntactic categories for each word in a text document. A rule-based POS tagger generally involves a POS tag dictionary and a set of rules in order to identify the words that are considered parts of speech. In this paper, we propose a framework that applies Malay affixing rules to identify the Malay POS tag and the relation between words in order to select the best POS tag for words that have two or more valid POS tags. The results show that the performance accuracy of the ruled-based POS tagger is higher compared to a statistical POS tagger. This indicates that the proposed RPOS tagger is able to predict any unknown word's POS at some promising accuracy. Keywords: Rule-Based POS Tagger, Computational Linguistic, Malay Affixing Rules, Malay Word Relation. 1 Introduction In Malaysia, the Malay language is officially known as Bahasa Malaysia, which translates as the "Malaysian language". The total number of speakers of Standard Malay is about 18 million. There are also about 170 million people who speak Indonesian, which is a form of Malay. Malay language is used as a national language for Malaysia and Indonesia and ranked fourth after Spanish for the most widely spoken languages on earth. Nevertheless, it is one of the least studied and known about, to the extent that it is even left out of rank orders of the world s major languages. Traditional linguistics is well developed for Malay but there are very limited resources and tools that A. Selamat et al. (Eds.): ACIIDS 2013, Part II, LNAI 7803, pp , Springer-Verlag Berlin Heidelberg 2013

2 A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles 51 are available or made accessible for computer linguistic analysis of Malay language. For example, the part of speech (POS) tagging for Malay text articles is one of the limited tools for computer linguistic analysis. POS is a process of tagging a text into corresponding part of speech tag based on the word definition and relation. A post of speech (POS) tagger for Malay language has some end product applications. Firstly, POS tagger for Malay language can be used as a grammar checker that identifies word relation based on word class, by checking the word class before and after the word. Next, a POS tagger for Malay language can also be used to classify question by identifying question focus [6] (e.g., a noun and verb after the interrogative word and keyword can be used to identify the question focus). For English language, a simple rule-based POS tagger was first introduced by Eric Brill [1]. In his work, he has illustrated that a rule-based tagger for English language can perform as good as taggers based upon probabilistic or statistical models. Statistical tagging for English text articles has been widely applied into tagged corpora using various approaches. Among the early technique was Hidden Markov Model (HMM) algorithm [12] which achieved the accuracy of more than 96% for English text articles. For Malay language, a statistical POS tagger using trigram Hidden Markov Model for tagging Malay text articles has been designed but only achieved the accuracy of 67.9%. The efforts in statistical POS tagger initiatives are mainly focused on European languages like English, German, Spanish etc [7,8,9,10,11]. The development of this research is mainly contributed by the availability of their language resources such as dictionaries and annotated corpus. Minority languages such as Malay language still need more supports in term of researches conducted in order to assist the development of tools for computer linguistic analysis of Malay language. In this work, a framework of a rule-based POS tagging for Malay language will be outlined, since Malay language has a very limited POS tagged corpus accessible for Malay language researchers. This is paper is organized as followed. Section 2 explains the background of the POS tagger for Malay language. Section 3 outlines the ruled-based POS tagger framework for Malay language articles. Section 4 describes the experimental design setup and discusses the experimental results. Section 5 discusses the results obtained from the experiments and finally, this paper is concluded with future works in Section 6. 2 Part of Speech Tagging for Malay Language Part of Speech (POS) tagging is a process of tagging a text into corresponding word class or part of speech, based on word definition and word relation. A simple rulebased POS tagger for Malay language applies a POS tag dictionary and affixing rules in order to identify the Malay word definition. The POS tag dictionary is manually extracted from the Malay thesaurus and stored in a text format [2]. Fig. 1. illustrates a snapshot of the Malay POS tag dictionary. Table 1 shows the list of POS tags for Malay language words.

3 52 R. Alfred, A. Mujat, and J.H. Obit adunan agak agaknya mengagak-agak teragak teragak-agak NN GUT PEN VB VB VB Fig. 1. Malay POS tag dictionary snapshot All the affixing rules that are applied in the proposed approach are studied and manually extracted from the Tatabahasa dewan edisi ketiga [3]. The derived word relations are based on the word types where some word types co-occur with words other word types (see Table 2). For instance, given a phrase in Malay language as follows, saya suka makan saya (NN) suka (JJ/VB/RB) makan (VB) where saya is a noun that co-occurs with the word suka which is classified as an adjective, a verb and an adverb. However, makan is a verb and only an adverb that is allowed to co-exist with the word makan. Thus, we will have the following word relations saya suka makan saya (NN) suka (RB) makan (VB) Table 1. POS tag list for Malay Word Type (English language) Noun Verb Adjective Function Subtype (English language) Subtype (Malay language) Tag NN Proper noun NNP VB JJ Conjunction Kata hubung CC Interjection Kata seru UH Interrogative Kata tanya WP Command Kata perintah CO Kata pangkal ayat PNG Auxiliary (Amplifier) Kata bantu AUX Kata penguat GUT Particles Kata penegas RP Negation Kata naïf NEG Kata pemeri MER Preposition Kata sendi nama IN Kata pembenar BNR Direction Kata arah DR Cardinal number Kata bilangan CD Kata penekan PEN Kata pembenda BND Adverb Adverb RB

4 A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles 53 Table 2. Word Type Relation Word Type Noun (NN) Verb (VB) Adjective (JJ) Adverb (RB) Direction (DR) Preposition (IN) Auxiliary (AUX) Cardinal number Penekan (PEN) Pembenda (BND) Conjunction (CC) Penguat (GUT) Interrogative (WP) Pangkal ayat (PNG) Valid Sequences of Word Types adjective (JJ), adverb (RB),verb (VB),noun (NN),preposition (IN) auxiliary (AUX), adverb (RB), noun (NN), penekan (PEN), pembenda (BND) penguat (GUT), preposition (IN) verb (VB), preposition (IN), adjective (JJ), noun (AUX) noun (NN), preposition (IN) noun (NN), verb (VB), adjective (JJ) adjective (JJ), verb (VB), preposition (IN) noun (NN) adverb (RB), noun (NN), conjunction (CC) conjunction (CC), noun (NN) noun (NN), verb (VB), preposition (IN), adjective (JJ) adjective (JJ) noun (NN), verb (VB) noun (NN) Most Malay POS tagging systems apply a POS tag dictionary and affixing rules acquisition for POS (see section 3), because of the unavailability of resources such as tagged POS tag corpus. 2.1 Analysis of Affixed Word Bali analyzes Malay affixed words by identifying affixed words, segmenting them and finally interpreting the affixed words in Malay language [4]. In Malay, the form of words can be simple or complex. Affixed words are complex words generated by a morphological process called affixation that includes prefixation, suffixation, circumfixation, and infixation. Prefixation is the process of adding a prefix at the left side of the base and suffixation is the adding of a suffix to the right side of the base (See Fig. 2.). Circumfixation is the simultaneous adding of a discontinuous morphological unit called circumfix at the left and right sides of the base [4]. A circumfix is a combination of a prefix and a suffix treated as a single morphological unit. In Malay language, infixation is the insertion of an infix just after the first consonant of the base. affixed word Proclitic (Ku, Kau ) Circumflex Base Enclitic ( ku, kau, mu, nya) Particle Prefix Infix Suffix Fig. 2. Clitics, Affixing and Particle in Malay word In Bali s work, she has identified the affixing words, the clitics and particles and their relations. A word containing clitics and particle cannot be affixed but affixed word may have clitic and particle. In Fig. 2, it is shown that an affixed word can be

5 54 R. Alfred, A. Mujat, and J.H. Obit the host of one and only one clitic and/or one and only one particle. A clitic attached before the base is called proclitic and a clitic attached after the base is called enclitic. Fig. 2 shows the structure of an affixed word in Malay with the addition of a clitic (proclitic or enclitic) and particle [4]. In Malay language, there are two proclitics, four enclitics and three particles. Ku and Kau are two proclitics that generate passive word. On the other hand, ku, kau, mu and nya are four enclitics that are functioning as an object pronoun of active verb and a possessive adjective. In addition to that, the enclitic nya is also functioning as a subjective pronoun of passive verb and a definite article. Finally, the Malay particles include kah and tah that generate question marker, and lah that generates imperative and predicative marker. 3 A Rule-Based Part of Speech (RPOS) Tagger for Malay Texts In this paper, the proposed rule-based POS tagger for Malay language applies three general tagging convention of the Penn Treebank [6] that includes a) the part of speech tags are defined based on their syntactic distribution rather than their semantic function and b) the tagger capitalizes words tagged as proper noun and c) the tagger tags the abbreviations and initials. In addition to that, the proposed rule-based POS tagger for Malay language has additional POS tags which are not included in the Penn Treebank tags [3]. These tags include a) kata perintah command (CO) b) kata pangkal ayat (PNG) c) kata bantu auxiliary (AUX) d) kata penguat (GUT) e) kata naïf negation (NEG) f) kata pemeri (MER) g) kata pembenar (BNR) h) kata arah direction (DR) i) kata penekan (PEN) j) kata pembenda (BND) In this paper, we outline a simple rule-based POS tagger for Malay language. The rules involve the affixing and word relation rules [3]. Malay language affixing has a prefix, infix, suffix and combination, in this paper only the prefix, suffix and combination are considered. This is because infix is not a productive affixing and it can cause ambiguity in the POS tagging as a similar infix may exist in the noun, verb and adjective. The affixing rules consist of a noun (as shown in Table 3), a proper noun, an adjective (as shown in Table 4), a verb (as shown in Table 5), pembenda, penegas and penekan.

6 A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles 55 The penegas rule includes a sequence of characters ending in kah, lah and tah. The pembenda rule includes a non noun root word and ending with nya. Finally, the penekan rule includes a noun root word and ending with nya. In addition to the affixing rules, we also include the word type relation rules. The word type relation rule is a rule used for selecting the base POS tag to represent the word if the word has more than one POS tags. This is done by checking the validity of the word type relation before and after the word as explained in the Section 2. The word type relation list, shown in Table 2, is not an exhaustive list which is extracted from Tatabahasa Dewan Edisi Ketiga [3]. Table 3. Noun Affixing Identification Rules Rules Prefix Next character Sequences of character May end with Suffix 1a pe ny, ng, r, l and w a-z an - 1b pem b and p a-z an - 1c pen d, c, j, sy and z a-z an - 1d peng g, kh, h,k and vowel a-z an - 1e penge - a-z (3 to 4 an - character) 1f pel or ke - a-z an - 1g juru, maha, tata, pra, swa, tuna, eka, dwi, tri, panca, pasca, pro, anti, poli, auto, sub, supra - a-z - - 1h not started with me, meng, mem, menge, ber, be, di, diper Rules - a-z - an, at, in, wan, wati, isme, isasi, logi, tas, man, nita, isme, ik, is, al Table 4. Adjective Affixing Identification Rules Next character Sequences of character Suffix Prefix May end with 2a ter, se, bi - a-z - - 2b ke - a-z an - 2c not starting with di - a-z - in, at, ah, iah, and men sequences of vowels then wi and sequences of consonants end ending with i

7 56 R. Alfred, A. Mujat, and J.H. Obit Rules Prefix Table 5. Verb Affixing Identification Rules Next Character Sequences Suffix of character May end with a-z - - 3a me ny, ng, r, l, w, y, p, t, k, s 3b mem b, f, p and v a-z kan or i - 3c men d, c, j, sy, z, t and s a-z kan or i - 3d meng g, gh, kh, h, k and a-z - - vowel 3e menge - a-z (3 to character) 3f) memper or diper - a-z kan or i - 3g) ber not r a-z kan or an - 3h) bel - a-z - - 3i) Ter not r a-z - - 3j) Ke - a-z - An 3k) - - a-z - i or kan 3l) di or diper - a-z kan or i - WORD POS tag Dictionary Word POS tag does not exist in dictionary Single POS tag Affixing Rules NOUN ADJECTIVE More than one POS tag TAGGED WORD VERB PENEGAS WORD RELATION RULES PEMBENDA PENEKAN Fig. 3. The Rule-Based POS Tagger Framework for Malay Text Articles Fig. 3 illustrates the framework of the proposed rule-based POS (RPOS) tagger for Malay text articles which consists of a POS tag dictionary, a set of affixing rules and word relation rules. The POS tag dictionary consists of Malay words with their POS tags and these Malay words are extracted manually from Thesaurus Bahasa Melayu

8 A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles 57 that has more than 8,700 tagged words [2]. The Malay language affixing generates a new word and meaning and in this paper we apply affixing characteristics in order to identify POS tags only for the noun, adjective, verb, penegas, pembenda and penekan. First, the rule-based POS tagger starts by checking the existence of the word POS tag in the POS tag dictionary. If the word exists in the POS tag dictionary and has only one tag then the word tagging is completed. If the word exists in the dictionary and has more the one possible tagging name, identifying valid word type relation will be performed to select the proper POS tag name. Otherwise, if the word does not exist in the POS tag dictionary, the word will be processed in line with the affixing rules before it is processed in the tagging process again. 4 Experimental Setup and Evaluations In this experiment, we have extracted ten sets of news article from the Malay online news and ten sets of biomedical articles from the Malaysian Journal of community health ( We performed the rule-based POS tagging process on these sets of news and biomedical articles based on the affixing and word type relation rules. We then compared the results with the actual tags. We have performed the process of tagging the words manually in order to evaluate the accuracy of our proposed algorithm. Table 6 shows the percentage accuracies of the rule-based POS tagger performance against the manually tagging process for both the news and biomedical articles. In Table 6, the total token represents the actual number of word found in the test sets. The counted token represents the number of words actually used for POS tagging. Table 6. Experiment Results for Rules based POS tagging for Malay language News Articles Biomedical Articles Accuracy Test Accuracy (%) Total Counted Total Counted (%) Biomedical Articles Set News Articles token token token token Average Discussions The results show that the proposed rule-based Malay POS tagger achieves 89 percent accuracy for the Malay news articles and 86 percent accuracy for the Malay biomedical articles. The result of the rule-based Malay POS tagger for Malay biomedical

9 58 R. Alfred, A. Mujat, and J.H. Obit articles is lower due to the existing of some borrowed words in Malay from the English Language. Based on our experiment results, for the news articles, we also have identified some of the words POS tags that the rule-based POS tagger for Malay language has failed to identify. These words POS tags include the words kopersai (NN), berniaga (VB), selepas (RB), waktu (NN/AUX), bertugas (VB), selepas(rb) and waktu (AUX). On the other hand, for the biomedical articles, it shows that the rule-based POS tagger for Malay language have failed to identify some words POS tags that include words which are borrowed from the English language such as antropometri (anthropometry a noun), dialysis (dialysis a noun), inflamasi (inflammation a noun), komplikasi (complication a noun), vascular (vascular a noun or adjective), nefropati (nephropathy a noun), neuropati (neuropathic a noun), retinopati (retinopathy a noun), infarksi (infarction a noun), myocardium (myocardium a noun), amputasi (amputation a noun) and superfisial (superficial a adjective). 6 Conclusion In this paper, we have outlined the framework for a simple Rule-based Part of Speech (RPOS) tagger for Malay text articles. Based on our experiment results, the performance of the proposed rule-based POS tagger is acceptable compared to performance of a statistical POS tagger reported earlier. This indicates that a ruled-based POS tagger for Malay language is able to predict any unknown word's POS at some promising accuracy. The performance of the proposed rule-based POS tagger for Malay language can be improved by adding more word type relations and more POS tags into the POS tag dictionary. By improving the word type relations, more sentence formats can be handled. References 1. Brill, E.: A simple rule-based part of speech tagger. In: HLT 1991: Proceedings of the Workshop on Speech and Natural Language, pp Association for Computational Linguistics, Morristown (1992) 2. Thesaurus Bahasa Melayu, New Edition Kuala Lumpur, Dewan Bahasa dan Pustaka (2008) ISBN X 3. Karim, N.S., Onn, F.M., Musa, H.H., Mahmood, A.H.: Tatabahasa Dewan Edisi Ketiga. Dewan Bahasa dan Pustaka, Kuala Lumpur (2008) 4. Ranaivo-Malancon, B.: Computational Analysis of Affixed Words in Malay Language. In: The 8th International Symposium on Malay/Indonesian Linguistics (ISMIL8), Penang, Malaysia (2004) 5. Purwarianti, A.: Developing Cross Language Systems for Language Pair with Limited Resource-Indonesian-Japanese CLIR and CLQA, Phd. thesis, Toyohashi University of Technology (2007) 6. Santorini, B.: Part-of-Speech tagging guideline for the Penn Treebank Project, 3rd Revision, 2nd Printing (1990)

10 A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles Merialdo, B.: Tanging English Text with a Probabilistic Model. Computational Linguistics 20(2), (1994) 8. Elworthy, D.: Does Baum-Welch Re-estimation Help Taggers? In: Proceedings of the 4th ACL Conference on Applied Natural Language Processing, ANLP (1994) 9. Banko, M., Moore, R.C.: Part of Speech Tagging in Context. In: Proceedings of the 8th International Conference on Computational Linguistics, COLING (2004) 10. Wang, Q.I., Schuurmans, D.: Improved Estimation for Unsupervised Part-of-Speech Tagging. In: Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering, IEEE NLP-KE (2005) 11. Biemann, C., Giuliano, C., Gliozzo, A.: Unsupervised Part-Of-Speech Tagging Supporting Supervised Methods. In: Proceedings of RANLP 2007, Borovets, Bulgaria (2007) 12. Jurafsky, D., Martin, J.H.: Speech and language processing. Prentice Hall, New Jersey (2000)

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion