Sci.Int.(Lahore),27(5), ,2015 ISSN ; CODEN: SINTE

Similar documents
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Emmaus Lutheran School English Language Arts Curriculum

Advanced Grammar in Use

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Developing Grammar in Context

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Context Free Grammars. Many slides from Michael Collins

Grammars & Parsing, Part 1:

Sample Goals and Benchmarks

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Indian Institute of Technology, Kanpur

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

What the National Curriculum requires in reading at Y5 and Y6

Words come in categories

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Parsing of part-of-speech tagged Assamese Texts

Ch VI- SENTENCE PATTERNS.

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

The Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek

Chapter 9 Banked gap-filling

ScienceDirect. Malayalam question answering system

Training and evaluation of POS taggers on the French MULTITAG corpus

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Modeling full form lexica for Arabic

Loughton School s curriculum evening. 28 th February 2017

CS 598 Natural Language Processing

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Coast Academies Writing Framework Step 4. 1 of 7

The College Board Redesigned SAT Grade 12

BULATS A2 WORDLIST 2

Course Outline for Honors Spanish II Mrs. Sharon Koller

Intensive English Program Southwest College

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Writing a composition

BASIC ENGLISH. Book GRAMMAR

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Procedia - Social and Behavioral Sciences 154 ( 2014 )

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Adjectives tell you more about a noun (for example: the red dress ).

Development of the First LRs for Macedonian: Current Projects

Written by: YULI AMRIA (RRA1B210085) ABSTRACT. Key words: ability, possessive pronouns, and possessive adjectives INTRODUCTION

Underlying and Surface Grammatical Relations in Greek consider

Derivational and Inflectional Morphemes in Pak-Pak Language

Hindi Aspectual Verb Complexes

Tagging Urdu Sentences from English POS Taggers

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

UC Berkeley Berkeley Undergraduate Journal of Classics

California Department of Education English Language Development Standards for Grade 8

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

On the Notion Determiner

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

An Evaluation of POS Taggers for the CHILDES Corpus

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Specifying a shallow grammatical for parsing purposes

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Part of Speech Template

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

THE VERB ARGUMENT BROWSER

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Programma di Inglese

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

AN EXPERIMENTAL APPROACH TO NEW AND OLD INFORMATION IN TURKISH LOCATIVES AND EXISTENTIALS

A Simple Surface Realization Engine for Telugu

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Memory-based grammatical error correction

The stages of event extraction

SAMPLE. Chapter 1: Background. A. Basic Introduction. B. Why It s Important to Teach/Learn Grammar in the First Place

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Progressive Aspect in Nigerian English

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Participate in expanded conversations and respond appropriately to a variety of conversational prompts

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

The Effect of Extensive Reading on Developing the Grammatical. Accuracy of the EFL Freshmen at Al Al-Bayt University

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation

Author: Fatima Lemtouni, Wayzata High School, Wayzata, MN

Prediction of Maximal Projection for Semantic Role Labeling

Using dialogue context to improve parsing performance in dialogue systems

Chapter 3: Semi-lexical categories. nor truly functional. As Corver and van Riemsdijk rightly point out, There is more

Presentation Exercise: Chapter 32

Dear Teacher: Welcome to Reading Rods! Reading Rods offer many outstanding features! Read on to discover how to put Reading Rods to work today!

Today we examine the distribution of infinitival clauses, which can be

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

National Literacy and Numeracy Framework for years 3/4

Leveraging Sentiment to Compute Word Similarity

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

Aspectual Classes of Verb Phrases

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Transcription:

Sci.Int.(Lahore),27(5),4479-4483,2015 ISSN 1013-5316; CODEN: SINTE 8 4479 DEVELOPING A POS TAGGED RESOURCE OF URDU Tahira Asif, Aasim Ali, Kamran Malik Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore Pakistan aasim.ali@pucit.edu.pk, ABSTRACT: Part of speech (POS) is an important linguistic information which is fundamental in several advanced stages of text processing, like, Named Entity Recognition Statistical Machine Translation. Several existing POS tagsets are analyzed to define a tagset that has maximal tags. Consequently, 46 POS 4 morphological tags are used to tag 440,000 tokens in above 20,000 sentences of Urdu corpus of religious text, using bootstrapping assisted by a statistical tagger for human reviewed tagging. Increase in the data size shows gradual improvement in the accuracies for both, seen unseen vocabulary, with an overall best match of 95.59%. 1 INTRODUCTION Part of speech (POS) tagging is a method of identifying the appropriate POS category for a sequence of words in a running text. POS tagged corpus is such a foundation that may be used to underst the advanced features of a language such as syntax, semantic, pragmatics, speech, others. This paper presents the attempt of developing a POS tagged resource. In addition, we also tagged morphological features of each word. We selected Urdu translation of religious text for this work. Urdu is such a language for which POS tagging is not done on a significant amount of data if it is done then very little tagged data is freely available. A supervised approach (using a statistical tagger) is used to assist the tagging process of around 440,000 tokens. Various POS tagsets have been tested on different data sizes, to analyze the impact of each. 2 POS TAGGING FOR URDU LANGUAGE There have been several efforts made for POS tagging of Urdu data. We have named the tagsets developed in those efforts as T0 [12], [27], [20], [32] to decide on POS Tagset for this work. Muaz et al used statistical taggers for tagging of news data [20]. Sajjad et al used the statistical tagging approach with without external dictionary [32]. Hardie used rule based approach developed a morphologically induced POS tagset, thus having a huge list of tags, for tagging of text from a book transcription of speech data [12]. POS tagger trained on Hindi text has also been used to tag Urdu text [37]. A larger tagset for Urdu POS tagging has been used to show the reduction in ambiguity [4]. 3 FEATURES OF URDU LANGUAGE Urdu language has various morphological features in different POS categories such as: noun, pronoun, verb, adjective. 3.1 Noun In Urdu grammars, generally noun is classified with respect to its structure, meaning, number), gender. Nouns are also inflected to show the case such as: nominative, oblique or vocative. 3.2 Verb It is divided with respect to following types: root, imperfective participles, perfective participles, infinitive. Verbs can also be categorized as: (i) Transitive, (ii) Intransitive. Verbs in Urdu language have rich inflectional features. Around 60 inflected form of verb are present [1], [34]. 3.3 Adjectives In case of gender adjective, there are no particular oblique suffixes to hle the plural. When two or more nouns appear in a sentence, then the adjective in gender number will be according to that head noun which is nearest to that adjective in natural reading order [1], [34]. 3.4 Morphological features Urdu words may have following morphological features: Gender Urdu has only two possible values for gender: male female. The gender male is also used as default when the gender of the word/concept is not available. Number There are only two possible values for number in Urdu: singular plural. Case Urdu nouns have three cases at the level of morphology: nominative, oblique, vocative. When a noun is used to call someone, then it is in its vocative case. When noun is followed by a semantic marker, then noun appears in its oblique case, otherwise it is in its nominative case. Honor There are several levels of showing in Urdu. We have noted them as H0, H1, H2, H3 where H3 denotes the highest level of honor. 4 PROPOSED TAGSET FOR URDU 4.1 Part of Speech (POS) tags The tagset which is proposed here is modified version of [27]. was designed in order to develop the English-Urdu parallel corpus [20], is very close to the Penn Treebank tagset of English. Here proposed tagset referred to as (Proposed POS

4480 ISSN 1013-5316; CODEN: SINTE 8 Sci.Int.(Lahore),27(5),4479-4483,2015 Table 1: Sorted list of POS tags descriptive titles Tags POS Tag Titles AUXA Aspectual auxiliary AUXT Tense auxiliary CC Coordinating conjunction CD Cardinal CM Semantic case marker DM Demonstrative DMRL Relative demonstrative FR Fractional FW Foreign word I Intensifier INJ Interjection ITRP Intensifier particle JJ Adjective JJRP Adjectival Particle KER Serial verb joiner MOPE Pre-Mohmil MOPO Post-Mohmil MUL Multiplicative NN Noun NNC Combined noun / Noun continued NNCM Prepositional noun/ Noun after case marker Combined noun continue / Noun NNCR continuation terminated NNP Proper noun NNPC Proper noun continue OD Ordinal PM Phrase marker PR Personal Pronun PRP$ Personal possessive pronoun PRRF Reflexive pronoun PRRFP$ Reflexive possessive pronoun PRRL Relative pronoun Q Quantifier QW Question word RB Adverb RBRP Adverbial particle SC Subordinating conjunction SM Sentence marker SYM Symbol U Measuring unit UNK Unknown VB Verb bare form VBI Infinitive verb VBL Light verb VBLI Infinitive light verb VBT Verb to-be WALA The word 'wala' Tagset with Morphological marking). The modification in the is the addition of morphological tags one additional tag in the POS category. This modification was required in order to make it suitable for the selected data, to provide additional grammatical information about the words. There are couple of tags in which have been decided to not include in the proposed tagset due to the reasons described in the subsection 5.2 below. Table 1 lists the proposed POS tagset. 4.2 DISCUSSION ON POS TAGS Differences between demonstrative (DM) pronouns (PR) are found on the phrase level study. Word is tagged as DM when a demonstrative is followed by a noun in the same noun phrase whereas a pronoun forms a phrase by itself or pronoun appears without a noun as subsequent word. Adjective either follows the noun or is followed by nouns. Most of the proper nouns are derived from adjective In Urdu language. Similarly, the inflected forms of adjective also come as a noun [34]. Some examples are: Tag (VBT) tag (AUXT) occur at the same position in a sentence sometimes are tagged ambiguously in automatic tagging process [20]. A light verb with VBL tag is added to hle the complex predicates. It is such a verb that does not give a complete meaning in a sentence without the help of a noun or adjective or even a verb. Hence a light verb makes a compound verb by combining a noun, or an adjective, or a verb gives complete meaning in sentence [1]. Tag (VBI) is used to hle the infinitive verbs. Tag (VBLI) is also used to hle the complex predicates infinitive light verb makes a compound verb by combining a noun, or an adjective, or a verb gives complete meaning in sentence [1]. It is a word that joins two or more verb phrases shows the completion of previous verbs in a sentence. In some sentences, a semantic marker kay is also tagged with tag (KER). For example: Mohmils are those words that do not have their own meanings. In a sentence, Mohmil cannot occur lonely always come before/after with a meaningful word. 4.3 Morphological Tags Table 2 lists the proposed morphological tags. Morphological features are: gender with its two values as masculine feminine; number with its two values as singular plural; case with its three values as nominative, oblique, vocative; honor with its four values as H0, H1, H2, H3. POS tags for foreign word (FW) to deal with cross language words (e.g. Arabic); unknown (UNK) to provide training space for the out of vocabulary words in the training corpus. Table 2: List of Morphological tags categorized in according to morphological features. Morphological Tags Gender Number F Feminine P Plural M Male S Singular Case Honor NOM Nominative H0 Honor Level 0 OBL Oblique H1 Honor Level 1 VOC Vocative H2 Honor Level 2 H3 Honor Level 3 4.4 Discussion on Morphological Tags Nominative case can either be case of subject-verb agreement or object-verb agreement. When subject is in nominative form, then subject will agree with the verb subject can be

Sci.Int.(Lahore),27(5),4479-4483,2015 ISSN 1013-5316; CODEN: SINTE 8 4481 noun or pronoun [2]. If subject is in non-nominative case if object is in nominative case then object starts to link with verb. Consider below example for object-verb agreement. Nominative case is also observed in different types of sentences [2]. A word is in oblique case, if it is followed by case marker (CM), it may be noun/pronoun/verb or a word with a special tag WALA. Vocative case of a word is used to call a person. It sometimes plays a role of interjections [34]. 5 EXPERIMENTAL SETUP We perform experiments using TnT tagger [6] on six different tagsets using different training testing data as mentioned in Table 3. Accuracies of Know words, Unknown word + Unknkown are calculated against each tagset. words are all those words which are part of relevant language model, whereas unknown words are those words that do not exist in language model. Table 3: Count of words in each version for training data test data Version Training Data Test Data I 56415 100044 II 106448 101543 III 184221 55487 The reason of conducting these various experiments is to analyze the results of different tagsets. Dataset which is tagged using tagset is our basic dataset for experimentation. Using basic dataset we derive dataset with,, M,, tagsets. For our first experiment we build our model on 56415 words test on 100044 words. The detail results are mentioned in Table 4. Table 4: Accuracies on different tagset using Version I data M Unknown 92.65 78.02 94.32 79.16 93.89 78.70 94.23 80.44 95.64 81.53 95.21 81.03 Unknown 63.05 32.48 69.50 34.63 69.05 34.79 Results in Table 4 shows that the best accuracy rate is achieved on the dataset tagged with that is 94.32 %, whereas on accuracy rate is 93.89%, on based data set is 92.65%. Similarly, after adding the morphological features to,,, we again train test the tagger on Version I data. After adding morphological information accuracies of, M on Unknown are 78.02%, 79.16% 78.70% respectively. Above experiments show that by adding morphological information accuracy decreases. For second experiment we used 106448 words for training 101543 words for testing. We use same tagsets for accuracies build six models. In this experiment training data is much larger than the previous one, which causes higher accuracies than previous one. The detailed results are given in Table 5. As the results show that by building model on large training data all tagsets produce better results than the previous one. Table 5: Accuracies on different tagset using Version II data M Unknown 93.93 79.89 95.21 80.73 94.95 80.35 94.89 81.44 95.98 82.25 95.71 81.83 Unknown 63.16 30.43 70.26 32.31 70.62 33.06 For third experiment 184221 words 55487 words are taken for training testing respectively. Model is trained tested on using all six datasets with different tagsets. The detail results are shown in Table 6. Table 6: Accuracies on different tagset using Version II data Unknow n M 94.58 78.22 95.60 78.90 95.31 78.44 95.17 79.20 96.06 79.80 95.77 79.35 Unknow n 68.82 35.13 75.50 39.40 75.26 38.76 After analyzing the accuracies, it is observed that some of the dataset accuracies increases some of the dataset accuracies decreases. 10 8 6 4 2 Unknown Accuracies M Figure 1: Unknown Accuracies with respect to each tagset.

4482 ISSN 1013-5316; CODEN: SINTE 8 Sci.Int.(Lahore),27(5),4479-4483,2015 Details of accuracies of + Unknown, Unknown with respect to each Tagset is mentioned in Figure1, Figure 2 Figure 3 respectively. 10 8 6 4 2 Figure 2: Accuracies with respect to each tagset 10 8 6 4 2 Accuracies M Unknown Accuracies M Figure 3: Unknown Accuracies with respect to each tagset 6 CONCLUSION A quick view of all results with respect to overall results, known words, unknown words cases on combined test set is presented in above table: 5.0. In this study, originally three models were built as our basic language models. These models were varied from each other with respect to their knowledge. Later on, more fifteen models were built with the help of basic models in this chapter. As a consequence, this chapter covers total eighteen (18) language models with three versions. All these models were applied on the chunk test data as well as on the combined test data accuracies were achieved with differences in their rates. Table 7: Misclassified Tags in corpus, based on during TnT tagging Assigned Correct Assigned Correct Tag Tag Tag Tag NNPC NNP VBT AUXT NN NNP PRRL DMRL NNP NN DMRL PRRL PR SC PR DM VBL VB DM PR NNPC VB NN VBI VB NN VBI NN NN VB VBI VBLI VB VBL NN VBLI QW VBL CC VOC VB AUXA CD VB PRRF PR VB CD PR CM DM/ PR CD CM PR RBRP JJRP VBL KER NN U Q CC NN RB This incorrect tagging became the cause of degradation in accuracy rates. After removing the incorrect tagging problem in data set, we reached in the experiment phase. In that phase, we performed In first three versions, we identified the tags that were confused with other tags during tagging using the based data set as our basic data set. Following table represents some confused tags in pairs shows that which tag was incorrect what was its correct tag in corpus. These confusions between tags were identified during the post editing of all TnT tagged files based on our basic data set (i.e., based data set). several experiments got the diverse accuracy rates on different data sets. Here we can analyze that what were the reasons of low high accuracy rates on different data sets having the same text with same statistics in each version. So, the reasons which we identified are following: Tagsets which we chosen were syntactic based. Some tagsets among them have sub-classes in tags of one POS class, whereas other tagsets have not such classification in that particular POS class. These sub-classifications in tags were not different syntactically affected on accuracy rates due to incorrect tagging. Similarly, the addition of morphological tags also affected the accuracy rates. These tags only increase the language information in a corpus. A simple example on accuracy rate variation: We take the that has only one tag (VB) in Verb POS class one tag (NN) in noun POS class, whereas have four

Sci.Int.(Lahore),27(5),4479-4483,2015 ISSN 1013-5316; CODEN: SINTE 8 4483 sub-classes in Verb POS class, three sub-classes in Noun POS class. All the four sub-classes of Verb POS class are map able onto the single Verb POS class of syntactically are not different. During manual editing of POS tags on our basic data set, we identified that TnT tagger was confused during the tagging of such POS sub-classes that have no difference at syntax level (shown in above Table 7.0) affected the accuracy rates. For example: confusion between noun verb classes in based data set affected the accuracy rates, whereas no such confusion was found on based data set. Similarly, if we consider the case of additional morphological tags, we can see that accuracy rates on M,, based data sets became low than the,, based data sets in all versions. So, the data sets which are tagged with original tagsets means without morphological tags also have the good accuracy rates as compare to those data sets that have such extra information. So, if we want to increase the language information in data sets, then we have to face the low accuracy rates. Future Work The tagset for this work is designed with the view of its direct mapping on other tagsets used in this study. It may be investigated for its mapping to other tagsets like another POS tagset of Urdu [41]. REFERENCES 1. Hardie, A. 2003. Developing a tagset for automated partof-speech tagging in Urdu. Archer, D, Rayson, P, Wilson, A, Mc Enery, T (eds.) Proceedings of the Corpus Linguistics 2003 conference. UCREL Technical Papers Volume 16. Department of Linguistics, Lancaster University, UK. 2. Muaz, A., Ali, A., Hussain, S. Analysis Development of Urdu POS Tagged Corpora, Proceedings of the 7th Workshop on Asian Language Resources, IJCNLP 09, Suntec City, Singapore, 2009. 3. Hussain, S. 2008. Resources for Urdu Language Processing. Proceedings of the 6th Workshop on Asian Language Resources, IJCNLP 08, IIIT Hyderabad, India. 4. Sajjad, H. 2007. Statistical Part of Speech Tagger for Urdu. Unpublished MS Thesis, National University of Computer Emerging Sciences, Lahore, Pakistan. 5. Srivastava, K. A. 2008. Unsupervised Approaches to Part-of-Speech Tagging (Five methodologies survey). 6. Anwar, W., Wang, X., Li. L., Wang. X. A Statistical based Part of Speech Tagger for Urdu Language. Preceedings of the Sixth International Conference on Machine Learning Cybernetics, Hong Kong, 19-22 August 2007. 7. Ali, A. 2010. Study of Morphology of Urdu Language, for its Computational Modeling. Pub: VDM. 8. Schmidt, R. 1999. Urdu: An Essential Grammar. Routledge, London, UK. 9. Ali, A. 2011. Syntax of Urdu Language (A survey of Urdu Language syntax). LAP, Lambert Academic Publishing. 10. Brants, T. 2000. TnT A statistical part-of-speech tagger. Proceedings of the Sixth Applied Natural Language Processing Conference ANLP-2000 Seattle, WA, USA. 11. Urooj, S., Hussain, S., Mustafa, A., Parveen, R., Adeeba, F., Ahmed, T., Butt, M., Hautli, A. (2014). The CLE Urdu POS Tagset. In LREC proceedings (pp. 2920-2925).