DESIGNING POS TAG SET FOR KANNADA Presented by: Vijayalaxmi.F. Patil LDC-IL
CONTENTS Introduction Dravidian Languages Tag set : Meaning and Structure Kannada Tag set : Category, Type, Attribute Conclusion
INTRODUCTION This paper presents the importance and the structure of POS tag set for Kannada, one of the major languages of the Dravidian Language family. This is a process of marking up the words in a text or corpus as corresponding to a particular part of speech based on both its definition, as well as its context i.e. the relationship with adjacent and related words in a phrase, sentence or paragraph.
Continue.. POS tagging is often the first stage of natural language processing following further processing like chunking, parsing etc are done. Tags play vital role in speech recognition, information retrieval and information extraction. Recent machine learning techniques makes use of corpora to acquire high-level language knowledge. This knowledge is estimated from the corpora which are usually tagged with the correct part of speech labels. Many words occurring in the natural language texts are not listed in any catalog or lexicon.
DRAVIDIAN LANGUAGES South Indian languages belong to a common source and the cognate languages constitute a single family known as Dravidian family. About 23 languages are there in the Dravidian language family which appears to be unrelated to any other known language family. There are more than 40 million speakers of Dravidian languages. Dravidian languages are divided on the basis of geographical perspective, shared innovations and characteristic features possessed by the languages. Classification of the Dravidian languages into three sub groups namely- Dravidian Languages South Dravidian Central Dravidian North Dravidian
Continues South Dravidian languages: The name itself reveals the languages spoken in Southern part of India are south Dravidian languages and they are eight in number viz, Kannada, Malayalam, Tamil, Tulu, Kodagu, Badaga, Toda and Kota. Central Dravidian languages : The languages which are spoken by central part of India are Central Dravidian languages. They are 12 in number viz, Telugu, Gondi, Konda, Kui, Kuvi, Pengo, Manda, Kolami, Naiky, Parji, Gadaba Ollari and Gadaba Sillur. North Dravidian languages : The languages spoken in the north part of India are North Dravidian languages and they are three in number viz, Kurukh, Malto and Brahui.
Continues.. Kannada Language is spoken predominantly in the state of Karnataka, whose native speakers are called Kannadigas ( Kannadigaru). It is the 27th most spoken language in the world. It is one of the scheduled languages of India and the official and administrative language of the state of Karnataka. Based on the recommendations of the Committee of Linguistic Experts, appointed by the Ministry of Culture, the Government of India officially recognized Kannada as a classical language. During later centuries, Kannada, along with other Dravidian languages like Telugu, Tamil, Malayalam etc, has been greatly influenced by Sanskrit in terms of vocabulary, grammar and literary styles.
Tag set : Meaning and Structure What is a tag set? A set of defined tags i.e a set of word categories to be applied to the word tokens of a text.
Continues Types of tag set Flat tag set Hierarchical tag set Fine grained tag set Flat tag set just list down the categories applicable for a particular Flat tag set just list down the categories applicable for a particular language without any provision for modularity or feature reusability. Hierarchical tag set means that the categories is that tag set which is structured relative to one another rather than a large number of independent categories. A hierarchical tag set will contain a small number of categories, each category contains a number of Types, and each Type contains Attributes, and so on, in a tree-like structure. Fine grained tag set is the tagset where the minute things are considered and is accutare in syntactic analysis.
Continues. Present paper is based on a hierarchical tag set Preprocessing: A process of normalization of text before tokenization. Part of speech: Categories [that] group lexical items which perform similar grammatical functions Lexicon: A list of possible tags for the root forms of all the valid words in a given language.
KANNADA TAG SET Category Noun (N) Pronoun (P) Demonstrative (D) Nominal Modifier (J) Verb (V) Adverb (A) Participle (L) Particle (C) Numeral (NUM) Reduplication (RDP) Residual (RD) Unknown (UNK) Punctuation (PU)
NOUN Category Noun (N) Type Common (NC) Proper (NP) Verbal (NV) Spatio-temporal (NST) Attribute Gender, Number, Case Marker, Adverbial suffix, Adjectival suffix, Postposition, Negative, Clitic, Gender, Number, Case Marker, Adverbial suffix, Adjectival suffix, Postposition, Negative, Clitic Case Marker, Post-position, Negative, Clitic Dimension, Case marker, Post-position, Clitic. E.g.(1) \NC.hum.pl.nom.0.0.0.0.emp people (2) \NP.mas.sg.gen.0.0.pp.0.0 with Ramesh (3) \NV.acc.0.0.emp doing (4) \NST.dis.gen.pp.incl till there
Category PRONOUN Type Attribute Pronoun Pronominal (PRP) Gender, Number, Person, Case Marker, Dimention, Adverbial suffix, Adjectival suffix, Postposition, Negative, Clitic Reflexive (PRF) Gender, Number, Person, Case Marker, Adverbial suffix, Postposition, Negative, Clitic Reciprocal (PRC) Gender, Number, Person, Case Marker, Adverbial suffix, Postposition, Negative, Clitic Wh-Pronoun (PWH) Eg. (5) \PRP.fem.sg.3rd.nom.dis.0.0.0.0.0 she (6) \PRF.hum.pl.nom.0.0.0.epm yourself (7) \PRC.hum.pl.0.nom.0.0.0.0.0 reciprocal (8) \PWH.hum.0.0.nom.0.0.0.0.0 who Gender, Number, Person, Case Marker, Adverbial suffix, Adjectival suffix, Post-position, Negative, Clitic
DEMONSTRATIVE Category Type Attribute Demonstrative(DAB) Absolute (DAB) Dimension Wh-demonstrative (DWH) E.g. (9) \DAB.dis that (10) \DWH which
NOMINAL MODIFIER Category Type Attribute Nominal Modifier (J) Adjective (JJ) Quantifier (JQ) Negative. Adjectival suffix, Clitic Gender, Number, Numeral, Case Marker, Adverbial suffix, Adjectival suffix, Post-position, Dimension, Negative, Clitic, Intensifier (JINT) Clitic E.g. (11) \JJ.0.adj.0 beautiful (12) \JQ.nue.0.nnm.acc.0.0.0.dis.0.emp (that much) (13) \JINT.0 much
VERB Category Type Attribute Verb (V) Gender, Number, Person, Tense, causative, Aspect, Mood, Finiteness, Negative, Defective verb, Clitic E.g. (14) \V.fem.sg.3 rd.fut.n.prg.intr.nfn.n.n.intr will she come? (15) \ V.fem.sg.3 rd.pst.n.prg.0.nfn.n.n.0 he will divide
ADVERB Category Type Attribute Adverb (A) Manner (AMN) Clitic E.g.(16) \AMN.emp slowly
PARTICIPLE Category Type Attribute Participle (L) Relative (LRL) Tense, Negative, Adjectival suffix, Postposition, Negative, Clitic, Verbal (LV) Nominal (LN) Conditional (LC) Tense, Negative, Clitic Gender, Number, Tense, negative, Case Marker, Adverbial suffix, Adjectival suffix, Postposition, Clitic, adjective suffix, Negative, Clitic, E.g. (17) (18) \LRL.pst.0.0.0.emp which has come \LV.pst.0 go (19) \LN.hum.pl.pst.y.nom.0.0.0.0 those who have not come (20) \LC.0.y.0 if not tell
PARTICAL Category Type Attributes Examples Co-ordinating (CCD) Clitic (21), ( and ) ( but ) Subordinating (CSB) (22) ( or ) Particle (C) Interjection (CIN) (Dis) Agreement (CAGR) (23) ( oh ), ( alas ) (24),( yes ) ( no ) Confirmative( CCON) (25), ( isn t it ) Delimitive (CDLIM) Clitic (26),, ( only ) Dubitative (CDUB) (27) probably ) Inclusive (CINCL) (28) ( also ) Others (CX)
NUMERAL Category Type Attribute Examples Numeral (NUM) Real (NUMR) (29)10,20,30,40 Case marker, Clitic, Adverbial Serial (NUMS) suffix, (30)10.5, 25.02 Postposition Calendric (NUMC) (31) Ordinal (NUMO) (32)3 rd, 4 th, 20 th
REDUPLICATION Category Type Attribute Reduplication(RD P) Gender, number, person, Case marker, Postposition, Adverbial suffix, Cilitic E.g.(33) \RDP.hum.pl.0.nom.0.adv.0 one by one (34) \RDP.hum.pl.3 rd.gen.pp.0.0 with them
RESIDUAL Category Type Attribute Residual(RD) Foreign Word (RDF) Symbol (RDS) E.g. (35)क म work (36)Ink (37)@ # $ & %
UNKNOWN Category Unknown (UNK) E.g.(38) Sanskrit shloka
PUNCTUATION Category Punctuation(PU) (39),. /? : ; } [ \ = + _ /
ATTRIBUTES AND THEIR VALUES Attribute Values Person \PER First\1 Second\2 Third\3 Number\NUM Singular\sg Plural\pl Gender\GEN Masculine\mas Feminine\fem Neuter\neu Human\hum Case Marker Nominative/no Accusative\acc Instrumental\i Dative\dat Ablative\abl Genitive\gen Locative\loc \CSM m ns Tense \TNS Present\prs Past\pst Future\fut Aspect Imperfect\ ipfv Perfect\prf Progressive\ Mood \MOOD Interrogative\i nt prog Finiteness\FIN Finite\fin Non-finite\nfn Infinitive\inf Dimension \DIM Clitic /CL Proximal\prx Interrogative\i nt Habitual\hab Imperative\imp Optative\opt Hortative\hort Debitive\debt Potential \potn Distal\dst Inclusive\incl Indefiniteness\i Numeral \NML Cardinal (crd) Ordinal (ord) Non-numeral Negative (NEG) Yes/y No/n nd (nnm) Emphatic\emp Comparative\c om Heresay\hers Adverbial suffix/adv Adjectival suffix/adj Defective verb\def Yes\y No\n
CONCLUSION The use of morphological features is especially helpful to develop a reasonable POS tagger when tagged resources are limited. In Pos tagging one word may have more than one part- of speech label. Syntactic and semantic parsing of natural language sentences are generally influenced by adequate part-of-speech.
REFERENCES ANDREW, H. developing a tag set for automated part-of-speech tagging in urdu. department of linguistics and modern english language, university of lancaster. BALI, K. microsoft research india. bangalore. BASKARAN, S. microsoft research india. bangalore. BHATTACHARYA, T. delhi university, delhi. BHATTACHARYYA, P. iit-bombay, mumbai. DANDAPAT, S., april 2008. part-of-speech tagging for bengali. HUDSON THOMAS, 1878. elementary grammar of the kannada language JHA, G. N. jawaharlal nehru university, delhi. MALLIKARJUN, B. ciil mysore, 31 st march 2005. morphological processing of kannada verbs MEETEI, A. N., 1 st december 2009. an introduction to language and annotation
REFERENCES NICOLA, U. AND HERMANN, N., 2003. using pos information for statistical machine translation into morphologically rich languages RAJENDRAN, S. tamil university, thanjavur. SARAVANAN, K. microsoft research india, bangalore. SCHIFFMAN, H., september 1979. a reference grammar of spoken kannada SHARMA, D. M., SAMAR HUSAIN, AND RAJEEV SANGAL, pune 2008. linguistic data annotation for indian languages SHRIDHAR, S.N.1990. kannada (descriptive grammars) SOBHA L, au-kbc research centre, chennai. SUBBARAO, K. V. delhi, 2008. designing a common pos-tagset framework for indian languages. UPPOOR, N. june 2009. a rule-based parts of speech tagger for kannada wikipedia.org/wiki/kannada language. kannada language