Shallow Parser for Kannada Sentences using Machine Learning Approach

Similar documents
Two methods to incorporate local morphosyntactic features in Hindi dependency

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Named Entity Recognition: A Survey for the Indian Languages

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Indian Institute of Technology, Kanpur

Grammar Extraction from Treebanks for Hindi and Telugu

Grammars & Parsing, Part 1:

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Parsing of part-of-speech tagged Assamese Texts

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

A Simple Surface Realization Engine for Telugu

CS 598 Natural Language Processing

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Prediction of Maximal Projection for Semantic Role Labeling

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

The stages of event extraction

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Context Free Grammars. Many slides from Michael Collins

ScienceDirect. Malayalam question answering system

LTAG-spinal and the Treebank

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Linking Task: Identifying authors and book titles in verbose queries

Learning Computational Grammars

Beyond the Pipeline: Discrete Optimization in NLP

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Ensemble Technique Utilization for Indonesian Dependency Parser

An Evaluation of POS Taggers for the CHILDES Corpus

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Extracting Verb Expressions Implying Negative Opinions

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Ch VI- SENTENCE PATTERNS.

A Syllable Based Word Recognition Model for Korean Noun Extraction

English to Marathi Rule-based Machine Translation of Simple Assertive Sentences

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

The Discourse Anaphoric Properties of Connectives

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Compositional Semantics

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Natural Language Processing. George Konidaris

AQUA: An Ontology-Driven Question Answering System

Universiteit Leiden ICT in Business

Some Principles of Automated Natural Language Information Extraction

Physics 270: Experimental Physics

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Distant Supervised Relation Extraction with Wikipedia and Freebase

The Smart/Empire TIPSTER IR System

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Survey of Named Entity Recognition Systems with respect to Indian and Foreign Languages

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Developing a TT-MCTAG for German with an RCG-based Parser

Development of the First LRs for Macedonian: Current Projects

Words come in categories

Introduction to Text Mining

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Intensive English Program Southwest College

THE VERB ARGUMENT BROWSER

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Hindi Aspectual Verb Complexes

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Proof Theory for Syntacticians

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Accurate Unlexicalized Parsing for Modern Hebrew

Applications of memory-based natural language processing

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Using dialogue context to improve parsing performance in dialogue systems

A Graph Based Authorship Identification Approach

Emmaus Lutheran School English Language Arts Curriculum

A Case Study: News Classification Based on Term Frequency

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Cross Language Information Retrieval

cmp-lg/ Jan 1998

The College Board Redesigned SAT Grade 12

Adapting Stochastic Output for Rule-Based Semantics

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Specifying a shallow grammatical for parsing purposes

Analysis of Probabilistic Parsing in NLP

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

HinMA: Distributed Morphology based Hindi Morphological Analyzer

Sample Goals and Benchmarks

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation

Modeling function word errors in DNN-HMM based LVCSR systems

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Short Text Understanding Through Lexical-Semantic Analysis

Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems

Heritage Korean Stage 6 Syllabus Preliminary and HSC Courses

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

Leveraging Sentiment to Compute Word Similarity

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Transcription:

Shallow Parser for Kannada Sentences using Machine Learning Approach Prathibha, R J Sri Jayachamarajendra College of Engineering Mysore, India rjprathibha@gmail.com Padma, M C P.E.S. College of Engineering Mandya, India padmapes@gmail.com ABSTRACT: Kannada is an inflectional, agglutinative and morphologically rich language. Kannada is a relatively freeword order language but in the phrasal construction it behaves like a fixed word order language. In other words, order of words in Kannada sentence is flexible but in a chunk, order of words is fixed. This paper presents a statistical chunker for Kannada language using conditional random field model. Input for chunker is parts of speech tagged Kannada words. The proposed chunker is trained using Enabling Minority Language Engineering(EMILLE) corpus. The performance of proposed model is tested on stories and novels dataset that are collected from EMILLE corpus. An accuracy of 92.77% and 93.28% is achieved on novels and stories dataset respectively. Keywords: Conditional Random Field, Chunking, Machine Translation System, Shallow Parser, Statistical Approach Received: 16 July 2017, Revised 1 September 2017, Accepted 3 October 2017 2017 DLINE. All Rights Reserved 1. Introduction In machine translation system, chunking is the basic step towards parsing of natural language sentences. Chunking or shallow parsing is the task of identifying and labeling simple phrases in a sentence. In other words, chunking refers to the identification of syntactically correlated parts of words in a sentence. Chunker divides sentences into non-recursive, inseparable phrases like noun-phrase, verb-phrase, adverb-phrase, adjective-phrase, with only one head in a phrase. Chunk is a minimal, non-recursive phrase consisting of correlated, inseparable words, such that the intrachunk dependencies are not distorted. Based on this definition, a chunk contains a head and its modifiers. Chunks are normally taken to be a correlated group of words. Once the constituents and their syntactic phrases have been identified, a full parsing helps to find the syntactico-semantic relations between the constituents. Input for chunker or shallow parser is Parts of Speech (PoS) tagged or annotated text. Accuracy of parser directly depends on the accuracy of shallow parser, and hence it is essential to develop an efficient shallow parser before moving to parser stage. Shallow parser also substantially enhances the work in the direction of machine translation system. Kannada is a relatively free word 158 International Journal of Computational Linguistics Research Volume 8 Number 4 December 2017

language but in the phrasal construction it behaves like a fixed word order language. In other words, order of words in Kannada sentence is flexible but in a chunk, order of words is fixed. Output of chunker for simple English sentence: I ate the green apple is given in Figure 1. Figure 1. Output of a chunker for simple sentence In Figure 1, inner boxes show the word-level tokenization and PoS tagging, while outer boxes show higher-level chunking. Each of these outer boxes is called a chunk. The given example is divided into three phrases (chunks) as given below: i) Noun-phrase (NP I-PRP), ii) Verb-phrase (VP ate-vbd) iii) Noun-phrase (NP the-dt, green-jj, apple-nn). The rest of paper is organized as follows. Section 2 presents the previous works carried out on design of chunker or shallow parser for different natural languages. The different components of chunker are described in Section 3. The complete description of proposed work is explained in Section 4. The experimental results and discussions are presented in Section 5. Conclusions are given in Section 6. 2. PreviousWorks Chunks or phrases are normally a non-recursive correlated group of words. Different types of chunks present in Kannada sentences are noun-phrase, verb-phrase, adverbial-phrase, etc. Rule-based and statistical approaches have been used to design chunkers for natural languages. Some of the existing chunkers are discussed below. James Hammerton et al. [9] gave their opinion on complexity of rule-based and machine learning approaches in developing chunker for morphologically rich languages. Handcrafted linguistic rules are language dependents and machine learning approaches only work well when the features have been carefully selected and weighted. Kuang-hua Chen and Hsin-Hsi Chen [8] designed a probabilistic chunker for English language using statistical approach. Susanne corpus which is a modified but shrunk version of Brown corpus is used as training dataset to train the system and obtained 98% of accuracy. Chakraborty et al. [4] developed a rule-based chunker for English language by framing handcrafted morphological rules. This chunker has been tested on 50 English text documents and obtained 84% of accuracy. The complexity of rule-based chunker is that the morphological or linguistic rules are language dependents and requires language experts. Akshay Singh et al. [17] designed a chunker for Hindi language using Hidden Markov Model (HMM) approach. They have trained the system with corpus of size 2,00,000 words and achieved 91.7% of accuracy. Sneha Asopa et al. [2] have designed a rulebased chunker for Hindi language, tested on 500 sentences and obtained an accuracy of 74.16%. This shows that the accuracy obtained by the chunker which is designed using statistical approach is better than rule-based approach. International Journal of Computational Linguistics Research Volume 8 Number 4 December 2017 159

Sankar De et al. [6] developed a chunker for Bangla language using rule-based approach and obtained accuracy of 94.62%. But the dataset on which they have tested the system is not reported. Kishorjit Nongmeikapam et al. [10] have proposed a chunker for Manipuri language using Conditional Random Field (CRF) approach. They have used 20,000 words to train the system and tested on 10,000 words and obtained an accuracy of 74.21%. Chirag Patel and Dilip Ahalpara [13] have designed a chunker for Gujarathi language using statistical approach called CRF method. The system has been trained using data about 5,000 sentences collected from a corpus designed by Central Institute of Indian Language (CIIL), Mysore, and obtained accuracy of 96%. Dhanalakhmi et al. [19] designed a chunker for Tamil language using CRF approach. The required corpus is created by the authors. This system has been trained and tested on the corpus size of 2,25,000 Tamil words and obtained 97.49% of accuracy. S. Lakshmana Pandian and T.V. Geetha [12] have proposed a chunker for Tamil language using CRF approach. This system has been tested on the corpus which is specifically manually created by the authors. The accuracy obtained by this proposed system is 84.25%. The major limitation of chunkers that are designed for Tamil language is that annotated Tamil corpus is publicly unavailable, hence the corpus required for training and testing the system is manually created by the authors. In the year 2007, a workshop on Shallow Parser fo South Asian Languages (SPSAL) has been conducted and a contest was announced. The training data and testing data of approximately 20,000 words and 5,000 words respectively was released to the participants. Chunk annotated data was released for Hindi, Bengali and Telugu using IIIT-H tagset in Shakti Standard Format (SSF). Different authors used different statistical methodologies to develop chunker for Hindi, Bengali and Telugu. The Details of authors, methodology used and accuracy obtained are shown in Table 1. Table 1. Details of shallow parser of South Asian languages developed during the contest in SPSAL workshop 160 International Journal of Computational Linguistics Research Volume 8 Number 4 December 2017

Literature shows that a few chunkers have been developed for Hindi, Bengali, Telugu, Tamil, etc., using rule-based and statistical or machine learning approaches. The limitations of existing chunkers that are designed using rule-based and statistical approaches are discussed below. In morphological rich languages, the critical and crucial information required in PoS tagging and chunkig is available in the internal structure of word itself. Hence rule-based chunkers give good results. However linguistic rules are language dependents and require language expertise. Performance of stochastic based chunker is better than rule-based chunker. However, in stochastic approaches, a pretagged or an annotated text is required to train the system. The accuracy of stochastic chunker directly depends on the size of training dataset. As the size of training data increases, the accuracy also increases. But for most of the Indian languages, pre-tagged chunked text is publicly unavailable. The inference drawn from the literature survey is that the conditional random field model gave better accuracy compared to other statistical and rule-based approaches. However, in literature, no papers have been published related to chunker for Kannada language. Hence, a statistical chunker for Kannada language using CRF approach is proposed in this paper. 3. Components in Chunking 3.1 Chunk Types The guidelines given in AnnCorra [3] has been followed to prepare customized chunks in Kannada language. The following are the different types of chunks identified to design the proposed Kannada chunker. Noun Chunk: Noun chunks include non-recursive noun-phrases. Always, noun is the head of noun chunk. Verb Chunk: The verb group includes the main verb and auxiliary verbs. There are three types of verb chunks. Finite Verb Chunk: In finite verb chunk, the main verb may not be finite in a sentence. The finiteness is known by the auxiliary verbs. Non-Finite Verb Chunk: A verb chunk containing non-finite verbs is called a non-finite verb chunk. Verb Chunk Gerund: A verb chunk having a gerund is called a verb chunk gerund. Adjectival Chunk: Adjectival chunk consists of all adjectives including predicative together with noun chunk. However, adjectives appearing before a noun will be grouped together with the noun chunk. Adverb Chunk: This chunk includes all adverbial phrases. Chunk for negatives: In case, if a negative particle present around a verb, it is considered as negative chunk. Conjuncts: Conjuncts are functional unit which is required to build larger sentences. Miscellaneous Entities: Entities such as interjections and discourse markers that cannot belong to any of the above mentioned chunks will be kept in separate chunk called miscellaneous chunk. 3.2 Chunk Boundary Identification To identify chunks, it is necessary to find and mark the positions where a chunk can end and new chunk can begin. The PoS tag is used to discover these positions. The chunk boundaries are identified by some handcrafted linguistic rules that check whether two neighboring PoS tags belong to the same chunk or not. If they do not, then a check boundary is assigned in between the words. The I/O/B (Intermediate, Outside/end and Begin) tags are used to indicate the boundaries for each chunk. I - Intermediate word which is inside a chunk. O - Boundary or end of the sentence. B - current word is the beginning of a chunk, which may be followed by another chunk. Framing of handcrafted linguistic rules is not a trivial task. However, we have manually framed almost all linguistic rules and used International Journal of Computational Linguistics Research Volume 8 Number 4 December 2017 161

as reference to identify the boundaries of chunker. We have arrived at 167 linguistic rules that are used to identify the boundaries of chunker for Kannada language and few of them are listed below. ROOT S S NP VP 1. Noun Phrase (NP): NP NN NP QF NN NP QC NN NP PRP NN NP NN NN NP QF JJ NN NP PRP JJ NN NP NNP QC NN NP NNP NN JJ NN NP NN NN QC JJ NN NP DEM JJ NN NP VNAJ NN NP PRP VP NP PRP VINT NP PRP NNQ_NN NP NNQ 2. Verb Phrase (VP): VP PRO NN PRO VM VP PRO NN NN VM VP PRO RB VM VP PRO VM VP NN NN VM VP PRO NN VM RB VP PRO NN RB VM VP PRO NN VM NN VM VP PRO VM NN VM Notations that are used in the above linguistic rules are given below. NN - Noun VM - Main verb PRO - Pronoun JJ - Adjective QC - Cardinal DEM - Demonstrative QF - Quantifiere 162 International Journal of Computational Linguistics Research Volume 8 Number 4 December 2017

3.3 Chunk Labelling After chunk boundary identification, the chunks are labeled. The PoS tags within a chunk help to assign the label on chunk. The chunk labels chosen from AnnCorra [3] for the proposed Kannada chunker are given in Table 2. Sl.No Chunk Type Tag Name 1 Noun Chunk NP 2.1 Verb Chunk VP 2.2 Finite Verb Chunk VGF 2.3 Non-Finite Verb Chunk VGINF 2.4 Verb Chunk Gerunds VGNN 3 Adjectival Chunk JJP 4 Adverb Chunk RBP 5 Chunk for negatives NEGP 6 Conjuncts CCP 7 Miscellaneous Entities BLK Table 2. Various chunk tags used in proposed Kannada chunker 3.3.1 Parts of Speech Tagset Bharati et al. [3] proposed a common Parts of Speech (PoS) tagset for Indian languages. The same PoS tagset has been used in this paper to manually assign parts of speech tag to each word in the input sentence. Lesser the size of tagset better is the efficiency of machine learning. The PoS tagset used in this proposed work consisting of 24 tags are listed in Table 3. These PoS tagset is used while annotating the input words with their relevant PoS tags. 3.4 Chunk Features Analysis In the process of chunking, PoS tag of previous words and next words influence the chunk tag of current word. The training features used in the proposed chunker are as follows: <word-2> Next to previous word <word-1> Previous word <word 0> Current word <word 1> Next word <word 2> Next to next word <PoS tag-2> PoS of Next to previous word <PoS tag-1> PoS of Previous word <PoS tag 0> PoS of current word <PoS tag 1> PoS of next word <PoS tag 2> PoS of next to next word The content of template describes the features used for training and testing the system. Each line in template file denotes one template. In each template, special macro %x [row, col] will be used to specify a token in the input data. Row specifies the relative position from the current focusing token and col specifies the absolute position of the column. Content of template file which is used in the training phase is given below. U00:%x[-2,0] U01:%x[-1,0] U02:%x[0,0] U03:%x[1,0] International Journal of Computational Linguistics Research Volume 8 Number 4 December 2017 163

Sl.No. Tag Description 1 NN Noun 2 NNP Proper Noun 3 PRP Pronoun 4 DEM Demonstrative 5 VM Verb Finite 6 VAUX Auxiliary Verb 7 JJ Adjective 8 RB Adverb 9 PSP Postposition 10 CC Conjuncts 11 WQ Question Words 12 QC Cardinal 13 QF Quantifiers 14 QO Ordinal 15 INTF Intensifier 16 INJ Interjection 17 NEG Negation 18 SYM Symbol 19 RDP Reduplication 20 UT Quotative 21 NUM Numbers 22 ECH Echo words 23 UNK Unknown 24 FOREIN Foreign Words Table 3. PoS tagset used in proposed Kannada chunker U04:%x[2,0] U05:%x[-1,0]/%x[0,0] U06:%x[0,0]/%x[1,0] U07:%x[-2,1] U08:%x[-1,1] U09:%x[0,1] U10:%x[1,1] U11:%x[2,1] U12:%x[-2,1]/%x[-1,1] U13:%x[-1,1]/%x[0,1] U14:%x[0,1]/%x[1,1] U15:%x[1,1]/%x[2,1] U16:%x[-2,1]/%x[-1,1]/%x[0,1] U17:%x[-1,1]/%x[0,1]/%x[1,1] U18:%x[0,1]/%x[1,1]/%x[2,1] For example, if a noun is preceded by an adjective, then it gets the chunk tag I-NP and the noun-phrase begins with an adjective. On the other hand, if it is preceded by a noun, then the current word will be chunk tagged as beginning of a noun-phrase (B-NP). In this case, the feature U08:%x[-1,1] is used in chunking process. If an adjective is followed by a noun, then the adjective word becomes the start of the noun-phrase (B-NP). On the other hand, if it is followed by a verb, then the adjective word becomes an independent adjective-phrase (B-JJP). This example shows that the chunk tag for the current word would also depend on the PoS of the next word, giving the feature U10:%x[1,1] from the template. 164 International Journal of Computational Linguistics Research Volume 8 Number 4 December 2017

4. Proposed Model 4.1 Architecture of the Proposed Kannada Chunker The Architecture of proposed Kannada chunker is shown in Figure 2. The input for chunker has to be an annotated (both PoS and chunk tagged) sentence. The proposed chunker is implemented using statistical approach - Conditional Random Field (CRF) model. Each word in the input sentence is identified and assigned chunking labels like I/O/B. Hence the output obtained by the chunker is a set of chunks or phrases that are present in the input sentence. Figure 2. Architecture of the proposed Kannada chunker 4.2 Methodology Chunking refers to the identification of syntactically correlated parts of words in a sentence, and is usually the first step towards parsing of a natural language sentence. It divides the sentence into phrases like noun-phrase, verbphrase, adverb-phrase etc. In chunking process, two tasks (chunk boundary identification and chunk labeling) are very important. Various statistical approaches are used to determine the most appropriate chunk tag sequences for a given sentence. The statistical approaches require training data which is chunked and tagged manually. The proposed chunker for Kannada language is designed using Conditional Random Field (CRF) model. Since CRF model belongs to statistical approach, it requires pre-tagged or annotated chunked corpus to train the system. 4.3 Corpus Used 4.3.1 Training Corpus The training data used in CRF model should be in a particular format. Training for chunker is done in two phases. First, extract chunk boundary and then mark chunk label for each word in the corpus. Check boundary markers are: Begin chunk word (B) and Intermediate check word (I). In the first phase, chunk tags (both chunk boundary identification and chunk label) are assigned to each word in training data and the data is trained to predict the corresponding B-L (Boundary Label) tag. In the second phase, the system is trained on the feature template for predicting the chunk boundary markers (B). Finally, chunk label markers from first phase and chunk boundary markers from the second phase are combined together to obtain the chunk tag. To train the system, 6,000 (approximately 80,000 words) sentences have been taken from EMILLE (Enabling Minority Language Engineering) corpus and manually identified chunk boundaries and marked chunk labels for each word in the corpus. Finally, chunk tags are assigned for each identified chunks. The training data consists of multiple tokens. Each token is represented with a number of columns, but the columns are fixed through all tokens. There should be some kind of semantics across the columns i.e. first column is a word, second column is PoS International Journal of Computational Linguistics Research Volume 8 Number 4 December 2017 165

tag of the word and third column is chunk tag of the word and so on. The last column represents the answer tag which is going to be trained by CRF model. In this proposed chunker, we have used three column format. Content of sample training data in English and Kannada are given Table 4 and Table 5 respectively. 4.3.2 Testing Corpus Input for chunker module should be an annotated (PoS tagged) text. The proposed chunker is tested on novels and stories category (from EMILLE corpus) dataset, containing 2,732 sentences (9,000 words) and 3,971 sentences (40,000 words) respectively. 5. Experimental Results and Discussion The major contributions in this paper towards the design of chunker for Kannada language are listed below. Framing of 167 linguistic rules to determine the chunk boundaries and chunk labels in the input corpus. Assignment of parts of speech tag to each word in the test dataset having 80,000 words (6,000 sentences). Identification of different chunks and assignment of chunk tags to the identified chunks in the input corpus is carried out manually to train the system. Table 4. Sample training data in English Input for the proposed CRF chunker is PoS tagged sentence. The output obtained by the chunker is a chunked sentence. Sample input for chunker is shown below: Krishnanu <NNP> benneyannu <NN> Kaddanu <VP>. <SYM> 166 International Journal of Computational Linguistics Research Volume 8 Number 4 December 2017

Output obtained from the proposed chunker is given below: Table 5. Sample training data in Kannada Krishnanu <NNP> <B-NP> benneyannu <NN> <I-NP> kaddanu <VP> <B-VP>. <SYM> O In our experiments, we found that over 85% of the chunks identified were given the correct chunk labels. Thus, the best method for doing chunk boundary identification is to train the system with conditional random fiels model with both boundary and syntactic label information together. Now given a test sample, the trained CRF can identify both the chunk boundaries and labels. The chunk labels are then dropped to obain data marked with chunk boundaries only. Accuracy of the proposed CRF chunker for Kannada is calculated as the ratio of correctly chunked words to total number of input words. The equation used to the calculate accuracy of proposed system is given in the following equation (1). (1) Based on the above equation, an accuracy of 92.77% and 93.28% is achieved on novels (2732 sentences) and stories (3971 sentences) dataset respectively. The training dataset has been divided into 8 divisions based on their sizes. The result obtained by the proposed Kannada chunker on novels and stories dataset is tabulated in Table 6 and Table 7 respectively. The graph plotted on these two tables are given in Figure 3 and Figure 4. Consequently, it is observed from these graphs that the accuracy of chunker increases as the size of training data is increased. 6. Conclusions Almost all Indian languages are free word ordered languages. But in phrases, order of words is fixed. Chunking or shallow parsing is the task of identifying and labeling simple phrases or chunks like noun-phrase, verb-phrase, adverbphrase, etc., in a sentence. In this paper, a chunker for Kannada language is proposed using statistical approach called conditional random field model. The stories and novels dataset from EMILLE corpus is used to train and test the proposed chunker. An accuracy of 92.77% and 93.28% is achieved on novels (2732 sentences) and stories (3971 sentences) dataset respectively. It is observed from the results obtained International Journal of Computational Linguistics Research Volume 8 Number 4 December 2017 167

from the proposed Kannada chunker is that the accuracy of chunker increases as the size of training data is increased. Experimental result shows that the performance of proposed chunker is significantly good. Table 6. Accuracy of proposed Kannada chunker on novels dataset (2,732 sentences containing 29,638 words) with different training data size Table 7. Accuracy of proposed Kannada chunker on stories dataset (3,971 sentences containing 44,469 words)with different training data size Figure 3. Accuracy of Kannada chunker on novels dataset 168 International Journal of Computational Linguistics Research Volume 8 Number 4 December 2017

Figure 4. Accuracy of Kannada chunker on stories dataset References [1] Agrawal, Himanshu. (2007). POS tagging and chunking for Indian languages. In: Proceedings of the IJCAI, Workshop On Shallow Parsing for South Asian Languages, p. 37 40. [2] Asopa, Sneha., Asopa, Pooja. (2016). Iti Mathur, and Nisheeth Joshi. Rule based chunker for Hindi. In: 2nd International Conference on Contemporary Computing and Informatics, p. 242 245, March 2016. [3] Bharati, Akshar., Sangal, Rajeev., Dipti Misra Sharma., Bai, Lakshmi. (2006). Anncorra: Annotating corpora guidelines for pos and chunk annotation for Indian languages. LTRC-TR31. [4] Chakraborty, Neelotpal., Malakar, Samir., Sarkar, Ram., Nasipuri, Mita. (2016). A rule based approach for noun phrase extraction from English text document. In: Seventh International Conference on CNC-2016, p. 13 26. [5] Dandapat, Sandipan. (2007). Part of speech tagging and chunking with maximum entropy model. In: Proceedings of the IJCAI, Workshop On Shallow Parsing for South Asian Languages, p. 29 32, 2007. [6] De, Sankar., Dhar, Arnab., Biswas, Suchismita., Garain, Utpal. (2011). On development and evaluation of a chunker for Bangla. In: Second International Conference on Emerging Applications of Information Technology, p. 321 324. [7] Ekbal, Asif., Mandal, Samiran., Bandyopadhyay, Sivaji. (2007). POS tagging using HMM and rule based chunking. In: Proceedings of the IJCAI, Workshop On Shallow Parsing for South Asian Languages, p. 25 28. [8] Kuang hua Chen., Hsin-Hsi Chen. (1993). A probabilistic chunker. In: Proceedings of ROCLING-93, p. 99 117. [9] Susan Armstrong James Hammerton., Miles Osborne., Walter Daelemans. (2002). Introduction to special issue on machine learning approaches to shallow parsing. Journal of Machine Learning Research, 2. 551 558. [10] Nongmeikapam, Kishorjit., Chingangbam, Chiranjiv., Nepoleon Keisham., Biakchungnunga Varte., Sivaji Bandopadhyay. (2014). Chunking in Manipuri using CRF. International Journal on Natural Language Computing (IJNLC), 3. 121 127. [11] Sathish Chandra Pammi., Kishore Prahallad. (2007). POS tagging and chunking using decision forests. In: Proceedings of the IJCAI, Workshop On Shallow Parsing for South Asian Languages, p. 33 36. [12] S. Lakshmana Pandian and T.V. Geetha. (2009). CRF models for Tamil part of speech tagging and chunking approach. In: ICCPOL, LNAI 5459, l Springer-Verlag Berlin Heidelberg, p. 11-22. International Journal of Computational Linguistics Research Volume 8 Number 4 December 2017 169

[13] Patel, Chirag., Ahalpara, Dilip. (2015). A statistical chunker for Indian language Gujarati. International Journal of Computer Engineering and Applications, 9. 173 180. [14] Avinesh, PVS., Karthik, G. (2007). Part-of-speech tagging and chunking using conditional random fields and transformation based learning. In: 2009 Proceedings of the IJCAI, Workshop On Shallow Parsing for South Asian Languages, p. 21 24. [15] Rao, Delip., Yarowsky, David. (2009).Part of speech tagging and shallow parsing for Indian languages. In: 2009 Proceedings of the IJCAI, Workshop On Shallow Parsing for South Asian Languages, p. 17 20. [16] Ravi Sastry, G. M. Chaudhuri, Sourish Nagender Reddy, P. (2009). A HMM based part-of-speech tagger and statistical chunker for three Indian languages. In: 2009 Proceedings of the IJCAI, Workshop On Shallow Parsing for South Asian Languages, p. 13 16. [17] Akshay Singh, S M Bendre, and Rajeev Sangal. HMM based chunker for Hindi. In: 2009 Proceedings of the Second International Joint Conference on Natural Language Processing, October 2005. [18] Pattabhi, R K., Rao, T., Vijay Sundar Ram, R., Vijayakrishna, R., Sobha, L. (2009). A text chunker and hybrid POS tagger for Indian languages. In: 2009 Proceedings of the IJCAI, Workshop On Shallow Parsing for South Asian Languages, p. 9 12. [19] Dhanalakshmi,V., Padmavathy, P., Anand Kumar, M., Soman, K. P., Rajendran, S. (2009)Chunker for Tamil. In: International Conference on Advances in Recent Technologies in Communication and Computing, p. 436 438,. 170 International Journal of Computational Linguistics Research Volume 8 Number 4 December 2017