Ritesh Kumar & Dr. Girish Nath Jha Jawaharlal Nehru University New Delhi

Similar documents
Derivational and Inflectional Morphemes in Pak-Pak Language

Parsing of part-of-speech tagged Assamese Texts

HinMA: Distributed Morphology based Hindi Morphological Analyzer

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

LING 329 : MORPHOLOGY

1. Introduction. 2. The OMBI database editor

Linking Task: Identifying authors and book titles in verbose queries

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

THE VERB ARGUMENT BROWSER

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Development of the First LRs for Macedonian: Current Projects

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

CS 598 Natural Language Processing

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Indian Institute of Technology, Kanpur

Modeling full form lexica for Arabic

ScienceDirect. Malayalam question answering system

Developing a TT-MCTAG for German with an RCG-based Parser

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

cambridge occasional papers in linguistics Volume 8, Article 3: 41 55, 2015 ISSN

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Words come in categories

Semantic Modeling in Morpheme-based Lexica for Greek

A Simple Surface Realization Engine for Telugu

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

CEFR Overall Illustrative English Proficiency Scales

Progressive Aspect in Nigerian English

Vocabulary Usage and Intelligibility in Learner Language

Natural Language Processing. George Konidaris

A Grammar for Battle Management Language

Grammars & Parsing, Part 1:

The Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

What the National Curriculum requires in reading at Y5 and Y6

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Compositional Semantics

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Prediction of Maximal Projection for Semantic Role Labeling

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

California Department of Education English Language Development Standards for Grade 8

Problems of the Arabic OCR: New Attitudes

Sample Goals and Benchmarks

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Phonological Processing for Urdu Text to Speech System

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

- «Crede Experto:,,,». 2 (09) ( '36

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Emmaus Lutheran School English Language Arts Curriculum

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Hindi Aspectual Verb Complexes

Memory-based grammatical error correction

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

On the final vowel in Kikae

BULATS A2 WORDLIST 2

Using a Native Language Reference Grammar as a Language Learning Tool

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

AQUA: An Ontology-Driven Question Answering System

STANDARDS. Essential Question: How can ideas, themes, and stories connect people from different times and places? BIN/TABLE 1

Teaching Vocabulary Summary. Erin Cathey. Middle Tennessee State University

Advanced Grammar in Use

Adjectives tell you more about a noun (for example: the red dress ).

SAMPLE PAPER SYLLABUS

Word Stress and Intonation: Introduction

Ch VI- SENTENCE PATTERNS.

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Hindi-Urdu Phrase Structure Annotation

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Constructing Parallel Corpus from Movie Subtitles

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Applications of memory-based natural language processing

Course Syllabus Advanced-Intermediate Grammar ESOL 0352

Constraining X-Bar: Theta Theory

The Structure of Relative Clauses in Maay Maay By Elly Zimmer

Using dialogue context to improve parsing performance in dialogue systems

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Syntactic types of Russian expressive suffixes

A Graph Based Authorship Identification Approach

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

A Bayesian Learning Approach to Concept-Based Document Classification

A Computational Evaluation of Case-Assignment Algorithms

Collocations of Nouns: How to Present Verb-noun Collocations in a Monolingual Dictionary

Proof Theory for Syntacticians

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Primary English Curriculum Framework

The Acquisition of English Grammatical Morphemes: A Case of Iranian EFL Learners

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Coast Academies Writing Framework Step 4. 1 of 7

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation

Chapter 3: Semi-lexical categories. nor truly functional. As Corver and van Riemsdijk rightly point out, There is more

Multilingual Sentiment and Subjectivity Analysis

A Syllable Based Word Recognition Model for Korean Noun Extraction

Transcription:

Magahi Verb Analyser and Generator Ritesh Kumar & Dr. Girish Nath Jha Jawaharlal Nehru University New Delhi

Magahi Magahi appeared as a distinct language around 10th century like other New Indo-Aryan (NIA) languages. Grierson has classified Magahi under Eastern group of Outer sub-branch. Currently, Magahi speakers count up to 13,978,565 (Census, 2001). Ethnologue (1996) reports that Magahi is spoken mainly in Bihar and Jharkhand; but it is also spoken in some parts of West Bengal like Maldah District

Magahi Currently, three distinct varieties of Magahi could be recognized: Central Magahi of Patna, Gaya, Hazaribagh; South-Eastern Magahi of Ranchi and some parts of Orissa; Eastern Magahi of Begusarai and Monghyr. Amongst these the Magahi spoken in and around Gaya and Patna is generally considered standard because of the obvious social and political reasons.

Verbs in Magahi In case of finite verbs, Magahi has three tenses present, past and future. While present is unmarked, the past is marked by -l- and -b- functions as the marker for future. There are three aspects progressive, stative and habitual. Also there are two moods presumptive and subjunctive represented morphologically on the verb

Verbs in Magahi Basically there are three types of verb stems in Magahi: Primitive, monomorphic basic stems like /kʰɑ-/, /d ekʰ-/, /sʊn-/, etc. Derivative stems. These are formed by adding various kinds of derivative suffixes to the verbal or non-verbal stem. Complex verbs. These are formed by adding various kinds of models to the primitive and derived stems.

Complex verbs in Magahi The complex verbs in Magahi can be divided into two categories compound verbals and conjunct verbals. Compound verbals involve combinations of two verb-stems. Conjunct verbals are those that involve the combination of a substantive (i.e., nouns and adjectives) and a verb stem.

Agreement in Magahi The most intriguing and unique feature of Magahi is its agreement system. The verb in Magahi agrees with both the subject and the object simultaneously. There is no gender and number agreement in Magahi. The verb agrees with the person and honirificity of both subject and object.

Agreement in Magahi Some examples: (1) həm okərɑ d ekʰə-l-i- əi I him (-Honor) saw 3P object (-Honour) I saw him; 3P Object, -Honour (2) həm ʊnkɑ d ekʰə-l-i- əin I him (+Honor) saw 3P object (+Honour) I saw him; 3P Object, +Honour.

Agreement in Magahi There is also this phenomenon of suspension of all agreements with object in certain construction, as in the following examples (1) həm d ekʰəli/ d ekʰəlio I saw ; Neutral object (2) həm okərɑ d ekʰəliəi/ d ekʰəlio I saw ; 3P Object, -Honour

Magahi as LRL According to the Census of India, 2001, Magahi is considered a dialect of Hindi. But the fact is that it is a completely different language, with closer relations with Bangla, Oriya, etc rather than Hindi. Literate, urban parents dissuade and forcefully stop children from using the language since it is considered 'uncouth' and the 'language of the illiterate'.

Magahi as LRL Consequently, Magahi does not have any online resources. And there is hardly any effort to develop these resources for the language, since neither the government nor the speakers are concerned or feel a need to develop the computationally useful resources. The basic aim of this analyser is to initiate some resource building and language processing for the language.

Needs for LRL Like any other LRL there are two basic needs of Magahi. Need to standardise whatever little resources we have such that it could utilised for developing different tools, applications, etc. Need to develop the language foundations (i.e., basic grammatical descriptions, dictionaries, etc.) and tools such that these standardised resources could be utilised.

Developmental Phases for LRLs There are four developmental phases for LRLs: Initial Phase (Foundations): building of lexical data-base. Second Phase: basic tools like morphological analysers, POS taggers, etc. Third Phase: development of advanced tools and applications like web crawler and search engines. Fourth Phase: development of general applications like those of information retrieval and extraction, question/answering systems, etc.

Phases in Magahi In case of Magahi, the foundational work has yet to be completed. There is no collection of corpus as such, since very little data is transferred on the computer, if any at all. However the primary job at the foundational stage, i.e., the grammatical and linguistic description of the language, is complete to a very large extent.

The Analyser/Generator In this paper we have also tried to take the work further to the second stage by developing a basic morphological analyser/generator for the verbs of Magahi. This tool analyses and gives the grammatical category of the given verb form and also generates the verb paradigm for that particular verb root.

The Analyser/Generator The users are provided with a GUI in which they are required to input a verb-root or verb-form and the system will give the verb-root, the grammatical category of the verb root and will One file has all the root forms and their English equivalent and the other file has the inflections and the ECVs, along with the grammatical tags. generate all other forms of the verb. The data for developing this analyser is stored in three files in UTF-8 encoding. One file has all the lemmas and their English equivalent and the other two files have the inflections and the ECVs, along with the grammatical tags.

The Analyser/Generator The inputted form is searched through the list of the roots with the help of a lexicon reader and lexicon search engine. One file has all the root forms and their English equivalent and the other file has the inflections and the ECVs, along with the grammatical tags. If it is there then it is attached all the inflections and ECVs and finally returned which is displayed as output to the users with all the forms. If it is not found then the system checks whether it is a derived form of the verb.

The Analyser/Generator The output is displayed both in the Devanagari script and IPA. If it is not there then there is no output and the One system file has all the root forms prompts and their English equivalent and the other user file has the inflections to enter and the ECVs, along another with grammatical tags. verb root form. The system is developed using Java/JSP as the programming language in the web domain. A demo of the system http://sanskrit.jnu.ac.in/student_projects/magahi-sea

The way ahead: Fixing up the Bugs As it is clear from the demo, the programme is not very clean. We need to fix up a few issues here and there like making the IPA transcriptions complete. One file has all the root forms and their English equivalent and the other file has the inflections and the ECVs, along with the grammatical tags. Searching will be enabled through IPA and English equivalents also.

The way ahead : CL system We are planning to expand and make the system more robust by adopting the method of 'construction labelling (CL) system', for enhancing the argument structure specification. This system is especially designed for the LRLs and requires extensive linguistic expertise. It is a system of representing detailed morphsyntactic and semantic information in such a way that it is computationally useful.

The way ahead: CL system The main aim of this CL system is to identify and enumerate all the construction types (within the linguistic limits) of a particular language in a particular domain, down to a certain degree of detail. In this system the construction types are represented by strings of letters and hyphens which are called 'templates'. These templates are made up of 'labels'.

The way ahead: CL system Each construction is displayed from the top, first its properties as a whole are given, followed by properties of its main constituents, their syntactic properties and then finally their semantic properties. The area occupied by each type of the level is called 'slot'. Thus each slot consist of different labels like that for 'Parts of Speech','valency', etc.

The way ahead : CL system This approach of construction labelling would be helpful in developing the morphological analysers/generators (and of course many other tools and applications also) which could analyse the morphemes of different words, if it is given a sentence or even a complete text. Later on it could be developed into a language generation tool also.

Open to Questions!