Bulgarian Inflectional Morphology in Universal Networking Language

Similar documents
LING 329 : MORPHOLOGY

Derivational and Inflectional Morphemes in Pak-Pak Language

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

AQUA: An Ontology-Driven Question Answering System

Some Principles of Automated Natural Language Information Extraction

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80.

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Basic concepts: words and morphemes. LING 481 Winter 2011

Parsing of part-of-speech tagged Assamese Texts

Modeling full form lexica for Arabic

Word Stress and Intonation: Introduction

Vocabulary Usage and Intelligibility in Learner Language

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Constraining X-Bar: Theta Theory

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

California Department of Education English Language Development Standards for Grade 8

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Proof Theory for Syntacticians

HinMA: Distributed Morphology based Hindi Morphological Analyzer

PRODUCT PLATFORM DESIGN: A GRAPH GRAMMAR APPROACH

Words come in categories

Lexical phonology. Marc van Oostendorp. December 6, Until now, we have presented phonological theory as if it is a monolithic

BULATS A2 WORDLIST 2

Coast Academies Writing Framework Step 4. 1 of 7

An Introduction to the Minimalist Program

Program in Linguistics. Academic Year Assessment Report

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Florida Reading Endorsement Alignment Matrix Competency 1

Underlying Representations

CS 598 Natural Language Processing

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Procedia - Social and Behavioral Sciences 154 ( 2014 )

A Simple Surface Realization Engine for Telugu

Linking Task: Identifying authors and book titles in verbose queries

INTRODUCTION TO MORPHOLOGY Mark C. Baker and Jonathan David Bobaljik. Rutgers and McGill. Draft 6 INFLECTION

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Accurate Unlexicalized Parsing for Modern Hebrew

Building an HPSG-based Indonesian Resource Grammar (INDRA)

UKLO Round Advanced solutions and marking schemes. 6 The long and short of English verbs [15 marks]

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Oakland Unified School District English/ Language Arts Course Syllabus

More Morphology. Problem Set #1 is up: it s due next Thursday (1/19) fieldwork component: Figure out how negation is expressed in your language.

A Bayesian Learning Approach to Concept-Based Document Classification

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Natural Language Processing. George Konidaris

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

AUTHORING E-LEARNING CONTENT TRENDS AND SOLUTIONS

Character Stream Parsing of Mixed-lingual Text

The College Board Redesigned SAT Grade 12

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

What the National Curriculum requires in reading at Y5 and Y6

Test Blueprint. Grade 3 Reading English Standards of Learning

Controlled vocabulary

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

The Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Ensemble Technique Utilization for Indonesian Dependency Parser

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

On document relevance and lexical cohesion between query terms

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Leveraging Sentiment to Compute Word Similarity

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Update on Soar-based language processing

The Acquisition of English Grammatical Morphemes: A Case of Iranian EFL Learners

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Tutorial on Paradigms

A Framework for Customizable Generation of Hypertext Presentations

Disambiguation of Thai Personal Name from Online News Articles

Heritage Korean Stage 6 Syllabus Preliminary and HSC Courses

Rhythm-typology revisited.

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

On the Notion Determiner

Emmaus Lutheran School English Language Arts Curriculum

Semantic Modeling in Morpheme-based Lexica for Greek

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

The MEANING Multilingual Central Repository

Interfacing Phonology with LFG

THE INTERNATIONAL JOURNAL OF HUMANITIES & SOCIAL STUDIES

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Development of the First LRs for Macedonian: Current Projects

A Computational Evaluation of Case-Assignment Algorithms

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

"f TOPIC =T COMP COMP... OBJ

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Presentation Exercise: Chapter 32

On the final vowel in Kikae

cambridge occasional papers in linguistics Volume 8, Article 3: 41 55, 2015 ISSN

Transcription:

Bulgarian Inflectional Morphology in Universal Networking Language Velislava ST OY KOVA INSTITUTE FOR BULGARIAN LANGUAGE - BAS, 52, Shipchensky proh. str., bl. 17, 1113 Sofia, Bulgaria Ú ØÓÝ ÓÚ Ý ÓÓºÓÑ ABSTRACT The paper presents a web-based application of semantic networks to model Bulgarian inflectional morphology. It demonstrates the general ideas, principles, and problems of inflectional grammar knowledge representation used for encoding Bulgarian inflectional morphology in Universal Networking Language (UNL). The analysis of UNL formalism is outlined in terms of its expressive power to present inflection, and the principles and related programming encodings are explained and demonstrated. KEYWORDS: Morphology and POS tagging, Grammar and formalisms, Underresourced languages. Proceedings of COLING 2012: Demonstration Papers, pages 423 430, COLING 2012, Mumbai, December 2012. 423

1 Introduction Modeling inflectional morphology is a key problem for any natural language processing application of Bulgarian language. It can result in a wide range of real applications however different formal models and theories offer different insights for encoding of almost all grammar features, and allow the use of related principles for encoding. 2 General problems with applications of word inflectional morphology The problems with natural language processing applications for word inflectional morphology are generally of two types (i) the problems of language theory at the level of phonology, morphonology, and morphology, and (ii) the adequacy of existing methodologies and techniques to offer the applications capable to interpret the complexity of natural language phenomena. Thus, the context of natural language formal representations and interpretations of inflectional morphology is the logical framework which are capable to deal with regularity, irregularity, and subregularity and have to provide a logical basis for interpreting such language phenomena like suppletion, syncretism, declension, conjugation, and paradigm. 2.1 The traditional academic representation and computational morphology formal models of inflectional morphology The traditional interpretation of inflectional morphology given at the academic descriptive grammar works (Popov and Penchev, 1983) is a presentation of tables. The tables consist of all possible inflected forms of a related word with respect to its subsequent grammar features. The artificial intelligence (AI) techniques offer a computationally tractable encoding preceded by a related semantic analysis, which suggest a subsequent architecture. Representing inflectional morphology in AI frameworks is, in fact, to represent a specific type of grammar knowledge. The computational approach to both derivational and inflectional morphology is to represent words as a rule-based concatenation of morphemes, and the main task is to construct relevant rules for their combinations. The problem how to segment words into morphemes is central and there are two basic approaches of interpretation (Blevins, 2001). The first is Word and Paradigme (WP) approach wich uses paradigme to segment morphemes. The secound is Item and Agreement (IA) approach which uses sub-word units and morpho-syntactic units for word segmentation. With respect to number and types of morphemes, the different theories offer different approaches depending on variations of either stems or suffixes as follows: (i) Conjugational solution offers invariant stem and variant suffixes, and (ii) Variant stem solution offers variant stems and invariant suffix. Both these approaches are suitable for languages, which use inflection rarely to express syntactic structures, whereas for those using rich inflection some cases where phonological alternations appear both in stem and in concatenating morpheme a "mixed" approach is used to account for the complexity. Also, some complicated cases where both prefixes and suffixes have to be processed require such approach. We evaluate the "mixed" approach as a most appropriate for the task because it considers both stems and suffixes as variables and, also, can account for the specific phonetic alternations. The additional requirement is that during the process of the inflection all generated inflected rules (both using prefixes and suffixes) have to produce more than one type of inflected forms. 424

Figure 1: The word structure according to the general linguistic morphological theory. 2.2 Interpreting sound alternations The sound alternations influence the inflectional morphology of almost all part-of-speech of standard Bulgarian language and as a result they form irregular word forms. In fact, we have a rather unsystematically formed variety of regular and irregular sound alternations which is very difficult to be interpreted formally. The phonetic alternations in Bulgarian are of various types and influence both derivational and inflectional morphology. The general morphological theory offers a segmentation of words (Fig. 1) which consists of root to which prefixes, suffixes or endings are attached. In Bulgarian, all three types of morphemes are used and additional difficulties come from the fact that sound alternations can be occurred both in stems, prefixes, suffixes, and also on their boundaries which suggest extremely complicated solutions. 3 The Universal Networking Language In the UNL approach, information conveyed by natural language is represented as a hypergraph composed of a set of directed binary labelled links (referred to as "relations") between nodes or hypernodes (the "Universal Words"(WS)), which stand for concepts (Uchida and Della Senta, 2005). UWs can also be annotated with "attributes" representing context information (UNL, 2011). Universal Words (UWs) represent universal concepts and correspond to the nodes to be interlinked by "relations" or modified by "attributes" in a UNL graph. They can be associated to natural language open lexical categories (noun, verb, adjective and adverb). Additionally, UWs are organized in a hierarchy (the UNL Ontology), and are defined in the UNL Knowledge Base and exemplified in the UNL Example Base, which are the lexical databases for UNL. As language-independent semantic units, UWs are equivalent to the sets of synonyms of a given language, approaching the concept of "synset" used by the WordNet. Attributes are arcs linking a node to itself. In opposition to relations, they correspond to oneplace predicates, i.e., function that take a single argument. In UNL, attributes have been normally used to represent information conveyed by natural language grammatical categories (such as tense, mood, aspect, number, etc). Attributes are annotations made to nodes or hypernodes of a UNL hypergraph. They denote the circumstances under which these nodes (or hypernodes) are used. Attributes may convey three different kinds of information: (i) The information on the role of the node in the UNL graph, (ii) The information conveyed by bound morphemes and closed classes, such as affixes (gender, number, tense, aspect, mood, voice, etc), determiners (articles and demonstratives), etc., (iii) The information on the (external) context of the utterance. Attributes represent information that cannot be conveyed by UWs and relations. Relations, are labelled arcs connecting a node to another node in a UNL graph. They correspond to two-place semantic predicates holding between two UWs. In UNL, relations have 425

Figure 2: The statistical word distribution of part-of-speech for UNL interpretation of Bulgarian inflectional morphology. been normally used to represent semantic cases or thematic roles (such as agent, object, instrument, etc.) between UWs. UNL-NL Grammars are sets of rules for translating UNL expressions into natural language (NL) sentences and vice-versa. They are normally unidirectional, i.e., the enconversion grammar (NL-to-UNL) or deconversion grammar (UNL-to-NL), even though they share the same basic syntax. In the UNL Grammar there are two basic types of rules: (i) Transformation rules - used to generate natural language sentences out of UNL graphs and vice-versa and (ii) Disambiguation rules - used to improve the performance of transformation rules by constraining their applicability. The UNL offers an universal language-independent and open-source platform for multilingual web-based applications (Boitet and Cardenosa, 2007) available for many laguages (Martins, 2011) including Slavonic languages like Russian (Boguslavsky, 2005) as well. 3.1 Representing Bulgarian inflectional morphology in UNL The UNL specifications offer types of grammar rules particularly designed to interpret inflectional morphology both with respect to prefixes, suffixes, infixes, and to sound alternations taking place during the process of the inflection. Thus, UNL allows two types of transformation inflectional rules: (i) A-rules (affixation rules) apply over isolated word forms (as to generate possible inflections) and (ii) L-rules (linear rules) apply over lists of word forms (as to provide transformations in the surface structure). Affixation rules are used for adding morphemes to a given base form, so to generate inflections or derivations. There are two types of A-rules: (i) simple A-rules involve a single action (such as prefixation, suffixation, infixation and replacement), and (ii) complex A-rules involve more than one action (such as circumfixation). 426

Figure 3: The inflectional rules definitions for the word " ezik". There are four types of simple A-rules: (i) prefixation, for adding morphemes at the beginning of the base form, (ii) suffixation, for adding morphemes at the end of the base form, (iii) infixation, for adding morphemes to the middle of the base form, (iv) replacement, for changing the base form. The analysed application of Bulgarian inflectional morphology (Noncheva and Stoykova, 2011) was made within the framework of the project The Little Prince Project of the UNDL Foundation aimed to develop UNL grammar and lexical resources for several european languages based on the book The Little Prince. Hence, the lexicon is limitted to the text of the book. It offers the interpretation of inflectional morphology for the nouns, adjectives, numerals, pronouns (Stoykova, 2012) and verbs which uses A-rules (Fig. 2). The UNL interpretation of nouns defines 74 word inflectional types. Every inflectional type uses its own rules to generate all possible inflected forms for the features of number and definiteness. Here we are analysing the inflectional rules of Bulgarian word for language "ezik" 1. ÓÖÑ Þ ËÆ ² ¼ ËÆ ² ¼ ÙØ ÈÄÊ ½ ÈÄʲ ½ Ø È Í ¼ ÅÍÄ ¼ The inflectional rules for generation of all inflected word forms are defined as separate rules (Fig. 3). The suffixation rules for adding: ËÆ ² ¼, ËÆ ² ¼ ÙØ, È Í ¼, ÅÍÄ ¼ use the idea of introducing stems to which the inflectional morphemes are added. A-rules for replacement also reflect the idea of introducing inflectional stems consisting of root plus infix ÈÄÊ ½, ÈÄʲ ½ Ø. The generated inflected word forms of the example Bulgarian word for language " ezik" are given at the Fig.4. In general, the UNL lexical information presentation scheme underlie the idea of WordNet for semantic hierarchical representation and allows the presentation of synonyms and the translation of the word as well, which also is introduced in the application. Adjectives are defined by using 14 word inflectional types and every inflectional type uses its own rules to generate all possible inflected forms for the features of gender, number and def- 1 Here and elsewhere in the description we use Latin alphabet instead of Cyrillic. Because of mismatching between both some of Bulgarian phonological alternations are assigned by two letters instead of one in Cyrillic alphabet. 427

Figure 4: The word forms of the word " ezik" generated by the system. initeness. The interpretation of numerals and pronouns consist of 5 and 6 word inflectional types, respectively. Alternatively, verbs are represented in 48 inflectional types. The UNL interpretation, also, offers syntactic and semantic account. The syntacitc account is represented by 21 syntactic rules for subcategorization frame and linearization, and rules to define the semantic relations. In general, the UNL interpretation of Bulgarian inflectional morphology offers a sound alternations interpretation mostly by the use of A-rules. The inflectional rules are defined without the use of hierarchinal inflectional representation even they define the related inflectional types. The sound alternations and the irregularity are interpreted within the definition of the main inflectional rule. The UNL application, also, represents a web-based intelligent information and knowledge management system which allows different types of semantic search with respect to the context like semantic co-occurrence relations search, keywords or key concepts search, etc. Conclusion The demonstrated application of Bulgarian inflectional morphology uses the semantic networks formal representation schemes and the UNL as a formalism. However, it encodes the inflectional knowledge using both the expressive power and the limitations of the formalism used. The UNL knowledge representation scheme offers well defined types of inflectional rules and differentiates inflectional, semantic, and lexemic hierarchies. The treatment of inflectional classes as nodes in the inflectional hierarchy is used extensively, as well. The application is open for further improvement and development by introducing additional grammar rules and by enlarging the database for the use in different projects. References (2011). URL http://www.undl.org. Blevins, J. (2001). Morphological paradigms. Transactions of the Philosophical society, 99:207 210. Boguslavsky, I. (2005). Some lexical issues of unl. In J. Cardenosa, A. Gelbukh, E. Tovar (eds.) Universal Network Language: Advances in Theory and Applications. Research on Computing Science, 12:101 108. 428

Boitet, C., B. I. and Cardenosa, J. (2007). An evaluation of unl usability for high quality multilingualization and projections for a future unl++ language. In A. Gelbukh, (ed.) Proceedings of CICLing, Lecture Notes in Computer Sciences, 4394:361 376. Martins, R. (2011). Le petit prince in unl. In Proceedings from Language Resources Evaluation Conference 2011, pages 3201 3204. Noncheva, V., S. Y. and Stoykova, V. (2011). The little prince project encoding of bulgarian grammar. http://www.undl.org(undl Foundation). Popov, K., G. E. and Penchev, J. (1983). The Grammar of Contemporary Bulgarian Language (in Bulgarian). Bulgarian Acdemy of Sciences Publishing house. Stoykova, V. (2012). The inflectional morphology of bulgarian possessive and reflexivepossesive pronouns in universal networking language. In A. Karahoca and S. Kanbul (eds.) Procedia Technology, 1:400 406. Uchida, H., Z. M. and Della Senta, T. (2005). Universal Networking Language. UNDL Foundation. 429