Development of the First LRs for Macedonian: Current Projects

Similar documents
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

BULATS A2 WORDLIST 2

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Modeling full form lexica for Arabic

Linking Task: Identifying authors and book titles in verbose queries

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

1. Introduction. 2. The OMBI database editor

THE VERB ARGUMENT BROWSER

LING 329 : MORPHOLOGY

Memory-based grammatical error correction

Developing a TT-MCTAG for German with an RCG-based Parser

CS 598 Natural Language Processing

ScienceDirect. Malayalam question answering system

Problems of the Arabic OCR: New Attitudes

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

Parsing of part-of-speech tagged Assamese Texts

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

Words come in categories

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Advanced Grammar in Use

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Can Human Verb Associations help identify Salient Features for Semantic Verb Classification?

Ch VI- SENTENCE PATTERNS.

Applications of memory-based natural language processing

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Specifying a shallow grammatical for parsing purposes

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

AQUA: An Ontology-Driven Question Answering System

Vocabulary Usage and Intelligibility in Learner Language

The CESAR Project: Enabling LRT for 70M+ Speakers

A Bayesian Learning Approach to Concept-Based Document Classification

Semantic Modeling in Morpheme-based Lexica for Greek

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Character Stream Parsing of Mixed-lingual Text

The College Board Redesigned SAT Grade 12

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Universiteit Leiden ICT in Business

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The stages of event extraction

The taming of the data:

Building an HPSG-based Indonesian Resource Grammar (INDRA)

A Syllable Based Word Recognition Model for Korean Noun Extraction

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Constructing Parallel Corpus from Movie Subtitles

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Automated Identification of Domain Preferences of Collocations

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

A corpus-based approach to the acquisition of collocational prepositional phrases

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

C.A.E. LUSCHNIG ANCIENT GREEK. A Literary Appro a c h. Second Edition Revised by C.A.E. Luschnig and Deborah Mitchell

A Comparison of Two Text Representations for Sentiment Analysis

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

ARNE - A tool for Namend Entity Recognition from Arabic Text

Analysis of Lexical Structures from Field Linguistics and Language Engineering

Ontologies vs. classification systems

Emmaus Lutheran School English Language Arts Curriculum

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

Cross-Lingual Text Categorization

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

Corpus Linguistics (L615)

EAGLE: an Error-Annotated Corpus of Beginning Learner German

Chapter 4: Valence & Agreement CSLI Publications

An Evaluation of POS Taggers for the CHILDES Corpus

Syntactic types of Russian expressive suffixes

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation

Derivational and Inflectional Morphemes in Pak-Pak Language

Anna P. Kosterina Iowa State University. Retrospective Theses and Dissertations

Presentation Advice for your Professional Review

Speech Recognition at ICSI: Broadcast News and beyond

Search right and thou shalt find... Using Web Queries for Learner Error Detection

A Case Study: News Classification Based on Term Frequency

Collocations of Nouns: How to Present Verb-noun Collocations in a Monolingual Dictionary

The following information has been adapted from A guide to using AntConc.

Collocation extraction measures for text mining applications

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Transcription:

Development of the First LRs for Macedonian: Current Projects Ruska Ivanovska-Naskova Faculty of Philology- University St. Cyril and Methodius Bul. Krste Petkov Misirkov bb, 1000 Skopje, Macedonia rivanovska@flf.ukim.edu.mk Abstract This paper presents in brief several ongoing projects whose aim is to develop the first LRs for Macedonian, in particular the raw corpus compiled by Prof. George Mitrevski at the Auburn University, the preparation for the compilation of a reference corpus for the Macedonian written language at the MASA (Macedonian Academy of Sciences and Arts), the first small annotated corpus of the Macedonian translation of the Orwell s 1984, the electronic dictionary of simple words created by Aleksandar Petrovski for the Macedonian module in the frame of the corpus processing system Intex/Nooj and the Morphological dictionary developed by the LTRC (Language Technology Research Center). Further we discuss the importance of the development of the basic LRs for Macedonian as a means of preservation and a prerequisite for the creation of the first commercial language products for this Slavic language. 1. Introduction The Macedonian language belongs to the group of minority or, so-called, lesser-used languages that due to lack of funding, specialized human resources and a relatively small market for commercial language products is way behind the leading languages in the field of the computational linguistics. Still the creation of LRs for Macedonian is essential for its preservation because it will encourage linguistic investigations that include this Slavic language and will act at the same time as a starting point for the development of commercial language products. We would like to present few ongoing projects that aim to develop the first corpora and electronic dictionaries for Macedonian. 2. Corpora 2.1 The first Macedonian online corpus The first initiative was launched by Prof. George Mitrevski at the Auburn University, Alabama who has been working on the compilation of a Macedonian written corpus (Mitrevski). Its uncompleted version (one million words approximately) is consultable on the Web (http://omilia.uio.no/ce/mak/). It is a raw corpus developed with the IMS Corpus Workbench of the Institut für Maschinelle Sprachverbeitung at the University of Stuttgart. The corpus is made up of around 10 different types of texts mainly retrieved from the Internet. Each text is described with several parameters: its number in the database, title, author, genre, subject, publisher, date of publishing, date of its registration in the database, ISBN or other identification number, text format, information whether a sample or the entire text is included, number of words etc. The corpus can be used to build concordances for single words or groups of up to five words, collocations etc. The queries can be applied to the whole corpus or only to a group of texts selected according to the criteria included in the description of the texts. The future development of the corpus regards the size of the corpus (it is planned to reach ten million words) and POS annotation compatible with the Corpus Encoding Standard (CES) and the multilingual MULTEXT-East data set. 2.2 The MASA reference corpus of written Macedonian The second initiative comes from the Research Center for Areal Linguistics at the Macedonian Academy of Sciences and Arts. Ac. Zzuzana Topoljinska and her team launched the idea related to the development of reference corpus for the Macedonian written language which is a part of a larger project at regional scale of the network of the Academies of South-Eastern Europe. The project is still in its preparatory phase when many features of the corpus and the strategy for its development are yet to be defined. Currently a team of linguists is working on the typology of authentic texts written in modern Macedonian that are going to be included in the corpus, their preparation and the creation of a referencing system. The planned degree of annotation is on morphological level (POS tagging). A team of engineers is working on the selection of tools that correspond to the nature of the Macedonian language and their adaptation for the organization and the treatment of the corpus. The actual compilation of the corpus is planned to start in the course of 2006. Similar initiative is to be undertaken by the Institute of the Macedonian Language Krste Petkov Misirkov. (Venovska-Anteska, 2005). 2.3 The tagged Macedonian translation of Orwell s 1984 This first small-size annotated corpus of the Macedonian language is part of the bilateral Macedonian- Slovene project Gathering, annotation and analysis of Macedonian-Slovene language resources in which participated several researchers (Zdravkova et al., 2005; Vojnovski et al., 2005). This is the first attempt to create Macedonian morpho-lexical resources conform to the guidelines of MULTEXT-East. The text of the corpus is the Macedonian translation of the Orwell s 1984. The first stage of the creation of the corpus was the scanning of the paper version of the text and its conversion into a digital format. The preprocessing stage also included segmentation, tokenization and compilation of a dictionary of the word forms which later were annotated. Each word form is associated with a morpho-syntactic description: part-of-speech tag (11 grammatical categories) and information about the corresponding 1837

attributes (84 for the Macedonian language) and their values (134). The morpho-syntactic description is represented as a string of different characters. The word forms were semi-automatically classified: 60% automatically according to the inflection and the rest of the words manual. Each word is also associated with its lemma. This annotated corpus was used for learning of the TnT (Trigrams n Tags) which is an efficient languagenon-specific statistical part-of-speech tagger suitable for training on large corpora. The tagger tested on the same corpus achieved an accuracy of 98.1%. The further work of this research group regards the finalization of the lexical lists through a rule-based lexicon, re-learning of the tagger and its testing on another text. The main source used for the creation of the lexical database was the Blaze Koneski s traditional dictionary of the Macedonian language. The dictionary was scanned and the errors were corrected. The basic tag set presented in Figure.1 is formed of 10 grammatical categories represented with corresponding codes. Each grammatical category is further described with certain number of attributes and values (ex. the nouns are represented with four attributes: gender with three values, number with four, case with three and definiteness with three values) At the moment of the presentation of the paper, the Macedonian DELAS consisted of 61.296 lemmas which produce 426.161 inflected forms as shown in Figure.2: 3. Electronic dictionaries 3.1 Intex/Nooj electronic dictionary of simple words The second part of this paper will focus on the compilation of two electronic dictionaries. Aleksandar Petrovski is developing an electronic dictionary of simple words which is a starting point for the creation of the Macedonian module in the frame of the Intex, recently Nooj corpus processing system. (Petrovski, 2005; Silberztein, 2005) One of the main aims of Intex/Nooj is to allow the construction of formalized description of languages and apply them to large corpora compiled acoording to the needs of the user. The main linguistic resources of this development environment are the e- morpho-syntactical dictionaries and various types of grammars (inflectional, derivational, lexical, orthographical, syntactic, semantic etc.) represented as a set of graphs. Finite-state automata, finite-state transducers and other computational devices are used for the formalization of the linguistic phenomena. The system works through language-specific modules developed by several teams of researchers that can be upgraded and modified by the user. The level of elaboration of the module differs from language to language. The core of each language module are the e- dictionaries that are conformed to the methodology promoted by the RELEX network: the first step is the creation of the dictionary of lemmas and corresponding flectional codes (DELAS) in order to automatically build the dictionary of all inflected forms (DELAF). Petrovski has started working on the Macedonian DELAS presented at the 8 th Nooj workshop: Figure 2: The total number of DELAS and DELAF entries (Petrovski, 2005) Future activities are related to the compilation of the electronic dictionary of compound words (DELACF) and set of local grammars that can be used for disambiguation when a text is being processed. Furthermore, the Macedonian module should be adapted to the new version of the system called Nooj (Silberztein, 2005) which presents several differences when compared to Intex especially the organization and the compilation of the dictionaries. The basic feature of the Nooj dictionaries is the absence of the dictionary of inflected forms DELAF and the co-existence of both simple and compound words in a same dictionary. Some of the other main innovations in Nooj are: processing corpora rather than single texts, processing of more than 100 file formats which makes the system quite flexible and easy-for-use etc. The development of a rich Macedonian module for Nooj will allow the linguists to use Nooj for processing of Macedonian corpora as well as a tool for extraction of terminology etc. Figure 1. Table of the word groups, the categories and the codes (Petrovski, 2005) Figure 3. Example of text in Macedonian processed with INTEX (Petrovski, 2005) 1838

3.2 Morphological dictionary The second electronic dictionary is being constructed by the recently established LTRC (Language Technology Research Center). It is a morphological dictionary that associates each inflected form with a lemma and flectional information represented by different tags. Various sources were used for extraction of words that were included in the initial database: lexicons, corpuses of texts retrieved from the Internet etc. Once the raw database of word forms was completed, the team had to extract the lemmas and to elaborate a methodology for generation of all inflected forms. This proved to be a difficult task since the Macedonian language is characterized by complex morphological system both on derivational and inflective level. The analysis and the generation of the inflected forms was done semiautomatically: around 50 inflectional paradigms for 10 word classes were developed and each lemma was assigned a code referring to the corresponding paradigm. The new expanded database was then manually corrected since many of the lemmas show inconsistencies with the corresponding inflectional rules. At the same time the LTRC group elaborated a tag set which was used for the annotation of the word forms. NN V ADJ ADV PREP CONJ PRON NUM PART INTERJ ABBR ADJPRON CLASS noun verb adverb preposition conjunction numeral particle interjection abbreviation /s Figure 4: The word classes in the LTRC dictionary Figure 4 shows the tags used for different word classes which is almost identical to the tags of the previous dictionary and of the Orwell s 1984 annotated corpus. SUBCLASS SUBCLASS VOC vocative ADJC comparative OBL oblique ADJS superlative DIM diminutive VADJ verbal AUG augmentative PERS personal PEJ pejorative REL relative PR present POS possessive IM imperfect DEM demonstrative AO aorist IND indefinite IML imperfect l VADV verbal adverb AOL aorist l NUM numerals Figure 5: Subtypes in the LTRC dictionary Still the tag set of the dictionary slightly differs from the MULTEXT-East notation system as far as the organization of the attributes and values is concerned presented in Figure 5. Beside this two columns that represent the type and the subtype of the word, there are several others used to insert information about the gender, number, article (three different types of articles added as suffixes to the nouns), case (if any form) and identification number. Currently the dictionary contains 1.535.668 generated word forms distributed as follows: ABBR 249 ADJ 603863 ADJPRON 128 ADV 13938 CONJ 57 INTERJ 189 NN 407351 NUM 288 PART 56 PREP 63 PRON 413 V 509073 total 1.535.668 Figure 6: Distribution of the word classes The high number of nouns can be explained with the relatively large database of proper names included in the dictionary. The high number of word forms which are s is due to the fact that the comparative and the superlative are analytical. The dictionary was tested on a half a million word corpus and managed to recognize 99.02% of the words. 4. Conclusions and future work The corpora are intended as a source of data for linguistic research. They will help to capture all the meanings of a word, their frequency and the context in which they appear. The information regarding the relevance and frequency of each meaning can be incorporated in the lexicon of the Macedonian language. The reference corpus can also be used for more complex research, for detection of patterns of words and for the enlargement of the Macedonian morphological lexicon. The corpora, as well as the dictionaries will be used in future to build various NLP tools. 5. References Istrazuvacki centar za arealna lingvistika (2005). Skopje: Makedonska Akademija na naukite i umetnostite Mitrevski, G. Makedonski elektronski korpus: dizajn, implementacija, pristap. In Predavanja na XXXVIII megunaroden seminar za nakedonski jazik, literatura i kultura. Skopje: UKIM, Megunaroden seminar za makedonski jazik, literatura i kultura. In press. Petrovski, A. (2005) Macedonian DELAS- first results laseldi.univ/fcomte.fr/document/colloque/nooj_2005/po werpoint/petrovski.ppt Petrovski, Aleksandar. Za makedonskata kompjuterska leksikografija. In Jazicnata politika. Informatikata i lingvistikata. Denovi posveteni na Blagoja Korubin, maj 2005. Skopje: Institut za makedonski jazik Krste Petkov Misirkov. In press. Silberztein, M. (2004) Intex 1839

http://msh.univ-fcomte.fr/intex/downloads/manual.pdf Silberztein, M. (2005) Nooj http://perso.wanadoo.fr/rosavram/nooj%20manual.pdf Venovska-Antevska S. (2005). Makedonski jazicen korpus. Ideja, moznosti, realizacija. In Predavanja na XXXVII seminar za makedonski jazik, literatura i kultura. Skopje: UKIM, Megunaroden seminar za makedonski jazik, literatura i kultura. pp. 77-92. Vojnovski, V., S. Dzeroski and T. Erjavec, (2005) Learning POS Tagging from a Tagged Macedonian Text Corpus. In Proceedings of SIKDD 2005, Ljubljana, 2005. In press. Zdravkova, K., A. Ivanovska, S. Dzeroski and T. Erjavec, (2005) Learning Rules for Morphological Analysis and Synthesis of Macedonian Nouns. In Proceedings of SIKDD 2005, Ljubljana, 2005. In press. 1840

1841