LINGUISTIC ANNOTATION OF CORPORA IN THE CZECH NATIONAL CORPUS 1

Similar documents
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Memory-based grammatical error correction

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Semi-supervised Training for the Averaged Perceptron POS Tagger

Linking Task: Identifying authors and book titles in verbose queries

Development of the First LRs for Macedonian: Current Projects

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

THE VERB ARGUMENT BROWSER

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Specifying a shallow grammatical for parsing purposes

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

CS 598 Natural Language Processing

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

The College Board Redesigned SAT Grade 12

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

AQUA: An Ontology-Driven Question Answering System

Constructing Parallel Corpus from Movie Subtitles

arxiv: v1 [cs.cl] 2 Apr 2017

Disambiguation of Thai Personal Name from Online News Articles

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Character Stream Parsing of Mixed-lingual Text

Cross Language Information Retrieval

Corpus Linguistics (L615)

What the National Curriculum requires in reading at Y5 and Y6

Loughton School s curriculum evening. 28 th February 2017

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Training and evaluation of POS taggers on the French MULTITAG corpus

An Evaluation of POS Taggers for the CHILDES Corpus

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

1. Introduction. 2. The OMBI database editor

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

On document relevance and lexical cohesion between query terms

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

The taming of the data:

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Problems of the Arabic OCR: New Attitudes

Natural Language Processing. George Konidaris

ScienceDirect. Malayalam question answering system

Introduction to Text Mining

Developing a TT-MCTAG for German with an RCG-based Parser

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Vocabulary Usage and Intelligibility in Learner Language

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A Bayesian Learning Approach to Concept-Based Document Classification

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Methods for the Qualitative Evaluation of Lexical Association Measures

EAGLE: an Error-Annotated Corpus of Beginning Learner German

The stages of event extraction

TRAITS OF GOOD WRITING

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Leveraging Sentiment to Compute Word Similarity

Grade 5: Module 3A: Overview

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Parsing of part-of-speech tagged Assamese Texts

A Syllable Based Word Recognition Model for Korean Noun Extraction

Advanced Grammar in Use

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

A Framework for Customizable Generation of Hypertext Presentations

Distant Supervised Relation Extraction with Wikipedia and Freebase

The Role of the Head in the Interpretation of English Deverbal Compounds

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Using computational modeling in language acquisition research

Modeling full form lexica for Arabic

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

CHAPTER 5. THE SIMPLE PAST

Applications of memory-based natural language processing

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

2.1 The Theory of Semantic Fields

The Ups and Downs of Preposition Error Detection in ESL Writing

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Course Outline for Honors Spanish II Mrs. Sharon Koller

Universiteit Leiden ICT in Business

Accurate Unlexicalized Parsing for Modern Hebrew

Using Semantic Relations to Refine Coreference Decisions

ACCOMMODATIONS FOR STUDENTS WITH DISABILITIES

Constraining X-Bar: Theta Theory

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

A High-Quality Web Corpus of Czech

Short Text Understanding Through Lexical-Semantic Analysis

Transcription:

M. Hnátková, V. Petkevič, H. Skoumalová LINGUISTIC ANNOTATION OF CORPORA IN THE CZECH NATIONAL CORPUS 1 0. Introduction In the project Czech National Corpus and the Corpora of Other Languages the key role is played by extensive corpora (comprising hundreds of millions of words) of written contemporary Czech: SYN2005, SYN2006PUB, SYN2009PUB, SYN2010, SYN (cf. http://www.korpus.cz). One of the important characteristics of these corpora is the fact that their texts are lemmatized and morphologically annotated. We shall describe the whole system and individual phases of an entirely automatic process of linguistic annotation. 1. Phases of linguistic annotation The whole linguistic annotation consists of the following three phases: a) morphological phase b) disambiguation phase c) complementary phase. 1.1 Morphological phase At the beginning of the processing the text of a document is formed by a sequence of characters including the blank spaces. The morphological phase is composed of the following parts: a) premorphological phase: preprocessing of the input plain text consisting in the concatenation of neighbouring strings, or in splitting strings into several strings separated by blanks, as well as in corrections of obvious typos; 1 This paper was supported by the grant MSM0021620823 of the Ministry of Education of the Czech Republic. 15

b) morphological analysis in a broader sense it involves: tokenization: identification of individual textual words and punctuation as independent elements tokens; sentence segmentation: separation of the input text into sentences based on punctuation and segmentation rules; morphological analysis proper: each token is assigned: a) all of its lemmas, i.e. representations of lexemes pertaining to the token; b) all of its part-of-speech (POS) and morphological properties in the form of tags. Morphological analysis is realized by a morphological analyzer: it analyses every token and assigns to it all of its lemmas and POS properties regardless of context, i.e. only on the basis of the token itself and its properties contained in the morphological dictionary. The dictionary contains ca 350.000 lexemes including 155.000 proper names: it includes primarily the vocabulary of the Dictionary of Standard Czech Language 1 and the Dictionary of Standard Czech 2 which contain 194.000 and 57.000 lexemes, respectively. In addition, it comprises also some other lexemes derived from these lexemes: e.g. deadjectival nouns and adverbs. Moreover, the morphological lexicon is gradually extended. c) postmorphological phase: it consists in ad hoc corrections of possible errors in morphological analysis that for organizational reasons could not be rectified in the morphological dictionary. 1.2 Disambiguation phase The morphological phase is followed up with the disambiguation one: homographs are subject to the disambiguation of lemmas and POS and morphological tags of individual tokens. The natural language texts can generally be POS and morphologically disambiguated by the three possible types of methods: 16 1 Slovník spisovného jazyka českého. Praha, 1960 1971. 2 Slovník spisovné češtiny. Praha, 1994.

(i) statistically (stochastically) on the basis of machine learning; (ii) by linguistic rules: either (ii1) rules automatically inferred from texts, or (ii2) hand-crafted rules; (iii) cooperation of type (i) a (ii) a hybrid method. For the disambiguation of Czech corpora the method (iii) was selected as optimum: it includes the statistical tagger called MorČe (=Morphologie Češtiny) and the LanGr tagger based on hand-crafted rules (of the (ii2) type). The statistical tagger MorČe is based on machine-learning: it uses a training corpus of several hundreds of thousands of words; in addition, some of its features or their combination can be parameterized. At present, it is the best statistical tagger of Czech. It is very robust: it need not be retrained in case the tagset or input data is moderately modified. The other tagger called LanGr is based on a system of thousands of manually written rules that are (a) developed on the basis of linguistic introspection and checked on corpus data, and (b) non-automatically inferred from corpus data. Linguistic rules of the LanGr tagger are written in a special programming language and their performance consists in context-based gradual deletion of incorrect lemmas and tags assigned to individual tokens. First, the tagger processes the output of morphological analysis which assigns every token all of its tags and lemmas. After morphological analysis, in a typical case every token in the input sentence is assigned a lemma and tag that are correct in the given context, i.e. the recall is generally almost 100%, for the morphological dictionary includes the whole vocabulary of contemporary Czech. However, as the morphological analyzer assigns all tokens all of its lemmas and tags regardless of the context, the tokens are assigned the highest amount of incorrect tags. This fact is quantified by the precision measure: it is lowest possible on disambiguation input. The disambiguation consists in keeping the best recall as possible (close to 100%) and in increasing precision by removing lemmas and tags incorrect in the given context. Specifically, 17

in some cases morphological analysis also yields some very general tags that are transformed to more specific tags during the disambiguation. Example (1) Zajímá mě poslech rozhlasu. E. lit. Interests me listening of radio. E. I am interested in listening to the radio. Morphological analysis assigns the word poslech the following 4 pairs (lemma, tag): a) lemma= poslech, NounMascInanNomSg, tag=nnis1-----a----- (E. lemma: listening) b) lemma= poslech, NounMascInanAccSg, tag=nnis4-----a----- (E. lemma: listening) c) lemma= posel, NounMascAnimatLocPl, tag=nnmp6-----a----- (E. lemma: messenger) d) lemma= poslechnout, VrbPastMascSgAct, tag=vpys---xr-aa---6 (E. lemma: obey) The task of the disambiguation is to remove incorrect lemmas and tags b), c) and d), since the only correct tag is the a) alternative: i.e. nom. sg. masc. inanimate of the noun lemma= poslech. Disambiguation rules are contained in two groups: a) safe rules, b) heuristic rules; they are applied to the input sentence tokens and remove their tags and lemmas that are contextually inappropriate (e.g. tags NNIS4-----A----- a NNMP6-----A----- and corresponding lemmas of the token poslech in sentence (1)). An input sentence is more and more disambiguated by the rules application until ideally full disambiguation is achieved, i.e. each token is assigned the only correct lemma and tag, i.e. the a) alternative of poslech in sentence (1). In case the rule-based tagger is unable to entirely disambiguate all tokens of an input sentence, i.e. some lemmas are still assigned more tags, the remaining incorrect ones are removed by the statistical tagger MorČe. The POS and morphological disambiguation also involves the collocational module Phras identifying and properly disambiguating 18

so-called grammatical and non-grammatical collocations. Thus, the following modules take part in the disambiguation process: (i) LanGr tagger based on hand-crafted rules; (ii) collocational/phraseme Phras module based on manually written rules and dictionary of collocations; (iii) parameterizable stochastic tagger MorČe. The collaboration of the modules can be described by the following sequence of operations applied to a sentence: 1th step: the output of morphological analysis is processed by safe rules. The rules gradually disambiguate the sentence, i.e. the number of incorrect tags decreases. The process continues till there is nothing to disambiguate, i.e. till the rules in recurrent cycles exhaust their disambiguation capacity; 2nd step: Phras, the collocational module is invoked: it identifies the collocations in the sentence and performs their disambiguation; 3rd step: both safe and heuristic rules of the LanGr tagger are applied till there is nothing to disambiguate; 4th step: the remaining incorrect tags untouched by the LanGr tagger are removed by the stochastic tagger MorČe. Our experience shows that this is the optimum disambiguation strategy for such a morphologically complex language as Czech. This is due to the following main properties of the language system of Czech which considerably influence the accuracy of the disambiguation of Czech sentences: a) complex morphology (number of tags is ca 5000, out of which ca 1500 are really exploited); b) high morphological syncretism; c) high amount of casual, synchronically unmotivated ambiguity; d) many exceptions and irregularities in morphology and syntax; e) relatively free word-order enabled by the a) property above; e) relatively few reference points in a sentence that could be safely exploited by disambiguation rules; f) strict rules of orthography including the punctuation ones. 19

1.3 Complementary phase Following the two main phases comes the third, complementary one. It consist of the following steps: (a) involvement of the aspect module, which assigns the verbs an aspect value (in future, the aspect assignment will be performed within morphological analysis). In Czech the verbs can be: perfective (e.g. přidat, E. add, R. добавить) imperfective (e.g. přidávat, E. add, R. добавлять) biaspectual (e.g. adaptovat, E. adapt, R. адаптировать). (b) inclusion of various parameterizable modules: (b) correction of tokenization based on already disambiguated tokens; (c) optional POS corrections (e.g. adverb particle); d) optional tagging of some morphosyntactic functions (auxiliary verbs) etc. 2. Evaluation The accuracy (recall+precision) of the whole system is close to 95%, the precision of individual steps was not counted. The recall of morphological analysis is 99.25%, the recall of safe disambiguation rules is 99.09%, safe rules + Phras module have the recall = 99.07%; with heuristic rules added the rule-based system has recall = 98.82%. The big gap between 98.82% and 95% is due to the complexity of the remaining disambiguation problems solved by the statistical tagger MorČe which, however, considerably increases precision. 20