Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Similar documents
Agnès Tutin and Olivier Kraif Univ. Grenoble Alpes, LIDILEM CS Grenoble cedex 9, France

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Development of the First LRs for Macedonian: Current Projects

Cross Language Information Retrieval

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Using dialogue context to improve parsing performance in dialogue systems

THE VERB ARGUMENT BROWSER

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

CS 598 Natural Language Processing

Multilingual Sentiment and Subjectivity Analysis

Handling Sparsity for Verb Noun MWE Token Classification

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Procedia - Social and Behavioral Sciences 154 ( 2014 )

1.2 Interpretive Communication: Students will demonstrate comprehension of content from authentic audio and visual resources.

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Methods for the Qualitative Evaluation of Lexical Association Measures

Advanced Grammar in Use

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Linking Task: Identifying authors and book titles in verbose queries

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Lemmatization of Multi-word Lexical Units: In which Entry?

Modeling full form lexica for Arabic

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Loughton School s curriculum evening. 28 th February 2017

5 th Grade Language Arts Curriculum Map

A corpus-based approach to the acquisition of collocational prepositional phrases

Developing a TT-MCTAG for German with an RCG-based Parser

Specifying a shallow grammatical for parsing purposes

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Construction Grammar. University of Jena.

A Re-examination of Lexical Association Measures

Grade 7. Prentice Hall. Literature, The Penguin Edition, Grade Oregon English/Language Arts Grade-Level Standards. Grade 7

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Ch VI- SENTENCE PATTERNS.

Towards a corpus-based online dictionary. of Italian Word Combinations

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Constructing Parallel Corpus from Movie Subtitles

BYLINE [Heng Ji, Computer Science Department, New York University,

Mercer County Schools

Variation of English passives used by Swedes

EAGLE: an Error-Annotated Corpus of Beginning Learner German

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

W O R L D L A N G U A G E S

Collocations of Nouns: How to Present Verb-noun Collocations in a Monolingual Dictionary

AQUA: An Ontology-Driven Question Answering System

Common Core State Standards for English Language Arts

Compositional Semantics

Character Stream Parsing of Mixed-lingual Text

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

A Comparison of Two Text Representations for Sentiment Analysis

Annotation Projection for Discourse Connectives

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

The Smart/Empire TIPSTER IR System

VOCABULARY INSTRUCTION

Eyebrows in French talk-in-interaction

A Statistical Approach to the Semantics of Verb-Particles

The Role of the Head in the Interpretation of English Deverbal Compounds

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

English Language Arts Scoring Guide for Sample Test 2005

Using Small Random Samples for the Manual Evaluation of Statistical Association Measures

Corpus Linguistics (L615)

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

EUROPEAN DAY OF LANGUAGES

Greeley-Evans School District 6 French 1, French 1A Curriculum Guide

Language Acquisition French 2016

Candidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.

Adding syntactic structure to bilingual terminology for improved domain adaptation

Intermediate Academic Writing

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Vocabulary Usage and Intelligibility in Learner Language

California Department of Education English Language Development Standards for Grade 8

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Can Human Verb Associations help identify Salient Features for Semantic Verb Classification?

Proposed syllabi of Foundation Course in French New Session FIRST SEMESTER FFR 100 (Grammar,Comprehension &Paragraph writing)

The taming of the data:

Chapter 9 Banked gap-filling

1. Share the following information with your partner. Spell each name to your partner. Change roles. One object in the classroom:

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

5 Star Writing Persuasive Essay

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

SAMPLE PAPER SYLLABUS

Emmaus Lutheran School English Language Arts Curriculum

Ensemble Technique Utilization for Indonesian Dependency Parser

1. Introduction. 2. The OMBI database editor

Automated Identification of Domain Preferences of Collocations

Pseudo-Passives as Adjectival Passives

Automatic Translation of Norwegian Noun Compounds

Leveraging Sentiment to Compute Word Similarity

Chinese for Beginners CEFR Level: A1

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

Transcription:

Project in the framework of the AIM-WEST project Annotation of MWEs for translation 1 Agnès Tutin LIDILEM/LIG Université Grenoble Alpes 30 october 2014

Outline 2 Why annotate MWEs in corpora? A first experiment Typology of MWEs Annotation scheme Corpora Annotation process

Why annotate MWEs in corpora? 3 Theoretical aims : To validate MWE typology To explore the most productive MWEs according to genres. E.g. Are really idiomatic metaphoric expressions more frequent in spoken genres? E.g. Are really collocations more frequent than true idiomatic expressions? To observe the syntactic properties of MWEs Hypothesis : MWEs are highly variables and few of them are «frozen expressions» (Cf Moon 1998)

Why annotate MWEs in corpora (2)? Practical aims Few MWE annotated corpora, especially in French 4 Small corpora with adverbial and nominal MWEs (Laporte et al. 2008, Laporte & Voyatzi 2008), FrenchTreebank (Abeillé ) : 1 million words but few verbs and only contiguous verbs (e.g. faire part) and no discontinuous expressions (e.g. prendre ce problème en compte). No typology of MWEs. Useful for MT applications to evaluate which MWEs are more difficult to translate Hypothesis (partially) confirmed by the first LIG-LIDILEM study (internship) : contiguous MWEs are easier to translate

Why annotate MWEs in corpora? Theoretical aims : To validate MWE typology To explore the most productive MWEs according to genres. E.g. Are really idiomatic metaphoric expressions more frequent in spoken genres? E.g.a Are really collocations more frequent than true idiomatic expressions? To observe the syntactic properties of MWEs 5 Hypothesis : MWEs are highly variables and few of them are «frozen expressions» (Cf Moon 1998)

A first experiment in Grenoble Internship (LIF & LIDILEM) (Master students Justine Rverdy and Manolo Iborra, may-july 2014, supervisors : L. Besacier, A. Tutin) Creation of an MWE annotated corpus of 12500 words (French version of the BAF corpus, scientific corpus) Manual annotation of the corpus with several types of MWEs : collocations, coupounds, functional words, multiword terms, named entities Example of annotation : 6 <epl type="locution verbale" id="14577575" d="0">être</epl> <epl type="locution verbale" id="14577575" d="0">en</epl> <epl type="locution verbale" id="14577575" d="0">mesure</epl>

A first experiment (2) Evaluation of the translation of MWEs (Moses) 80% of translated MWEs are MWEs 70% MWEs are correctly translated (manual evaluation with 3 values : OK, to revise, incorrect) 7 Type of MWE Functional words Nominal and adjectival compounds Complex terms Verbal phrasemes Collocations Number of occurrences MWE in target language Correct Translation Translation to revise 390 71% 77% 6% 72 72% 46% 4% 122 100% 73% 9% 66 73% 61% 12% 124 93% 65% 6% Incorrect translation 17% 50% 18% 27% 29% Total 785 80% 70% 7% 23% - As expected, most MWEs are translated by MWEs - Less MWEs : functional words - More MWEs : complex terms and collocations - As expected, best translations with functional words, and complex terms. - Unexpectedly, better translations with verbal phrasemes than with adjectival and nominal compounds;

An experiment to be extended? 8 An interesting experiment, worth to be extended With a larger and more diverse corpus Including spoken and written language With several genres TED corpus Literary texts Europarl Scientific writings With a semi-automatic annotation process (and a more detailed annotation scheme) With a more precise evaluation protocol (in collaboration with Emmanuelle Esperança-Rodier)

Typology of MWEs Inspired by Heid (2008), Mel čuk (2011), «full phrasemes» (non compositional) Nominal compounds : pomme de terre ( potatoe ), dead end Adjectival compounds : bon marché ( cheap ) Verbal phrasemes : to take into account Functional MWEs Adverbs : on the one hand Prepositions : in front of Conjunctions : insofar as Determiners: a large number of Collocations or semi-phrasemes (including lihgt verb constructions) To have a shower, heavy smoker, freshly baked 9

Pragmatemes (spoken) Typology of MWEs (2) 10 You re welcome, see you later Proverbs Jack of all trade, master of none. First come, first served Multiword terms Natural language processing, syntactic parser Named entities Université Stendhal, Laboratoire d Informatique de Grenoble

To develop Annotation scheme Principles : simple surface annotation Features Identifier Type of MWE : full phraseme, collocation, complex term Syntactic category of full expression : verb, adverb, noun, Syntactic category of each part of the MWE Lemma of the expression 11 Example Nous avons pris ce problème en compte. Nous avons <epl id="23" type="fphraseme" catepl="verb" catw="verb" lemma="prendre_en_compte">pris</epl> ce problème <epl id="23" catw="prep">en</epl> <epl id="23" catw="noun">compte</epl>

Annotation process Semi-automatic annotation process: Using MWE lexicons Surface annotation with finite state techniques in a first step, then syntactic parsing (XIP?) 12 Dictionaries used Extracted MWEs from FrenchTreeBank Interest : most frequent MWEs, decomposition of MWEs Dictionnaire Electronique des Mots (Dubois et Dubois Charlier) Wide coverage, semantic features Wiktionary Wide coverage DELAC Wide coverage

Example of semi-automatic annotation with NooJ (FST) 13 abandonner,v+vdn+lemma=abandonner_la_partie+n0hum+det1=la+n1=partie+passif+flx=aimer

Manual check of the annotation on concordances and generation of the annotation 14