Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning

Similar documents
Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Linking Task: Identifying authors and book titles in verbose queries

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Cross Language Information Retrieval

Online Updating of Word Representations for Part-of-Speech Tagging

Training and evaluation of POS taggers on the French MULTITAG corpus

Ensemble Technique Utilization for Indonesian Dependency Parser

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

CS Machine Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Probabilistic Latent Semantic Analysis

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

The Role of the Head in the Interpretation of English Deverbal Compounds

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Indian Institute of Technology, Kanpur

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Annotation Projection for Discourse Connectives

arxiv: v1 [cs.cl] 2 Apr 2017

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

The stages of event extraction

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Multilingual Sentiment and Subjectivity Analysis

CS 598 Natural Language Processing

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

A heuristic framework for pivot-based bilingual dictionary induction

Distant Supervised Relation Extraction with Wikipedia and Freebase

Using dialogue context to improve parsing performance in dialogue systems

Constructing Parallel Corpus from Movie Subtitles

The KIT-LIMSI Translation System for WMT 2014

(Sub)Gradient Descent

A High-Quality Web Corpus of Czech

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Prediction of Maximal Projection for Semantic Role Labeling

Active Learning. Yingyu Liang Computer Sciences 760 Fall

BYLINE [Heng Ji, Computer Science Department, New York University,

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

CS 446: Machine Learning

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

A Case Study: News Classification Based on Term Frequency

Lecture 1: Machine Learning Basics

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Parsing of part-of-speech tagged Assamese Texts

Parsing Morphologically Rich Languages:

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Learning Methods in Multilingual Speech Recognition

BULATS A2 WORDLIST 2

Finding Translations in Scanned Book Collections

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

The Effect of Multiple Grammatical Errors on Processing Non-Native Writing

Chapter 4: Valence & Agreement CSLI Publications

Probability and Statistics Curriculum Pacing Guide

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Calibration of Confidence Measures in Speech Recognition

A Vector Space Approach for Aspect-Based Sentiment Analysis

Learning Methods for Fuzzy Systems

Leveraging Sentiment to Compute Word Similarity

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Word Sense Disambiguation

Artificial Neural Networks written examination

Building a Semantic Role Labelling System for Vietnamese

Discriminative Learning of Beam-Search Heuristics for Planning

Applications of memory-based natural language processing

TINE: A Metric to Assess MT Adequacy

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Survey on parsing three dependency representations for English

An Out-of-Domain Test Suite for Dependency Parsing of German

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Experiments with a Higher-Order Projective Dependency Parser

Development of the First LRs for Macedonian: Current Projects

Speech Recognition at ICSI: Broadcast News and beyond

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Visual CP Representation of Knowledge

The Smart/Empire TIPSTER IR System

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

Adding syntactic structure to bilingual terminology for improved domain adaptation

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Python Machine Learning

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Cross-lingual Text Fragment Alignment using Divergence from Randomness

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Derivational and Inflectional Morphemes in Pak-Pak Language

Cross-lingual Transfer Parsing for Low-Resourced Languages: An Irish Case Study

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Transcription:

Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning Guillaume Wisniewski Nicolas Pécheux Souhir Gahbiche-Braham François Yvon Université Paris-Sud & LIMSI-CNRS October 28, 2014 1/27

Context Supervised Machine Learning techniques have established new performance standards for many NLP tasks Success crucially depends on the availability of annotated in-domain data Not so common situation (eg under-resourced languages) What can we do then? 2/27

Context Unsupervised learning Crawl data (eg Wiktionary) 3/27

svm_one_class_trainer with radial_basis_kernel TOO SLOW Clustering svm_c_linear_dcd_trainer (see one_class_classifiers_excpp example program) svm_c_trainer with radial_basis_kernel or histogram_intersection_kernel svm_c_linear_trainer one_vs_one_trainer with krr_trainer using radial_basis_kernel svm_multiclass_linear_trainer newman_cluster or chinese_whispers kkmeans or find_clusters_using_kmeans Classification < 20K Samples Do you have labeled data? Do you know how many categories? < 20K Samples T WORKING Number of features < 100 Go get labels! Are you trying to label things as anomalous vs normal? krr_trainer with radial_basis_kernel Number of features < 100 Predicting a categorial label? svr_linear_trainer Do you have labeled data? svr_trainer with radial_basis_kernel or Predicting a true or false label? < 20K Samples histogram_intersection_kernel Data Transformations linear_manifold_regularizer vector_normalizer_frobmetric discriminant_pca sammon_projection Predicting a continuous quantity? T WORKING krls or rls Regression < 5K Samples cca krr_trainer with radial_basis_kernel Do you have a graph of "similar" samples? Do you want to transform your data? Are you trying to rank order something? Learning a distance metric? Do you have two views of your data? Y Is this a time-series or online prediction problem? Context Transfer Ressource-rich language Less-ressourced language Cross-lingual transfer (weakly supervised learning) 4/27

svm_one_class_trainer with radial_basis_kernel TOO SLOW Clustering svm_c_linear_dcd_trainer (see one_class_classifiers_excpp example program) svm_c_trainer with radial_basis_kernel or histogram_intersection_kernel svm_c_linear_trainer one_vs_one_trainer with krr_trainer using radial_basis_kernel svm_multiclass_linear_trainer newman_cluster or chinese_whispers kkmeans or find_clusters_using_kmeans Classification < 20K Samples Do you have labeled data? Do you know how many categories? < 20K Samples T WORKING Number of features < 100 Are you trying to label things as anomalous vs normal? Go get labels! krr_trainer with radial_basis_kernel Number of features < 100 Predicting a categorial label? svr_linear_trainer Do you have labeled data? svr_trainer with radial_basis_kernel or Predicting a true or false label? < 20K Samples histogram_intersection_kernel Data Transformations linear_manifold_regularizer vector_normalizer_frobmetric discriminant_pca sammon_projection Predicting a continuous quantity? T WORKING krls or rls Regression < 5K Samples cca krr_trainer with radial_basis_kernel Do you have a graph of "similar" samples? Do you want to transform your data? Are you trying to rank order something? Learning a distance metric? Do you have two views of your data? Y Is this a time-series or online prediction problem? Context Transfer Ressource-rich language Less-ressourced language Cross-lingual transfer (weakly supervised learning) Example VERB DET Making a UN ADP UN UN Market for Scientific Research Uṅ marché pour la recherche scientifique UN ADP UN UN 4/27

State of the art In most cases this only results in partially annotated data Alternative ML techniques need to be designed State of the art Partially observed CRF [Täckström et al, 2013] Posterior regularization [Ganchev and Das, 2013] Expectation maximization [Wang and Manning, 2014] 5/27

Contributions 1 We cast this problem in the framework of ambiguous learning [Bordes et al, 2010, Cour et al, 2011] 2 We present a novel method to learn from ambiguous supervision data 3 We show significant improvements over prior state of the art 4 We conduct a detailed analysis that allows us to identify the limits of transfer-based methods and their evaluation 6/27

Part I Projecting Labels across Aligned Corpora 7/27

Hypothesis In this work we focus on POS tagging Strong assumption Syntactic categories in the source language can be directly related to the ones in the target one Universal tagset [Petrov et al, 2012] { Noun, Verb, Adj, Adv, Pron, Det, Adp, Num, Conj, Prt,, X } All annotations are mapped to this universal tagset 8/27

Type and token constraints Transfer-based methods only deliver partial and noisy supervision Heuristic filtering rules [Yarowsky et al, 2001] Graph-base projection [Das and Petrov, 2011] Combine with monolingual information [Täckström et al, 2013] Type and token constraints [Täckström et al, 2013] 1 type constraints from a dictionary 2 token constraints projected through alignment links 9/27

Type constraints From tag dictionaries Automatically extracted from Wiktionary 10/27

Type constraints From tag dictionaries Automatically extracted from Wiktionary Build from the projected labels across the aligned corpora UN VERB market walked market marché marché UN VERB UN VERB 10/27

Type constraints From tag dictionaries Automatically extracted from Wiktionary Build from the projected labels across the aligned corpora UN VERB market walked market marché marché UN VERB UN VERB We use the intersection of the two above 10/27

Token constraints 1 Use the type constraints Uṅ marché pour la recherche scientifique ADJ DET UN PRON UN VERB ADP DET UN UN PRON UN VERB UN ADJ 11/27

Token constraints 2 Use the alignment links from the parallel corpora Making a Market for Scientific Research Uṅ marché pour la recherche scientifique ADJ DET UN PRON UN VERB ADP DET UN UN PRON UN VERB UN ADJ 11/27

Token constraints 3 Tag the source side (resource-rich) VERB DET Making a UN ADP UN UN Market for Scientific Research Uṅ marché pour la recherche scientifique ADJ DET UN PRON UN VERB ADP DET UN UN PRON UN VERB UN ADJ 11/27

Token constraints 4 Project labels if licensed by type constraints VERB DET Making a UN ADP UN UN Market for Scientific Research Uṅ marché pour la recherche scientifique ADJ DET UN PRON UN VERB ADP DET UN UN PRON UN VERB UN ADJ 11/27

Part II Modeling Sequences under Ambiguous Supervision 12/27

Problem Uṅ marché pour la recherche scientifique ADJ DET UN PRON ADP UN DET UN PRON UN UN Gold labels: a set of possible labels of which only one is true How to learn from ambiguous supervision? Can be cast in the framework of ambiguous learning [Bordes et al, 2010, Cour et al, 2011] 13/27

History-based model: inference x: Un marché pour la y: DET UN ADP? y i = Principle Structured prediction is reduced to a sequence of multi-classification problems 14/27

History-based model: inference x: Un marché pour la y: DET UN ADP? y i = arg max F(x, y, y i 1, y i 2, ) y {UN, VERB, } Principle Structured prediction is reduced to a sequence of multi-classification problems At each step, the decision is taken based on the input structure and the so far partially tagged sequence 14/27

History-based model: training Linear classifier y i = arg max y Y w T ϕ(x, i, y, h i ) Perceptron Full supervision if y i ŷ i then update w t+1 w t ϕ (x, i, y i, h i ) + ϕ (x, i, ŷ i, h i ) Heighten the gold label score at the cost of the wrongly predicted one 15/27

History-based model: training Linear classifier y i = arg max y Y w T ϕ(x, i, y, h i ) Perceptron-like update Ambiguous supervision if y i Ŷi then w t+1 w t ϕ (x, i, y i, h i ) + ϕ (x, i, ŷ i, h i ) ŷi Ŷi Heighten the gold labels score at the cost of the wrongly predicted one 15/27

History-based model: training Linear classifier y i = arg max y Y w T ϕ(x, i, y, h i ) Perceptron-like update Ambiguous supervision if y i Ŷi then w t+1 w t ϕ (x, i, y i, h i ) + ϕ (x, i, ŷ i, h i ) ŷi Ŷi Heighten the gold labels score at the cost of the wrongly predicted one Theoretical guarantees for similar problems under mild assumptions [Bordes et al, 2010, Cour et al, 2011] 15/27

Part III Experiments 16/27

Experimental setup Experiments on 10 languages from different families English as the source side Our method needs Parallel corpora English POS tagger Crawled dictionary Labeled test data Europarl, NIST, Open Subtitle Wapiti Wiktionary CoNLL 07, UDT v20, Treebanks Standard feature set 17/27

Results CRF HBAL [1] [2] [3] Unsupervised [1] ar 339 279-60 499 cs 116 104-12 193 189 de 122 88-34 96 95 142 187 el 109 81-28 94 105 208 282 es 107 82-25 128 109 136 187 fi 129 133 +04 fr 116 102-14 125 116 id 163 113-50 it 104 91-13 101 102 135 319 sv 116 101-15 108 111 139 299 CRF Partially supervised CRF baseline [Täckström et al, 2013] HBAL Our History-based model [1] [Ganchev and Das, 2013] [2] [Täckström et al, 2013] [3] [Li et al, 2012] 18/27

Part IV Discussion 19/27

Discussion Closer look on Spanish results: State of the art 109% 20/27

Discussion Closer look on Spanish results: State of the art 109% Our model HBAL 82% 20/27

Discussion Closer look on Spanish results: State of the art 109% Our model HBAL 82% Our model trained on supervised data (HBSL) 24% 20/27

Discussion Closer look on Spanish results: State of the art 109% Our model HBAL 82% Our model trained on supervised data (HBSL) 24% Our method still falls short of a fully supervised model! 20/27

Why such a large gap? Noisy constraints Type constraints precision on test data is 94% Ie using our type constraints as hard constraints at decoding time yields at least 6% of errors In this setting HBSL gets 73% Noisy dictionaries 21/27

Why such a large gap? Noisy constraints Type constraints precision on test data is 94% Ie using our type constraints as hard constraints at decoding time yields at least 6% of errors In this setting HBSL gets 73% Noisy dictionaries not only? 21/27

Why such a large gap? Noisy constraints Type constraints precision on test data is 94% Ie using our type constraints as hard constraints at decoding time yields at least 6% of errors In this setting HBSL gets 73% Noisy dictionaries not only? Out-of-domain evaluation 1 tokenization differs 2 domain differs 3 annotation conventions differ 21/27

Why such a large gap? Noisy constraints Type constraints precision on test data is 94% Ie using our type constraints as hard constraints at decoding time yields at least 6% of errors In this setting HBSL gets 73% Noisy dictionaries not only? Out-of-domain evaluation 1 tokenization differs 2 domain differs 3 annotation conventions differ 21/27

The annotation convention problem Several independently designed information sources are combined They follow conflicting annotation conventions Example NUM UN Numbers Foreing names ADJ DET X ADJ few poco DET PRON UN 22/27

Impact of annotation and train/test mismatches Fixing some annotation mismatches in type constraints ar cs de el es fi fr id it sv HBAL 279 104 88 81 82 133 102 113 91 101 HBAL + match 241 76 80 73 74 122 74 98 83 88-38 -28-08 -08-08 -11-28 -15-08 -13 Supervised experiments for Spanish train train labels test error rate UDT manual 24% Europarl HBSL 42% Europarl FreeLing 61% Europarl Cross-lingual transfer (ambiguous) 82% Performance may be underestimated 23/27

Part V Conclusion 24/27

Conclusion We introduce a new, simple and efficient learning criterion Performance surpasses best reported results Results close to the best achievable performance? Evaluation of such settings much be taken with great care Additional gains might be more easily obtained by fixing systematic biases than by designing more sophisticated weakly supervised learners 25/27

Thank you for your attention Questions? Tools and resources available from http://persolimsifr/wisniews/weakly 26/27

References Bordes, A, Usunier, N, and Weston, J (2010) Label ranking under ambiguous supervision for learning semantic correspondences In ICML, pages 103 110 Cour, T, Sapp, B, and Taskar, B (2011) Learning from partial labels Journal of Machine Learning Research, 12:1501 1536 Das, D and Petrov, S (2011) Unsupervised part-of-speech tagging with bilingual graph-based projections In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT 11, pages 600 609, Stroudsburg, PA, USA Association for Computational Linguistics Ganchev, K and Das, D (2013) Cross-lingual discriminative learning of sequence models with posterior regularization In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1996 2006, Seattle, Washington, USA Association for Computational Linguistics Li, S, Graça, J a V, and Taskar, B (2012) Wiki-ly supervised part-of-speech tagging In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 12, pages 1389 1398, Stroudsburg, PA, USA Association for Computational Linguistics Petrov, S, Das, D, and McDonald, R (2012) A universal part-of-speech tagset In Chair), N C C, Choukri, K, Declerck, T, Doğan, M U, Maegaard, B, Mariani, J, Odijk, J, and Piperidis, S, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 12), Istanbul, Turkey European Language Resources Association (ELRA) Täckström, O, Das, D, Petrov, S, McDonald, R, and Nivre, J (2013) Token and type constraints for cross-lingual part-of-speech tagging 27/27