Automatic Generation of a Training Set for NER on Portuguese journalistic text

Similar documents
Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Linking Task: Identifying authors and book titles in verbose queries

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Artificial Intelligence

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Corrective Feedback and Persistent Learning for Information Extraction

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Boosting Named Entity Recognition with Neural Character Embeddings

Distant Supervised Relation Extraction with Wikipedia and Freebase

A Named Entity Recognition Method using Rules Acquired from Unlabeled Data

AQUA: An Ontology-Driven Question Answering System

Named Entity Recognition: A Survey for the Indian Languages

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

ARNE - A tool for Namend Entity Recognition from Arabic Text

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Memory-based grammatical error correction

Multilingual Sentiment and Subjectivity Analysis

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Text-mining the Estonian National Electronic Health Record

The taming of the data:

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

BYLINE [Heng Ji, Computer Science Department, New York University,

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

A Case Study: News Classification Based on Term Frequency

SEMAFOR: Frame Argument Resolution with Log-Linear Models

HLTCOE at TREC 2013: Temporal Summarization

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Software Maintenance

Disambiguation of Thai Personal Name from Online News Articles

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Postprint.

Using dialogue context to improve parsing performance in dialogue systems

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Prediction of Maximal Projection for Semantic Role Labeling

The Role of String Similarity Metrics in Ontology Alignment

Introduction to Text Mining

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Development of the First LRs for Macedonian: Current Projects

The stages of event extraction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Methods for the Qualitative Evaluation of Lexical Association Measures

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Using Semantic Relations to Refine Coreference Decisions

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Rubric For California Mission Project

Probabilistic Latent Semantic Analysis

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

A Class-based Language Model Approach to Chinese Named Entity Identification 1

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

CEFR Overall Illustrative English Proficiency Scales

Compositional Semantics

Yoshida Honmachi, Sakyo-ku, Kyoto, Japan 1 Although the label set contains verb phrases, they

Indian Institute of Technology, Kanpur

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

The Ups and Downs of Preposition Error Detection in ESL Writing

Loughton School s curriculum evening. 28 th February 2017

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

Handling Sparsity for Verb Noun MWE Token Classification

Using Small Random Samples for the Manual Evaluation of Statistical Association Measures

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Modeling full form lexica for Arabic

Training and evaluation of POS taggers on the French MULTITAG corpus

EXPLOITING DOMAIN AND TASK REGULARITIES FOR ROBUST NAMED ENTITY RECOGNITION

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

A Web Based Annotation Interface Based of Wheel of Emotions. Author: Philip Marsh. Project Supervisor: Irena Spasic. Project Moderator: Matthew Morgan

1. Introduction. 2. The OMBI database editor

Task Tolerance of MT Output in Integrated Text Processes

Modeling function word errors in DNN-HMM based LVCSR systems

ScienceDirect. Malayalam question answering system

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Language Independent Passage Retrieval for Question Answering

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

ENGLISH. Progression Chart YEAR 8

A Hybrid Approach to Lao Word Segmentation using Longest Syllable Level Matching with Named Entities Recognition

Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN:

A data and analysis resource for an experiment in text mining a collection of micro-blogs on a political topic

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

Multiobjective Optimization for Biomedical Named Entity Recognition and Classification

The Smart/Empire TIPSTER IR System

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

On document relevance and lexical cohesion between query terms

The Strong Minimalist Thesis and Bounded Optimality

Variation of English passives used by Swedes

Matching Similarity for Keyword-Based Clustering

Adding syntactic structure to bilingual terminology for improved domain adaptation

Transcription:

Automatic Generation of a Training Set for NER on Portuguese journalistic text Jorge Teixeira - jft@fe.up.pt DSIE 11 - January 2011

Outline Motivation & Main Objectives Method & Approach Experimental Set-Up Results Analysis & Discussion Conclusions & Future work 2/15

Motivation Main Objectives Related Work... Motivation The number of news published everyday is huge How to organize all this information? Media Clipping and entity tracking are usually performed by experts and semi-manually How to subscribe to news mentioning José Sócrates? How to study different perspectives and evolutions of Obama speeches? Is not trivial to automatically identify names of people: - Ronaldo: This is perfect time to go to Camp Nou - Wikileaks: Hillary Clinton continues contacts with foreign leaders Simple hand-crafted rules are not enough to identify these names 3/15

Motivation Main Objectives Related Work... Main objectives Automatically create a training set for NER - Extremely high time and human resources consuming task - Divergence between annotators - Limited-size training set Use Conditional Random Fields to automatically identify names of people on news. A partir de Fevereiro, os programas de <PN>Marcelo Rebelo de Sousa</PN> e <PN>António Vitorino</PN>, na RTP1(...) [ Starting in February, <PN>Marcelo Rebelo de Sousa</PN> and <PN>António Vitorino</PN> shows on RTP I (...) ] 4/15

... Main Objectives Related Word Method & Approach... Related Work Patterns and complex feature generation methods for NER - (Minkov et al., 2005) proposed a set of specialized structural features for identifying personal names on emails - Four corpora manually annotated with 573 documents - Used CRF, obtained F-measure varying from 68,1 to 91,9 - Improved results using: (i) repetition of NE in emails; (ii) dictionary of names and its variations - (McCallum and Li, 2003) used feature induction and web-enhanced lexicons for NER with CRF - Automatic feature induction allow to choose the more relevant features for the task - Web-enhaced lexicons allow to augment lexicons using the web - Used CoNll 2003, a corpus of english newspapers with 964 documents and 4 entities (PER, LOC, ORG, MISC) - Obtained F-measure of 84,04 on the test set 5/15

... Main Objectives Related Word Method & Approach... Related Work Wikipedia as external knowledge to improve NER - (Jun ichi and Torisawa, 2007) extracted labels using the structure: Jimi Hendrix (...) was an American guitarist - These categories were used as features in a CRF-based model - Using CoNLL 2003, F-measure improved 1,58 from the baseline Big versus small gazetteers - (Mikheev et al., 1999) considered that compiling large gazetteers is sometimes the bottleneck in of NER systems - It was sufficient to use small gazetteers of well-known names rather than large gazetteers of low-frequency names Portuguese - (Sarmento, 2006) developed SIEMES, a NER for Portuguese that uses rules of form and similarity supported by an wide-scope gazetteer for Portuguese, REPENTINO. 6/15

... Related Work Method & Approach Experimental Set-UP... Method & Approach 1) Initial Set of names: Voxx : a system that automatically extracts quotations from online news N Voxx; = 1045 names Names of well known people (frequent names) 2) Annotation process (training set): C news with 20,000 news items ni = (title, body) and 110,000 sentences C news is automatically annotate with the names from the Initial Set N Voxx Annotation rules: exact and soft matches, erroneous names 6,600 instances (annotated names) and 562 different names 7/15

... Related Work Method & Approach Experimental Set-UP... Method & Approach 3) Features generation: Word-level features Window of 3 tokens to the left and to 3 tokens to the right REPENTINO - gazetteer for the Portuguese language with 100 different categories Features Capitalized word Acronym Examples Pedro or Miguel NATO or USA Word Length musician - 8 End of sentence Grammatical category Lemma Semantic category List of REPENTINO names said - verb doors - door journalist - job Eduardo de Melo 8/15

... Related Work Method & Approach Experimental Set-UP... Method & Approach 4) CRF Model Well suited to sequence analysis, particularly on NER for newswire data (McCallum and Li, 2003) Straightforward CRF templates that describe the tokens, its position and features Build a CRF model 5) Identification of names Use the CRF model on HAREM HAREM is an annotated corpus for Named Entities for Portuguese 9/15

... Method & Approach Experimental Set-UP Results... Experimental Set-Up Evaluate what? 1. Evaluate the quality of the annotation of the training set 2. Evaluate the quality of the CRF annotator for names of people Evaluate how? 1. Manually evaluate 1% of the news corpus (200 news items) 2. Using HAREM, an annotated corpus of Named Entities for Portuguese (the gold-standard corpus) With which measures? - Precision, Recall and F-meaure 10/15

... Experimental Set-UP Results Analysis & Discussion... Results 1. Quality of the annotation of the training set - Precision of 95% - Recall of 74% 2. Quality of the CRF annotator - Baseline method (only features with names from REPENTINO) Precision of 55% Recall of 8% - Best method (features with names, structural information and syntactic and semantic information) Precision of 79% Recall of 23% 11/15

... Results Analysis & Discussion Conclusions... Annotation of the training set: Analysis & Discussion - Precision of 95% means that almost every names were correctly identified - Recall of 74% means the method misses some names: names with only one word! CRF annotator: - Baseline (P=55%, R=8% and F-measure=14%): poor features - Best method (P=79%, R=23% and F-measure=36%): Error type 1: Incorrectly identified names (36%) Error type II: Name used in different context (33%) Error type III: Missed name (31%) - Milidiú et al (2007): used HMM and achieved F-measure of 88% Training corpus manually annotated and small (2100 sentences) 12/15

... Analysis & Discussion Conclusions Future Work Conclusions We build an NER system for portuguese: Specialized on names of people Completely automatic (from the training set to the model construction and final identification of names) No human annotation is necessary Results achieved are encouraging 13/15

... Analysis & Discussion Conclusions Future Work Future work Names with only one word (nicknames) Study the influence of training sets of different sizes Study and test different features to increase recall Active-learning Other NE as company names, locations and jobs Wikipedia as an additional resource of NE 14/15

Motivation Main Objectives Related Work Method & Approach Experimental Set-UP Results Analysis Conclusions Future Work Questions? Automatic Generation of a Training Set for NER on portuguese journalistic text Jorge Teixeira jft@fe.up.pt 15/15