Automatic Generation of a Training Set for NER on Portuguese journalistic text

Automatic Generation of a Training Set for NER on Portuguese journalistic text Jorge Teixeira - jft@fe.up.pt DSIE 11 - January 2011

Outline Motivation & Main Objectives Method & Approach Experimental Set-Up Results Analysis & Discussion Conclusions & Future work 2/15

Motivation Main Objectives Related Work... Motivation The number of news published everyday is huge How to organize all this information? Media Clipping and entity tracking are usually performed by experts and semi-manually How to subscribe to news mentioning José Sócrates? How to study different perspectives and evolutions of Obama speeches? Is not trivial to automatically identify names of people: - Ronaldo: This is perfect time to go to Camp Nou - Wikileaks: Hillary Clinton continues contacts with foreign leaders Simple hand-crafted rules are not enough to identify these names 3/15

Motivation Main Objectives Related Work... Main objectives Automatically create a training set for NER - Extremely high time and human resources consuming task - Divergence between annotators - Limited-size training set Use Conditional Random Fields to automatically identify names of people on news. A partir de Fevereiro, os programas de <PN>Marcelo Rebelo de Sousa</PN> e <PN>António Vitorino</PN>, na RTP1(...) [ Starting in February, <PN>Marcelo Rebelo de Sousa</PN> and <PN>António Vitorino</PN> shows on RTP I (...) ] 4/15

... Main Objectives Related Word Method & Approach... Related Work Patterns and complex feature generation methods for NER - (Minkov et al., 2005) proposed a set of specialized structural features for identifying personal names on emails - Four corpora manually annotated with 573 documents - Used CRF, obtained F-measure varying from 68,1 to 91,9 - Improved results using: (i) repetition of NE in emails; (ii) dictionary of names and its variations - (McCallum and Li, 2003) used feature induction and web-enhanced lexicons for NER with CRF - Automatic feature induction allow to choose the more relevant features for the task - Web-enhaced lexicons allow to augment lexicons using the web - Used CoNll 2003, a corpus of english newspapers with 964 documents and 4 entities (PER, LOC, ORG, MISC) - Obtained F-measure of 84,04 on the test set 5/15

... Main Objectives Related Word Method & Approach... Related Work Wikipedia as external knowledge to improve NER - (Jun ichi and Torisawa, 2007) extracted labels using the structure: Jimi Hendrix (...) was an American guitarist - These categories were used as features in a CRF-based model - Using CoNLL 2003, F-measure improved 1,58 from the baseline Big versus small gazetteers - (Mikheev et al., 1999) considered that compiling large gazetteers is sometimes the bottleneck in of NER systems - It was sufficient to use small gazetteers of well-known names rather than large gazetteers of low-frequency names Portuguese - (Sarmento, 2006) developed SIEMES, a NER for Portuguese that uses rules of form and similarity supported by an wide-scope gazetteer for Portuguese, REPENTINO. 6/15

... Related Work Method & Approach Experimental Set-UP... Method & Approach 1) Initial Set of names: Voxx : a system that automatically extracts quotations from online news N Voxx; = 1045 names Names of well known people (frequent names) 2) Annotation process (training set): C news with 20,000 news items ni = (title, body) and 110,000 sentences C news is automatically annotate with the names from the Initial Set N Voxx Annotation rules: exact and soft matches, erroneous names 6,600 instances (annotated names) and 562 different names 7/15

... Related Work Method & Approach Experimental Set-UP... Method & Approach 3) Features generation: Word-level features Window of 3 tokens to the left and to 3 tokens to the right REPENTINO - gazetteer for the Portuguese language with 100 different categories Features Capitalized word Acronym Examples Pedro or Miguel NATO or USA Word Length musician - 8 End of sentence Grammatical category Lemma Semantic category List of REPENTINO names said - verb doors - door journalist - job Eduardo de Melo 8/15

... Related Work Method & Approach Experimental Set-UP... Method & Approach 4) CRF Model Well suited to sequence analysis, particularly on NER for newswire data (McCallum and Li, 2003) Straightforward CRF templates that describe the tokens, its position and features Build a CRF model 5) Identification of names Use the CRF model on HAREM HAREM is an annotated corpus for Named Entities for Portuguese 9/15

... Method & Approach Experimental Set-UP Results... Experimental Set-Up Evaluate what? 1. Evaluate the quality of the annotation of the training set 2. Evaluate the quality of the CRF annotator for names of people Evaluate how? 1. Manually evaluate 1% of the news corpus (200 news items) 2. Using HAREM, an annotated corpus of Named Entities for Portuguese (the gold-standard corpus) With which measures? - Precision, Recall and F-meaure 10/15

... Experimental Set-UP Results Analysis & Discussion... Results 1. Quality of the annotation of the training set - Precision of 95% - Recall of 74% 2. Quality of the CRF annotator - Baseline method (only features with names from REPENTINO) Precision of 55% Recall of 8% - Best method (features with names, structural information and syntactic and semantic information) Precision of 79% Recall of 23% 11/15

... Results Analysis & Discussion Conclusions... Annotation of the training set: Analysis & Discussion - Precision of 95% means that almost every names were correctly identified - Recall of 74% means the method misses some names: names with only one word! CRF annotator: - Baseline (P=55%, R=8% and F-measure=14%): poor features - Best method (P=79%, R=23% and F-measure=36%): Error type 1: Incorrectly identified names (36%) Error type II: Name used in different context (33%) Error type III: Missed name (31%) - Milidiú et al (2007): used HMM and achieved F-measure of 88% Training corpus manually annotated and small (2100 sentences) 12/15

... Analysis & Discussion Conclusions Future Work Conclusions We build an NER system for portuguese: Specialized on names of people Completely automatic (from the training set to the model construction and final identification of names) No human annotation is necessary Results achieved are encouraging 13/15

... Analysis & Discussion Conclusions Future Work Future work Names with only one word (nicknames) Study the influence of training sets of different sizes Study and test different features to increase recall Active-learning Other NE as company names, locations and jobs Wikipedia as an additional resource of NE 14/15

Motivation Main Objectives Related Work Method & Approach Experimental Set-UP Results Analysis Conclusions Future Work Questions? Automatic Generation of a Training Set for NER on portuguese journalistic text Jorge Teixeira jft@fe.up.pt 15/15