Cross-lingual named entity extraction and disambiguation

Similar documents
OCR for Arabic using SIFT Descriptors With Online Failure Prediction

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

A Case Study: News Classification Based on Term Frequency

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

AQUA: An Ontology-Driven Question Answering System

Distant Supervised Relation Extraction with Wikipedia and Freebase

Linking Task: Identifying authors and book titles in verbose queries

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A heuristic framework for pivot-based bilingual dictionary induction

Learning Methods in Multilingual Speech Recognition

Probabilistic Latent Semantic Analysis

Lecture 1: Machine Learning Basics

Cross-Lingual Text Categorization

Python Machine Learning

Postprint.

Multilingual Sentiment and Subjectivity Analysis

Modeling function word errors in DNN-HMM based LVCSR systems

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Introduction to Simulation

Assignment 1: Predicting Amazon Review Ratings

Indian Institute of Technology, Kanpur

DKPro WSD A Generalized UIMA-based Framework for Word Sense Disambiguation

Modeling function word errors in DNN-HMM based LVCSR systems

Cross Language Information Retrieval

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Rule Learning With Negation: Issues Regarding Effectiveness

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Word Segmentation of Off-line Handwritten Documents

Software Maintenance

arxiv: v1 [cs.cl] 2 Apr 2017

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

SEMAFOR: Frame Argument Resolution with Log-Linear Models

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Applications of memory-based natural language processing

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Reinforcement Learning by Comparing Immediate Reward

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

An Interactive Intelligent Language Tutor Over The Internet

Using dialogue context to improve parsing performance in dialogue systems

Discriminative Learning of Beam-Search Heuristics for Planning

Calibration of Confidence Measures in Speech Recognition

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Multi-Lingual Text Leveling

A cognitive perspective on pair programming

WHEN THERE IS A mismatch between the acoustic

Finding Translations in Scanned Book Collections

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

Statewide Framework Document for:

Learning Methods for Fuzzy Systems

The Strong Minimalist Thesis and Bounded Optimality

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

On the Combined Behavior of Autonomous Resource Management Agents

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Language Independent Passage Retrieval for Question Answering

Switchboard Language Model Improvement with Conversational Data from Gigaword

Task Tolerance of MT Output in Integrated Text Processes

Probability and Statistics Curriculum Pacing Guide

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

elearning OVERVIEW GFA Consulting Group GmbH 1

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

GALICIAN TEACHERS PERCEPTIONS ON THE USABILITY AND USEFULNESS OF THE ODS PORTAL

Human Emotion Recognition From Speech

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

A Bayesian Learning Approach to Concept-Based Document Classification

Extending Place Value with Whole Numbers to 1,000,000

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Ensemble Technique Utilization for Indonesian Dependency Parser

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

How to Judge the Quality of an Objective Classroom Test

Disambiguation of Thai Personal Name from Online News Articles

PIRLS. International Achievement in the Processes of Reading Comprehension Results from PIRLS 2001 in 35 Countries

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Detecting English-French Cognates Using Orthographic Edit Distance

TextGraphs: Graph-based algorithms for Natural Language Processing

INPE São José dos Campos

Rule Learning with Negation: Issues Regarding Effectiveness

A Study of Metacognitive Awareness of Non-English Majors in L2 Listening

Learning Cases to Resolve Conflicts and Improve Group Behavior

A study of speaker adaptation for DNN-based speech synthesis

Speech Emotion Recognition Using Support Vector Machine

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Analysis of Enzyme Kinetic Data

CSL465/603 - Machine Learning

Visit us at:

The stages of event extraction

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Guru: A Computer Tutor that Models Expert Human Tutors

Parsing of part-of-speech tagged Assamese Texts

Learning and Retaining New Vocabularies: The Case of Monolingual and Bilingual Dictionaries

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

12- A whirlwind tour of statistics

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Transcription:

Cross-lingual named entity extraction and disambiguation Tadej Štajner 1,2, Dunja Mladenić 1,2 1 Artificial Intelligence Laboratory, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International Postgraduate School, Ljubljana, Slovenia tadej.stajner@ijs.si Abstract. We propose a method for the task of identifying and disambiguation of named entities in a scenario where the language of the input text differs from the language of the knowledge base. We demonstrate this functionality on English and Slovene named entity disambiguation Keywords: Natural language processing, knowledge management, multilingual information management, cross-lingual information retrieval 1 Introduction Since a lot of our world s knowledge is present in textual format in multiple languages rather than a more explicit or language-neutral format, an interesting challenge is automatically integrating texts with structured and semi-structured resources, such as knowledge bases, collections of entities having various properties, such as labels and textual descriptions. Recent work focuses on the fact that all of this knowledge can be spread over many languages [6]. While Wikipedia, the free encyclopaedia, is a famous example, the same problem is applicable on many domains where text is present in multiple languages. In the domain of crosslingual text annotation, we focus on the tasks of entity extraction and disambiguation (NED). We demonstrate a multilingual named entity extraction and disambiguation pipeline, operating for English and Slovene in order to demonstrate the capability of re-using language resources across languages within the Enrycher system [8]. 1 Motivation Many machine translation systems are not aware of named entities and special handling that is often required for them, and instead simply attempt to literally translate them. This often results in errors, for instance in Google Translate

changing the name of the music band Foo Fighters into Sigur Ros, an Icelandic music band, when translating from English to Icelandic. This illustrates the need for special handling of proper names when doing machine translation. By performing named entity extraction and disambiguation before translation, we are able to use a knowledge base to find a correct translation for that named entity. The second problem comes up in performing NED in a language that has poor domain coverage in the knowledge base. Consequently, entities that are extracted are not correctly disambiguated, since they don t exist in that particular language. However, the entity that we are looking for can exist in the knowledge base in a different language. However, directly using that language introduces new problems, since many of the components assume that the language of the input text corresponds to the language of the knowledge base labels and descriptions. 2 Related work The simplest solution for cross-lingual entity disambiguation is the one that simply disregards the language mismatch and tries to use the full textual content to perform the context similarity without any additional processing [1]. The authors have shown that using a merged bilingual knowledge base performed significantly better than using just the document language knowledge base, mainly due to better domain coverage, but it performed much worse than a monolingual scenario. Another simple baseline uses the equivalent of just using the context-independent mention popularity measure, backed by a dictionary [2]. The dictionary can be constructed from looking at anchor texts from non-english to English Wikipedia pages. An ideal system would be the one that would simply translate the document in the desired language and do the disambiguation on the translation. While doing so manually is not feasible for our task, one may use machine translation to do this [6]. While they achieve up to 94% performance of a monolingual baseline, machine translation greatly complicates and slows down the processing, opening an window for more efficient approaches. 3 Problem description We state the problem as identifying and disambiguating concepts that appear as mentions within a fragment of text. Disambiguation is important because phrases may have many distinct meanings. While human readers are able to infer the

meaning from context, this task is difficult for computers. For instance, the phrase Washington can be either a person, location or an organization, and even constraining its type to a location yields over sixty possible different location that are named that way. 3.1 Named entity extraction Named entity extraction is the task of using the surrounding context to isolate the part of text which represents an entity, referred to by a proper name. It is often coupled with entity classification, determining to what class it belongs to, for instance a person or an organization. In general, these are implemented as supervised sequence classifiers. 3.2 Named entity disambiguation Ambiguities, which are inherently present in natural languages represent a challenge of determining the actual identities of entities mentioned in a document (e.g., Paris can refer to a city in France but it can also refer to a small city in Texas, USA or to a 1984 film directed by Wim Wenders having title Paris, Texas). Well-defined entities and relationships are a property of the knowledge model which asserts that a single term has only a single meaning. In that case, we refer to terms as entities. We achieve this property by performing entity resolution. In general, state of the art entity disambiguation systems use three main heuristics: Mention popularity captures the overall most likely meanings of entity phrases. It is typically modelled by the conditional probability of the named entity given a mention. Context similarity: This heuristic captures the entity that best fits the topical context around the mention. It is modelled by the similarity of the mention s context and the entity s context, using a similarity measure operating on a bag-of-words model. The mention s context is a window of words around the mention in the input text, and the entity s context is its description. Coherence: This heuristic collectively captures the entities that make sense appearing together because they are somehow related to one another. While context similarity operates on a single mention-entity pair, the coherence heuristic is collective, operating on the whole input document. It is typically solved by a greedy graph pruning algorithm.

3.3 Cross-lingual named entity disambiguation When extending this pipeline into a scenario where the input and the knowledge base are represented in multiple languages, the biggest impact of this change is on the context similarity heuristic. Because it operates on the level of lexical similarity, its output has little meaning when the assumption of a single language is removed. 4 Proposed method We propose a method that incorporates a cross-lingual similarity measure into the framework. Instead of just computing literal context similarity between two contexts of different languages, we use an additional linear mapping that is able to map one vector of bag-of-words features into another such vector in another language. This enables us to perform meaningful similarity computation on the same vector space. The method used in this approach is Regression Canonical Correlation Analysis (rcca), a dimensionality reduction technique operation on two views that finds a linear combination of vectors from both views (languages) that are maximally correlated. The first vector corresponds to the input document, while the second one corresponds to the optimal mapping of it. However, instead of calculating this mapping in advance, we solve the optimization problem for each input document separately around the input document as the initial projection vector. Input text Direct similarity Crosslingual Mapped text Entity Knowledge base mapping Cross similarity Figure 1: The setup of obtaining similarity in cross-lingual NED Figure 1 represents the two ways of obtaining a context similarity measure between an input document and one of the candidate entities. When the languages of the input and the knowledge base are the same, we use direct similarity. When they differ, we first try to map the cross-lingual mapping (green triangle) into a vector

space, compatible with the knowledge base. However, using a cross-lingual mapping exposes us to the risk of poor domain coverage. Initial experiments show that because the cross-lingual mapping was not able to map some of the words from the input document, it will have poor performance. Therefore, we interpolate the cross-similarity with the direct similarity with the proportion of the words that the cross-lingual mapping was able to recognize. In pre-processing, we use the Stanford Named Entity Recognizer [9] for English named entity recognition. For Slovene, we have developed a Slovene named entity recognizer using a CRF (Conditional random fields) model trained on the SSJ-500k corpus [9]. 5 Discussion and conclusions Current preliminary experiments show that obtaining a cross-lingual mapping does improve on the context-similarity based NED when the training corpus and the input text share a common topic. However, it is not yet certain whether it compares favourably to a machine translation based system. Current work demonstrates that the interpolation between direct and cross-lingual similarity help the robustness of the systems. Future work will involve evaluating different crosslingual similarity models, as well as transliteration models and data integration issues that arise when dealing with multilingual knowledge bases. References: [1] A. Lommatzsch et al, Named Entity Disambiguation for German News Articles, WIR 2010 [2] Spitkovsky, V.I. and Chang, A.X., Strong baselines for cross-lingual entity linking, TAC 2011 [3] T. Štajner and D. Mladenić: Entity resolution in texts using statistical learning and ontologies, ASWC 2009 [4] J. Rupnik, B. Fortuna. Regression Canonical Correlation Analysis. Learning from Multiple Sources, NIPS Workshop, 2008 [5] Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H, Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G. (2011). Robust Disambiguation of Named Entities in Text. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 782-792. [6] McNamee, P., Mayfield, J., Oard, D. W., Lawrie, D., & Doermann, D. (2011). Cross- Language Entity Linking. IJCNLP 2001, 255-263. [7] Učni korpus Sporazumevanje v Slovenskem Jeziku, http://www.xn--sloveninaqfb73g.eu/vsebine/sl/aktivnosti/ucnikorpus.aspx, April 2012 [8] Štajner, T., Rusu, D., Dali, L., Fortuna, B., Mladenić, D., Grobelnik, M. A service oriented framework for natural language text enrichment. Informatica (Ljublj.), 2010, vol. 34, no. 3, 307-313. http://enrycher.ijs.si [9] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Nonlocal Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the ACL (ACL 2005), pp. 363-370.

For wider interest When attempting to understand text, one of the tasks that need to be solved is named entity disambiguation: for instance, Paris can refer to a city in France but it can also refer to a small city in Texas, USA or to a 1984 film directed by Wim Wenders having title Paris, Texas. Knowing the correct answer to that depends on the context. However, context is difficult to interpret if the input text is expressed in a different language than the knowledge base that these entities belong to. This is a very common scenario in processing Slovene text. While using the Slovene Wikipedia for this purpose is easy, it does not contain many entities that we may be interested in. While the English one is over thirty times bigger, it introduces a language barrier. We overcome this by applying techniques from cross-lingual information retrieval to the problem of identifying proper names in text and linking them to concrete knowledge base concepts. Another goal was to re-use language resources from languages with more resource in languages with less available resources. The work presented has resulted in a usable named entity extraction and disambiguation service that is able to work on Slovene text even while having a knowledge base in English. The demonstration is available at http://enrycher.ijs.si