Cross-lingual named entity extraction and disambiguation

Cross-lingual named entity extraction and disambiguation Tadej Štajner 1,2, Dunja Mladenić 1,2 1 Artificial Intelligence Laboratory, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International Postgraduate School, Ljubljana, Slovenia tadej.stajner@ijs.si Abstract. We propose a method for the task of identifying and disambiguation of named entities in a scenario where the language of the input text differs from the language of the knowledge base. We demonstrate this functionality on English and Slovene named entity disambiguation Keywords: Natural language processing, knowledge management, multilingual information management, cross-lingual information retrieval 1 Introduction Since a lot of our world s knowledge is present in textual format in multiple languages rather than a more explicit or language-neutral format, an interesting challenge is automatically integrating texts with structured and semi-structured resources, such as knowledge bases, collections of entities having various properties, such as labels and textual descriptions. Recent work focuses on the fact that all of this knowledge can be spread over many languages [6]. While Wikipedia, the free encyclopaedia, is a famous example, the same problem is applicable on many domains where text is present in multiple languages. In the domain of crosslingual text annotation, we focus on the tasks of entity extraction and disambiguation (NED). We demonstrate a multilingual named entity extraction and disambiguation pipeline, operating for English and Slovene in order to demonstrate the capability of re-using language resources across languages within the Enrycher system [8]. 1 Motivation Many machine translation systems are not aware of named entities and special handling that is often required for them, and instead simply attempt to literally translate them. This often results in errors, for instance in Google Translate

changing the name of the music band Foo Fighters into Sigur Ros, an Icelandic music band, when translating from English to Icelandic. This illustrates the need for special handling of proper names when doing machine translation. By performing named entity extraction and disambiguation before translation, we are able to use a knowledge base to find a correct translation for that named entity. The second problem comes up in performing NED in a language that has poor domain coverage in the knowledge base. Consequently, entities that are extracted are not correctly disambiguated, since they don t exist in that particular language. However, the entity that we are looking for can exist in the knowledge base in a different language. However, directly using that language introduces new problems, since many of the components assume that the language of the input text corresponds to the language of the knowledge base labels and descriptions. 2 Related work The simplest solution for cross-lingual entity disambiguation is the one that simply disregards the language mismatch and tries to use the full textual content to perform the context similarity without any additional processing [1]. The authors have shown that using a merged bilingual knowledge base performed significantly better than using just the document language knowledge base, mainly due to better domain coverage, but it performed much worse than a monolingual scenario. Another simple baseline uses the equivalent of just using the context-independent mention popularity measure, backed by a dictionary [2]. The dictionary can be constructed from looking at anchor texts from non-english to English Wikipedia pages. An ideal system would be the one that would simply translate the document in the desired language and do the disambiguation on the translation. While doing so manually is not feasible for our task, one may use machine translation to do this [6]. While they achieve up to 94% performance of a monolingual baseline, machine translation greatly complicates and slows down the processing, opening an window for more efficient approaches. 3 Problem description We state the problem as identifying and disambiguating concepts that appear as mentions within a fragment of text. Disambiguation is important because phrases may have many distinct meanings. While human readers are able to infer the

meaning from context, this task is difficult for computers. For instance, the phrase Washington can be either a person, location or an organization, and even constraining its type to a location yields over sixty possible different location that are named that way. 3.1 Named entity extraction Named entity extraction is the task of using the surrounding context to isolate the part of text which represents an entity, referred to by a proper name. It is often coupled with entity classification, determining to what class it belongs to, for instance a person or an organization. In general, these are implemented as supervised sequence classifiers. 3.2 Named entity disambiguation Ambiguities, which are inherently present in natural languages represent a challenge of determining the actual identities of entities mentioned in a document (e.g., Paris can refer to a city in France but it can also refer to a small city in Texas, USA or to a 1984 film directed by Wim Wenders having title Paris, Texas). Well-defined entities and relationships are a property of the knowledge model which asserts that a single term has only a single meaning. In that case, we refer to terms as entities. We achieve this property by performing entity resolution. In general, state of the art entity disambiguation systems use three main heuristics: Mention popularity captures the overall most likely meanings of entity phrases. It is typically modelled by the conditional probability of the named entity given a mention. Context similarity: This heuristic captures the entity that best fits the topical context around the mention. It is modelled by the similarity of the mention s context and the entity s context, using a similarity measure operating on a bag-of-words model. The mention s context is a window of words around the mention in the input text, and the entity s context is its description. Coherence: This heuristic collectively captures the entities that make sense appearing together because they are somehow related to one another. While context similarity operates on a single mention-entity pair, the coherence heuristic is collective, operating on the whole input document. It is typically solved by a greedy graph pruning algorithm.

3.3 Cross-lingual named entity disambiguation When extending this pipeline into a scenario where the input and the knowledge base are represented in multiple languages, the biggest impact of this change is on the context similarity heuristic. Because it operates on the level of lexical similarity, its output has little meaning when the assumption of a single language is removed. 4 Proposed method We propose a method that incorporates a cross-lingual similarity measure into the framework. Instead of just computing literal context similarity between two contexts of different languages, we use an additional linear mapping that is able to map one vector of bag-of-words features into another such vector in another language. This enables us to perform meaningful similarity computation on the same vector space. The method used in this approach is Regression Canonical Correlation Analysis (rcca), a dimensionality reduction technique operation on two views that finds a linear combination of vectors from both views (languages) that are maximally correlated. The first vector corresponds to the input document, while the second one corresponds to the optimal mapping of it. However, instead of calculating this mapping in advance, we solve the optimization problem for each input document separately around the input document as the initial projection vector. Input text Direct similarity Crosslingual Mapped text Entity Knowledge base mapping Cross similarity Figure 1: The setup of obtaining similarity in cross-lingual NED Figure 1 represents the two ways of obtaining a context similarity measure between an input document and one of the candidate entities. When the languages of the input and the knowledge base are the same, we use direct similarity. When they differ, we first try to map the cross-lingual mapping (green triangle) into a vector

space, compatible with the knowledge base. However, using a cross-lingual mapping exposes us to the risk of poor domain coverage. Initial experiments show that because the cross-lingual mapping was not able to map some of the words from the input document, it will have poor performance. Therefore, we interpolate the cross-similarity with the direct similarity with the proportion of the words that the cross-lingual mapping was able to recognize. In pre-processing, we use the Stanford Named Entity Recognizer [9] for English named entity recognition. For Slovene, we have developed a Slovene named entity recognizer using a CRF (Conditional random fields) model trained on the SSJ-500k corpus [9]. 5 Discussion and conclusions Current preliminary experiments show that obtaining a cross-lingual mapping does improve on the context-similarity based NED when the training corpus and the input text share a common topic. However, it is not yet certain whether it compares favourably to a machine translation based system. Current work demonstrates that the interpolation between direct and cross-lingual similarity help the robustness of the systems. Future work will involve evaluating different crosslingual similarity models, as well as transliteration models and data integration issues that arise when dealing with multilingual knowledge bases. References: [1] A. Lommatzsch et al, Named Entity Disambiguation for German News Articles, WIR 2010 [2] Spitkovsky, V.I. and Chang, A.X., Strong baselines for cross-lingual entity linking, TAC 2011 [3] T. Štajner and D. Mladenić: Entity resolution in texts using statistical learning and ontologies, ASWC 2009 [4] J. Rupnik, B. Fortuna. Regression Canonical Correlation Analysis. Learning from Multiple Sources, NIPS Workshop, 2008 [5] Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H, Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G. (2011). Robust Disambiguation of Named Entities in Text. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 782-792. [6] McNamee, P., Mayfield, J., Oard, D. W., Lawrie, D., & Doermann, D. (2011). Cross- Language Entity Linking. IJCNLP 2001, 255-263. [7] Učni korpus Sporazumevanje v Slovenskem Jeziku, http://www.xn--sloveninaqfb73g.eu/vsebine/sl/aktivnosti/ucnikorpus.aspx, April 2012 [8] Štajner, T., Rusu, D., Dali, L., Fortuna, B., Mladenić, D., Grobelnik, M. A service oriented framework for natural language text enrichment. Informatica (Ljublj.), 2010, vol. 34, no. 3, 307-313. http://enrycher.ijs.si [9] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Nonlocal Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the ACL (ACL 2005), pp. 363-370.

For wider interest When attempting to understand text, one of the tasks that need to be solved is named entity disambiguation: for instance, Paris can refer to a city in France but it can also refer to a small city in Texas, USA or to a 1984 film directed by Wim Wenders having title Paris, Texas. Knowing the correct answer to that depends on the context. However, context is difficult to interpret if the input text is expressed in a different language than the knowledge base that these entities belong to. This is a very common scenario in processing Slovene text. While using the Slovene Wikipedia for this purpose is easy, it does not contain many entities that we may be interested in. While the English one is over thirty times bigger, it introduces a language barrier. We overcome this by applying techniques from cross-lingual information retrieval to the problem of identifying proper names in text and linking them to concrete knowledge base concepts. Another goal was to re-use language resources from languages with more resource in languages with less available resources. The work presented has resulted in a usable named entity extraction and disambiguation service that is able to work on Slovene text even while having a knowledge base in English. The demonstration is available at http://enrycher.ijs.si