NATURAL LANGUAGE PROCESSING

Similar documents
Cross Language Information Retrieval

AQUA: An Ontology-Driven Question Answering System

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Multilingual Sentiment and Subjectivity Analysis

A heuristic framework for pivot-based bilingual dictionary induction

A Case Study: News Classification Based on Term Frequency

arxiv: v1 [cs.cl] 2 Apr 2017

Constructing Parallel Corpus from Movie Subtitles

Ontological spine, localization and multilingual access

Matching Similarity for Keyword-Based Clustering

Ontologies vs. classification systems

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

DICTE PLATFORM: AN INPUT TO COLLABORATION AND KNOWLEDGE SHARING

Loughton School s curriculum evening. 28 th February 2017

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Word Segmentation of Off-line Handwritten Documents

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

ROSETTA STONE PRODUCT OVERVIEW

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

1. Introduction. 2. The OMBI database editor

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

2.1 The Theory of Semantic Fields

Mercer County Schools

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN

Applications of memory-based natural language processing

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Beyond the Blend: Optimizing the Use of your Learning Technologies. Bryan Chapman, Chapman Alliance

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Test Blueprint. Grade 3 Reading English Standards of Learning

Controlled vocabulary

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Modeling user preferences and norms in context-aware systems

Language Acquisition Chart

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Linking Task: Identifying authors and book titles in verbose queries

Detecting English-French Cognates Using Orthographic Edit Distance

Speech Recognition at ICSI: Broadcast News and beyond

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

AUTHORING E-LEARNING CONTENT TRENDS AND SOLUTIONS

The MEANING Multilingual Central Repository

A cognitive perspective on pair programming

Software Maintenance

Visual CP Representation of Knowledge

MISSISSIPPI OCCUPATIONAL DIPLOMA EMPLOYMENT ENGLISH I: NINTH, TENTH, ELEVENTH AND TWELFTH GRADES

CS 598 Natural Language Processing

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

Big Fish. Big Fish The Book. Big Fish. The Shooting Script. The Movie

Automating the E-learning Personalization

Understanding Language

Corpus Linguistics (L615)

Using Proportions to Solve Percentage Problems I

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

The IDN Variant Issues Project: A Study of Issues Related to the Delegation of IDN Variant TLDs. 20 April 2011

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Effect of Word Complexity on L2 Vocabulary Learning

DICE - Final Report. Project Information Project Acronym DICE Project Title

e-portfolios in Australian education and training 2008 National Symposium Report

Learning and Retaining New Vocabularies: The Case of Monolingual and Bilingual Dictionaries

Semantic Evidence for Automatic Identification of Cognates

Natural Language Processing. George Konidaris

Cross-Lingual Text Categorization

Full text of O L O W Science As Inquiry conference. Science as Inquiry

Cross-lingual Text Fragment Alignment using Divergence from Randomness

the contribution of the European Centre for Modern Languages Frank Heyworth

A Bayesian Learning Approach to Concept-Based Document Classification

On document relevance and lexical cohesion between query terms

A Neural Network GUI Tested on Text-To-Phoneme Mapping

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

INSTRUCTIONAL FOCUS DOCUMENT Grade 5/Science

Biome I Can Statements

COUNSELLING PROCESS. Definition

Learning Microsoft Office Excel

Using Semantic Relations to Refine Coreference Decisions

Building Vocabulary Knowledge by Teaching Paraphrasing with the Use of Synonyms Improves Comprehension for Year Six ESL Students

Extending Place Value with Whole Numbers to 1,000,000

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Motivating & motivation in TTO: Initial findings

EUROPEAN DAY OF LANGUAGES

Learning Methods in Multilingual Speech Recognition

GALICIAN TEACHERS PERCEPTIONS ON THE USABILITY AND USEFULNESS OF THE ODS PORTAL

Finding Translations in Scanned Book Collections

Grade 6: Module 2A: Unit 2: Lesson 8 Mid-Unit 3 Assessment: Analyzing Structure and Theme in Stanza 4 of If

Word Sense Disambiguation

Use of Online Information Resources for Knowledge Organisation in Library and Information Centres: A Case Study of CUSAT

Introduction to the Common European Framework (CEF)

Developing a TT-MCTAG for German with an RCG-based Parser

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

K 1 2 K 1 2. Iron Mountain Public Schools Standards (modified METS) Checklist by Grade Level Page 1 of 11

Workshop 5 Teaching Writing as a Process

MARKETING MANAGEMENT II: MARKETING STRATEGY (MKTG 613) Section 007

Oakland Unified School District English/ Language Arts Course Syllabus

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Spinners at the School Carnival (Unequal Sections)

A process by any other name

Literature and the Language Arts Experiencing Literature

Transcription:

NATURAL LANGUAGE PROCESSING LESSON 13: PARAPHRASING / ONTOLOGY MAPPING OUTLINE Paraphrase Methods Linguistic resources Corpus based) Ontology Mapping Monolingual Ontology Mapping Cross Lingual Ontology Mapping CLOM Approaches 1

PARAPHRASE The richness of language allows humans to express the same idea in very different ways. This variability of expression is a major source of difficulties in most NLP applications. Indeed, one of the methods to solve the problems caused by this phenomenon is to acquire paraphrases. Paraphrase: A set of sentences expressing the same idea or describing the same event. TYPE OF PARAPHRASES Lexical paraphrase or synonym: individual lexical elements having the same meaning (eat consume). Sub-sentence paraphrase: Textual units (segments or fragments of texts) sharing the same semantic content.(y was built by X, X is the creator of Y) Sentential paraphrase: two sentences representing the same semantic content(i finished my work I completed my assignment). The presence of paraphrase greatly complicates all applications aimed at modeling, understanding and producing natural language using machines. 2

APPLICATION AREAS The majority of automatic language processing systems are somehow confronted with the phenomenon of paraphrase. However, most of the work dealing with paraphrase focus on using its features to improve automatic systems (not interested in understanding paraphrase). Question Answering System (QAS) Machine Translation Document Summarization QUESTION ANSWERING SYSTEM (QAS) Question answering (QA) is challenging due to the many different ways natural language expresses the same information need. As a result, small variations in semantically equivalent questions, may yield different answers. For example, a hypothetical QA system must recognize that the questions who created Microsoft and who started Microsoft have the same meaning and that they both convey the founder relation in order to retrieve the correct answer from a set of documents. 3

MACHINE TRANSLATION The hypotheses produced by a system are evaluated by measuring their similarity to reference translations created by humans. These similarity measures are essentially based on the number of groups of common words in the two sentences. However, it is impossible to identify the different formulations of the same semantic content with a single reference translation. This can penalize the hypotheses of translation conveying the same meaning, but using expressions different from those present in the reference. çizgi filmleri görmek istiyorum I would like to watch cartoons (ref) Sys 1 - I want to see cartoons Sys 2 - I would like to watch movies DOCUMENT SUMMARIZATION In automatic summarization, the identification of paraphrases can condense the information contained in several documents and improve the quality of automatic summaries. Producing a paraphrase shorter than an original sentence can condense a text, an essential step in automatic summary. The sentence She hates apple, orange, pear. is summarized as She hates fruits 4

PREVİOUS WORKS In the last years, several works have been concerned with the processing of paraphrase. The extraction of paraphrases can be achieved two main methods: Methods exploiting linguistic resources Corpus-based methods. METHODS EXPLOITING LINGUISTIC RESOURCES For a source segment, a paraphrase is obtained by replacing certain words with their synonyms. 1. Extract synonyms for the terms to be substituted from a semantic network such as Wordnet. 2. Choose the synonym most adapted to the context of appearance of each term. 5

CORPUS-BASED METHODS The techniques used to extract paraphrases are generally very dependent on the types of corpora on which they were developed. Monolingual corpus Parallel monolingual corpus Corpus Type Comparable monolingual corpus Parallel multilingual corpus MONOLINGUAL CORPUS A corpus of similar documents from the Web. For example, the automatic recognition of paraphrases is done from the revisions of WIKIPEDIA (It is a free online encyclopedia, created and edited by volunteers around the World). Hubert Beuve-Méry He founded the French-speaking [newspaper daily paper] "Le Monde" in 1944. 6

COMPARABLE MONOLINGUAL CORPUS It is composed of associated text pairs based on a measure of textual similarity possibly, such as newspaper articles published in the same time interval. CNN - Bush says he ll helps NY with $20 billion Washington Post - Bush Reassures New York of $20 Billion PARALLEL MONOLINGUAL CORPUS It consists of pairs of equivalent meaning statements aligned in a supervised manner, such as multiple translations of books Emma burst into tears and he tried to comfort her, saying things to make her smile. Emma cried, and he tried to console her, adorning his words with puns. or groups of questions having the same answer How many ounces are there in a pound? What s the number of ounces per pound? 7

PARALLEL MULTILINGUAL CORPUS It consists of pairs of sentences available in two or more languages (such as transcripts of European parliamentary debates). Bannard and Callison-Burch (2005) propose a pivotal approach where segments aligned with the same terms in the pivot language are considered potential paraphrases. Example of German English corpus: in check Unterkontrolle under control. ONTOLOGY In recent years, with the important evolution of the World Wide Web (WWW), the sources of information become more and more multiform (article, wiki, video, photo, library, etc.). These sources of information are represented in forms useful to the users but difficult for automatic processing by a computer. Indeed, several computer applications such as information retrieval, summarizing or machine translation, require an increasing development of tools able to manage the knowledge expressed in natural language. 8

ONTOLOGY Such systems generally require intelligent processing of the textual content of the information sources available on the web. The crucial problem to solve is that of the polysemy of words. Many efforts have been made in this field, with the aim of enabling the machine to understand the information and to extract its meaning from the words, in order to facilitate their use in automatic processing. As a result, the implementation of techniques and tools for automatic pre-processing of information sources becomes a necessity. ONTOLOGY Ontologies are among the tools that allow the semantic representation of information sources in order to make them interpretable by machine. In particular, ontologies are tools that allow to represent a corpus of knowledge in a form usable by machine. They aim to provide shared and common knowledge on an domain to facilitate knowledge sharing and reuse. This knowledge is represented as a structured set of concepts which are organized in the form of a graph whose relations can be semantic relations. 9

ONTOLOGY ONTOLOGY Concretely, in the context of NLP, the use of an ontology aims to improve the quality and generality of a system. Indeed, they make it possible to obtain a representation of the text deeper, more abstract and independent of language. 10

MONOLINGUAL ONTOLOGY MAPPING The heterogeneity issue occurs when ontologies are authored by different actors like database management problem, where database administrators use different terms to store the same information in different database systems. This means that the views on the same domains of interest will differ from one person to the next, depending on their conceptual model and background knowledge. To address the heterogeneity issue arising from ontologies, ontology mapping has become an important research field. MONOLINGUAL ONTOLOGY MAPPING Ontology mapping is viewed as a two-step process, whereby the first step involves the generation of candidate correspondences (i.e. preevaluation) and the second step involves the generation of validated correspondences (i.e. post-evaluation). 11

CROSS LINGUAL ONTOLOGY MAPPING Many tools have been developed to facilitate monolingual ontology matching process that are written in the same natural language. However, the knowledge representations are not restricted to the usage of a single natural language, matching tools and techniques must be able to work with ontologies that are written in different natural languages. For example, a match may be established between the concept <"#Nebat"> in the source ontology and the concept<"#bitki"> in the target ontology (i.e. both ontologies are in Turkish). However, when lexical comparison is not possible between two different languages (e.g. English and Turkish), a match to the concept <"#Plant"> in would be ignored using monolingual matching tools. CROSS LINGUAL ONTOLOGY MAPPING Given the limitations of existing matching tools that focus on mostly monolingual matching processes, there is a pressing need for the development of matching techniques that can work with ontologies in different natural languages. One way to enable semantic interoperability between ontologies in different natural languages is by means of cross-lingual ontology mapping. A cross-lingual ontology mapping (CLOM) refers to the process of establishing relationships among ontological resources from two or more independent ontologies where each ontology is labeled in a different natural language. 12

CATEGORIES OF CLOM APPROACHES Current approaches to CLOM can be grouped into five categories: Manual CLOM Corpus-based CLOM CLOM via linguistic enrichment CLOM via indirect alignment Translation-based CLOM MANUAL CLOM Manual CLOM refers to those approaches that rely only on human experts whereby mappings are generated by hand. An example of manual CLOM: an English thesaurus: AGROVOC (developed by the FAO containing a set of agricultural vocabularies) is mapped to a Chinese thesaurus: CAT (Chinese Agricultural Ontology, developed by the Chinese Academy of Agricultural Science) by hand. The thesauri are assigned to groups of terminologists to generate mappings. These manually generated mappings are reviewed and stored. The advantage of this approach is that the mappings generated are likely to be accurate and reliable. However, given large and complex ontologies, this can be a time-consuming. 13

CORPUS BASED CLOM Corpus-based CLOM refers to those approaches that require the assistance of bilingual corpora when generating mappings. Such an example is presented in [Ngai et al., 2002]. Ngai et al. use a bilingual corpus (newspaper) to align WordNet (in English) and HowNet (in Chinese). The advantage of this approach is that the corpora don t need to be parallel, which makes the construction process easier. However, a disadvantage of using corpora is that the construction could be a costly process for domain-specific ontologies. CLOM VIA LINGUISTIC ENRICHMENT Pazienza & Stellato [2005] developed an interface which allows to add synonyms (e.g. extracted from WordNet) during the ontology development. Linguistic enrichment of ontological resources will offer strong evidence in the process of mapping generation. However, this enrichment process is currently un-standardized. As a result, it can be difficult to build CLOM algorithms based upon these linguistically enriched ontologies. 14

CLOM VIA INDIRECT ALIGNMENT It refers to the process of generating new CLOM results using preexisting CLOM results. Such an example [Jung et al., 2009]. They present indirect alignment among ontologies in English, Korean and Swedish, given alignment A which is generated between ontology O i (e.g. in Korean) and O j (e.g. in English), and alignment A' which is generated between ontology O j and O k (e.g. in Swedish). Then mappings between O i and O k can be generated by reusing alignment A and A' since they both concern one common ontology O j. TRANSLATION-BASED CLOM Here the CLOM problem is converted to a MOM problem first, which is then solved using MOM techniques. It can be summarized as follows: given ontologies O1 and O2 that are labeled in different natural languages, the labels of O1 are first translated into the language used by O2. As both ontologies are now labeled in the same natural language, the mappings between them can then be created by simply applying monolingual ontology matching techniques. The outcome of the mapping process is conditioned on the translations selected for the given ontology resources. In order to generate quality mapping results, translations must be selected appropriately. 15

TRANSLATION-BASED CLOM O1 (label + structure)l1 Translators (google, Babel-NET,..) Candidate translations Appropriate Translation Selection ATS Results Ontology Interpretation O1 O2 (label + structure)l2 MOM CLOM Results APPROPRIATE TRANSLATION SELECTION Translation repository YES 1 To 1 One matched target Resource 1 to * Several matched target Resource Label acquisition Appropriate translation = target label String match O2 repository NO Surrounding Semantic generation Source semantic surrounding Target semantic surrounding Semilarity comparison Appropriate translation = Highest Ranked target label INPUT Strategy OUtPUT 16