Aligning Sentences from Standard Wikipedia to Simple Wikipedia. Written by Hwang et al. Presented by Xia Cui for

Similar documents
arxiv: v1 [cs.cl] 2 Apr 2017

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Linking Task: Identifying authors and book titles in verbose queries

The Role of String Similarity Metrics in Ontology Alignment

On document relevance and lexical cohesion between query terms

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Finding Translations in Scanned Book Collections

Constructing Parallel Corpus from Movie Subtitles

A Case Study: News Classification Based on Term Frequency

Python Machine Learning

Multilingual Sentiment and Subjectivity Analysis

Comparison of network inference packages and methods for multiple networks inference

AQUA: An Ontology-Driven Question Answering System

Leveraging Sentiment to Compute Word Similarity

Ensemble Technique Utilization for Indonesian Dependency Parser

Variations of the Similarity Function of TextRank for Automated Summarization

A Graph Based Authorship Identification Approach

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Graph Alignment for Semi-Supervised Semantic Role Labeling

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

Beyond the Pipeline: Discrete Optimization in NLP

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

Matching Similarity for Keyword-Based Clustering

CSC200: Lecture 4. Allan Borodin

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

Loughton School s curriculum evening. 28 th February 2017

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Information-theoretic evaluation of predicted ontological annotations

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

The Smart/Empire TIPSTER IR System

Applications of memory-based natural language processing

Word Segmentation of Off-line Handwritten Documents

The Strong Minimalist Thesis and Bounded Optimality

Problems in Current Text Simplification Research: New Data Can Help

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

A Re-examination of Lexical Association Measures

Term Weighting based on Document Revision History

Statewide Framework Document for:

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

A Domain Ontology Development Environment Using a MRD and Text Corpus

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Rule Learning With Negation: Issues Regarding Effectiveness

The Role of the Head in the Interpretation of English Deverbal Compounds

Florida Reading Endorsement Alignment Matrix Competency 1

MetaPAD: Meta Pattern Discovery from Massive Text Corpora

Short Text Understanding Through Lexical-Semantic Analysis

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

HLTCOE at TREC 2013: Temporal Summarization

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Detecting English-French Cognates Using Orthographic Edit Distance

Semantic and Context-aware Linguistic Model for Bias Detection

Handling Sparsity for Verb Noun MWE Token Classification

A Bayesian Learning Approach to Concept-Based Document Classification

Cross Language Information Retrieval

Extracting Verb Expressions Implying Negative Opinions

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Radius STEM Readiness TM

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Vocabulary Usage and Intelligibility in Learner Language

Semantic Inference at the Lexical-Syntactic Level

Re-evaluating the Role of Bleu in Machine Translation Research

1.11 I Know What Do You Know?

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Assignment 1: Predicting Amazon Review Ratings

A Vector Space Approach for Aspect-Based Sentiment Analysis

Mandarin Lexical Tone Recognition: The Gating Paradigm

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Distant Supervised Relation Extraction with Wikipedia and Freebase

Unsupervised Learning of Narrative Schemas and their Participants

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

SEMAFOR: Frame Argument Resolution with Log-Linear Models

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

An Investigation into Team-Based Planning

Unsupervised Cross-Lingual Scaling of Political Texts

Data Modeling and Databases II Entity-Relationship (ER) Model. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

arxiv: v1 [cs.lg] 3 May 2013

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

The taming of the data:

arxiv: v2 [cs.cv] 3 Aug 2017

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

The stages of event extraction

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

Honors Mathematics. Introduction and Definition of Honors Mathematics

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Combining a Chinese Thesaurus with a Chinese Dictionary

Transcription:

Aligning Sentences from Standard Wikipedia to Simple Wikipedia Written by Hwang et al. Presented by Xia Cui for NLP@UoL

Overview Wikipedia Simple Article shorter sentences and simpler words and grammars Standard Article Aim: Sentence Alignment for every simple sentence, find corresponding sentence(or sentence fragments) in standard Wikipedia Problem not strictly parallel & very different presentation ordering Solution Sentence-Level Scoring Sequence-Level Search

Sentence-Level Scoring Kauchak, 2013 cosine distance between vector representations of tf.idf scores of words in each sentence tf.idf: term frequency inverse document frequency, how important a word to a document Wu and Plamer, 1994 word-level pairwise semantic similarity score

Sequence-Level Search Zhu et al., 2010 without constraint, can be one-to-many two sentences are aligned if similarity score > threshold Coster and Kauchak, 2011; Barzilay and Elhadad, 2003 with a sequential constraint dynamic programming, recursively optimization relies on consistent ordering, not always hold for Wikipedia

Simplification Datasets Good semantics of simple and standard completely matches Good Partial a sentence covers the other, but contains additional info Partial discuss unrelated concepts, but share short related phrase Bad discuss unrelated concepts

Simplification Datasets(Cont.) Manually Annotated native speaker, 67,853 pairs(277 good, 281 good partial, 117 partial and 67,178 bad) Automatically Aligned threshold > 0.45; good: 0.67; good partial:0.53 150K good, 130K good partial, 110K unlabelled 51.5M potential(threshold < 0.45)

Sentence Alignment Sentence-Level Score builds on Word-Level Similarity WikNet Similarity Structural Semantic Similarity Greedy Search

Word-Level Similarity WikNet Similarity WikNet: a graph leverage synonym info in Wiktionary + word-definition co-occurrence Word: a node if word w2 appears in any sense of definitions of word w1 an edge: Preprocess w1 morphological variations are mapped to baseform atypical word senses are removed stopwords are removed Extended Jaccard Coefficient Jaccard Coefficient(Salton and Mcgill, 1983) Number of shared neighbors for two words w2

WikNet Similarity(Cont.) Extended Jaccard Coefficient neighbors with n-step reach(fogaras and Racz, 2005) additional term: direct neighbor or not if words or neighbors have synonym sets in Wiktionary, then the shared synonyms are used if two words are in each other s synonym lists, the similarity is set to 1 otherwise:» is l-step neighbor set of wi https://ssli.ee.washington.edu/tial/projects/simplify.html

Structural Semantic Similarity Between words +dependency structure between words in a sentence Stanford s dependency parser(de Marneffe et al., 2006) create triplet for each word w: given word, h: head word, r: relationship between w and h Similarity between w1 and w2 : WikNet Similarity; : dependency similarity between relations r1 and r2 same category: ; otherwise:

Greedy Sequence-Level Alignment Compute similarity between all sentences Sj in simple and Ai in standard Select most similar sentence pair, remove all other pairs with respective sentences S*, A* = argmaxs(sj, Ai) Repeat until all sentences in shorter document are aligned Good Good Partial Ai (fragments of standard sentence Ai)

Experiments Preprocess topic names, list markers and non-english are removed data was tokenized, lemmatized and parsed by Stanford CoreNLP (http://stanfordnlp.github.io/corenlp/) Evaluation Precision-recall; max F1; AUC Comparison(Greedy Structural WikNet) Unconstrained WordNet(Mohler and Mihalcea, 2009) an unconstrained search for aligning sentences and WordNet Semantic Similarity Unconstrained Vector Space(Zhu et al., 2010) vector space representation and an unconstrained search for aligning sentences Ordered Vector Space(Coster and Kauchak, 2011) dynamic programming for sentence alignment and vector space scoring

Results

Results(Cont.)

Future Work Introducing other techniques using introduced datasets Better text preprocessing Learning similarities Phrase alignment to obtain better partial matches