Aligning Sentences from Standard Wikipedia to Simple Wikipedia Written by Hwang et al. Presented by Xia Cui for NLP@UoL
Overview Wikipedia Simple Article shorter sentences and simpler words and grammars Standard Article Aim: Sentence Alignment for every simple sentence, find corresponding sentence(or sentence fragments) in standard Wikipedia Problem not strictly parallel & very different presentation ordering Solution Sentence-Level Scoring Sequence-Level Search
Sentence-Level Scoring Kauchak, 2013 cosine distance between vector representations of tf.idf scores of words in each sentence tf.idf: term frequency inverse document frequency, how important a word to a document Wu and Plamer, 1994 word-level pairwise semantic similarity score
Sequence-Level Search Zhu et al., 2010 without constraint, can be one-to-many two sentences are aligned if similarity score > threshold Coster and Kauchak, 2011; Barzilay and Elhadad, 2003 with a sequential constraint dynamic programming, recursively optimization relies on consistent ordering, not always hold for Wikipedia
Simplification Datasets Good semantics of simple and standard completely matches Good Partial a sentence covers the other, but contains additional info Partial discuss unrelated concepts, but share short related phrase Bad discuss unrelated concepts
Simplification Datasets(Cont.) Manually Annotated native speaker, 67,853 pairs(277 good, 281 good partial, 117 partial and 67,178 bad) Automatically Aligned threshold > 0.45; good: 0.67; good partial:0.53 150K good, 130K good partial, 110K unlabelled 51.5M potential(threshold < 0.45)
Sentence Alignment Sentence-Level Score builds on Word-Level Similarity WikNet Similarity Structural Semantic Similarity Greedy Search
Word-Level Similarity WikNet Similarity WikNet: a graph leverage synonym info in Wiktionary + word-definition co-occurrence Word: a node if word w2 appears in any sense of definitions of word w1 an edge: Preprocess w1 morphological variations are mapped to baseform atypical word senses are removed stopwords are removed Extended Jaccard Coefficient Jaccard Coefficient(Salton and Mcgill, 1983) Number of shared neighbors for two words w2
WikNet Similarity(Cont.) Extended Jaccard Coefficient neighbors with n-step reach(fogaras and Racz, 2005) additional term: direct neighbor or not if words or neighbors have synonym sets in Wiktionary, then the shared synonyms are used if two words are in each other s synonym lists, the similarity is set to 1 otherwise:» is l-step neighbor set of wi https://ssli.ee.washington.edu/tial/projects/simplify.html
Structural Semantic Similarity Between words +dependency structure between words in a sentence Stanford s dependency parser(de Marneffe et al., 2006) create triplet for each word w: given word, h: head word, r: relationship between w and h Similarity between w1 and w2 : WikNet Similarity; : dependency similarity between relations r1 and r2 same category: ; otherwise:
Greedy Sequence-Level Alignment Compute similarity between all sentences Sj in simple and Ai in standard Select most similar sentence pair, remove all other pairs with respective sentences S*, A* = argmaxs(sj, Ai) Repeat until all sentences in shorter document are aligned Good Good Partial Ai (fragments of standard sentence Ai)
Experiments Preprocess topic names, list markers and non-english are removed data was tokenized, lemmatized and parsed by Stanford CoreNLP (http://stanfordnlp.github.io/corenlp/) Evaluation Precision-recall; max F1; AUC Comparison(Greedy Structural WikNet) Unconstrained WordNet(Mohler and Mihalcea, 2009) an unconstrained search for aligning sentences and WordNet Semantic Similarity Unconstrained Vector Space(Zhu et al., 2010) vector space representation and an unconstrained search for aligning sentences Ordered Vector Space(Coster and Kauchak, 2011) dynamic programming for sentence alignment and vector space scoring
Results
Results(Cont.)
Future Work Introducing other techniques using introduced datasets Better text preprocessing Learning similarities Phrase alignment to obtain better partial matches