Motivation, Methods and Evaluation. Sowmya Vajjala

Similar documents
Multi-Lingual Text Leveling

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Linking Task: Identifying authors and book titles in verbose queries

A Case Study: News Classification Based on Term Frequency

Ensemble Technique Utilization for Indonesian Dependency Parser

Word Segmentation of Off-line Handwritten Documents

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Cross Language Information Retrieval

AQUA: An Ontology-Driven Question Answering System

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Using dialogue context to improve parsing performance in dialogue systems

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

arxiv: v1 [cs.cl] 2 Apr 2017

Distant Supervised Relation Extraction with Wikipedia and Freebase

A Comparison of Two Text Representations for Sentiment Analysis

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

The stages of event extraction

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

A Bayesian Learning Approach to Concept-Based Document Classification

CS 598 Natural Language Processing

Assignment 1: Predicting Amazon Review Ratings

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

THE VERB ARGUMENT BROWSER

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Beyond the Pipeline: Discrete Optimization in NLP

Software Maintenance

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Applications of memory-based natural language processing

CS Machine Learning

Getting Started with Deliberate Practice

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

A heuristic framework for pivot-based bilingual dictionary induction

CS 446: Machine Learning

Indian Institute of Technology, Kanpur

Parsing of part-of-speech tagged Assamese Texts

Florida Reading Endorsement Alignment Matrix Competency 1

The taming of the data:

Lecture 1: Machine Learning Basics

1.11 I Know What Do You Know?

Developing a TT-MCTAG for German with an RCG-based Parser

Constructing Parallel Corpus from Movie Subtitles

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Online Updating of Word Representations for Part-of-Speech Tagging

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

International Conference on Education and Educational Psychology (ICEEPSY 2012)

Leveraging Sentiment to Compute Word Similarity

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Term Weighting based on Document Revision History

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Learning From the Past with Experiment Databases

GACE Computer Science Assessment Test at a Glance

Finding Translations in Scanned Book Collections

Regression for Sentence-Level MT Evaluation with Pseudo References

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

CaMLA Working Papers

On document relevance and lexical cohesion between query terms

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

The Role of the Head in the Interpretation of English Deverbal Compounds

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Combining a Chinese Thesaurus with a Chinese Dictionary

Probabilistic Latent Semantic Analysis

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

A Right to Access Implies A Right to Know: An Open Online Platform for Research on the Readability of Law

Compositional Semantics

Rule Learning With Negation: Issues Regarding Effectiveness

Multilingual Sentiment and Subjectivity Analysis

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

The Smart/Empire TIPSTER IR System

Python Machine Learning

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Columbia University at DUC 2004

A Graph Based Authorship Identification Approach

Language Acquisition Chart

Vocabulary Usage and Intelligibility in Learner Language

FAQ (Frequently Asked Questions)

ScienceDirect. Malayalam question answering system

Syntactic and Lexical Simplification: The Impact on EFL Listening Comprehension at Low and High Language Proficiency Levels

Accurate Unlexicalized Parsing for Modern Hebrew

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Transcription:

Motivation, Methods and Evaluation (with Detmar Meurers) Center for Language Technology University of Gothenburg, Sweden 20 November 2014 1 / 29

What is readability analysis? We want to measure how difficult it is to read a text, based on properties of the text, using criteria which are data-induced: using corpora with graded texts theory-driven: constructs known to reflect complexity given a purpose, e.g., humans: to support reading and comprehension read texts at a specific level of language proficiency. carry out specific tasks (e.g., answer questions) etc., machines: evaluation of generation systems sometimes personalized to a user, through information- obtained directly (e.g., questionnaire), or indirectly (e.g, inferred from nature of a search query) 2 / 29

Why do we need readability for sentences? some application scenarios selecting appropriate sentences for language learners in CALL. (Segler 2007; Pilán et al. 2013, 2014) understanding the difficulty of survey questions (Lenzner 2013) predicting sentence fluency in Machine Translation (Chae & Nenkova 2009) for text simplification (Vajjala & Meurers 2014a; Dell Orletta et al. 2014) 3 / 29

Why do WE need it? identifying target sentences for text simplification. evaluating text simplification approaches. 4 / 29

: Overview Corpus: publicly accessible, sentence level corpora (texts not prepared by us) : from Vajjala & Meurers (2014b), that work well at a text level. : 1. binary classification (easy vs difficult) 2. apply document level regression model on sentences. 3. pair-wise ranking Evaluation: within and cross corpus evaluations with multiple real-life datasets 5 / 29

-1: Wikipedia-SimpleWikipedia Zhu et al. (2010) created a publicly available, sentence aligned corpus from Wikipedia and Simple Wikipedia. 80,000 pairs of sentences in simplified and unsimplified versions. Example pair: 1. Wiki: Chinese styles vary greatly from era to era and are traditionally named after the ruling dynasty. 2. Simple Wiki: There are many Chinese artistic styles, which are usually named after the ruling dynasty. 6 / 29

-2: OneStopEnglish.com OneStopEnglish (OSE) is an English teachers resource website published by the Macmillan Education Group. They publish Weekly News Lessons which consist of news articles sourced from The Guardian. The articles are rewritten by teaching experts for English language learners at three reading levels (elementary, intermediate, advanced) We obtained permission to collect articles and compiled a corpus of 76 article triplets (228 in total) 7 / 29

-2: OneStopEnglish.com sentence aligned corpus creation creation process: 1. parse the pdf files and extract text content. 2. split all texts into sentences. 3. compare sentences between versions and match them by their cosine similarity (Nelken & Shieber 2006). two versions of the corpus: 1. OSE2Corpus: 3000 sentence pairs. 2. OSE3Corpus: 850 sentence triplets. * contact me if anyone wants to use this corpus. 8 / 29

OSE Corpus: Example adv: In Beijing, mourners and admirers made their way to lay flowers and light candles at the Apple Store. inter: In Beijing, mourners and admirers came to lay flowers and light candles at the Apple Store. ele: In Beijing, people went to the Apple Store with flowers and candles. 9 / 29

-1 From Vajjala & Meurers (2014b) Lexical lexical richness features from Second Language Acquisition (SLA) research e.g., Type-Token ratio, noun variation,... POS density features e.g., # nouns/# words, # adverbs/# words,... traditional features and formulae e.g., # sentence length in words... Syntactic syntactic complexity features from SLA research. e.g., # dep. clauses/clause, average clause length,... other parse tree features e.g., # NPs per sentence, avg. parse tree height,... 10 / 29

-2 Morphological properties of words e.g., Does the word contain a stem along with an affix? abundant=abound+ant Age of Acquisition (AoA) average age-of-acquisition of words in a text Other Psycholinguistic features e.g., word abstractness Avg. number of senses per word (obtained from WordNet) 11 / 29

Sentence : Binary Classification Vajjala & Meurers (2014a) We started with training a sentence-level readability model on Wikipedia corpus: Binary classification: simple hard 65 68% accuracy, depending on training set size. increasing training sample size from 10K to 80K samples did not improve the accuracy much! As regression: r = 0.4 Why is it so bad? 12 / 29

What is the problem? What happens if we just apply a document level readability model on this corpus? Model (Vajjala & Meurers 2014b): outputs readability score on a scale of 1-5, 5 being difficult. Percentage of the total sentences at that level 50 45 40 35 30 25 20 15 10 Distribution of reading levels of Normal and Simplified Wiki Simple Wiki 5 1 1.5 2 2.5 3 3.5 4 4.5 5 Reading level 13 / 29

What can we infer? There are all sorts of sentences in both versions. Wikipedia has more sentences at higher reading levels than Simple Wikipedia. - Is this the reason binary classification failed? one idea: A simple sentence is only simpler than its unsimplified version. It can also still be hard. Simplification could be relative, not absolute. 14 / 29

Is Simplification Relative? How can we study this? One approach: compute reading levels of normal (N) and simplified (S) sentences using our document level readability model. evaluate simplification classification using the percentages of S<N, S=N and S>N the higher the percentage for S<N, the better the model is, at evaluating sentence level readability. Why?: Simplified versions are expected to be at a lower reading level than Normal versions!! How big must S N be to interpret it as a categorical difference in reading level? We call this the d-value. 15 / 29

What exactly is d-value? It is a measure of how fine-grained the model is in identifying reading-level differences between sentences. For example, let us say d = 0.3. Now, when N = 3.4, S = 3.2, S N = 0.2, <d S=N. If N=3.5, S=3.1, S N = 0.4, >d S<N. What is good for us?: the model should be able to identify as many pairs as possible as S<N. S= N is probably okay, but S>N is bad. 16 / 29

Influence of d Question 1: Comparison Does changing of Normal and the Simplified d-value affect our results? Percentage of the total samples 60 55 50 45 40 35 30 25 20 S<N S=N S>N 15 10 0 0.2 0.4 0.6 0.8 1 d-value desired scenario: percentage of (S<N) > (S=N) > (S>N) 17 / 29

Influence of N Question 2: How does the reading level of the unsimplified sentence (N) affect the results? 18 / 29

Influence of N Percentage of the total samples Question 2: How does the reading level of the unsimplified sentence (N) affect the results? 80 70 60 50 40 30 20 10 Comparison at Higher Values of N S<N S=N S>N Percentage of the total samples 70 60 50 40 30 20 Comparison at Lower Values of N S<N S=N S>N 0 0 0.2 0.4 0.6 0.8 1 d-value For harder sentences (when N >=2.5) 10 0 0.2 0.4 0.6 0.8 1 d-value For easier sentences (when N<2.5) 18 / 29

What do we learn from these graphs? The accuracy of relative comparison of reading levels of sentences depends on: 1. minimum S N required to identify a categorical difference (d). 2. reading level of the original, unsimplified sentence (N). It is difficult to identify simplifications for an already simple sentence. But, this approach works well for complex sentences. *more details about this: Vajjala & Meurers (2014a). 19 / 29

What Next? How about modeling this as pair-wise ranking instead? Why? 1. We do not have exact reading level annotations at sentence level. 2. But we know that the simplified version should have a lower reading level. ranking cares only about relative differences, not absolute differences. perhaps, an ideal learning method for this problem? 20 / 29

Pairwise : A Primer typically used in information retrieval, to rank search results by doing a pair-wise comparison between them. learning goal: minimize the number of ordering errors and mis-classified pairs. usual purpose: look at a pair of documents and rank them based on their relevance to the query. our purpose: learn a binary classifier that can tell which sentence is simpler, given a pair of sentences. 21 / 29

Pairwise : Evaluation errors: percentage of reversed pairs. accuracy: percentage of correctly ranked pairs. if there are two sentences (N - unsimplified, S - simplified), rank(s) > rank(n), it is counted as an error. if there are three sentences (A, I, B), and our system gives a readability ranking: I,B,A there are two ranking errors here. 1. I is ranked higher than A. 2. B is ranked higher than A. note: There are other measures of efficiency for ranking, that are tailored to information retrieval applications. 22 / 29

with Train-Test Data setup Training sets: 1. Wiki-Train: 2000 pairs of Wikipedia-SimpleWikipedia parallel sentences. 2. OSE2-Train: 2000 pairs of sentences from OSE corpus (advanced beginner, advanced intermediate, and intermediate beginner combinations.) 3. OSE3-Train: 750 sentence triplets from OSE corpus (each triplet has a single sentence in 3 versions). 4. WikiOSE: mixed training set, consisting of Wiki-Train and OSE2-Train. (size: 4000 pairs) Test sets: 1. Wiki-Test: 78000 pairs. 2. OSE2-Test: 1000 pairs. 3. OSE3-Test: 100 triplets. 23 / 29

Results algorithm: SVM rank results: training with 2-level datasets Training => Wiki-Train OSE2-Train Wiki-Test 81.8% 77.5% OSE2-Test 74.6% 81.5% OSE3-Test 74.7% 79.3% training with a 3 level corpus and a mixed corpus. Training => OSE-3Level-750 WikiOSE Wiki-Test-78Kpairs 78.6% 81.3% OSE2-Test 82.4% 80.7% OSE3-Test 79.6% 84.0% 24 / 29

How does this compare with the previous approach? vs Regression 1. with regression model, depending on d-value, we were: able to predict correctly in 60% of the cases. identified no difference between the sentence versions in 10% of the cases. identified the order wrongly in 30% of the cases. 2. ranking approach: We can predict the order correctly with 80% accuracy. works with multiple datasets and levels of simplification. clearly, ranking is working better than regression. 25 / 29

What features work with? training set: OSE-2 level, test set: OSE-2Level-Test. performance of feature groups: 1. Psycholinguistic. - 69.1% 2. Syntactic Complexity - 67.1% 3. Celex - 72.2% 4. Lexical Richness - 67% around 55-60% accuracy with good single features. 26 / 29

so far readability based ranking works well in distinguishing simplified and unsimplified versions of a sentence. features that worked at document level work well on sentences too with good accuracy. the approach can also make distinctions between multiple levels without losing on the performance. we get good results with cross-corpus and mixed-corpus evaluations too. its fairly generalizable to other texts that informational in nature (e.g., news, encyclopaedia articles etc.,) 27 / 29

Current and Future Work What feature selection approaches work for ranking? What linguistic properties change the most between Advanced to Intermediate vs Intermediate to Beginner? How do we eliminate correlated features? Using with actual automatic text-simplification approaches to evaluate them in a data-driven manner.... 28 / 29

End of Story! Thank you for your patience :-) Questions? email: sowmya@sfs.uni-tuebingen.de 29 / 29

References Chae, J. & A. Nenkova (2009). Predicting the fluency of text with shallow structural features: case studies of machine translation and human-written text. In Proceedings of the 12th Conference of the European Chapter of the ACL. Dell Orletta, F., M. Wieling, A. Cimino, G. Venturi & S. Montemagni (2014). Assessing the of : Which and? In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications (BEA9). Baltimore, Maryland, USA: ACL, pp. 163 173. Lenzner, T. (2013). Are Formulas Valid Tools for Assessing Survey Question Difficulty? Sociological Methods and Research -,. Nelken, R. & S. M. Shieber (2006). Towards robust context-sensitive sentence alignment for monolingual corpora. In In 11th Conference of the European Chapter of the Association of Computational Linguistics. Assoc. for Computational Linguistics, pp. 161 168. Pilán, I., E. Volodina & R. Johansson (2013). Automatic Selection of Suitable for Language Learning Exercises. In L. Bradley & S. Thouësny (eds.), 20 Years of EUROCALL: Learning from the Past, Looking to the Future. Proceedings of the 2013 EUROCALL Conference. pp. 218 225. Pilán, I., E. Volodina & R. Johansson (2014). Rule-based and machine learning approaches for second language sentence-level readability. In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications (BEA9). Baltimore, Maryland, USA: ACL, pp. 174 184. 29 / 29

Segler, T. M. (2007). Investigating the Selection of Example for Unknown Target Words in ICALL Reading Texts for L2 German. Ph.D. thesis, Institute for Communicating and Collaborative Systems, School of Informatics, University of Edinburgh. URL http://homepages.inf.ed.ac.uk/s9808690/thesis.pdf. Vajjala, S. & D. Meurers (2014a). Assessing the relative reading level of sentence pairs for text simplification. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL). ACL, Gothenburg, Sweden: Association for Computational Linguistics, pp. 288 297. Vajjala, S. & D. Meurers (2014b). Text Simplification: From Analyzing Documents to Identifying Sentential Simplifications. International Journal of Applied Linguistics, Special Issue on Current Research in and Text Simplification Thomas François and Delphine Bernhard. Zhu, Z., D. Bernhard & I. Gurevych (2010). A Monolingual Tree-based Translation Model for Sentence Simplification. In Proceedings of The 23rd International Conference on Computational Linguistics (COLING), August 2010. Beijing, China. pp. 1353 1361. 29 / 29