Motivation, Methods and Evaluation. Sowmya Vajjala

Motivation, Methods and Evaluation (with Detmar Meurers) Center for Language Technology University of Gothenburg, Sweden 20 November 2014 1 / 29

What is readability analysis? We want to measure how difficult it is to read a text, based on properties of the text, using criteria which are data-induced: using corpora with graded texts theory-driven: constructs known to reflect complexity given a purpose, e.g., humans: to support reading and comprehension read texts at a specific level of language proficiency. carry out specific tasks (e.g., answer questions) etc., machines: evaluation of generation systems sometimes personalized to a user, through information- obtained directly (e.g., questionnaire), or indirectly (e.g, inferred from nature of a search query) 2 / 29

Why do we need readability for sentences? some application scenarios selecting appropriate sentences for language learners in CALL. (Segler 2007; Pilán et al. 2013, 2014) understanding the difficulty of survey questions (Lenzner 2013) predicting sentence fluency in Machine Translation (Chae & Nenkova 2009) for text simplification (Vajjala & Meurers 2014a; Dell Orletta et al. 2014) 3 / 29

Why do WE need it? identifying target sentences for text simplification. evaluating text simplification approaches. 4 / 29

: Overview Corpus: publicly accessible, sentence level corpora (texts not prepared by us) : from Vajjala & Meurers (2014b), that work well at a text level. : 1. binary classification (easy vs difficult) 2. apply document level regression model on sentences. 3. pair-wise ranking Evaluation: within and cross corpus evaluations with multiple real-life datasets 5 / 29

-1: Wikipedia-SimpleWikipedia Zhu et al. (2010) created a publicly available, sentence aligned corpus from Wikipedia and Simple Wikipedia. 80,000 pairs of sentences in simplified and unsimplified versions. Example pair: 1. Wiki: Chinese styles vary greatly from era to era and are traditionally named after the ruling dynasty. 2. Simple Wiki: There are many Chinese artistic styles, which are usually named after the ruling dynasty. 6 / 29

-2: OneStopEnglish.com OneStopEnglish (OSE) is an English teachers resource website published by the Macmillan Education Group. They publish Weekly News Lessons which consist of news articles sourced from The Guardian. The articles are rewritten by teaching experts for English language learners at three reading levels (elementary, intermediate, advanced) We obtained permission to collect articles and compiled a corpus of 76 article triplets (228 in total) 7 / 29

-2: OneStopEnglish.com sentence aligned corpus creation creation process: 1. parse the pdf files and extract text content. 2. split all texts into sentences. 3. compare sentences between versions and match them by their cosine similarity (Nelken & Shieber 2006). two versions of the corpus: 1. OSE2Corpus: 3000 sentence pairs. 2. OSE3Corpus: 850 sentence triplets. * contact me if anyone wants to use this corpus. 8 / 29

OSE Corpus: Example adv: In Beijing, mourners and admirers made their way to lay flowers and light candles at the Apple Store. inter: In Beijing, mourners and admirers came to lay flowers and light candles at the Apple Store. ele: In Beijing, people went to the Apple Store with flowers and candles. 9 / 29

-1 From Vajjala & Meurers (2014b) Lexical lexical richness features from Second Language Acquisition (SLA) research e.g., Type-Token ratio, noun variation,... POS density features e.g., # nouns/# words, # adverbs/# words,... traditional features and formulae e.g., # sentence length in words... Syntactic syntactic complexity features from SLA research. e.g., # dep. clauses/clause, average clause length,... other parse tree features e.g., # NPs per sentence, avg. parse tree height,... 10 / 29

-2 Morphological properties of words e.g., Does the word contain a stem along with an affix? abundant=abound+ant Age of Acquisition (AoA) average age-of-acquisition of words in a text Other Psycholinguistic features e.g., word abstractness Avg. number of senses per word (obtained from WordNet) 11 / 29

Sentence : Binary Classification Vajjala & Meurers (2014a) We started with training a sentence-level readability model on Wikipedia corpus: Binary classification: simple hard 65 68% accuracy, depending on training set size. increasing training sample size from 10K to 80K samples did not improve the accuracy much! As regression: r = 0.4 Why is it so bad? 12 / 29

What is the problem? What happens if we just apply a document level readability model on this corpus? Model (Vajjala & Meurers 2014b): outputs readability score on a scale of 1-5, 5 being difficult. Percentage of the total sentences at that level 50 45 40 35 30 25 20 15 10 Distribution of reading levels of Normal and Simplified Wiki Simple Wiki 5 1 1.5 2 2.5 3 3.5 4 4.5 5 Reading level 13 / 29

What can we infer? There are all sorts of sentences in both versions. Wikipedia has more sentences at higher reading levels than Simple Wikipedia. - Is this the reason binary classification failed? one idea: A simple sentence is only simpler than its unsimplified version. It can also still be hard. Simplification could be relative, not absolute. 14 / 29

Is Simplification Relative? How can we study this? One approach: compute reading levels of normal (N) and simplified (S) sentences using our document level readability model. evaluate simplification classification using the percentages of S<N, S=N and S>N the higher the percentage for S<N, the better the model is, at evaluating sentence level readability. Why?: Simplified versions are expected to be at a lower reading level than Normal versions!! How big must S N be to interpret it as a categorical difference in reading level? We call this the d-value. 15 / 29

What exactly is d-value? It is a measure of how fine-grained the model is in identifying reading-level differences between sentences. For example, let us say d = 0.3. Now, when N = 3.4, S = 3.2, S N = 0.2, <d S=N. If N=3.5, S=3.1, S N = 0.4, >d S<N. What is good for us?: the model should be able to identify as many pairs as possible as S<N. S= N is probably okay, but S>N is bad. 16 / 29

Influence of d Question 1: Comparison Does changing of Normal and the Simplified d-value affect our results? Percentage of the total samples 60 55 50 45 40 35 30 25 20 S<N S=N S>N 15 10 0 0.2 0.4 0.6 0.8 1 d-value desired scenario: percentage of (S<N) > (S=N) > (S>N) 17 / 29

Influence of N Question 2: How does the reading level of the unsimplified sentence (N) affect the results? 18 / 29

Influence of N Percentage of the total samples Question 2: How does the reading level of the unsimplified sentence (N) affect the results? 80 70 60 50 40 30 20 10 Comparison at Higher Values of N S<N S=N S>N Percentage of the total samples 70 60 50 40 30 20 Comparison at Lower Values of N S<N S=N S>N 0 0 0.2 0.4 0.6 0.8 1 d-value For harder sentences (when N >=2.5) 10 0 0.2 0.4 0.6 0.8 1 d-value For easier sentences (when N<2.5) 18 / 29

What do we learn from these graphs? The accuracy of relative comparison of reading levels of sentences depends on: 1. minimum S N required to identify a categorical difference (d). 2. reading level of the original, unsimplified sentence (N). It is difficult to identify simplifications for an already simple sentence. But, this approach works well for complex sentences. *more details about this: Vajjala & Meurers (2014a). 19 / 29

What Next? How about modeling this as pair-wise ranking instead? Why? 1. We do not have exact reading level annotations at sentence level. 2. But we know that the simplified version should have a lower reading level. ranking cares only about relative differences, not absolute differences. perhaps, an ideal learning method for this problem? 20 / 29

Pairwise : A Primer typically used in information retrieval, to rank search results by doing a pair-wise comparison between them. learning goal: minimize the number of ordering errors and mis-classified pairs. usual purpose: look at a pair of documents and rank them based on their relevance to the query. our purpose: learn a binary classifier that can tell which sentence is simpler, given a pair of sentences. 21 / 29

Pairwise : Evaluation errors: percentage of reversed pairs. accuracy: percentage of correctly ranked pairs. if there are two sentences (N - unsimplified, S - simplified), rank(s) > rank(n), it is counted as an error. if there are three sentences (A, I, B), and our system gives a readability ranking: I,B,A there are two ranking errors here. 1. I is ranked higher than A. 2. B is ranked higher than A. note: There are other measures of efficiency for ranking, that are tailored to information retrieval applications. 22 / 29

with Train-Test Data setup Training sets: 1. Wiki-Train: 2000 pairs of Wikipedia-SimpleWikipedia parallel sentences. 2. OSE2-Train: 2000 pairs of sentences from OSE corpus (advanced beginner, advanced intermediate, and intermediate beginner combinations.) 3. OSE3-Train: 750 sentence triplets from OSE corpus (each triplet has a single sentence in 3 versions). 4. WikiOSE: mixed training set, consisting of Wiki-Train and OSE2-Train. (size: 4000 pairs) Test sets: 1. Wiki-Test: 78000 pairs. 2. OSE2-Test: 1000 pairs. 3. OSE3-Test: 100 triplets. 23 / 29

Results algorithm: SVM rank results: training with 2-level datasets Training => Wiki-Train OSE2-Train Wiki-Test 81.8% 77.5% OSE2-Test 74.6% 81.5% OSE3-Test 74.7% 79.3% training with a 3 level corpus and a mixed corpus. Training => OSE-3Level-750 WikiOSE Wiki-Test-78Kpairs 78.6% 81.3% OSE2-Test 82.4% 80.7% OSE3-Test 79.6% 84.0% 24 / 29

How does this compare with the previous approach? vs Regression 1. with regression model, depending on d-value, we were: able to predict correctly in 60% of the cases. identified no difference between the sentence versions in 10% of the cases. identified the order wrongly in 30% of the cases. 2. ranking approach: We can predict the order correctly with 80% accuracy. works with multiple datasets and levels of simplification. clearly, ranking is working better than regression. 25 / 29

What features work with? training set: OSE-2 level, test set: OSE-2Level-Test. performance of feature groups: 1. Psycholinguistic. - 69.1% 2. Syntactic Complexity - 67.1% 3. Celex - 72.2% 4. Lexical Richness - 67% around 55-60% accuracy with good single features. 26 / 29

so far readability based ranking works well in distinguishing simplified and unsimplified versions of a sentence. features that worked at document level work well on sentences too with good accuracy. the approach can also make distinctions between multiple levels without losing on the performance. we get good results with cross-corpus and mixed-corpus evaluations too. its fairly generalizable to other texts that informational in nature (e.g., news, encyclopaedia articles etc.,) 27 / 29

Current and Future Work What feature selection approaches work for ranking? What linguistic properties change the most between Advanced to Intermediate vs Intermediate to Beginner? How do we eliminate correlated features? Using with actual automatic text-simplification approaches to evaluate them in a data-driven manner.... 28 / 29

End of Story! Thank you for your patience :-) Questions? email: sowmya@sfs.uni-tuebingen.de 29 / 29

References Chae, J. & A. Nenkova (2009). Predicting the fluency of text with shallow structural features: case studies of machine translation and human-written text. In Proceedings of the 12th Conference of the European Chapter of the ACL. Dell Orletta, F., M. Wieling, A. Cimino, G. Venturi & S. Montemagni (2014). Assessing the of : Which and? In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications (BEA9). Baltimore, Maryland, USA: ACL, pp. 163 173. Lenzner, T. (2013). Are Formulas Valid Tools for Assessing Survey Question Difficulty? Sociological Methods and Research -,. Nelken, R. & S. M. Shieber (2006). Towards robust context-sensitive sentence alignment for monolingual corpora. In In 11th Conference of the European Chapter of the Association of Computational Linguistics. Assoc. for Computational Linguistics, pp. 161 168. Pilán, I., E. Volodina & R. Johansson (2013). Automatic Selection of Suitable for Language Learning Exercises. In L. Bradley & S. Thouësny (eds.), 20 Years of EUROCALL: Learning from the Past, Looking to the Future. Proceedings of the 2013 EUROCALL Conference. pp. 218 225. Pilán, I., E. Volodina & R. Johansson (2014). Rule-based and machine learning approaches for second language sentence-level readability. In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications (BEA9). Baltimore, Maryland, USA: ACL, pp. 174 184. 29 / 29

Segler, T. M. (2007). Investigating the Selection of Example for Unknown Target Words in ICALL Reading Texts for L2 German. Ph.D. thesis, Institute for Communicating and Collaborative Systems, School of Informatics, University of Edinburgh. URL http://homepages.inf.ed.ac.uk/s9808690/thesis.pdf. Vajjala, S. & D. Meurers (2014a). Assessing the relative reading level of sentence pairs for text simplification. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL). ACL, Gothenburg, Sweden: Association for Computational Linguistics, pp. 288 297. Vajjala, S. & D. Meurers (2014b). Text Simplification: From Analyzing Documents to Identifying Sentential Simplifications. International Journal of Applied Linguistics, Special Issue on Current Research in and Text Simplification Thomas François and Delphine Bernhard. Zhu, Z., D. Bernhard & I. Gurevych (2010). A Monolingual Tree-based Translation Model for Sentence Simplification. In Proceedings of The 23rd International Conference on Computational Linguistics (COLING), August 2010. Beijing, China. pp. 1353 1361. 29 / 29