Automatic Readability Assessment

Similar documents
Multi-Lingual Text Leveling

A Right to Access Implies A Right to Know: An Open Online Platform for Research on the Readability of Law

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Assignment 1: Predicting Amazon Review Ratings

CS 598 Natural Language Processing

Ensemble Technique Utilization for Indonesian Dependency Parser

Using dialogue context to improve parsing performance in dialogue systems

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

CS Machine Learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Annotation Projection for Discourse Connectives

Prediction of Maximal Projection for Semantic Role Labeling

The stages of event extraction

Accurate Unlexicalized Parsing for Modern Hebrew

Beyond the Pipeline: Discrete Optimization in NLP

Compositional Semantics

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Applications of memory-based natural language processing

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Columbia University at DUC 2004

AQUA: An Ontology-Driven Question Answering System

Accuracy (%) # features

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

The Smart/Empire TIPSTER IR System

Linking Task: Identifying authors and book titles in verbose queries

Parsing of part-of-speech tagged Assamese Texts

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Analysis: Evaluation: Knowledge: Comprehension: Synthesis: Application:

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Leveraging Sentiment to Compute Word Similarity

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

BYLINE [Heng Ji, Computer Science Department, New York University,

Vocabulary Usage and Intelligibility in Learner Language

The Role of the Head in the Interpretation of English Deverbal Compounds

Lecture 1: Machine Learning Basics

Probabilistic Latent Semantic Analysis

Using Semantic Relations to Refine Coreference Decisions

Multilingual Sentiment and Subjectivity Analysis

Chapter 9 Banked gap-filling

Abstractions and the Brain

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Guru: A Computer Tutor that Models Expert Human Tutors

Word Segmentation of Off-line Handwritten Documents

Handling Sparsity for Verb Noun MWE Token Classification

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Learning From the Past with Experiment Databases

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Lecture 2: Quantifiers and Approximation

Task Tolerance of MT Output in Integrated Text Processes

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Levels of processing: Qualitative differences or task-demand differences?

Grammars & Parsing, Part 1:

Syntactic and Lexical Simplification: The Impact on EFL Listening Comprehension at Low and High Language Proficiency Levels

Indian Institute of Technology, Kanpur

Distant Supervised Relation Extraction with Wikipedia and Freebase

Some Principles of Automated Natural Language Information Extraction

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Corpus Linguistics (L615)

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge

Exposé for a Master s Thesis

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

The Discourse Anaphoric Properties of Connectives

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

CaMLA Working Papers

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Word Sense Disambiguation

Constraining X-Bar: Theta Theory

Cross Language Information Retrieval

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Generative models and adversarial training

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Formulaic Language and Fluency: ESL Teaching Applications

Robust Sense-Based Sentiment Classification

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Probability estimates in a scenario tree

Transcription:

Dissertation Defense Automatic Readability Assessment Candidate: Lijun Feng September 03, 2010 Committee: Prof. Matt Huenerfauth, Mentor, Queens College Prof. Virginia Teller, Hunter College Prof. Heng Ji, Queens College Prof. Andrew Rosenberg, Queens College Outside Member: Prof. Noemie Elhadad, Columbia University

Some texts are harder than others Text A Quacker, Squeaky, and I are building a snowman. Squeaky tried to put a nose on the snowman, but he couldn t reach that high. Quacker teased him about being short. Now Squeaky is upset. What should I say to him? Text B The floes normally provide a floating nursery for pups while adults dive to root for clams and other food in the seabed in shallow coastal waters along the continental shelf. Last month, the federal Fish and Wildlife Service, responding to a lawsuit by the Center for Biological Diversity, an environmental group, concluded that there was sufficient scientific evidence of rising stress on the animals from climate change to consider granting the Pacific walrus protection under the Endangered Species Act.

Background What is readability? Factors affecting readability How is readability measured? How to assess readability automatically a measure of ease with which a text can be understood Discourse, syntactic, lexical, etc. reader s side Corpora, expert ratings, reader responses, etc NLP and machine learning techniques with novel features

Motivations & Impact effectively select reading material appropriate to target reader s proficiency level (school children, SL learners, adults with low literacy, language instructors) rank documents by reading difficulty (text summarization, machine translation, text ordering, text coherence, text generalization,) evaluate performance of text simplification systems, guide simplification processes

Previous Work Traditional approaches (see Section 3.1) shallow features (avg. sentence length, avg. num. of syllables per word, etc) learning technique: simple linear functions examples Flesch-Kincaid grade level 0.39*(total words/total sentences) + 11.8*(total syllables/total words) - 15.59 Gunning FOG 0.4*[(total words/total sentences) + 100*(complex words/total words) ] limitations easy to calculate, but highly unreliable

Previous Work Recent statistical approaches (see Section 3.2) features explored at various linguistic levels based on: language modeling syntactic parse trees part-of-speech local entity coherence analysis of discourse connectives between sent. learning techniques: EM, SVM, etc. Si and Callan (2001), Collins-Thompson & Callan (2004), Schwarm & Ostendorf (2005), Heilman et al. (2007, 2008) Schwarm & Ostendorf (2005) Heilman et al. (2007) Barzilay & Lapata (2008) Pitler & Nenkova (2008)

Methodology General guidelines (see Section 4.1) view readability from text comprehension point of view, focus in particular on extracting features at various discourse levels readability indicators grade levels broader categories such as complex & simple expert ratings readers comprehension responses collected from reading experiments combine NLP, machine learning techniques and empirical studies to understand the relations between various proxies

Corpora training corpora cross-domain evaluation corpus Corpora # docs avg nbr words/sent WeeklyReader Grade 2 174 9.72 Grade 3 289 11.26 Grade 4 428 13.54 Grade 5 542 15.06 NewYorkTimes100 Grade 7 100 25.39 LocalNews2008 original 11 18.01 simplified 11 10.60 corpora used to replicate features by Schwarm et al Britannica original 114 25.31 simplified 114 16.18 LiteracyNet original 115 16.81 simplified 115 12.34

Feature Extraction & Representation Novel discourse features (see Section 6.1) entity density named entities + general nouns Name Finder by OpenNLP lexical chains Lexical Chainer by Galley and McKeown (2003) annotation of semantic relations among entities coreference inference Coreference Resolution by OpenNLP entity grid Brown Coherence Toolkit (v0.2) by Elsner et al. (2007) (grammatical/syntactical transition patterns of entities in adjacent sentences)

Feature Extraction & Representation language modeling (SRI language modeling toolkit) feature selection schemes: a. information gain (IG) b. generic word sequence c. POS sequence d. paired word/pos sequence train LMs on in-domain vs cross-domain corpora part-of-speech (Charniak s Parser) nouns, verbs, adjectives, adverbs, prepositions content vs function words

Feature Extraction & Representation syntactic parse trees (Charniak s Parser) ratio of terminal & non-terminal nodes NPs, VPs, PPs, SBARS (avg. phrasal length & counts) shallow features

Feature Summary Feature sets Discourse features 45 -- entity density 16 -- lexical chain 6 -- coreference 7 -- entity grid 16 Language modeling 128 -- in-domain LMs 80 -- cross-domain LMs 48 POS features 64 Parsed syntactic features 21 Shallow features 9 OOV 6 Total 273

Empirical comparison of features Readability assessment as classification task Corpus: WeeklyReader (Grade 2 to 5) State-of-the-art classification models: LIBSVM (Chang & Lin, 2001) Weka machine learning toolkit (Hall et al., 2009) logistic regression, SMO, J48 Evaluation repeated 10-fold stratified cross-validation measures: accuracy, mean squared error (MSE), num. of misclassification by more than one grade level

Results Baseline accuracy (%) 37.8 Flesch-Kincaid Grade Level 1.4 Feature Set # Features LIBSVM Accuracy (%) MSE miss > 1G Schwarm et al. 25 63.18 ± 1.664 0.55 ± 0.03 73 ± 6 (5.1%) All comb. 273 72.21 ± 0.821 0.43 ± 0.02 56 ± 5 (3.9%) WekaFS 28 70.06 ± 0.777 0.49 ± 0.02 69 ± 5 (4.8%) GAOB 122 74.01 ± 0.847 0.39 ± 0.02 47 ± 7 (3.3%) * WekaFS: a subset of features obtained by Weka s attribute filter using best-fit forward search method *GAOB: a subset of features obtained by groupwise-add-one-best approach

Comparison of Feature Subsets Feature Set # Features LIBSVM Accuracy (%) 5gramWR 80 68.38 ± 0.93 discourse 45 60.50 ± 0.99 POS 64 59.82 ± 1.24 syntactic 21 57.79 ± 1.02 shallow 9 56.04 ± 1.36 all comb. 273 72.21 ± 0.82

Key Observations Language modeling features LMs trained on in-domain corpus are much more effective. Information gain approach outperforms other feature selections Using in-domain corpus, LMs trained on word seq. and paired word/pos seq. achieve accuracy close to IG approach without complicated feature selection. Feature Set cross-domain LMs in-domain LMs 3gramBL 5gramWR information gain 52.21 ± 0.83 62.52 ± 1.20 word sequence 45.57 ± 0.81 60.17 ± 1.21 POS sequence 49.62 ± 0.51 56.21 ± 2.35 word/pos seq. 44.91 ± 1.17 60.38 ± 0.82 all combined 53.61 ± 0.84 68.38 ± 0.93 * Dissertation includes detailed comparison with Schwarm et al. s study (2005)

Key Observations Discourse features Entity density features demonstrate dominating discriminative power. Feature Set LIBSVM Accuracy (%) entity density 59.63 ± 0.63 lexical chain 45.86 ± 0.82 coreference 40.93 ± 0.84 entity grid 45.92 ± 1.16 all combined 60.50 ± 0.99

Key Observations POS features Among all word classes examined, Nouns exhibit the most significant discriminative power. Prepositions demonstrate surprisingly higher discriminative power next to nouns. The high discriminative power of nouns in turn explains the good performance of entity density features, which are heavily based on nouns. Feature Set LIBSVM Accuracy (%) nouns 58.15 ± 0.86 prepositions 56.77 ± 1.28 verbs 54.40 ± 1.03 adjectives 53.87 ± 1.13 adverbs 52.66 ± 0.97 all comb. 59.82 ± 1.24 * Dissertation includes detailed comparison with Heilman et al. s study (2007)

Key Observations Syntactic features VPs and node ratios are better predictors SBARs appear to be least discriminative LIBSVM classifiers: Feature Set Accuracy (%) VPs 53.07 ± 0.60 NPs 51.56 ± 1.05 PPs 49.36 ± 1.28 SBARs 44.42 ± 1.07 ratios of nodes 53.02 ± 0.57 * Dissertation includes detailed comparison with Schwarm et al. s study (2005)

Key Observations Shallow features Avg. sentence length displays dominating discriminative power over lexical-based shallow features. Flesch-Kincaid scores, although they perform poorly when used directly to model text readability, exhibit significant discriminative power when used as features for classifiers. Feature Set J48 Accuracy (%) avg. sentence length 52.18 ± 0.36 avg. num. syll. per word 42.41 ± 0.51 % of poly-syll. words per doc 41.22 ± 0.56 ChallDale 42.46 ± 0.38 Flesch-Kincaid 53.52 ± 0.00

Generalization across corpora Evaluation corpus: LocalNews2008, paired original/simplified Assumption: simplified texts should be easier to read Readability measures: predictions by LIBSVM classifiers trained on corpora with grade level annotation (WeeklyReader & NewYorkTimes100) expert ratings text difficulty experienced by adults with MID (hierarchical latent trait model) Two layers of evaluation investigate whether each readability measure can distinguish between original and simplified texts (use paired t-test) correlations between each pair of readability measures

More diverse training data Limitations of current models: -- can not generalize to texts with complexity higher than Grade 5, -- text complexity of several original articles in LocalNews2008 exceeds Grade 5. Improvement: add NewYorkTimes100 (labeled as Grade 7) to training corpora (WeeklyReader)

Improved Models Classification accuracy generated by LIBSVM classifiers using repeated 10-fold cross-validation: WeeklyReader + NewYorkTimes100 Feature Set WROnly WR NYT100 5gramWR 68.38 ± 0.93 68.12 ± 1.16 91.30 ± 2.34 POS 59.82 ± 1.24 59.10 ± 0.87 89.30 ± 2.41 syntactic 57.79 ± 1.02 56.99 ± 1.99 90.70 ± 2.83 discourse 60.50 ± 0.99 59.53 ± 1.15 89.60 ± 2.50 shallow 56.04 ± 1.36 56.01 ± 0.99 80.30 ± 3.82 all comb. 72.21 ± 0.82 71.86 ± 1.21 93.30 ± 1.55 GAOB 74.01 ± 0.85 73.84 ± 1.05 95.00 ± 2.28 WekaFS 70.06 ± 0.78 70.67 ± 0.88 91.70 ± 2.10 Key observations Adding NYT100 into training data does not affect the classification accuracy on WeeklyReader texts. New models recognize NewYorkTimes100 texts with high accuracy, indicating their high generalizability to new data. However, this may have resulted from strong genre bias.

Discrimination by improved models Paired t-test Known labels for LocalNews2008 are complex & simple Model predictions are in grade levels To evaluate model performance, we conduct paired t-test on predictions for original articles and simplified articles Investigate whether prediction difference between original & simplified texts is statistically significant (p < 0.05) Models p-value 5gramWR 0.0011 POS 0.0001 syntactic 3.993-06 discourse shallow 8.14e-06 3.59e-07 all comb. 0.0027 GAOB 0.0026 WekaFS 2.17e-05 Key observations All model predictions are able to distinguish between original and simplified text with high confidence.

Expert ratings Expert B C A 0.85 0.76 B 0.77 p-value Expert A 0.0011 Expert B 0.0001 Expert C 3.993-06 Key observations Expert ratings are able to distinguish between original and simplified texts with high confidence.

Correlations: model predictions & expert ratings Models Expert A Expert B Expert C 5gramWR 0.60 0.44 0.39 POS 0.76 0.51 0.38 syntactic 0.82 0.68 0.52 discourse 0.84 0.63 0.54 shallow 0.81 0.57 0.58 Expert B C A 0.85 0.76 B 0.77 all comb. 0.73 0.60 0.64 GAOB 0.65 0.47 0.51 WekaFS 0.75 0.62 0.50 Key observations Strong correlations observed between model predictions and expert ratings suggest that our model generalize to cross-domain texts quite well, close to human judgment.

Infer text difficulty for adults with ID Characteristics of reading experiment design subjects read texts and answered comprehension questions two versions of texts for each topic: original vs simplified same set of comprehension questions for ori. and sim. texts for a given topic, no participants read both versions of texts Hierarchical latent trait model (joint work, Jansche, Feng & Huenerfauth, 2010) subjects assumed to have varying reading ability (a real number) intrinsic difficulty of texts measured on the same latent scale difference of ability and difficulty determines a given subject s chance of answering questions about a given text correctly infer text difficulty from observed subject responses

Inferred text difficulty for Adults with ID Results: p-value obtained from a paired t-test on the inferred text difficulty: 0.1845 > 0.05 Implication: The inferred text difficulty for adult participants with MID is NOT able to distinguish between original and simplified texts.

Correlations: model predictions, expert ratings & inferred text difficulty Models corr. w user 5gramWR 0.53 POS 0.41 syntactic 0.20 discourse 0.35 shallow 0.20 all comb. 0.50 GAOB 0.47 WekaFS 0.20 Experts A 0.26 B 0.03 C 0.14 Models Expert A Expert B Expert C 5gramWR 0.60 0.44 0.39 POS 0.76 0.51 0.38 syntactic 0.82 0.68 0.52 discourse 0.84 0.63 0.54 shallow 0.81 0.57 0.58 all comb. 0.73 0.60 0.64 GAOB 0.65 0.47 0.51 WekaFS 0.75 0.62 0.50 Expert B C A 0.85 0.76 B 0.77 Key observations Compared with expert ratings, text difficulty experienced by adults with MID is not ideal for evaluation of our ARAT

Summary We combined NLP and machine learning techniques to build an automatic readability assessment tool with high performance (74% accuracy). We examined the usefulness of features within and across various linguistic levels for predicting text readability in terms of grade levels, through which we identified a set of features with significant discriminative power. We adapted, refined and evaluated our ARAT on cross-domain data. We showed that it generalizes well to unseen cross-domain texts, behaving similar to expert judgments. Our experiment results suggest that, for future readability studies targeted to adult readers with ID, expert ratings seem to be a more reliable choice for annotating text difficulty for this group of readers.

Summary of contributions Novel features and techniques Improvement of previous work at several linguistic levels Creation of local news corpora experienced by adults with MID, useful for future study Detailed comparisons of feature effectiveness Build an efficient tool to assess text readability automatically with state of the art performance Development of hierarchical latent trait model (joint work with Jansche, Feng & Huenerfauth, 2010) to capture particular reading experiment design, generalizable for future research

Future Directions Ideal scenario for future work: Large, validated, freely available corpora of diverse texts annotated with reading difficulty for a well-defined group of readers. At the moment, we have no information to what extent grade level annotations in WeeklyReader reflect reading difficulty as experienced by elementary school students. Key future direction: Creation and validation of corpora. annotation guidelines validation with target group of readers data may already be available from educational testing

and special thanks to F L O R A You make me smile!