Automatic Readability Assessment

Dissertation Defense Automatic Readability Assessment Candidate: Lijun Feng September 03, 2010 Committee: Prof. Matt Huenerfauth, Mentor, Queens College Prof. Virginia Teller, Hunter College Prof. Heng Ji, Queens College Prof. Andrew Rosenberg, Queens College Outside Member: Prof. Noemie Elhadad, Columbia University

Some texts are harder than others Text A Quacker, Squeaky, and I are building a snowman. Squeaky tried to put a nose on the snowman, but he couldn t reach that high. Quacker teased him about being short. Now Squeaky is upset. What should I say to him? Text B The floes normally provide a floating nursery for pups while adults dive to root for clams and other food in the seabed in shallow coastal waters along the continental shelf. Last month, the federal Fish and Wildlife Service, responding to a lawsuit by the Center for Biological Diversity, an environmental group, concluded that there was sufficient scientific evidence of rising stress on the animals from climate change to consider granting the Pacific walrus protection under the Endangered Species Act.

Background What is readability? Factors affecting readability How is readability measured? How to assess readability automatically a measure of ease with which a text can be understood Discourse, syntactic, lexical, etc. reader s side Corpora, expert ratings, reader responses, etc NLP and machine learning techniques with novel features

Motivations & Impact effectively select reading material appropriate to target reader s proficiency level (school children, SL learners, adults with low literacy, language instructors) rank documents by reading difficulty (text summarization, machine translation, text ordering, text coherence, text generalization,) evaluate performance of text simplification systems, guide simplification processes

Previous Work Traditional approaches (see Section 3.1) shallow features (avg. sentence length, avg. num. of syllables per word, etc) learning technique: simple linear functions examples Flesch-Kincaid grade level 0.39*(total words/total sentences) + 11.8*(total syllables/total words) - 15.59 Gunning FOG 0.4*[(total words/total sentences) + 100*(complex words/total words) ] limitations easy to calculate, but highly unreliable

Previous Work Recent statistical approaches (see Section 3.2) features explored at various linguistic levels based on: language modeling syntactic parse trees part-of-speech local entity coherence analysis of discourse connectives between sent. learning techniques: EM, SVM, etc. Si and Callan (2001), Collins-Thompson & Callan (2004), Schwarm & Ostendorf (2005), Heilman et al. (2007, 2008) Schwarm & Ostendorf (2005) Heilman et al. (2007) Barzilay & Lapata (2008) Pitler & Nenkova (2008)

Methodology General guidelines (see Section 4.1) view readability from text comprehension point of view, focus in particular on extracting features at various discourse levels readability indicators grade levels broader categories such as complex & simple expert ratings readers comprehension responses collected from reading experiments combine NLP, machine learning techniques and empirical studies to understand the relations between various proxies

Corpora training corpora cross-domain evaluation corpus Corpora # docs avg nbr words/sent WeeklyReader Grade 2 174 9.72 Grade 3 289 11.26 Grade 4 428 13.54 Grade 5 542 15.06 NewYorkTimes100 Grade 7 100 25.39 LocalNews2008 original 11 18.01 simplified 11 10.60 corpora used to replicate features by Schwarm et al Britannica original 114 25.31 simplified 114 16.18 LiteracyNet original 115 16.81 simplified 115 12.34

Feature Extraction & Representation Novel discourse features (see Section 6.1) entity density named entities + general nouns Name Finder by OpenNLP lexical chains Lexical Chainer by Galley and McKeown (2003) annotation of semantic relations among entities coreference inference Coreference Resolution by OpenNLP entity grid Brown Coherence Toolkit (v0.2) by Elsner et al. (2007) (grammatical/syntactical transition patterns of entities in adjacent sentences)

Feature Extraction & Representation language modeling (SRI language modeling toolkit) feature selection schemes: a. information gain (IG) b. generic word sequence c. POS sequence d. paired word/pos sequence train LMs on in-domain vs cross-domain corpora part-of-speech (Charniak s Parser) nouns, verbs, adjectives, adverbs, prepositions content vs function words

Feature Extraction & Representation syntactic parse trees (Charniak s Parser) ratio of terminal & non-terminal nodes NPs, VPs, PPs, SBARS (avg. phrasal length & counts) shallow features

Feature Summary Feature sets Discourse features 45 -- entity density 16 -- lexical chain 6 -- coreference 7 -- entity grid 16 Language modeling 128 -- in-domain LMs 80 -- cross-domain LMs 48 POS features 64 Parsed syntactic features 21 Shallow features 9 OOV 6 Total 273

Empirical comparison of features Readability assessment as classification task Corpus: WeeklyReader (Grade 2 to 5) State-of-the-art classification models: LIBSVM (Chang & Lin, 2001) Weka machine learning toolkit (Hall et al., 2009) logistic regression, SMO, J48 Evaluation repeated 10-fold stratified cross-validation measures: accuracy, mean squared error (MSE), num. of misclassification by more than one grade level

Results Baseline accuracy (%) 37.8 Flesch-Kincaid Grade Level 1.4 Feature Set # Features LIBSVM Accuracy (%) MSE miss > 1G Schwarm et al. 25 63.18 ± 1.664 0.55 ± 0.03 73 ± 6 (5.1%) All comb. 273 72.21 ± 0.821 0.43 ± 0.02 56 ± 5 (3.9%) WekaFS 28 70.06 ± 0.777 0.49 ± 0.02 69 ± 5 (4.8%) GAOB 122 74.01 ± 0.847 0.39 ± 0.02 47 ± 7 (3.3%) * WekaFS: a subset of features obtained by Weka s attribute filter using best-fit forward search method *GAOB: a subset of features obtained by groupwise-add-one-best approach

Comparison of Feature Subsets Feature Set # Features LIBSVM Accuracy (%) 5gramWR 80 68.38 ± 0.93 discourse 45 60.50 ± 0.99 POS 64 59.82 ± 1.24 syntactic 21 57.79 ± 1.02 shallow 9 56.04 ± 1.36 all comb. 273 72.21 ± 0.82

Key Observations Language modeling features LMs trained on in-domain corpus are much more effective. Information gain approach outperforms other feature selections Using in-domain corpus, LMs trained on word seq. and paired word/pos seq. achieve accuracy close to IG approach without complicated feature selection. Feature Set cross-domain LMs in-domain LMs 3gramBL 5gramWR information gain 52.21 ± 0.83 62.52 ± 1.20 word sequence 45.57 ± 0.81 60.17 ± 1.21 POS sequence 49.62 ± 0.51 56.21 ± 2.35 word/pos seq. 44.91 ± 1.17 60.38 ± 0.82 all combined 53.61 ± 0.84 68.38 ± 0.93 * Dissertation includes detailed comparison with Schwarm et al. s study (2005)

Key Observations Discourse features Entity density features demonstrate dominating discriminative power. Feature Set LIBSVM Accuracy (%) entity density 59.63 ± 0.63 lexical chain 45.86 ± 0.82 coreference 40.93 ± 0.84 entity grid 45.92 ± 1.16 all combined 60.50 ± 0.99

Key Observations POS features Among all word classes examined, Nouns exhibit the most significant discriminative power. Prepositions demonstrate surprisingly higher discriminative power next to nouns. The high discriminative power of nouns in turn explains the good performance of entity density features, which are heavily based on nouns. Feature Set LIBSVM Accuracy (%) nouns 58.15 ± 0.86 prepositions 56.77 ± 1.28 verbs 54.40 ± 1.03 adjectives 53.87 ± 1.13 adverbs 52.66 ± 0.97 all comb. 59.82 ± 1.24 * Dissertation includes detailed comparison with Heilman et al. s study (2007)

Key Observations Syntactic features VPs and node ratios are better predictors SBARs appear to be least discriminative LIBSVM classifiers: Feature Set Accuracy (%) VPs 53.07 ± 0.60 NPs 51.56 ± 1.05 PPs 49.36 ± 1.28 SBARs 44.42 ± 1.07 ratios of nodes 53.02 ± 0.57 * Dissertation includes detailed comparison with Schwarm et al. s study (2005)

Key Observations Shallow features Avg. sentence length displays dominating discriminative power over lexical-based shallow features. Flesch-Kincaid scores, although they perform poorly when used directly to model text readability, exhibit significant discriminative power when used as features for classifiers. Feature Set J48 Accuracy (%) avg. sentence length 52.18 ± 0.36 avg. num. syll. per word 42.41 ± 0.51 % of poly-syll. words per doc 41.22 ± 0.56 ChallDale 42.46 ± 0.38 Flesch-Kincaid 53.52 ± 0.00

Generalization across corpora Evaluation corpus: LocalNews2008, paired original/simplified Assumption: simplified texts should be easier to read Readability measures: predictions by LIBSVM classifiers trained on corpora with grade level annotation (WeeklyReader & NewYorkTimes100) expert ratings text difficulty experienced by adults with MID (hierarchical latent trait model) Two layers of evaluation investigate whether each readability measure can distinguish between original and simplified texts (use paired t-test) correlations between each pair of readability measures

More diverse training data Limitations of current models: -- can not generalize to texts with complexity higher than Grade 5, -- text complexity of several original articles in LocalNews2008 exceeds Grade 5. Improvement: add NewYorkTimes100 (labeled as Grade 7) to training corpora (WeeklyReader)

Improved Models Classification accuracy generated by LIBSVM classifiers using repeated 10-fold cross-validation: WeeklyReader + NewYorkTimes100 Feature Set WROnly WR NYT100 5gramWR 68.38 ± 0.93 68.12 ± 1.16 91.30 ± 2.34 POS 59.82 ± 1.24 59.10 ± 0.87 89.30 ± 2.41 syntactic 57.79 ± 1.02 56.99 ± 1.99 90.70 ± 2.83 discourse 60.50 ± 0.99 59.53 ± 1.15 89.60 ± 2.50 shallow 56.04 ± 1.36 56.01 ± 0.99 80.30 ± 3.82 all comb. 72.21 ± 0.82 71.86 ± 1.21 93.30 ± 1.55 GAOB 74.01 ± 0.85 73.84 ± 1.05 95.00 ± 2.28 WekaFS 70.06 ± 0.78 70.67 ± 0.88 91.70 ± 2.10 Key observations Adding NYT100 into training data does not affect the classification accuracy on WeeklyReader texts. New models recognize NewYorkTimes100 texts with high accuracy, indicating their high generalizability to new data. However, this may have resulted from strong genre bias.

Discrimination by improved models Paired t-test Known labels for LocalNews2008 are complex & simple Model predictions are in grade levels To evaluate model performance, we conduct paired t-test on predictions for original articles and simplified articles Investigate whether prediction difference between original & simplified texts is statistically significant (p < 0.05) Models p-value 5gramWR 0.0011 POS 0.0001 syntactic 3.993-06 discourse shallow 8.14e-06 3.59e-07 all comb. 0.0027 GAOB 0.0026 WekaFS 2.17e-05 Key observations All model predictions are able to distinguish between original and simplified text with high confidence.

Expert ratings Expert B C A 0.85 0.76 B 0.77 p-value Expert A 0.0011 Expert B 0.0001 Expert C 3.993-06 Key observations Expert ratings are able to distinguish between original and simplified texts with high confidence.

Correlations: model predictions & expert ratings Models Expert A Expert B Expert C 5gramWR 0.60 0.44 0.39 POS 0.76 0.51 0.38 syntactic 0.82 0.68 0.52 discourse 0.84 0.63 0.54 shallow 0.81 0.57 0.58 Expert B C A 0.85 0.76 B 0.77 all comb. 0.73 0.60 0.64 GAOB 0.65 0.47 0.51 WekaFS 0.75 0.62 0.50 Key observations Strong correlations observed between model predictions and expert ratings suggest that our model generalize to cross-domain texts quite well, close to human judgment.

Infer text difficulty for adults with ID Characteristics of reading experiment design subjects read texts and answered comprehension questions two versions of texts for each topic: original vs simplified same set of comprehension questions for ori. and sim. texts for a given topic, no participants read both versions of texts Hierarchical latent trait model (joint work, Jansche, Feng & Huenerfauth, 2010) subjects assumed to have varying reading ability (a real number) intrinsic difficulty of texts measured on the same latent scale difference of ability and difficulty determines a given subject s chance of answering questions about a given text correctly infer text difficulty from observed subject responses

Inferred text difficulty for Adults with ID Results: p-value obtained from a paired t-test on the inferred text difficulty: 0.1845 > 0.05 Implication: The inferred text difficulty for adult participants with MID is NOT able to distinguish between original and simplified texts.

Correlations: model predictions, expert ratings & inferred text difficulty Models corr. w user 5gramWR 0.53 POS 0.41 syntactic 0.20 discourse 0.35 shallow 0.20 all comb. 0.50 GAOB 0.47 WekaFS 0.20 Experts A 0.26 B 0.03 C 0.14 Models Expert A Expert B Expert C 5gramWR 0.60 0.44 0.39 POS 0.76 0.51 0.38 syntactic 0.82 0.68 0.52 discourse 0.84 0.63 0.54 shallow 0.81 0.57 0.58 all comb. 0.73 0.60 0.64 GAOB 0.65 0.47 0.51 WekaFS 0.75 0.62 0.50 Expert B C A 0.85 0.76 B 0.77 Key observations Compared with expert ratings, text difficulty experienced by adults with MID is not ideal for evaluation of our ARAT

Summary We combined NLP and machine learning techniques to build an automatic readability assessment tool with high performance (74% accuracy). We examined the usefulness of features within and across various linguistic levels for predicting text readability in terms of grade levels, through which we identified a set of features with significant discriminative power. We adapted, refined and evaluated our ARAT on cross-domain data. We showed that it generalizes well to unseen cross-domain texts, behaving similar to expert judgments. Our experiment results suggest that, for future readability studies targeted to adult readers with ID, expert ratings seem to be a more reliable choice for annotating text difficulty for this group of readers.

Summary of contributions Novel features and techniques Improvement of previous work at several linguistic levels Creation of local news corpora experienced by adults with MID, useful for future study Detailed comparisons of feature effectiveness Build an efficient tool to assess text readability automatically with state of the art performance Development of hierarchical latent trait model (joint work with Jansche, Feng & Huenerfauth, 2010) to capture particular reading experiment design, generalizable for future research

Future Directions Ideal scenario for future work: Large, validated, freely available corpora of diverse texts annotated with reading difficulty for a well-defined group of readers. At the moment, we have no information to what extent grade level annotations in WeeklyReader reflect reading difficulty as experienced by elementary school students. Key future direction: Creation and validation of corpora. annotation guidelines validation with target group of readers data may already be available from educational testing

and special thanks to F L O R A You make me smile!