Automatic Readability Assessment

Size: px

Start display at page:

Download "Automatic Readability Assessment"

Gervase Perkins
6 years ago
Views:

1 Dissertation Defense Automatic Readability Assessment Candidate: Lijun Feng September 03, 2010 Committee: Prof. Matt Huenerfauth, Mentor, Queens College Prof. Virginia Teller, Hunter College Prof. Heng Ji, Queens College Prof. Andrew Rosenberg, Queens College Outside Member: Prof. Noemie Elhadad, Columbia University

2 Some texts are harder than others Text A Quacker, Squeaky, and I are building a snowman. Squeaky tried to put a nose on the snowman, but he couldn t reach that high. Quacker teased him about being short. Now Squeaky is upset. What should I say to him? Text B The floes normally provide a floating nursery for pups while adults dive to root for clams and other food in the seabed in shallow coastal waters along the continental shelf. Last month, the federal Fish and Wildlife Service, responding to a lawsuit by the Center for Biological Diversity, an environmental group, concluded that there was sufficient scientific evidence of rising stress on the animals from climate change to consider granting the Pacific walrus protection under the Endangered Species Act.

3 Background What is readability? Factors affecting readability How is readability measured? How to assess readability automatically a measure of ease with which a text can be understood Discourse, syntactic, lexical, etc. reader s side Corpora, expert ratings, reader responses, etc NLP and machine learning techniques with novel features

4 Motivations & Impact effectively select reading material appropriate to target reader s proficiency level (school children, SL learners, adults with low literacy, language instructors) rank documents by reading difficulty (text summarization, machine translation, text ordering, text coherence, text generalization,) evaluate performance of text simplification systems, guide simplification processes

5 Previous Work Traditional approaches (see Section 3.1) shallow features (avg. sentence length, avg. num. of syllables per word, etc) learning technique: simple linear functions examples Flesch-Kincaid grade level 0.39*(total words/total sentences) *(total syllables/total words) Gunning FOG 0.4*[(total words/total sentences) + 100*(complex words/total words) ] limitations easy to calculate, but highly unreliable

6 Previous Work Recent statistical approaches (see Section 3.2) features explored at various linguistic levels based on: language modeling syntactic parse trees part-of-speech local entity coherence analysis of discourse connectives between sent. learning techniques: EM, SVM, etc. Si and Callan (2001), Collins-Thompson & Callan (2004), Schwarm & Ostendorf (2005), Heilman et al. (2007, 2008) Schwarm & Ostendorf (2005) Heilman et al. (2007) Barzilay & Lapata (2008) Pitler & Nenkova (2008)

7 Methodology General guidelines (see Section 4.1) view readability from text comprehension point of view, focus in particular on extracting features at various discourse levels readability indicators grade levels broader categories such as complex & simple expert ratings readers comprehension responses collected from reading experiments combine NLP, machine learning techniques and empirical studies to understand the relations between various proxies

8 Corpora training corpora cross-domain evaluation corpus Corpora # docs avg nbr words/sent WeeklyReader Grade Grade Grade Grade NewYorkTimes100 Grade LocalNews2008 original simplified corpora used to replicate features by Schwarm et al Britannica original simplified LiteracyNet original simplified

9 Feature Extraction & Representation Novel discourse features (see Section 6.1) entity density named entities + general nouns Name Finder by OpenNLP lexical chains Lexical Chainer by Galley and McKeown (2003) annotation of semantic relations among entities coreference inference Coreference Resolution by OpenNLP entity grid Brown Coherence Toolkit (v0.2) by Elsner et al. (2007) (grammatical/syntactical transition patterns of entities in adjacent sentences)

10 Feature Extraction & Representation language modeling (SRI language modeling toolkit) feature selection schemes: a. information gain (IG) b. generic word sequence c. POS sequence d. paired word/pos sequence train LMs on in-domain vs cross-domain corpora part-of-speech (Charniak s Parser) nouns, verbs, adjectives, adverbs, prepositions content vs function words

11 Feature Extraction & Representation syntactic parse trees (Charniak s Parser) ratio of terminal & non-terminal nodes NPs, VPs, PPs, SBARS (avg. phrasal length & counts) shallow features

12 Feature Summary Feature sets Discourse features entity density lexical chain 6 -- coreference 7 -- entity grid 16 Language modeling in-domain LMs cross-domain LMs 48 POS features 64 Parsed syntactic features 21 Shallow features 9 OOV 6 Total 273

13 Empirical comparison of features Readability assessment as classification task Corpus: WeeklyReader (Grade 2 to 5) State-of-the-art classification models: LIBSVM (Chang & Lin, 2001) Weka machine learning toolkit (Hall et al., 2009) logistic regression, SMO, J48 Evaluation repeated 10-fold stratified cross-validation measures: accuracy, mean squared error (MSE), num. of misclassification by more than one grade level

14 Results Baseline accuracy (%) 37.8 Flesch-Kincaid Grade Level 1.4 Feature Set # Features LIBSVM Accuracy (%) MSE miss > 1G Schwarm et al ± ± ± 6 (5.1%) All comb ± ± ± 5 (3.9%) WekaFS ± ± ± 5 (4.8%) GAOB ± ± ± 7 (3.3%) * WekaFS: a subset of features obtained by Weka s attribute filter using best-fit forward search method *GAOB: a subset of features obtained by groupwise-add-one-best approach

15 Comparison of Feature Subsets Feature Set # Features LIBSVM Accuracy (%) 5gramWR ± 0.93 discourse ± 0.99 POS ± 1.24 syntactic ± 1.02 shallow ± 1.36 all comb ± 0.82

16 Key Observations Language modeling features LMs trained on in-domain corpus are much more effective. Information gain approach outperforms other feature selections Using in-domain corpus, LMs trained on word seq. and paired word/pos seq. achieve accuracy close to IG approach without complicated feature selection. Feature Set cross-domain LMs in-domain LMs 3gramBL 5gramWR information gain ± ± 1.20 word sequence ± ± 1.21 POS sequence ± ± 2.35 word/pos seq ± ± 0.82 all combined ± ± 0.93 * Dissertation includes detailed comparison with Schwarm et al. s study (2005)

17 Key Observations Discourse features Entity density features demonstrate dominating discriminative power. Feature Set LIBSVM Accuracy (%) entity density ± 0.63 lexical chain ± 0.82 coreference ± 0.84 entity grid ± 1.16 all combined ± 0.99

18 Key Observations POS features Among all word classes examined, Nouns exhibit the most significant discriminative power. Prepositions demonstrate surprisingly higher discriminative power next to nouns. The high discriminative power of nouns in turn explains the good performance of entity density features, which are heavily based on nouns. Feature Set LIBSVM Accuracy (%) nouns ± 0.86 prepositions ± 1.28 verbs ± 1.03 adjectives ± 1.13 adverbs ± 0.97 all comb ± 1.24 * Dissertation includes detailed comparison with Heilman et al. s study (2007)

19 Key Observations Syntactic features VPs and node ratios are better predictors SBARs appear to be least discriminative LIBSVM classifiers: Feature Set Accuracy (%) VPs ± 0.60 NPs ± 1.05 PPs ± 1.28 SBARs ± 1.07 ratios of nodes ± 0.57 * Dissertation includes detailed comparison with Schwarm et al. s study (2005)

20 Key Observations Shallow features Avg. sentence length displays dominating discriminative power over lexical-based shallow features. Flesch-Kincaid scores, although they perform poorly when used directly to model text readability, exhibit significant discriminative power when used as features for classifiers. Feature Set J48 Accuracy (%) avg. sentence length ± 0.36 avg. num. syll. per word ± 0.51 % of poly-syll. words per doc ± 0.56 ChallDale ± 0.38 Flesch-Kincaid ± 0.00

21 Generalization across corpora Evaluation corpus: LocalNews2008, paired original/simplified Assumption: simplified texts should be easier to read Readability measures: predictions by LIBSVM classifiers trained on corpora with grade level annotation (WeeklyReader & NewYorkTimes100) expert ratings text difficulty experienced by adults with MID (hierarchical latent trait model) Two layers of evaluation investigate whether each readability measure can distinguish between original and simplified texts (use paired t-test) correlations between each pair of readability measures

22 More diverse training data Limitations of current models: -- can not generalize to texts with complexity higher than Grade 5, -- text complexity of several original articles in LocalNews2008 exceeds Grade 5. Improvement: add NewYorkTimes100 (labeled as Grade 7) to training corpora (WeeklyReader)

23 Improved Models Classification accuracy generated by LIBSVM classifiers using repeated 10-fold cross-validation: WeeklyReader + NewYorkTimes100 Feature Set WROnly WR NYT100 5gramWR ± ± ± 2.34 POS ± ± ± 2.41 syntactic ± ± ± 2.83 discourse ± ± ± 2.50 shallow ± ± ± 3.82 all comb ± ± ± 1.55 GAOB ± ± ± 2.28 WekaFS ± ± ± 2.10 Key observations Adding NYT100 into training data does not affect the classification accuracy on WeeklyReader texts. New models recognize NewYorkTimes100 texts with high accuracy, indicating their high generalizability to new data. However, this may have resulted from strong genre bias.

24 Discrimination by improved models Paired t-test Known labels for LocalNews2008 are complex & simple Model predictions are in grade levels To evaluate model performance, we conduct paired t-test on predictions for original articles and simplified articles Investigate whether prediction difference between original & simplified texts is statistically significant (p < 0.05) Models p-value 5gramWR POS syntactic discourse shallow 8.14e e-07 all comb GAOB WekaFS 2.17e-05 Key observations All model predictions are able to distinguish between original and simplified text with high confidence.

25 Expert ratings Expert B C A B 0.77 p-value Expert A Expert B Expert C Key observations Expert ratings are able to distinguish between original and simplified texts with high confidence.

26 Correlations: model predictions & expert ratings Models Expert A Expert B Expert C 5gramWR POS syntactic discourse shallow Expert B C A B 0.77 all comb GAOB WekaFS Key observations Strong correlations observed between model predictions and expert ratings suggest that our model generalize to cross-domain texts quite well, close to human judgment.

27 Infer text difficulty for adults with ID Characteristics of reading experiment design subjects read texts and answered comprehension questions two versions of texts for each topic: original vs simplified same set of comprehension questions for ori. and sim. texts for a given topic, no participants read both versions of texts Hierarchical latent trait model (joint work, Jansche, Feng & Huenerfauth, 2010) subjects assumed to have varying reading ability (a real number) intrinsic difficulty of texts measured on the same latent scale difference of ability and difficulty determines a given subject s chance of answering questions about a given text correctly infer text difficulty from observed subject responses

28 Inferred text difficulty for Adults with ID Results: p-value obtained from a paired t-test on the inferred text difficulty: > 0.05 Implication: The inferred text difficulty for adult participants with MID is NOT able to distinguish between original and simplified texts.

29 Correlations: model predictions, expert ratings & inferred text difficulty Models corr. w user 5gramWR 0.53 POS 0.41 syntactic 0.20 discourse 0.35 shallow 0.20 all comb GAOB 0.47 WekaFS 0.20 Experts A 0.26 B 0.03 C 0.14 Models Expert A Expert B Expert C 5gramWR POS syntactic discourse shallow all comb GAOB WekaFS Expert B C A B 0.77 Key observations Compared with expert ratings, text difficulty experienced by adults with MID is not ideal for evaluation of our ARAT

30 Summary We combined NLP and machine learning techniques to build an automatic readability assessment tool with high performance (74% accuracy). We examined the usefulness of features within and across various linguistic levels for predicting text readability in terms of grade levels, through which we identified a set of features with significant discriminative power. We adapted, refined and evaluated our ARAT on cross-domain data. We showed that it generalizes well to unseen cross-domain texts, behaving similar to expert judgments. Our experiment results suggest that, for future readability studies targeted to adult readers with ID, expert ratings seem to be a more reliable choice for annotating text difficulty for this group of readers.

31 Summary of contributions Novel features and techniques Improvement of previous work at several linguistic levels Creation of local news corpora experienced by adults with MID, useful for future study Detailed comparisons of feature effectiveness Build an efficient tool to assess text readability automatically with state of the art performance Development of hierarchical latent trait model (joint work with Jansche, Feng & Huenerfauth, 2010) to capture particular reading experiment design, generalizable for future research

32 Future Directions Ideal scenario for future work: Large, validated, freely available corpora of diverse texts annotated with reading difficulty for a well-defined group of readers. At the moment, we have no information to what extent grade level annotations in WeeklyReader reflect reading difficulty as experienced by elementary school students. Key future direction: Creation and validation of corpora. annotation guidelines validation with target group of readers data may already be available from educational testing

33 and special thanks to F L O R A You make me smile!

Multi-Lingual Text Leveling

Multi-Lingual Text Leveling Salim Roukos, Jerome Quin, and Todd Ward IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 {roukos,jlquinn,tward}@us.ibm.com Abstract. Determining the language proficiency