Automatic Readability Assessment
|
|
- Gervase Perkins
- 6 years ago
- Views:
Transcription
1 Dissertation Defense Automatic Readability Assessment Candidate: Lijun Feng September 03, 2010 Committee: Prof. Matt Huenerfauth, Mentor, Queens College Prof. Virginia Teller, Hunter College Prof. Heng Ji, Queens College Prof. Andrew Rosenberg, Queens College Outside Member: Prof. Noemie Elhadad, Columbia University
2 Some texts are harder than others Text A Quacker, Squeaky, and I are building a snowman. Squeaky tried to put a nose on the snowman, but he couldn t reach that high. Quacker teased him about being short. Now Squeaky is upset. What should I say to him? Text B The floes normally provide a floating nursery for pups while adults dive to root for clams and other food in the seabed in shallow coastal waters along the continental shelf. Last month, the federal Fish and Wildlife Service, responding to a lawsuit by the Center for Biological Diversity, an environmental group, concluded that there was sufficient scientific evidence of rising stress on the animals from climate change to consider granting the Pacific walrus protection under the Endangered Species Act.
3 Background What is readability? Factors affecting readability How is readability measured? How to assess readability automatically a measure of ease with which a text can be understood Discourse, syntactic, lexical, etc. reader s side Corpora, expert ratings, reader responses, etc NLP and machine learning techniques with novel features
4 Motivations & Impact effectively select reading material appropriate to target reader s proficiency level (school children, SL learners, adults with low literacy, language instructors) rank documents by reading difficulty (text summarization, machine translation, text ordering, text coherence, text generalization,) evaluate performance of text simplification systems, guide simplification processes
5 Previous Work Traditional approaches (see Section 3.1) shallow features (avg. sentence length, avg. num. of syllables per word, etc) learning technique: simple linear functions examples Flesch-Kincaid grade level 0.39*(total words/total sentences) *(total syllables/total words) Gunning FOG 0.4*[(total words/total sentences) + 100*(complex words/total words) ] limitations easy to calculate, but highly unreliable
6 Previous Work Recent statistical approaches (see Section 3.2) features explored at various linguistic levels based on: language modeling syntactic parse trees part-of-speech local entity coherence analysis of discourse connectives between sent. learning techniques: EM, SVM, etc. Si and Callan (2001), Collins-Thompson & Callan (2004), Schwarm & Ostendorf (2005), Heilman et al. (2007, 2008) Schwarm & Ostendorf (2005) Heilman et al. (2007) Barzilay & Lapata (2008) Pitler & Nenkova (2008)
7 Methodology General guidelines (see Section 4.1) view readability from text comprehension point of view, focus in particular on extracting features at various discourse levels readability indicators grade levels broader categories such as complex & simple expert ratings readers comprehension responses collected from reading experiments combine NLP, machine learning techniques and empirical studies to understand the relations between various proxies
8 Corpora training corpora cross-domain evaluation corpus Corpora # docs avg nbr words/sent WeeklyReader Grade Grade Grade Grade NewYorkTimes100 Grade LocalNews2008 original simplified corpora used to replicate features by Schwarm et al Britannica original simplified LiteracyNet original simplified
9 Feature Extraction & Representation Novel discourse features (see Section 6.1) entity density named entities + general nouns Name Finder by OpenNLP lexical chains Lexical Chainer by Galley and McKeown (2003) annotation of semantic relations among entities coreference inference Coreference Resolution by OpenNLP entity grid Brown Coherence Toolkit (v0.2) by Elsner et al. (2007) (grammatical/syntactical transition patterns of entities in adjacent sentences)
10 Feature Extraction & Representation language modeling (SRI language modeling toolkit) feature selection schemes: a. information gain (IG) b. generic word sequence c. POS sequence d. paired word/pos sequence train LMs on in-domain vs cross-domain corpora part-of-speech (Charniak s Parser) nouns, verbs, adjectives, adverbs, prepositions content vs function words
11 Feature Extraction & Representation syntactic parse trees (Charniak s Parser) ratio of terminal & non-terminal nodes NPs, VPs, PPs, SBARS (avg. phrasal length & counts) shallow features
12 Feature Summary Feature sets Discourse features entity density lexical chain 6 -- coreference 7 -- entity grid 16 Language modeling in-domain LMs cross-domain LMs 48 POS features 64 Parsed syntactic features 21 Shallow features 9 OOV 6 Total 273
13 Empirical comparison of features Readability assessment as classification task Corpus: WeeklyReader (Grade 2 to 5) State-of-the-art classification models: LIBSVM (Chang & Lin, 2001) Weka machine learning toolkit (Hall et al., 2009) logistic regression, SMO, J48 Evaluation repeated 10-fold stratified cross-validation measures: accuracy, mean squared error (MSE), num. of misclassification by more than one grade level
14 Results Baseline accuracy (%) 37.8 Flesch-Kincaid Grade Level 1.4 Feature Set # Features LIBSVM Accuracy (%) MSE miss > 1G Schwarm et al ± ± ± 6 (5.1%) All comb ± ± ± 5 (3.9%) WekaFS ± ± ± 5 (4.8%) GAOB ± ± ± 7 (3.3%) * WekaFS: a subset of features obtained by Weka s attribute filter using best-fit forward search method *GAOB: a subset of features obtained by groupwise-add-one-best approach
15 Comparison of Feature Subsets Feature Set # Features LIBSVM Accuracy (%) 5gramWR ± 0.93 discourse ± 0.99 POS ± 1.24 syntactic ± 1.02 shallow ± 1.36 all comb ± 0.82
16 Key Observations Language modeling features LMs trained on in-domain corpus are much more effective. Information gain approach outperforms other feature selections Using in-domain corpus, LMs trained on word seq. and paired word/pos seq. achieve accuracy close to IG approach without complicated feature selection. Feature Set cross-domain LMs in-domain LMs 3gramBL 5gramWR information gain ± ± 1.20 word sequence ± ± 1.21 POS sequence ± ± 2.35 word/pos seq ± ± 0.82 all combined ± ± 0.93 * Dissertation includes detailed comparison with Schwarm et al. s study (2005)
17 Key Observations Discourse features Entity density features demonstrate dominating discriminative power. Feature Set LIBSVM Accuracy (%) entity density ± 0.63 lexical chain ± 0.82 coreference ± 0.84 entity grid ± 1.16 all combined ± 0.99
18 Key Observations POS features Among all word classes examined, Nouns exhibit the most significant discriminative power. Prepositions demonstrate surprisingly higher discriminative power next to nouns. The high discriminative power of nouns in turn explains the good performance of entity density features, which are heavily based on nouns. Feature Set LIBSVM Accuracy (%) nouns ± 0.86 prepositions ± 1.28 verbs ± 1.03 adjectives ± 1.13 adverbs ± 0.97 all comb ± 1.24 * Dissertation includes detailed comparison with Heilman et al. s study (2007)
19 Key Observations Syntactic features VPs and node ratios are better predictors SBARs appear to be least discriminative LIBSVM classifiers: Feature Set Accuracy (%) VPs ± 0.60 NPs ± 1.05 PPs ± 1.28 SBARs ± 1.07 ratios of nodes ± 0.57 * Dissertation includes detailed comparison with Schwarm et al. s study (2005)
20 Key Observations Shallow features Avg. sentence length displays dominating discriminative power over lexical-based shallow features. Flesch-Kincaid scores, although they perform poorly when used directly to model text readability, exhibit significant discriminative power when used as features for classifiers. Feature Set J48 Accuracy (%) avg. sentence length ± 0.36 avg. num. syll. per word ± 0.51 % of poly-syll. words per doc ± 0.56 ChallDale ± 0.38 Flesch-Kincaid ± 0.00
21 Generalization across corpora Evaluation corpus: LocalNews2008, paired original/simplified Assumption: simplified texts should be easier to read Readability measures: predictions by LIBSVM classifiers trained on corpora with grade level annotation (WeeklyReader & NewYorkTimes100) expert ratings text difficulty experienced by adults with MID (hierarchical latent trait model) Two layers of evaluation investigate whether each readability measure can distinguish between original and simplified texts (use paired t-test) correlations between each pair of readability measures
22 More diverse training data Limitations of current models: -- can not generalize to texts with complexity higher than Grade 5, -- text complexity of several original articles in LocalNews2008 exceeds Grade 5. Improvement: add NewYorkTimes100 (labeled as Grade 7) to training corpora (WeeklyReader)
23 Improved Models Classification accuracy generated by LIBSVM classifiers using repeated 10-fold cross-validation: WeeklyReader + NewYorkTimes100 Feature Set WROnly WR NYT100 5gramWR ± ± ± 2.34 POS ± ± ± 2.41 syntactic ± ± ± 2.83 discourse ± ± ± 2.50 shallow ± ± ± 3.82 all comb ± ± ± 1.55 GAOB ± ± ± 2.28 WekaFS ± ± ± 2.10 Key observations Adding NYT100 into training data does not affect the classification accuracy on WeeklyReader texts. New models recognize NewYorkTimes100 texts with high accuracy, indicating their high generalizability to new data. However, this may have resulted from strong genre bias.
24 Discrimination by improved models Paired t-test Known labels for LocalNews2008 are complex & simple Model predictions are in grade levels To evaluate model performance, we conduct paired t-test on predictions for original articles and simplified articles Investigate whether prediction difference between original & simplified texts is statistically significant (p < 0.05) Models p-value 5gramWR POS syntactic discourse shallow 8.14e e-07 all comb GAOB WekaFS 2.17e-05 Key observations All model predictions are able to distinguish between original and simplified text with high confidence.
25 Expert ratings Expert B C A B 0.77 p-value Expert A Expert B Expert C Key observations Expert ratings are able to distinguish between original and simplified texts with high confidence.
26 Correlations: model predictions & expert ratings Models Expert A Expert B Expert C 5gramWR POS syntactic discourse shallow Expert B C A B 0.77 all comb GAOB WekaFS Key observations Strong correlations observed between model predictions and expert ratings suggest that our model generalize to cross-domain texts quite well, close to human judgment.
27 Infer text difficulty for adults with ID Characteristics of reading experiment design subjects read texts and answered comprehension questions two versions of texts for each topic: original vs simplified same set of comprehension questions for ori. and sim. texts for a given topic, no participants read both versions of texts Hierarchical latent trait model (joint work, Jansche, Feng & Huenerfauth, 2010) subjects assumed to have varying reading ability (a real number) intrinsic difficulty of texts measured on the same latent scale difference of ability and difficulty determines a given subject s chance of answering questions about a given text correctly infer text difficulty from observed subject responses
28 Inferred text difficulty for Adults with ID Results: p-value obtained from a paired t-test on the inferred text difficulty: > 0.05 Implication: The inferred text difficulty for adult participants with MID is NOT able to distinguish between original and simplified texts.
29 Correlations: model predictions, expert ratings & inferred text difficulty Models corr. w user 5gramWR 0.53 POS 0.41 syntactic 0.20 discourse 0.35 shallow 0.20 all comb GAOB 0.47 WekaFS 0.20 Experts A 0.26 B 0.03 C 0.14 Models Expert A Expert B Expert C 5gramWR POS syntactic discourse shallow all comb GAOB WekaFS Expert B C A B 0.77 Key observations Compared with expert ratings, text difficulty experienced by adults with MID is not ideal for evaluation of our ARAT
30 Summary We combined NLP and machine learning techniques to build an automatic readability assessment tool with high performance (74% accuracy). We examined the usefulness of features within and across various linguistic levels for predicting text readability in terms of grade levels, through which we identified a set of features with significant discriminative power. We adapted, refined and evaluated our ARAT on cross-domain data. We showed that it generalizes well to unseen cross-domain texts, behaving similar to expert judgments. Our experiment results suggest that, for future readability studies targeted to adult readers with ID, expert ratings seem to be a more reliable choice for annotating text difficulty for this group of readers.
31 Summary of contributions Novel features and techniques Improvement of previous work at several linguistic levels Creation of local news corpora experienced by adults with MID, useful for future study Detailed comparisons of feature effectiveness Build an efficient tool to assess text readability automatically with state of the art performance Development of hierarchical latent trait model (joint work with Jansche, Feng & Huenerfauth, 2010) to capture particular reading experiment design, generalizable for future research
32 Future Directions Ideal scenario for future work: Large, validated, freely available corpora of diverse texts annotated with reading difficulty for a well-defined group of readers. At the moment, we have no information to what extent grade level annotations in WeeklyReader reflect reading difficulty as experienced by elementary school students. Key future direction: Creation and validation of corpora. annotation guidelines validation with target group of readers data may already be available from educational testing
33 and special thanks to F L O R A You make me smile!
Multi-Lingual Text Leveling
Multi-Lingual Text Leveling Salim Roukos, Jerome Quin, and Todd Ward IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 {roukos,jlquinn,tward}@us.ibm.com Abstract. Determining the language proficiency
More informationA Right to Access Implies A Right to Know: An Open Online Platform for Research on the Readability of Law
A Right to Access Implies A Right to Know: An Open Online Platform for Research on the Readability of Law Michael Curtotti* Eric McCreathº * Legal Counsel, ANU Students Association & ANU Postgraduate and
More informationLQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization
LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization Annemarie Friedrich, Marina Valeeva and Alexis Palmer COMPUTATIONAL LINGUISTICS & PHONETICS SAARLAND UNIVERSITY, GERMANY
More informationAssignment 1: Predicting Amazon Review Ratings
Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for
More informationCS 598 Natural Language Processing
CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@
More informationEnsemble Technique Utilization for Indonesian Dependency Parser
Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationEnhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion
More informationCS Machine Learning
CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationIntra-talker Variation: Audience Design Factors Affecting Lexical Selections
Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and
More informationAnnotation Projection for Discourse Connectives
SFB 833 / Univ. Tübingen Penn Discourse Treebank Workshop Annotation projection Basic idea: Given a bitext E/F and annotation for F, how would the annotation look for E? Examples: Word Sense Disambiguation
More informationPrediction of Maximal Projection for Semantic Role Labeling
Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba
More informationThe stages of event extraction
The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks
More informationAccurate Unlexicalized Parsing for Modern Hebrew
Accurate Unlexicalized Parsing for Modern Hebrew Reut Tsarfaty and Khalil Sima an Institute for Logic, Language and Computation, University of Amsterdam Plantage Muidergracht 24, 1018TV Amsterdam, The
More informationBeyond the Pipeline: Discrete Optimization in NLP
Beyond the Pipeline: Discrete Optimization in NLP Tomasz Marciniak and Michael Strube EML Research ggmbh Schloss-Wolfsbrunnenweg 33 69118 Heidelberg, Germany http://www.eml-research.de/nlp Abstract We
More informationCompositional Semantics
Compositional Semantics CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Words, bag of words Sequences Trees Meaning Representing Meaning An important goal of NLP/AI: convert natural language
More informationModeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures
Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Ulrike Baldewein (ulrike@coli.uni-sb.de) Computational Psycholinguistics, Saarland University D-66041 Saarbrücken,
More informationTarget Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se
More informationApplications of memory-based natural language processing
Applications of memory-based natural language processing Antal van den Bosch and Roser Morante ILK Research Group Tilburg University Prague, June 24, 2007 Current ILK members Principal investigator: Antal
More informationChinese Language Parsing with Maximum-Entropy-Inspired Parser
Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationA Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and
A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and Planning Overview Motivation for Analyses Analyses and
More informationProject in the framework of the AIM-WEST project Annotation of MWEs for translation
Project in the framework of the AIM-WEST project Annotation of MWEs for translation 1 Agnès Tutin LIDILEM/LIG Université Grenoble Alpes 30 october 2014 Outline 2 Why annotate MWEs in corpora? A first experiment
More informationColumbia University at DUC 2004
Columbia University at DUC 2004 Sasha Blair-Goldensohn, David Evans, Vasileios Hatzivassiloglou, Kathleen McKeown, Ani Nenkova, Rebecca Passonneau, Barry Schiffman, Andrew Schlaikjer, Advaith Siddharthan,
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationAccuracy (%) # features
Question Terminology and Representation for Question Type Classication Noriko Tomuro DePaul University School of Computer Science, Telecommunications and Information Systems 243 S. Wabash Ave. Chicago,
More informationChunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.
NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and
More informationThe Smart/Empire TIPSTER IR System
The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationParsing of part-of-speech tagged Assamese Texts
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal
More informationTwitter Sentiment Classification on Sanders Data using Hybrid Approach
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders
More informationAnalysis: Evaluation: Knowledge: Comprehension: Synthesis: Application:
In 1956, Benjamin Bloom headed a group of educational psychologists who developed a classification of levels of intellectual behavior important in learning. Bloom found that over 95 % of the test questions
More informationUniversity of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma
University of Alberta Large-Scale Semi-Supervised Learning for Natural Language Processing by Shane Bergsma A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of
More informationTHE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING
SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,
More informationWhat Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017
What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017 Supervised Training of Neural Networks for Language Training Data Training Model this is an example the cat went to
More informationLeveraging Sentiment to Compute Word Similarity
Leveraging Sentiment to Compute Word Similarity Balamurali A.R., Subhabrata Mukherjee, Akshat Malu and Pushpak Bhattacharyya Dept. of Computer Science and Engineering, IIT Bombay 6th International Global
More informationSyntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm
Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm syntax: from the Greek syntaxis, meaning setting out together
More informationBYLINE [Heng Ji, Computer Science Department, New York University,
INFORMATION EXTRACTION BYLINE [Heng Ji, Computer Science Department, New York University, hengji@cs.nyu.edu] SYNONYMS NONE DEFINITION Information Extraction (IE) is a task of extracting pre-specified types
More informationVocabulary Usage and Intelligibility in Learner Language
Vocabulary Usage and Intelligibility in Learner Language Emi Izumi, 1 Kiyotaka Uchimoto 1 and Hitoshi Isahara 1 1. Introduction In verbal communication, the primary purpose of which is to convey and understand
More informationThe Role of the Head in the Interpretation of English Deverbal Compounds
The Role of the Head in the Interpretation of English Deverbal Compounds Gianina Iordăchioaia i, Lonneke van der Plas ii, Glorianna Jagfeld i (Universität Stuttgart i, University of Malta ii ) Wen wurmt
More informationLecture 1: Machine Learning Basics
1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationUsing Semantic Relations to Refine Coreference Decisions
Using Semantic Relations to Refine Coreference Decisions Heng Ji David Westbrook Ralph Grishman Department of Computer Science New York University New York, NY, 10003, USA hengji@cs.nyu.edu westbroo@cs.nyu.edu
More informationMultilingual Sentiment and Subjectivity Analysis
Multilingual Sentiment and Subjectivity Analysis Carmen Banea and Rada Mihalcea Department of Computer Science University of North Texas rada@cs.unt.edu, carmen.banea@gmail.com Janyce Wiebe Department
More informationChapter 9 Banked gap-filling
Chapter 9 Banked gap-filling This testing technique is known as banked gap-filling, because you have to choose the appropriate word from a bank of alternatives. In a banked gap-filling task, similarly
More informationAbstractions and the Brain
Abstractions and the Brain Brian D. Josephson Department of Physics, University of Cambridge Cavendish Lab. Madingley Road Cambridge, UK. CB3 OHE bdj10@cam.ac.uk http://www.tcm.phy.cam.ac.uk/~bdj10 ABSTRACT
More information11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation
tatistical Parsing (Following slides are modified from Prof. Raymond Mooney s slides.) tatistical Parsing tatistical parsing uses a probabilistic model of syntax in order to assign probabilities to each
More informationGuru: A Computer Tutor that Models Expert Human Tutors
Guru: A Computer Tutor that Models Expert Human Tutors Andrew Olney 1, Sidney D'Mello 2, Natalie Person 3, Whitney Cade 1, Patrick Hays 1, Claire Williams 1, Blair Lehman 1, and Art Graesser 1 1 University
More informationWord Segmentation of Off-line Handwritten Documents
Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department
More informationHandling Sparsity for Verb Noun MWE Token Classification
Handling Sparsity for Verb Noun MWE Token Classification Mona T. Diab Center for Computational Learning Systems Columbia University mdiab@ccls.columbia.edu Madhav Krishna Computer Science Department Columbia
More informationA Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many
Schmidt 1 Eric Schmidt Prof. Suzanne Flynn Linguistic Study of Bilingualism December 13, 2013 A Minimalist Approach to Code-Switching In the field of linguistics, the topic of bilingualism is a broad one.
More informationLearning From the Past with Experiment Databases
Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University
More informationNetpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models
Netpix: A Method of Feature Selection Leading to Accurate Sentiment-Based Classification Models 1 Netpix: A Method of Feature Selection Leading to Accurate Sentiment-Based Classification Models James B.
More informationLecture 2: Quantifiers and Approximation
Lecture 2: Quantifiers and Approximation Case study: Most vs More than half Jakub Szymanik Outline Number Sense Approximate Number Sense Approximating most Superlative Meaning of most What About Counting?
More informationTask Tolerance of MT Output in Integrated Text Processes
Task Tolerance of MT Output in Integrated Text Processes John S. White, Jennifer B. Doyon, and Susan W. Talbott Litton PRC 1500 PRC Drive McLean, VA 22102, USA {white_john, doyon jennifer, talbott_susan}@prc.com
More informationCAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011
CAAP Content Analysis Report Institution Code: 911 Institution Type: 4-Year Normative Group: 4-year Colleges Introduction This report provides information intended to help postsecondary institutions better
More informationLevels of processing: Qualitative differences or task-demand differences?
Memory & Cognition 1983,11 (3),316-323 Levels of processing: Qualitative differences or task-demand differences? SHANNON DAWN MOESER Memorial University ofnewfoundland, St. John's, NewfoundlandAlB3X8,
More informationGrammars & Parsing, Part 1:
Grammars & Parsing, Part 1: Rules, representations, and transformations- oh my! Sentence VP The teacher Verb gave the lecture 2015-02-12 CS 562/662: Natural Language Processing Game plan for today: Review
More informationSyntactic and Lexical Simplification: The Impact on EFL Listening Comprehension at Low and High Language Proficiency Levels
ISSN 1798-4769 Journal of Language Teaching and Research, Vol. 5, No. 3, pp. 566-571, May 2014 Manufactured in Finland. doi:10.4304/jltr.5.3.566-571 Syntactic and Lexical Simplification: The Impact on
More informationIndian Institute of Technology, Kanpur
Indian Institute of Technology, Kanpur Course Project - CS671A POS Tagging of Code Mixed Text Ayushman Sisodiya (12188) {ayushmn@iitk.ac.in} Donthu Vamsi Krishna (15111016) {vamsi@iitk.ac.in} Sandeep Kumar
More informationDistant Supervised Relation Extraction with Wikipedia and Freebase
Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational
More informationSome Principles of Automated Natural Language Information Extraction
Some Principles of Automated Natural Language Information Extraction Gregers Koch Department of Computer Science, Copenhagen University DIKU, Universitetsparken 1, DK-2100 Copenhagen, Denmark Abstract
More informationBasic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1
Basic Parsing with Context-Free Grammars Some slides adapted from Julia Hirschberg and Dan Jurafsky 1 Announcements HW 2 to go out today. Next Tuesday most important for background to assignment Sign up
More informationExploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data
Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Maja Popović and Hermann Ney Lehrstuhl für Informatik VI, Computer
More informationExtracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models
Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models Richard Johansson and Alessandro Moschitti DISI, University of Trento Via Sommarive 14, 38123 Trento (TN),
More informationThe Internet as a Normative Corpus: Grammar Checking with a Search Engine
The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a
More informationParsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank
Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank Dan Klein and Christopher D. Manning Computer Science Department Stanford University Stanford,
More informationAlgebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview
Algebra 1, Quarter 3, Unit 3.1 Line of Best Fit Overview Number of instructional days 6 (1 day assessment) (1 day = 45 minutes) Content to be learned Analyze scatter plots and construct the line of best
More informationCorpus Linguistics (L615)
(L615) Basics of Markus Dickinson Department of, Indiana University Spring 2013 1 / 23 : the extent to which a sample includes the full range of variability in a population distinguishes corpora from archives
More informationLinguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis
International Journal of Arts Humanities and Social Sciences (IJAHSS) Volume 1 Issue 1 ǁ August 216. www.ijahss.com Linguistic Variation across Sports Category of Press Reportage from British Newspapers:
More informationResolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Jeju Island, South Korea, July 2012, pp. 777--789.
More informationExposé for a Master s Thesis
Exposé for a Master s Thesis Stefan Selent January 21, 2017 Working Title: TF Relation Mining: An Active Learning Approach Introduction The amount of scientific literature is ever increasing. Especially
More informationSpecification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments
Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,
More informationProgram Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading
Program Requirements Competency 1: Foundations of Instruction 60 In-service Hours Teachers will develop substantive understanding of six components of reading as a process: comprehension, oral language,
More informationThe Discourse Anaphoric Properties of Connectives
The Discourse Anaphoric Properties of Connectives Cassandre Creswell, Kate Forbes, Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi Λ, Bonnie Webber y Λ University of Pennsylvania 3401 Walnut Street Philadelphia,
More informationBANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS
Daffodil International University Institutional Repository DIU Journal of Science and Technology Volume 8, Issue 1, January 2013 2013-01 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Uddin, Sk.
More informationCaMLA Working Papers
CaMLA Working Papers 2015 02 The Characteristics of the Michigan English Test Reading Texts and Items and their Relationship to Item Difficulty Khaled Barkaoui York University Canada 2015 The Characteristics
More informationAtypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty
Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty Julie Medero and Mari Ostendorf Electrical Engineering Department University of Washington Seattle, WA 98195 USA {jmedero,ostendor}@uw.edu
More informationOutline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt
Outline Using Web Data for Linguistic Purposes NCLT, Dublin City University Outline Outline 1 Corpora as linguistic tools 2 Limitations of web data Strategies to enhance web data 3 Corpora as linguistic
More informationEdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,
More informationPredicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks
Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com
More informationMachine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler
Machine Learning and Data Mining Ensembles of Learners Prof. Alexander Ihler Ensemble methods Why learn one classifier when you can learn many? Ensemble: combine many predictors (Weighted) combina
More informationMultilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities
Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Soto Montalvo GAVAB Group URJC Raquel Martínez NLP&IR Group UNED Arantza Casillas Dpt. EE UPV-EHU Víctor Fresno GAVAB
More informationWord Sense Disambiguation
Word Sense Disambiguation D. De Cao R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 May 21, 2009 Excerpt of the R. Mihalcea and T. Pedersen AAAI 2005 Tutorial, at: http://www.d.umn.edu/ tpederse/tutorials/advances-in-wsd-aaai-2005.ppt
More informationConstraining X-Bar: Theta Theory
Constraining X-Bar: Theta Theory Carnie, 2013, chapter 8 Kofi K. Saah 1 Learning objectives Distinguish between thematic relation and theta role. Identify the thematic relations agent, theme, goal, source,
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationProduct Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments
Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &
More informationGenerative models and adversarial training
Day 4 Lecture 1 Generative models and adversarial training Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University What is a generative model?
More informationIntroduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.
to as a linguistic theory to to a member of the family of linguistic frameworks that are called generative grammars a grammar which is formalized to a high degree and thus makes exact predictions about
More informationReading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-
New York Grade 7 Core Performance Indicators Grades 7 8: common to all four ELA standards Throughout grades 7 and 8, students demonstrate the following core performance indicators in the key ideas of reading,
More informationUNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen
UNIVERSITY OF OSLO Department of Informatics Dialog Act Recognition using Dependency Features Master s thesis Sindre Wetjen November 15, 2013 Acknowledgments First I want to thank my supervisors Lilja
More informationOptimizing to Arbitrary NLP Metrics using Ensemble Selection
Optimizing to Arbitrary NLP Metrics using Ensemble Selection Art Munson, Claire Cardie, Rich Caruana Department of Computer Science Cornell University Ithaca, NY 14850 {mmunson, cardie, caruana}@cs.cornell.edu
More informationMETHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS
METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS Ruslan Mitkov (R.Mitkov@wlv.ac.uk) University of Wolverhampton ViktorPekar (v.pekar@wlv.ac.uk) University of Wolverhampton Dimitar
More informationFormulaic Language and Fluency: ESL Teaching Applications
Formulaic Language and Fluency: ESL Teaching Applications Formulaic Language Terminology Formulaic sequence One such item Formulaic language Non-count noun referring to these items Phraseology The study
More informationRobust Sense-Based Sentiment Classification
Robust Sense-Based Sentiment Classification Balamurali A R 1 Aditya Joshi 2 Pushpak Bhattacharyya 2 1 IITB-Monash Research Academy, IIT Bombay 2 Dept. of Computer Science and Engineering, IIT Bombay Mumbai,
More informationMachine Learning from Garden Path Sentences: The Application of Computational Linguistics
Machine Learning from Garden Path Sentences: The Application of Computational Linguistics http://dx.doi.org/10.3991/ijet.v9i6.4109 J.L. Du 1, P.F. Yu 1 and M.L. Li 2 1 Guangdong University of Foreign Studies,
More informationThe Karlsruhe Institute of Technology Translation Systems for the WMT 2011
The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel Karlsruhe Institute of Technology Karlsruhe, Germany firstname.lastname@kit.edu
More informationSystem Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering
More informationProbability estimates in a scenario tree
101 Chapter 11 Probability estimates in a scenario tree An expert is a person who has made all the mistakes that can be made in a very narrow field. Niels Bohr (1885 1962) Scenario trees require many numbers.
More information