CS474 Natural Language Processing

Similar documents
Linking Task: Identifying authors and book titles in verbose queries

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Developing a TT-MCTAG for German with an RCG-based Parser

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

The taming of the data:

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

An Efficient Implementation of a New POP Model

Distant Supervised Relation Extraction with Wikipedia and Freebase

Development of the First LRs for Macedonian: Current Projects

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Modeling full form lexica for Arabic

Project Based Learning Debriefing Form Elementary School

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Grammars & Parsing, Part 1:

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Parsing of part-of-speech tagged Assamese Texts

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Using dialogue context to improve parsing performance in dialogue systems

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Vocabulary Usage and Intelligibility in Learner Language

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

An Evaluation of POS Taggers for the CHILDES Corpus

Verbal Behaviors and Persuasiveness in Online Multimedia Content

A High-Quality Web Corpus of Czech

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Memory-based grammatical error correction

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

English Language and Applied Linguistics. Module Descriptions 2017/18

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Switchboard Language Model Improvement with Conversational Data from Gigaword

LING 329 : MORPHOLOGY

A Comparison of Two Text Representations for Sentiment Analysis

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

THE VERB ARGUMENT BROWSER

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

Noisy SMS Machine Translation in Low-Density Languages

The Ups and Downs of Preposition Error Detection in ESL Writing

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Cross-Lingual Text Categorization

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Prediction of Maximal Projection for Semantic Role Labeling

CS 598 Natural Language Processing

Derivational and Inflectional Morphemes in Pak-Pak Language

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Intermediate Academic Writing

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

ScienceDirect. Malayalam question answering system

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Specifying a shallow grammatical for parsing purposes

Syntactic surprisal affects spoken word duration in conversational contexts

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Named Entity Recognition: A Survey for the Indian Languages

Language Independent Passage Retrieval for Question Answering

Natural Language Processing. George Konidaris

Introduction to Text Mining

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Context Free Grammars. Many slides from Michael Collins

BYLINE [Heng Ji, Computer Science Department, New York University,

Experts Retrieval with Multiword-Enhanced Author Topic Model

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Constructing Parallel Corpus from Movie Subtitles

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

Progressive Aspect in Nigerian English

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

TINE: A Metric to Assess MT Adequacy

Collocation extraction measures for text mining applications

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Online Updating of Word Representations for Part-of-Speech Tagging

A Grammar for Battle Management Language

Re-evaluating the Role of Bleu in Machine Translation Research

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

The Role of the Head in the Interpretation of English Deverbal Compounds

A Graph Based Authorship Identification Approach

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Cross Language Information Retrieval

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Transcription:

CS474 Natural Language Processing Last class Introduction to the field of NLP Course requirements, syllabus, etc. Today Introduction to an important class of statistical methods in NLP: generative models

CS474 Natural Language Processing Language Modeling Introduction to generative models of language today» What are they?» Why they re important» Issues for counting words» Statistics of natural language» Unsmoothed n-gram models

What are generative models of language? Word prediction Once upon a I d like to make a collect Let s go outside and take a Generative models can assign probabilities to Possible next words Sequences of words

Why are word prediction models important? Augmentative communication systems For the disabled, to predict the next words the user wants to speak Computer-aided education System that helps kids learn to read (e.g. Mostow et al. system) Speech recognition Context-sensitive spelling correction

Why are word prediction models important? Can be used to assign a probability to the next word in an incomplete sentence Closely related to the problem of computing the probability of a sequence of words Useful for part-of-speech tagging, probabilistic parsing,

The need for models of word prediction in NLP has not been uncontroversial But it must be recognized that the notion probability of a sentence is an entirely useless one, under any known interpretation of this term. -Noam Chomsky (1969) Every time I fire a linguist the recognition rate improves. - Fred Jelinek (IBM speech group, 1988)

Paradigms in NLP Knowledge-based methods Rely on the manual encoding of linguistic (and world) knowledge» E.g. FSA s for morphological parsing, syntactic parsing Statistical/learning methods Rely on the automatic acquisition of linguistic knowledge from corpora

Statistical/machine learning in NLP 1992 ACL 1994 ACL 1996 ACL 24% (8/34) 35% (14/40) 39% (16/41) 76% 65% 61% 60% (41/69) 1999 ACL 2001 NAACL 87% (27/31) some ML no ML 40% 13%

Word prediction models Important in real-life situations... Miss words in a conversation, lecture, movie, etc.

Word prediction gone awry Woody Allen s Take the Money and Run http://www.tcm.com/mediaroom/video/224555/take-the-money-and-run-movie-clip-gub.html

Word prediction gone amok Seinfeld Sentence Finisher http://www.youtube.com/watch? v=01tezktyjqa&feature=related

N-gram model Uses the previous N-1 words to predict the next word 2-gram: bigram 3-gram: trigram 1-gram: unigram In speech recognition, these statistical models of word sequences are referred to as a language model

Want to use n-gram models to... Determine the next word in a sequence Probability distribution across all words in the language P (w n w 1 w 2 w n-1 ) Determine the probability of a sequence of words P (w 1 w 2 w n-1 w n )

Next Language Modeling Introduction to generative models of language» What are they?» Why they re important» Issues for counting words» Statistics of natural language» Unsmoothed n-gram models

Counting words in corpora Ok, so how many words are in this sentence? Depends on whether or not we treat punctuation marks as words Important for many NLP tasks» Grammar-checking, spelling error detection, author identification, part-of-speech tagging Spoken language corpora Utterances don t usually have punctuation, but they do have other phenomena that we might or might not want to treat as words» I do uh main- mainly business data processing Fragments Filled pauses» um and uh behave more like words, so most speech recognition systems treat them as such

Counting words in corpora Capitalization Should They and they be treated as the same word?» For most statistical NLP applications, they are» Sometimes capitalization information is maintained as a feature E.g. spelling error correction, part-of-speech tagging Inflected forms Should walks and walk be treated as the same word?» No for most n-gram based systems» based on the wordform (i.e. the inflected form as it appears in the corpus) rather than the lemma (i.e. set of lexical forms that have the same stem)

Counting words in corpora Need to distinguish word types» the number of distinct words word tokens» the number of running words Example All for one and one for all. 8 tokens (counting punctuation) 6 types (assuming capitalized and uncapitalized versions of the same token are treated separately)

Introduction to generative models of language» What are they?» Why they re important» Issues for counting words» Statistics of natural language» Unsmoothed n-gram models