Introduction to Natural Language Processing. Hongning Wang

Similar documents
CS 598 Natural Language Processing

Applications of memory-based natural language processing

Parsing of part-of-speech tagged Assamese Texts

AQUA: An Ontology-Driven Question Answering System

Context Free Grammars. Many slides from Michael Collins

Natural Language Processing. George Konidaris

Compositional Semantics

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Construction Grammar. University of Jena.

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Linking Task: Identifying authors and book titles in verbose queries

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

The Conversational User Interface

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

An Interactive Intelligent Language Tutor Over The Internet

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

The College Board Redesigned SAT Grade 12

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

What the National Curriculum requires in reading at Y5 and Y6

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Text-mining the Estonian National Electronic Health Record

Cross Language Information Retrieval

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Ensemble Technique Utilization for Indonesian Dependency Parser

Grammars & Parsing, Part 1:

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Let's Learn English Lesson Plan

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

The Smart/Empire TIPSTER IR System

Derivational and Inflectional Morphemes in Pak-Pak Language

Analysis of Probabilistic Parsing in NLP

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

BYLINE [Heng Ji, Computer Science Department, New York University,

ScienceDirect. Malayalam question answering system

THE VERB ARGUMENT BROWSER

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Modeling full form lexica for Arabic

Chapter 4: Valence & Agreement CSLI Publications

Some Principles of Automated Natural Language Information Extraction

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Introduction to Text Mining

Developing a TT-MCTAG for German with an RCG-based Parser

Encoding. Retrieval. Forgetting. Physiology of Memory. Systems and Types of Memory

The Role of the Head in the Interpretation of English Deverbal Compounds

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

A Framework for Customizable Generation of Hypertext Presentations

Prediction of Maximal Projection for Semantic Role Labeling

Ch VI- SENTENCE PATTERNS.

Problems of the Arabic OCR: New Attitudes

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Segmented Discourse Representation Theory. Dynamic Semantics with Discourse Structure

Using dialogue context to improve parsing performance in dialogue systems

Control and Boundedness

BULATS A2 WORDLIST 2

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Speech Recognition at ICSI: Broadcast News and beyond

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Developing Grammar in Context

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Specifying a shallow grammatical for parsing purposes

Beyond the Pipeline: Discrete Optimization in NLP

Probabilistic Latent Semantic Analysis

Words come in categories

Text Type Purpose Structure Language Features Article

Loughton School s curriculum evening. 28 th February 2017

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Distant Supervised Relation Extraction with Wikipedia and Freebase

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

An Introduction to the Minimalist Program

Can Money Buy Happiness? EPISODE # 605

Constraining X-Bar: Theta Theory

L1 and L2 acquisition. Holger Diessel

English Language and Applied Linguistics. Module Descriptions 2017/18

Abstractions and the Brain

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Universal Grammar 2. Universal Grammar 1. Forms and functions 1. Universal Grammar 3. Conceptual and surface structure of complex clauses

Fountas-Pinnell Level P Informational Text

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

A Graph Based Authorship Identification Approach

LING 329 : MORPHOLOGY

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Unit 8 Pronoun References

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Writing a composition

Transcription:

Introduction to Natural Language Processing Hongning Wang CS@UVa

What is NLP? كلب ھو مطاردة صبي في الملعب. Arabic text How can a computer make sense out of this string? Morphology Syntax Semantics Pragmatics Discourse Inference - What are the basic units of meaning (words)? - What is the meaning of each word? - How are words related with each other? - What is the combined meaning of words? - What is the meta-meaning? (speech act) - Handling a large chunk of text - Making sense of everything CS@UVa CS6501: Text Mining 2

An example of NLP Semantic analysis Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1). + A dog is chasing a boy on the playground. Det Noun Aux Verb Det Noun Prep Det Noun Noun Phrase Complex Verb Noun Phrase Scared(x) if Chasing(_,x,_). Scared(b1) Inference Verb Phrase Sentence Verb Phrase Noun Phrase Prep Phrase Lexical analysis (part-ofspeech tagging) Syntactic analysis (Parsing) A person saying this may be reminding another person to get the dog back Pragmatic analysis (speech act) CS@UVa CS6501: Text Mining 3

If we can do this for all the sentences in BAD NEWS: Automatically answer our emails Translate languages accurately Help us manage, summarize, and aggregate information Use speech as a UI (when needed) Talk to us / listen to us all languages, then Unfortunately, we cannot right now. General NLP = Complete AI CS@UVa CS6501: Text Mining 4

NLP is difficult!!!!!!! Natural language is designed to make human communication efficient. Therefore, We omit a lot of common sense knowledge, which we assume the hearer/reader possesses We keep a lot of ambiguities, which we assume the hearer/reader knows how to resolve This makes EVERY step in NLP hard Ambiguity is a killer! Common sense reasoning is pre-required CS@UVa CS6501: Text Mining 5

An example of ambiguity Get the cat with the gloves. CS@UVa CS6501: Text Mining 6

Examples of challenges Word-level ambiguity design can be a noun or a verb (Ambiguous POS) root has multiple meanings (Ambiguous sense) Syntactic ambiguity natural language processing (Modification) A man saw a boy with a telescope. (PP Attachment) Anaphora resolution John persuaded Bill to buy a TV for himself. (himself = John or Bill?) Presupposition He has quit smoking. implies that he smoked before. CS@UVa CS6501: Text Mining 7

Despite all the challenges, research in NLP has also made a lot of progress CS@UVa CS6501: Text Mining 8

A brief history of NLP Early enthusiasm (1950 s): Machine Translation Too ambitious Bar-Hillel report (1960) concluded that fully-automatic high-quality translation could not be accomplished without knowledge (Dictionary + Encyclopedia) Less ambitious applications (late 1960 s & early 1970 s): Limited success, failed to scale up Deep understanding in Speech recognition Dialogue (Eliza) Shallow understanding limited domain Inference and domain knowledge (SHRDLU= block world ) Real world evaluation (late 1970 s now) Story understanding (late 1970 s & early 1980 s) Knowledge representation Large scale evaluation of speech recognition, text retrieval, information extraction (1980 now) Robust component techniques Statistical approaches enjoy more success (first in speech recognition & Statistical language models retrieval, later others) Current trend: Boundary between statistical and symbolic approaches is disappearing. We need to use all the available knowledge Applications Application-driven NLP research (bioinformatics, Web, Question answering ) CS@UVa CS6501: Text Mining 9

The state of the art A dog is chasing a boy on the playground Det Noun Aux Verb Det Noun Prep Det Noun Noun Phrase Noun Phrase Complex Verb Noun Phrase POS Tagging: 97% Semantics: some aspects - Entity/relation extraction - Word sense disambiguation - Anaphora resolution Verb Phrase Sentence Verb Phrase Prep Phrase Parsing: partial >90% Inference:??? Speech act analysis:??? CS@UVa CS6501: Text Mining 10

Machine translation CS@UVa CS6501: Text Mining 11

Dialog systems Apple s siri system Google search CS@UVa CS6501: Text Mining 12

Information extraction Google Knowledge Graph Wiki Info Box CS@UVa CS6501: Text Mining 13

Information extraction CMU Never-Ending Language Learning YAGO Knowledge Base CS@UVa CS6501: Text Mining 14

Building a computer that understands text: The NLP pipeline CS@UVa CS6501: Text Mining 15

Tokenization/Segmentation Split text into words and sentences Task: what is the most likely segmentation /tokenization? There was an earthquake near D.C. I ve even felt it in Philadelphia, New York, etc. There + was + an + earthquake + near + D.C. I + ve + even + felt + it + in + Philadelphia, + New + York, + etc. CS@UVa CS6501: Text Mining 16

Part-of-Speech tagging Marking up a word in a text (corpus) as corresponding to a particular part of speech Task: what is the most likely tag sequence A + dog + is + chasing + a + boy + on + the + playground A + dog + is + chasing + a + boy + on + the + playground Det Noun Aux Verb Det Noun Prep Det Noun CS@UVa CS6501: Text Mining 17

Named entity recognition Determine text mapping to proper names Task: what is the most likely mapping Its initial Board of Visitors included U.S. Presidents Thomas Jefferson, James Madison, and James Monroe. Its initial Board of Visitors included U.S. Presidents Thomas Jefferson, James Madison, and James Monroe. Organization, Location, Person CS@UVa CS6501: Text Mining 18

Syntactic parsing Grammatical analysis of a given sentence, conforming to the rules of a formal grammar Task: what is the most likely grammatical structure A + dog + is + chasing + a + boy + on + the + playground Det Noun Aux Verb Det Noun Prep Det Noun Noun Phrase Complex Verb Noun Phrase Verb Phrase Noun Phrase Prep Phrase Verb Phrase Sentence CS@UVa CS6501: Text Mining 19

Relation extraction Identify the relationships among named entities Shallow semantic analysis Its initial Board of Visitors included U.S. Presidents Thomas Jefferson, James Madison, and James Monroe. 1. Thomas Jefferson Is_Member_Of Board of Visitors 2. Thomas Jefferson Is_President_Of U.S. CS@UVa CS6501: Text Mining 20

Logic inference Convert chunks of text into more formal representations Deep semantic analysis: e.g., first-order logic structures Its initial Board of Visitors included U.S. Presidents Thomas Jefferson, James Madison, and James Monroe. xx (Is_Person(xx) & Is_President_Of(xx, U.S. ) & Is_Member_Of(xx, Board of Visitors )) CS@UVa CS6501: Text Mining 21

Towards understanding of text Who is Carl Lewis? Did Carl Lewis break any records? CS@UVa CS6501: Text Mining 22

Major NLP applications Speech recognition: e.g., auto telephone call routing Text mining Text clustering Text classification Text summarization Our focus Topic modeling Question answering Language tutoring Spelling/grammar correction Machine translation Cross-language retrieval Restricted natural language Natural language user interface CS@UVa CS6501: Text Mining 23

NLP & text mining Better NLP => Better text mining Bad NLP => Bad text mining? Robust, shallow NLP tends to be more useful than deep, but fragile NLP. Errors in NLP can hurt text mining performance CS@UVa CS6501: Text Mining 24

How much NLP is really needed? Tasks Dependency on NLP Scalability Classification Clustering Summarization Extraction Topic modeling Translation Dialogue Question Answering Inference Speech Act CS@UVa CS6501: Text Mining 25

So, what NLP techniques are the most useful for text mining? Statistical NLP in general. The need for high robustness and efficiency implies the dominant use of simple models CS@UVa CS6501: Text Mining 26

What you should know Different levels of NLP Challenges in NLP NLP pipeline CS@UVa CS6501: Text Mining 27