CS460/626 : Natural Language

Similar documents
DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

S. RAZA GIRLS HIGH SCHOOL

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

HinMA: Distributed Morphology based Hindi Morphological Analyzer

Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL


CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

ह द स ख! Hindi Sikho!

ENGLISH Month August

Leveraging Sentiment to Compute Word Similarity

2.1 The Theory of Semantic Fields

Word Sense Disambiguation

Robust Sense-Based Sentiment Classification

A Bayesian Learning Approach to Concept-Based Document Classification

The MEANING Multilingual Central Repository

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

F.No.29-3/2016-NVS(Acad.) Dated: Sub:- Organisation of Cluster/Regional/National Sports & Games Meet and Exhibition reg.

BASIC ENGLISH. Book GRAMMAR

CS 598 Natural Language Processing

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Vocabulary Usage and Intelligibility in Learner Language

Combining a Chinese Thesaurus with a Chinese Dictionary

Natural Language Processing. George Konidaris

व रण क ए आ दन-पत र. Prospectus Cum Application Form. न दय व kऱय सम त. Navodaya Vidyalaya Samiti ਨਵ ਦ ਆ ਦਵਦ ਆਦ ਆ ਸਦ ਤ. Navodaya Vidyalaya Samiti

Parsing of part-of-speech tagged Assamese Texts

A process by any other name

SAMPLE PAPER SYLLABUS

Ch VI- SENTENCE PATTERNS.

A Comparative Evaluation of Word Sense Disambiguation Algorithms for German

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

1. Introduction. 2. The OMBI database editor

Introduction to Text Mining

BULATS A2 WORDLIST 2

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

A Comparison of Two Text Representations for Sentiment Analysis

Available online at ScienceDirect. Procedia Computer Science 54 (2015 )

On document relevance and lexical cohesion between query terms

Context Free Grammars. Many slides from Michael Collins

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80.

Aspectual Classes of Verb Phrases

Accuracy (%) # features

Part III: Semantics. Notes on Natural Language Processing. Chia-Ping Chen

been each get other TASK #1 Fry Words TASK #2 Fry Words Write the following words in ABC order: Write the following words in ABC order:

Top US Tech Talent for the Top China Tech Company

Dear Teacher: Welcome to Reading Rods! Reading Rods offer many outstanding features! Read on to discover how to put Reading Rods to work today!

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Controlled vocabulary

Cross Language Information Retrieval

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

AQUA: An Ontology-Driven Question Answering System

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Determining the Semantic Orientation of Terms through Gloss Classification

Extended Similarity Test for the Evaluation of Semantic Similarity Functions

Myths, Legends, Fairytales and Novels (Writing a Letter)

We are going to talk about the meaning of the word weary. Then we will learn how it can be used in different sentences.

Chapter 4: Valence & Agreement CSLI Publications

Course Outline for Honors Spanish II Mrs. Sharon Koller

A Simple Surface Realization Engine for Telugu

Grammar Extraction from Treebanks for Hindi and Telugu

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Foundations of Knowledge Representation in Cyc

The History of Language Teaching

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Two methods to incorporate local morphosyntactic features in Hindi dependency

A Bottom-up Comparative Study of EuroWordNet and WordNet 3.0 Lexical and Semantic Relations

Transliteration Systems Across Indian Languages Using Parallel Corpora

Concepts and Properties in Word Spaces

CEFR Overall Illustrative English Proficiency Scales

Guidelines for Writing an Internship Report

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Using dialogue context to improve parsing performance in dialogue systems

Developing Grammar in Context

Extracting and Ranking Product Features in Opinion Documents

Proceedings of the 19th COLING, , 2002.

A First-Pass Approach for Evaluating Machine Translation Systems

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Effect of Word Complexity on L2 Vocabulary Learning

Derivational and Inflectional Morphemes in Pak-Pak Language

A Domain Ontology Development Environment Using a MRD and Text Corpus

Named Entity Recognition: A Survey for the Indian Languages

Introducing the New Iowa Assessments Language Arts Levels 15 17/18

Which verb classes and why? Research questions: Semantic Basis Hypothesis (SBH) What verb classes? Why the truth of the SBH matters

Text Type Purpose Structure Language Features Article

English Language and Applied Linguistics. Module Descriptions 2017/18

DKPro WSD A Generalized UIMA-based Framework for Word Sense Disambiguation

Test Blueprint. Grade 3 Reading English Standards of Learning

Applications of memory-based natural language processing

Campus Academic Resource Program An Object of a Preposition: A Prepositional Phrase: noun adjective

Automatic Extraction of Semantic Relations by Using Web Statistical Information

L1 and L2 acquisition. Holger Diessel

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Transcription:

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 2 and Word Sense Disambiguation) Pushpak Bhattacharyya CSE Dept., IIT Bombay 6 th Jan, 2011

Perpectivising NLP: Areas of AI and their inter-dependencies Search Logic Knowledge Representation Machine Learning Planning Expert NLP Vision Robotics Systems

Books etc. Main Text(s): Natural Language Understanding: James Allan Speech and NLP: Jurafsky and Martin Foundations of Statistical NLP: Manning and Schutze Other References: NLP a Paninian Perspective: Bharati, Cahitanya and Sangal Statistical NLP: Charniak Journals Computational Linguistics, Natural Language Engineering, AI, AI Magazine, IEEE SMC Conferences ACL, EACL, COLING, MT Summit, EMNLP, IJCNLP, HLT, ICON, SIGIR, WWW, ICML, ECML

A lexical knowledgebase based on conceptual lookup Organizing concepts in a semantic network. Organize lexical information in terms of word meaning, rather than word form can also be used as a thesaurus.

Psycholinguistic Theory Human lexical memory for nouns as a hierarchy. Can canary sing? - Pretty fast response. Can canary fly? - Slower response. Does canary have skin? Slowest response. Animal (can move, has skin) Bird (can fly) canary (can sing) - a lexical reference system based on psycholinguistic theories of human lexical memory.

Lexical Matrix

is a network of words linked by lexical and semantic relations. The first wordnet in the world was for English developed at Princeton over 15 years. The Eurowordnet- linked structure of European language g wordnets was built in 1998 over 3 years with funding from the EC as a a mission mode project. s for Hindi and Marathi being built at IIT Bombay are amongst the first IL wordnets. All these are proposed to be linked into the Indo which eventually will be linked to the English and the Euro wordnets.

Ud Urdu INDOWORDNET Bengali Dravidian Language Sanskrit Hindi Punjabi North East Language Konkani English Marathi

Fundamental Design Question Syntagmatic vs. Paradigmatic realtions? Psycholinguistics is the basis of the design. When we hear a word, many words come to our mind by association. For English, about half of the associated words are syntagmatically related and half are paradignatically related. For cat animal, mammal- paradigmatic mew, purr, furry- syntagmatic

Stated Fundamental Application of : Sense Disambiguation Determination of the correct sense of the word The crane ate the fish vs. The crane was used to lift the load bird vs. machine

The problem of Sense tagging Given a corpora To Assign correct sense to the words. This is sense tagging. Needs Word Sense Disambiguation (WSD) Highly important for Question Answering, Machine Translation, Text Mining tasks.

Classification of Words Word Content Word Function Word Verb Noun Adjective Adverb Prepo sition Conjun ction Pronoun Interjection

Example of sense marking: its need एक_4187 नए श ध_1138 क अन स र_3123 जन ल ग _1189 क स म जक_43540 ज वन_125623 य त_48029 ह त ह उनक दम ग_16168 क एक_4187 ह स _120425 म अ धक_42403 जगह_113368 ह त ह (According to a new research, those people who have a busy social life, have larger space in a part of their brain). न चर नचर यर स इ स य र स इस म छप एक_4187 श ध_1138 क अन स र_3123 अनस र कई_4118 ल ग _1189 क दम ग_16168 क क न स पत _11431 चल क दम ग_16168 क एक_4187 ह स _120425 ए मगड ल स म जक_43540 य तत ओ _1438 क स थ_328602 स म ज य_166 क लए थ ड़ _38861 बढ़_25368 ज त ह यह श ध_1138 58 ल ग _1189 पर कय गय जसम उनक उ _13159 और दम ग_16168 क स इज़ क आ कड़ _128065 लए गए अमर क _413405 ट म_14077 न प य _227806 क जन ल ग _1189 क स शल न टव क नटव कग अ धक_42403 ह उनक दम ग_16168 क ए मगड ल व ल ह स _120425 ब क _130137 ल ग _1189 क त लन _म _38220 अ धक_42403 बड़ _426602 ह दम ग_16168 क ए मगड ल व ल ह स _120425 भ वन ओ _1912 और म न सक_42151 थ त_1652 स ज ड़ ह आ म न _212436 ज त ह

Ambiguity of ल ग (People) ल ग, जन, ल क, जनम नस, प लक - एक स अ धक य "ल ग क हत म क म करन च हए" (English synset) multitude, masses, mass, hoi_polloi, people, the_great_unwashed - the common people generally "separate the warriors from the mass" "power to the people" द नय, द नय, स स र, व, जगत, जह, जह न, ज़म न, जम न, ल क, द नय व ल द नय व ल, द नय व ल द नय व ल, ल ग - स स र सस र म रहन व ल ल ग "मह म ग ध क स म न प र द नय करत ह / म इस द नय क परव ह नह करत / आज क द नय द नय प स क प छ भ ग रह ह " (English synset) populace, public, world - people in general considered as a whole "he is a hero in the eyes of the public

Basic Principle Words in natural languages are polysemous. However, when synonymous words are put together, a unique meaning often emerges. Use is made of Relational Semantics. Componential Semantics where each word is a bundle of semantic features (as in the Schankian Conceptual Dependency system or Lexical Componential Semantics) is to be examined as a viable alternative.

Componential Semantics Consider cat and tiger. Decide on componential attributes. Furry Carnivorous Heavy Domesticable For cat (Y, Y, N, Y) For tiger (YYYN) (Y,Y,Y,N) Complete and correct Attributes are difficult to design.

Semantic relations in wordnet 1. Synonymy 2. Hypernymy / Hyponymy 3. Antonymy 4. Meronymy / Holonymy 5. Gradation 6. Entailment 7. Troponymy y 1, 3 and 5 are lexical (word to word), rest are semantic (synset to synset).

Synset: the foundation (house) 1. house -- (a dwelling that serves as living quarters for one or more families; "he has a house on Cape Cod"; "she felt she had to get out of the house") 2. house -- (an official assembly having legislative powers; "the legislature has two houses") 3. house -- (a building in which something is sheltered or located; "they had a large carriage house") 4. family, household, house, home, menage -- (a social unit living together; "he moved his family to Virginia"; "It was a good Christian household"; "I waited until the whole house was asleep"; "the teacher asked how many people made up his home") 5. theater, theatre, house -- (a building where theatrical performances or motion-picture shows can be presented; "the house was full") 6. firm, house, business firm -- (members of a business organization that owns or operates one or more establishments; "he worked for a brokerage house") 7. house -- (aristocratic family line; "the House of York") 8. house -- (the members of a religious community living together) 9. house -- (the audience gathered together in a theatre or cinema; "the house applauded"; "he counted the house") 10. house -- (play in which children take the roles of father or mother or children and pretend to interact like adults; "the children were playing house") 11. sign of the zodiac,star sign,sign,mansion,house, planetary house -- ((astrology) one of 12 equal areas into which the zodiac is divided) 12. house -- (the management of a gambling house or casino; "the house gets a percentage of every bet")

Creation of Synsets Three principles: Minimality Coverage Replacability

Synset creation (continued) Home John s home was decorated with lights on the occasion of Christmas. Having worked for many years abroad, John Returned home. House John s house was decorated with lights on the occasion of Christmas. Mercury is situated in the eighth house of John s horoscope.

Synsets (continued) {house} is ambiguous. {house, home} has the sense of a social unit living together; Is this the minimal unit? {family, house, home} will make the unit completely unambiguous. For coverage: {family, household, h house, home} ordered d according to frequency. l bili f h f d i Replacability of the most frequent words is a requirement.

Synset creation From first principles Pick all the senses from good standard dictionaries. Obtain synonyms for each sense. Needs hard and long hours of work.

Synset creation (continued) From the wordnet of another language in the same family Pick the synset and obtain the sense from the gloss. Get the words of the target language. Often same words can be used- especially for words. Translation, Insertion and deletion. Hindi Synset: ABI M A (experienced person) Marathi Synset: ABI N