Wordnet, Multiword, Metaphor and UW

Similar documents
DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

HinMA: Distributed Morphology based Hindi Morphological Analyzer

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

S. RAZA GIRLS HIGH SCHOOL

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Leveraging Sentiment to Compute Word Similarity

AQUA: An Ontology-Driven Question Answering System

Project in the framework of the AIM-WEST project Annotation of MWEs for translation


2.1 The Theory of Semantic Fields

The MEANING Multilingual Central Repository

Cross Language Information Retrieval

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

On document relevance and lexical cohesion between query terms

CS 598 Natural Language Processing

Named Entity Recognition: A Survey for the Indian Languages

Vocabulary Usage and Intelligibility in Learner Language

ह द स ख! Hindi Sikho!

A process by any other name

Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Parsing of part-of-speech tagged Assamese Texts

THE VERB ARGUMENT BROWSER

Context Free Grammars. Many slides from Michael Collins

Introduction to Text Mining

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Linking Task: Identifying authors and book titles in verbose queries

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL

Ensemble Technique Utilization for Indonesian Dependency Parser

Construction Grammar. University of Jena.

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

L1 and L2 acquisition. Holger Diessel

ENGLISH Month August

The stages of event extraction

Text-mining the Estonian National Electronic Health Record

Applications of memory-based natural language processing

Word Sense Disambiguation

Chapter 4: Valence & Agreement CSLI Publications

Formulaic Language and Fluency: ESL Teaching Applications

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Natural Language Processing. George Konidaris

Ch VI- SENTENCE PATTERNS.

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Derivational and Inflectional Morphemes in Pak-Pak Language

Transliteration Systems Across Indian Languages Using Parallel Corpora

A Domain Ontology Development Environment Using a MRD and Text Corpus

Part III: Semantics. Notes on Natural Language Processing. Chia-Ping Chen

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Multilingual Sentiment and Subjectivity Analysis

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Determining the Semantic Orientation of Terms through Gloss Classification

ScienceDirect. Malayalam question answering system

Ontologies vs. classification systems

Chapter 9 Banked gap-filling

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

The Smart/Empire TIPSTER IR System

Using Semantic Relations to Refine Coreference Decisions

Indian Institute of Technology, Kanpur

A First-Pass Approach for Evaluating Machine Translation Systems

A Bayesian Learning Approach to Concept-Based Document Classification

The Role of the Head in the Interpretation of English Deverbal Compounds

CEFR Overall Illustrative English Proficiency Scales

Compositional Semantics

Today we examine the distribution of infinitival clauses, which can be

Combining a Chinese Thesaurus with a Chinese Dictionary

Distant Supervised Relation Extraction with Wikipedia and Freebase

Teaching Vocabulary Summary. Erin Cathey. Middle Tennessee State University

SAMPLE PAPER SYLLABUS

We are going to talk about the meaning of the word weary. Then we will learn how it can be used in different sentences.

Automatic Extraction of Semantic Relations by Using Web Statistical Information

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Robust Sense-Based Sentiment Classification

Lemmatization of Multi-word Lexical Units: In which Entry?

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Beyond the Pipeline: Discrete Optimization in NLP

F.No.29-3/2016-NVS(Acad.) Dated: Sub:- Organisation of Cluster/Regional/National Sports & Games Meet and Exhibition reg.

Universiteit Leiden ICT in Business

Developing a TT-MCTAG for German with an RCG-based Parser

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Matching Similarity for Keyword-Based Clustering

YMCA SCHOOL AGE CHILD CARE PROGRAM PLAN

On JEE. Milind Sohoni Senate Meeting, IITB 6 th October 2016

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Prediction of Maximal Projection for Semantic Role Labeling

SEMAFOR: Frame Argument Resolution with Log-Linear Models

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

व रण क ए आ दन-पत र. Prospectus Cum Application Form. न दय व kऱय सम त. Navodaya Vidyalaya Samiti ਨਵ ਦ ਆ ਦਵਦ ਆਦ ਆ ਸਦ ਤ. Navodaya Vidyalaya Samiti

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

BYLINE [Heng Ji, Computer Science Department, New York University,

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Underlying and Surface Grammatical Relations in Greek consider

The College Board Redesigned SAT Grade 12

Unit 14 Dangerous animals

Mercer County Schools

Using dialogue context to improve parsing performance in dialogue systems

Transcription:

Wordnet, Multiword, Metaphor and UW Pushpak Bhattacharyya Department of Computer Science and Engineering IIT Bombay COLING 2012 UNL Panel, IIT Bombay 15 December, 2012

Foundations

Two pictures NLP Problem Parsing Semantics NLP Trinity Part of Speech Tagging Vision Speech Morph Analysis Marathi French Statistics and Probability + Knowledge Based HMM CRF MEMM Algorithm Hindi English Language

NLP Layer Discourse and Corefernce Increased Complexity Of Processing Semantics Extraction Parsing Chunking POS tagging Morphology

Relational Semantics Word Meanings Word Forms F 1 F 2 F 3 F n M 1 (depend) E 1,1 (bank) E 1,2 (rely) E 1,3 M 2 (bank) E 2,2 (embankme nt) E 2, M 3 (bank) E 3,2 E 3,3 M m E m,n

Componential Semantics Consider cat and tiger. Decide on componential attributes. Furry Carnivorous Heavy Domesticable For cat (Y, Y, N, Y) For tiger (Y,Y,Y,N) Complete and correct Attributes are difficult to design.

Fundamental Design Question Syntagmatic vs. Paradigmatic relations? Psycholinguistics is the basis of the design. When we hear a word, many words come to our mind by association. For English, about half of the associated words are syntagmatically related and half are paradignatically related. For cat animal, mammal- paradigmatic mew, purr, furry- syntagmatic

Coming to UW

Universal Word The repository of Uws is supposed to be Universal Maybe the entities themselves are not! Every concept expressed in every language should find a place in the UW dictionary

IITB s NLP effort and UW++ Connect Indian languages to the other languages of the world through a pivot of interlingual lexemes, that will make machine translation easier among these languages.

Indian Languages: a complex landscape Major streams Indo European Dravidian Sino Tibetan Austro-Asiatic Some languages are ranked within 20 in the world in terms of the populations speaking them Hindi and Urdu: 5 th (~500 milion) Bangla: 7 th (~300 million) Marathi 14 th (~70 million) TDIL program of DIT, Ministry of IT Launched large consortia projects on MT and IR

Some UW++ entries which are MWs Cabman "cabman(icl>driver>thing,equ>taxidriver) {n} "SOMEONE WHO DRIVES A TAXI FOR A LIVING "" E [cabman] {CABMAN:AGENS,COUNT,STRONGCOUNT } F [chauffeur_de_taxi] {CAT(CATN),GNR(MAS)}

Another multiword UW "counterbalance(icl>cancel>do, equ>counteract, agt>thing, obj>thing) {v} "OPPOSE AND MITIGATE THE EFFECTS OF CONTRARY ACTIONS "THIS WILL COUNTERACT THE FOOLISH ACTIONS OF MY COLLEAGUES" "counterbalance(icl>balance>be, equ>compensate, obj>thing,aoj>thing) {v}"adjust FOR "ENGINEERS WILL WORK TO CORRECT THE EFFECTS OR AIR RESISTANCE" "counterbalance(icl>contrast>do, equ>oppose, agt>thing, obj>thing) {v} "OPPOSE WITH EQUAL WEIGHT OR FORCE" "counterbalance(icl>structure>thing, equ>balance) {n} "EQUALITY OF DISTRIBUTION" "counterbalance(icl>weight>thing, equ>counterweight) {n} "A WEIGHT THAT BALANCES ANOTHER WEIGHT"

UW dictionary is a linked structure like the wordnet "waddle(icl>walk>do,equ> toddle,agt>thing) {v} "WALK UNSTEADILY "SMALL CHILDREN TODDLE" toddle, coggle, totter, dodder, paddle, waddle -- (walk unsteadily; "small children toddle") => walk -- (use one's feet to advance; advance by steps; "Walk, don't run!") => travel, go, move, locomote -- (change location; move, travel, or proceed; "How fast does your new car go? )

Lexical and Semantic relations in wordnet 1. Synonymy 2. Hypernymy / Hyponymy 3. Antonymy 4. Meronymy / Holonymy 5. Gradation 6. Entailment 7. Troponymy 1, 3 and 5 are lexical (word to word), rest are semantic (synset to synset).

WordNet Sub-Graph Hyponymy Hypernymy Dwelling,abode Hyponymy Meronymy kitchen bckyard veranda M e r o n y m y house,home Hyponymy Gloss bedroom A place that serves as the living quarters of one or mor efamilies study guestroom hermitage cottage

Verbs in wordnet

INDOWORDNET Sanskrit Wordnet Urdu Wordnet Bengali Wordnet Dravidian Language Wordnet Kashmiri Wordnet Oriya Wordne t North East Language Wordnet Konkani Wordnet Hindi Wordnet English Wordnet Gujarati Wordnet Punjabi Wordnet Marathi Wordnet

Categories of Synsets (2/2) Language specific: Synsets which are unique to a language (e.g. Bihu in Assamese language) Rare: Synsets which express technical terms (e.g. ngram). Synthesized: Synsets created in the language due to influence of another language (e.g. Pizza).

Need for categorization To bring systematicity in the way the wordnet synsets are linked Universal Pan Indian Language Family Language Synthesised Rare All members have finished the Universal and Pan Indian synsets

Categorization methodology 34378 Hindi synsets were sent to all Indowordnet groups in the tool, in which they had these options to categorize: Yes No Universal synsets:- The synsets which were categorized Yes and also have equivalent English words or synsets. Pan-Indian :- The synsets which were categorized Yes and did not have equivalent English words or synsets.

Expansion approach: linking is a subtle and difficult process To link or not to link While linking: face lexical and semantic chasms Syntactic divergences in the example sentences Change of POS Copula drop (Hindi Bangla)

Linking kinship relations and fine grained concepts Relative Uncle Chacha Mama प न direct आब प न hypernym श Case of kashmiri

Important decision TWO kinds of linkages Direct Hypernymy प न direct आब प न hypernym श Case of kashmiri

How to express a concept not present in the language?

Transliteration: often employed Synset ID : 39 POS : adjective Synonyms : सन थ, (sanaatha) Gloss : जसक क ई प लन-प षण य द खभ ल करन व ल ह (opposite of orphan) Example statement : "सन थ ब लक क अन थ ब लक क मदद करन च हए (children who are looked after should help the orphans)/ स धक भ क ह ज न पर अन थ नह रहत, सन थ ह ज त ह Transliterated and adopted by Bangla and Gujarati

Short phrase: often employed Bangla Urdu (meaning Inauspicious)

Linking synsets across languages: Influence on Hindi Wordnet Hindi wordnet has to add new synsets to accommodate language specific concepts, e.g., in Gujarati ભ રવજપ (bhairav jap) ID :: 103040 CAT :: NOUN CONCEPT :: म क लए जप करत ह ए पव त पर स अपन आप क गर न (Taking God s name and throwing oneself from atop a mountain to attain liberation) EXAMPLE :: गरन र क शखर पर स य क भ रवजप करत थ एस म न ज त ह (it is thought that pilgrms used to do bhairav jap atop Girnar mountain) SYNSET-HINDI :: भ रवजप

Multiwords

MWs can be long Long Expressions with variable relationships Colon Cancer Tumor Suppressor Protein Head: Protein Mod (protein-5, suppressor-4); protein causing suppression Mod (suppressor-4, tumor-3); suppressor causing tumor (*) suppressor /suppressing of tumor Mod (tumor-3, cancer-2); tumor caused-by cancer Mod(cancer-2, colon-1); cancer of colon

Necessary and Sufficient Conditions for MWness Necessary Condition Word sequence separated by space/delimiter Sufficient Conditions Non-compositionality of meaning Fixity of expression In lexical items In structure and order

Examples Necessary condition Non-MWE example: Marathi: सरक र ह क ब क झ ल Roman: sarakara HakkAbakkA JZAle Meaning: government was surprised MWE example: Hindi: गर ब नव ज़ Roman: gariba navajza Meaning: who nourishes poor

Examples - Sufficient conditions ( Non-compositionality of meaning) Konkani: प ट त च बत Roman: potamta cabata (literally, biting in the stomach) Meaning: to feel jealous Telugu: ట డర Roman: cevttu kimda plidaru (literally, a lawyer sitting under the tree) Meaning: an idle person Bangla: ম র ম ন ষ Roman: matira manusa Meaning: a simple person/son of the soil

Examples Sufficient conditions (Fixity of expression) In lexical items Hindi usane muje gali di (he abused me) *usane muje gali pradana ki Bangla jabajjibana karadamda (life imprisonment) *jibanabhara karadamda *jabajjibana jela English (1) life imprisonment *lifelong imprisonment English (2) Many thanks *Plenty thanks

Examples Sufficient conditions (In structure and order) English example kicked the bucket (died) the bucket was kicked (not passivizable in the sense of dying) Hindi example उ क़ द umra keda (life imprisonment) umra bhara keda

Characterization of IL-MWs

Reduplicative MWs Complete Onomatopoeic (gutar gutar (Hindi) meaning sound made by pigeons) Non-Onomatopoeic (ghar ghar (Hindi) meaning in every house) Partial With echo words (pani vani (H) meaning water etc., bai tai (Bangla) meaning book etc.) With words of different origin (pran thawai (Manipuri) meaning soul; sena lanmi (Manipuri) meaning army): both composed of Sanskrit and Manipuri With meaningless words (balancing compounds) (irugu povrugu (Telugu) meaning neighbours)

Non-Reduplicative MWs Synonyms (ghar baadi (Bangla) meaning houses/homes) Antonym (jannat jahannum (Urdu) meaning heaven and hell) Complex predicates Conjunct verbs (kitappil 'in state of lying' + pootu > kitappil pootu 'keep something pending (Tamil)) Compound verbs (faao khalam (Bodo) meaning to finish acting on a task)

MW task (NLP + ML) NLP ML String + Morph POS POS+ WN POS + List Chu nking Parsing Rules (tik tik, chham chham) (ghar ghar) Non-redup (Syn, Anto, Hypo) (raat din, dhan doulat) Onomaeto pic Reduplication Non- Onomaeto pic Reduplication Noncontiguous something Statistical Colloctions or fixed expression s (many thanks) Conjunct verb (verbalizer list), Compund verb (verctor verb list) (salaha dena, has uthama) Noncontiguous Complex Predicate Idioms will be list morph + look up

MWE Extraction Engine: pipeline architecture Developed at IIT Bombay to extract Multiwords from input corpus Combination of filters MWE list produced after passing the corpus through the pipeline

MWE Pipeline Input Corpus (POS tagged) RegEx Pattern Extraction Filter Linguistic Filter Statistical Filter Named Entity Filter Human Filtering MWE List

Metonymy

Metonymy Associated with Metaphors which are epitomes of semantics Oxford Advanced Learners Dictionary definition: The use of a word or phrase to mean something different from the literal meaning

Insight from Sanskritic Tradition Power of a word Abhidha, Lakshana, Vyanjana Meaning of Hall: The hall is packed (avidha) The hall burst into laughing (lakshana) The Hall is full (unsaid: and so we cannot enter) (vyanjana) How will hall be represented in these three cases, in the UW dictionary?

Metaphors in Indian Tradition upamana and upameya Former: object being compared Latter: object being compared with Richard the Lion (Richard: upameya; Lion: upamana)

Upamana, rupak, atishayokti upamana: Explicit comparison King Richard was like a lion leading the crusaders rupak: Implicit comparison King Richard was a lion leading the crusaders Atishayokti (exaggeration): upamana and upameya dropped King Richard led the crusaders from the front. The lion was everywhere in the battlefield.

Modern study (1956 onwards, Richards et. al.) Three constituents of metaphor Vehicle (items used metaphorically) Tenor (the metaphorical meaning of the former) Ground (the basis for metaphorical extension) The foot of the mountain Vehicle: :foot Tenor: lower portion Ground: spatial parallel between the relationship between the foot to the human body and the lower portion of the mountain with the rest of the mountain

Interaction of semantic fields (Haas) Core vs. peripheral semantic fields Interaction of two words in metonymic relation brings in new semantic fields with selective inclusion of features Leg of a table Does not stretch or move Does stand and support

Lakoff s (1987) contribution Source Domain Target Domain Mapping Relations

Mapping Relations: ontological correspondences Anger is heat of fluid in container Heat (i) Container (ii) Agitation of fluid (iii) Limit of resistence (iv) Explosion Anger Body Agitation of mind Limit of ability to suppress Loss of control

Image Schemas Categories: Container Contained Quantity More is up, less is down: Outputs rose dramatically; accidents rates were lower Linear scales and paths: Ram is by far the best performer Time Stationary event: we are coming to exam time Stationary observer: weeks rush by Causation: desperation drove her to extreme steps

Patterns of Metonymy Container for contained The kettle boiled (water) Possessor for possessed/attribute Where are you parked? (car) Represented entity for representative The government will announce new targets Whole for part I am going to fill up the car with petrol

Patterns of Metonymy (contd) Part for whole I noticed several new faces in the class Place for institution Lalbaug witnessed the largest Ganapati Question: Can you have part-part metonymy

Feature sharing not necessary In a restaurant: Jalebii ko abhi dudh chaiye ( the jalebi (a sweet) now wants milk ) no feature sharing The elephant now wants some coffee (feature sharing) (a fat man desiring coffee)

Proverbs Describes a specific event or state of affairs which is applicable metaphorically to a range of events or states of affairs provided they have the same or sufficiently similar image-schematic structure

Investigation into Sanskritic traditions Rich work of smasa and their types Concept of samarthya When can adjacent words combine to give a single meaning? Example: krishnena bhramarena damshitavati radha rorudyamati cha (bitten by the black bee Radha is crying) krishabhramarena damshitavati radha rorudyamati cha (bitten by the black bee Radha is crying) Helped by the same subanta (declension) But modern descendents of Sanskrit have very little agreement between adjective and the qualified noun

Conclusions (1/2) To ensure coverage, Uws need to represent MWs and metaphors More precision- if possible- needed in the theory of uws sensational(icl>adj,icl>good); two parents?? Such a theory is needed, even if limited Can specify exceptions (like Panini)

Conclusions (2/2) IMP: not all words in the sentence corresponds to a UW (but an attribute; e.g., she seems disturbed; seems should go as attribute) Named Entities (not covered) need to be Detected only once Stored for the future Disambiguation needed (Washington voted Washington to power) Very closely linked with coreference resolution

Thank You http://www.cse.iitb.ac.in/~pb http://www.cfilt.iitb.ac.in