CFILT. Center for Indian Language Technology. Indian Institute of Technology Bombay Mumbai. Pushpak Bhattacharyya

Similar documents
HinMA: Distributed Morphology based Hindi Morphological Analyzer

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Leveraging Sentiment to Compute Word Similarity

S. RAZA GIRLS HIGH SCHOOL

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Multilingual Sentiment and Subjectivity Analysis

Named Entity Recognition: A Survey for the Indian Languages

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Robust Sense-Based Sentiment Classification

AQUA: An Ontology-Driven Question Answering System

Applications of memory-based natural language processing

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

arxiv: v1 [cs.cl] 2 Apr 2017


Indian Institute of Technology, Kanpur

Transliteration Systems Across Indian Languages Using Parallel Corpora

ENGLISH Month August

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

The MEANING Multilingual Central Repository

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Speech Recognition at ICSI: Broadcast News and beyond

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Parsing of part-of-speech tagged Assamese Texts

Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

TINE: A Metric to Assess MT Adequacy

Cross Language Information Retrieval

A heuristic framework for pivot-based bilingual dictionary induction

Vocabulary Usage and Intelligibility in Learner Language

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

ह द स ख! Hindi Sikho!

CS 598 Natural Language Processing

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

The College Board Redesigned SAT Grade 12

ScienceDirect. Malayalam question answering system

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

A Case Study: News Classification Based on Term Frequency

CEFR Overall Illustrative English Proficiency Scales

English Language and Applied Linguistics. Module Descriptions 2017/18

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

BYLINE [Heng Ji, Computer Science Department, New York University,

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

Linking Task: Identifying authors and book titles in verbose queries

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

A Comparison of Two Text Representations for Sentiment Analysis

Distant Supervised Relation Extraction with Wikipedia and Freebase

Top US Tech Talent for the Top China Tech Company

The Smart/Empire TIPSTER IR System

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

Using Games with a Purpose and Bootstrapping to Create Domain-Specific Sentiment Lexicons

THE VERB ARGUMENT BROWSER

व रण क ए आ दन-पत र. Prospectus Cum Application Form. न दय व kऱय सम त. Navodaya Vidyalaya Samiti ਨਵ ਦ ਆ ਦਵਦ ਆਦ ਆ ਸਦ ਤ. Navodaya Vidyalaya Samiti

Florida Reading Endorsement Alignment Matrix Competency 1

Noisy SMS Machine Translation in Low-Density Languages

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

A process by any other name

Modeling full form lexica for Arabic

Ensemble Technique Utilization for Indonesian Dependency Parser

A Simple Surface Realization Engine for Telugu

Beyond the Pipeline: Discrete Optimization in NLP

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

English to Marathi Rule-based Machine Translation of Simple Assertive Sentences

Two methods to incorporate local morphosyntactic features in Hindi dependency

Postprint.

Word Sense Disambiguation

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

2.1 The Theory of Semantic Fields

Using Semantic Relations to Refine Coreference Decisions

Constructing Parallel Corpus from Movie Subtitles

Ontological spine, localization and multilingual access

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Spanish III Class Description

Survey of Named Entity Recognition Systems with respect to Indian and Foreign Languages

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Common Core State Standards for English Language Arts

Some Principles of Automated Natural Language Information Extraction

Cross-Lingual Text Categorization

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Proceedings of the 19th COLING, , 2002.

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

TextGraphs: Graph-based algorithms for Natural Language Processing

Rule Learning With Negation: Issues Regarding Effectiveness

Transcription:

NLP @ CFILT Center for Indian Language Technology Indian Institute of Technology Bombay Mumbai Pushpak Bhattacharyya pb@cse.iitb.ac.in www.cfilt.iitb.ac.in March 2016

Brief Introduction to CFILT Natural Language Processing @IIT Bombay started in 1996 Work started with support from United Nations University, Tokyo for Universal Networking Language The Center was established in 2000 Many faculty members & Ph.D, M.Tech, B.Tech students and linguists associated with the lab

Multilinguality is a key theme 5+1 language families Indo-Aryan (74% population) Dravidian (24%) Austro-Asiatic (1.2%) Tibeto-Burman (0.6%) Andaman languages (2 families?) + English (West-Germanic) 22 scheduled languages 11 languages with more than 25 million speakers 29 languages with more than 1 million speakers Only India has 2 languages (+English) in the world s 10 most spoken languages 7-8 Indian languages in the top 20 most spoken languages 6

Key features of Indian languages Word order: Subject-Object-Verb हम ओस क स य ट तक न म आय we osaka+from kyoto+to train+in (hindi) came We came from Osaka to Kyoto in a train Morphologically rich आ ह ओस क प स न य ट पयत नम य आल we osaka+from kyoto+to train+in came (marathi)

Key Research Areas Machine Translation Sentiment Analysis Information Retrieval Lexical Semantics Information Extraction Cognitive NLP

Machine Translation 9

MT@IITB: Overview Translation among Indian languages English Indian languages Indian languages English Between Indian languages Paradigms Interlingua-based MT Transfer-based MT Statistical MT

Statistical MT (1) Phrase-based SMT: Incorporating linguistic knowledge Source Reordering: En-IL, IL-En, various representations (IJCNLP 08) Factor-based: Dependency parse information for generating case markers correctly (ACL 09) Handling morphologically rich languages: unsupervised segmentation (ICON 14) Post-ordering: Mainly for IL-En translation (ICON 15) Translation & Transliteration among related languages: Scaling Statistical MT systems to a large number of languages with high accuracy and less resources Relatedness of languages Comparative study of pan-indian translation (LREC 14) Reuse of resources, leveraging similarities (LREC 14, ICON 14, NAACL 15) Unsupervised transliteration and translation (NAACL 16-under review)

Statistical MT (2) Pivot-based SMT: Addressing language divergence issues MT Evaluation: Incorporate semantics and address rich morphology Multiple assisting languages (NAACL 15) Addressing word order (ICON 15) Addressing morphological richness (ICON 15) Combining character-based and phrase-based SMT Analysis of BLEU (ICON 07) METEOR for Indic languages (LREC 14) Textual entailment for evaluation (WMT 14) Crowdsourcing: Exploring quality control issues Translation & transliteration resources with crowdsourcing (LREC 14) Translation crowdsourcing pipeline (ACL 13) Shata-Anuvaadak MT System: http://www.cfilt.iitb.ac.in/indic-translator/

UNL-based English Hindi Translation System (JMT, 2001) Hindi English Interlingua (UNL) Analysis French generation Chinese

Indian Language MT Project (ILMT) Translation between Indian languages Transfer based MT system Every language vertical develops analyzers and synthesizers Analysis up to shallow parsing Morphological analysis has an important role Tamil Hindi Telugu Hindi Marathi Hindi Tamil Telugu Telugu Tamil Urdu Hindi Hindi Urdu Punjabi Hindi Hindi Punjabi Sampark MT system http://sampark.iiit.ac.in/sampark/web/index.php/content

Sampark Architecture G

Lexical Semantics

IndoWordNet (LREC 2010, GWC 2002, GWC 2010)

Activities related to IndoWordNet

Word Sense Disambiguation IJCNLP 2011, ACL 2013) NAACL 2015)

(ACL 2013, GWC 2014)

Enriching & Creating NLP resources using Deep Learning Enriching existing resources Automatic linking of synsets Creating new resources Within a language specific wordnet Cross-lingual Refining pretrained vector repositories Detection and removal of nonspecific vectors Estimating task specific approximate representation for out-of-vocabulary words Creating vector representations of complex lexical entities such as Synsets Phrases Sentences Question/Answer pairs Investigating compositional and noncompositional methods of creating vectors

Information Retrieval

Sandhan sandhansearch.in Target Language Index in English Crawled and Indexed Web Pages Hindi Query त पत य त पत य CLIR Engine त प त आन क लए र ल स धन त प त प य नगर पह चन क लए बह त र ल उपल ध ह अगर म बई स य कर रह ह त म बईच नई ए स स ग ड़ स व स कर सकत ह Result Snippets in Hindi Language Resource s Ranked List of Results Target Information in English Supports 9 languages: Hindi, Marathi, Punjabi, Oriya, Bengali, Tamil, Telugu, Gujarati and Assamese 23

Cross Lingual Search for Indian Languages Query Expansion Multilingual Pseudo-relevance feedback (ACL 2010, SIGIR 2010) Structure Cognizant PRF (IJCNLP 2013) Query Transliteration Crawling Character Sequence Modelling (TALIP 2010) Conservative focussed crawling under resource constraints (ICON 2015) Using Orthographic syllables of Indic scripts

Information Extraction

Indian language IE tools resource constraints multilinguality Relation Extraction POS, NER, Chunkers (ACL 2006, COLING 2010) Co-reference resolution Making sense of data Textual Entailment (ICON 2013) Noun Compound Interpretation (RANLP 2015)

Multilingual Named Entity Recognition Using Deep Learning (CICLING 2016) Deep Learning techniques do Feature Learning Word embeddings combined with Deep Learning have given comparable results with existing state-of-the-art feature engineered systems Use Deep Learning to learn language independent features Named Entities should have a common representation across languages

(ICON 2013, CICLING 2016)

otic drops medicine_for norco medicine_for ear discomfort pain Exploring rich feature design using syntactic and dependency information Explore representation learning with convoutional neural networks

Noun Compound Interpretation Noun compound: sequence of two or more nouns that act as a single noun Interpretation: identifying relations between nouns in a noun compound. ENG: Honey Singh became the latest victim of celebrity death hoax. HIN: हन स ह स ध य त क म त क ब र म अफव ह क त ज शक र बन Problem: Labeling apple pie Made-Of Paraphrasing apple pie : a pie made of apple, or a pie with apple flavor Motivation: (Translation) Example: apple pie, student protest, colon cancer, colon cancer symptoms, etc. Given a noun+noun compound, assign an abstract label (relationship between two nouns) Set of abstract relations are defined by Tratz and Hovy (2010). Challenges: Highly productive, no clue from the context, and pragmatic influence 32

Sentiment Analysis

Detecting Granularity in Words: for the Purpose of Sentiment Analysis Many hidden properties of words other than being positive or negative which can lead to enrichment of existing sentiment analysis systems. Identifying these properties in polar words for different applications in sentiment analysis. Properties Polar Word Domain dependence for polarity Domain dependence for significance Intensity within a semantic category Intensity within a sense (IJCNLP 2013, EMNLP 2015) Applications in SA In-domain SA Cross-domain SA Star-rating Prediction Intensity in SentiWordNet

Computational Sarcasm (ACL 2015, WASSA 2015, WISDOM 2015) Definition: Computational approaches to sarcasm This phone is awesome. Use it as a paperweight. I loooovvvee Nicki Minaj! Computational Sarcasm Sarcasm Generation Sarcasm Detection Sarcasm Studies in Humans An open-source chatbot that responds sarcastically Detection using incongruity within text Sentiment understanding using eye-tracking An emotion tracking engine Detection using author s historical text Sarcasm understanding using eye-tracking

Emotion Analysis from Text Hierarchical classification for emotion analysis Leverage hierarchy of relations between emotion labels to improve emotion analysis using Hierarchical Naive Bayes Emotion Analysis in Narratives and Discourses Model as a sequence labelling problem

Sentiment Analysis and Deep Learning Models explored for sentiment analysis: Convolutional Neural Networks (CNN) Long Short-term Memory (LSTM) networks For sentiment classification tasks like: Positive/negative/neutral sentiment detection Aspect Classification On different types of data like: Movie Reviews in languages like English and Hindi Social Media texts like tweets System

Cognitive NLP

Cognitive NLP http://www.cfilt.iitb.ac.in/cognitive-nlp/

Some problems being investigated

Education Technology

Automatic Essay Grading (QATS 2016) Score various aspects of the essay, like language complexity, word usage, organization, coherence, etc. to generate an overall score to check the overall quality of the essay Text complexity calculation and its effect on quality of the essay. Extraction of words / phrases and estimating their contribution to the quality of an essay. Eye-tracking to evaluate organization, coherence and cohesion of the essay.

Automated Grammatical Error Correction Addressing Class Imbalance in grammatical error correction (ICON 2015) Adapting Methods in Machine Translation to grammar correction (CoNLL 2014) Addressing Subject-Verb Agreement errors (CoNLL 2013)

Thank You! Resources: http://www.cfilt.iitb.ac.in Publications: http://www.cse.iitb.ac.in/~pb