Transliterated Search BITS PILANI HYDERABAD CAMPUS TEAM [ABHINAV MUKHERJEE, ANIRUDH RAVI, KAUSTAV DATTA]

Similar documents
DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

S. RAZA GIRLS HIGH SCHOOL

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

HinMA: Distributed Morphology based Hindi Morphological Analyzer


Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

ENGLISH Month August

ह द स ख! Hindi Sikho!

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

CS Machine Learning

Indian Institute of Technology, Kanpur

Detecting English-French Cognates Using Orthographic Edit Distance

Rule Learning With Negation: Issues Regarding Effectiveness

A Case Study: News Classification Based on Term Frequency

Reducing Features to Improve Bug Prediction

Cross Language Information Retrieval

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Learning From the Past with Experiment Databases

F.No.29-3/2016-NVS(Acad.) Dated: Sub:- Organisation of Cluster/Regional/National Sports & Games Meet and Exhibition reg.

Linking Task: Identifying authors and book titles in verbose queries

Rule Learning with Negation: Issues Regarding Effectiveness

Using Web Searches on Important Words to Create Background Sets for LSI Classification

CS 446: Machine Learning

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Bug triage in open source systems: a review

Python Machine Learning

Assignment 1: Predicting Amazon Review Ratings

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Probabilistic Latent Semantic Analysis

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Disambiguation of Thai Personal Name from Online News Articles

Lecture 1: Machine Learning Basics

Australian Journal of Basic and Applied Sciences

What the National Curriculum requires in reading at Y5 and Y6

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Finding Translations in Scanned Book Collections

Using dialogue context to improve parsing performance in dialogue systems

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

The stages of event extraction

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Word Segmentation of Off-line Handwritten Documents

Speech Emotion Recognition Using Support Vector Machine

arxiv: v1 [cs.lg] 3 May 2013

Switchboard Language Model Improvement with Conversational Data from Gigaword

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

Universiteit Leiden ICT in Business

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Cross-lingual Short-Text Document Classification for Facebook Comments

EDEXCEL FUNCTIONAL SKILLS PILOT TEACHER S NOTES. Maths Level 2. Chapter 4. Working with measures

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

arxiv: v1 [cs.cl] 2 Apr 2017

ARNE - A tool for Namend Entity Recognition from Arabic Text

A Right to Access Implies A Right to Know: An Open Online Platform for Research on the Readability of Law

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

1. Introduction. 2. The OMBI database editor

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

व रण क ए आ दन-पत र. Prospectus Cum Application Form. न दय व kऱय सम त. Navodaya Vidyalaya Samiti ਨਵ ਦ ਆ ਦਵਦ ਆਦ ਆ ਸਦ ਤ. Navodaya Vidyalaya Samiti

Coast Academies Writing Framework Step 4. 1 of 7

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Speech Recognition at ICSI: Broadcast News and beyond

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

A Comparison of Two Text Representations for Sentiment Analysis

Beyond the Pipeline: Discrete Optimization in NLP

Test Blueprint. Grade 3 Reading English Standards of Learning

MARK¹² Reading II (Adaptive Remediation)

ScienceDirect. Malayalam question answering system

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Human Emotion Recognition From Speech

Activity Recognition from Accelerometer Data

The Role of String Similarity Metrics in Ontology Alignment

A Vector Space Approach for Aspect-Based Sentiment Analysis

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Bootstrapping Personal Gesture Shortcuts with the Wisdom of the Crowd and Handwriting Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Text-mining the Estonian National Electronic Health Record

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

A Bayesian Learning Approach to Concept-Based Document Classification

arxiv: v2 [cs.cv] 30 Mar 2017

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Dialog Act Classification Using N-Gram Algorithms

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Leveraging Sentiment to Compute Word Similarity

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

HLTCOE at TREC 2013: Temporal Summarization

Online Updating of Word Representations for Part-of-Speech Tagging

AQUA: An Ontology-Driven Question Answering System

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Transcription:

Transliterated Search BITS PILANI HYDERABAD CAMPUS TEAM [ABHINAV MUKHERJEE, ANIRUDH RAVI, KAUSTAV DATTA]

Subtask 1 Language identification and back transliteration A few challenges were faced : Since the data given to us was from user chats, they contained varied spellings and grammatical errors Forward transliteration results in lot of words that can be classified as both E(English) and L(Language).

Training Set The data sets used were the ones provided, along with an external data set consisting of 5000 frequently used English words Char n grams were extracted, and were used as features in training data set Training data set was constructed in Sparse ARFF format. {45 1, 54 1, 81 1, 86 1, 1653 1, 1873 1, 2634 1, 2755 1, 3377 1, 4039 1, 9394 1, 13316 1, 19162 1, 19550 english}

Language classification Weka was used for Machine Learning The dataset was trained using the Support Vector Machines algorithm Classifier performance was evaluated on training set by performing cross validation, and optimised Linear Kernel function was used The model was then tested on the test data provided

Context Consideration Forward transliterated forms of words in many cases can have ambiguous classification to (त ), me (म ), b (भ ), use (उस ). For these words in the training set, we built a Naïve Bayes Classifier considering the language of the surrounding words

Results Metric Run 1 Run 2 Max Scor e Med Score EQMF All 0.005 0.004 0.005 0.001 EQMF without NE 0.010 0.009 0.010 0.003 EQMF without Mix 0.005 0.004 0.005 0.001 EQMF without Mix and NE 0.010 0.009 0.010 0.003 EQMF All (No transliteration) 0.205 0.177 0.276 0.194 EQMF without NE (No transliteration) 0.285 0.257 0.427 0.285 EQMF without MIX (No transliteration) 0.205 0.177 0.276 0.194 EQMF without Mix and NE (No 0.285 0.257 0.427 0.285 transliteration) ETPM 1923/2156 1876/2109 NA NA H- Precision 0.879 0.863 0.942 0.853 H- Recall 0.794 0.781 0.917 0.861 H- F Score 0.835 0.820 0.911 0.810 E- Precision 0.780 0.767 0.895 0.767 E- Recall 0.881 0.865 0.987 0.881 E- F Score 0.827 0.813 0.901 0.797 Transliteration Precision 0.156 0.152 0.200 0.109 Transliteration Recall 0.756 0.738 0.760 0.6335 Transliteration F Score 0.258 0.252 0.304 0.1835 Labelling Accuracy 0.838 0.826 0.886 0.792

Subtask 2 There were three main problems to tackle: Mixed Script documents and queries Spelling variations raja ki aaegi baraat, raaja ki aaegi baaraat Breaking and joining of words lejaenge lejaenge dilwale dulhaniya lejaenge, le jaenge le jaenge dilwale dulhaniya le jaenge

Mixed Script Information Retrieval There can be two possibilities: Query expansion to both the scripts would require forward and backward transliteration on the query words. Converting the documents and queries to a single script. Would require backward transliteration on both query and documents (if the native script is chosen).

Mixed Script Information Retrieval We chose the second option as backward transliteration is more accurate than forward transliteration. We used Google s online transliteration tool to perform back transliteration which returns to us the five nearest Hindi words.

Spelling variations Special rules were implemented to normalize the spelling variations in the corpus. The letter ह is trimmed from the suffix of the words ending with that letter, as this could result in spelling variations. For example, म ह and म, न यह च द ह ग and न य च द ह ग The words ending with य, य, य, and य are often written with ए, ऐ, ओ, औ instead. For example, आइय can be written as आइए etc. On many occasions, when the vowels इ or ई or combination of both occur on consecutive consonants, the later vowel is sometimes ignored. For example, र श न and र न

Sub word indexing Hindi words are broken and joined mostly along vowels (स वर). This operation doesn t affect the consonants (व यज न) that make the words. So the consonant pattern of each word in the document is concatenated to find out the consonant pattern of the whole document. The document is indexed along with character n-grams (n=3,4,5,6) of this character pattern. lejayenge lejayenge dilwale dulhaniya lejayenge le jayenge le jayenge dilwale dulhaniya le jayenge Both give the same base, which is: ल-ज-ग-ल-ज-ग-द-ल-व-ल-द-ल-ह-न-य-ल-ज-ग

Final query expansion Consonant pattern indexing made the system resistant to spelling variations made in vowels. To incorporate resistance to the spelling variations in consonants, we expand the query with sub word pattern with varying consonants. The following mapping is used to vary the consonants as they were seen as the major cause of spelling variations. ("क" -> "ख") ("ख" -> "क") ("ग" -> "घ") ("घ" -> "ग") ("च" -> "छ") ("छ" -> "च") ("ज" -> "झ") ("झ" -> "ज") ("त" -> "ट") ("ट" -> "त") ("ठ" -> "थ") ("थ" -> "ठ") ("द" -> "ध") ("ध" -> "द") ("न" -> "ण") ("ण" -> "न") ("ब" -> "भ") ("भ" -> "ब") This was done within a certain limit so that only few permutations of consonant swapping is taken into consideration.

Results of subtask 2 Run NDCG@1 NDCG@5 MAP MRR RECALL Run1 0.7500 0.7817 0.6263 0.7929 0.6818 Run2 0.7708 0.7954 0.6421 0.8171 0.6918