HANDLING AMBIGUITIES AND UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING ANAPHORA RESOLUTION

Similar documents
Named Entity Recognition: A Survey for the Indian Languages

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

HinMA: Distributed Morphology based Hindi Morphological Analyzer

Survey of Named Entity Recognition Systems with respect to Indian and Foreign Languages

DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

S. RAZA GIRLS HIGH SCHOOL

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

Linking Task: Identifying authors and book titles in verbose queries

ScienceDirect. Malayalam question answering system

Parsing of part-of-speech tagged Assamese Texts

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Indian Institute of Technology, Kanpur

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL

Cross Language Information Retrieval

Rule Learning With Negation: Issues Regarding Effectiveness

ARNE - A tool for Namend Entity Recognition from Arabic Text

F.No.29-3/2016-NVS(Acad.) Dated: Sub:- Organisation of Cluster/Regional/National Sports & Games Meet and Exhibition reg.

The stages of event extraction

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Using Semantic Relations to Refine Coreference Decisions

On-Screen Font in Telugu

ENGLISH Month August

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

English to Marathi Rule-based Machine Translation of Simple Assertive Sentences

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

Experts Retrieval with Multiword-Enhanced Author Topic Model

Corrective Feedback and Persistent Learning for Information Extraction

Python Machine Learning

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard


Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

Rule Learning with Negation: Issues Regarding Effectiveness

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Word Segmentation of Off-line Handwritten Documents

A Case Study: News Classification Based on Term Frequency

Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

BYLINE [Heng Ji, Computer Science Department, New York University,

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Reducing Features to Improve Bug Prediction

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

AQUA: An Ontology-Driven Question Answering System

Universiteit Leiden ICT in Business

How to Judge the Quality of an Objective Classroom Test

Training and evaluation of POS taggers on the French MULTITAG corpus

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

The Smart/Empire TIPSTER IR System

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Probabilistic Latent Semantic Analysis

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Disambiguation of Thai Personal Name from Online News Articles

Test Effort Estimation Using Neural Network

Introduction to Text Mining

Loughton School s curriculum evening. 28 th February 2017

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

Ensemble Technique Utilization for Indonesian Dependency Parser

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Learning Methods in Multilingual Speech Recognition

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

A Syllable Based Word Recognition Model for Korean Noun Extraction

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Learning Methods for Fuzzy Systems

Transliteration Systems Across Indian Languages Using Parallel Corpora

Multiobjective Optimization for Biomedical Named Entity Recognition and Classification

Segmented Discourse Representation Theory. Dynamic Semantics with Discourse Structure

Speech Emotion Recognition Using Support Vector Machine

The Role of the Head in the Interpretation of English Deverbal Compounds

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

End-to-End SMT with Zero or Small Parallel Texts 1. Abstract

Human Emotion Recognition From Speech

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

The Ups and Downs of Preposition Error Detection in ESL Writing

A Named Entity Recognition Method using Rules Acquired from Unlabeled Data

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

व रण क ए आ दन-पत र. Prospectus Cum Application Form. न दय व kऱय सम त. Navodaya Vidyalaya Samiti ਨਵ ਦ ਆ ਦਵਦ ਆਦ ਆ ਸਦ ਤ. Navodaya Vidyalaya Samiti

Using dialogue context to improve parsing performance in dialogue systems

Columbia University at DUC 2004

Lecture 10: Reinforcement Learning

Eye Movements in Speech Technologies: an overview of current research

ह द स ख! Hindi Sikho!

Learning Computational Grammars

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Evaluation for Scenario Question Answering Systems

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Bug triage in open source systems: a review

Available online at ScienceDirect. Procedia Computer Science 54 (2015 )

CSL465/603 - Machine Learning

Memory-based grammatical error correction

Extracting and Ranking Product Features in Opinion Documents

Transcription:

HANDLING AMBIGUITIES AND UNKWN WORDS IN NAMED ENTITY RECOGNITION USING ANAPHORA RESOLUTION Deepti Chopra 1 Dr. G.N. Purohit 2 Department of Computer Engineering, Banasthali Vidyapith, Rajasthan, INDIA ABSTRACT Anaphora Resolution is a method of determining what a particular noun phrase or pronoun at a given instance refers to. In this paper, we have discussed how Anaphora Resolution is useful in performing computation linguistic task in various Natural languages including the Indian languages. Also, we have discussed how Anaphora Resolution is conducive in handling unknown words in Named Entity Recognition. KEYWORDS Anaphora Resolution, Named Entities, Performance Metrics 1. INTRODUCTION Anaphora refers to presupposition or cohesion that refers back to the previous element. Anaphor refers to the reference that is pointing back. The entity to which anaphora actually refers is known as antecedent. So, Anaphora Resolution is a method of finding the antecedent of an anaphor. Some of the applications of Anaphora Resolution include: Named Entity Recognition, Machine Translation, Information Retrieval, Question Answering System and Information Extraction. Named Entity Recognition (NER) is considered as one of the subtask of Information Extraction in which named entities or proper nouns are searched from a huge text and classified into various categories. Transliteration plays a crucial role in resolving the problem of unknown words in Named Entity Recognition. If a training of a given word is performed using example based learning in one language, then this word can be transliterated into other languages also. e. g. न सरत/PER स ब ख त ह द प क /PER क न र/CITY म रहत ह द /PER न ग र/CITY म रहत ह Above, Named Entities in Hindi are transliterated into English as: nusrat, deepika, Kanpur, deepa and nagpur. DOI:10.5121/ijcsa.2013.3504 29

2. RELATED WORK Named Entity Recognition (NER) is the process in which Named Entities or proper nouns are identified and then categorized into different classes of Named Entity classes e.g. Name of Person, Sport, River, Country, State, city, Organization etc. The unknown words in Named Entity Recognition can be handled using many ways. In one of the approaches, unknown word can be allotted a specific tag e.g. unk tag in order to handle unknown words in NER [1][2]. Capitalization information cannot be used to identify the POS tag, since Capitalization can exists in unknown words also. Also, Capitalization information is only restricted to identify Named Entities in English [3][4]. 3. OUR APPROACH In our approach, we have considered OTHER tag to be Not a Named Entity tag. Figure 1 depicts our approach during training in NER. We process each token of training data and check its tag. If the token is not tagged with OTHER tag, then we transliterate the token into different languages and save these transliterated Named Entities along with their tags in a file. If the token is attached with OTHER tag, then we simply process the next token in a sentence and continue this procedure until all the training sentences are not processed. START TRAINING DATA (PROCESS EACH TOKEN) IF T OTHER TAG? PROCESS NEXT TOKEN TRANSLITERATE TOKEN AND ATTACH ITS TAG IF END OF SENTENCE REACHED? EXIT Figure 1 Steps taken while performing training in Named Entity Recognition 30

If we have the sound mapping of one language into any other language, then we can easily perform the transliteration task. e. g. If Ram/PER plays/other cricket/other is a training sentence, then the transliteration task will transliterate Ram into other languages, since Ram is a Named Entity that is attached to PER tag and not OTHER tag. plays and cricket are not the Named Entities, since these are tagged by OTHER tag, so transliteration is not performed on it. So, our approach is a language independent based approach, that is capable of recognizing Named Entities from a text written in any of the Natural languages, provided that it has been trained on a Named Entity Recognition based system in any of the languages and there lie some sound mappings so that transliterated Named Entities in other languages can be achieved. Figure 2 displays the steps taken while performing testing in a Named Entity Recognition based system. In this approach, all the tokens in a testing sentence are processed one at a time. If the current token is not an unknown entity, then a tag is assigned to it using Viterbi algorithm, if Hidden Markov Model approach is used to perform Named Entity Recognition. If current token is an unknown entity, then the Transliteration file which was generated during training process is searched. If the unknown entity is found in transliteration file, then the corresponding Named Entity tag is attached to it, else OTHER tag is allotted to it. This procedure continues until all the training sentences are not processed. TESTING SENTENCE (PROCESS EACH TOKEN) IF AN UNKWN ENTITY? ATTACH TAG TO TOKEN SEARCH TRANSLITERATION FILE IF ENTITY EXISTS? ATTACH OTHER TAG TO TOKEN ATTACH NAMED ENTITY TAG IF END OF SENTENCE? EXIT 31

Figure 2 Steps taken while performing testing in Named Entity Recognition Consider a testing sentence: र म क र क ट ख लत ह If Training sentence is: Ram/PER plays/other cricket/sport. Then, during training transliteration of Named Entities i.e. Ram and cricket takes place and the transliterated Named Entities are stored in a transliterated file along with their tags. So, during testing र म and क र क ट are given PER and SPORT tags respectively. ख लत and ह tokens are unknown entities, since these tokens are neither trained during training process nor they exist in transliteration file. 3. RESULT In Named Entity Recognition, handling of unknown words is a very crucial issue. We have developed a code that generates a list of Named Entities used in the training file. These Named Entities are then transliterated into different languages and stored in the respective transliterated files of different languages. TABLE 1 displays the transliteration of Named Entities in different languages obtained using our Transliteration code. TABLE 1 Transliteration of Named Entities in different languages ENGLISH HINDI FRENCH. URDU TELUGU BENGALI Ram र म Bélier رام మ షర శ গড ডল Cricket क र क ट Cricket کرکٹ క ర క ట ক র ক ট খ ল India भ रत Inde بھارت భ రతద శ ভ রত Ganga ग ग Ganga گنگا గ గ গঙ গ For a given testing sentence, we can initially check if all the tokens exist in already created words list. If for all the words, training has been performed, then by using Viterbi Algorithm, we can generate optimal state sequence. If one of the word is unknown that does not exists in training file, then we can check the transliterated file, if the word exists then the correct tag can be allotted to it and hence the problem of unknown word is resolved. If initially we have performed training in NER in English and during testing if we give a word in Hindi, then this would be treated as an unknown word for which we have not performed training yet. So, we can generate the transliterated list of Named Entities from English to other languages and handle the problem of unknown words in NER. TABLE 2 displays our results obtained while performing NER on multilingual corpus. The Training and Testing file include sentences from the following languages: English, Hindi, Marathi, Punjabi and Urdu. We have performed training on 142 tokens and testing on 212 tokens. Recall, Precision and F-Measure obtained are: 95.80%, 96.3% & 96.04%. Tag set TABLE: 2 Unknown words handling in NER having 28 sentences in Training file {Person, City, Sport} Number of Training sentences Number of Testing 28 (142 tokens) 41(212tokens) 32

Sentences Number of sentence tagged wrongly According to human frequency of tags (Correct Answe r) Observed (System provided) frequency of tags (Experimental Results) TAG SET Recall(R) Precision(P) 0 42 6 1 163 41 3 1 167 PER City Sport Not-a- name R=Correct/(Correct+Incorrect+Spurious)=209/(209+8+1)=95.80% P=Correct/(Correct+Incorrect+Missing)=209/(209+8+0)=96.3% F-measure F-Measure = (2*P*R)/(P+R)=(2*96.3*95.8)/(96.3+95.8)=96.04% TABLE 3 displays the list of Named Entities on which transliteration is performed in different languages and then these transliterated Named Entities are stored in a file along with their tags that can be used further to resolve the problem of unknown entities in a Named Entity Recognition. TABLE 3 Named Entities transliterated into different languages S TAGS 1 PER (Name of Person) 2 OZ (Name of Organization) 3 CN (Name of Country) 4 MAG (Name of Magazine) 5 WEEK (Name of Week) 6 LOC (Name of Location) 7 PC (Name of Personal Computer) 8 MONTH (Name of Month) 9 CITY (Name of City) 10 ST (Name of State) 11 SPORT (Name of Sport) 33

4. PERFORMANCE METRICS Performance Metrics is measure to estimate the performance of a NER based system. Performance Metrics can be calculated in terms of 3 parameters: Precision, Accuracy and F- Measure. The output of a NER based system is termed as response and the interpretation of human as the answer key [5]. Consider the following terms: 1. Correct-If the response is same as the answer key. 2. Incorrect-If the response is not same as the answer key. 3. Missing-If answer key is found to be tagged but response is not tagged. 4. Spurious-If response is found to be tagged but answer key is not tagged. [6] Hence, we define Precision, Recall and F-Measure as follows: [7][8][9] Precision (P): Correct / (Correct + Incorrect + Missing) Recall (R): Correct / (Correct + Incorrect + Spurious) F-Measure: (2 * P * R) / (P + R) 5. CONCLUSION In this paper, we have discussed about Transliteration. We have given our results obtained while performing Named Entity Recognition on multilingual corpus and handling unknown words using transliteration approach. We have also discussed about Performance Metrics which is a very important measure to judge the performance of a Named Entity Recognition based system. ACKWLEDGEMENT I would like to thank all those who helped me in accomplishing this task. REFERENCES [1] http://www.cse.iitb.ac.in/~cs626-460-2012/seminar_ppts/ner.pdf [2] http://staff.um.edu.mt/cabe2/publications/nfhmms.pdf [3] http://acl.ldc.upenn.edu/acl2002/main/pdfs/main036.pdf [4] http://www.sigkdd.org/explorations/issues/7-1-2005-06/4-fu.pdf [5] Shilpi Srivastava, Mukund Sanglikar & D.C Kothari. Named Entity Recognition System for Hindi Language: A Hybrid Approach International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011.Available at http://cscjournals.org/csc/manuscript/journals/ijcl/volume2/issue1/ijcl-19.pdf [6] B. Sasidhar, P. M. Yohan, Dr. A. Vinaya Babu3, Dr. A. Govardhan,. A Survey on Named Entity Recognition in Indian Languages with particular reference to Telugu IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 2, March 2011. [7] Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka and Sivaji Bandyopadhyay Language Independent Named Entity Recognition in Indian Languages.In Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages, pages 33 40,Hyderabad, India, January 2008.Available at: http://www.mt-archive.info/ijcnlp-2008-ekbal.pdf [8] Darvinder kaur, Vishal Gupta. A survey of Named Entity Recognition in English and other Indian Languages.IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, November 2010. 34

[9] G.V.S.RAJU, B.SRINIVASU, Dr.S.VISWANADHA RAJU, 4K.S.M.V.KUMAR Named Entity Recognition for Telugu Using Maximum Entropy Model About Authors Deepti Chopra is working as Assistant Professor in the Department of Computer Science at Banasthali University (Rajasthan), India. She has received B.Tech degree in Computer Science and Engineering from Rajasthan College of Engineering for Women, Jaipur, Rajasthan in 2011.She has done M.Tech in Computer Science and Engineering from Banasthali University, Rajasthan in 2013. Her research interests include Artificial Intelligence, Natural Language Processing, and Information Retrieval. She has published many papers in International journals and conferences. Dr. G. N. Purohit is a Professor in Department of Mathematics & Statistics at Banasthali University (Rajasthan). Before joining Banasthali University, he was Professor and Head of the Department of Mathematics, University of Rajasthan, Jaipur. He had been Chief-editor of a research journal and regular reviewer of many journals. His present interest is in O.R., Discrete Mathematics and Communication networks. He has published around 40 research papers in various journals.. 35