Named Entity Recognition Using Deep Learning

Similar documents
DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

HinMA: Distributed Morphology based Hindi Morphological Analyzer

S. RAZA GIRLS HIGH SCHOOL

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

arxiv: v4 [cs.cl] 28 Mar 2016

Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)


A deep architecture for non-projective dependency parsing

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

ENGLISH Month August

arxiv: v1 [cs.cl] 20 Jul 2015

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL

Boosting Named Entity Recognition with Neural Character Embeddings

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Named Entity Recognition: A Survey for the Indian Languages

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Python Machine Learning

Attributed Social Network Embedding

Online Updating of Word Representations for Part-of-Speech Tagging

Lecture 1: Machine Learning Basics

Dialog-based Language Learning

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Guide to the Uniform mark scale (UMS) Uniform marks in A-level and GCSE exams

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speech Recognition at ICSI: Broadcast News and beyond

Indian Institute of Technology, Kanpur

Switchboard Language Model Improvement with Conversational Data from Gigaword

ARNE - A tool for Namend Entity Recognition from Arabic Text

F.No.29-3/2016-NVS(Acad.) Dated: Sub:- Organisation of Cluster/Regional/National Sports & Games Meet and Exhibition reg.

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

A Vector Space Approach for Aspect-Based Sentiment Analysis

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Learning Methods in Multilingual Speech Recognition

ह द स ख! Hindi Sikho!

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Model Ensemble for Click Prediction in Bing Search Ads

arxiv: v2 [cs.cl] 26 Mar 2015

Second Exam: Natural Language Parsing with Neural Networks

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Word Segmentation of Off-line Handwritten Documents

A Named Entity Recognition Method using Rules Acquired from Unlabeled Data

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Linking Task: Identifying authors and book titles in verbose queries

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v2 [cs.cv] 30 Mar 2017

arxiv: v1 [cs.cl] 27 Apr 2016

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Human Emotion Recognition From Speech

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Probabilistic Latent Semantic Analysis

Speaker Identification by Comparison of Smart Methods. Abstract

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

A study of speaker adaptation for DNN-based speech synthesis

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Modeling function word errors in DNN-HMM based LVCSR systems

Cultivating DNN Diversity for Large Scale Video Labelling

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

व रण क ए आ दन-पत र. Prospectus Cum Application Form. न दय व kऱय सम त. Navodaya Vidyalaya Samiti ਨਵ ਦ ਆ ਦਵਦ ਆਦ ਆ ਸਦ ਤ. Navodaya Vidyalaya Samiti

Calibration of Confidence Measures in Speech Recognition

arxiv: v1 [cs.lg] 15 Jun 2015

Corrective Feedback and Persistent Learning for Information Extraction

Memory-based grammatical error correction

Georgetown University at TREC 2017 Dynamic Domain Track

Cross Language Information Retrieval

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Survey of Named Entity Recognition Systems with respect to Indian and Foreign Languages

A Case Study: News Classification Based on Term Frequency

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

CS Machine Learning

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Language Model and Grammar Extraction Variation in Machine Translation

Multivariate k-nearest Neighbor Regression for Time Series data -

Robust Sense-Based Sentiment Classification

WHEN THERE IS A mismatch between the acoustic

Assignment 1: Predicting Amazon Review Ratings

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Distant Supervised Relation Extraction with Wikipedia and Freebase

A Neural Network GUI Tested on Text-To-Phoneme Mapping

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Learning Methods for Fuzzy Systems

Speech Emotion Recognition Using Support Vector Machine

Improvements to the Pruning Behavior of DNN Acoustic Models

THE world surrounding us involves multiple modalities

Applications of memory-based natural language processing

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Rule Learning With Negation: Issues Regarding Effectiveness

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

Transcription:

Named Entity Recognition Using Deep Learning Rudra Murthy Center for Indian Language Technology, Indian Institute of Technology Bombay rudra@cse.iitb.ac.in https://www.cse.iitb.ac.in/~rudra Deep Learning Tutorial. ICON 2017, Kolkata 21th December 2017 1

Outline What is NER? Traditional ML Approaches Motivating Deep Learning Deep Learning Solutions Summary 2

Introduction What is Named Entity Recognition? The task of identifying person names, location names, organization names and other miscellaneous entities in a given piece of text. Example: Malinga omitted from squad for Pakistan ODIs Malinga will be tagged as Person and Pakistan as Location entity 3

You thought NER was trivial 4

Challenges Named Entities are ambiguous I went to Washington I met Washington Named Entities form an open class Box8 Alphabet.. 5

Challenges List of Unique/Crazy Person Names Ahmiracle Anna I munique Baby Girl Abcde North West Melanomia Heaven Lee Tu Morrow Moxie Crimefighter Abstinence Apple Facebook Danger Colon Mercury Constellation Starcuiser Pilot Inspektor Rage Billion Audio Science Sadman Hashtag Source: http://www.momjunction.com/articles/worst-baby-names-in-the-world_00400377/#gref6

Traditional ML Approaches 7

Vince s Person maiden O test O fifty O keeps O England Misc ticking O Mumbai Misc drop O Nayar Person.. Machine Learning NER Model 8

Vince s Person maiden O test O fifty O keeps O England Team ticking O Mumbai Team drop O Nayar Person.. Machine Learning *learn probabilities over words NER Model P(Person Vince s) =? P(Location Vince s) =? P(Team Vince s) =? P(O Vince s) =? 9

Problem Formulation Given a word sequence ( w1, w2,, wn ) find the most probable tag sequence ( y1, y 2,, y n ) i.e, find the most probable entity label for every word in the sentence Best Tag Sequence y* = P ( y1, y2,, yn w1, w2,, wn ) Why sequence labeling and not classification task? Sequence labeling performs better at identifying named entity phrases 10

Problem Formulation (CRF) Given a word sequence ( w1, w2,, wn ) find the most probable tag sequence ( y1, y 2,, y n ) P ( y w ) = exp ( Σi=1n Σk λk fk ( yt, yt-1, x ) ) Here, fk ( yt, yt-1, x ) is a feature function whose weights λk needs to be learned during training The feature function is used to define various features 11

Typical Features Word Features Subword Features Context Words POS Tag Gazetteers Suffix Gazetteers Handcrafted Features Does the word begin with an uppercase character? Contains any digits? Contains special characters? 12

Why Deep Learning? 13

Why Deep Learning? Neural networks provide an hierarchical architecture NER Tagging Lower layers of the network can discover subword features Layers above it can be used to discuss word specific features POS Tagging The higher layer can use the information coming from lower layers to identify named entities Morphology 14

Word Embeddings Plot of word Spectral word embedding for words from English CoNLL 2003 test data Choose the most frequent named tag for every word We observe named entities of the same type forming a cluster in the embedding space 15

Deep Learning Solutions 16

We have looked at various neural network architectures What are the important features for NER? What neural network architectures can we use to make the model learn these features? 17

Deep Learning Model for NER Timeline Model Subword CNN Word Bi-LSTM CNN Hammerton [2003] Collobert et al. [2011] dos Santos et al. [2015] Bi-LSTM Huang et al. [2015] Chiu and Nichols [2016] Murthy and Bhattacharyya [2016] Lample et al. [2016] Ma and Hovy [2016] Yang et al. [2017] 18

Deep Learning Model for NER [ Murthy and Bhattacharyya [2016] ] Given a dataset D consisting of tagged sentences Let X = {x1, x2,, xn} be the sequence of words in a sentence Let Y = {y1, y2,, yn} be the sequence of corresponding tags Goal is to maximize the likelihood of the tag sequence given the word sequence maximize P( Y X) maximize P( y1, y2,, yn x1, x2,, xn ) maximize Ⲡi=1n P ( yi x1, x2,, xn yi-1 ) We maximize the log-likelihood for every tag sequence in the training data 19

Deep Learning Architecture for NER 20

Deep Learning Architecture for NER The input to the model is words and the character sequence forming the word One-hot representation of the word is sent through a Lookup Table Lookup Table is initialized with pre-trained embeddings Additionally character sequence is fed to CNN to extract sub-word features The word embeddings and sub-word features are concatenated to get final word representation This representation is fed to a Bi-LSTM layer which disambiguates the word (w.r.t NER task) in the sentence Finally, the output from Bi-LSTM model is fed to softmax layer which predicts the named entity label 21

Word Embeddings Word embeddings represent words using d-dimensional real valued vector Word embeddings exhibit the property that named entities tend to form a cluster in the embedding space Providing word embedding features as input is more informative compared to the one-hot representation Word embeddings are updated during training 22

Subword Features We use multiple CNNs of varying width to extract sub-word features Every character is represented using one-hot vector representation The input is a matrix with ith row indicating the one-hot vector of ith character in the word The output of CNN is fed to max-pooling layer We extract 15-50 features from the CNN for every word This forms the sub-word features for the word 23

Subword Features This module should be able to discover various subword features The feature could be capitalization feature, affix features, presence of digits etc. 24

CNNs to Extract Subword Features How to go from variable length to fixed length representation? Simple Linear layer looking at 3 characters at a time One-hot representation of the characters W1 l o W3 W2 h a g a d 25

CNNs to Extract Subword Features How to go from variable length to fixed length representation? What are we expecting the Subword Feature Extractor to do? Simple Linear layer looking at 3 characters at a time One-hot representation of the characters W1 l o W2 h a W3 g a d 26

CNNs to Extract Subword Features How to go from variable length to fixed length representation? ad$ gad aga hag oha loh ^lo Simple Linear layer looking at 3 characters at a time One-hot representation of the characters W1 l o What are we expecting the Subword Feature Extractor to do? W2 h a Cluster all words ending with the suffix gad together W3 g a d 27

CNNs to Extract Subword Features Select most of the features from the ngrams gad and ad$ How to go from variable length to fixed length representation? ad$ gad aga hag oha loh ^lo Simple Linear layer looking at 3 characters at a time One-hot representation of the characters W1 l o What are we expecting the Subword Feature Extractor to do? W2 h a Cluster all words ending with the suffix gad together W3 g a d 28

CNNs to Extract Subword Features Use Max-Pooling Select most of the features from the ngrams gad and ad$ How to go from variable length to fixed length representation? ad$ gad aga hag oha loh ^lo Simple Linear layer looking at 3 characters at a time One-hot representation of the characters W1 l o What are we expecting the Subword Feature Extractor to do? W2 h a Cluster all words ending with the suffix gad together W3 g a d 29

CNNs to Extract Subword Features Max Pooling ad$ gad aga hag oha loh ^lo Simple Linear layer looking at 3 characters at a time One-hot representation of the characters W1 l o W2 h a W3 g a d 30

CNNs to Extract Subword Features Use CNNs to extract various sub-word features By extracting, we mean word with similar features to be closer in the feature space The features could be capitalization features, similar suffixes, similar prefixes, all time expressions, etc. This is similar to say suffix embeddings except that the suffix pattern is discovered by the model 31

Subword Features Plot of subword features for different Marathi words We observe that the CNN was able to cluster words based on their suffixes The CNN model was able to cluster words with similar suffixes 32

Bi-LSTM Layer We have observed that both word embeddings and subword features are able to cluster similar words together All location names forming a cluster in word embedding space All words with similar suffixes forming a cluster in the sub-word feature space This acts as a proxy feature for suffix features used in traditional ML methods Till now we have looked at only global features What about local features like contextual features? 33

Bi-LSTM Layer The word embeddings and extracted sub-word features give global information about the word Whether a word is named entity or not depends on the specific context in which it is used For example, I went to Washington I met Washington The Bi-LSTM layer is responsible for disambiguation of the word in a sentence Here the disambiguation is w.r.t named entity tags 34

Bi-LSTM Layer Given a sequence of words, {x1, x2,, xn} the Bi-LSTM layer employs two LSTM modules The forward LSTM module reads a sequence from left-to-right and disambiguates the word based on left context The forward LSTM extracts a set of features based on current word representation and previous word s forward LSTM output hfi = f( xi, hfi-1) Similarly, backward LSTM reads sequence from right to left and disambiguates the word based on right context hbi = f( xi, hbi+1) 35

What does Bi-LSTM Layer compute? Revisiting the Deep Learning Architecture for NER 36

What does Bi-LSTM Layer compute? Revisiting the Deep Learning Architecture for NER 37

Bi-LSTM Layer The Bi-LSTM layer extracts a set of features for every word in the sentence We will now call this representation as instance-level representation Consider the sentence snippets, वतम न म उ र द श क ज नस र ब वर Currently Jaunsar Bavar area of Uttar Pradesh... भ वज द भ क ई स हत ( स त षजनक ) उ र त नह ह आ even after that no satisfactory answer was obtained The word उ र will now have two instance-level representations one for first sentence and the other for second sentence We will now query the nearest neighbors for उ र using instance-level representations from both sentences 38

B-LSTM Layer Word Embedding Neighbors Score Sentence 1 Tag Neighbors Score Sentence 2 Tag Neighbors Score Tag द श 0.8722 - उ र 0.9088 LOC उ र 0.9183 O प चम 0.8596 - उ र 0.9033 LOC उ र 0.9155 O म य 0.8502 - त बत 0.8669 LOC उ र 0.9137 O प रब 0.8432 - शमल 0.8641 LOC उ र 0.9125 O अ ण चल 0.8430 - क न र 0.8495 LOC उ र 0.9124 O The table shows the nearest neighbors (using cosine similarity) for the ambiguous word उ र using instance-level representation In sentence 1, the nearest neighbors are all location entities In sentence 2, we observe different instances of उ र appearing as nearest neighbors All the instances of उ र takes the answer meaning as in क उ र द न व ल य त.. 39

Analyzing Bi-LSTM Layer Sentence 2 Neighbors Score Tag Sentence उ र 0.9183 O क उ र द न व ल य त उ र 0.9155 O अन स र उ र दय पर त उ र 0.9137 O उस उ र द न म उ र 0.9125 O सह उ र क स भ वन उ र 0.9124 O एक भ उ र न द In sentence 2, we observe different instances of उ र appearing as nearest neighbors All the instances of उ र takes the answer meaning as in क उ र द न व ल य त.. 40

Softmax Layer (Linear + Softmax) The output from Bi-LSTM module and correct previous tag is fed as input to Softmax layer The correct previous tag is crucial in identifying the named entity phrase boundaries During testing, we do not have previous tag information We use beam search to find the best possible tag sequence 41

Results We perform the NER experiments on the following set of languages Language Dataset English CoNLL 2003 Shared Task Spanish CoNLL 2002 Shared Task Dutch CoNLL 2002 Shared Task Hindi Bengali IJCNLP 2008 Shared Task Telugu Marathi In-House Data 42

Results The following Table shows the F1-Score obtained using the Deep Learning system Language F1-Score English 90.94 Spanish 84.85 Dutch 85.20 Hindi 59.80 Marathi 61.78 Bengali 43.24 Telugu 21.11 43

Demo 44

Thank You 45

Questions? 46

References Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011). Natural language processing (almost) from scratch. J. Mach. Learn. Res., dos Santos, C., Guimaraes, V., Niterói, R., and de Janeiro, R. (2015). Boosting named entity recognition with neural character embeddings. Proceedings of NEWS 2015 The Fifth Named Entities Workshop, page 9. Huang, Z., Xu, W., and Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991 Lample, G., Ballesteros, M., Kawakami, K., Subramanian, S., and Dyer, C. (2016). Neural architectures for named entity recognition. In In proceedings of NAACL-HLT (NAACL 2016)., San Diego, US 47

References Gillick, Dan and Brunk, Cliff and Vinyals, Oriol and Subramanya, Amarnag "Multilingual Language Processing From Bytes." In proceedings of NAACL-HLT (NAACL 2016)., San Diego, US. Murthy, Rudra and Bhattacharyya, Pushpak Complete Deep Learning Solution For Named Entity Recognition. CICLING 2016, Konya, Turkey Murthy, Rudra and Khapra, Mitesh and Bhattacharyya, Pushpak Sharing Network Parameters for Crosslingual Named Entity Recognition CoRR, abs/1607.00198 48