CogALex-V Shared Task - LexNET: Integrated Path-based and Distributional Method for the Identification of Semantic Relations

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

arxiv: v1 [cs.cl] 20 Jul 2015

arxiv: v1 [cs.cl] 2 Apr 2017

FBK-HLT-NLP at SemEval-2016 Task 2: A Multitask, Deep Learning Approach for Interpretable Semantic Textual Similarity

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

arxiv: v4 [cs.cl] 28 Mar 2016

Semantic and Context-aware Linguistic Model for Bias Detection

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Python Machine Learning

Second Exam: Natural Language Parsing with Neural Networks

Probabilistic Latent Semantic Analysis

Linking Task: Identifying authors and book titles in verbose queries

Lecture 1: Machine Learning Basics

Online Updating of Word Representations for Part-of-Speech Tagging

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Georgetown University at TREC 2017 Dynamic Domain Track

Assignment 1: Predicting Amazon Review Ratings

On document relevance and lexical cohesion between query terms

A deep architecture for non-projective dependency parsing

Memory-based grammatical error correction

Differential Evolutionary Algorithm Based on Multiple Vector Metrics for Semantic Similarity Assessment in Continuous Vector Space

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Unsupervised Cross-Lingual Scaling of Political Texts

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Joint Learning of Character and Word Embeddings

A Vector Space Approach for Aspect-Based Sentiment Analysis

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

ON THE USE OF WORD EMBEDDINGS ALONE TO

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A Graph Based Authorship Identification Approach

(Sub)Gradient Descent

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Word Embedding Based Correlation Model for Question/Answer Matching

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Probing for semantic evidence of composition by means of simple classification tasks

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

arxiv: v2 [cs.ir] 22 Aug 2016

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

The stages of event extraction

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Short Text Understanding Through Lexical-Semantic Analysis

Distant Supervised Relation Extraction with Wikipedia and Freebase

Finding Translations in Scanned Book Collections

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Stance Classification of Context-Dependent Claims

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Prediction of Maximal Projection for Semantic Role Labeling

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Semantic Inference at the Lexical-Syntactic Level

Combining a Chinese Thesaurus with a Chinese Dictionary

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Robust Sense-Based Sentiment Classification

Human-like Natural Language Generation Using Monte Carlo Tree Search

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Vocabulary Usage and Intelligibility in Learner Language

Deep Neural Network Language Models

Attributed Social Network Embedding

HLTCOE at TREC 2013: Temporal Summarization

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Human Emotion Recognition From Speech

A Comparison of Two Text Representations for Sentiment Analysis

Indian Institute of Technology, Kanpur

Noisy SMS Machine Translation in Low-Density Languages

CS Machine Learning

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

TextGraphs: Graph-based algorithms for Natural Language Processing

Beyond the Pipeline: Discrete Optimization in NLP

Multilingual Sentiment and Subjectivity Analysis

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Text-mining the Estonian National Electronic Health Record

Ensemble Technique Utilization for Indonesian Dependency Parser

Cultivating DNN Diversity for Large Scale Video Labelling

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Residual Stacking of RNNs for Neural Machine Translation

Cross-Lingual Text Categorization

Calibration of Confidence Measures in Speech Recognition

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

arxiv: v5 [cs.ai] 18 Aug 2015

Rule Learning With Negation: Issues Regarding Effectiveness

Detecting English-French Cognates Using Orthographic Edit Distance

A Bayesian Learning Approach to Concept-Based Document Classification

Transcription:

CogALex-V Shared Task - LexNET: Integrated Path-based and Distributional Method for the Identification of Semantic Relations Vered Shwartz and Ido Dagan Bar-Ilan University December 12, 2016

CogALex Shared Task Corpus-based identification of semantic relations Given two words x and y: Subtask 1: decide whether they are related or not: e.g. related:(misery, sadness), unrelated:(misery, school) Subtask 2: decide what is the semantic relation that holds between them: e.g. ANT:(child, parent), HYPER:(child, human), PART OF:(child, family), SYN:(child, kid), RANDOM:(child, mix)

Outline LexNET Architecture Subtask 1 - Word Relatedness Subtask 2 - Semantic Relation Classification

LexNET Architecture

LexNET Architecture (1) (x, y) is represented as a feature vector, a concatenation of: Path-based features - averaged path embedding: vpaths(x,y)

LexNET Architecture (1) (x, y) is represented as a feature vector, a concatenation of: Path-based features - averaged path embedding: vpaths(x,y) Distributional features - x and y s word embeddings: vwx, v wy

LexNET Architecture (1) (x, y) is represented as a feature vector, a concatenation of: Path-based features - averaged path embedding: vpaths(x,y) Distributional features - x and y s word embeddings: vwx, v wy An MLP classifies (x, y) to the semantic relation that holds between them: v xy v wx (x, y) classification (softmax)... v paths(x,y) v wy

LexNET Architecture (2) Dependency Path Representation [Shwartz et al., 2016]: 1. An edge is a concatenation of 4 component vectors: be/verb/root/- dependent lemma / dependent POS / dependency label / direction

LexNET Architecture (2) Dependency Path Representation [Shwartz et al., 2016]: 1. An edge is a concatenation of 4 component vectors: be/verb/root/- dependent lemma / dependent POS / dependency label / direction 2. Edges are fed sequentially to an LSTM to get the path embedding: Embeddings: lemma POS dependency label direction o p v paths(x,y) X/NOUN/nsubj > be/verb/root <... Y/NOUN/attr average pooling X/NOUN/dobj > define/verb/root < as/adp/prep < Y/NOUN/pobj

Experimental Settings Most hyper-parameters are tuned on a validation set: We split the provided train set to 90% train and 10% validation

Experimental Settings Most hyper-parameters are tuned on a validation set: We split the provided train set to 90% train and 10% validation Our split is lexical (for the x slot), to avoid lexical memorization [Levy et al., 2015]

Experimental Settings Most hyper-parameters are tuned on a validation set: We split the provided train set to 90% train and 10% validation Our split is lexical (for the x slot), to avoid lexical memorization [Levy et al., 2015] Some hyper-parameters are fixed: We use Wikipedia for a corpus (3B tokens) Network s word embeddings initialized with GloVe [Pennington et al., 2014] (6B tokens)

Experimental Settings Most hyper-parameters are tuned on a validation set: We split the provided train set to 90% train and 10% validation Our split is lexical (for the x slot), to avoid lexical memorization [Levy et al., 2015] Some hyper-parameters are fixed: We use Wikipedia for a corpus (3B tokens) Network s word embeddings initialized with GloVe [Pennington et al., 2014] (6B tokens) More on corpus size later...

Subtask 1 Word Relatedness

Common Approaches Typically: compute vector similarity on x and y s distributional representations

Common Approaches Typically: compute vector similarity on x and y s distributional representations Tune a threshold to separate related and unrelated word pairs

Common Approaches Typically: compute vector similarity on x and y s distributional representations Tune a threshold to separate related and unrelated word pairs Most common: cosine similarity

Common Approaches Typically: compute vector similarity on x and y s distributional representations Tune a threshold to separate related and unrelated word pairs Most common: cosine similarity Achieves F1 = 0.747 on the test set

Common Approaches Typically: compute vector similarity on x and y s distributional representations Tune a threshold to separate related and unrelated word pairs Most common: cosine similarity Achieves F1 = 0.747 on the test set When can this go wrong? the relation holds in a rare sense of x or y: e.g. (fire, shoot)

Common Approaches Typically: compute vector similarity on x and y s distributional representations Tune a threshold to separate related and unrelated word pairs Most common: cosine similarity Achieves F1 = 0.747 on the test set When can this go wrong? the relation holds in a rare sense of x or y: e.g. (fire, shoot) the relation is weak / non-prototypical: e.g. (compact, car)

Subtask 1 Model We combine cosine similarity with LexNET: Train LexNET to distinguish between related / unrelated pairs

Subtask 1 Model We combine cosine similarity with LexNET: Train LexNET to distinguish between related / unrelated pairs Compute a linear combination of cosine and LexNET: Rel(x, y) = w C cos( v wx, v wy ) + w L c[related]

Subtask 1 Model We combine cosine similarity with LexNET: Train LexNET to distinguish between related / unrelated pairs Compute a linear combination of cosine and LexNET: Rel(x, y) = w C cos( v wx, v wy ) + w L c[related] Weights, threshold and word embeddings (for Cosine) are tuned on the validation set

Subtask 1 Results Method P R F1 Majority Baseline 0.000 0.000 0.000 Random Baseline 0.283 0.503 0.362 ROOT18 - - 0.731 Cosine Similarity 0.841 0.672 0.747 LexNET 0.754 0.777 0.765 Mach5 - - 0.778 GHHH - - 0.790 Table: Performance scores on the test set of our method, the baselines, and the top 4 systems. Top performing systems achieve similar results

Subtask 1 Results Method P R F1 Majority Baseline 0.000 0.000 0.000 Random Baseline 0.283 0.503 0.362 ROOT18 - - 0.731 Cosine Similarity 0.841 0.672 0.747 LexNET 0.754 0.777 0.765 Mach5 - - 0.778 GHHH - - 0.790 Table: Performance scores on the test set of our method, the baselines, and the top 4 systems. Top performing systems achieve similar results Cosine baseline is strong word2vec [Mikolov et al., 2013] on GoogleNews, 100B tokens

Subtask 1 Results Method P R F1 Majority Baseline 0.000 0.000 0.000 Random Baseline 0.283 0.503 0.362 ROOT18 - - 0.731 Cosine Similarity 0.841 0.672 0.747 LexNET 0.754 0.777 0.765 Mach5 - - 0.778 GHHH - - 0.790 Table: Performance scores on the test set of our method, the baselines, and the top 4 systems. Top performing systems achieve similar results Cosine baseline is strong word2vec [Mikolov et al., 2013] on GoogleNews, 100B tokens LexNET contributes for rare senses and non-prototypical relatedness

Subtask 2 Semantic Relation Classification

Subtask 2 Model (1) Vanilla settings - train LexNET to distinguish between hypernyms, meronyms, antonyms, synonyms, and random

Subtask 2 Model (1) Vanilla settings - train LexNET to distinguish between hypernyms, meronyms, antonyms, synonyms, and random Problem: The dataset is highly imbalanced model overfits random!

Subtask 2 Model (1) Vanilla settings - train LexNET to distinguish between hypernyms, meronyms, antonyms, synonyms, and random Problem: The dataset is highly imbalanced model overfits random! Solution: Use subtask 1 model to classify pairs to random / related

Subtask 2 Model (1) Vanilla settings - train LexNET to distinguish between hypernyms, meronyms, antonyms, synonyms, and random Problem: The dataset is highly imbalanced model overfits random! Solution: Use subtask 1 model to classify pairs to random / related Train LexNET to classify related pairs to different semantic relations

Subtask 2 Model (2) LexNET is now trained to distinguish between hypernyms, meronyms, antonyms, and synonyms

Subtask 2 Model (2) LexNET is now trained to distinguish between hypernyms, meronyms, antonyms, and synonyms Problem: Synonyms are hard to recognize!

Subtask 2 Model (2) LexNET is now trained to distinguish between hypernyms, meronyms, antonyms, and synonyms Problem: Synonyms are hard to recognize! Path-based: synonyms do not tend to occur together

Subtask 2 Model (2) LexNET is now trained to distinguish between hypernyms, meronyms, antonyms, and synonyms Problem: Synonyms are hard to recognize! Path-based: synonyms do not tend to occur together Distributional: synonyms are often mistaken for antonyms that also occur in similar contexts

Subtask 2 Model (2) LexNET is now trained to distinguish between hypernyms, meronyms, antonyms, and synonyms Problem: Synonyms are hard to recognize! Path-based: synonyms do not tend to occur together Distributional: synonyms are often mistaken for antonyms that also occur in similar contexts Solution: Add a heuristic: If (x, y) s classification score for synonym and R are similar,

Subtask 2 Model (2) LexNET is now trained to distinguish between hypernyms, meronyms, antonyms, and synonyms Problem: Synonyms are hard to recognize! Path-based: synonyms do not tend to occur together Distributional: synonyms are often mistaken for antonyms that also occur in similar contexts Solution: Add a heuristic: If (x, y) s classification score for synonym and R are similar, classify as synonym only if x and y occur together less than 3 times in the corpus

Subtask 2 Results Method P R F1 Majority Baseline 0.000 0.000 0.000 Random Baseline 0.073 0.201 0.106 ROOT18 - - 0.262 Mach5 - - 0.295 Concatenation 0.469 0.371 0.411 GHHH - - 0.423 LexNET 0.480 0.418 0.445 Table: Performance scores on the test set of our method, the baselines, and the top 4 systems. Only GHHH achieves similar results The overall performance is very low!

Analysis Low results contrast the success of previous methods on common datasets

Analysis Low results contrast the success of previous methods on common datasets This can be attributed to the stricter and more informative evaluation: random considered noise, excluded from F1 average

Analysis Low results contrast the success of previous methods on common datasets This can be attributed to the stricter and more informative evaluation: random considered noise, excluded from F1 average dataset is lexically split, disabling lexical memorization [Levy et al., 2015]

Analysis Low results contrast the success of previous methods on common datasets This can be attributed to the stricter and more informative evaluation: random considered noise, excluded from F1 average dataset is lexically split, disabling lexical memorization [Levy et al., 2015] Motivates further research on this task!

Recap We presented our submission to the CogALex shared task

Recap We presented our submission to the CogALex shared task The submission is based on LexNET [Shwartz and Dagan, 2016] an integrated path-based and distributional method for semantic relation classification

Recap We presented our submission to the CogALex shared task The submission is based on LexNET [Shwartz and Dagan, 2016] an integrated path-based and distributional method for semantic relation classification LexNET was the best-performing system on subtask 2 and the only system using path-based information...

Recap We presented our submission to the CogALex shared task The submission is based on LexNET [Shwartz and Dagan, 2016] an integrated path-based and distributional method for semantic relation classification LexNET was the best-performing system on subtask 2 and the only system using path-based information... Performance on subtask 2 was low for all participating systems

Recap We presented our submission to the CogALex shared task The submission is based on LexNET [Shwartz and Dagan, 2016] an integrated path-based and distributional method for semantic relation classification LexNET was the best-performing system on subtask 2 and the only system using path-based information... Performance on subtask 2 was low for all participating systems Demonstrates the difficulty of the task, and motivates further research

Recap We presented our submission to the CogALex shared task The submission is based on LexNET [Shwartz and Dagan, 2016] an integrated path-based and distributional method for semantic relation classification LexNET was the best-performing system on subtask 2 and the only system using path-based information... Performance on subtask 2 was low for all participating systems Demonstrates the difficulty of the task, and motivates further research Thank you!

References Levy, O., Remus, S., Biemann, C., and Dagan, I. (2015). Do supervised distributional methods really learn lexical inference relations. In NAACL. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111 3119. Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In EMNLP, pages 1532 1543. Shwartz, V. and Dagan, I. (2016). Path-based vs. distributional information in recognizing lexical semantic relations. Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex-V). Shwartz, V., Goldberg, Y., and Dagan, I. (2016). Improving hypernymy detection with an integrated path-based and distributional method. In Proceedings of ACL 2016 (Volume 1: Long Papers), pages 2389 2398.

Appendix - Corpus Size LexNET: Main corpus: Wikipedia (3B tokens) Pre-trained GloVe embeddings [Pennington et al., 2014], trained on Wikipedia + Gigaword 5 (6B tokens) Cosine: pre-trained word2vec embeddings [Mikolov et al., 2013], trained on Google News (100B tokens)