Contents. 1 Brief recap. 2 Models evaluation. 3 Off-the-shelf tools to train and use models. 4 Model formats. 5 Hyperparameters influence

Similar documents
arxiv: v1 [cs.cl] 20 Jul 2015

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Differential Evolutionary Algorithm Based on Multiple Vector Metrics for Semantic Similarity Assessment in Continuous Vector Space

Deep Multilingual Correlation for Improved Word Embeddings

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semantic and Context-aware Linguistic Model for Bias Detection

Linking Task: Identifying authors and book titles in verbose queries

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

On document relevance and lexical cohesion between query terms

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Finding Translations in Scanned Book Collections

arxiv: v1 [cs.cl] 2 Apr 2017

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

A Comparison of Two Text Representations for Sentiment Analysis

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

The taming of the data:

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Joint Learning of Character and Word Embeddings

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Variations of the Similarity Function of TextRank for Automated Summarization

Noisy SMS Machine Translation in Low-Density Languages

Matching Similarity for Keyword-Based Clustering

Python Machine Learning

FBK-HLT-NLP at SemEval-2016 Task 2: A Multitask, Deep Learning Approach for Interpretable Semantic Textual Similarity

Word Sense Disambiguation

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Memory-based grammatical error correction

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

Literal or idiomatic? Identifying the reading of single occurrences of German multiword expressions using word embeddings

There are some definitions for what Word

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

arxiv: v2 [cs.ir] 22 Aug 2016

The Role of the Head in the Interpretation of English Deverbal Compounds

Vocabulary Usage and Intelligibility in Learner Language

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Unsupervised Cross-Lingual Scaling of Political Texts

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Cross-Lingual Text Categorization

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Accuracy (%) # features

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Indian Institute of Technology, Kanpur

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

BYLINE [Heng Ji, Computer Science Department, New York University,

Cross-lingual Text Fragment Alignment using Divergence from Randomness

The stages of event extraction

Using dialogue context to improve parsing performance in dialogue systems

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

A Domain Ontology Development Environment Using a MRD and Text Corpus

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Language Model and Grammar Extraction Variation in Machine Translation

AQUA: An Ontology-Driven Question Answering System

Cross Language Information Retrieval

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Georgetown University at TREC 2017 Dynamic Domain Track

A deep architecture for non-projective dependency parsing

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

A High-Quality Web Corpus of Czech

Mining meaning from Wikipedia

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

An Introduction to the Minimalist Program

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Detecting English-French Cognates Using Orthographic Edit Distance

A Case Study: News Classification Based on Term Frequency

Probing for semantic evidence of composition by means of simple classification tasks

Extended Similarity Test for the Evaluation of Semantic Similarity Functions

Topic Modelling with Word Embeddings

Lexical Similarity based on Quantity of Information Exchanged - Synonym Extraction

A Vector Space Approach for Aspect-Based Sentiment Analysis

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Lecture 1: Machine Learning Basics

Word Embedding Based Correlation Model for Question/Answer Matching

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Disambiguation of Thai Personal Name from Online News Articles

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

Compositional Semantics

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Using Semantic Relations to Refine Coreference Decisions

Multilingual Sentiment and Subjectivity Analysis

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Robust Sense-Based Sentiment Classification

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

An Assessment of Experimental Protocols for Tracing Changes in Word Semantics Relative to Accuracy and Reliability

Modeling function word errors in DNN-HMM based LVCSR systems

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Ensemble Technique Utilization for Indonesian Dependency Parser

Introduction to Text Mining

arxiv: v1 [cs.lg] 3 May 2013

Distant Supervised Relation Extraction with Wikipedia and Freebase

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Transcription:

INF5820 Distributional Semantics: Extracting Meaning from Data Lecture 3 Practical aspects of training and using distributional models Andrey Kutuzov andreku@ifi.uio.no 9 November 2016 1 1 Brief recap What we are going to cover today Models evaluation; Off-the-shelf tools to train and use models; Models formats; Models hyperparameters. 2 2

Models evaluation Models evaluation How do we evaluate trained models? Subject to many discussions! The topic of a special workshop at ACL2016: https://sites.google.com/site/repevalacl16/ Semantic relatedness (what is the association degree?): RG dataset [Rubenstein and Goodenough, 1965] WordSim 353 dataset [Finkelstein et al., 2001] MEN dataset [Bruni et al., 2014] SimLex-999 dataset [Hill et al., 2015] Synonym detection (what is most similar?): TOEFL dataset (1997) Concept categorization (what groups with what?): ESSLI 2008 dataset Battig dataset (2010) Analogical inference (A is to B as C is to?): Google Analogy dataset [Le and Mikolov, 2014] Many domain-specific datasets inspired by Google Analogy Correlation with manually crafted linguistic features: QVEC uses words affiliations with Wordnet synsets [Tsvetkov et al., 2015] 3 4 Off-the-shelf tools to train and use models Main frameworks and toolkits 1. Dissect [Dinu et al., 2013] (http://clic.cimec.unitn.it/composes/toolkit/); 2. word2vec original C code [Le and Mikolov, 2014] (https://word2vec.googlecode.com/svn/trunk/) 3. Gensim framework for Python, including word2vec implementations (http://radimrehurek.com/gensim/); 4. word2vec implementations in Google s TensorFlow (https://www.tensorflow.org/tutorials/word2vec); 5. GloVe reference implementation [Pennington et al., 2014] (http://nlp.stanford.edu/projects/glove/). 4 5

Model formats Models can come in several formats: 1. Simple text format: words and sequences of values representing their vectors, one word per line; first line gives information on the number of words in the model and vector size. 2. The same in the binary form. 3. Gensim binary format: uses NumPy matrices saved via Python pickles; stores a lot of additional information (input vectors, training algorithm, word frequency, etc). Gensim works with all of these formats. 5 6 Things are complicated Model performance hugely depends on training settings (hyperparameters): 1. CBOW or skip-gram algorithm. Needs further research; SkipGram is generally better (but slower). CBOW seems to be better on small corpora (less than 100 mln tokens). 2. Vector size: how many distributed semantic features (dimensions) we use to describe a word. The more is not always the better. 3. Window size: context width and influence of distance. Topical (associative) or functional (semantic proper) models. 4. Frequency threshold: useful to get rid of long noisy lexical tail; 5. Selection of learning material: hierarchical softmax or negative sampling (used more often); 6. Number of iterations on our training data, etc... 6 7

A bunch of observations Wikipedia is not the best training corpus: fluctuates wildly depending on hyperparameters. Perhaps, too specific language. Normalize you data: lowercase, lemmatize, merge multi-word entities. It helps to augment words with PoS tags before training ( boot_noun, boot_verb ). As a result, your model becomes aware of morphological ambiguity. Remove your stop words yourself. Statistical downsampling implemented in word2vec algorithms can easily deprive you of valuable text data. Model performance in semantic relatedness task depending on context width and vector size. 8 9 Questions? INF5820 Distributional Semantics: Extracting Meaning from Data Lecture 3 Practical aspects of training and using distributional models Homework: obligatory assignment 3. 10 10

In the next week References I Beyond words: distributional representations of texts Representing phrases, sentences and documents; semantic fingerprints; paragraph vector (doc2vec); deep inverse regression etc. Bruni, E., Tran, N.-K., and Baroni, M. (2014). Multimodal distributional semantics. J. Artif. Intell. Res.(JAIR), 49(1-47). Dinu, G., Pham, T. N., and Baroni, M. (2013). Dissect - distributional semantics composition toolkit. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 31 36. Association for Computational Linguistics. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., and Ruppin, E. (2001). Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406 414. ACM. 11 12 References II References III Hill, F., Reichart, R., and Korhonen, A. (2015). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4). Le, Q. V. and Mikolov, T. (2014). Distributed representations of sentences and documents. In ICML, volume 14, pages 1188 1196. Pennington, J., Socher, R., and Manning, C. D. (2014). GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532 1543. Rubenstein, H. and Goodenough, J. B. (1965). Contextual correlates of synonymy. Communications of the ACM, 8(10):627 633. Tsvetkov, Y., Faruqui, M., Ling, W., Lample, G., and Dyer, C. (2015). Evaluation of word vector representations by subspace alignment. In Proc. of EMNLP. 13 14