Deep Learning of Text Representations

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Python Machine Learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Vector Space Approach for Aspect-Based Sentiment Analysis

Linking Task: Identifying authors and book titles in verbose queries

Lecture 1: Machine Learning Basics

arxiv: v1 [cs.cl] 20 Jul 2015

Second Exam: Natural Language Parsing with Neural Networks

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

A deep architecture for non-projective dependency parsing

CS Machine Learning

Assignment 1: Predicting Amazon Review Ratings

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

arxiv: v5 [cs.ai] 18 Aug 2015

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Deep Neural Network Language Models

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

ON THE USE OF WORD EMBEDDINGS ALONE TO

arxiv: v1 [cs.lg] 15 Jun 2015

Knowledge Transfer in Deep Convolutional Neural Nets

Learning Methods for Fuzzy Systems

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

arxiv: v2 [cs.ir] 22 Aug 2016

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

arxiv: v2 [cs.cl] 26 Mar 2015

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Natural Language Processing. George Konidaris

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

THE world surrounding us involves multiple modalities

Calibration of Confidence Measures in Speech Recognition

Probabilistic Latent Semantic Analysis

A Comparison of Two Text Representations for Sentiment Analysis

Generative models and adversarial training

Postprint.

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

arxiv: v4 [cs.cl] 28 Mar 2016

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Artificial Neural Networks written examination

The taming of the data:

arxiv: v1 [cs.cv] 10 May 2017

arxiv: v1 [cs.cl] 2 Apr 2017

Word Embedding Based Correlation Model for Question/Answer Matching

arxiv: v2 [cs.cv] 30 Mar 2017

Laboratorio di Intelligenza Artificiale e Robotica

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Ensemble Technique Utilization for Indonesian Dependency Parser

Human Emotion Recognition From Speech

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

(Sub)Gradient Descent

CS 446: Machine Learning

Speech Recognition at ICSI: Broadcast News and beyond

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Evolution of Symbolisation in Chimpanzees and Neural Nets

There are some definitions for what Word

Short Text Understanding Through Lexical-Semantic Analysis

Model Ensemble for Click Prediction in Bing Search Ads

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Active Learning. Yingyu Liang Computer Sciences 760 Fall

The Smart/Empire TIPSTER IR System

Distant Supervised Relation Extraction with Wikipedia and Freebase

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Evolutive Neural Net Fuzzy Filtering: Basic Description

Applications of memory-based natural language processing

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

On the Formation of Phoneme Categories in DNN Acoustic Models

Multilingual Sentiment and Subjectivity Analysis

Semantic and Context-aware Linguistic Model for Bias Detection

Residual Stacking of RNNs for Neural Machine Translation

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

CSL465/603 - Machine Learning

A study of speaker adaptation for DNN-based speech synthesis

Lecture 1: Basic Concepts of Machine Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

INPE São José dos Campos

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

AQUA: An Ontology-Driven Question Answering System

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Summarizing Answers in Non-Factoid Community Question-Answering

Lip Reading in Profile

SEMAFOR: Frame Argument Resolution with Log-Linear Models

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

ARNE - A tool for Namend Entity Recognition from Arabic Text

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Laboratorio di Intelligenza Artificiale e Robotica

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

A Review: Speech Recognition with Deep Learning Methods

Probing for semantic evidence of composition by means of simple classification tasks

Improvements to the Pruning Behavior of DNN Acoustic Models

Learning Methods in Multilingual Speech Recognition

Transcription:

Deep Learning of Text Representations Fatih Uzdilli 21.01.2015

Outline Deep Learning & Text-Analysis Word Representations Compositionality Results

What is the role of deep learning in text-analysis?

What is deep learning? Deep learning algorithms learn multiple levels of representation of increasing complexity/abstraction from raw sensory inputs.

What are raw sensory inputs? Images: Intensity for each pixel Audio: Aplitude at each time point Text:???

Machine learning (for text) until now Human designed representation and input features Machine Learning => often linear models, just optimizing weights

Good input features are essential for successful ML! (feature engineering = 90% of effort in industrial ML)

Bag-of-Words Feature The most simple approach for text Also called: Unigram 80/20-rule s perfect example How To: Vector of lenght Vocabulary Every index represents one word For each word occuring in text: set value at index of word to 1 Can t distinguish: + White blood cells destroying an infection - An infection destroying white blood cells.

Some state-of-the-art features for sentiment detection for tweets N-gram (n=1-4) N-gram with lemma (n=1-4) N-gram with removing middle word(s) (n=1-4) N-gram using word clusters (n=1-4) Substrings-n-grams POS-n-grams (n-grams with middle words replaced by POS-tag) Encoding negation context into words Number of all capitalized words Number of hashtags Number of POS-tags Number of words in a negated context Number of elongated words Text ends with punctuation Length of longest continous punctuation Is last word in negative words list Is last word in positive words list Sentiment lexicon score for last token Total sentiment lexicon score of all tokens Maximum sentiment lexicon score of all tokens Number of tokens having positive sentiment lexicon score Stats about resulting feature vector: Vector-Size: 2.1Mio Avg Non-Zero-Values: ~1100 => Very very sparse

Promlem 1: Handcrafting Features Problem: Handcrafting features is time-consuming Needs experience, you have to be good Has to be done for each task/domain Alternative: Representation Learning: let the machine learn good feature representations

Problem 2: Curse of Dimensionality Problem: Current Natural-Language-Processing systems are fragile because of their atomic symbol representations He is smart vs. he is brilliant Curse of dimensionality: to generalize, we need all relevant variations => more dimensions than variations available!

We need Distributed Representations Distributed Representations: represent multiple dimensions of similarity, non-mutually exclusive features => exponentially large set of distinguishable configuration

Problem 3: Not enough labeled data Problem: Most methods require labeled training data (i.e. supervised learning) but almost all data is unlabeled Alternative: Unsupervised feature learning

Purely subervised setup

Semi-Supervised setup

Let s start with word representations

Neural Word Embeddings as a Distributed Representation Using large amount of data Similar idea to soft clustering models like LSI, LDA Allows adding more supervision from multiple tasks à can become more meaningful Word is represented as a dense vector linguistics = 0.286 0.792-0.177-0.107 0.109-0.542 0.349 0.271

Word Embeddings Visualization

Vector Operations for Analogy Testing Syntactically: X apple X apples X car X cars X family X families Semantically: X shirt X clothing X chair X furniture X switzerland X zurich + X istanbul X turkey

A Neural Probabilistic Language Model Y. Bengio, 2003 Task: Given a sequence of words in a window, predict next word Input vectors also involved in backpropagation ß Hierarchical Softmax Moring & Bengio, 2005 Speedup: Number of output nodes to update every step shrinked to log(n)

Word2Vec (Continuous Bag-Of-Words) Mikolov et. al, 2013 Speedup: Neural language model with hidden layer removed Speedup+Quality-Improvement: - Negative Sampling Quality-Improvement: - Subsampling Still in use: - Hierarchical Softmax

Other approaches for word embeddings Predict whether a given sequence exists in nature (Collobert et al. 2011) Example: the cat chills on a mat the cat chills hello a mat Negative examples created by replacing middle word in a window with random word Other ideas: Add standard nlp tasks such as POS-Tagging, Named Entity Recognition (NER) etc.

Word2Vec Demo DEMO

Let s go to a higher level: Compositionality

Phrase representations by summing up word vectors X south + X africa = X south_africa à X south + X africa X africa + X europe X germany X new + X york = X new_york Works well for up to 3-grams

What about whole sentences? Sentence Representation is difficult because of the varying size!!

No final answer for sentences yet Some Mentionable Approaches: Convolutional Networks with Max-Over-Time-Pooling (Collobert&Weston, 2008, 2011) Recursive Neural Networks (Socher et. al, 2010) Recursive Autoencoder (Socher et. al, 2011) Recursive Neural Tensor Networks (Socher et. al, 2013) Paragraph Vector based on Word2Vec (Le&Mikolov, 2014) Convolutional Networks with with various pooling schemes, regularizations (Kim, 2014) (Kalchbrenner et. al., 2014)

Paragraph Vector Le & Mikolov, 2014 Add an additional vector for each sentence/document during word2vec training to learn vectors for sentence/document. Slow on test-time!

Convolutional Neural Networks for Sentence Classification Yoon Kim, 2014

Results

Results on SemEval2014 Shared Task 9 Sentiment (3-class)-Classification Task on Twitter Data Deep Learning Part Classical Features Part Final Score Best System - 70.96 70.96 Coooolll 66.86 67.07 70.14 Think Positive 67.04-67.04 For practical uses deep learning has been just a provider of one additional feature!

Results on Sentiment Treebank Dataset Sentiment of movie review sentences, label provided on each sub-phrase for training 5 class (++, +, o, -, --) 2 class (+, -) Bag-Of-Words + SVM 40.0 82.2 Feature Engineered Twitter-Sentiment-Classifier (2013) 43.7 84.1 RAE (Socher et. al., 2011) 43.2 82.4 MV-RNN (Socher et al., 2012) 44.4 82.9 RNTN (Socher et al., 2013) 45.7 85.4 DCNN (Kalchbrenner et al., 2014) 48.5 86.8 Paragraph-Vec (Le and Mikolov, 2014) 48.7 87.8 CNN (Kim, 2014) 47.4 88.1

2014 s results of deep learning systems seem to be useful in their own BUT: Deep learning systems with good results are difficult to reproduce.

Summary Semi-supervised distributed representation learning shows future direction Word representation easy Sentence representation still ongoing issue Good results difficult to reproduce

Further Reading 2013 Mikolov et. al - Efficient Estimation of Word Representations in Vector Space: http://arxiv.org/pdf/1301.3781.pdf 2014 Quoc & Mikolov - Distributed Representations of Sentences and Documents: http://cs.stanford.edu/~quocle/paragraph_vector.pdf 2014 Y. Kim Convolutional Neural Networks for Sentence Classification: http://emnlp2014.org/papers/pdf/emnlp2014181.pdf