Learning Feature-based Semantics with Autoencoder

Similar documents
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Python Machine Learning

Lecture 1: Machine Learning Basics

A deep architecture for non-projective dependency parsing

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

(Sub)Gradient Descent

Artificial Neural Networks written examination

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Probabilistic Latent Semantic Analysis

Comment-based Multi-View Clustering of Web 2.0 Items

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Second Exam: Natural Language Parsing with Neural Networks

arxiv: v2 [cs.cv] 30 Mar 2017

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v1 [cs.cl] 20 Jul 2015

A study of speaker adaptation for DNN-based speech synthesis

Assignment 1: Predicting Amazon Review Ratings

arxiv: v2 [cs.ir] 22 Aug 2016

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

CSL465/603 - Machine Learning

WHEN THERE IS A mismatch between the acoustic

Axiom 2013 Team Description Paper

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Word Segmentation of Off-line Handwritten Documents

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Knowledge Transfer in Deep Convolutional Neural Nets

Beyond the Pipeline: Discrete Optimization in NLP

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

CS Machine Learning

A Case Study: News Classification Based on Term Frequency

Online Updating of Word Representations for Part-of-Speech Tagging

arxiv: v1 [cs.cl] 2 Apr 2017

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Rule Learning With Negation: Issues Regarding Effectiveness

Statewide Framework Document for:

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Generative models and adversarial training

A survey of multi-view machine learning

A Vector Space Approach for Aspect-Based Sentiment Analysis

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Speaker Identification by Comparison of Smart Methods. Abstract

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Softprop: Softmax Neural Network Backpropagation Learning

Model Ensemble for Click Prediction in Bing Search Ads

Linking Task: Identifying authors and book titles in verbose queries

Modeling function word errors in DNN-HMM based LVCSR systems

Truth Inference in Crowdsourcing: Is the Problem Solved?

Discriminative Learning of Beam-Search Heuristics for Planning

Attributed Social Network Embedding

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Semi-Supervised Face Detection

Calibration of Confidence Measures in Speech Recognition

A heuristic framework for pivot-based bilingual dictionary induction

Learning From the Past with Experiment Databases

Semi-supervised Training for the Averaged Perceptron POS Tagger

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Summarizing Answers in Non-Factoid Community Question-Answering

THE world surrounding us involves multiple modalities

Semantic and Context-aware Linguistic Model for Bias Detection

Learning to Schedule Straight-Line Code

Switchboard Language Model Improvement with Conversational Data from Gigaword

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

SARDNET: A Self-Organizing Feature Map for Sequences

Rule Learning with Negation: Issues Regarding Effectiveness

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

INPE São José dos Campos

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

arxiv: v1 [cs.cv] 10 May 2017

Australian Journal of Basic and Applied Sciences

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Short Text Understanding Through Lexical-Semantic Analysis

Deep Neural Network Language Models

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Human Emotion Recognition From Speech

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

A Review: Speech Recognition with Deep Learning Methods

On-the-Fly Customization of Automated Essay Scoring

Learning Methods for Fuzzy Systems

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Software Maintenance

A Deep Bag-of-Features Model for Music Auto-Tagging

Learning Methods in Multilingual Speech Recognition

Transcription:

Wonhong Lee Minjong Chung wonhong@stanford.edu mjipeo@stanford.edu Abstract It is essential to reduce the dimensionality of features, not only for computational efficiency, but also for extracting the most meaningful pattern of features. This is particularly important in certain area such as computer vision or natural language processing, in which researchers usually have difficulty to manually specify feature. In this project, we applied the autoencoder model to represent the semantics of phrases and evaluated it by computing the correlation between the similarity of autoencoded feature and the rating manually tagged by human. 1. Introduction Even though a lot of supervised learning algorithm such SVM, linear regression works well with a set of appropriate features, it is tedious to manually specify the feature of the input data in some domains like computer vision or natural language processing, and sometimes even quite difficult to define the features representing the input data in an intuitive way. Therefore, it is getting more and more important to reduce the dimensionality of the data and to extract only a set of meaningful features from them, which eventually leads to computational efficiency as well as the learning accuracy. Natural language processing is one of the most promising area for which the autoencoder approach can be applied. Especially, compared to their syntactic information, the semantics of word, phrase, or sentence is quite difficult to be modeled accurately and evaluated systematically. Furthermore, supervised approach to learn semantic of word or phrase has certain limitation to retrieve the labeled accurate training data. Turian emphasized the efficiency of the semi-supervised approach for word representation. In this project, we utilized autoencoder to automatically learn useful features from the input data, and use the compressed representation to find similar phrases given a phrase, which is usually hard to manually specify the features. First of all, we made a simple assumption about the autoencoder model to reduce the number of parameter, which leads to fast convergence compared to the old version, and derived new update rule for this model. We applied this model to visualize the trained hidden unit to figure out what kind of interesting patterns our model tried to extract from input features. To apply our model to natural language processing, we used Jeff Mitchell and Mirella Lapata s phrase pairs, with manual rating by humans, to train our autoencoder model, and to compute the Pearson correlation between the manual rating and the autoencoded (compressed) similarity to evaluate the model. Furthermore, we compared the performance of our modified autoencoder model to other method to compute the similarity of phrases. 2. Background 2.1. Autoencoder Autoencoder networks are feed forward neural networks that can have more than one hidden layer. These networks attempt to reconstruct the input data at the output layer. The targets at the output layer are the same as the input data, thus the size of the output layer is also the same as the size of the input layer. Also it assume that the values of input data are equal to the those of output data. In other word, it tries to learn a function h(w,b) x; it is trying to find the similarity between input features and output features. By adjusting the number of nodes in the hidden layer, we can discover very interesting feature between data sets. From the interesting structure of autoencoder, we also expect to find semantic correlations between various sentences.

so we call it a word feature. However, in real world, the labeled training data to represent words is rare so that it results in poor estimation. Because of this limitation, NLP researchers tried to investigate suitable unsupervised methods to induce word features. One of powerful approach is to use clustering. This technique was used by a variety of researchers. In this project, we will choose data sets in those categories. 3. Model Figure 1. Autoencoder model Autoencoder is unsupervised neural networks model that are trained using a gradient descent method, such as back propagation. Since the size of the hidden layer in an autoencoder is smaller than the size of the input data, the dimensionality of input data is reduced to a smaller-dimensional code space at the hidden layer. The outputs from the hidden layer are then reconstructed into the original data at the output layer. Like PCA, the autoencoders can give mappings in both directions between the data and the code space. 2.2. Word representation Semi-supervised approaches such as Ando and Zhang (2005), Suzuki and Isozaki (2008) achieve remarkabe results in terms of accuracy. However, those approaches are limited in choosing a training model because of their distinct characteristics. Therefore, it is very difficult to utilize an exsiting supervised Natural Language Processing system in the semi-supervised methods. Consequently, it is getting more preferable to use unsupervised technics to induce word features. 3.1. Autoencoder 3.1.1. Assumption This model starts from the assumption: W 2 W 1 T In other words, we assume that the weight matrix of the second layer is the transpose of the weight matrix of the first layer, which effectively reduce the number of the parameters of the weight matrix into half. With this assumption, the output vector of the autoencoder can be written by three parameters, W, b 1, b 2 : Y W T f(w X + b 1 ) + b 2 The parameter b 1, b 2 represent the bias in the first and second layer, respectively, and f(x) is the activation function. In this project, we use tanh(x) as an activation function. Finally, the cost function we should minimize is given by: J(W, b 1, b 2 ) 1 2 (Y X)T (Y X) 3.1.2. Derivation The update rule of the weight matrix W for the stochastic gradient descent algorithm is given by: W : W α dw Figure 2. Word representation A word can be represented as a mathematical object such as a vector associated with the word. Each dimension s value corresponds to a feature and might even have a semantic or grammatical interpretation, Since X R n and Y R n, the cost function can be written by the summation of the value for each dimension. J(W, b 1, b 2 ) 1 2 (Y X)T (Y X) 1 2 (Y i X i ) 2

Now, we can apply the chain rule to compute the derivative of the cost function with respect to weight parameter. (Note that W R m n, b 1 R m, and b 2 R n ) Learning Feature-based Semantics with Autoencoder dy i d 1 dy i 2 (Y j X j ) 2 j1 Y i X i dy i dy i (Y i X i ) dy i The value of the i-th dimension of Y can be written by: Y i dy i m W ki f( W kl X l + b 1k ) + b 2i k1 l1 (Y i X i )W pi f ( W pj X j + b 1p )X q + (Y q X q )f( j1 W pj X j + b 1p ) To sum up, the update rule for each parameter is given by: (derivation for b 1 and b 2 is similar) dw W (Y X)1T f (W X + b 1 )X T + f(w X + b 1 )(Y X) T db 1 W (Y X) f (W X + b 1 ) db 2 Y X (Note that is used to denote the element-wise product operator, often called Hadamard product) 3.2. Phrase representation We want to let our modified autoencoder model to learn the semantic of phrase consisting of two words, expecting each hidden unit to try to catch certain meaningful pattern of the composition of words. Jeff Mitchell and Mirella Lapata (2008) introduced vectorbased models of semantic composition of two words and showed that the model works well on a phrase similarity task. In this project, we introduced a different form of composition of the vector representation of words: concatenation. Figure 3. Phrase representation Since the autoencoder operates in unsupervised way, we can take advantage of unbounded amounts of training data without any human effort manually labeling the data. This approach is especially effective with language model since it is quite difficult to define for which each dimension of phrase feature vector should represent. After training the model by optimizing the weight matrix as well as two bias vectors with the stochastic gradient descent method, we can represent each phrase by computing W X + b 1, and compare the similarity between those vectors which supposedly represent the semantic of the phrases. The similarity measure we used is Euclidean distance and cosine similarity, which defined as: 4. Experiment 4.1. Visualization EUC(u, v) u v 2 COS(u, v) u v u v To verify our model assumption, we wanted to visualize what the model learned with autoencoder. The model were trained by randomly picking one of the 10 images, then randomly sampling an 8x8 image patch from the selected 512x512 images, and set the size of the hidden units to 30 and the input units of 64. After training the model, each cell in the below images represents the input image that maximizes the activation

of the corresponding hidden unit. Figure 4. Visualization of trained hidden units It is easy to find out that each hidden unit is trying to capture a different pattern of input feature, which leads to effective dimensional reduction. 4.2. Phrase similarity The preprocessed datasets by Collobert and Weston s neural language model (2008) was used as input features of this natural language processing experiment. From the data sets, each word can be represented as n-dimensional vecotor. (n25,50,100,200) All the datasets are provided in either scaled or unscaled version. As mentioned before, the stochastic gradient descent method is used as an optimization algorithm to minimize the cost function J(W, b 1, b 2 ). The learning rate parameter is tried for 0.0002, 0.002, 0.02, 0.2. Furthermore, in order to avoid the case the weight matrix grows too big, we introduced so-called weight decay term at the end of the cost function. The decay parameter is tried for 0.0001, 0.001, 0.01, 0.1. The combination of the parameters of the best performance is 0.02 for learning rate and 0.001 for decay parameter. The number of iterations is also tried differently, but it tends to converge before 2 million iterations. The bias parameters are initialized to zero vector and weight matrix to randomly computed of the range [0, 1 q ], when q is the dimension of the input feature. We tried different initialization approaches, but there wasn t notable difference among them. The Pearson correlation is used to verify the accuracy, which is computed between the similarities of autoencoded phrases and the level of similarity manually rated by human. For the Euclidean distance measure, there should be a negative correlation, and for the cosine similarity, it should be as close as possible to 1. For comparison, we also implemented several different models, including sparse autoencoder. In the table below, AUTO represents for the similarity of autoencoded phrases, RAW for the uncompressed concatenated phrase vectors, PAIRAVG for the pairwise average similarity between corresponding words of phrases, and SPARSEAUTO for the sparse autoencoded phrases, respectively. Measure Correlation EUC, AUTO -0.0842 COS, AUTO 0.0502 EUC, RAW 0.0341 COS, RAW -0.0201 EUC, PAIRAVG 0.0341 COS, PAIRAVG 0.0063 EUC, SPARSEAUTO 0.0742 COS, SPARSEAUTO 0.0020 Table 1. Correlation between similarities computed by different model Note that the RAW and PAIRAVG models do not utilize the training data to learn parameters, just computing the similarity given words and their representation. 5. Discussion Even though the Pearson correlation value of our Autoencoder model is not significant compared to other measures, it consistently showed a negative correlation, in case of the Euclidean distance measure, with various combination of parameters such as learning rate or weight decay. For cosine similarity measure, it still shows good performance with relatively higher positive correlation compared to other ones. In the other hands, the raw representation of phrases, which is just concatenation of two words, continuously showed inconsistent correlation with different parameter combination, nearly random pattern. It is also true for the pairwise average similarity between corresponding words. In this result, we can conclude that these two model to represent phrase consisting of two words failed to effectively capture the essential parts of input features, while the autoencoder model showed consistent correlation pattern. Although it shows consistent correlation pattern, the absolute value is not significant enough to conclude that there is strong correlation between them. There are several factors which possibly have an effect on the experiment result. First of all, the training data size is not large enough to cover large dimension of the input feature, such as 100 or 200 dimensional representation of Collobert and Weston s embeddings. Since we can take advantage of the power of unsupervised learning by utilizing large unlabeled, unprocessed dataset, this is one of the future work which should be followed by this research.

Furthermore, we can still utilize a variety of different feature vectors to represent each word, not only the Collobert and Weston s neural language model. Since the autoencoder tries to capture interesting pattern of concatenated words, we believe that uncompressed raw vector representation should be applied for this model, in which each dimension of feature has certain meaning such as POS tag, frequency, and other syntactic or semantic information. Finally, a variety of composition method to represent phrases given words can be tried as input feature of this model. These include Jeff Mitchell and Mirella Lapata s additive, multiplicative, etc. 6. Acknowledgement Thanks to Richard Socher for useful discussion. Learning Feature-based Semantics with Autoencoder References Ronan Collobert, Jason Weston. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. ICML 08 Proceedings of the 25th international conference on Machine learning Jeff Mitchell, Mirella Lapata. 2008. Vector-based Models of Semantic Composition. In Proceedings of ACL-08: HLT Joseph Turian, Lev Ratinov, Yoshua Bengio. 2010 Word representations: A simple and general method for semi-supervised learning. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics R. Ando, T. Zhang. 2005. A high- performance semisupervised learning method for text chunking. ACL J. Suzuki, H. Isozaki. 2008. Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. ACL-08: HLT (pp. 665673). Richard Socher. 2010. Deep Chinese Character Department of Computer Science, Stanford University Andrew Ng. 2010. Sparse Autoencoder Department of Computer Science, Stanford University