CSCI 315: Artificial Intelligence through Deep Learning

Similar documents
Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Probabilistic Latent Semantic Analysis

Python Machine Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

CS Machine Learning

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Modeling function word errors in DNN-HMM based LVCSR systems

Artificial Neural Networks written examination

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Modeling function word errors in DNN-HMM based LVCSR systems

Learning Methods for Fuzzy Systems

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

arxiv: v1 [cs.cl] 20 Jul 2015

Assignment 1: Predicting Amazon Review Ratings

arxiv: v2 [cs.ir] 22 Aug 2016

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Attributed Social Network Embedding

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Speaker Identification by Comparison of Smart Methods. Abstract

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Lecture 1: Machine Learning Basics

THE world surrounding us involves multiple modalities

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Generative models and adversarial training

Using Web Searches on Important Words to Create Background Sets for LSI Classification

J j W w. Write. Name. Max Takes the Train. Handwriting Letters Jj, Ww: Words with j, w 321

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Deep Neural Network Language Models

Calibration of Confidence Measures in Speech Recognition

Improvements to the Pruning Behavior of DNN Acoustic Models

Human Emotion Recognition From Speech

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A Reinforcement Learning Variant for Control Scheduling

Second Exam: Natural Language Parsing with Neural Networks

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

On the Formation of Phoneme Categories in DNN Acoustic Models

A study of speaker adaptation for DNN-based speech synthesis

A Case Study: News Classification Based on Term Frequency

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Knowledge Transfer in Deep Convolutional Neural Nets

SARDNET: A Self-Organizing Feature Map for Sequences

Model Ensemble for Click Prediction in Bing Search Ads

arxiv: v1 [cs.lg] 15 Jun 2015

INPE São José dos Campos

arxiv: v2 [cs.cv] 30 Mar 2017

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Evolutive Neural Net Fuzzy Filtering: Basic Description

12- A whirlwind tour of statistics

Artificial Neural Networks

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Learning Cases to Resolve Conflicts and Improve Group Behavior

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

arxiv: v1 [cs.cv] 10 May 2017

Getting Started with Deliberate Practice

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Syntactic systematicity in sentence processing with a recurrent self-organizing network

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

(Sub)Gradient Descent

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

The Good Judgment Project: A large scale test of different methods of combining expert predictions

phone hidden time phone

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

A deep architecture for non-projective dependency parsing

Residual Stacking of RNNs for Neural Machine Translation

Word Segmentation of Off-line Handwritten Documents

Evolution of Symbolisation in Chimpanzees and Neural Nets

Axiom 2013 Team Description Paper

Dialog-based Language Learning

Lecture 1: Basic Concepts of Machine Learning

Learning Methods in Multilingual Speech Recognition

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Using focal point learning to improve human machine tacit coordination

WHEN THERE IS A mismatch between the acoustic

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

arxiv: v4 [cs.cl] 28 Mar 2016

arxiv: v1 [cs.cl] 2 Apr 2017

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Comment-based Multi-View Clustering of Web 2.0 Items

Rule Learning With Negation: Issues Regarding Effectiveness

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Are You Ready? Simplify Fractions

An empirical study of learning speed in backpropagation

A Deep Bag-of-Features Model for Music Auto-Tagging

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Diagnostic Test. Middle School Mathematics

Transcription:

CSCI 315: Artificial Intelligence through Deep Learning W&L Winter Term 2017 Prof. Levy Autoencoder Networks: Embedding and Representation Learning (Chapter 6)

Motivation Representing words and other data as arbitrary one-hot codes is convenient for building a simple dictionary, but creates two problems: Ignores similarity between word meanings (cat should closer to mouse than either is to bell), making tasks like language learning much harder. One-hot codes become computationally impractical once we approach a realistic (20-30k words) vocabulary: for each word, softmax would need to compute the probability distribution over tens of thousands of words. y i =f ( x i )= j ex i e x j

Data Compression So what we'd like to be able to do is pick an arbitrary, reasonable size for our codes (maybe a few hundred units), and compress the entire vocabulary into codes of that size. Such a compression (a.k.a. embedding) should also put similar words near each other in the vector space. the if bell collar frighten chase speak cat dog mouse

Elman (1990) Revisited Elman's SRN showed that a neural network can "find structure in time" (discover semantic similarities among words) by being trained to predict the next word from the current and previous words. SRN is a variety of auto-encoder network: a network where the input and output layers represent the same thing (hence, same # of units), and the cool stuff happens on the hidden layer.

The Simplest Autoencoder Although Buduma cites Hinton & Salakhutdinov (2006) for autoencoders, a three-layer backprop network (available since 1986) can serve as a simple autoencoder: they used to be called auto-associators; e.g., Pollack's RAAM (1990). Let's build a simple 8-3-8 auto-associator using our Backprop class.

As we can see, our simple 8-3-8 autoassociator has "discovered" (devised, learned) a binary representation for the the one-hot codes, on it hidden layer. Now we are ready to look at Hinton & Salakhutdinov (2006). First, let's predict: How did their network differ from our simple one? What dataset did they use?

All layers fully connected, with sigmoidal activation function Why only two units in innermost hidden layer?

Dimensionality Reduction People can only visualize data in two or three dimensions. Several techniques exist for reducing high-dimensional data to such low dimensions; the most popular, dating back to 1933, is Principal Component Analysis (PCA) Figure above shows 2 1 dimension reduction, but PCA can be used for any number of dimensions

Limitations of PCA As Fig. 6-2 shows, PCA is a fundamentally linear technique: it works by re-aligning the data along a few mutually orthogonal (right-angled) axes, like X,Y or X,Y,Z. Works well for many kinds of data; however, a simple example shows the limitations of the linearity assumption. Does this non-linearity problem remind you of anything?

PCA vs. DL: The MNIST Challenge Reducing 784 dimensions to two: (On the other hand, PCA can be coded in two or three lines of Python!)

Autoencoder as De-noiser (a.k.a. Cleanup Memory) Give the trained network a degraded image, and see what comes out the other end:

word2vec: Word Prediction Revisited Predicting the next word from the current word may be too narrow a view of how to find structure in time. Looking at a window ( bag ) of a few words before and after the current word can be an even better way of discovering relationships: We saw some LIONS and ELEPHANTS at the ZOO. The ZOO had no LIONS, but lots of ELEPHANTS. ELEPHANTS and LIONS live in the ZOO. Since sequence order no longer matters, we don t need a recurrent net (and its associated complexity) to learn the relationships. This is not a novel idea! Before looking more at word2vec, let s look at an earlier approach...

Latent Semantic Analysis: The Ultimate Bag of Words Algorithm LSA: An extremely simple matrix-based method: One word per row One document per column Each cell j,k shows the number of times word j appears in document k. Applying a clever transformation (Singular Value Decomposition) reveals latent ( hidden ) relationships between words and the documents in which they should have appeared!

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.66.7422&rep=rep1&type=pdf LSA

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.66.7422&rep=rep1&type=pdf LSA

word2vec Architecture is simple: a classic input-hidden-output (n-h-n) auto-encoder. Two flavors: Continuous Bag Of Words (CBOW): Predict a single word (LIONS) from neighboring words (ZOO, ELEPHANTS). Useful for smaller datasets. Skip-Gram: Predict neighbors (ZOO, ELEPHANTS) from a single word (LIONS). Useful for larger datasets. But this still leaves us with the second original problem: given the embedding (low-dimensional embedding) of a word how can the output layer (decoder) compute the softmax over n > 10K words!

word2vec: Noise-Contrastive Encoding Instead of trying to compute the softmax over all n vocabulary words, word2vec compares the actual input/target pattern (MONKEY ZOO, LIONS) with a randomly-selected bogus ( noise ) pattern (MONKEY DOLLAR, CLOCK). The closer the obtained output is to the bogus output, the higher the value of the loss function, and the greater the adjustment on the weights from the hidden (embedding) layer to the output (one-hot) layer.

word2vec: Analogy as Linear Transformation https://www.tensorflow.org/images/linear-relationships.png