A Distributional Representation Model For Collaborative

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

arxiv: v2 [cs.ir] 22 Aug 2016

Assignment 1: Predicting Amazon Review Ratings

Deep Neural Network Language Models

Python Machine Learning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Lecture 1: Machine Learning Basics

CSL465/603 - Machine Learning

Attributed Social Network Embedding

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Calibration of Confidence Measures in Speech Recognition

Second Exam: Natural Language Parsing with Neural Networks

A study of speaker adaptation for DNN-based speech synthesis

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A Deep Bag-of-Features Model for Music Auto-Tagging

arxiv: v1 [cs.lg] 15 Jun 2015

A deep architecture for non-projective dependency parsing

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v1 [cs.cl] 20 Jul 2015

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Test Effort Estimation Using Neural Network

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

A Review: Speech Recognition with Deep Learning Methods

Evolutive Neural Net Fuzzy Filtering: Basic Description

On the Formation of Phoneme Categories in DNN Acoustic Models

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Improvements to the Pruning Behavior of DNN Acoustic Models

Learning Methods for Fuzzy Systems

Generative models and adversarial training

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Dropout improves Recurrent Neural Networks for Handwriting Recognition

THE world surrounding us involves multiple modalities

Reducing Features to Improve Bug Prediction

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Australian Journal of Basic and Applied Sciences

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

A Neural Network GUI Tested on Text-To-Phoneme Mapping

arxiv: v2 [cs.cl] 26 Mar 2015

arxiv: v4 [cs.cl] 28 Mar 2016

Human Emotion Recognition From Speech

On the Combined Behavior of Autonomous Resource Management Agents

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Knowledge Transfer in Deep Convolutional Neural Nets

Model Ensemble for Click Prediction in Bing Search Ads

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

arxiv: v1 [cs.lg] 7 Apr 2015

INPE São José dos Campos

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Comment-based Multi-View Clustering of Web 2.0 Items

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Artificial Neural Networks written examination

Probabilistic Latent Semantic Analysis

Rule Learning With Negation: Issues Regarding Effectiveness

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Time series prediction

arxiv: v2 [cs.cv] 30 Mar 2017

arxiv: v1 [cs.cv] 10 May 2017

THE enormous growth of unstructured data, including

Word Segmentation of Off-line Handwritten Documents

Learning Methods in Multilingual Speech Recognition

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Word Embedding Based Correlation Model for Question/Answer Matching

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Lip Reading in Profile

Rule Learning with Negation: Issues Regarding Effectiveness

Georgetown University at TREC 2017 Dynamic Domain Track

WHEN THERE IS A mismatch between the acoustic

arxiv: v1 [cs.cl] 2 Apr 2017

Online Updating of Word Representations for Part-of-Speech Tagging

A Vector Space Approach for Aspect-Based Sentiment Analysis

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Speech Emotion Recognition Using Support Vector Machine

Learning to Schedule Straight-Line Code

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Semi-Supervised Face Detection

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Data Fusion Through Statistical Matching

Speaker Identification by Comparison of Smart Methods. Abstract

Transcription:

A Distributional Representation Model For Collaborative Filtering Zhang Junlin,Cai Heng,Huang Tongwen, Xue Huiping Chanjet.com {zhangjlh,caiheng,huangtw,xuehp}@chanjet.com Abstract In this paper, we propose a very concise deep learning approach for collaborative filtering that jointly models distributional representation for users and items. The proposed framework obtains better performance when compared against current state-of-art algorithms and that made the distributional representation model a promising direction for further research in the collaborative filtering. 1. Introduction Recommender systems are best known for their usage on e-commerce websites. By bringing much more extra profit for the website, better recommendation algorithms have attracted attention both from the industry and the academic community. Collaborative Filtering (CF) is one of the most popular approaches among the recommendation algorithms. It utilizes user feedback to infer relations between users, between items, and ultimately relate users to items they like. On the other side, recent years have witnessed the breakthrough of applying the deep learning algorithms into the object recognition[1][2] and speech recognition[3][4]. The NLP is another filed in which the deep learning is widely used. Inspired by successful application of deep learning in NLP tasks [5][6][7][8][9], especially the distributional representation method, we want to explore the distributional representation of users and items for collaborative filtering. In this paper, we proposed the framework which combines the three-layer neutral network with the distributional representation of the users and items for collaborative filtering. By explicitly encoding the features into vectors, we can explore the complex nonlinearity interdependencies of features through this neutral network. Though seems to be simple, the method has been proved to be effective in recommendation domain by experiment results. The main contributions of this work can be summarized as following: We propose a distributional representation approach for recommender system which, to the best of our knowledge, is the first study to introduce the word embedding concept into collaborative filtering. The experiment results show that it s a promising direction for further research. Section 2 describes the distributional representation framework for collaborative filtering. In Section 3, we present the experiment results which indicate

the proposed method outperforms many commonly used algorithms in this research field. Section 4 presents a brief overview of related work. The Final section is the conclusion of this paper. 2. Distributional Representation Model For Recommendation Output Layer 3.8 Hidden Layer Input Layer Concatenation Lookup Table Features Input User i Item j Fig 1. Neutral network structure of distributional representation model Collaborative Filtering is one of the most popular approaches for building recommendation systems. CF mostly relies on past user s behavior such as their previous transactions or product ratings (For convenience, we will call the transaction or product as item in the following part of this paper). In order to identify new user-item associations, It analyzes relationships between users and interdependencies among items. Our proposed model explicitly transforms the user and item into vectors which encode the latent features and then it uses the combined vectors as neutral network s input to explore the complex nonlinearity interdependencies of features. We regard the CF as a regression problem by the proposed model. The figure 1 shows the main structure of this distributional representation model. 2.1 Transforming the user and item into vectors Each user i D user is embedding into a d-dimensional space by checking lookup table LT W(.) :

LT W(i) = W i U Where W U R d Du is the parameter matrix that needs to be learned through training, W i U R d is the i th column of W U and d is the vector size defined as hyper-parameter. On the other hand, Each item j D Item can also be represented as d-dimensional vector by mapping the lookup table LT W(.) : LT W(j) = W j I Where W I R d Ditem is the parameter matrix and W j I R d is the j th column of W I. When the user i and item j are given as input of the recommendation system to predict the score of item j for user i, we can concatenate the user vector and item vector into a longer vector {W i U, W j I } by applying the lookup-table to each of them. 2.2 Neutral Network Structure Our proposed neutral network has three layers---an input layer, a hidden layer and an output layer. As mentioned above, the input layer is the concatenation vector {W i U, W j I }.The node of hidden layer has a full connection to the nodes in input layer and it transforms the features encoded in the user vector and item vector into real-value number by nonlinearity function. The hyperbolic tangent, or tanh, function is used as nonlinearity function as following: f(z) = tanh(z) = ez e z e z +e z The tanh(z) function is a rescaled version of the sigmoid, and its output range is [ 1,1] instead of [0,1]. Here z is the linear function of input vector{w i U, W j I } and edge weight parameter W L1 which connect the nodes between the input layer and hidden layer. The output of the hidden layer is used as features for a logistic regression classier (the output layer) which will return the probability that means the predicted scores for item j by user i. The sigmoid function is used as the nonlinearity function which scales the output range in [0,1] and the bigger score obviously means more preference. However, the real-life applications always prefer a score in the range [0,k],say k=5. We can rescale the output of the neutral network to the right range just by multiplying result by factor k. 2.3Training We can see from the section 2.2 that the following parameters need to be

trained: θ = {W U, W I, W L1, W L2 } Here W L2 is the edge weight of nodes between the output layer and hidden layer. The rating records of users can be used as the training set and the training data takes the form of [U i, I j, y] triplet.here y is the rating of user U i for the item I j. The full learning objective takes the following form of the structural risk minimization which tries to minimize the prediction error during the training: J(θ) = 1 n (f (θ) y) 2 2 2 i=1 + λ θ 2 Where f (θ) is the predicting function of distributional representation model which sequentially consists of tanh function and sigmoid function. We use standard L2 regularization of all the parameters, weighted by the hyper-parameter λ. General back-propagation is used to train the model by taking derivatives with respect to the four groups of parameters. We use mini-batched L-BFGS for optimization which converges to a local optimum of objective function. 3. Experiment 3.1 Datasets For evaluating our proposed model, we use the MovieLens 1M[10] and EachMovie datasets [11]. MovieLens 1M dataset contains 1000209 ratings of approximately 3900 movies made by 6040 MovieLens user and EachMovie contains 2,811,983 ratings entered by 72,916 user for 1628 different movies. For all the experiments, Ninety percentage of the rating data were randomly chosen for training and the rest 10% were used as the test set. 3.2 Experiment Results RMSE is a commonly used evaluation standard for recommendation system and we use it through all experiments. In order to compare the performance of distributional representation model (DR model) with the state-of-the-art CF algorithms, we use mahout[12] as the test bed. The most commonly used recommendation algorithms such as classical KNN based model,slopeone,als,svd++ and improved KNN based model which was proposed by Koren[] were elaborately tuned to get as good performance as we can. The experiment results are listed in the table 1. The best run of DR Model has the following parameters: both the length of the user vector and item vector are 24 and the number of nodes in hidden layer is 40. These results indicate consistently good performance from our DR model in both datasets and that made the distributional representation model a promising direction for further research in the collaborative filtering.

Table 1.RMSE results of MovieLens and EachMovie datasets Model RMSE(MovieLens Dataset) RMSE(EachMovie Dataset) User-Based KNN 1.0476 0.2930 Item-Based KNN 1.0084 0.2618 SlopeOne 0.9370 0.2486 ALS 0.9550 0.2628 SVD++ 0.9495 0.2991 Koren s Item-based KNN 0.9234 0.2564 DR Model 0.9037 0.2409 4. Related Works Many popular CF algorithms have been proposed in recent years. Among them, the improved item-based KNN proposed by Koren[13] and latent factor CF[14] shows great performance advantages. Latent factor CF models explain ratings by characterizing both items and users in terms of factors inferred from the pattern of ratings. One of the most successful realizations of latent factor models is based on matrix factorization[15] such as SVD and SVD++.Our proposed distributional representation model can be categorized into the latent factor CF because it explicitly encodes the latent features of users and items into the word embedding vectors. Compared with the SVD++-like matrix factorization, distributional representation model directly combine the latent factor vectors with the neutral network structure and it can explore the complex nonlinearity interdependencies of features under this framework. As for the neutral network method or deep leaning approach in CF, RBM[16][17] and Wang s model[18] show different network structures or different optimization target compared with our proposed model. 5. Conclusion We present in this paper a concise distributional representation model for collaborative filtering. To the best of our knowledge, this is the first study on the use of word embedding to the recommendation system. We can conclude from the experiment results that this model outperforms the state-of-the-art algorithms in many cases. That made the distributional representation model a promising direction for further research in collaborative filtering. If we introduce the tensor into the DR model, It s natural to regard this DR model as a special case of Tensor-based DR model. We will further explore this more general tensor-based deep leaning model in the future work.

References [1]Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X.(2011). The manifold tangent classifier. In NIPS 2011. [2] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNetclassification with deep convolutional neural networks. InAdvances in Neural Information Processing Systems 25(NIPS 2012). [3] Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012). Context dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 33 42. [4] Mohamed, A., Dahl, G., and Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Trans. on Audio,Speech and Language Processing, 20(1), 14 22. [5]Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference onmachine learning, pages 160 167. ACM [6]Ronan Collobert, Jason Weston, L eonbottou, MichaelKarlen, KorayKavukcuoglu, and Pavel Kuksa.2011.Natural language processing (almost) from scratch.the Journal of Machine Learning Re-search, 12:2493 2537. [7] J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In Proceedings of ACL, pages 384 394, 2010. [8] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin.A neural probabilistic language model.j.mach.learn. Res., 3, March 2003. [9] E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. Improving Word Representations via Global Context and Multiple Word Prototypes. In ACL2012, 2012. [10]MovieLens dataset, http://grouplens.org/datasets/movielens/ [11]EachMovie dataset, http://grouplens.org/datasets/eachmovie/ [12] Apache Mahout. http://mahout.apache.org/ [13] R. Bell and Y. Koren. Scalable collaborative filtering with jointly derived neighborhood interpolation weights. In IEEE International Conference on Data Mining. KDD-Cup07, 2007. [14] Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. InKDD '08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 426{434, New York, NY, USA, 2008. ACM. [15]Y. Koren, R. Bell and C. Volinsky. Matrix Factorization Techniques for Recommender Systems. IEEE Computer42:8, 30 37, 2009. [16]Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. 2007. Restricted boltzmann machines for collaborative filtering. In Proceedings of the 24th international conference on Machine learning, pages791 798.ACM. [17] R. Salakhutdinov and G. E. Hinton.Deep Boltzmann machines.in AISTATS, pages 448 455, 2009. [18]Hao Wang, Naiyan Wang, Dit-Yan Yeung, Collaborative Deep Learning for Recommender Systems.arXiv:1409.2944.2014.