Deanonymizing Quora Answers

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Python Machine Learning

Lecture 1: Machine Learning Basics

Assignment 1: Predicting Amazon Review Ratings

Probabilistic Latent Semantic Analysis

Switchboard Language Model Improvement with Conversational Data from Gigaword

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Word Segmentation of Off-line Handwritten Documents

Indian Institute of Technology, Kanpur

Learning From the Past with Experiment Databases

Linking Task: Identifying authors and book titles in verbose queries

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

arxiv: v1 [cs.lg] 7 Apr 2015

Reducing Features to Improve Bug Prediction

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Knowledge Transfer in Deep Convolutional Neural Nets

Rule Learning With Negation: Issues Regarding Effectiveness

arxiv: v4 [cs.cl] 28 Mar 2016

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Second Exam: Natural Language Parsing with Neural Networks

CS Machine Learning

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Modeling function word errors in DNN-HMM based LVCSR systems

CS 446: Machine Learning

A Case Study: News Classification Based on Term Frequency

Model Ensemble for Click Prediction in Bing Search Ads

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Modeling function word errors in DNN-HMM based LVCSR systems

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Probability and Statistics Curriculum Pacing Guide

(Sub)Gradient Descent

Human Emotion Recognition From Speech

Artificial Neural Networks written examination

Georgetown University at TREC 2017 Dynamic Domain Track

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

arxiv: v1 [cs.cv] 10 May 2017

Calibration of Confidence Measures in Speech Recognition

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Softprop: Softmax Neural Network Backpropagation Learning

Learning Lesson Study Course

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

arxiv: v1 [cs.lg] 15 Jun 2015

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Improvements to the Pruning Behavior of DNN Acoustic Models

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Rule Learning with Negation: Issues Regarding Effectiveness

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

A Comparison of Two Text Representations for Sentiment Analysis

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Generative models and adversarial training

Applications of data mining algorithms to analysis of medical data

Attributed Social Network Embedding

Lecture 2: Quantifiers and Approximation

INPE São José dos Campos

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

A study of speaker adaptation for DNN-based speech synthesis

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Speech Recognition at ICSI: Broadcast News and beyond

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Universidade do Minho Escola de Engenharia

Communication and Cybernetics 17

Beyond the Pipeline: Discrete Optimization in NLP

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

SARDNET: A Self-Organizing Feature Map for Sequences

Residual Stacking of RNNs for Neural Machine Translation

Handling Concept Drifts Using Dynamic Selection of Classifiers

Mandarin Lexical Tone Recognition: The Gating Paradigm

Dropout improves Recurrent Neural Networks for Handwriting Recognition

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Semantic and Context-aware Linguistic Model for Bias Detection

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Multi-Lingual Text Leveling

Cross-Lingual Text Categorization

Why Did My Detector Do That?!

CSL465/603 - Machine Learning

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Temporal Information Extraction for Question Answering Using Syntactic Dependencies in an LSTM-based Architecture

Grade 6: Correlated to AGS Basic Math Skills

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Exploration. CS : Deep Reinforcement Learning Sergey Levine

arxiv: v5 [cs.ai] 18 Aug 2015

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

A Bayesian Learning Approach to Concept-Based Document Classification

A Vector Space Approach for Aspect-Based Sentiment Analysis

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Genre classification on German novels

APES Summer Work PURPOSE: THE ASSIGNMENT: DUE DATE: TEST:

Lecture 1: Basic Concepts of Machine Learning

Test Effort Estimation Using Neural Network

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Mining Association Rules in Student s Assessment Data

Transcription:

Deanonymizing Quora Answers Pranav Jindal pranavj@stanford.edu Paranjape, Ashwin ashwinpp@stanford.edu 1 Introduction Quora is a knowledge sharing website where users can ask/answer questions with the option of anonymity.we investigate the problem of Author-identification for Quora answers using deep learning techniques in Natural Language Processing. 1.1 Problem Statement We hope to achieve significant precision on the task of identifying users from their writings with the end-goal of recognizing the authors of anonymous answers on Quora. Previous work indicates that writing style harbors essential cues about authors and we believe that deep learning is a powerful tool to extract such features to distinguish between the various writing styles that people have. The work finds applications to several other tasks like Forensic Linguistics, email spam detection, identity tracing in cyber forensics etc. 1.2 Background Reading There has been fair amount of interest in author-identification in previous NLP works with most of the work focussing on manually engineered features:- 1. Comparing Frequency and Style-Based Features for Twitter Author Identification: Examines author identification in short texts focussing on messages retrieved from Twitter to determine the most effective feature set for recognizing authors look at Bag-of-Words and style-marker features and use SVMs for the classification task 2. A Comparative Study of Language Models for Book and Author Recognition: Evaluates similarity between documents and authors showed that syntactic features are less successful than function words for author attribution 3. A Survey of Modern Authorship Attribution Methods Discusses how this scientific field has been developed substantially taking advantage of research advances in areas such as machine learning, information retrieval, and natural language processing over the past few decades 1.3 Dataset To generate a small-version of the problem we deal with, we select 200 writers from the list of top Quora writers and use RSS feeds to generate a dataset containing exactly 50 answers per user. 1

Figure 1: Dataset details Quora Top writers: they are a group of people with recognized expertise, knowledge and authenticity. Some people in the group are experts in specific fields, while others are simply great writers with a talent for describing the human condition and the world around us. Selecting these people ensures high quality content in the form of a large number of long answers per author. (a) Word length vs. Frequency (b) Average answer length distribution Figure 2: Data statistics We observe that most of the authors in the dataset have an average of less than 150 words per answer, though there are a few authors in the dataset which write very long answers. Also, not surprisingly, the word length vs frequency curve follows a power law. 2 Technical Approach and Models The labels are hidden for a fraction of answers for every author to test our final model and the remaining dataset is used to train on the task of author attribution. We used machine learning models on engineered features as well as deep learning models for the task. 2.1 Model 1: Style marker features The following commonly used style marker features from previous works in author-identification were used for the classification task 1. Number of words in the answer 2. Fraction of words that were punctuations 3. Average word length 4. Standard deviation of word length 2

5. Number of sentences in the answer 6. Average Sentence length 7. Number of digits in the answer 2.2 Model 2: Word Frequency model Each answer is modeled by a feature vector of the length of the vocabulary set and contains the counts for each word in the vocabulary set for that answer The vocabulary set is varied by incrementally adding more tokens in the order of decreasing frequency, using the complete vocabulary works best. 2.3 Model 3: LSTM with mean-pooling The LSTM model is a recurrent neural network with memory units, allowing the cells to remember or forget its previous state, as needed. Notation: x t is the input to the memory cell layer at time t W i, W f, W c, W o, U i, U f, U c, U o and V o are weight matrices b i, b f, b c and b o are bias vectors Figure 3: LSTM unit Memory Unit Update: First, we compute the values for i t, the input gate, and C t the candidate value for the states of the memory cells at time t : (1) i t = σ(w i x t + U i h t 1 + b i ) (2) C t = tanh(w c x t + U c h t 1 + b c ) Second, we compute the value for f t, the activation of the memory cells forget gates at time t (3) f t = σ(w f x t + U f h t 1 + b f ) Given the value of the input gate activation i t, the forget gate activation f t and the candidate state value C t, we can compute C t the memory cells new state at time t : (4) C t = i t C t + f t C t 1 With the new state of the memory cells, we can compute the value of their output gates and, subsequently, their outputs : 3

(5) o t = σ(w o x t + U o h t 1 + b 1 ) (6) h t = o t tanh(c t ) Figure 4: Mean Pooling with LSTMs Mean pooling: The model is composed of a single LSTM layer followed by an average pooling and a logistic regression layer as illustrated in Figure 4. Thus, from an input sequence x 0, x 1, x 2,..., x n, the memory cells in the LSTM layer will produce a representation sequence h 0, h 1, h 2,...h n. This representation sequence is then averaged over all time-steps resulting in representation h. Finally, this representation is fed to a logistic regression layer whose target is the class label associated with the input sequence. 3 Results The dataset for 200 authors with 50 answers per author was split into 80:10:10 for train, validation and test datasets respectively. 3.1 Visualization (TSNE) TSNE is a tool to visualize high-dimensional data, converts similarities between data points to joint probabilities and tries to minimize the KL divergence between the joint probabilities of the lowdimensional embedding and the high-dimensional data We visualize the 2 sets of baseline features using TSNE, displaying how well the features separate the various classes. Each scatter point represents an answer and the color represents the author for that answer 4

(a) TSNE Visualization - Style markers (b) TSNE Visualization - Unigram Features The Unigram features clearly do a much better job at separating the data as compared to the style markers, confirming the findings in [, 2] 3.2 Evaluation Metrics We use top-k accuracy to evaluate the performance of our models, i.e. the author prediction is considered as correct if the true author belongs to the top-k predictions of the model for a given answer Random Guessing Top-1 Accuracy = 0.505% (198 classes) 3.2.1 Traditional Machine Learning Methods Various classifiers including random forests, Multinomial Naive Bayes, Adaboost, Gradient Boosting were tested on the feature sets, the best performance was achieved by using Random Forests, Multinomial Naive Bayes also achieved reasonable accuracy. Dataset Top-1 Accuracy Top-5 Accuracy Top-10 Accuracy Training 45.51 74.14 84.70 Validation 6.71 17.8 27.12 Test 6.63 17.91 27.22 Table 1: Results for model 1 features for best model (Random Forests) Dataset Top-1 Accuracy Top-5 Accuracy Top-10 Accuracy Training 97.44 98.44 98.72 Validation 33.83 55.85 65.16 Test 32.34 55.15 66.27 Table 2: Results for model 2 features for best model (Random Forests) Random forests capture co-occurrence of words, thus giving significant improvement over random guessing and style-based features using the word frequency feature vector. 5

3.2.2 Deep Learning Model - LSTM with mean pooling Training LSTMs on the entire answers is a computationally expensive task, thus each answer was split into smaller chunks and the model was trained to predict on each chunk rather than every answer. Dataset Top-1 Accuracy Top-5 Accuracy Top-10 Accuracy Training 51.32 75.6 83.98 Validation 20.26 37.78 53.51 Test 20.12 36.96 53.42 Table 3: Results for LSTM model on chunk size: 50 words Comments: 1. Using random word vector initialization performed better than using pre-trained word vectors from the wikipedia dataset. The reason behind this may be that, for this specific task, words which are semantically very similar (eg. synonyms) might still be required to be separated out in the word vector space defined by the way different people use those words. 2. A direct relation is observed between average answer length and accuracy, indicating that author attribution is easier for longer answers. Authors with an average answer length > 400 have at least 70% accuracy. (See figure 6) Figure 6: Correlation between accuracy and answer length 3. The performance is much better than random guessing It is lower than the random forest model however which was trained on full answers as compared to chunked answers for the LSTM model training 4 Future Work Although computationally expensive, training on entire answers should improve results as compared to training on individual chunks. The model was prone to overfitting to specific words in the training data which could disambiguate between the authors and thus lead to poor generalization. Using a mean pooling layer after a softmax layer at each neuron rather than directly on the hidden layer output might help overcome this, since softmax will normalize all the hidden layer outputs to sum to 1. Using pre trained word vectors from the wikipedia dataset lead to worse performance than random initialization. However, our dataset was relatively small and results should improve if the word 6

vectors are trained using the author-attribution task on a much larger dataset and initializing the word vectors to these instead. References [1] Green, R. M., & Sheppard, J. W. (2013, May). Comparing frequency-and style-based features for twitter author identification. In The Twenty-Sixth International FLAIRS Conference. [2] Uzuner, zlem, and Boris Katz. A comparative study of language models for book and author recognition. Natural Language ProcessingIJCNLP 2005. Springer Berlin Heidelberg, 2005. 969-980. [3] Stamatatos, Efstathios. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology 60.3 (2009): 538-556. [4] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780 [5] Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural computation [6] Graves, Alex. Supervised sequence labeling with recurrent neural networks. Vol. 385. Springer, 2012. 7