What Your Username Says About You. Aaron Jaech & Mari Ostendorf University of Washington

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Probabilistic Latent Semantic Analysis

Python Machine Learning

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Switchboard Language Model Improvement with Conversational Data from Gigaword

Investigation on Mandarin Broadcast News Speech Recognition

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Linking Task: Identifying authors and book titles in verbose queries

CS 446: Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 1: Machine Learning Basics

Cross-Lingual Text Categorization

CS Machine Learning

Distant Supervised Relation Extraction with Wikipedia and Freebase

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

Rule Learning With Negation: Issues Regarding Effectiveness

Dublin City Schools Mathematics Graded Course of Study GRADE 4

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Multi-Lingual Text Leveling

Speech Recognition at ICSI: Broadcast News and beyond

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

The Good Judgment Project: A large scale test of different methods of combining expert predictions

arxiv: v1 [cs.cl] 2 Apr 2017

Assignment 1: Predicting Amazon Review Ratings

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Using dialogue context to improve parsing performance in dialogue systems

Multilingual Sentiment and Subjectivity Analysis

SURVIVING ON MARS WITH GEOGEBRA

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Corrective Feedback and Persistent Learning for Information Extraction

What is this place? Inferring place categories through user patterns identification in geo-tagged tweets

Detecting English-French Cognates Using Orthographic Edit Distance

Calibration of Confidence Measures in Speech Recognition

A Case Study: News Classification Based on Term Frequency

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

The stages of event extraction

Active Learning. Yingyu Liang Computer Sciences 760 Fall

(Sub)Gradient Descent

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Using Hashtags to Capture Fine Emotion Categories from Tweets

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Rule Learning with Negation: Issues Regarding Effectiveness

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Language Model and Grammar Extraction Variation in Machine Translation

A Comparison of Two Text Representations for Sentiment Analysis

Large vocabulary off-line handwriting recognition: A survey

Discovery of Topical Authorities in Instagram

COMM370, Social Media Advertising Fall 2017

Laboratorio di Intelligenza Artificiale e Robotica

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

WHEN THERE IS A mismatch between the acoustic

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Social Media Journalism J336F Unique ID CMA Fall 2012

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

The taming of the data:

A Statistical Model for Word Discovery in Transcribed Speech

Indian Institute of Technology, Kanpur

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Learning Methods in Multilingual Speech Recognition

Engineers and Engineering Brand Monitor 2015

Let s think about how to multiply and divide fractions by fractions!

Cross-lingual Short-Text Document Classification for Facebook Comments

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Toward a Unified Approach to Statistical Language Modeling for Chinese

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Word Segmentation of Off-line Handwritten Documents

Online Updating of Word Representations for Part-of-Speech Tagging

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Modeling function word errors in DNN-HMM based LVCSR systems

Truth Inference in Crowdsourcing: Is the Problem Solved?

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Language Independent Passage Retrieval for Question Answering

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Completing the Pre-Assessment Activity for TSI Testing (designed by Maria Martinez- CARE Coordinator)

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Ensemble Technique Utilization for Indonesian Dependency Parser

Social Media Journalism J336F Unique Spring 2016

Australian Journal of Basic and Applied Sciences

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

Applications of memory-based natural language processing

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Bug triage in open source systems: a review

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Detecting Online Harassment in Social Networks

Dialog Act Classification Using N-Gram Algorithms

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Transcription:

What Your Username Says About You Aaron Jaech & Mari Ostendorf University of Washington

Motivation Understanding personal information in online interactions Why Usernames? Three reasons: Expressiveness: Used to advertise oneself @VINCEEinNYC, @AngelTheBunny, @Gunservatively Ubiquity: Twitter, Instagram, Pinterest, Snapchat, Vine, Youtube Complementary to text cues & relatively unexplored

Example Can you guess anything about Twitter user @mo_alq?

Example Can you guess anything about Twitter user @mo_alq? Does looking at these users help? @moh_alsaeid @mohamad_al3jmei @mohamad_alarshi @mohamad_aljasim @mohamad_alkhale @mohamad_almdnee @mohamad_almo0ha @mohamad_almsfr @mohamad_alrashe @mohamed_alattas @mohamed_alhassn @mohamed_almored @mohand_alsharif @mohd_alfozan @mohd_alsaleh1 @mohmmad_al3mry

Overview Problem: Find out what can be inferred about an individual from only their username Use gender and language identification tasks to prove the technique Strategy: Split input into username morphemes (u-morphs) Learn relationship between u-morphs and class labels

Username Morphology Usernames are often formed by concatenation @taylorswift13 à taylor swift 13 We use the Morfessor algorithm to build a u-morph lexicon from unsupervised data Preprocess to use casing (JohnDoe à john$doe) Lexicon size/morph length tuned to maximize performance on each task Experiments compare u-morph segmentation to character 3-gram & 4-gram (used in prior work)

Classifier Each username is just a sequence of n u-morphs @taylorswift13 à [m 1 m 2 m 3 ]; m 1 =taylor, m 2 =swift, m 3 =13 Model the relationship between u-morphs and class labels using a unigram language models Class Prior Class specific u-morph probabilities Class labels only depend on observed u-morphs Class-dependent smoothing weights are needed when class priors are skewed

Gender ID Task: Label usernames as male/female Data: 44k labeled Okcupid usernames and 3.5 million unlabeled Snapchat ones, test from Okcupid Approach: Use Snapchat data to build u-morph lexicon Train male/female unigram models on Okcupid data Improve models using self-training on unlabeled Snapchat data

Top Features u-morph trigram Gender ID Results Male guy, mike, matt, josh guy, uy#, kev, joe Female girl, marie, lady, miss Irl, gir, grl, emm Top u-morph features all have a strong semantic relationship to the task. N-grams more prone to confusion, e.g. guy in Nguyen and miss in mission. Error Rates Features Supervised Self-Training 3-gram 28.7% 32.0% 4-gram 28.7% 29.4% u-morph 27.8% 25.8% Supervised learning: u-morphs have 3% reduction in error rate Self-training: n-grams don t benefit but u- morph model has 10% error reduction from baseline

Language ID Task: Predict language of tweet from Twitter username Data: 540,000 usernames from 9 most popular Twitter languages. Labeled by Twitter API + langid.py classifier Approach: Build u-morph lexicon on international Twitter data Train u-morph and n-gram language models Average posterior probabilities from u-morph and n-gram models to create combination model

Language ID Results 4-gram model has higher recall; u-morphs give better precision. Combination model benefits from both. Multilingual u-morph lexicon less well matched to infrequent languages. 1-2% of total for each

Conclusions u-morphs are a good representation for username classification Personal characteristics can be inferred from usernames with just u-morph unigrams Accuracy is good with username alone (for language ID, roughly comparable to using the whole tweet) Usernames are complementary to existing features

Future Work Explore more tasks Pooled language-specific u-morph lexicons Improve classifier: Higher-order LM Other language models (e.g. log-bilinear) Code and data available at: https://github.com/ajaech/username_analytics