Social Media, Anonymity, and Fraud: HP Forest Node in SAS Enterprise Miner

Similar documents
Learning From the Past with Experiment Databases

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Assignment 1: Predicting Amazon Review Ratings

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Python Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

CS Machine Learning

Probability and Statistics Curriculum Pacing Guide

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

An Empirical Comparison of Supervised Ensemble Learning Approaches

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Axiom 2013 Team Description Paper

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Linking Task: Identifying authors and book titles in verbose queries

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Radius STEM Readiness TM

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Learning Methods in Multilingual Speech Recognition

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

STA 225: Introductory Statistics (CT)

Lecture 1: Machine Learning Basics

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

learning collegiate assessment]

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Switchboard Language Model Improvement with Conversational Data from Gigaword

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Probability estimates in a scenario tree

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

University of Groningen. Systemen, planning, netwerken Bosman, Aart

A. What is research? B. Types of research

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Multi-label classification via multi-target regression on data streams

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Australian Journal of Basic and Applied Sciences

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Visit us at:

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Learning By Asking: How Children Ask Questions To Achieve Efficient Search

Major Milestones, Team Activities, and Individual Deliverables

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Reducing Features to Improve Bug Prediction

A Bootstrapping Model of Frequency and Context Effects in Word Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Computerized Adaptive Psychological Testing A Personalisation Perspective

AQUA: An Ontology-Driven Question Answering System

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Software Maintenance

Learning to Rank with Selection Bias in Personal Search

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Algebra 2- Semester 2 Review

A Version Space Approach to Learning Context-free Grammars

Becoming a Leader in Institutional Research

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Learning goal-oriented strategies in problem solving

arxiv: v1 [cs.lg] 15 Jun 2015

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A Comparison of Standard and Interval Association Rules

Handling Concept Drifts Using Dynamic Selection of Classifiers

Term Weighting based on Document Revision History

On-Line Data Analytics

An Introduction to Simio for Beginners

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

On-the-Fly Customization of Automated Essay Scoring

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

A Case Study: News Classification Based on Term Frequency

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Dublin City Schools Mathematics Graded Course of Study GRADE 4

The SREB Leadership Initiative and its

Universidade do Minho Escola de Engenharia

Task Types. Duration, Work and Units Prepared by

Seminar - Organic Computing

VOL. 3, NO. 5, May 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

English for Specific Purposes World ISSN Issue 34, Volume 12, 2012 TITLE:

Softprop: Softmax Neural Network Backpropagation Learning

Introduction to Causal Inference. Problem Set 1. Required Problems

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Unit 7 Data analysis and design

Applications of data mining algorithms to analysis of medical data

Characterizing Mathematical Digital Literacy: A Preliminary Investigation. Todd Abel Appalachian State University

Word Segmentation of Off-line Handwritten Documents

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Accessing Higher Education in Developing Countries: panel data analysis from India, Peru and Vietnam

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

On the Combined Behavior of Autonomous Resource Management Agents

Transcription:

ABSTRACT Social Media, Anonymity, and Fraud: HP Forest Node in SAS Enterprise Miner Taylor K. Larkin, The University of Alabama, Tuscaloosa, Alabama Denise J. McManus, The University of Alabama, Tuscaloosa, Alabama With an ever-increasing flow of information from the Web, droves of data are constantly being introduced into the world. While the internet provides people with an easy avenue to purchase their favorite products or participate in social media, the anonymity of users can make the internet a breeding ground for fraud, distribution of illegal material, or communication between terrorist cells. In response, areas such as author identification through online writeprinting are increasing in popularity. Much like a human fingerprint, writeprinting quantifies the characteristics of a person s writing style in ways such as the frequency of certain words, the lengths of sentences and their structures, and the punctuation patterns. The ability to translate someone s post into quantifiable inputs allows the ability to implement advanced machine learning tools to make predictions as to the identity of malicious users. To demonstrate prediction on this type of problem, the Amazon Commerce Reviews dataset from the UCI Machine Learning Repository is used. This dataset consists of 1,500 observations, representing 30 product reviews from 50 authors, and 10,000 numeric writeprint variables. Given that the target variable has 50 classes, only select models are adequate for this classification task. Fortunately, with the help of the HP Forest node in SAS Enterprise Miner, we can apply the popular Random Forest model very efficiently, even in a setting where the number of predictor variables is much larger that the observation count. Results show that the HP Forest node produces the best results compared to the HP Neural, MBR, and Decision Tree nodes. INTRODUCTION In today s society, the internet plays a pivotal role in everyday life. It has a variety of uses, whether it be for communicating with business leaders across the world or something as mundane as paying bills. The ease of use promotes individuals to be more involved in online shopping and social media. Despite the wealth of information the internet presents, it is no stranger to misuse. Specifically in the communications realm, malicious users can spread illegal materials, such as pirated software and child pornography, via online messaging. Moreover, terrorist organizations remain active in this type of transmission (Li, Zheng, & Chen, 2006). Therefore, it is desirable to demystify online anonymity for the purpose of identifying cyber criminals and terrorist cells. This can be accomplished empirically through analyzing an individual s writeprint. Much like a fingerprint, a writeprint characterizes the writing style by identifying specific attributes. Examples of this include (Liu, Liu, Sun, & Liu, 2011): Word and sentence length Vocabulary richness Occurrence of special characters and punctuations Character n-grams frequencies Types of words used and misspellings The goal of this author identification system is to identify a set of features that remain relatively constant among a number of writings by a particular author (Li, Zheng, & Chen, 2006, p. 78). Due to the variety of possible characteristics to include in an individual s writeprint, these datasets can quickly become high dimensional (i.e. where the number of predictor variables is much larger than the number of observations). Thus, in order to perform classification tasks, powerful algorithms such as Random Forests (RFs) are necessary to handle the nature of these datasets. These models can be executed efficiently through SAS products such as Enterprise Miner. This paper first introduces the concept of RFs and its implementation, the HP Forest node, in SAS Enterprise Miner 13.1. Next, it demonstrations the HP Forest node in action on a real, high-dimensional writeprint dataset along with some other model nodes for comparison. Finally, it discusses the predictive performances and computational expenses of each node within Enterprise Miner on the dataset. 1

WHAT ARE RANDOM FORESTS? Classification and Regression Trees (CART) (Brieman, Friedman, Olshen, & Stone, 1984) are a very popular technique for analyzing data due to their practical interpretations. They can identify predictive splits on predictor variables in a dataset using binary recursive partitioning. In deciding these partitions, maximizing the information gain according to the Gini impurity criterion is typically used to determine which predictor variables to use. Some advantages to using these tree models are that they assume no distribution, produce interpretable logic rules, can handle varying scales and types of data as well as missing values, and are able to capture complex interactions (Lewis, 2000). However, these tree models can be highly variable, which can induce over-fitting (Hastie, Tibshirani, & Friedman, 2009). In other words, small changes in the data can drastically change the tree structure since it is a hierarchical process. To remedy this issue, Breiman (1996) suggests it is advantageous to perform bagging, or bootstrap aggregation, in conjunction with CART. That is, train many trees independently on bootstrap samples (sampling with replacement) of the same size as the training data and average the individual predictions. To improve upon this procedure, Breiman (2001) introduces randomness to posit the idea of a RF. RFs grow a plethora of trees that choose a random subset of predictor variables from all those available at each node in each tree to search for the best split. This randomization promotes a forest of diverse tree models, which is necessary for accurate predictions (Breiman, 2001). Besides increasing performance, another advantage to these models is the ability to effectively estimate the error rate simultaneously during model training through the use of out-of-bag (OOB) observations. OOB observations are comprised from the observations not used in the sampling for generating each tree s training data. Running the OOB observations through each constructed tree and aggregating the error rate across all the trees in the forest yields very similar error to that from cross-validation (Hastie, Tibshirani, & Friedman, 2009). Although RFs are typically viewed as a black box approach, variable importance rankings can be extracted. However, it is important to note that the variable selection procedure in RFs can be biased when the predictor variables differ in number of categories or scale of measurement and due to the bootstrap aggregation process which will affect the reliability of these variable importance estimates (Strobl, Boulesteix, Zeileis, & Hothorn, 2007); thus, subsampling (sampling without replacement) may be preferred in some scenarios when some inference is important. Nevertheless, RFs have become incredibly popular in both academic literature and in practice because they inherit many of the benefits of decision trees while boosting predictive performance. METHODS To construct RFs via SAS, the HP Forest node within Enterprise Miner can be used. This implementation improves upon the original RF algorithm in a couple of ways. First, the training data for each tree is subsampled as opposed to being bootstrapped. Second, instead of searching for the best splitting rule in the subset of randomly selected predictor variables, the HP Forest node preselects the predictor variable with the greatest amount of association, determined by a statistical test, between the subset and the target variable and then finds its best splitting rule (Maldonado, Dean, Czika, & Haller, 2014). This helps reduce the chances of producing a spurious split and decreases the biases prompted by an exhaustive search for the Gini information gain, thus, leading to more reliable predictions on new data and variable importance rankings. The method of using these tests of association for determining splitting is similar to the RF variant proposed by Strobl, Boulesteix, Zeileis, and Hothorn (2007) which can produce unbiased variable selection in each tree of the forest. The main three parameters to tune in HP Forest are highlighted in red in figure 1 with their default values. The Maximum Number of Trees, Number of vars to consider in split search, and Proportion of obs in each sample correspond to the number of trees constructed in the forest, the number of predictor variables to be randomly selected at each node, and the number of observations dedicated to each tree s training data given by the subsampling, respectively. A numerical value for the Number of vars to consider in split search is not listed because it varies for each dataset. As is typical for RFs, the square root of the number of predictor variables is a suitable default. Tuning this parameter controls the amount of correlation between the trees. Smaller values yield less correlation in the trees; on the other hand, if only a few predictor variables are informative for the target, then this value should be larger to increase the chances of these being selected. This parameter will likely lead to differences in the predictive performance unlike for Maximum Number of Trees, which does not greatly impact the prediction for a sufficient number of trees. 2

Figure 1. Main Sections of HP Forest Node's Properties Panel DATA AND EXPERIMENTAL PROCEDURE To investigate the performance of the HP Forest node for author identification tasks, the writeprint dataset, Amazon Commerce Reviews, is used which be found on the UCI Machine Learning Repository (Lichman, 2013). This dataset represents 30 product reviews from 50 active users. Accompanying these observations, 10,000 writeprint variables are constructed from the text of the product reviews (see (Liu, Liu, Sun, & Liu, 2011) for more detail). Given the large number of target classes (in this case 50), only certain models can be utilized for prediction. Figure 2 shows the Enterprise Miner diagram constructed. Along with the HP Forest node, the HP Neural, MBR (Memory Based Reasoning), and Decision Tree nodes are also executed for comparison. These are all left at their default settings. Three versions of the HP Forest node are investigated. Denote ntree to represent Maximum Number of Trees and mtry to symbolize Number of vars to consider in split search. The three version include: 1. ntree = 50; mtry = 100 (default) 2. ntree = 500; mtry = 100 3. ntree = 5,000; mtry = 1,000 Figure 2. Enterprise Miner Diagram for Experimentation 3

To estimate the misclassification error for the comparison models, a random, stratified split is conducted with 70% allocated for training (1,018 observations) and 30% for testing. Capitalizing on the OOB feature for RFs, the Number of obs in each sample parameter is set to 1,018 so that each tree in the forest is trained on the same number of observations as the comparison models. RESULTS Table 1 displays the misclassification rates for each model tested. Each of the RF models delivers a lower error compared to the comparison models. Increasing ntree boosts performance at the default level of mtry; however, increasing both ntree and mtry drastically does not lead to a large difference in error on this dataset compared to the default setting with much larger computational expense. Note that processing time for a normal run of the Decision Tree node is only a few seconds less than the default HP Forest node, which constructs 50 trees. Since the trees are built independently from one another in a RF, this model can be ran in parallel using multiple cores, thanks to the SAS High-Performance (HP) Data Mining framework within Enterprise Miner (Maldonado, Dean, Czika, & Haller, 2014). Model Misclassification Rate Run Duration (Minutes:Seconds) HP Forest 1 0.693 2:51 HP Forest 2 0.483 3:58 HP Forest 3 0.649 40:38 HP Neural 0.923 2:08 MBR 0.751 1:01 Decision Tree 0.865 2:36 Table 1. Predictive Performance and Computational Expense. Figure 2 plots the changing misclassification rate as more trees are added to the best model. Major decreases in the misclassification rate begin to decline at around 100 trees indicating that it may not be necessary to construct so many trees for this dataset. As expected, the misclassification rate given by the OOB observations is less optimistic than when the RF is trained on all the data. This reinforces the necessity to always test predictive models on independently held-out samples of data. Fortunately for RFs, this can be measured at the same time as model training. Figure 3. Misclassification Rate across Varying Numbers of Trees for Best Performing HP Forest Node. CONCLUSION Since their introduction, RFs have become an increasingly popular data mining tool for prediction and inference. In this study, the RF implementation in SAS Enterprise Miner, HP Forest, is used to predict the authorship of various 4

Amazon users based on their online writeprint from their product reviews. Because of the large number of target classes, only certain models are adequate for this classification task. The HP Forest node out-performs the other comparison models tested with little increases in computational expense due to the high-performance environment within Enterprise Miner. As the internet becomes more prevalent in society, it becomes all the more important to utilize the best analytical tools to prevent the distribution of illegal content and the communication of malicious entities. Given the high-dimensional nature of these online writeprint datasets, RFs can be a data-driven solution to help authorities identify authorship in online settings. REFERENCES Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140. Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks. Hastie, T., Tibshirani R., & Friedman, J. (2009). The elements of statistical learning. New York: Springer. Lewis, R. J. (2000, May). An introduction to classification and regression tree (CART) analysis. In Annual meeting of the society for academic emergency medicine in San Francisco, California (pp. 1-14). Li, J., Zheng, R., & Chen, H. (2006). From fingerprint to writeprint. Communications of the ACM, 49(4), 76-82. Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Liu, S., Liu, Z., Sun, J., & Liu, L. (2011). Application of synergetic neural network in online writeprint identification. International Journal of Digital Content Technology and its Applications, 5(3), 126-135. Maldonado, M., Dean, J., Czika, W., & Haller, S. (2014). Leveraging Ensemble Models in SAS Enterprise Miner. In Proceedings of the SAS Global Forum 2014 Conference. Cary, NC: SAS Institute Inc. Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1), 1. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: Name: Taylor Larkin Enterprise: The University of Alabama Address: Box 870226 City, State ZIP: Tuscaloosa, AL 35487 E-mail: tklarkin@crimson.ua.edu Name: Denise McManus Enterprise: The University of Alabama Address: Box 870226 City, State ZIP: Tuscaloosa, AL 35487 Work Phone: 205-348-7571 E-mail: dmcmanus@cba.ua.edu SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 5