Social Media, Anonymity, and Fraud: HP Forest Node in SAS Enterprise Miner

ABSTRACT Social Media, Anonymity, and Fraud: HP Forest Node in SAS Enterprise Miner Taylor K. Larkin, The University of Alabama, Tuscaloosa, Alabama Denise J. McManus, The University of Alabama, Tuscaloosa, Alabama With an ever-increasing flow of information from the Web, droves of data are constantly being introduced into the world. While the internet provides people with an easy avenue to purchase their favorite products or participate in social media, the anonymity of users can make the internet a breeding ground for fraud, distribution of illegal material, or communication between terrorist cells. In response, areas such as author identification through online writeprinting are increasing in popularity. Much like a human fingerprint, writeprinting quantifies the characteristics of a person s writing style in ways such as the frequency of certain words, the lengths of sentences and their structures, and the punctuation patterns. The ability to translate someone s post into quantifiable inputs allows the ability to implement advanced machine learning tools to make predictions as to the identity of malicious users. To demonstrate prediction on this type of problem, the Amazon Commerce Reviews dataset from the UCI Machine Learning Repository is used. This dataset consists of 1,500 observations, representing 30 product reviews from 50 authors, and 10,000 numeric writeprint variables. Given that the target variable has 50 classes, only select models are adequate for this classification task. Fortunately, with the help of the HP Forest node in SAS Enterprise Miner, we can apply the popular Random Forest model very efficiently, even in a setting where the number of predictor variables is much larger that the observation count. Results show that the HP Forest node produces the best results compared to the HP Neural, MBR, and Decision Tree nodes. INTRODUCTION In today s society, the internet plays a pivotal role in everyday life. It has a variety of uses, whether it be for communicating with business leaders across the world or something as mundane as paying bills. The ease of use promotes individuals to be more involved in online shopping and social media. Despite the wealth of information the internet presents, it is no stranger to misuse. Specifically in the communications realm, malicious users can spread illegal materials, such as pirated software and child pornography, via online messaging. Moreover, terrorist organizations remain active in this type of transmission (Li, Zheng, & Chen, 2006). Therefore, it is desirable to demystify online anonymity for the purpose of identifying cyber criminals and terrorist cells. This can be accomplished empirically through analyzing an individual s writeprint. Much like a fingerprint, a writeprint characterizes the writing style by identifying specific attributes. Examples of this include (Liu, Liu, Sun, & Liu, 2011): Word and sentence length Vocabulary richness Occurrence of special characters and punctuations Character n-grams frequencies Types of words used and misspellings The goal of this author identification system is to identify a set of features that remain relatively constant among a number of writings by a particular author (Li, Zheng, & Chen, 2006, p. 78). Due to the variety of possible characteristics to include in an individual s writeprint, these datasets can quickly become high dimensional (i.e. where the number of predictor variables is much larger than the number of observations). Thus, in order to perform classification tasks, powerful algorithms such as Random Forests (RFs) are necessary to handle the nature of these datasets. These models can be executed efficiently through SAS products such as Enterprise Miner. This paper first introduces the concept of RFs and its implementation, the HP Forest node, in SAS Enterprise Miner 13.1. Next, it demonstrations the HP Forest node in action on a real, high-dimensional writeprint dataset along with some other model nodes for comparison. Finally, it discusses the predictive performances and computational expenses of each node within Enterprise Miner on the dataset. 1

WHAT ARE RANDOM FORESTS? Classification and Regression Trees (CART) (Brieman, Friedman, Olshen, & Stone, 1984) are a very popular technique for analyzing data due to their practical interpretations. They can identify predictive splits on predictor variables in a dataset using binary recursive partitioning. In deciding these partitions, maximizing the information gain according to the Gini impurity criterion is typically used to determine which predictor variables to use. Some advantages to using these tree models are that they assume no distribution, produce interpretable logic rules, can handle varying scales and types of data as well as missing values, and are able to capture complex interactions (Lewis, 2000). However, these tree models can be highly variable, which can induce over-fitting (Hastie, Tibshirani, & Friedman, 2009). In other words, small changes in the data can drastically change the tree structure since it is a hierarchical process. To remedy this issue, Breiman (1996) suggests it is advantageous to perform bagging, or bootstrap aggregation, in conjunction with CART. That is, train many trees independently on bootstrap samples (sampling with replacement) of the same size as the training data and average the individual predictions. To improve upon this procedure, Breiman (2001) introduces randomness to posit the idea of a RF. RFs grow a plethora of trees that choose a random subset of predictor variables from all those available at each node in each tree to search for the best split. This randomization promotes a forest of diverse tree models, which is necessary for accurate predictions (Breiman, 2001). Besides increasing performance, another advantage to these models is the ability to effectively estimate the error rate simultaneously during model training through the use of out-of-bag (OOB) observations. OOB observations are comprised from the observations not used in the sampling for generating each tree s training data. Running the OOB observations through each constructed tree and aggregating the error rate across all the trees in the forest yields very similar error to that from cross-validation (Hastie, Tibshirani, & Friedman, 2009). Although RFs are typically viewed as a black box approach, variable importance rankings can be extracted. However, it is important to note that the variable selection procedure in RFs can be biased when the predictor variables differ in number of categories or scale of measurement and due to the bootstrap aggregation process which will affect the reliability of these variable importance estimates (Strobl, Boulesteix, Zeileis, & Hothorn, 2007); thus, subsampling (sampling without replacement) may be preferred in some scenarios when some inference is important. Nevertheless, RFs have become incredibly popular in both academic literature and in practice because they inherit many of the benefits of decision trees while boosting predictive performance. METHODS To construct RFs via SAS, the HP Forest node within Enterprise Miner can be used. This implementation improves upon the original RF algorithm in a couple of ways. First, the training data for each tree is subsampled as opposed to being bootstrapped. Second, instead of searching for the best splitting rule in the subset of randomly selected predictor variables, the HP Forest node preselects the predictor variable with the greatest amount of association, determined by a statistical test, between the subset and the target variable and then finds its best splitting rule (Maldonado, Dean, Czika, & Haller, 2014). This helps reduce the chances of producing a spurious split and decreases the biases prompted by an exhaustive search for the Gini information gain, thus, leading to more reliable predictions on new data and variable importance rankings. The method of using these tests of association for determining splitting is similar to the RF variant proposed by Strobl, Boulesteix, Zeileis, and Hothorn (2007) which can produce unbiased variable selection in each tree of the forest. The main three parameters to tune in HP Forest are highlighted in red in figure 1 with their default values. The Maximum Number of Trees, Number of vars to consider in split search, and Proportion of obs in each sample correspond to the number of trees constructed in the forest, the number of predictor variables to be randomly selected at each node, and the number of observations dedicated to each tree s training data given by the subsampling, respectively. A numerical value for the Number of vars to consider in split search is not listed because it varies for each dataset. As is typical for RFs, the square root of the number of predictor variables is a suitable default. Tuning this parameter controls the amount of correlation between the trees. Smaller values yield less correlation in the trees; on the other hand, if only a few predictor variables are informative for the target, then this value should be larger to increase the chances of these being selected. This parameter will likely lead to differences in the predictive performance unlike for Maximum Number of Trees, which does not greatly impact the prediction for a sufficient number of trees. 2

Figure 1. Main Sections of HP Forest Node's Properties Panel DATA AND EXPERIMENTAL PROCEDURE To investigate the performance of the HP Forest node for author identification tasks, the writeprint dataset, Amazon Commerce Reviews, is used which be found on the UCI Machine Learning Repository (Lichman, 2013). This dataset represents 30 product reviews from 50 active users. Accompanying these observations, 10,000 writeprint variables are constructed from the text of the product reviews (see (Liu, Liu, Sun, & Liu, 2011) for more detail). Given the large number of target classes (in this case 50), only certain models can be utilized for prediction. Figure 2 shows the Enterprise Miner diagram constructed. Along with the HP Forest node, the HP Neural, MBR (Memory Based Reasoning), and Decision Tree nodes are also executed for comparison. These are all left at their default settings. Three versions of the HP Forest node are investigated. Denote ntree to represent Maximum Number of Trees and mtry to symbolize Number of vars to consider in split search. The three version include: 1. ntree = 50; mtry = 100 (default) 2. ntree = 500; mtry = 100 3. ntree = 5,000; mtry = 1,000 Figure 2. Enterprise Miner Diagram for Experimentation 3

To estimate the misclassification error for the comparison models, a random, stratified split is conducted with 70% allocated for training (1,018 observations) and 30% for testing. Capitalizing on the OOB feature for RFs, the Number of obs in each sample parameter is set to 1,018 so that each tree in the forest is trained on the same number of observations as the comparison models. RESULTS Table 1 displays the misclassification rates for each model tested. Each of the RF models delivers a lower error compared to the comparison models. Increasing ntree boosts performance at the default level of mtry; however, increasing both ntree and mtry drastically does not lead to a large difference in error on this dataset compared to the default setting with much larger computational expense. Note that processing time for a normal run of the Decision Tree node is only a few seconds less than the default HP Forest node, which constructs 50 trees. Since the trees are built independently from one another in a RF, this model can be ran in parallel using multiple cores, thanks to the SAS High-Performance (HP) Data Mining framework within Enterprise Miner (Maldonado, Dean, Czika, & Haller, 2014). Model Misclassification Rate Run Duration (Minutes:Seconds) HP Forest 1 0.693 2:51 HP Forest 2 0.483 3:58 HP Forest 3 0.649 40:38 HP Neural 0.923 2:08 MBR 0.751 1:01 Decision Tree 0.865 2:36 Table 1. Predictive Performance and Computational Expense. Figure 2 plots the changing misclassification rate as more trees are added to the best model. Major decreases in the misclassification rate begin to decline at around 100 trees indicating that it may not be necessary to construct so many trees for this dataset. As expected, the misclassification rate given by the OOB observations is less optimistic than when the RF is trained on all the data. This reinforces the necessity to always test predictive models on independently held-out samples of data. Fortunately for RFs, this can be measured at the same time as model training. Figure 3. Misclassification Rate across Varying Numbers of Trees for Best Performing HP Forest Node. CONCLUSION Since their introduction, RFs have become an increasingly popular data mining tool for prediction and inference. In this study, the RF implementation in SAS Enterprise Miner, HP Forest, is used to predict the authorship of various 4

Amazon users based on their online writeprint from their product reviews. Because of the large number of target classes, only certain models are adequate for this classification task. The HP Forest node out-performs the other comparison models tested with little increases in computational expense due to the high-performance environment within Enterprise Miner. As the internet becomes more prevalent in society, it becomes all the more important to utilize the best analytical tools to prevent the distribution of illegal content and the communication of malicious entities. Given the high-dimensional nature of these online writeprint datasets, RFs can be a data-driven solution to help authorities identify authorship in online settings. REFERENCES Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140. Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks. Hastie, T., Tibshirani R., & Friedman, J. (2009). The elements of statistical learning. New York: Springer. Lewis, R. J. (2000, May). An introduction to classification and regression tree (CART) analysis. In Annual meeting of the society for academic emergency medicine in San Francisco, California (pp. 1-14). Li, J., Zheng, R., & Chen, H. (2006). From fingerprint to writeprint. Communications of the ACM, 49(4), 76-82. Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Liu, S., Liu, Z., Sun, J., & Liu, L. (2011). Application of synergetic neural network in online writeprint identification. International Journal of Digital Content Technology and its Applications, 5(3), 126-135. Maldonado, M., Dean, J., Czika, W., & Haller, S. (2014). Leveraging Ensemble Models in SAS Enterprise Miner. In Proceedings of the SAS Global Forum 2014 Conference. Cary, NC: SAS Institute Inc. Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1), 1. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: Name: Taylor Larkin Enterprise: The University of Alabama Address: Box 870226 City, State ZIP: Tuscaloosa, AL 35487 E-mail: tklarkin@crimson.ua.edu Name: Denise McManus Enterprise: The University of Alabama Address: Box 870226 City, State ZIP: Tuscaloosa, AL 35487 Work Phone: 205-348-7571 E-mail: dmcmanus@cba.ua.edu SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 5