Social Media, Anonymity, and Fraud: HP Forest Node in SAS Enterprise Miner

Size: px
Start display at page:

Download "Social Media, Anonymity, and Fraud: HP Forest Node in SAS Enterprise Miner"

Transcription

1 ABSTRACT Social Media, Anonymity, and Fraud: HP Forest Node in SAS Enterprise Miner Taylor K. Larkin, The University of Alabama, Tuscaloosa, Alabama Denise J. McManus, The University of Alabama, Tuscaloosa, Alabama With an ever-increasing flow of information from the Web, droves of data are constantly being introduced into the world. While the internet provides people with an easy avenue to purchase their favorite products or participate in social media, the anonymity of users can make the internet a breeding ground for fraud, distribution of illegal material, or communication between terrorist cells. In response, areas such as author identification through online writeprinting are increasing in popularity. Much like a human fingerprint, writeprinting quantifies the characteristics of a person s writing style in ways such as the frequency of certain words, the lengths of sentences and their structures, and the punctuation patterns. The ability to translate someone s post into quantifiable inputs allows the ability to implement advanced machine learning tools to make predictions as to the identity of malicious users. To demonstrate prediction on this type of problem, the Amazon Commerce Reviews dataset from the UCI Machine Learning Repository is used. This dataset consists of 1,500 observations, representing 30 product reviews from 50 authors, and 10,000 numeric writeprint variables. Given that the target variable has 50 classes, only select models are adequate for this classification task. Fortunately, with the help of the HP Forest node in SAS Enterprise Miner, we can apply the popular Random Forest model very efficiently, even in a setting where the number of predictor variables is much larger that the observation count. Results show that the HP Forest node produces the best results compared to the HP Neural, MBR, and Decision Tree nodes. INTRODUCTION In today s society, the internet plays a pivotal role in everyday life. It has a variety of uses, whether it be for communicating with business leaders across the world or something as mundane as paying bills. The ease of use promotes individuals to be more involved in online shopping and social media. Despite the wealth of information the internet presents, it is no stranger to misuse. Specifically in the communications realm, malicious users can spread illegal materials, such as pirated software and child pornography, via online messaging. Moreover, terrorist organizations remain active in this type of transmission (Li, Zheng, & Chen, 2006). Therefore, it is desirable to demystify online anonymity for the purpose of identifying cyber criminals and terrorist cells. This can be accomplished empirically through analyzing an individual s writeprint. Much like a fingerprint, a writeprint characterizes the writing style by identifying specific attributes. Examples of this include (Liu, Liu, Sun, & Liu, 2011): Word and sentence length Vocabulary richness Occurrence of special characters and punctuations Character n-grams frequencies Types of words used and misspellings The goal of this author identification system is to identify a set of features that remain relatively constant among a number of writings by a particular author (Li, Zheng, & Chen, 2006, p. 78). Due to the variety of possible characteristics to include in an individual s writeprint, these datasets can quickly become high dimensional (i.e. where the number of predictor variables is much larger than the number of observations). Thus, in order to perform classification tasks, powerful algorithms such as Random Forests (RFs) are necessary to handle the nature of these datasets. These models can be executed efficiently through SAS products such as Enterprise Miner. This paper first introduces the concept of RFs and its implementation, the HP Forest node, in SAS Enterprise Miner Next, it demonstrations the HP Forest node in action on a real, high-dimensional writeprint dataset along with some other model nodes for comparison. Finally, it discusses the predictive performances and computational expenses of each node within Enterprise Miner on the dataset. 1

2 WHAT ARE RANDOM FORESTS? Classification and Regression Trees (CART) (Brieman, Friedman, Olshen, & Stone, 1984) are a very popular technique for analyzing data due to their practical interpretations. They can identify predictive splits on predictor variables in a dataset using binary recursive partitioning. In deciding these partitions, maximizing the information gain according to the Gini impurity criterion is typically used to determine which predictor variables to use. Some advantages to using these tree models are that they assume no distribution, produce interpretable logic rules, can handle varying scales and types of data as well as missing values, and are able to capture complex interactions (Lewis, 2000). However, these tree models can be highly variable, which can induce over-fitting (Hastie, Tibshirani, & Friedman, 2009). In other words, small changes in the data can drastically change the tree structure since it is a hierarchical process. To remedy this issue, Breiman (1996) suggests it is advantageous to perform bagging, or bootstrap aggregation, in conjunction with CART. That is, train many trees independently on bootstrap samples (sampling with replacement) of the same size as the training data and average the individual predictions. To improve upon this procedure, Breiman (2001) introduces randomness to posit the idea of a RF. RFs grow a plethora of trees that choose a random subset of predictor variables from all those available at each node in each tree to search for the best split. This randomization promotes a forest of diverse tree models, which is necessary for accurate predictions (Breiman, 2001). Besides increasing performance, another advantage to these models is the ability to effectively estimate the error rate simultaneously during model training through the use of out-of-bag (OOB) observations. OOB observations are comprised from the observations not used in the sampling for generating each tree s training data. Running the OOB observations through each constructed tree and aggregating the error rate across all the trees in the forest yields very similar error to that from cross-validation (Hastie, Tibshirani, & Friedman, 2009). Although RFs are typically viewed as a black box approach, variable importance rankings can be extracted. However, it is important to note that the variable selection procedure in RFs can be biased when the predictor variables differ in number of categories or scale of measurement and due to the bootstrap aggregation process which will affect the reliability of these variable importance estimates (Strobl, Boulesteix, Zeileis, & Hothorn, 2007); thus, subsampling (sampling without replacement) may be preferred in some scenarios when some inference is important. Nevertheless, RFs have become incredibly popular in both academic literature and in practice because they inherit many of the benefits of decision trees while boosting predictive performance. METHODS To construct RFs via SAS, the HP Forest node within Enterprise Miner can be used. This implementation improves upon the original RF algorithm in a couple of ways. First, the training data for each tree is subsampled as opposed to being bootstrapped. Second, instead of searching for the best splitting rule in the subset of randomly selected predictor variables, the HP Forest node preselects the predictor variable with the greatest amount of association, determined by a statistical test, between the subset and the target variable and then finds its best splitting rule (Maldonado, Dean, Czika, & Haller, 2014). This helps reduce the chances of producing a spurious split and decreases the biases prompted by an exhaustive search for the Gini information gain, thus, leading to more reliable predictions on new data and variable importance rankings. The method of using these tests of association for determining splitting is similar to the RF variant proposed by Strobl, Boulesteix, Zeileis, and Hothorn (2007) which can produce unbiased variable selection in each tree of the forest. The main three parameters to tune in HP Forest are highlighted in red in figure 1 with their default values. The Maximum Number of Trees, Number of vars to consider in split search, and Proportion of obs in each sample correspond to the number of trees constructed in the forest, the number of predictor variables to be randomly selected at each node, and the number of observations dedicated to each tree s training data given by the subsampling, respectively. A numerical value for the Number of vars to consider in split search is not listed because it varies for each dataset. As is typical for RFs, the square root of the number of predictor variables is a suitable default. Tuning this parameter controls the amount of correlation between the trees. Smaller values yield less correlation in the trees; on the other hand, if only a few predictor variables are informative for the target, then this value should be larger to increase the chances of these being selected. This parameter will likely lead to differences in the predictive performance unlike for Maximum Number of Trees, which does not greatly impact the prediction for a sufficient number of trees. 2

3 Figure 1. Main Sections of HP Forest Node's Properties Panel DATA AND EXPERIMENTAL PROCEDURE To investigate the performance of the HP Forest node for author identification tasks, the writeprint dataset, Amazon Commerce Reviews, is used which be found on the UCI Machine Learning Repository (Lichman, 2013). This dataset represents 30 product reviews from 50 active users. Accompanying these observations, 10,000 writeprint variables are constructed from the text of the product reviews (see (Liu, Liu, Sun, & Liu, 2011) for more detail). Given the large number of target classes (in this case 50), only certain models can be utilized for prediction. Figure 2 shows the Enterprise Miner diagram constructed. Along with the HP Forest node, the HP Neural, MBR (Memory Based Reasoning), and Decision Tree nodes are also executed for comparison. These are all left at their default settings. Three versions of the HP Forest node are investigated. Denote ntree to represent Maximum Number of Trees and mtry to symbolize Number of vars to consider in split search. The three version include: 1. ntree = 50; mtry = 100 (default) 2. ntree = 500; mtry = ntree = 5,000; mtry = 1,000 Figure 2. Enterprise Miner Diagram for Experimentation 3

4 To estimate the misclassification error for the comparison models, a random, stratified split is conducted with 70% allocated for training (1,018 observations) and 30% for testing. Capitalizing on the OOB feature for RFs, the Number of obs in each sample parameter is set to 1,018 so that each tree in the forest is trained on the same number of observations as the comparison models. RESULTS Table 1 displays the misclassification rates for each model tested. Each of the RF models delivers a lower error compared to the comparison models. Increasing ntree boosts performance at the default level of mtry; however, increasing both ntree and mtry drastically does not lead to a large difference in error on this dataset compared to the default setting with much larger computational expense. Note that processing time for a normal run of the Decision Tree node is only a few seconds less than the default HP Forest node, which constructs 50 trees. Since the trees are built independently from one another in a RF, this model can be ran in parallel using multiple cores, thanks to the SAS High-Performance (HP) Data Mining framework within Enterprise Miner (Maldonado, Dean, Czika, & Haller, 2014). Model Misclassification Rate Run Duration (Minutes:Seconds) HP Forest :51 HP Forest :58 HP Forest :38 HP Neural :08 MBR :01 Decision Tree :36 Table 1. Predictive Performance and Computational Expense. Figure 2 plots the changing misclassification rate as more trees are added to the best model. Major decreases in the misclassification rate begin to decline at around 100 trees indicating that it may not be necessary to construct so many trees for this dataset. As expected, the misclassification rate given by the OOB observations is less optimistic than when the RF is trained on all the data. This reinforces the necessity to always test predictive models on independently held-out samples of data. Fortunately for RFs, this can be measured at the same time as model training. Figure 3. Misclassification Rate across Varying Numbers of Trees for Best Performing HP Forest Node. CONCLUSION Since their introduction, RFs have become an increasingly popular data mining tool for prediction and inference. In this study, the RF implementation in SAS Enterprise Miner, HP Forest, is used to predict the authorship of various 4

5 Amazon users based on their online writeprint from their product reviews. Because of the large number of target classes, only certain models are adequate for this classification task. The HP Forest node out-performs the other comparison models tested with little increases in computational expense due to the high-performance environment within Enterprise Miner. As the internet becomes more prevalent in society, it becomes all the more important to utilize the best analytical tools to prevent the distribution of illegal content and the communication of malicious entities. Given the high-dimensional nature of these online writeprint datasets, RFs can be a data-driven solution to help authorities identify authorship in online settings. REFERENCES Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), Breiman, L. (2001). Random forests. Machine learning, 45(1), Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks. Hastie, T., Tibshirani R., & Friedman, J. (2009). The elements of statistical learning. New York: Springer. Lewis, R. J. (2000, May). An introduction to classification and regression tree (CART) analysis. In Annual meeting of the society for academic emergency medicine in San Francisco, California (pp. 1-14). Li, J., Zheng, R., & Chen, H. (2006). From fingerprint to writeprint. Communications of the ACM, 49(4), Lichman, M. (2013). UCI Machine Learning Repository [ Irvine, CA: University of California, School of Information and Computer Science. Liu, S., Liu, Z., Sun, J., & Liu, L. (2011). Application of synergetic neural network in online writeprint identification. International Journal of Digital Content Technology and its Applications, 5(3), Maldonado, M., Dean, J., Czika, W., & Haller, S. (2014). Leveraging Ensemble Models in SAS Enterprise Miner. In Proceedings of the SAS Global Forum 2014 Conference. Cary, NC: SAS Institute Inc. Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1), 1. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: Name: Taylor Larkin Enterprise: The University of Alabama Address: Box City, State ZIP: Tuscaloosa, AL tklarkin@crimson.ua.edu Name: Denise McManus Enterprise: The University of Alabama Address: Box City, State ZIP: Tuscaloosa, AL Work Phone: dmcmanus@cba.ua.edu SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 5

Learning From the Past with Experiment Databases

Learning From the Past with Experiment Databases Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University

More information

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees Mariusz Łapczy ski 1 and Bartłomiej Jefma ski 2 1 The Chair of Market Analysis and Marketing Research,

More information

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio SCSUG Student Symposium 2016 Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio Praneth Guggilla, Tejaswi Jha, Goutam Chakraborty, Oklahoma State

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler Machine Learning and Data Mining Ensembles of Learners Prof. Alexander Ihler Ensemble methods Why learn one classifier when you can learn many? Ensemble: combine many predictors (Weighted) combina

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Todd Holloway Two Lecture Series for B551 November 20 & 27, 2007 Indiana University Outline Introduction Bias and

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge Innov High Educ (2009) 34:93 103 DOI 10.1007/s10755-009-9095-2 Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge Phyllis Blumberg Published online: 3 February

More information

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point. STT 231 Test 1 Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point. 1. A professor has kept records on grades that students have earned in his class. If he

More information

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing

More information

Probability and Statistics Curriculum Pacing Guide

Probability and Statistics Curriculum Pacing Guide Unit 1 Terms PS.SPMJ.3 PS.SPMJ.5 Plan and conduct a survey to answer a statistical question. Recognize how the plan addresses sampling technique, randomization, measurement of experimental error and methods

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

An Empirical Comparison of Supervised Ensemble Learning Approaches

An Empirical Comparison of Supervised Ensemble Learning Approaches An Empirical Comparison of Supervised Ensemble Learning Approaches Mohamed Bibimoune 1,2, Haytham Elghazel 1, Alex Aussem 1 1 Université de Lyon, CNRS Université Lyon 1, LIRIS UMR 5205, F-69622, France

More information

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders

More information

Axiom 2013 Team Description Paper

Axiom 2013 Team Description Paper Axiom 2013 Team Description Paper Mohammad Ghazanfari, S Omid Shirkhorshidi, Farbod Samsamipour, Hossein Rahmatizadeh Zagheli, Mohammad Mahdavi, Payam Mohajeri, S Abbas Alamolhoda Robotics Scientific Association

More information

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems Ajith Abraham School of Business Systems, Monash University, Clayton, Victoria 3800, Australia. Email: ajith.abraham@ieee.org

More information

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1 Notes on The Sciences of the Artificial Adapted from a shorter document written for course 17-652 (Deciding What to Design) 1 Ali Almossawi December 29, 2005 1 Introduction The Sciences of the Artificial

More information

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence COURSE DESCRIPTION This course presents computing tools and concepts for all stages

More information

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,

More information

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com

More information

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE Mingon Kang, PhD Computer Science, Kennesaw State University Self Introduction Mingon Kang, PhD Homepage: http://ksuweb.kennesaw.edu/~mkang9

More information

Radius STEM Readiness TM

Radius STEM Readiness TM Curriculum Guide Radius STEM Readiness TM While today s teens are surrounded by technology, we face a stark and imminent shortage of graduates pursuing careers in Science, Technology, Engineering, and

More information

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D. INTRODUCTION Anyone who spends ample time working

More information

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and Planning Overview Motivation for Analyses Analyses and

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing D. Indhumathi Research Scholar Department of Information Technology

More information

STA 225: Introductory Statistics (CT)

STA 225: Introductory Statistics (CT) Marshall University College of Science Mathematics Department STA 225: Introductory Statistics (CT) Course catalog description A critical thinking course in applied statistical reasoning covering basic

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction CLASSIFICATION OF PROGRAM Critical Elements Analysis 1 Program Name: Macmillan/McGraw Hill Reading 2003 Date of Publication: 2003 Publisher: Macmillan/McGraw Hill Reviewer Code: 1. X The program meets

More information

learning collegiate assessment]

learning collegiate assessment] [ collegiate learning assessment] INSTITUTIONAL REPORT 2005 2006 Kalamazoo College council for aid to education 215 lexington avenue floor 21 new york new york 10016-6023 p 212.217.0700 f 212.661.9766

More information

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District Report Submitted June 20, 2012, to Willis D. Hawley, Ph.D., Special

More information

Switchboard Language Model Improvement with Conversational Data from Gigaword

Switchboard Language Model Improvement with Conversational Data from Gigaword Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword

More information

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al Dependency Networks for Collaborative Filtering and Data Visualization David Heckerman, David Maxwell Chickering, Christopher Meek, Robert Rounthwaite, Carl Kadie Microsoft Research Redmond WA 98052-6399

More information

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics (L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes

More information

Probability estimates in a scenario tree

Probability estimates in a scenario tree 101 Chapter 11 Probability estimates in a scenario tree An expert is a person who has made all the mistakes that can be made in a very narrow field. Niels Bohr (1885 1962) Scenario trees require many numbers.

More information

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand Texas Essential Knowledge and Skills (TEKS): (2.1) Number, operation, and quantitative reasoning. The student

More information

University of Groningen. Systemen, planning, netwerken Bosman, Aart

University of Groningen. Systemen, planning, netwerken Bosman, Aart University of Groningen Systemen, planning, netwerken Bosman, Aart IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document

More information

A. What is research? B. Types of research

A. What is research? B. Types of research A. What is research? Research = the process of finding solutions to a problem after a thorough study and analysis (Sekaran, 2006). Research = systematic inquiry that provides information to guide decision

More information

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, 2013 10.12753/2066-026X-13-154 DATA MINING SOLUTIONS FOR DETERMINING STUDENT'S PROFILE Adela BÂRA,

More information

Multi-label classification via multi-target regression on data streams

Multi-label classification via multi-target regression on data streams Mach Learn (2017) 106:745 770 DOI 10.1007/s10994-016-5613-5 Multi-label classification via multi-target regression on data streams Aljaž Osojnik 1,2 Panče Panov 1 Sašo Džeroski 1,2,3 Received: 26 April

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean

More information

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA Alta de Waal, Jacobus Venter and Etienne Barnard Abstract Most actionable evidence is identified during the analysis phase of digital forensic investigations.

More information

Visit us at:

Visit us at: White Paper Integrating Six Sigma and Software Testing Process for Removal of Wastage & Optimizing Resource Utilization 24 October 2013 With resources working for extended hours and in a pressurized environment,

More information

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning Hendrik Blockeel and Joaquin Vanschoren Computer Science Dept., K.U.Leuven, Celestijnenlaan 200A, 3001 Leuven, Belgium

More information

Learning By Asking: How Children Ask Questions To Achieve Efficient Search

Learning By Asking: How Children Ask Questions To Achieve Efficient Search Learning By Asking: How Children Ask Questions To Achieve Efficient Search Azzurra Ruggeri (a.ruggeri@berkeley.edu) Department of Psychology, University of California, Berkeley, USA Max Planck Institute

More information

Major Milestones, Team Activities, and Individual Deliverables

Major Milestones, Team Activities, and Individual Deliverables Major Milestones, Team Activities, and Individual Deliverables Milestone #1: Team Semester Proposal Your team should write a proposal that describes project objectives, existing relevant technology, engineering

More information

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN From: AAAI Technical Report WS-98-08. Compilation copyright 1998, AAAI (www.aaai.org). All rights reserved. Recommender Systems: A GroupLens Perspective Joseph A. Konstan *t, John Riedl *t, AI Borchers,

More information

Reducing Features to Improve Bug Prediction

Reducing Features to Improve Bug Prediction Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science

More information

A Bootstrapping Model of Frequency and Context Effects in Word Learning

A Bootstrapping Model of Frequency and Context Effects in Word Learning Cognitive Science 41 (2017) 590 622 Copyright 2016 Cognitive Science Society, Inc. All rights reserved. ISSN: 0364-0213 print / 1551-6709 online DOI: 10.1111/cogs.12353 A Bootstrapping Model of Frequency

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

Computerized Adaptive Psychological Testing A Personalisation Perspective

Computerized Adaptive Psychological Testing A Personalisation Perspective Psychology and the internet: An European Perspective Computerized Adaptive Psychological Testing A Personalisation Perspective Mykola Pechenizkiy mpechen@cc.jyu.fi Introduction Mixed Model of IRT and ES

More information

AQUA: An Ontology-Driven Question Answering System

AQUA: An Ontology-Driven Question Answering System AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.

More information

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING Yong Sun, a * Colin Fidge b and Lin Ma a a CRC for Integrated Engineering Asset Management, School of Engineering Systems, Queensland

More information

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Exploration. CS : Deep Reinforcement Learning Sergey Levine Exploration CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1. Homework 4 due on Wednesday 2. Project proposal feedback sent Today s Lecture 1. What is exploration? Why is it a problem?

More information

Software Maintenance

Software Maintenance 1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories

More information

Learning to Rank with Selection Bias in Personal Search

Learning to Rank with Selection Bias in Personal Search Learning to Rank with Selection Bias in Personal Search Xuanhui Wang, Michael Bendersky, Donald Metzler, Marc Najork Google Inc. Mountain View, CA 94043 {xuanhui, bemike, metzler, najork}@google.com ABSTRACT

More information

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics Machine Learning from Garden Path Sentences: The Application of Computational Linguistics http://dx.doi.org/10.3991/ijet.v9i6.4109 J.L. Du 1, P.F. Yu 1 and M.L. Li 2 1 Guangdong University of Foreign Studies,

More information

Algebra 2- Semester 2 Review

Algebra 2- Semester 2 Review Name Block Date Algebra 2- Semester 2 Review Non-Calculator 5.4 1. Consider the function f x 1 x 2. a) Describe the transformation of the graph of y 1 x. b) Identify the asymptotes. c) What is the domain

More information

A Version Space Approach to Learning Context-free Grammars

A Version Space Approach to Learning Context-free Grammars Machine Learning 2: 39~74, 1987 1987 Kluwer Academic Publishers, Boston - Manufactured in The Netherlands A Version Space Approach to Learning Context-free Grammars KURT VANLEHN (VANLEHN@A.PSY.CMU.EDU)

More information

Becoming a Leader in Institutional Research

Becoming a Leader in Institutional Research Becoming a Leader in Institutional Research Slide 1 (Becoming a Leader in IR) California Association for Institutional Research 41st Annual Conference November 18, 2016 Los Angeles, California by Robert

More information

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE Pierre Foy TIMSS Advanced 2015 orks User Guide for the International Database Pierre Foy Contributors: Victoria A.S. Centurino, Kerry E. Cotter,

More information

Learning goal-oriented strategies in problem solving

Learning goal-oriented strategies in problem solving Learning goal-oriented strategies in problem solving Martin Možina, Timotej Lazar, Ivan Bratko Faculty of Computer and Information Science University of Ljubljana, Ljubljana, Slovenia Abstract The need

More information

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v1 [cs.lg] 15 Jun 2015 Dual Memory Architectures for Fast Deep Learning of Stream Data via an Online-Incremental-Transfer Strategy arxiv:1506.04477v1 [cs.lg] 15 Jun 2015 Sang-Woo Lee Min-Oh Heo School of Computer Science and

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

A Comparison of Standard and Interval Association Rules

A Comparison of Standard and Interval Association Rules A Comparison of Standard and Association Rules Choh Man Teng cmteng@ai.uwf.edu Institute for Human and Machine Cognition University of West Florida 4 South Alcaniz Street, Pensacola FL 325, USA Abstract

More information

Handling Concept Drifts Using Dynamic Selection of Classifiers

Handling Concept Drifts Using Dynamic Selection of Classifiers Handling Concept Drifts Using Dynamic Selection of Classifiers Paulo R. Lisboa de Almeida, Luiz S. Oliveira, Alceu de Souza Britto Jr. and and Robert Sabourin Universidade Federal do Paraná, DInf, Curitiba,

More information

Term Weighting based on Document Revision History

Term Weighting based on Document Revision History Term Weighting based on Document Revision History Sérgio Nunes, Cristina Ribeiro, and Gabriel David INESC Porto, DEI, Faculdade de Engenharia, Universidade do Porto. Rua Dr. Roberto Frias, s/n. 4200-465

More information

On-Line Data Analytics

On-Line Data Analytics International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob

More information

An Introduction to Simio for Beginners

An Introduction to Simio for Beginners An Introduction to Simio for Beginners C. Dennis Pegden, Ph.D. This white paper is intended to introduce Simio to a user new to simulation. It is intended for the manufacturing engineer, hospital quality

More information

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview Algebra 1, Quarter 3, Unit 3.1 Line of Best Fit Overview Number of instructional days 6 (1 day assessment) (1 day = 45 minutes) Content to be learned Analyze scatter plots and construct the line of best

More information

On-the-Fly Customization of Automated Essay Scoring

On-the-Fly Customization of Automated Essay Scoring Research Report On-the-Fly Customization of Automated Essay Scoring Yigal Attali Research & Development December 2007 RR-07-42 On-the-Fly Customization of Automated Essay Scoring Yigal Attali ETS, Princeton,

More information

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming Data Mining VI 205 Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming C. Romero, S. Ventura, C. Hervás & P. González Universidad de Córdoba, Campus Universitario de

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms ABSTRACT DEODHAR, SUSHAMNA DEODHAR. Using Grammatical Evolution Decision Trees for Detecting Gene-Gene Interactions in Genetic Epidemiology. (Under the direction of Dr. Alison Motsinger-Reif.) A major

More information

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract

More information

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Dublin City Schools Mathematics Graded Course of Study GRADE 4 I. Content Standard: Number, Number Sense and Operations Standard Students demonstrate number sense, including an understanding of number systems and reasonable estimates using paper and pencil, technology-supported

More information

The SREB Leadership Initiative and its

The SREB Leadership Initiative and its SREB LEADERSHIP INITIATIVE SREB s Leadership Curriculum Modules Engage Leaders in Solving Real School Problems Every school has leadership that results in improved student performance and leadership begins

More information

Universidade do Minho Escola de Engenharia

Universidade do Minho Escola de Engenharia Universidade do Minho Escola de Engenharia Universidade do Minho Escola de Engenharia Dissertação de Mestrado Knowledge Discovery is the nontrivial extraction of implicit, previously unknown, and potentially

More information

Task Types. Duration, Work and Units Prepared by

Task Types. Duration, Work and Units Prepared by Task Types Duration, Work and Units Prepared by 1 Introduction Microsoft Project allows tasks with fixed work, fixed duration, or fixed units. Many people ask questions about changes in these values when

More information

Seminar - Organic Computing

Seminar - Organic Computing Seminar - Organic Computing Self-Organisation of OC-Systems Markus Franke 25.01.2006 Typeset by FoilTEX Timetable 1. Overview 2. Characteristics of SO-Systems 3. Concern with Nature 4. Design-Concepts

More information

VOL. 3, NO. 5, May 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

VOL. 3, NO. 5, May 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved. Exploratory Study on Factors that Impact / Influence Success and failure of Students in the Foundation Computer Studies Course at the National University of Samoa 1 2 Elisapeta Mauai, Edna Temese 1 Computing

More information

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition Chapter 2: The Representation of Knowledge Expert Systems: Principles and Programming, Fourth Edition Objectives Introduce the study of logic Learn the difference between formal logic and informal logic

More information

English for Specific Purposes World ISSN Issue 34, Volume 12, 2012 TITLE:

English for Specific Purposes World ISSN Issue 34, Volume 12, 2012 TITLE: TITLE: The English Language Needs of Computer Science Undergraduate Students at Putra University, Author: 1 Affiliation: Faculty Member Department of Languages College of Arts and Sciences International

More information

Softprop: Softmax Neural Network Backpropagation Learning

Softprop: Softmax Neural Network Backpropagation Learning Softprop: Softmax Neural Networ Bacpropagation Learning Michael Rimer Computer Science Department Brigham Young University Provo, UT 84602, USA E-mail: mrimer@axon.cs.byu.edu Tony Martinez Computer Science

More information

Introduction to Causal Inference. Problem Set 1. Required Problems

Introduction to Causal Inference. Problem Set 1. Required Problems Introduction to Causal Inference Problem Set 1 Professor: Teppei Yamamoto Due Friday, July 15 (at beginning of class) Only the required problems are due on the above date. The optional problems will not

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer

More information

Unit 7 Data analysis and design

Unit 7 Data analysis and design 2016 Suite Cambridge TECHNICALS LEVEL 3 IT Unit 7 Data analysis and design A/507/5007 Guided learning hours: 60 Version 2 - revised May 2016 *changes indicated by black vertical line ocr.org.uk/it LEVEL

More information

Applications of data mining algorithms to analysis of medical data

Applications of data mining algorithms to analysis of medical data Master Thesis Software Engineering Thesis no: MSE-2007:20 August 2007 Applications of data mining algorithms to analysis of medical data Dariusz Matyja School of Engineering Blekinge Institute of Technology

More information

Characterizing Mathematical Digital Literacy: A Preliminary Investigation. Todd Abel Appalachian State University

Characterizing Mathematical Digital Literacy: A Preliminary Investigation. Todd Abel Appalachian State University Characterizing Mathematical Digital Literacy: A Preliminary Investigation Todd Abel Appalachian State University Jeremy Brazas, Darryl Chamberlain Jr., Aubrey Kemp Georgia State University This preliminary

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts. Recommendation 1 Build on students informal understanding of sharing and proportionality to develop initial fraction concepts. Students come to kindergarten with a rudimentary understanding of basic fraction

More information

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called Improving Simple Bayes Ron Kohavi Barry Becker Dan Sommereld Data Mining and Visualization Group Silicon Graphics, Inc. 2011 N. Shoreline Blvd. Mountain View, CA 94043 fbecker,ronnyk,sommdag@engr.sgi.com

More information

Accessing Higher Education in Developing Countries: panel data analysis from India, Peru and Vietnam

Accessing Higher Education in Developing Countries: panel data analysis from India, Peru and Vietnam Accessing Higher Education in Developing Countries: panel data analysis from India, Peru and Vietnam Alan Sanchez (GRADE) y Abhijeet Singh (UCL) 12 de Agosto, 2017 Introduction Higher education in developing

More information

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus CS 1103 Computer Science I Honors Fall 2016 Instructor Muller Syllabus Welcome to CS1103. This course is an introduction to the art and science of computer programming and to some of the fundamental concepts

More information

On the Combined Behavior of Autonomous Resource Management Agents

On the Combined Behavior of Autonomous Resource Management Agents On the Combined Behavior of Autonomous Resource Management Agents Siri Fagernes 1 and Alva L. Couch 2 1 Faculty of Engineering Oslo University College Oslo, Norway siri.fagernes@iu.hio.no 2 Computer Science

More information