Using R for Statistics. Sarah Stowell

Similar documents
International Series in Operations Research & Management Science

Probability and Statistics Curriculum Pacing Guide

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

PowerTeacher Gradebook User Guide PowerSchool Student Information System

School of Innovative Technologies and Engineering

Houghton Mifflin Online Assessment System Walkthrough Guide

DOCTORAL SCHOOL TRAINING AND DEVELOPMENT PROGRAMME

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Guide to Teaching Computer Science

MARE Publication Series

STA 225: Introductory Statistics (CT)

Excel Formulas & Functions

Minitab Tutorial (Version 17+)

Introduction to the Practice of Statistics

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

MMOG Subscription Business Models: Table of Contents

Knowledge management styles and performance: a knowledge space model from both theoretical and empirical perspectives

12- A whirlwind tour of statistics

Preferences...3 Basic Calculator...5 Math/Graphing Tools...5 Help...6 Run System Check...6 Sign Out...8

Excel Intermediate

Quick Start Guide 7.0

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

SkillPort Quick Start Guide 7.0

AP Statistics Summer Assignment 17-18

Test Administrator User Guide

TeacherPlus Gradebook HTML5 Guide LEARN OUR SOFTWARE STEP BY STEP

Millersville University Degree Works Training User Guide

Perspectives of Information Systems

A Model to Predict 24-Hour Urinary Creatinine Level Using Repeated Measurements

Pre-vocational Education in Germany and China

Using Blackboard.com Software to Reach Beyond the Classroom: Intermediate

Introduction to Causal Inference. Problem Set 1. Required Problems

Mathematics Success Level E

Dialogue Live Clientside

Background Information. Instructions. Problem Statement. HOMEWORK INSTRUCTIONS Homework #3 Higher Education Salary Problem

Schoology Getting Started Guide for Teachers

Intel-powered Classmate PC. SMART Response* Training Foils. Version 2.0

GACE Computer Science Assessment Test at a Glance

16.1 Lesson: Putting it into practice - isikhnas

Course Content Concepts

Statewide Framework Document for:

Preparing for the School Census Autumn 2017 Return preparation guide. English Primary, Nursery and Special Phase Schools Applicable to 7.

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

CHALLENGES FACING DEVELOPMENT OF STRATEGIC PLANS IN PUBLIC SECONDARY SCHOOLS IN MWINGI CENTRAL DISTRICT, KENYA

MOODLE 2.0 GLOSSARY TUTORIALS

Workshop Guide Tutorials and Sample Activities. Dynamic Dataa Software

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Field Experience Management 2011 Training Guides

Developing Language Teacher Autonomy through Action Research

Lecture Notes on Mathematical Olympiad Courses

Ryerson University Sociology SOC 483: Advanced Research and Statistics

PRODUCT PLATFORM AND PRODUCT FAMILY DESIGN

Python Machine Learning

Road Maps A Guide to Learning System Dynamics System Dynamics in Education Project

M55205-Mastering Microsoft Project 2016

McGraw-Hill Connect and Create Built by Blackboard. Release Notes. Version 2.3 for Blackboard Learn 9.1

Welcome to California Colleges, Platform Exploration (6.1) Goal: Students will familiarize themselves with the CaliforniaColleges.edu platform.

CENTRAL MAINE COMMUNITY COLLEGE Introduction to Computer Applications BCA ; FALL 2011

Ohio s Learning Standards-Clear Learning Targets

A THESIS. By: IRENE BRAINNITA OKTARIN S

SCT Banner Financial Aid Needs Analysis Training Workbook January 2005 Release 7

Individual Differences & Item Effects: How to test them, & how to test them well

Instrumentation, Control & Automation Staffing. Maintenance Benchmarking Study

Cal s Dinner Card Deals

PROFESSIONAL TREATMENT OF TEACHERS AND STUDENT ACADEMIC ACHIEVEMENT. James B. Chapman. Dissertation submitted to the Faculty of the Virginia

PeopleSoft Human Capital Management 9.2 (through Update Image 23) Hardware and Software Requirements

ACADEMIC TECHNOLOGY SUPPORT

Louisiana Free Materials List

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Using SAM Central With iread

INSTRUCTOR USER MANUAL/HELP SECTION

Economics 201 Principles of Microeconomics Fall 2010 MWF 10:00 10:50am 160 Bryan Building

ENGINEERING DESIGN BY RUDOLPH J. EGGERT DOWNLOAD EBOOK : ENGINEERING DESIGN BY RUDOLPH J. EGGERT PDF

STUDENT MOODLE ORIENTATION

Sociology 521: Social Statistics and Quantitative Methods I Spring Wed. 2 5, Kap 305 Computer Lab. Course Website

Analysis of Enzyme Kinetic Data

Interpreting Graphs Middle School Science

New Features & Functionality in Q Release Version 3.1 January 2016

UNIVERSITY of NORTH GEORGIA

myperspectives 2017 Click Path to Success myperspectives 2017 Virtual Activation Click Path

ADMN-1311: MicroSoft Word I ( Online Fall 2017 )

Sociology 521: Social Statistics and Quantitative Methods I Spring 2013 Mondays 2 5pm Kap 305 Computer Lab. Course Website

Physics 270: Experimental Physics

1 Use complex features of a word processing application to a given brief. 2 Create a complex document. 3 Collaborate on a complex document.

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Outreach Connect User Manual

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Foothill College Fall 2014 Math My Way Math 230/235 MTWThF 10:00-11:50 (click on Math My Way tab) Math My Way Instructors:

Measurement & Analysis in the Real World

Green Belt Curriculum (This workshop can also be conducted on-site, subject to price change and number of participants)

SEPERAC MEE QUICK REVIEW OUTLINE

New Features & Functionality in Q Release Version 3.2 June 2016

Office of Planning and Budgets. Provost Market for Fiscal Year Resource Guide

SCT Banner Student Fee Assessment Training Workbook October 2005 Release 7.2

A Case Study: News Classification Based on Term Frequency

STAT 220 Midterm Exam, Friday, Feb. 24

LEARN TO PROGRAM, SECOND EDITION (THE FACETS OF RUBY SERIES) BY CHRIS PINE

Beyond PDF. Using Wordpress to create dynamic, multimedia library publications. Library Technology Conference, 2016 Kate McCready Shane Nackerud

Automating Outcome Based Assessment

Transcription:

Using R for Statistics Sarah Stowell

Using R for Statistics Copyright 2014 by Sarah Stowell This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. ISBN-13 (pbk): 978-1-4842-0140-4 ISBN-13 (electronic): 978-1-4842-0139-8 Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Publisher: Heinz Weinheimer Lead Editor: Steve Anglin Development Editor: Matthew Moodie and Chris Nelson Technical Reviewers: Myron Hlynka, Wen Sui Liu, and Larry Pace Editorial Board: Steve Anglin, Mark Beckner, Ewan Buckingham, Gary Cornell, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Jonathan Hassell, Robert Hutchinson, Michelle Lowman, James Markham, Matthew Moodie, Jeff Olson, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Dominic Shakeshaft, Gwenan Spearing, Matt Wade, Steve Weiss Coordinating Editor: Anamika Panchoo Copy Editor: Laura Lawrie Compositor: SPi Global Indexer: SPi Global Artist: SPi Global Cover Designer: Anna Ishchenko Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com, or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please e-mail rights@apress.com, or visit www.apress.com. Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at www.apress.com/bulk-sales. Any source code or other supplementary material referenced by the author in this text is available to readers at www.apress.com/9781484201404. For detailed information about how to locate your book s source code, go to www.apress.com/source-code/.

Contents at a Glance About the Author...xiii About the Technical Reviewer... xv Acknowledgments... xvii Introduction... xix Chapter 1: R Fundamentals...1 Chapter 2: Working with Data Files...19 Chapter 3: Preparing and Manipulating Your Data...29 Chapter 4: Combining and Restructuring Datasets...47 Chapter 5: Summary Statistics for Continuous Variables...59 Chapter 6: Tabular Data...73 Chapter 7: Probability Distributions...87 Chapter 8: Creating Plots...99 Chapter 9: Customizing Your Plots...119 Chapter 10: Hypothesis Testing...143 Chapter 11: Regression and General Linear Models...163 Appendix A: Add-On Packages...185 Appendix B: Basic Programming with R...193 Appendix C: Datasets...209 Index...217 iii

Contents About the Author...xiii About the Technical Reviewer... xv Acknowledgments... xvii Introduction... xix Chapter 1: R Fundamentals...1 Downloading and Installing R...1 Getting Orientated...2 The R Console and Command Prompt...3 Functions...4 Objects...6 Simple Objects... 6 Vectors... 7 Data Frames... 9 The Data Editor...13 Workspaces...14 Error Messages...15 Script Files...16 Summary...18 v

Contents Chapter 2: Working with Data Files...19 Entering Data Directly...19 Importing Plain Text Files...20 CSV and Tab-Delimited Files... 21 DIF Files... 23 Other Plain Text Files... 23 Importing Excel Files...24 Importing Files from Other Software...25 Using Relative File Paths...26 Exporting Datasets...27 Summary...28 Chapter 3: Preparing and Manipulating Your Data...29 Variables...30 Rearranging and Removing Variables... 30 Renaming Variables... 32 Variable Classes... 32 Calculating New Numeric Variables...33 Dividing a Continuous Variable into Categories...34 Working with Factor Variables...36 Manipulating Character Variables...37 Concatenating Character Strings... 37 Extracting a Substring... 38 Searching a Character Variable... 38 Working with Dates and Times...39 Adding and Removing Observations...41 Adding New Observations... 41 Removing Specific Observations... 42 Removing Duplicate Observations... 42 vi

Contents Selecting a Subset of the Data...42 Selecting a Subset According to Selection Criteria... 42 Selecting a Random Sample from a Dataset... 44 Sorting a Dataset...45 Summary...45 Chapter 4: Combining and Restructuring Datasets...47 Appending Rows...47 Appending Columns...49 Merging Datasets by Common Variables...50 Stacking and Unstacking a Dataset...53 Stacking Data... 54 Unstacking Data... 55 Reshaping a Dataset...56 Summary...57 Chapter 5: Summary Statistics for Continuous Variables...59 Univariate Statistics...59 Statistics by Group...61 Measures of Association...63 Covariance... 63 Pearson s Correlation Coefficient... 64 Spearman s Rank Correlation Coefficient... 65 Hypothesis Test of Correlation...66 Comparing a Sample with a Specified Distribution...67 Shapiro-Wilk Test... 67 Kolmogorov-Smirnov Test... 68 Confidence Intervals and Prediction Intervals...70 Summary...71 vii

Contents Chapter 6: Tabular Data...73 Frequency Tables...73 Creating Tables... 74 Displaying Tables... 75 Creating Tables from Count Data... 76 Creating a Table Directly... 78 Chi-Square Goodness-of-Fit Test...79 Tests of Association Between Categorical Variables...80 Chi-Square Test of Association... 81 Fisher s Exact Test... 83 Proportions test...84 Summary...86 Chapter 7: Probability Distributions...87 Probability Distributions in R...87 Probability Density Functions and Probability Mass Functions...89 Finding Probabilities...90 Finding Quantiles...93 Generating Random Numbers...95 Summary...97 Chapter 8: Creating Plots...99 Simple Plots...99 Histograms...101 Normal Probability Plots...103 Stem-and-Leaf Plots...106 Bar Charts...106 Pie Charts...109 Scatter Plots...110 Scatterplot Matrices...112 viii

Contents Box Plots...113 Plotting a Function...115 Exporting and Saving Plots...116 Summary...117 Chapter 9: Customizing Your Plots...119 Titles and Labels...119 Axes...122 Colors...123 Plotting Symbols...125 Plotting Lines...126 Shaded Areas...127 Adding Items to Plots...128 Adding Straight Lines... 128 Adding a Mathematical Function Curve... 129 Adding Labels and Text... 129 Adding a Grid... 131 Adding Arrows... 133 Overlaying Plots...135 Adding a Legend...138 Multiple Plots in the Plotting Area...139 Changing the Default Plot Settings...140 Summary...141 Chapter 10: Hypothesis Testing...143 Student s T-Tests...144 One-Sample T-Test... 144 Two-Sample T-Test... 146 Paired T-Test... 148 Wilcoxon Rank-Sum Test...150 Analysis of Variance...152 ix

Contents Kruskal-Wallis Test...154 Multiple Comparison Methods...155 Tukey s HSD Test... 156 Other Pairwise T-Tests... 157 Pairwise Wilcoxon Rank-Sum Tests... 158 Hypothesis Tests for Variance...158 F-Test... 158 Bartlett s Test... 159 Summary...160 Chapter 11: Regression and General Linear Models...163 Building the Model...164 Simple Linear Regression... 164 Multiple Linear Regression... 165 Interaction Terms... 165 Polynomial Terms... 167 Transformations... 167 The Intercept Term... 168 Including Factor Variables... 168 Updating a Model... 169 Stepwise Model Selection Procedures... 170 Assessing the Fit of the Model...171 Coefficient Estimates...174 Plotting the Line of Best Fit...174 Model Diagnostics...176 Residual Analysis... 176 Leverage... 180 Cook s Distances... 180 Making Predictions...181 Summary...182 x

Contents Appendix A: Add-On Packages...185 Viewing a List of Available Add-on Packages...185 Installing and Loading Add-On Packages...187 Windows Users... 187 Mac Users... 189 Linux Users... 191 Appendix B: Basic Programming with R...193 Creating New Functions...193 Conditional Statements...197 Conditions... 197 If Statement... 199 If/else Statement... 201 The switch Function... 203 Loops...205 For Loop... 205 While Loop... 206 Summary...208 Appendix C: Datasets...209 apartments...209 bigcats...209 bottles...210 brains...210 CIAdata1, CIAdata2...210 coffeeshop...211 concrete...211 CPIdata...211 customers...212 endangered...212 fiveyearreport...212 xi

Contents flights...213 fruit...213 grades1...213 people...214 people2...214 powerplant...214 pulserates...215 resistance...215 supermarkets...216 vitalsigns...216 WHOdata...216 Index...217 xii

About the Author Sarah Stowell is a contract statistician based in the UK, who has worked with Mitsubishi Pharma Europe, MDSL International, and GlaxoSmithKline previously. She holds a Master of Science degree in Statistics. xiii

About the Technical Reviewer Dr. Larry Pace is a statistics author and educator as well as a consultant. He lives in the upstate area of South Carolina in the town of Anderson. He is a professor of statistics, mathematics, psychology, management, and leadership. He has programmed in a variety of languages and scripting languages including R, Visual Basic, JavaScript, C##, PHP, APL, and, in a long-ago world, Fortran IV. He writes books and tutorials on statistics, computers, and technology. He has also published many academic papers, and made dozens of presentations and lectures. He has consulted with Compaq Computers, AT&T, Xerox Corporation, the U.S. Navy, and International Paper. He has taught at Keiser University, Argosy University, Capella University, Ashford University, Anderson University (where he was the chair of the behavioral sciences department), Clemson University, Louisiana Tech University, LSU in Shreveport, the University of Tennessee, Cornell University, Rochester Institute of Technology, Rensselaer Polytechnic Institute, and the University of Georgia. xv

Acknowledgments First, I would like to thank the Apress team, in particular: Lead Editor Steve Anglin, for getting me on board and giving me the chance to work with Apress; Coordinating Editors Anamika Panchoo and Mark Powers for keeping me on track; Development Editor Chris Nelson for teaching me a lot about writing; Technical Editor Larry Pace for making many valuable suggestions to improve the quality of the book; and the many others whom I have not met but I can see have done a great job helping to create the finished product. I would also like to thank to Andrés Barnett, James Sedgwick, and Therese Stukel for providing data for the examples, and my husband Timothy Baldock and friends Jemma-Kay Johnstone, Christopher Gilmour, Nina Farrell, Chris Brown, Artur Kyral, and Eddie Chung, who have all helped with the project in its early stages. xvii

Introduction Welcome to Using R for Statistics. This book was written for anyone who wants to use R to analyze data and create statistical plots. It is suitable for those with little or no experience with R, and aims to get you up and running quickly without having to learn all the details of programming. About R R is a statistical analysis and graphics environment and also a programming language. It is command-driven and very similar to the commercially produced S-Plus software. R is known for its professional-looking graphics, which allow complete customization. R is open-source software and free to install under the GNU general public license. It is written and maintained by a group of volunteers known as the R core team. The base software is supplemented by over 5,000 add-on packages developed by R users all over the world, many of whom belong to the academic community. These packages cover a broad range of statistical techniques including some of the most recently developed and niche purpose. Anyone can contribute add-on packages, which are checked for quality before they are added to the collection. At the time of writing, the current version of R is 3.1.0. What You Will Learn This book is designed to give straightforward, practical guidance for performing popular statistical methods in R. The programming aspect of R is explored only briefly. After reading this book you will be able to: navigate the R system enter and import data manipulate datasets calculate summary statistics create statistical plots and customize their appearance perform hypothesis tests such as the t-test and analysis of variance build regression models access additional functionality with the use of add-on packages create your own functions xix

Introduction Knowledge Assumed Although this book does include some reminders about statistics methods and examples demonstrating their use, it is not intended to teach statistics. Therefore, you will require some previous knowledge. You should be able to select the most appropriate statistical method for your purpose and interpret the results. You should also be familiar with common statistical terms and concepts. If you are unsure about any of the methods that you are using, I recommend that you use this book in conjunction with a more detailed book on statistics. No prior knowledge of R or of programming is assumed, making this book ideal if you are more accustomed to working with point-and-click style packages. Only general computer skills and a familiarity with your operating system are required. Conventions Used in This Book This book uses the following typographical conventions: Fixed width font is used to distinguish all R commands and output from the main text. Normal fixed width font is used for built-in R function names, argument names, syntax, specific dataset and variable names, and any other parts of the commands that can be copied verbatim. Slanted fixed width font is used for generic dataset and variable names and any other parts of the commands that should be replaced with the user s own values. Often it has not been possible to fit a whole command into the width of the page. In these cases, the command is continued on the following line and indented. Where you see this, the command should still be entered into the console on a single line. Text boxes, which are separate from the main text, contain reminders of statistical theory or methods. Practical examples are presented in separate numbered sections. Datasets Used in This Book A large number of example datasets are included with R, and these are available to use as soon as you open the software. This book makes use of several of these datasets for demonstration purposes. There are also a number of additional datasets used throughout the book, details of which are given in the Appendix C. They are available to download at www.apress.com/9781484201404. Contact the Author If you have any suggestions or feedback, I would love to hear from you. You can email me at s.stowell@instantr.com. xx