Predicting Game Outcomes and Spread with NFL Data. Rutgers University

Similar documents
Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Python Machine Learning

Multivariate k-nearest Neighbor Regression for Time Series data -

Assignment 1: Predicting Amazon Review Ratings

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Lecture 1: Machine Learning Basics

Probability and Statistics Curriculum Pacing Guide

Reducing Features to Improve Bug Prediction

Algebra 2- Semester 2 Review

Learning From the Past with Experiment Databases

Statewide Framework Document for:

Time series prediction

Why Did My Detector Do That?!

SPORTS POLICIES AND GUIDELINES

The Good Judgment Project: A large scale test of different methods of combining expert predictions

4-3 Basic Skills and Concepts

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Digital Fabrication and Aunt Sarah: Enabling Quadratic Explorations via Technology. Michael L. Connell University of Houston - Downtown

A Model to Predict 24-Hour Urinary Creatinine Level Using Repeated Measurements

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Speech Emotion Recognition Using Support Vector Machine

An Introduction to Simio for Beginners

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

JUNIOR HIGH SPORTS MANUAL GRADES 7 & 8

Trevon Grimes Wide Receiver / 6-4, 202 Fort Lauderdale, Fla. / St. Thomas Aquinas

NCEO Technical Report 27

Simple Random Sample (SRS) & Voluntary Response Sample: Examples: A Voluntary Response Sample: Examples: Systematic Sample Best Used When

(I couldn t find a Smartie Book) NEW Grade 5/6 Mathematics: (Number, Statistics and Probability) Title Smartie Mathematics

Broward County Public Schools G rade 6 FSA Warm-Ups

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Innovation Village: Building Tradition

Diagnostic Test. Middle School Mathematics

Detection and Classification of Mu Rhythm using Phase Synchronization for a Brain Computer Interface

arxiv: v1 [cs.lg] 15 Jun 2015

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

May To print or download your own copies of this document visit Name Date Eurovision Numeracy Assignment

Probabilistic Latent Semantic Analysis

STA 225: Introductory Statistics (CT)

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Indian Institute of Technology, Kanpur

Statistical Studies: Analyzing Data III.B Student Activity Sheet 7: Using Technology

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

CS Machine Learning

Physics 270: Experimental Physics

Switchboard Language Model Improvement with Conversational Data from Gigaword

Individual Differences & Item Effects: How to test them, & how to test them well

Activity Recognition from Accelerometer Data

12- A whirlwind tour of statistics

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Linking Task: Identifying authors and book titles in verbose queries

LONGVIEW LOBOS HIGH SCHOOL SOCCER MANUAL

Student-Athlete. Code of Conduct

WHEN THERE IS A mismatch between the acoustic

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Grade 6: Correlated to AGS Basic Math Skills

Comparison of network inference packages and methods for multiple networks inference

VOL. 3, NO. 5, May 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

THE OHIO HIGH SCHOOL ATHLETIC ASSOCIATION

Massachusetts Department of Elementary and Secondary Education. Title I Comparability

arxiv: v1 [cs.cy] 8 May 2016

White Mountains. Regional High School Athlete and Parent Handbook. Home of the Spartans. WMRHS Dispositions

University-Based Induction in Low-Performing Schools: Outcomes for North Carolina New Teacher Support Program Participants in

Math 96: Intermediate Algebra in Context

2013 DISCOVER BCS NATIONAL CHAMPIONSHIP GAME NICK SABAN PRESS CONFERENCE

Mathematics. Mathematics

School of Innovative Technologies and Engineering

Evidence for Reliability, Validity and Learning Effectiveness

Semi-Supervised Face Detection

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Mathematics process categories

A Guide to Adequate Yearly Progress Analyses in Nevada 2007 Nevada Department of Education

Getting Started with Deliberate Practice

Data Fusion Through Statistical Matching

TRAINING MANUAL FOR FACILITATORS OF RADIO LISTENING GROUPS

arxiv: v2 [cs.cv] 30 Mar 2017

A simulated annealing and hill-climbing algorithm for the traveling tournament problem

MGT/MGP/MGB 261: Investment Analysis

Model Ensemble for Click Prediction in Bing Search Ads

GUIDE TO THE CUNY ASSESSMENT TESTS

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

Multi-Lingual Text Leveling

Common Core State Standards

Youth Apprenticeship Application Packet Checklist

Task Types. Duration, Work and Units Prepared by

November 11, 2014 SCHOOL NAMING NEWS:

Introduction to Causal Inference. Problem Set 1. Required Problems

UCLA UCLA Electronic Theses and Dissertations

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Probability Therefore (25) (1.33)

Peer Influence on Academic Achievement: Mean, Variance, and Network Effects under School Choice

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

SAT MATH PREP:

About How Good is Estimation? Assessment Materials Page 1 of 12

CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and

TCC Jim Bolen Math Competition Rules and Facts. Rules:

Transcription:

Predicting Game Outcomes and Spread with NFL Data Rutgers University Immanuel Williams 5/7/2015

Contents Executive Summary... 1 Introduction... 2 Data Derivation & Summary... 2 Analysis... 3 Prediction of Outcomes... 3 Prediction Game Spread... 5 Conclusion... 6 References... 6 Executive Summary In football, sports analyst and fans are consistently trying to predict which team is going to win and by how much. Most sports analyst discuss key players, match ups and coaching when it comes to a team winning. However, little is said about the utilization of statistical models to predict victories or the point spread. The purpose of this report is to use past game statistics to predict whether a team wins or loses and to predict the spread of a game. The data used in this project was extracted from a website www.pro-football-reference.com. The data found at this website was then manipulated so that previous games statistics such as yards, points, point difference, turnovers and average wins were used to predict game outcomes and spreads. Based on the statistical models used in this paper, the implementation of k-nearest neighbors and quadratic discriminate analysis were good methods used to predict the outcomes of games. However, the methods implemented in this paper to predict games spreads did not perform well.

Introduction Exploring what makes a team win is not only important to passionate fans but also to stakeholders who watch these games religiously. Based on these statistical models, team owners, general managers and coaches will be able to determine the outcome of each game which will allow them to make appropriate adjustments to ensure an upset or maybe a closer game. The National Football League (NFL) will be able to schedule games in such a way that close games (small point spread) will be scheduled during primetime to ensure maximum viewers. Cable companies could also use this information to determine what type of advertisement should be played during certain games because if there is a close game, the cost of advertisement should be higher compared to when a game is going to be a blowout (large point spread). These techniques could also be applied to other sports to ensure a certain level of watchers. There are multiple of studies that examine predicting the probability of a team winning and by how much. One study analyzed determining the probability of a favored team beating an underdog team by p points (Stern, 1991). This work only looked at 5 years of data and did not utilize techniques discussed in this paper. There has also been research that evaluated how a community of NFL fans has the ability to predict future game wins (Szalkowski & Nelson, 2012). Other work used twitter as source to predict wins (Sinha et. Al, 2013). However little research has used the variables and statistical models discussed in this paper to predict wins, losses and point spread. The subsequent section describes the derivation of the data and its summary. Then the following section discusses the statistical models used to predict game outcomes and point spread. The final section will review the findings and its implications as well as discuss future research. Data Derivation & Summary Once the data was extracted from the website, a certain level of cleaning and organizing was done in order to acquire information from the data. This included removing playoff and super bowl games, reformatting the data to include the past 12 years of data (2002 to 2013) and manipulating the data so that the current and previous two years of game data will be used be to predict the last 6 games of each season. The exclusion of the playoff and super bowl games was done so the data was not inflated by non-random data. Reformatting and manipulating the data was done for three reasons: 1) Ensure that there was data for all 32 teams (before 2002 NFL had 31 teams) 2) Utilize the current season statistics and previous 2 seasons data in prediction 3) Create more variables (discussed below) Once the formatting was done, 45 predictors were created based on 5 variables. These 5 variables were yards, points, turnovers, point difference and average wins. The derivation of the 45 variables was accomplished by splitting the current season and the previous 2 seasons into three groups. Each group represented the beginning, middle and end of the season affects. The average of the 5 variables were calculated for each group with respect to each team for the last 6 games of each season. This was done due to introduce team streakiness (win or lose games consecutively) and to create more variables based on a team s past performance.

A binomial distribution with a probability of 0.5 was used to randomly determine which game was going to be a win or a loss. The wins/losses are then used as the response variable. The wins/losses are used to determine the point difference which is used as the game spread response variable. Due to the number of variables, generalized descriptions are given regarding the data. The average and standard deviation of the amount of yards variables was around 330 and 50, respectively. The point amount variables were generally around 20 for the mean and 5 for the standard deviation. The turnover variables were about 2 for the mean and 0.6 for the standard deviation, whereas the point difference tended to have a small mean around 0.5 and standard deviation around 8. This can be explained because some games are close and some games are blow-outs, thus the small mean difference between points and large standard deviation. The average wins mean was 0.5 and standard deviation of 0.24. Analysis Prediction of Outcomes Before any of the statistical models were used, the data was split into two data sets, training and test data sets. This was done at random using the sample function in R. This was done to verify the statistical methods. The size of the training set was 958 games and the test data set contain 300 games. Once the data was split, various statistical models were used such as the ordinary least square (OLS), logistic regression (LR), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), support vector machine (SVM), k-nearest neighbor (KNN), Ridge and LASSO regression. The results can we be seen in figure 1. Figure 1.

The misclassification method was used to measure the performance of each model. The results show that KNN and QDA provide the lowest test data set error and KNN, Ridge and LASSO gives the lowest training error. The best condition under SVM was when the cost function was set to 0.01 and the value was set to 0.022. The KNN best condition for the training data set was at N=3 and N=22 for the test data set. With respect to the Ridge and LASSO regression the best tuning parameters for each model was =0.011 and =0.001, respectively. Once this analysis was done a cross validation was done on the OLS and QDA methods with respect to the number of principal components. Figure 2 shows graphs the number of components and error. The points on the graph denote the minimum error for OLS and QDA which are 0.394 with 3 components and 0.390 with 7 components, respectively. Figure 2. A basis expansion was then used to increase the number of variables. This was only done the current season variables which increased the number of variables to 180. Figure 3 denotes the error found using the same statistical models used within the first analysis. Once, again the QDA and KNN outperforms the other methods with respect to the test data set and the SVM method produces the smallest amount of training error, which used the cost function of 0.001 and equal to 0.0005. The KNN used the N=2 for the training data set and N=21 for the test data set. The s for the Ridge and LASSO regression were 0.0031 and 0.0051 respectively.

Figure 3. Prediction Game Spread Similarly to the prediction of outcomes, the data set was randomly split into two data sets training and test. Instead of using all the statistical methodologies used in the previous section, OLS, Ridge and LASSO was only used in predicting game spread. In addition, methodologies such as forward selection and backward elimination with respect to the best Mallow s Cp and Bayesian information criterion (BIC) were also used. Figure 4 highlights the findings of using these statistical models. Figure 4. The mean square error was used to measure the precision of each statistical model. Overall, each model did not perform well for both training and test data sets. The best for both Ridge and LASSO regression was set at 0.001. The best number of variables for both backward elimination and forward selection with respect to Cp was 7 and 4 for BIC.

Conclusion Predicting game outcomes and spreads are important but difficult tasks. In this line of research not only do the statistical models have to be highly discriminative and predictive but the data has to be derived in such a way that the methods can be properly used. Based on predicting game outcomes, KNN, QDA and SVM worked reasonably well when it came to misclassification of outcomes with respect to both training and test data sets. On the other hand, the prediction of game spread was not estimated well using any of the statistical models. There are two reasons why these models probably could not predict game spreads well. The first reason stems from the derivation of the data, one may say the variables used and the way they were organized was not appropriate for the statistical models. Another reason was that the absolute value of the game spread was not implemented in the response variable, thus the large mean squared error. There are many ways to improve this study. One way is to include more types of variables such as number of first downs, number of penalties and number of touchdowns per game. This is important because it will provide more information about how well a team performs during a game which will lead to better predictions. Another way to improve this study is to take the absolute values of the of the game spread response variables. This will allow for better accuracy with respect to prediction. Lastly, once the incorporation of more diverse variables are included into the data set, dimension reduction tools such as principal component analysis and fisher discriminant analysis should be implemented to ensure precision. References Sinha, S., Dyer, C., Gimpel, K., Smith N., A. (2013). Predicting the NFL Using Twitter. Stern, H., (1991). On the Probability of Winning a Football Game. Szalkowski, G., & Nelson, M. L. (2012). The Performance of Betting Lines for Predicting the Outcome of NFL Games.