Mocking the Draft Predicting NFL Draft Picks and Career Success

Mocking the Draft Predicting NFL Draft Picks and Career Success Wesley Olmsted [wolmsted], Jeff Garnier [jeff1731], Tarek Abdelghany [tabdel] 1 Introduction We started off wanting to make some kind of prediction model related to football because of the vast amount of data and interesting applications. The NFL draft is a huge time for pro teams and it can make or break seasons to come. Instead of looking at mock drafts done by NFL experts, we decided to make a machine learning model in order to predict the round a player gets drafted based on various college stats. Just predicting the round doesn t, however, predict the player s performance in the league. Therefore we added in another measure to predict, which was pro-bowl status. Selection to the pro-bowl is made based on votes from coaches, players, and fans. Based on previous selections, it is safe to say that pro-bowl selection is a good measure of success in the NFL. 2 Task Definition Our task is to predict pro-bowl status and draft round given a player s college performance. Our input is a set of features that we scraped from several different sources. We are going to model our task as a machine learning problem, and build a predictor function F that takes in a feature vector for a given player and predicts whether or not that player made a pro-bowl and what round they were drafted in. Here is an example input/output pair: Input: { player : Andy Dalton, position : QB, college : TCU, conference : Big 12, division : 1, height : 6.1, weight : 215, draft age : 23, passes completed : 895, passes attempted : 1477, passing yards : 10211, passing tds : 69, interceptions : 45, rushing attempts : 129, rushing yards : 391, rushing tds : 5, } Output: { round : 2, pro bowl selection : 1} Since different positions have different stats, we decided to focus only on classifying Quarterbacks. The scope of this project is limited to predicting the draft round and NFL performance through pro-bowl status. We attempted to predict actual NFL stats from this data because players change greatly tran- 1

sitioning from college to the NFL. We also limited our scope to drafts from 1990-2013 since there was consistently more available college data from this timeframe. We evaluated success differently for the round selection and the pro bowl selection. We evaluated the draft round selection by getting the average round error of our predictor and the percentage of rounds we correctly predicted. Our goal was to get the lowest average round error while correctly predicting as many examples as we could. A very small number of players make a pro bowl, which makes the pro bowl data very skewed. Because of this, when evaluating our success we separated the players that made a pro bowl and those that did not. We then calculated the percentage correctly guessed in each category, and also calculated the total percentage correctly classified. 3 Infrastructure To get the data we needed for our project we scraped college football player stats from various websites. We aggregated all of the features we scraped in one json file, and we split that file to get our training, test, and validations data sets. We used a test set with a size of 15% of the training set, and a validation set of 10% of the remaining training set. Since we also wanted to experiment with different features we made it so that we could create our training, test, and validation sets with a desired list of features from the original data that we scraped. This was useful because some features were repetitive and we did not want them in our data sets, and we found that some features were just not useful for what we were trying to predict. There were also some problems finding similar data for all players. For example, we ran into the issue of certain QBs going professional before playing 4 years in college. Since this was the case, we needed to somehow standardize the performance, so we ended up just aggregating the stats of the last two seasons. Other important features include college and conference, since college stats in certain conferences compared to others aren t directly comparable. This is the case because some conferences have a lower caliber of teams so QBs will be able to put up better stats against these teams. 4 Approach Baseline: For the baseline on the draft round, we used standard offensive features (rush yards, passes completed, rush yards per attempt, passing touchdowns, passer rating, college, passing yards, draft year) and performed linear regression using stochastic gradient descent with squared loss. The results were an average round error of 2.65 and 11.8% completely correct. 2

For the baseline on pro-bowl status, we used the same standard offensive features and performed linear classification using stochastic gradient descent with hinge loss. The results were 62.1% success given the player did not make the pro-bowl and 0% successs given the player did make the pro-bowl. Overall, we got 52.9% total success. Oracle: For the oracle on the draft round, we used the same features we used for the baseline and performed linear regression using stochastic gradient descent with squared loss trained on the baseline test set. The results were an average round error of 0.71 and 35.3% completely correct. For the oracle on pro-bowl status, we used features of years starting and draft year. The years starting feature will not normally be available to us since this stat is made after the player has been drafted. The results were 79.3% given the player did not make the pro-bowl and 100.0% success given the player did make the pro-bowl. Overall, we got 82.4% total success. Feature Extraction: The first step in our algorithm is feature extraction. We used our scraped data and organized all of our players and their features in sparse vectors, where the features are the corresponding numeric values. Numerical features needed to be normalized with the formula f i,norm = fi fmin f max f min to ensure that all values are between 0 and 1. One of the most important parts to normalize was the round. The NFL draft contained 12 rounds from 1990 to 1992, then 8 rounds in 1993, and finally 7 rounds from 1994 to the present. We made sure to normalize the round to each respective draft year to get the most out of our training data. We needed to play around with the set of features to use since some features were redundant and could heavily weight a player in a certain way. Some features like receptions for quarterbacks were scraped, but were not relevant to performance. We ended up removing the features of college and conference since we realized that we were overfitting due to the our limited training set. We also ended up adding height, weight, and wonderlic scores since we found that these features are considered important by NFL scouts. On top of this we noticed that some players would go pro before they had used all 4 years of NCAA eligibility. We thought that if a player goes pro early then they are more likely to be a good player and get drafted earlier, so we added in a feature if a player declares for the NFL draft early. Algorithms: In order to maximize our chances of success, we tested multiple algorithms on our dataset of QBs from 1990 to 2013. The three approaches we took were (1) linear regression/classification using stochastic gradient descent (SGD) 3

with squared loss and hinge loss, (2) neural networks, (3) k-means combined with the linear regression/classification SGD. To perform linear regression using SGD and squared-loss, we used SG- DRegressor from the sklearn package. For all instances of SGD, we used α = 0.0001 and 1000 iterations. To perform linear classification using SGD and hinge-loss we used SGDClassifier from the sklearn package also. For the neural network, we imported the MLPRegressor and MLPClassifier from sklearn in order to create a neural net. We used our validation set to train the neural network s number of hidden layers and hidden neurons. We found out that one hidden layer with 17 neurons worked best for our validation set. Our reasoning behind clustering with k-means was that quarterbacks generally tend to fall into one of two categories. They either tend to be more traditional pocket-passers or mobile scramblers. Due to this kind of variation, we thought it might be reasonable to cluster these positions. For reference here is an example of a pocket passer (Tom Brady) and a mobile QB (Russell Wilson). Features Russell Wilson Tom Brady pass int : 18 16 pass cmp : 533 380 rush yds : 773-136 adj pass yds per att : 8.6 7.3 pass td : 61 30 pass rating : 159.7 135.6 pass yds : 6738 4644 rush att : 222 88 pass cmp pct : 63.8 61.5 pass yds per att : 8.1 7.5 rush td : 15 3 pass att : 836 618 rush yds per att : 3.5-1.5 Right off the bat, we can tell that Russell Wilson has much more rushing yards and rushing touchdowns. Their passing stats are relatively similar. So by clustering these kinds of QBs we can get a better idea of how these styles of quarterbacks get drafted and how they perform in the NFL separately. So for quarterbacks, we implemented a K-means algorithm (K=2) where we would find centroids and a regressor and classifier for these two centroids. We would then classify the test set into clusters, and use the separate regressors and classifiers for each cluster we just calculated for those clusters centroids. In order to make sure that we clustered QBs into clusters of pocket-passers and mobile QBs, we used a separate feature extractor, which only extracted rush yards per attempt, rush yards, and draft year. Therefore, clusters based on these stats were reflective of the clusters we wanted. 4

5 Error Analysis Results for draft round Algorithm Avg. Draft Round Error Percent Correct Baseline 2.65 11.8% Oracle 0.71 35.3% Stochastic Gradient Descent 2.03 8.8% K-Means clustering (K=2) with SGD 1.97 23.5% Multi-Layered/Unit Neural Network 1.47 35.3% We can see that we got the best results using the neural network. We even got around the same percentage correct as the oracle, but were off by more on our incorrect predictions. It s interesting to see how our intuition about clustering quarterbacks based on rushing abilities seemed to have an impact on our predictor s success. 23.5% of the time we were able to exactly predict the round if we clustered the players ahead of time compared to 8.8% when we did not cluster. Results for pro-bowl status Algorithm Non Pro Bowl Pro Bowl Total Success Baseline 62.1% 0.0% 52.9% Oracle 79.3% 100.0% 82.4% Stochastic Gradient Descent 86.2% 0.0% 73.5% K-Means clustering (K=2) with SGD 65.5% 20.0% 58.8% Multi-Layered/Unit Neural Network 75.9% 20.0% 67.6% For predicting pro-bowl status, we had more trouble predicting the correct classification given that the player had made the pro-bowl. Our max prediction for this case was 20%. Our best overall results came from performing linear classification with SGD with our optimized features. Linear classification with SGD performs better than the neural network most likely because the neural network is overfitting features, which do not have too much correlation with probowl status. Predicting if a player makes the pro-bowl seems to be a difficult task given the limitations of our stats. There are many factors of a player s success in the NFL, which are unknown in college days. 5

Here we can see that both the neural net and k-means combined with the linear classifier using SGD more aggressively predicted players to make the pro-bowl. Essentially we are predicting a players career success with this pro-bowl status, but when managers make the decision to draft a player in an earlier round, they are also predicting a players success in the NFL. Because we were more successful at predicting the draft round rather than predicting pro-bowl status, this shows that our model learned what general managers are looking for in QBs, but is largely ineffective in solving the difficult problem of predicting actual NFL career success. 6 Literature Review We found a couple projects with similar goals in mind. One project we found, created by Sean J. Taylor attempted to calculate the probability that a player is drafted in the first round based on their NFL Combine data and previous draft data, using sparse regularized regression. Our project uses similar data, but attempts to predict actual draft round number rather than just calculating the probability that players go first round. He was able to predict 44% of the players that went in the first round correctly for the 2015 NFL draft. This statistic is somewhat comparable to our 35.3% success rate for QBs through all rounds in a randomized training set over 1990 through 2013. Another project, a thesis by Gary McKenzie, attempts to predict performance of college players in the NFL using assisted learning with the goal of making actual drafts more accurate. Our project, in comparison, focuses on predicting a player s round rather than fixing the system based on NFL potential. 7 Conclusion and Possible Improvements Overall, we were able to predict draft round for QBs with an average error of 1.47 and 35.3% success. Our predictions of pro-bowl status was more difficult since there are many factors that we were unable to gather in our data. The evolution of the quarterback position over time also posed some difficulties 6

when trying to predict all the way back from 1990. We also faced the problem of limited data. Since there are only about 224 players drafted each year, and only some of those are QBs, we had to work with limited training data. We were limited in the features we were able to scrape too. We had hard statistics related to each player s college career, but lacked information on intangible attributes. To improve on this project further, we could gather data like scouting reports, which would be valuable in further classifying QBs. Using natural language processing, we could analyze each report s language and then use those features to improve this model. We could also have used some other kind of measure of career success in this model. For example, predicting the record of a starting QB in the NFL based on college stats and data could be very valuable to NFL scouts. 8 References 1. https://seanjtaylor.github.io/learning-the-draft/ 2. https://mospace.umsystem.edu/xmlui/bitstream/handle/10355/47027/research.pdf?sequence=2 3. http://www.sports-reference.com/cfb/players/ 4. https://github.com/abresler/asb datasets 3. http://scikit-learn.org/stable/modules/generated/sklearn.neural network.mlpregressor.html 4. http://scikit-learn.org/stable/modules/generated/sklearn.linear model.sgdregressor.html 7