Due to the Big Data Movement in recent years, lately I’ve been exploring statistical analysis and machine learning through books, online courses and online tutorials. One book I found intriguing is “Exploring Everyday Things with R and Ruby”, by Sau Sheong Chang. Sau Sheong’s book inspired me to apply its teachings to something that I enjoy every Sunday morning: the NFL.
Sau Sheong, in his experiments, uses Ruby as a tool to simulate, extract, or transform data into an input format that can be used in R for analysis. For those unfamiliar with R, it is a widely used programming language and software environment for statistical computing. In my experiment, by using the paradigm described in the book, I analyzed various NFL statistics from Yahoo Sports to determine how the number of running plays versus passing plays affects the outcome of the game for a given team.
I decided to focus on just 4 teams: San Francisco 49ers, New Orleans Saints, Detroit Lions, and the Minnesota Vikings. San Francisco and New Orleans have powerful all-around offenses with San Francisco having the edge in the running game and New Orleans having the edge in the passing game. Detroit’s passing offense ranks highly, but their running game is lacking. Minnesota’s offense is the opposite; their running offense is arguably the best, but their passing offense is horrible. We will see how the resulting analysis differ for each of these teams.
The NFL section of Yahoo Sports has a unique boxscore URL for each weekly game that show details of the plays and the progression of scores. For each of the four teams, I gathered the URLs and put them in a text file for each team. I created a Ruby script that then scrapes the URLs and outputs two types of CSV files. The rows in the first CSV file contain the plays of each drive during the game (and more information for Part 2 of this blog). The rows in the second CSV file contain the scoring progression throughout the game. We only care about the final score for this blog, so only the last row will be used.
With the collection of CSV files, I had to determine how to utilize the inputs in R. Determining the correlation between a team’s play choices and the game outcome was my final goal. My initial thought was to calculate the correlation between the binary outcome of a win or loss and the number of running plays per game, but it seemed wrong to compare a classification value to a integral value. After much pondering, it made more sense to take the correlation between the win margin (team score minus opponent score) and the delta between the number of running plays and passing plays (running plays minus passing plays) in a game. In R, reading in a CSV input file creates a matrix-like object called a dataframe to represent the data. Once I had the dataframe objects to my disposal, it was very easy to perform matrix operations to massage the data and calculate the correlation between the win margin and the delta of the running plays. This blog isn’t a tutorial on R, so if you’re interested, take a look at the R code via the link at the end of the blog.
Minnesota, the strong running and weak passing team, had a strong correlation.
Detroit, the weak running and strong passing team, had a weaker correlation. San Francisco and New Orleans were somewhere in the the middle. The results correctly matched intuition. The stronger the running game is compared to the passing game, the greater the running game utilization influence is on the final outcome.
In part 2 of this blog, I’ll try to dig deeper and hopefully discover other trends in the data.