Which states are over-represented in the NFL?

Southern states (and Iowa) are football factories

February 2, 2024

When it comes to state rankings of education, health, poverty and many other measures, southern states inevitably end up on the wrong end of the list. My own home state of Mississippi is so good at being bad that other southerners coined the phrase “Thank God for Mississippi.”

But no matter how bad our education, health, or income, there is one area where Southerners have a solid claim to excellence: the football field. I decided to check if the south’s love of football in fact translates into an increased chance of professional success for its players. Using data from sport-reference.com, I calculated the total number of active NFL players per million residents from each state.

Figure 1. NFL players per capita.

The numbers don’t lie. The “Deep South” states of Louisiana, Mississippi, Alabama, and Georgia are football factories, claiming the top four spots in producing NFL talent on a per-capita basis. Louisiana slightly edges out my home state for the top spot but, honestly, it’s nice to not be on the bottom for once. Outside of the south, Iowa is the only state with more than 10 players per million residents (something in that corn, I guess).The northeast is particularly weak at producing NFL talent. In the chart below, I pivot the axis by 45 degrees to emphasize the trend-line. States falling below the line are home to fewer NFL players than the national average (around 5 per million). States above the line are home to more than the average. Georgia has four times as many NFL players than New York, despite having half the population.

Figure 2. NFL players vs population

The Process: Scraping tables

The biggest challenge with this project was accessing the player data. Sports-reference.com provides tables of career stats for all NFL players that include home state and birthplace. Here’s a quick peek at the table.

rank player pos city start end all_pro_yrs pro_bowl_yrs starter_yrs w_av games passed_cmp passes_att passing_yds passing_tds longest_pass int sacked sacked_lost_yds rush_att rush_yds rush_tds longest_rush rec rec_yds rec_tds longest_rec state
1 Untz Brewer HB Washington 1922 1922 0 0 0 0 8 NA NA NA 0 NA NA NA NA 0 0 1 0 NA NA 0 NA DC
2 Perry Dowrick FB Washington 1921 1921 0 0 0 0 2 NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA DC
3 Patsy Gerardi E Washington 1921 1921 0 0 0 0 1 NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA DC
4 Sam Kaplan E Washington 1921 1921 0 0 0 0 1 NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA DC
5 Cy McDonald G Washington 1921 1921 0 0 1 0 3 NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA DC
1 Yannick Ngakoue DE Washington 2016 2023 0 1 7 46 123 NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA DC

The excellent sports-reference website provides a download option for accessing the tables. Unfortunately, the tables are organized by state, so downloading the tables manually would require navigating to the page for each of the 50 states and downloading. This is where the Rvest package for web scraping comes in handy. The Rvest website provides an intro to scraping, so I’ll just go over the basic process of pulling a table from a web page.

Whenever I begin a web scraping project I start out by exploring the structure of the website using Google Chrome’s Developer Tools. Once you open the Developer Tools window, you can hover over each element in the code and the element will highlight in the browser. In the sports-reference page, we can see that the player data table is a “table node.”

Now that we know what we’re looking for, we can jump into R.

library(rvest)
#library(stringi)

#paste the URL 
# I'm separating the root URL from the state signifier to set up a for loop for later

url = paste0("https://www.pro-football-reference.com/friv/birthplaces.cgi?country=USA&state=", "AL")

# The read html function loads the web page into R
html <- read_html(url)

# Grab the table node with html_node
node <- html_node(html, "table")

# The very useful html_table function automatically formats the output as a data frame
table <- html_table(node, header = T)

# Append the state code to the table

table$State = "AL"

# Select only the biographical data we need
# The column "To" indicates most recent year the player was active
table = select(table, 
               Player, 
               City, 
               State, 
               To)

We can see the output (limited to five rows) below.

Player City State To
Julio Jones Foley AL 2023
Carson Tinker Decatur AL 2023
Jordan Matthews Madison AL 2023
Nick Williams Birmingham AL 2023
C.J. Mosley Theodore AL 2023
Za'Darius Smith Montgomery AL 2023

Now that we have the process for pulling data for one state, we can simply refactor our code into a for-loop and run through all 50 states. But there is one other issue we need to resolve first.

The table for Alabama returned 200 players. In reality though, Alabama has had 788 players in the NFL. It turns out that the site limits the table rows to 200 per page. We’ll need to incorporate this information into our for-loop.

Here is the final nested for-loop. We first grab the page for the first state in the list and load the first table. We then iterate through each page adding “200 to the URL”&offset=” plus the value of the offset (adding 200 each time). Once we have grabbed all of the tables for a state, we move to the next state in the list.

# list of states and the ISO codes used in the website URL
state_table = read_csv("./state_iso2_codes.csv")
states = state_table$code

# The website paginates after each 200 players
offsets = seq(0,2600, 200)

#create function
get_cfb = function(states) {
  offsets = seq(0,2600, 200)
  for(i in states) {
    for(x in offsets){
      url = paste0("https://www.pro-football-reference.com/friv/birthplaces.cgi?country=USA&state=", i, "&offset=", x)
      html <- read_html(url)
      node <- html_node(html, "table")
      table <- html_table(node, header = T)}
  
    #add country slug to table
      table$state <- i
  
    #add to empty table
    all_players = rbind(table, all_players)}}

The above for-loop worked pretty well but be aware that the sports-reference site will throttle bulk requests. I ended up splitting my calls into batches of 3 states at a time. For anyone who wants to replicate this process, I recently learned about the polite package that will manage this process for you.

Posted on:
February 2, 2024
Length:
6 minute read, 1082 words
See Also: