Credit to Steven Morse (@thestevemo or his blog at https://stmorse.github.io/index.html) who had code I used as a basis for my ProFootballReference scraping scripts. He is a good follow for math and analytics work as well.
The full code and supporting files for scraping team and player data from ProFootballReference is located in my GitHub repo here: https://github.com/greghartpa/scrape_pfr_data
Most of the analysis on this site uses data from a few sources, including ProFootballReference as a primary source for team, player, and draft data. Here I am sharing how to automate scraping the team data using Python which will output the following:
Beyond being a team record and DVOA dataset, this output will be used when scraping player data which I will explain in a future post.
Setup and Dependencies
There are only a few package dependencies, including Pandas for storing and manipulating the data, BeautifulSoup and requests for parsing HTML, and NumPy used for some filtering cleanup.
A range of yearly data can be pulled by setting the start and end years. I also pull in team DVOA data which is stored in teams-dvoa.csv and was just manually pulled (some day I will script this out, but it is a small dataset and infrequently changes). If DVOA is not needed, the two lines reading it in and later on merging with the end dataframe can be removed.
Scraping Team Data
The main section of code to scrape PFR team data loops through the range of years, makes an HTML request and parses the resulting table data from PFR. A few points here:
- The URL is manually built and then pulled using the requests library.
- BeautifulSoup is used to parse the returned HTML and keys off of the class “sortable stats_table”. PFR breaks NFC and AFC team data into two tables which need to be separately pulled and stored.
- There is some data cleanup I perform. For example, PFR appends indicators to the team name to represent division winners (“*”) and wildcard teams (“+”). I remove these into a separate column so I have clean team names for later matching and also to make analysis on playoff teams easier. They do not indicate Super Bowl winners (which is odd to me) which I deal with later.
- I have to check for ties as if a season happened to have no ties, that column doesn’t exist and will cause an issue. So, if no ties, I just add a tie column “T” and populate with zeros.
- I also add a column “Scraped” and populate with zeros – this will be used when I scrape player data to control which teams I need to or want to scrape.
- And lastly in this section, I store the team roster URL which will be used for the player data scraping so I don’t have to re-scrape the URL.
Clean Up Data and Export
After the main loop to grab all years, I clean up the data with the following:
- Convert year, record, points for and against, and strength of schedule columns to numeric
- Calculate a win percentage
- I manually create a team abbreviation dataframe named tmabrevdf and merge it with the main output dataframe to normalize three-letter team abbreviations. This is done so I can handle team moves (the Rams, Raiders, and Chargers various moves) or team name changes (Washington Redskins to Washington Football team)
- And lastly, I manually create a list of Super Bowl winners since PFR does not indicate that in the team tables. It is stored in the header which I may go back and pull from in a future edit, but for now this data, like DVOA, is a small dataset and only changes annually.