Peterman’s Ineptitude: Web Scraping with Pandas Read_HTML

4 minute read

Published:

On December 19th, the Oakland Raiders shocked the football community by signing Nathan Peterman to their practice squad after a terrible year with the Buffalo Bills. Advanced stats aren’t really needed to describe how poorly Peterman played. A 54.3% completion percentage, a 30.7 quarterback rating, and a 1/7 TD/INT ratio did just fine demonstrating how bad he was.

I was curious nonetheless to see how Peterman’s 2018 QBR of 6.5 ranked since 2006. Pro Football Reference has the stat by year for Peterman but no easy way to sort historical QBR in ascending order. ESPN’s website does allow for sorting but the limit for qualifying is too high for Peterman’s 106 Total QB Plays.

My goal was to combine the qualified and unqualified leaderboards in order to determine who had the worst QBR with at least 100 Total QB Plays. It’s an arbitrary cutoff, but it’s a nice round number and it will get players who played close to as much as Peterman without including RB and WR.

In order to complete the analysis, we’ll need to a) combine the qualified and unqualified leaderboards, b) filter for at least 100 plays, and c) sort by Total QBR in ascending order.

To do so, we’ll need to grab every page for both the qualified and unqualified leaderboards. There are 8 pages on the qualified leaderboard and 29 pages on the unqualified leaderboards. We will set up two for loops to loop through 1 through 8, and 1 through 29. Because python uses 0-based indexing, we’ll use the python range function to create an iterator that starts at 1 and ends at 8.

For each n in our iterator we will use the pandas read_html function to return all HTML formatted tables on each page. As there is only one, we will grab the first element [0] of that list and append to our dataframes (all_unq_qbr: unqualified, all_qual_qbr: qualified).

import pandas as pd

all_unq_qbr = pd.DataFrame()

for n in range(1,30):
    df = pd.read_html(f'http://www.espn.com/nfl/qbr/_/type/alltime-season/page/{n}/order/true/qualified/false')[0]
    all_unq_qbr = all_unq_qbr.append(df)
all_qual_qbr = pd.DataFrame()

for n in range(1,9):
    df = pd.read_html(f'http://www.espn.com/nfl/qbr/_/type/alltime-season/page/{n}/order/true/qualified/true')[0]
    all_qual_qbr = all_qual_qbr.append(df)

Next we’ll concatenate the two dataframes together and rename the columns.

all_qbr = pd.concat([all_unq_qbr, all_qual_qbr]).reset_index(drop=True)
all_qbr.columns = ['RK',
  'PLAYER',
  'YEAR',
  'PTS ADDED',
  'PASS',
  'RUN',
  'PENALTY',
  'TOTAL EPA',
  'QB_PLAYS',
  'RAW QBR',
  'TOTAL QBR']

We also need to remove the sub headers that are included in the tables by filtering to remove rows that include “YEAR” under the “YEAR” column.

all_qbr = all_qbr.query('YEAR!="YEAR"').reset_index(drop=True)

Because of the sub headers, all columns are formatted as strings. In order to fix this, we can use the astype method to reformat QB_PLAYS as an integer and TOTAL_QBR as a float.

all_qbr['QB_PLAYS'] = all_qbr['QB_PLAYS'].astype(float)
all_qbr['TOTAL QBR'] = all_qbr['TOTAL QBR'].astype(float)


Now we can answer the question!

By using method chaining we will

  1. Filter QB_PLAYS to only include players with greater than or equal to 100 QB Plays
  2. Drop players that are duplicated
  3. Sort values by Total QBR in ascending order
  4. Select the columns Player, Year, QB_Plays, and Total_QBR
  5. Reset the index
  6. Show the top 10
(all_qbr
.query("QB_PLAYS>=100")
.drop_duplicates(subset=['PLAYER','YEAR'])
[['PLAYER','YEAR','QB_PLAYS','TOTAL QBR']]
.sort_values('TOTAL QBR', ascending=True)
.reset_index(drop=True).head(10))

Per ESPN’s Total QBR, Nathan Peterman had the second-worst QB season since 2006 for all players with at least 100 QB Plays. Only Blaine Gabbert’s 2013 season ranks worse.

PLAYERYEARQB_PLAYSTOTAL_QBR
0Blaine Gabbert, JAX2013110.04.0
1Nathan Peterman, BUF/OAK2018106.08.7
2Ryan Lindley, ARI2012196.010.2
3Tarvaris Jackson, MIN2006110.011.5
4John Beck, MIA2007128.013.3
5Jimmy Clausen, CAR2010378.013.8
6Bryce Petty, NYJ2016161.015.9
7Kerry Collins, IND2011109.016.8
8Keith Null, LAR2009148.016.9
9Alex Smith, SF2007234.018.2

If we relax the QB plays filter to 75 plays, Peterman ranks third since 2006, with Byron Leftwich’s 2007 season with Atlanta at 6.6 slightly worse.

PLAYERYEARQB_PLAYSTOTAL_QBR
0Blaine Gabbert, JAX20131104.0
1Byron Leftwich, ATL2007756.6
2Nathan Peterman, BUF/OAK20181068.7
3Max Hall, ARI2010968.8
4Ryan Lindley, ARI201219610.2
5Tarvaris Jackson, MIN200611011.5
6John Beck, MIA200712813.3
7Jimmy Clausen, CAR201037813.8
8Bryce Petty, NYJ201616115.9
9Kerry Collins, IND201110916.8