Data Wrangling With Fantasy Football and Pandas

Posted on Wed 21 March 2018 in Data-Science • 14 min read

Data Wrangling with Pandas

A couple of simple examples using some of the powerful built in pandas functions. The data is taken from a league.

In [1]:
import urllib
from urllib.request import urlopen
import json
import pandas as pd
import numpy as np
from time import sleep

Data Retrieval

Retrieve data from a GraphQL endpoint. There's no authentication/token required to query the endpoint, so we just need to know the correct query, which in this case is encoded as a single string.

Data is stored as a list of dicts, which a Pandas dataframe is created from.

In [2]:
while gw <= latest_gw:
    #Let's be a friendly scraper and throw a sleep in.  3 seconds should do the trick.
        d={"operationName":"getHeadToHeadMatches","variables":{"leagueId":"ABWkR4hBw4GWmSp8q","gameweek":gw},"query":"query getHeadToHeadMatches($leagueId: String!, $gameweek: Int!) {\n  headToHeadMatches(leagueId: $leagueId, gameweek: $gameweek) {\n    _id\n    gameweek\n    leagueId\n    team1 {\n      _id\n      name\n      liveGameweekPoints(gameweek: $gameweek)\n      user {\n        profile {\n          firstName\n          lastName\n          __typename\n        }\n        __typename\n      }\n      __typename\n    }\n    team2 {\n      _id\n      name\n      liveGameweekPoints(gameweek: $gameweek)\n      user {\n        profile {\n          firstName\n          lastName\n          __typename\n        }\n        __typename\n      }\n      __typename\n    }\n    __typename\n  }\n}\n"}
        hdr = {'content-type': 'application/json'}
        req = urllib.request.Request(base_url,headers=hdr, data=bytes(json.dumps(d), encoding="utf-8"))
        site_data = urlopen(req)
        data = json.load(site_data)
    except Exception as e:
        print("error retrieving : {}".format(base_url))
In [3]:
df = pd.DataFrame(gw_data)
In [4]:
__typename _id gameweek leagueId team1 team2
0 HeadToHeadMatch 2siym5Z2sFKN3a7hi 1 ABWkR4hBw4GWmSp8q {'_id': 'wbed7EjRC9T3n6x9N', 'name': 'Carpet M... None
1 HeadToHeadMatch jSPLCJCGJZ6A4BbFY 1 ABWkR4hBw4GWmSp8q {'_id': 'Y9s5ai5jTk2azRDav', 'name': 'Draft Ex... {'_id': '6pn7hitEDGtRQmr24', 'name': 'Ben Fost...
2 HeadToHeadMatch oKGQCETdijgnspKhz 1 ABWkR4hBw4GWmSp8q {'_id': '6pdanoFD84n4zKLXb', 'name': 'The Wrec... {'_id': 'rgF7P3CeNZP7cRt4s', 'name': 'On This ...
3 HeadToHeadMatch WrsPi6Nvoi5AYNMik 1 ABWkR4hBw4GWmSp8q {'_id': '7opskg8Tc6SfsXAnf', 'name': 'Dr Wanke... {'_id': 'BBXHYEt6GPrZsNLbj', 'name': 'RA Galax...
4 HeadToHeadMatch KYpEWb9SL2aYqyCdA 2 ABWkR4hBw4GWmSp8q {'_id': 'wbed7EjRC9T3n6x9N', 'name': 'Carpet M... {'_id': '6pn7hitEDGtRQmr24', 'name': 'Ben Fost...

Data Wrangling

The data is created from a hierarchical JSON format which is reflected in the data frame. Team information is stored within a dict within the data frame. The first thing we need to do is expand the dict and flatten the data.

Our observations are based around each team's gameweek, so that's what each row should reflect.

Metling data

We want a single team colum. The melt makes our dataframe 'less wide' i.e. unpivots our team1/team2 column and preserves the remaining columns.

We can then drop a bunch of columns that we're not interested in and expand the dict in the resuling 'value' column to the values that we're interested in

In [5]:
df['match_id'] = df.index
df_1 = pd.melt(df, id_vars=["__typename", "_id", "gameweek","leagueId","match_id"], var_name="team")
In [6]:
df_2 = pd.concat([df_1.drop(['value','leagueId','team','__typename'], axis=1), df_1['value'].apply(pd.Series)], axis=1)
_id gameweek match_id __typename _id liveGameweekPoints name user
0 2siym5Z2sFKN3a7hi 1 0 Team wbed7EjRC9T3n6x9N 48.0 Carpet Munchers {'profile': {'firstName': 'Mike', 'lastName': ...
1 jSPLCJCGJZ6A4BbFY 1 1 Team Y9s5ai5jTk2azRDav 41.0 Draft Excluders {'profile': {'firstName': 'David', 'lastName':...
2 oKGQCETdijgnspKhz 1 2 Team 6pdanoFD84n4zKLXb 55.0 The Wrecking Balls {'profile': {'firstName': 'Tim', 'lastName': '...
3 WrsPi6Nvoi5AYNMik 1 3 Team 7opskg8Tc6SfsXAnf 38.0 Dr Wankenstein {'profile': {'firstName': 'James', 'lastName':...
4 KYpEWb9SL2aYqyCdA 2 4 Team wbed7EjRC9T3n6x9N 43.0 Carpet Munchers {'profile': {'firstName': 'Mike', 'lastName': ...
5 aymJHZcdCvktJ7bF2 2 5 NaN NaN NaN NaN NaN
6 dbwNpp4DcsfMGu6cJ 2 6 Team Y9s5ai5jTk2azRDav 40.0 Draft Excluders {'profile': {'firstName': 'David', 'lastName':...
7 RXeZMfSu9WHY55hKW 2 7 Team 6pdanoFD84n4zKLXb 43.0 The Wrecking Balls {'profile': {'firstName': 'Tim', 'lastName': '...

Handling Missing Average Data

The data we retrieved from the site is missing team information for the 'Average' team. These need to be calculated and filled in.

The name is trivial, we can just use the built in fillna function on the name column.

Calculating the average score is slightly less obvious, but is a good opportunity to use the transform function. We call transform on our groupby object which which returns a NDFrame. The method we call replaces the NaN in each group with the mean of the other values.

We want an int for our scope so apply np.round to the resulting score.

In [7]:
df_3 = df_2.drop(["_id","__typename","user"], axis=1).sort_values(by='match_id').reset_index(drop=True)
df_3[['name']] = df_3[['name']].fillna(value='Average')
In [8]:
df_3["liveGameweekPoints"] = df_3.groupby("gameweek")["liveGameweekPoints"].transform(lambda x: x.fillna((x.mean())))
df_3["liveGameweekPoints"] = df_3["liveGameweekPoints"].apply(np.round)
In [9]:
gameweek match_id liveGameweekPoints name
0 1 0 48.0 Carpet Munchers
1 1 0 45.0 Average
2 1 1 40.0 Ben Foster on Draft
3 1 1 41.0 Draft Excluders
4 1 2 55.0 The Wrecking Balls

Calculating Opponent Score and Resulting Points

We duplicate the gameweek points column and group by our "match_id" (i.e. the duplicate of the initial dataframe index). We then want to reverse each grouped object and return a dataframe - this is what the apply and reset_index functions do.

We then apply the logic to calculate score i.e. 3 points for a win, 1 for a draw, 0 for a loss.

The rolling score for each week as calculated using the cumsum function.

In [10]:
df_4 = df_3
df_4['opponent_score'] = df_4['liveGameweekPoints']
df_4['opponent_name'] = df_4['name']
df_4['opponent_score'] = df_4.groupby("match_id")['opponent_score'].apply(lambda x: x.iloc[::-1]).reset_index(drop=True)
df_4a['opponent_name'] = df_4.groupby("match_id")['opponent_name'].apply(lambda x: x.iloc[::-1]).reset_index(drop=True)
In [11]:
gameweek match_id liveGameweekPoints name opponent_score opponent_name
0 1 0 48.0 Carpet Munchers 45.0 Average
1 1 0 45.0 Average 48.0 Carpet Munchers
2 1 1 40.0 Ben Foster on Draft 41.0 Draft Excluders
3 1 1 41.0 Draft Excluders 40.0 Ben Foster on Draft
4 1 2 55.0 The Wrecking Balls 50.0 On This Sunday
In [12]:
df_4['score'] = df_4.apply(lambda x: 3 if x['liveGameweekPoints'] > x['opponent_score'] else 1 if x['liveGameweekPoints'] == x['opponent_score'] else 0, axis=1)
In [13]:

Flattening and Graphing the Data

Group by manager name and gameweek. Expand the dataframe using the unstack function.

Finally plot the data into a graph.

In [14]:
# use unstack()
df_6_actual = df_6[df_6['gameweek']<latest_gw]
df_6_projected = df_6[df_6['gameweek']>=latest_gw]
df_7_actual = df_6_actual.groupby(['gameweek','name']).first()['rolling_score'].unstack()
df_7_projected = df_6_projected.groupby(['gameweek','name']).first()['rolling_score'].unstack()
In [15]:
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import animation
# plot data
fig, ax = plt.subplots(figsize=(15,7))
<matplotlib.axes._subplots.AxesSubplot at 0x7f1bd334d320>