The iterrows Experience
There are a number of ways to iterate over the rows of a Pandas data frame and without doubt the worst is to use the iterrows method.
I didn’t know this until recently but what happens is that every row is “boxed” into a Series and the performance is abysmal. A colleague pointed me to itertuples which basically is a version of iterrows that isn’t rubbish – i.e a version where each tuple in the data frame is returned as a…well,…tuple.
The downside is that you can’t access the columns by name so you need to use positional indexes – which is fine if you write clean code but bad if you don’t know about the evils of magic numbers. Note that sadly, columns are accessed using numbers starting from 1 because the index is position zero (duh).
So here’s a quick benchmark showing the performance boost on a 2 million row dataset using the UCI power consumption data set.
Scenario
We’ll calculate the average residual power consumption across all rows which is defined in the data descriptor as “the active energy consumed every minute (in watt hour) in the household by electrical equipment not measured in sub-meterings 1, 2 and 3”:
global_active_power*1000/60 – sub_metering_1 – sub_metering_2 – sub_metering_3
We’ll do it once using iterrows and then using itertuples.
A Jupyter notebook for this is available here.
import pandas as pd
import numpy as np
import time
df = pd.read_csv('data/household_power_consumption.zip',
sep=';', na_values=['?'])
df.head()
df = df.dropna(axis=0, how='any')
print('Total rows', len(df))
start = time.time()
tot_residual = 0
for idx, row in df.iterrows():
gap = row['Global_active_power']
tot_residual += (gap * 1000/60 - row['Sub_metering_1']
- row['Sub_metering_2'] - row['Sub_metering_3'])
avg_residual = tot_residual / len(df)
elapsed1 = time.time() - start
print(avg_residual, elapsed1)
9.314693306921193 141.24268531799316
IDX_GLOBAL = 3
IDX_SUB_METER_1 = 7
IDX_SUB_METER_2 = 8
IDX_SUB_METER_3 = 9
start = time.time()
tot_residual = 0
for row in df.itertuples():
gap = row[IDX_GLOBAL]
tot_residual += (gap * 1000/60 - row[IDX_SUB_METER_1]
- row[IDX_SUB_METER_2] - row[IDX_SUB_METER_3])
avg_residual = tot_residual / len(df)
elapsed2 = time.time() - start
print(avg_residual, elapsed2)
9.314693306921193 2.1055757999420166
100*(elapsed1-elapsed2)/elapsed1
98.50924966825607
So there it is – 100x speed up simply by not using iterrows.
So don’t.
Ever.
Nothing to see here really.
Categories: Adventures in Python, Analytics, Data Engineering