Adventures in Python

Adventures in Python #3

The iterrows Experience

There are a number of ways to iterate over the rows of a Pandas data frame and without doubt the worst is to use the iterrows method.

I didn’t know this until recently but what happens is that every row is “boxed” into a Series and the performance is abysmal. A colleague pointed me to itertuples which basically is a version of iterrows that isn’t rubbish – i.e a version where each tuple in the data frame is returned as a…well,…tuple.

The downside is that you can’t access the columns by name so you need to use positional indexes – which is fine if you write clean code but bad if you don’t know about the evils of magic numbers. Note that sadly, columns are accessed using numbers starting from 1 because the index is position zero (duh).

So here’s a quick benchmark showing the performance boost on a 2 million row dataset using the UCI power consumption data set.

Scenario

We’ll calculate the average residual power consumption across all rows which is defined in the data descriptor as “the active energy consumed every minute (in watt hour) in the household by electrical equipment not measured in sub-meterings 1, 2 and 3”:

global_active_power*1000/60 – sub_metering_1 – sub_metering_2 – sub_metering_3

We’ll do it once using iterrows and then using itertuples.

A Jupyter notebook for this is available here.

import pandas as pd
import numpy as np
import time
df = pd.read_csv('data/household_power_consumption.zip', 
                  sep=';', na_values=['?'])
df.head()
Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
4.216 0.418 234.84 18.4 0.0 1.0 17.0
5.360 0.436 233.63 23.0 0.0 1.0 16.0
5.374 0.498 233.29 23.0 0.0 2.0 17.0
5.388 0.502 233.74 23.0 0.0 1.0 17.0
3.666 0.528 235.68 15.8 0.0 1.0 17.0
df = df.dropna(axis=0, how='any')
print('Total rows', len(df))
 Total rows 2049280
Firstly, iterate over data frame using the iterrows method and calculate average residual active energy:
start = time.time()
tot_residual = 0
for idx, row in df.iterrows():
    gap = row['Global_active_power']
    tot_residual += (gap * 1000/60 - row['Sub_metering_1'] 
                  - row['Sub_metering_2'] - row['Sub_metering_3'])
avg_residual = tot_residual / len(df)
elapsed1 = time.time() - start
print(avg_residual, elapsed1)

9.314693306921193 141.24268531799316
Ok – so a bit under 2.5 minutes.
Now calculate it again but this time using the itertuples method (note – these indices don’t match the visual above because I removed some of the columns to make it fit in this blog entry):
IDX_GLOBAL = 3
IDX_SUB_METER_1 = 7
IDX_SUB_METER_2 = 8
IDX_SUB_METER_3 = 9

start = time.time()
tot_residual = 0
for row in df.itertuples():
    gap = row[IDX_GLOBAL]
    tot_residual += (gap * 1000/60 - row[IDX_SUB_METER_1] 
                 - row[IDX_SUB_METER_2] - row[IDX_SUB_METER_3])
avg_residual = tot_residual / len(df)
elapsed2 = time.time() - start
print(avg_residual, elapsed2)

9.314693306921193 2.1055757999420166
Calculate % performance gain
100*(elapsed1-elapsed2)/elapsed1
98.50924966825607

 

So there it is – 100x speed up simply by not using iterrows.

So don’t.

Ever.

Nothing to see here really.

Leave a Reply