## The iterrows Experience

There are a number of ways to iterate over the rows of a Pandas data frame and without doubt the worst is to use the iterrows method.

I didn’t know this until recently but what happens is that every row is “boxed” into a Series and the performance is abysmal. A colleague pointed me to itertuples which basically is a version of iterrows that isn’t rubbish – i.e a version where each tuple in the data frame is returned as a…well,…tuple.

The downside is that you can’t access the columns by name so you need to use positional indexes – which is fine if you write clean code but bad if you don’t know about the evils of magic numbers. Note that sadly, columns are accessed using numbers starting from 1 because the index is position zero (duh).

So here’s a quick benchmark showing the performance boost on a 2 million row dataset using the UCI power consumption data set.

## Scenario

We’ll calculate the average residual power consumption across all rows which is defined in the data descriptor as “the active energy consumed every minute (in watt hour) in the household by electrical equipment not measured in sub-meterings 1, 2 and 3”:

global_active_power*1000/60 – sub_metering_1 – sub_metering_2 – sub_metering_3

We’ll do it once using iterrows and then using itertuples.

A Jupyter notebook for this is available here.

```
import pandas as pd
import numpy as np
import time
df = pd.read_csv('data/household_power_consumption.zip',
sep=';', na_values=['?'])
df.head()
```

```
df = df.dropna(axis=0, how='any')
print('Total rows', len(df))
```

```
start = time.time()
tot_residual = 0
for idx, row in df.iterrows():
gap = row['Global_active_power']
tot_residual += (gap * 1000/60 - row['Sub_metering_1']
- row['Sub_metering_2'] - row['Sub_metering_3'])
avg_residual = tot_residual / len(df)
elapsed1 = time.time() - start
print(avg_residual, elapsed1)
9.314693306921193 141.24268531799316
```

```
IDX_GLOBAL = 3
IDX_SUB_METER_1 = 7
IDX_SUB_METER_2 = 8
IDX_SUB_METER_3 = 9
start = time.time()
tot_residual = 0
for row in df.itertuples():
gap = row[IDX_GLOBAL]
tot_residual += (gap * 1000/60 - row[IDX_SUB_METER_1]
- row[IDX_SUB_METER_2] - row[IDX_SUB_METER_3])
avg_residual = tot_residual / len(df)
elapsed2 = time.time() - start
print(avg_residual, elapsed2)
9.314693306921193 2.1055757999420166
```

```
100*(elapsed1-elapsed2)/elapsed1
98.50924966825607
```

So there it is – 100x speed up simply by not using iterrows.

So don’t.

Ever.

Nothing to see here really.

Categories: Adventures in Python, Analytics, Data Engineering