Zen of Python Pandas Optimization

Pandas is built on Numpy which is designed for vector manipulation. Therefore, using loops in Pandas will be inefficient.

z = df[[“x”, “y”]].apply(lambda row: row[0] + row[1])

.apply function is more efficient than looping. This is because .apply applies a function along a specified axis (row or column) - performing vector operation on row/column but have to loop through all of the rows. .apply is only efficient for scalar - integer, float. Non-scalar applied function are very inefficient.

The best way to use .apply when there is no way to vectorized a function.

Next improvement is to vectorized :

z = df[“x”] + df[“y”]

Vectorization is the process of performing the operations on arrays rather than scalars.

Basic units of Pandas:

Series is one dimensional array with axis label
DataFrame is a 2 dimensional array with labeled axis

Next significant improvement will be using numpy arrays:

Why numpy? Numpy operations are executed “under the hood” in-optimized, pre-compiled C code on ndarrays. The optimization removes overheads incurred by operations on Pandas series in Python (indexing, data type-checking, etc).

To perform numpy operation, use .values:

z = df[“x”].values + df[“y”].values

Next steps to improvement is to use Cython to compile the code.

I won’t go in-depth into Cython. There are some points to note:

  1. As long as Cython is using Python methods, you won’t see improvements.
  2. You have to replace Python/Numpy libraries with C specific math libraries.

In summary, the Zen of Pandas optimization is as follows:

Avoids loops

If you must loop, use apply, not iteration functions.

If you must apply, use Cpython to make it faster.

Vectorization is usually better than scalar operations.

Vector operations on Numpy arrays are more efficient than on native Pandas series.

To conclude, here is a piece of advice

“Premature optimization is the root of all evil.”

Get your functionality working first before improving on the performance of your program.

From here onwards, you will have a better understanding on performance in regards to Pandas.

1 Like