Handling row-wise operations with an offset

Hello community! hope you’re doing all well. :slight_smile:

I am trying to work on a cohort model on my job. Basically I’ve got a table where:

  • each row is a month, and the first numbered value is the number of new customers in the month
  • each column is another month, and tells me how many customers acquired in the corresponding row are still there and ordering.

with some sample data, it would look like the following.

import numpy as np
matrix =[
       [100,75,50,40,30],
       [np.nan, 100,90,70,30],
       [np.nan, np.nan,100,70,50],
       [np.nan, np.nan,np.nan,100,90],
       [np.nan, np.nan,np.nan,np.nan,100]
       ]

temp_df = pd.DataFrame(columns= [x for x in range(0,5)], data = matrix)

outputted like this.
Capture

My desire would be to calculate the retention rate compared from the first acquisition month. I find it to be a super trivial task in excel, but I found it overly challenging in python.

I came to something by doing the following:

def retention(x, row):
    anchor = 0
    for value in row:        
        if pd.notnull(value):
            anchor = value
        return x / anchor

first_row = temp_df.iloc[0,:]
new_row = first_row.apply(retention, row= first_row)

this works but do I need to apply to each single row and return every row to a new dataframe?
It looks super clunky: :-/

Thanks for your help :pray:
Nick

Hello again team,
I eventually managed to get something out of this but again, I REALLY don’t like the solution.

storage = []
   
for i in range(len(temp_df)):
    row = temp_df.iloc[i,:] #at each iteration, change row
    for value in row:          #on each row look up for first non null value and once found, break out of this loop
        if pd.isnull(value) == True:
            continue
        else:
            anchor = value
        break
    
    new = row.apply(lambda x: (x/anchor))
    storage.append(new)

transformed = pd.DataFrame(columns= [x for x in range(temp_df.shape[0])], data = storage)

I would be happy to get a review overhow this should have been done in the proper pythonic way :slight_smile:
thanks again!

Hi @nlong

If I understand the task you are facing correctly, you can do it like this.

def coef_culc(row):
    start_value = next(item for item in row if not pd.isnull(item))
    return [item/start_value for item in row]
    
transformed = temp_df.apply(coef_culc, axis=1, result_type='expand')

Hey, Nick.

I read your question a few days ago, and I was having trouble understanding what you’re trying to do, so I saved it for when I had a chance to run your code in order to look at the output.

I notice that temp_df/100 equals transformed. Can you find another way to explain what you want? Even a Google Sheet would be helpful.

Hi @Bruno sorry I’ve been away for some time, thanks for taking the time to address this. I’m attaching a workbook of what I’m trying to do - having everything in base 100 was not the optimal example though, otherwise, as you noted, it would be enough to do temp_df/100.

hoprfully this time it’s more clear :slight_smile:

Let’s setup the objects:

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> matrix = [
... [200, 150, 70, 60, 40],
... [np.nan, 100, 70, 30, 50],
... [np.nan, np.nan, 50, 40, 30],
... [np.nan, np.nan, np.nan, 170, 150],
... [np.nan, np.nan, np.nan, np.nan, 40]
... ]
>>> 
>>> temp_df = pd.DataFrame(columns= [x for x in range(0,5)], data = matrix)

The conceptual behavior of the data is such that the first non-null value of each row, is the maximum of that row. In an interactive shell this happens:

>>> temp_df.max(axis=1)
0    200.0
1    100.0
2     50.0
3    170.0
4     40.0
dtype: float64

So we want to divide each row by its corresponding maximum. We can achieve this in this way:

>>> temp_df.divide(temp_df.max(axis=1), axis=0)
     0     1     2    3         4
0  1.0  0.75  0.35  0.3  0.200000
1  NaN  1.00  0.70  0.3  0.500000
2  NaN   NaN  1.00  0.8  0.600000
3  NaN   NaN   NaN  1.0  0.882353
4  NaN   NaN   NaN  NaN  1.000000

I can envision, however, one needing to do similar calculations when the maximum doesn’t work anymore, where you need to actually get the first non-null value.

To accomplish this, I couldn’t find a pandas solution. Here’s something close to that though.

We’ll start by using pandas.Series.combine_first. Here’s how it works:

>>> temp_df[0].combine_first(temp_df[1])
0    200.0
1    100.0
2      NaN
3      NaN
4      NaN
Name: 0, dtype: float64
>>> temp_df[0].combine_first(temp_df[1]).combine_first(temp_df[2])
0    200.0
1    100.0
2     50.0
3      NaN
4      NaN
Name: 0, dtype: float64
>>> temp_df[0].combine_first(temp_df[1]).combine_first(temp_df[2]).combine_first(temp_df[3])
0    200.0
1    100.0
2     50.0
3    170.0
4      NaN
Name: 0, dtype: float64
>>> temp_df[0].combine_first(temp_df[1]).combine_first(temp_df[2]).combine_first(temp_df[3]).combine_first(temp_df[4])
0    200.0
1    100.0
2     50.0
3    170.0
4     40.0
Name: 0, dtype: float64

We don’t want to use lots of chain methods. To solve this problem we can use functools.reduce.

>>> from functools import reduce
>>> col_series = [temp_df[col] for col in temp_df.columns]
>>> reduce(lambda previous, current: previous.combine_first(current),
... col_series
... )
0    200.0
1    100.0
2     50.0
3    170.0
4     40.0
Name: 0, dtype: float64
1 Like

Thanks @Bruno brilliant explanation - I knew not of the reduce method, and actually it was good perspective also to aim to the max value of each row.

@moriturus7 thanks also to you - I’ve not tried this yet but I was no familiar to the result_type option as a kward gor for the apply method. WIll test this as well!

Thanks to both of you :pray:
Nic

2 Likes