Lookup operations between pandas dataframes

Hi fellow dataquesters,
hope to find you all well.

It’s been a while since I have been posting here but this is because after recently switching jobs I am swimming in data all day, so it’s harder for me to keep up the learning pace. :slight_smile:

Nevertheless today I faced a data challenge that I was somehow able to solve, but I fear my solution was very verbose and unclean, thus I’d like to pose the question to you all looking for a pythonic way of solving it.
Here we go:

I have a starting dataframe that has just two columns: country and driving distance. Each row is an observation. You can create a sample as follows:

sample = {'country': {0: 'FR', 1: 'FR', 2: 'GB', 3: 'GB', 4: 'FR'},
 'driving_distance': {0: 989.0, 1: 2541.0, 2: 12301.0, 3: 4620.0, 4: 18000.0}}
data = pd.DataFrame.from_dict(sample)

I need to scan the values of this table against a lookup table I am fetching from a MySql database, that appears in the following shape:

cr = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6},
 'country': {0: 'FR', 1: 'FR', 2: 'FR', 3: 'GB', 4: 'GB', 5: 'GB'},
 'currency': {0: 'EUR', 1: 'EUR', 2: 'EUR', 3: 'PND', 4: 'PND', 5: 'PND'},
 'lower_bound': {0: 10000.0,
  1: 20000.0,
  2: 25000.0,
  3: 8000.0,
  4: 15000.0,
  5: 20000.0},
 'upper_bound': {0: 20000.0,
  1: 25000.0,
  2: 999000.0,
  3: 15000.0,
  4: 20000.0,
  5: 999000.0},
 'payout': {0: 2.0, 1: 3.0, 2: 5.0, 3: 5.0, 4: 7.0, 5: 10.0}}

lkp_table = pd.DataFrame.from_dict(cr)

My desired result is to basically retrieve the payout information from the second table and merge it to the first one, so that for each driving_distance value, I am sure that:

  • it looks at the right table partition (same country value)
  • gets the payout value of the corresponding band, basically ensuring that driving_distance falls within the two values of lower_bound / upper_bound
  • if driving_distance is < smaller value in the lookup table, it shoul return 0.

The end result would eventually be like this:

result = {'country': {0: 'FR', 1: 'FR', 2: 'FR', 3: 'GB', 4: 'GB'},
 'driving_distance': {0: 989.0, 1: 2541.0, 2: 18000.0, 3: 12301.0, 4: 4620.0},
 'payout': {0: 0.0, 1: 0.0, 2: 2.0, 3: 5.0, 4: 0.0},
 'currency': {0: 'EUR', 1: 'EUR', 2: 'EUR', 3: 'ZTL', 4: 'ZTL'}}

end = pd.DataFrame.from_dict(result)

This is the output of my solution which relies on a combination of for iterrows(), nested loops and if conditions. looks so-so to me but works, yet it rearranges the whole dataframe by setting first all rows of one country the all the rows of the other country.
I would be happy if you could share your thoughts on the topic with me!
Many thanks,
N

Gentle bump to top :slight_smile:

Hi!
I´m just a learner, so maybe my solution won´t be the most effective one, but still it seems to be a working one.


def lookup(country, distance):
    #filter the lkp_table by country
    lkp_country = lkp[lkp['country']==country]
    
    # check if distance lies in range of all available intervals (it´s necessary for the correct functioning of .get_loc() later on)
    if lkp_country['lower_bound'].min() <= distance <= lkp_country['upper_bound'].max():
        #create an index of intervals availables for a country
        idx=pd.IntervalIndex.from_arrays(lkp_country['lower_bound'], lkp_country['upper_bound'])
        return lkp_country.iloc[idx.get_loc(distance)]['payout']   
    else:
        return 0.0
        
data['payout'] = np.vectorize(lookup)(data['country'], data['driving_distance'])

idx.get_loc(distance) returns the numeric index of the interval in which the distance value falls, that´s why .iloc() is used for selecting the necessary row.

You can specify closed parameter in pd.IntervalIndex.from_arrays() to make the intervals closed on the left-side, right-side, both or neither as per the documentation. By default they are closed on the right-side.

According to this and this StackOverflow threads in theory it should have a good performance on large data frames.

Let me know later if it works on the original data and if it´s something you were looking for :slight_smile:

1 Like

Hi Ksenia, thanks for taking a look into this and pointing out some references, this is a brillant solution, thanks! I was not familiar with IntervalIndex. Out of curiosity, is there a specific reason why you went for np.vectorize other than pandas apply?

And what would have been your solution if you had to go with apply?

Whenever I need to apply to a series functions with multiple arguments (possibly coming from different sources) I always find it difficult to digest :frowning:

You are welcome! It´s been a good practice exercise for me.

The same happens to me. Actually I found the np.vectorize() method yesterday looking for a way to apply a function that takes arguments from 2 different columns. And generally it´s recommended to avoid .apply() method if it´s possible to use a vectorized operation because it´s not the fastest one. But if I were to use the apply it would look something like this:

data['payout'] = data.apply(lambda row: lookup(row['country'], row['driving_distance']), axis = 1)

Check the last link in my previous message, they do discuss various ways to apply functions with 2 different columns as arguments and their performance in terms of time need to complete the operation. np.vectorize() has got one of the best results.

1 Like