Getting (not) all combinations of (sub lists)

Hi,

currently I am stuck on a challenge.

Imagine the following data frame:

 df1 = pd.DataFrame({"country":['DE', 'DE','DE', 'UK','UK','FR'],
 "miles":[0.5,1.5, 10,0.5,10, 1000], "hours":[100,7,200,10,20,1]})

In the end I want to split all miles by country into two different buckets (above or upon a treshold). As I don’t want to search manually for the best spot, I would like to get almost all possible combinations. I the beginning I tried a loop in a loop in a loop, but then I get something like:

  • DE 1, UK 1, UK 2, FR 1

This does not work as only 1 treshold at a time is allowed per country, so the target would be:

  • DE 1, UK 1, FR 1
  • DE 1, UK 1, FR 2
  • and so on

I tried a lot & my code looks quite messy now. I did not find a way with the loop strategy yet…

   df1 = pd.DataFrame({"country":['DE', 'DE','DE', 'UK','UK','FR'],
   "miles":[0.5,1.5, 10,0.5,10, 1000], "hours":[100,7,200,10,20,1]})

df_fin = pd.DataFrame({"code":['DUMMY'],
sum_upon":[1000], "sum_above":[1000], "count_upon":[1000], "count_above":[1000], "treshold":[1000]})

   ranges = 2
   
   #print(df1.head(4))
   for c in df1['country'].unique():
       temp_others = df1.loc[(df1['country']!=c)].reset_index()
       temp_select = df1.loc[(df1['country']==c)].reset_index()
       #print(temp_others)
       for i in range(ranges):
           treshold = i
           print(treshold)
           temp_sum_upon = 0
           temp_count_upon = 0
           temp_sum_above = 0
           temp_count_above = 0
           name = 'PASS_{0}_{1}_'.format(c,treshold)
       
       upon_treshold_sel = temp_select.loc[(temp_select['miles'] < treshold)]
       above_treshold_sel = temp_select.loc[(temp_select['miles'] >= treshold)]
       
       temp_sum_upon+=upon_treshold_sel['miles']
       temp_count_upon+=upon_treshold_sel['hours']
       temp_sum_above+=above_treshold_sel['miles']
       temp_count_above+=above_treshold_sel['hours']
       
       for t in temp_others['country'].unique():
           temp_sub_others = temp_others.loc[(temp_others['country']!=t)]
           for m in range(ranges):
                   others_treshold = m
                   upon_treshold_oth = temp_others[(temp_others['miles'] < others_treshold)]
                   above_treshold_oth = temp_others[(temp_others['miles'] >= others_treshold)]
           
                   temp_sum_upon+=upon_treshold_oth['miles']
                   temp_count_upon+=upon_treshold_oth['hours']
                   temp_sum_above+=above_treshold_oth['miles']
                   temp_count_above+=above_treshold_oth['hours']
                   name=name+'{country}_{sub_treshold} '.format(country=temp_others['country'], sub_treshold = others_treshold)
                   df_temp = pd.DataFrame({"code":[name],
                                   "sum_upon":[temp_sum_upon.sum()], "sum_above":[temp_sum_above.sum()], "count_upon":[temp_count_upon.sum()], "count_above":[temp_count_above.sum()], "treshold":[treshold]})
                     df_fin = df_fin.append(df_temp, ignore_index = True)
                     sorted_out.append(t)
             
             #print(c)
             #print(name[:20])
             #print(temp_sum_above.sum())
                     temp_sum_upon = 0
                     temp_count_upon = 0
                     temp_sum_above = 0
                     temp_count_above = 0

Alternatively I found this method:

 from itertools import combinations 
 
 names = []
 country_list = df1['country'].unique()
 for c in country_list:
     code = c
     country_list = country_list[country_list != c]
     for i in range(2):
         code+=str(i)
         names.append(code)
         
   
 def rSubset(arr, r): 
   
     # return list of all subsets of length r 
     # to deal with duplicate subsets use  
     # set(list(combinations(arr, r))) 
    return list(combinations(arr, r)) 
   
 
 arr = names 
 r = len(df1['country'].unique())
coms = rSubset(arr, r)
 coms_df = pd.DataFrame(coms) 
coms_df

Here I get all possible combinations, but I struggle with two things:

  • Clean up rows I do not need (i.e. 2 times UK or DE in it)
  • Map the above or upon treshold values, so I can apply them (as a sum) to all possible combinations.

In the end I want to find a minimum value in all possible combinations to identify the sweet spot.

Does someone have an idea here?

I am not sure what you are trying to do and not sure if I can help or not with this, but it would help if you -

  1. Provided a couple of examples of what kind of output you expect to have based on what you are trying to do. What exactly are “buckets” here? On what basis are you getting DE 1, UK 1... and why exactly is that not what you want? What are those numbers in front of the country initials corresponding to? What is that “sweet spot” you are referring to? Being explicit with your details and what kind of output you expect would be helpful to better understand your problem.

  2. Properly formatted the code as well. Currently, you are “blockquoting” the code which looks like -

def test_function():
print(“something”)

you instead want to use the tool </> to properly format code -

def test_function():
    print("something")

The above helps make it more readable as well. It can also be better to share the Github link to your code, if you are comfortable doing that.

Hi @the_doctor ,

thanks for the feedback: I reformatted the code.

So imagine:

  • every country would represent a target airport (biggest airport of this country)
  • hours represent working hours the airport needs to prepare the flight
  • and miles represent the way workers have to make for the preparations (like get food, fuel etc)

The workers are actually not employed by you, but external companies. They have two different pricing models. One company takes into account the total miles, the other company the avg working hours per worker. To get the best pricing you need to know, how to allocate the workers correctly.

One note here: The question is really about the technical solution, no the business question.

So what do I need here:
I want to combine every airport & every working hours allocation with each other under different tresholds. So if I set working hours treshold to 3, then I borrow every worker with less than three working hours from company A while I take every worker with more than 3 working hours from company B to optimize my pricing. I would like to know, what is the best spot for each treshold.

So I want to find all possible combinations (lets say for a treshold until 3). My desired output here is:

  • Each country + a treshold

So on a treshold range of 1 to 5 it would be like:

  • DE (treshold 1), UK (treshold 1), FR (treshold 1)
  • DE (treshold 1), UK (treshold 1), FR (treshold 2)
  • DE (treshold 1), UK (treshold 1), FR (treshold 3)
  • DE (treshold 1), UK (treshold 2), FR (treshold 1)
  • DE (treshold 1), UK (treshold 3), FR (treshold 1)
  • etc

For every of these combinations I want to know how many working hours & how many miles fall above or how many fall upon the treshold, so:

  • DE (treshold 1), UK (treshold 1), FR (treshold 1) / under_tres_h: 500, over_tres_h: 100, under_tres_m: 3000, over_tres_m: 100

The next step would be the calculation of company A & B costs based on every combination to identify the perfect pricing structure.

So far my issue was, that I got:

  • DE (treshold 1), UK (treshold 1), UK (treshold 2), FR (treshold 1)

which is not allowed here. Only one treshold per country/airport at a time.

I hope this makes more sense now.

Hey @thomas.meissner

The decision is pretty dirty. But so far, that’s all I came up with.

So, I’ll assume that your dateframe has Country and Threshold columns.

Step 1.
We get a unique combination of Country and Threshold pairs. And also how many unique countries you have. I’ll assume that this many values must be in one combination. If not, just set it manually.

code_list = data.groupby(['country', 'threshold']).size().reset_index()[['country', 'threshold']].values.tolist()
len_combine = len(data['country'].unique())

Step 2.
Let’s use the itertools library you already mentioned.
But we need to add a filter that shuts down all cases where the country repeats. And to do that, we’ll create a check function

from itertools import combinations 

comb_data = combinations(code_list, len_combine)

def check_function(code_combine, len_combine):
    return len(set([elem[0] for elem in code_combine])) == len_combine

result_list = [elem for elem in comb_data if check_function(elem, len_combine)]

Check function, extracts the country code. Then it deletes duplicates using the grid and checks the length of the resulting list. If it is below the set limit, it returns False

My result is on artificial data.

For the starting dateframe of the species

image

All possible combinations will.

[(['DE', '1'], ['FR', '1'], ['UK', '1']),
 (['DE', '1'], ['FR', '1'], ['UK', '2']),
 (['DE', '1'], ['FR', '2'], ['UK', '1']),
 (['DE', '1'], ['FR', '2'], ['UK', '2']),
 (['DE', '1'], ['FR', '3'], ['UK', '1']),
 (['DE', '1'], ['FR', '3'], ['UK', '2']),
 (['DE', '1'], ['FR', '5'], ['UK', '1']),
 (['DE', '1'], ['FR', '5'], ['UK', '2']),
 (['DE', '2'], ['FR', '1'], ['UK', '1']),
 (['DE', '2'], ['FR', '1'], ['UK', '2']),
 (['DE', '2'], ['FR', '2'], ['UK', '1']),
 (['DE', '2'], ['FR', '2'], ['UK', '2']),
 (['DE', '2'], ['FR', '3'], ['UK', '1']),
 (['DE', '2'], ['FR', '3'], ['UK', '2']),
 (['DE', '2'], ['FR', '5'], ['UK', '1']),
 (['DE', '2'], ['FR', '5'], ['UK', '2']),
 (['DE', '3'], ['FR', '1'], ['UK', '1']),
 (['DE', '3'], ['FR', '1'], ['UK', '2']),
 (['DE', '3'], ['FR', '2'], ['UK', '1']),
 (['DE', '3'], ['FR', '2'], ['UK', '2']),
 (['DE', '3'], ['FR', '3'], ['UK', '1']),
 (['DE', '3'], ['FR', '3'], ['UK', '2']),
 (['DE', '3'], ['FR', '5'], ['UK', '1']),
 (['DE', '3'], ['FR', '5'], ['UK', '2'])]

Many thanks @moriturus7 , that was a very good solution & explanation! You helped me a lot here :slight_smile: