Stuck on CPU Bound Programs

Hi!

So I’m working on the CPU Bound Programs mission and got to step 13 (Practicing writing efficient algorithms) where the instruction is to ‘Use pandas groupby to find the product_link with the highest relevance for each unique query .’ Course link

This is where I got stuck. How do I start thinking about solving this? I had initialized a dictionary and the loop for iterating through ‘query’ with ‘item’ but then? This is also suggested by the hint of the step. But how do I ‘track the highest relevance for each term and the associated link in a dictionary’? Eventually I checked the answer and I see stuff I haven’t come across before so that makes me think I could not have figured this out on my own without seriously Googling for this stuff.

The following stuff is new (and therefor unknown) to me, maybe you can point me in the right direction to where to learn this stuff? Maybe perhaps in courses on Dataquest:

  • lambda
  • enumerate
  • loop with two iterators

I have been looking for courses with lamba in it and came accross:

Both courses which are not in the Data Engineer Path.

Appreciate a response. Thanks!

3 Likes

Hi @DataBuzzer,

It is true that the proposed solutions uses some python programming concepts that we didn’t learn yet. We are working on improving that.

However, it is possible to solve this questions without them.

Lambda

A lambda function is like a regular function but defined with another syntax. The code:

def pandas_algo():
    get_max_relevance = lambda x : x.loc[x["relevance"].idxmax(), "product_link"]
    return data.groupby("query").apply(get_max_relevance)

Is the same as the following:

def get_max_relevance(x):
    return x.loc[x["relevance"].idxmax(), "product_link"]

def pandas_algo():
    return data.groupby("query").apply(get_max_relevance)

There is a course on lambda functions later in the DE paths but I agree that we should not be using them before teaching it.

Enumerate

When you do a for loop using enumerate() you get access to both the index and the value rather than just the value.

For example, a simple for loop will iterate over the values:

for value in [5, 7, 3, 8]:
    print(value)
5
7
3
8

Using enumerate will iterate over the indexes and the values at the same time:

for index, value in enumerate([5, 7, 3, 8]):
    print(index, value)
0 5
1 7
2 3
3 8

In the solution we use it but it is totally possible to solve it without it. The algo() function can be rewritten without enumerate() as follows:

def algo():
    links = {}
    for i in range(len(query)):
        row = query[i]
        if row not in links:
            links[row] = [0,""]
        if relevance[i] > links[row][0]:
            links[row] = [relevance[i], product_link[i]]
    return links

Pandas concepts

The pandas_algo() is using some functions that I am not sure we learn before.

The DataFrame.groupby() method groups the rows of the dataframe by a given column. Imagine that you have this dataset:

   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0

If we group by Animal then we have two groups each with two rows. For example:

df = pd.DataFrame({
    'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'],  
    'Speed': [380., 370., 24., 26.]
})                                                                                                                                            
for group in df.groupby('Animal'): 
    print(group) 
('Falcon',    Animal  Speed
0  Falcon      380.0
1  Falcon      370.0)
('Parrot',    Animal  Speed
2  Parrot       24.0
3  Parrot       26.0)

We can use the apply() method to apply a function to each group. For example, we can compute the maximum speeds of the animals in each group by doing:

def max_speed(group):
    return group['Speed'].max()

print(df.groupby('Animal').apply(max_speed))
Animal
Falcon    380.0
Parrot     26.0
dtype: float64

In the context of this problem, we group by query and apply a function that returns the product_link of the row with maximum relevance within each group.

I hope this helps :slight_smile:

1 Like

Hi @Francois,

Thanks for taking the time and effort for this great reply! Really appreciated.

After some practice and research I understand your approaches and how they work. I recognize that my skill level and intuition is just not there yet, but it’s coming. :slightly_smiling_face:

Gr. Ron

2 Likes

@Francois I appreciate the explanation, although I do think this question was really not well thought out.
The solution checker doesn’t ask for an output to validate either algorithm against.
It says to “develop an algorithm”, but doesn’t specify whether this should be a function and (if it is a function) what it’s inputs and outputs should be.
To that effect:
pandas_algo() will return product links in a pandas Series with query as its index.
algo() with return a dictionary whose keys are queries and whose values are a list containing relevance and product link. You could pull out the product link only with a dictionary comprehension:
links = {k, v[1] for k, v in links.items()}
Alternatively, you could track the query max relevance and the query product link in separate dictionaries. I feel like this looks slightly cleaner, and so far as I can tell it doesn’t affect order of complexity.

def algo():
    links = {}
    max_rel = {q: 0 for q in query}
    for i, q in enumerate(query):
        if relevance[i] > max_rel[q]:
            links[q] = product_link[i]
            max_rel[q] = relevance[i]
    return links

Additionally, one of the two algorithms needs a conversion step (Series to dict or vice-versa) in order for these two algorithms to have the same output.

3 Likes