The solution seems more complicated then it needs to be?

Screen Link:

I find the solution to this mission a bit more complicated than it needs to be to find the nearest cluster.

My Code:

def assign_to_cluster(row):
    cluster_distances = []
    for key, value in centroids_dict.items():
        distance = euclidean_distances(value, row[['ppg','atr']])
        cluster_distances.append(distance)
    return cluster_distances.index(min(cluster_distances))


point_guards['cluster'] = point_guards.apply(lambda row: assign_to_cluster(row), axis=1)

The code above gives the exact same result as the provide solution below:

def assign_to_cluster(row):
    lowest_distance = -1
    closest_cluster = -1
    
    for cluster_id, centroid in centroids_dict.items():
        df_row = [row['ppg'], row['atr']]
        euclidean_distance = calculate_distance(centroid, df_row)
        
        if lowest_distance == -1:
            lowest_distance = euclidean_distance
            closest_cluster = cluster_id 
        elif euclidean_distance < lowest_distance:
            lowest_distance = euclidean_distance
            closest_cluster = cluster_id
    return closest_cluster

point_guards['cluster'] = point_guards.apply(lambda row: assign_to_cluster(row), axis=1)

Is there a particular reason we want to use the if elif solution to finding the minimum distance and cluster_id?

Some may feel that this if-else separation helps emphasize 2 distinct ideas, that there is an intialization run of overwriting a dummy lowest_distance -1 with any euclidean distance, and later loops updating lowest_distance only when a euclidean distance lower than current min is found.

This feels strange to me since both branches of code serve the same purpose of updating lowest distance, and if-else is normally meant for controlling different actions for mutually exclusive conditions. I would have written if lowest_distance == -1 or euclidean_distance < lowest_distance:, and factorize the 2 blocks of updating code in if-else under this single conditional.

Let me explain some differences in DQ solution i see:

  1. for key, value in vs for cluster_id, centroid in
    Never use key, value, it is not self documenting and does not help the reader understand physically what the dictionary contains.

  2. Store first cluster_distances = [] and min at end vs streaming update
    What if there are 10 million points, creating a list to store all before min will take up space. This decision of streaming update vs create+store+post-process is an important consideration in data structures and algorithms, especially for dynamic programming with a lot of recursion where always condensing the solution with max/min at each stage before returning to the caller function could save a lot of space.

  3. cluster_distances.index vs cluster_id
    cluster_distances.index can only return integers from 0 to len_of_list - 1, cluster_id gives you an opportunity to use strings to name the clusters. Could be convenient in visualization where you want cluster names to be readable.

It’s an art to balance simple and concise with extensible and maintainable. Look to expand your spectrum so you have multiple ways to achieve the same and can explain trade-offs.

1 Like

@hanqi thanks for the reply!

Good point, I will definitely go with the best practice from now on.

I understand the concern of saving space, in this particular case though, I’m looping through centroids, and the list length will be the length of the number of centroids, it’s unlikely to be some huge number. But again, I do see your point, it’s definitely valid in the case you described.

Another good point.

The reason I brought up this topic is that I saw under this mission tag, there’s a question on why the initial distance was set to -1. While it’s an essential thing to learn the reason behind it if you don’t already know, it can be a bit confusing. Maybe the explanation can be added to the instructions. Also, I haven’t timed it, but I don’t wonder if there’s a processing time difference between the two solutions.

I completely agree with you on expanding the spectrum, that is why I posted this question. There are still lots of under the hood things for me to learn. :nerd_face: Thanks again for the great points you brought up!


Coming back from timing two solutions, mine is considerably slower, with an execution time of 0.29, and DQ solution with 0.01. So yeah, that does add up quickly with a large dataset.

My code was pretty similar to OP, and this is some great constructive criticism to help clarify the rationale in DQ’s coding approach, and spread some general methodological wisdom. Thank you for posting!