Self Guided Project 1 - Single Loop Method

Hello, I’m currently working on the first guided project specifically at the portion where we clean duplicates. I’m trying to use the basic method where we create a dictionary of unique apps but also cleaning the full app list at the same time (in the same loop).

In the code below, my intention is to:

  • Create a temporary copy of the full data set
  • Loop through the full data set as we are guided and perform the basic procedure
  • When we add an item to the dictionary, instead of the value being the number of reviews, use a tuple to include the number of reviews, and the current index in the full data set.
  • When we detect a duplicate and need to update the dictionary, get the index of the previous entry in the full data set for that app from the tuple in the dictionary and delete it from the temporary list. Then update the dictionary as normal, adding a tuple with the new ratings and new index.
  • Return the dictionary and cleaned data set (temporary list).

When I run this, I do see some rows deleted, however, I do not get the same list size as in the instructions. Specifically, I get a length of 10840, out of a total data size of 10486. And the instruction says we should see a total size of 9659.

I’m curious if anyone has feedback on this and where I may be going wrong or if this approach may not work well for this scenario. Thank you!

def ClearDuplicates(dataset=None):
        
    if(dataset != None):
        reviews_max = {}
        clean_dataset = dataset
        index = 0    

        for row in dataset:
            
            name = row[0].strip()
            n_reviews = float(row[3])
            
            if(name not in reviews_max):
                reviews_max[name] = (n_reviews, index)
                
            elif (name in reviews_max) and (reviews_max[name][0] < n_reviews):
                del clean_dataset[reviews_max[name][1]]
                reviews_max[name] = (n_reviews, index)
            
            index += 1
                
    return (reviews_max, clean_dataset)

Hi @jjschweigert.persona, welcome to the community! This is an interesting idea for cleaning out the duplicates. I think the main reason it’s not working correctly has to do with storing the index value in the dictionary.

When the function hits its first duplicate, if the number of reviews are higher, it will go ahead and make its deletion and update the dictionary. However, after that point, the indices for the rows in clean_dataset will no longer match up with the index and indicies of the dataset being looped through.

Let’s say we have this list:

a_list = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

So far the number of the index matches up with the actual number (a_list[5] = 5).

If I delete an element, this all changes.

del a_list[5]

result:

a_list = [0, 1, 2, 3, 4, 6, 7, 8, 9, 10]
a_list[5] = 6

So in the function, if we had an app stored in the dictionary with index 1234 and then later needed to delete it, that app may not be at index 1234 anymore because of other deletions. The wrong apps end up getting deleted.

I’m not sure how there were more rows than were started with (I didn’t see that?), but I think this is clue to what is going on.

2 Likes

Hey April,

Thank you so much for commenting! I definitely agree with you and that makes sense to me as to why this may not be working. If that’s the case, I should still be able to refer back to the original data set using that index right? Is there a way to remove a specific item from the cleaned list without using a direct index?

For instance, I could do cleaned_list.remove(full_list[index]), that should, in turn, remove it from the cleaned list. I believe the remove function on a list removes the first occurrence of that item, I’m curious if that will work if the item is a list itself rather than something like an int? And if so, will the first occurrence be unique in that it would have to match everything with that row for it to be considered an occurrence?

I’m not sure, but you can play with it and see. I think I’m having trouble visualizing it the way you’ve described. :thinking:

No problem! I’ll rework it later, in the meantime I did it by just having the tuple contain the full row of the top rated app and then just populate my list using that.

def GetNonDuplicateDataset(dataset=None, nameIndex = 0, reviewCountIndex = 3):
    """GetNonDuplicateDataset
    
    + Description
      -----------
      Takes the raw android app data set from the guided project and cleans
      duplicate rows based on the criteria described previously
    
    + Arguments
      ----------
      dataset - The raw dataset as a list of lists where each row is a list
      containing details on a specific app.
    
    + Returns
      --------
      A tuple containing a dictionary with the app names with the most reviews
      and a cleaned version of the dataset from the arguments.
    
    """
    
    if(dataset != None):
        reviews_max = {}
        index = 0
        
        # Get a dictionary containing unique apps where each app is the app
        # from the data set with the maximum reviews
        
        for row in dataset:
            
            name = row[nameIndex].strip()
            n_reviews = float(row[reviewCountIndex])
            
            if(name not in reviews_max):
                reviews_max[name] = (n_reviews, row)
                
            elif (name in reviews_max) and (reviews_max[name][0] < n_reviews):
                reviews_max[name] = (n_reviews, row)
        
        # Use the dictionary to clean the full data set
        
        android_clean = []
        
        for unique_app in reviews_max.values():
            android_clean.append(unique_app[1])
                
        return (reviews_max, android_clean)
    
    return (None, None)