Hello, I’m currently working on the first guided project specifically at the portion where we clean duplicates. I’m trying to use the basic method where we create a dictionary of unique apps but also cleaning the full app list at the same time (in the same loop).
In the code below, my intention is to:
- Create a temporary copy of the full data set
- Loop through the full data set as we are guided and perform the basic procedure
- When we add an item to the dictionary, instead of the value being the number of reviews, use a tuple to include the number of reviews, and the current index in the full data set.
- When we detect a duplicate and need to update the dictionary, get the index of the previous entry in the full data set for that app from the tuple in the dictionary and delete it from the temporary list. Then update the dictionary as normal, adding a tuple with the new ratings and new index.
- Return the dictionary and cleaned data set (temporary list).
When I run this, I do see some rows deleted, however, I do not get the same list size as in the instructions. Specifically, I get a length of 10840, out of a total data size of 10486. And the instruction says we should see a total size of 9659.
I’m curious if anyone has feedback on this and where I may be going wrong or if this approach may not work well for this scenario. Thank you!
def ClearDuplicates(dataset=None):
if(dataset != None):
reviews_max = {}
clean_dataset = dataset
index = 0
for row in dataset:
name = row[0].strip()
n_reviews = float(row[3])
if(name not in reviews_max):
reviews_max[name] = (n_reviews, index)
elif (name in reviews_max) and (reviews_max[name][0] < n_reviews):
del clean_dataset[reviews_max[name][1]]
reviews_max[name] = (n_reviews, index)
index += 1
return (reviews_max, clean_dataset)