5. Removing Duplicate Entries: Part Two

Walking thru step 1 on part 5 of the 1st project. Having some trouble.

  • Start by creating an empty dictionary named reviews_max .
  • Loop through the Google Play data set (make sure you don’t include the header row). For each iteration:
    • Assign the app name to a variable named name .
    • Convert the number of reviews to float . Assign it to a variable named n_reviews .
    • If name already exists as a key in the reviews_max dictionary and reviews_max[name] < n_reviews , update the number of reviews for that entry in the reviews_max dictionary.
    • If name is not in the reviews_max dictionary as a key, create a new entry in the dictionary where the key is the app name, and the value is the number of reviews. Make sure you don’t use an else clause here, otherwise the number of reviews will be incorrectly updated whenever reviews_max[name] < n_reviews evaluates to False .
  • Inspect the dictionary to make sure everything went as expected. Measure the length of the dictionary — remember that the expected length is 9,659 entries.

Here’s what I have so far:
reviews_max = {}
for app in android[1:]:
name = app[0]
n_reviews = float(len(unique_apps))
if (name in reviews_max) and (reviews_max[name] < n_reviews):
n_reviews +=


When converting # of reviews to a float, what variable am I putting inside the float?

The number of reviews comes from another index in the row. The 'Reviews' column is where you’ll find the number of reviews for the app (here’s documentation in case you’re curious). You can see which index it’s at if you go back up to the cell where you inspected the Google data using the explore_data() function.

Hope that helps!

Thanks! That helps. Now I’m getting an error for the code below:

reviews_max = {}
for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    if (name not in reviews_max):
        reviews_max[name] = n_reviews
print('Expected length' + len(android) - 1181)
print('Actual length' + len(reviews_max))

ValueErrorTraceback (most recent call last)
<ipython-input-17-6bfc25ac87c4> in <module>()
      2 for app in android[1:]:
      3     name = app[0]
----> 4     n_reviews = float(app[3])
      5     if (name in reviews_max) and (reviews_max[name] < n_reviews):
      6         reviews_max[name] = n_reviews

ValueError: could not convert string to float: '3.0M'

One of the rows in the android dataset has an error that has all the data shifted over so that it has '3.0M' in the reviews column. This post goes into more detail about it:

Thanks… Yes, I believe I already ran the del android[10472]. I don’t want to run it again unless I have to

I think it was actually index 10473. Since you’ve deleted a row, you can have a look at what android[10472] is now. Or check the row before or after. It’s the app named 'Life Made WI-Fi Touchscreen Photo Frame'.

Okay. Should I run something like this to check the value at the index?

for index, value in enumerate(android):
    index = 10472
    print(index, value)

You can just do print(android[10472]) and run the cell. If it turns out to be the right one, you can run the del command again.

Update: nevermind - I found it. I had ‘+’ s instead of ‘,’ s

Cool - seemed to work. Now I’m getting this:
----> 9 print('Expected length' + len(android) - 1181)
TypeError: Can't convert 'int' object to str implicitly

reviews_max = {}
for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    if (name not in reviews_max):
        reviews_max[name] = n_reviews
print('Expected length' + len(android) - 1181)
print('Actual length' + len(reviews_max))

Huzzah! Glad it’s working now and you got the last error worked out. :tada:

1 Like

My expected and actual length are 1 different

Expected length: 9659
Actual length: 9658

You deleted that extra row from before. :wink: I wouldn’t worry about it, it won’t affect the final results much. As you’re working through the project, I would only be concerned if your results are wildly different. With different projects you’ll end up making different choices during data cleaning and analysis, and you won’t always end up with the exact results from the solution notebook.

If it really bugs you, you can always restart the kernel and redo the deletion of the row in the dataset. :grin:

Ok then. Good to know. I suppose I can suspend OCD for the time being.

1 Like

Please i dont seem to understand the deletion of duplicates part. In the first part, i cant wrap my head around that if block and why elif was used, and in the second part, i still cant comprehend what was done there, please, i need the simplest of explanations.

I think the part giving me issues, is (*reviews_max[name] < n_review*) .
Now lets say the key(name) of the app is Facebook, how can this string be compared to a float?

Hi @pius.emmanuel, welcome to the community!

The short answer to your question is that while the key is the name of the app, reviews[name] refers to the number of reviews the app has stored in our dictionary. name is a string, but reviews[name] is a float.

I broke down how we use a loop to create a dictionary in this post, so feel free to read through that and see if it helps you understand a bit better. If you still have questions or need clarification, let me know!

Thanks so much for your explanation and the link to the post was helpful