Number of unique rows not matching

Screen Link: https://app.dataquest.io/m/350/guided-project%3A-profitable-app-profiles-for-the-app-store-and-google-play-markets/4/removing-duplicate-entries-part-one

My Code: ‘’’

cleanset = googleset

for row in googleset[1:]:
    appname = row[0]
    ratingsnum = float(row[3])
    if ratingsnum != maxedreviews[appname]:
        cleanset.remove(row)

What I expected to happen: I started off with a new list, cleanset, with the exact same values as googleset (the Google Play apps, with duplicates still present). My value for maxedreviews was consistent with the solution. I expected that each row (app) where the ratingsnum value was not consistent with the maxedreviews ratings number, would be deleted, and I would get 9659.

What actually happened: Instead, I got a total number of rows of 10055, an error which I don’t see that anyone else has posted. I’m not sure where I went wrong, as it looks like everything I’ve done so far is correct. After explore_data(cleanset), I got

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10055
Number of columns: 13

You haven’t shared the error you got.

Apart from the error, there is something important to know about python lists. Look at the following code -


a = [1, 2, 3, 3, 4, 5, 6, 6, 7]
b = a
for item in b:
    if item%2 != 0:
        b.remove(item)
print(a)
print(b)

The above does something similar to your code. In the for loop I am checking if the particular number in the list is even or not. If it’s not, then it will be removed from b.

This is the output of the above -

[2, 3, 4, 6, 6]
[2, 3, 4, 6, 6]

Notice what’s happening -

  1. Even though we removed from b, a is also missing the same items from it.
  2. 3 is present in both lists

The first point is something that happens based on how Python lists function internally. I won’t go into the details, but when you do something like b = a and a is a list, you are not creating a new copy of the list a. Both b and a point to the same list in memory. So, any changes you make to b will reflect onto a.

Because of the above the second point becomes important too. Because you are essentially deleting elements from the list as you iterate through it in the for loop. So, if you iterate through a list and delete an element from it, you inadvertently mess up the next iteration. That’s why there is a 3 left over.

So, re-structure your code to avoid the above problems to begin with.

I think there could be multiple highest ratings with the same value. So if you are comparing ‘maxedreviews’ and there are more than one row with the same maxedreview value on cleanset, they won’t be removed. Did that make any sense?