Unable to calculate duplicate app count properly in guided project for google store data set

Screen Link: https://app.dataquest.io/m/350/guided-project%3A-profitable-app-profiles-for-the-app-store-and-google-play-markets/4/removing-duplicate-entries-part-one

duplicate_App_dict = {}
dup_App_Final = {}
for each_list in googlestore_list[1:]:
    name = each_list[0]
    if name in duplicate_App_dict:
        duplicate_App_dict[name] += 1
    else:
        duplicate_App_dict[name] = 1
#print(duplicate_App_dict)        
print(len(duplicate_App_dict))

for each_dict in duplicate_App_dict:
    value = duplicate_App_dict[each_dict]
    #print(each_dict)
    #print(value)
    if value > 1:
        dup_App_Final[each_dict] = value
print(len(dup_App_Final))

What I expected to happen: I expected the number of duplicate apps to be 1181.

What actually happened: But I am getting the duplicate app count as only 798. Not sure what the issue is that I am missing. Please help resolve this.

Hi, could anyone please take a look at the code to tell me where I went wrong?

Thanks in Advance.

Hi @venkat3056. I was working over your problem yesterday and didn’t get it, but today I finally understand where the discrepancy is coming in between the 1181 and the 798. It occurs because of the difference in displaying the length of the dictionary in your code versus the length of the list in the solution code. In short, len(dictionary) will count the number of keys in the dictionary, but not how many duplicate apps there are.

I’ll illustrate with this code example.

set1 = ['chicken', 'goat', 'hamster', 'giraffe', 'chicken', 'cow', 'goat', 'goat']
count_items = {}
for item in set1: 
    if item in count_items:
        count_items[item] += 1
    else:
        count_items[item] = 1
print('count_items dictionary:', count_items)
print('size of dictionary:', len(count_items))

Output:

count_items dictionary: {'chicken': 2, 'goat': 3, 'hamster': 1, 'giraffe': 1, 'cow': 1}
size of dictionary: 5

In the original list, there were 8 items, but only 5 unique items. Since it’s a small list, it’s easy to see that 3 of the items are duplicates (1 extra chicken and 2 extra goats).

So if we extend that to the 2nd piece of code, we get this result:

count_duplicates = {}
for each_thing in count_items:
    value = count_items[each_thing]
    if value > 1:
        count_duplicates[each_thing] = value 
print('count_duplicates dictionary:', count_duplicates)
print('size of duplicate dictionary:', len(count_duplicates))

Output:

count_duplicates dictionary: {'chicken': 2, 'goat': 3}
size of duplicate dictionary: 2

If we only look at the size of the dictionary, we would be led to believe that there are only 2 duplicates in the whole list. However, we know there were 3 “extra” items! This 2nd dictionary is telling us how many apps were duplicated, but not how many duplicated items existed in the list.

So back to the play store example. Some of the apps are repeated more than 2 times (I think I saw there was one in there 6 times!). The dictionary would count all 6 items as 1 duplicate. This is why you’re seeing the difference.

I hope that clears it up.

3 Likes

Thanks for the detailed explanation @april.g. That clarifies the doubt I had. I knew I was missing something!

Cheers!
Venkat