Project 1: Removing Duplicate Entries (4/14)

I am trying to find the number of duplicate apps in the Google Play Store file. I went in a roundabout manner with my logic and coding. I am not getting the right answer. Can you help me figure out where I made a mistake?

My code is as follows:

### Finding duplicate entries in Google Play Store data ###
# First finding if duplicates exist

list1 = []

for row in android:
    list1.append(row[0]) #pulled all the app names into a separate list

print("The total number of app names is:",len(list1))
    
apps_android = {}
apps_duplicate = [] #creating a list of all apps that have duplicate entries

for aname in list1:
    if aname in apps_android:
        apps_android[aname] += 1
    else:
        apps_android[aname] = 1
        
print("The total number of apps in apps_android is:",len(apps_android))
        
for each in apps_android:
    if apps_android[each] > 1:
        apps_duplicate.append(each) #for any app with more than one listing, name gets added to the list

print("The total number of duplicate apps are:",len(apps_duplicate))

I am facing an error with the numbers. The results I get are as follows:
The total number of app names is: 10841
The total number of apps in apps_android is: 9660
The total number of duplicate apps are: 798

I am unable to figure out where I am making an error. The total number of apps (9660) is the number of unique apps in the actual solution. However, the sum of total apps and duplicate apps does not add up to 10841.

Hey gagan,

Is it possible some apps are duplicated more than once (have 3 or more entries in the dataset)?

The code you posted only counts each duplicated app once. So you wouldn’t necessarily expect #(unique apps) + #(duplicate apps) = #(original list) since some duplicate apps may have contributed to 3 or more entries in the original list.

If you want to double check all the numbers add up, you could add up the counts for each duplicate app in apps_android for comparison.

You’re looking at how Manu duplicated keys you have in your dictionary, I guess you should look for how frequent they occur.

test_list = ['A','A','B','C','C','D','E','F','F','F']
print("Elements in list:",len(test_list))

[Out]: Elements in list: 10

counter_dict = {}
for element in test_list:
    if element in counter_dict:
        counter_dict[element] += 1
    else:
        counter_dict[element] = 1
print(counter_dict)
print("Unique elements:", len(set(counter_dict)))

[Out]: {'A': 2, 'B': 1, 'C': 2, 'D': 1, 'E': 1, 'F': 3}
[Out]: Unique elements: 6

You arrived here:

duplicated = []
for element in counter_dict:
    if counter_dict[element] > 1:
        duplicated.append(element)
print(duplicated)
print("Duplicated elements:",len(duplicated))

[Out]: ['A', 'C', 'F']
[Out]: Duplicated elements: 3

And all you have to do now is add a counter:

counter_duplicated = 0
for key, value in counter_dict.items():
    if value > 1:
        counter_duplicated += value-1    #this makes sure you only count duplicates
print("Number of duplicated elements:", counter_duplicated)

[Out]: Number of duplicated elements: 4

Hope this is what you’re looking for!

I agree with Collin. You ignore the fact that the duplication can be more than once, this is the reason that the numbers do not match.