Duplicates in Guided Project: Profitable App Profiles for the App Store and Google Play Markets

Screen Link: Learn data science with Python and R projects

My Code:

def find_duplicates(dataset, index_name):
    duplicate_apps = []
    unique_apps = []
    
    for app in dataset:
        app_name = app[index_name]
        if app_name in unique_apps:
            duplicate_apps.append(app_name)
        else:
            unique_apps.append(app_name)
    
    if len(duplicate_apps) == 0:
        return "There are no duplicate apps in this dataset."
    else:
        print("Number of duplicate apps:", len(duplicate_apps))
        print("\n")
        print("Examples of duplicate apps:", duplicate_apps[:15])
        return duplicate_apps

What I expected to happen:
In the exercise “4. Removing Duplicate Entries: Part One” i have to write a function for both datasets: applestore and googleplaystore to check, if there are some duplicates.

With the code above, i got 1.181 duplicates in the googleplaystore dataset and 2 duplicates in the applestore dataset.

AppleStore

Number of duplicate apps: 2
Examples of duplicate apps: ['Mannequin Challenge', 'VR Roller Coaster']

GooglePlayStore:

Number of duplicate apps: 1181
Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']

But the problem is:
The two duplicates in the applestore dataset aren’t duplicates.
There are two different apps with the same name.

So how do I know if the 1,181 duplicates from the googleplaystore dataset are all real duplicates? Maybe a few entries are just different apps with the same name?

So i have to check the name and what other indicator can i use to find out if its really a duplicate or just another app?

Hi @1sp34k2r0b0ts ,

That depends on how you approach the problem.

You’re defining duplicated as apps with the same name, so that’s what your code is looking for.

However, when you get the results, you’re saying that they are only different apps with the same name. But that’s exactly what your code is looking for. And don’t get me wrong, that’s a valid perspective. But if you see it this way, then you should consider a duplicated when apps have the same ID number or even the same value in every row.

If you tell your function to use the index 0 to find duplicates, it will not return these apps as duplicates as they have different ID numbers as you can see below:

image

image

Hi @otavios.s ,
thank you very much for your answer.

I can take the ID from the App Store, but what about the Google Play Store? How exactly do I know it’s a duplicate and not another app with the same name?

I actually wanted to create a function that checks both the AppleStore dataset and the GooglePlayStore dataset for duplicates.

In the task, we look for the incorrect entries via the discussion platform and then remove them. But is there any way to automate it for both datasets, or do we always have to look at the datasets individually?

Best regards

1 Like

It seems like we don’t have an ID column in the google play dataset. The way I see it, that means that the name of the app works as a unique identifier, which means that if there are two apps with the same name, they’re duplicates.

In this dataset, the name is the index 0 just like the ID in apple dataset. Therefore, if you pass index 0 to your function, it will look for duplicated IDs in the apple dataset and for duplicated names in the google play dataset.

You already have a function that checks both datasets.

Thank you very much :slight_smile:

1 Like