Apple_Store duplicated data

Hello there,
I am performing the data cleaning in the first project of the course.
I found duplicated apps name in the Applestore dataset.
I know that reading the discussion they are not actual duplicate, however I wanted to be able to verify it myself.

I am now in front of an obstacle. How do I find out in which of the row in my dataset are positioned those 4 apps with duplicate name?
if I run a loop searching in the dataset it always analyse the columns while I need to print the corresponding row so I can read all the characteristic of that app.
Does it make sense what I am after?

Many thanks

You could try locating the pd.DataFrame.index to get the index of specific slice.Screenshot 2020-06-13 at 17.54.04

or beforehand, fetching theese specific data:Screenshot 2020-06-13 at 17.57.35

Hello @greta.meroni, welcome to DQ community!

To find the duplicated apps names in the Applestore dataset, use the code below:

duplicate = []
unique = []

for app in ios: ##note that ios is the name of the variable holding my Applestore dataset
    name = app[1]
    if name not in unique:
        unique.append(name)
    else:
        duplicate.append(name)
        
print("Number of duplicate apps: ", len(duplicate))
print('\n')
print("Examples of duplicate apps: ", duplicate[:10])

Output:

Number of duplicate apps:  2

Examples of duplicate apps:  ['Mannequin Challenge', 'VR Roller Coaster']

To find out which row in your dataset contains the apps with the duplicate name of ''Mannequin Challenge':

for app in ios: ##note that ios is the name of the variable holding my Applestore dataset
    name = app[1]
    if name == "Mannequin Challenge":
        print(app)

Output:

['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']

You can do the same for 'VR Roller Coaster' by replacing "Mannequin Challenge in the code above with 'VR Roller Coaster'. You can then check through the rows to read all the characteristics of that app to verify that they are not actual duplicates.

Let me know if this answers your questions.

Thank you for your reply! Also thank you @kakoori.

I have used the same code to find the duplicate name files (that is how I came to the realisation there might be duplicate data).

I also printed the row in which they belong (with the extra information like ‘Size’ for example)
Reading the dataset discussion and seeing the data it looks like they are not duplicate. However I wondered if I wanted to exclude them from the analysis how do I get the index of their row in the big dataset (that I believe you called iOS)??

every loop I tried always reads from column and not from row.
for example index 1 is the name
index 2 is the size. etc
I wanted to read in vertical if that makes sense.

Thanks for your help both of you!

1 Like

@greta.meroni,

To get the index of the rows with duplicate app name of 'Mannequin Challenge' in the big dataset:

for i, app in enumerate(ios):
   name = app[1]
   if name == "Mannequin Challenge":
       print("Row Index of: ", app, "is" , i)

To get the column index and its corresponding value for each row with dupliacte name of "Mannequin Challenge":

for app in ios:
    name = app[1]
    if name == "Mannequin Challenge":
        for column_index, value in enumerate(app):
            print(column_index,value)
        print("\n")

Let me know if this helps.

@greta.meroni

The .index method will return the indexes of the data it’s used upon.

So, to exclude these data from your loop, one could change the iterator, e.g.

# Get all indexes
all_indexes = ios.index

# Get the indexes of some data (e.g. apps with duplicate names)

filtering_condition = ios["track_name"].isin(list_of_duplicate_names)

indexes_of_duplicates = ios[filtering condition].index

# Get all indexes except for indexes of duplicates

no_duplicates_indexes = all_indexes.drop(indexes_of_duplicates)

The no_duplicates_indexes can be passed to a for loop like any other iterator.

no_duplicates_prices = []

for app_index in no_duplicates_indexes:
    # return the price of app at app_index
    app_price = ios.iloc[app_index, 4]
    
    # append the app_price to the no_duplicates_prices list
    no_duplicates_prices.append(app_price)

Actually, in this case, indeed the column index is fixed at 4 (“price”), but the function reads the data row-wise.

Could you describe the result you expect the program to work out?

Thank you!
I read and learnt about enumerate built-in function and does the job for me.
@kakoori I think my current knowledge of Python is much more limited than yours and requires more studying and doing, but thank you for quick answers! I hope I will get there soon or later!

1 Like

@greta.meroni,

That’s Great! I wasn’t sure of the particular index you needed from the dataset, that’s why I uploaded the two different blocks of code in my reply.

Can you please let me know which one does the job for you, the first or the second block of code?

Also, Kindly mark the reply as a solution if it helped solve your question.

1 Like

the first one! enumerate worked !

2 Likes

Hello,
I noticed the same thing. I found two duplicate apps and I believe assertion that no duplicates exist was incorrect. This just shows we can never clean the data enough!

Here are the results from my project:

Number of unique apps in the store: 7195
Number of apps with at least one duplicate: 2
{'Mannequin Challenge': 1, 'VR Roller Coaster': 1}
1 Like

Hi @fjpereny

That is a great start for the first guided project! But if you decide to print the values connected with these names, you will find that rest of the data connected with are different. I agree that the names are repeating, but I believe all other values are different including the app ID.

Probably that is why it was mentioned that there are no duplicates in Apple data.

1 Like

Thanks for your suggestion. You raised an excellent point, that it might be possible that our criteria for sorting duplicates was incorrect (mainly the name of the app).

This was an interesting first project and I learned a lot so far.

1 Like