Can anyone help explain the remove duplicate data code?

Hi all,

I dont really understand the below code especially the last 2 lines. Why we put app to android_clean, name to already_added? App is name, name is row… right?

android_clean =
already_added =

for app in android:
name = app[0]
n_reviews = float(app[3])

if (reviews_max[name] == n_reviews) and (name not in already_added):
    **android_clean.append(app)**

** already_added.append(name)**

Because you are looking for duplicated apps. You know there is more than one app with the same name, so you use the already_added list to make sure you are not adding the same app to the android_clean twice.

No, app is the row and name is the first item in the row, which in this case is the name of the app.

Notice that to the already_added you only add the name of the app and to android_clean you add the whole line. That’s because android_clean is your new dataset from now on and already_added will only be used to check if the app is in the new dataset already (remember, your data contains duplicates, so the for will find the same app more than once), and it is easier to that if it contains only the name of the app.

Hi Otabios,

than you for replying! I am still struggling to understand the last 2 lines.

*android_clean.append(app)**

** already_added.append(name)**
In the loop, " for app in android" computer will read 'app' here as a whole line?? if yes, then i think it make sense.... Thank you again.

Yes in for app in android, app is a whole line as android is a list of lists. Does it make sense for you now?

Hi Otavios! Thank you for your help!! I can understand it now …thanks!!!

1 Like