I'm not sure I understand what this is asking --> Duplicated and Indexing

Screen Link: https://app.dataquest.io/m/347/working-with-missing-and-duplicate-data/6/identifying-duplicates-values

I understand to do what they’ve asked, but I could use a more thorough explanation of two things:

  1. When I pass both column names in the df.duplicated() function, what does including both do? Does it exempt them from the duplicate (ex. the year) since we know it will be different?

  2. I didn’t understand the command to "Use dups to index combined". Seeing that it just need to print made sense, but I’m not sure if I understand “indexing” very well.

Can anyone explain further or point me to some helpful resources?

I would recommend at this stage to get comfortable with the documentation for such functions as well. The content in that Step has a link to the documentation for that function that should help clarify few things for you.

It’s the same concept as when you went through the Python introductory Missions.

Indexing is simply when you try to extract a certain value from a container.

In the introductory python Missions, you used indexing to find items in a list (the container), for example -

a = [4, 5, 6]

You can access 6 from a using indexing - a[2]. The index for 6 is 2.

Similarly, you are trying to get values from your combined dataframe using dups as an index.

In pandas, working with dataframes, indexing gets more functionality than simple lists. dups is a pandas Series which, for each row, contains either True or False.

So, you can use dups as an index to access rows in combined corresponding to the rows in dups which are True.

It’s similar to the following example -

a = [True, False, True]
b = [2, 3, 4]

If I did b[a], I would get [2, 4].

(The above example is just for the explanation and not valid code.)

1 Like