Pandas Data Cleaning Practice Problem 7

Screen Link:

My Code:

participants = pd.read_csv('participants.csv')

participants['name'] = participants['name'].str.title()

def change_size(val):
    for key in size_replacement_table:
        if val in size_replacement_table[key]:
            return key
        
participants['t-shirt'] = participants['t-shirt'].apply(change_size)

What I expected to happen:
I thought that the above code would yield the correct answer for the practice problem.

What actually happened:
Inspecting the variable editor seems to suggest that the dataframe my code created and that of the expected answer are identical. I would be very grateful if someone could point out the issue with my code. I have wracked my brains but can’t seem to come up with a suitable answer. Thank you so much.


Sometimes, you have to be careful of edge cases that can come up when trying to clean data.

For example, you use str.title() to ensure that the first letter of the first and the last name are capitalized. In theory, that seems fine.

However, you might have a name like - Parry Ben-aharon.

Your code will change the above to - Parry Ben-Aharon. That A should not be capitalized.

That’s the kind of edge cases you have in this data. Some other examples you should be careful of -

  • Markus O'growgane
  • Terri-jo Dobell

Now, the thing is the above is based on Dataquest’s implementation. Which, as per me, is incorrect.

Because names like Markus O'growgane do have that g capitalized. For example - https://en.wikipedia.org/wiki/Liam_O’Brien

As per me, it’s better to have the output that you get than the one that DQ has right now. But, for the time being, that’s not what they expect here, so try to modify your implementation.

@Sahil - I think this should be looked into based on what’s considered to be the right way to write such names.

1 Like

Thank you so much! I’ll try to keep edge cases in mind in my future analyses.

1 Like