Renaming Columns - Guided Project: Clean And Analyze Employee Exit Surveys

Screen Link: https://app.dataquest.io/m/348/guided-project%3A-clean-and-analyze-employee-exit-surveys/3/clean-column-names

Your Code:

                    'Gender. What is your Gender?':'gender', 'CurrentAge. Current Age': 'age',
                    'Employment Type. Employment Type':'employment_satus', 'Classification. Classification':'position',
                    'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
                    'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}

tafe_survey_updated = tafe_survey_updated.rename(tafe_column_mapping, axis=1)

# Check tafe column names
tafe_survey_updated.columns```



What I expected to happen: 

Index(['id', 'Institute', 'WorkArea', 'cease_date', 'separationtype',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE', 'gender',
       'age', 'employment_status', 'position', 'institute_service',
       'role_service'],
      dtype='object')

What actually happened: 

Index(['id', 'Institute', 'WorkArea', 'cease_date', 'separationtype',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE',
       'Gender.     What is your Gender?', 'CurrentAge.     Current Age',
       'Employment Type.     Employment Type',
       'Classification.     Classification', 'institute_service',
       'role_service'],
      dtype='object')

Other details:  Only some of the column names correctly changed. For example 'Gender. What is your Gender?' did not change to 'gender'. I am not sure why.
1 Like

Hi @chizjr. Just judging from your output, it looks like it added extra spaces (a tab space?) after the period for 'Gender. What is your Gender?', 'CurrentAge. Current Age', 'Employment Type. Employment Type', and 'Classification. Classification'. I’m not really sure how that would happen though. My first guess is an error with copying and pasting values into the cell for the replacement dictionary. You could try redoing this section by first copying and pasting the desired code into in a plain-text editor to make sure there aren’t any extra spaces, and then into Jupyter notebook.

If that doesn’t work, would you mind attaching a copy of your .lpynb file so we could have a closer look?

Appreciate the help. I realized it didn’t end up mattering for the end result. However, I’ll post my .lpynb file if it comes up again. Thanks again for the quick response!

Hi @april.g

I have downloaded the dataset file from the source linked in the guided project introduction section. The dataset itself has these extra spaces in these columns.

Based on another post, it seems like this dataset has been updated at the source itself, and the guided project workflow was based on datafile’s previous version.

Could you please explain what is that “replace(\s+” part exactly doing. My understanding is " \ " is the escape character, “s” is for space. What I don’t understand are

  • “+” - how it differentiates between “s”(1 space) and “sssss”(5 spaces).

  • why does this code doesn’t work for string example (what am I doing wrong here?), but works for dataframe.columns

      trial_str = "Gender.     What is your Gender?"
      print(len(trial_str), trial_str)
    
      trial_str_upd = trial_str.replace('\s+', ' ').strip()
      print(len(trial_str_upd), trial_str_upd)
    

Result for replace() method on trial_str:
image

print("before replace is applied:","\n",tafe_survey_updated.columns)
tafe_survey_updated.columns = tafe_survey_updated.columns.str.replace("\s+", " ").str.strip().str.lower()
print("after replace is applied:","\n",tafe_survey_updated.columns)

Result for replace() method on df.columns:

image

It might be that it was covered in the mission, however if you could still explain and also provide a better way to handle this kind of scenario.

Thanks & Regards

1 Like

Ah, thanks for clarifying about the dataset. I just downloaded and looked at it and I can see the extra spacing there. It looks like Dataquest did a bit of clean-up on the files we’re using on the platform, so you’re running into some issues that aren’t discussed in the mission. You can always download DQ’s version of the datasets to use so that your results more closely follow the lesson guide. On the other hand, working with the original dataset might be a fun challenge too! :slight_smile:

I didn’t immediately know the answer to your question so I tried to find some answers on the web. It doesn’t look like the Python str.replace() method accepts regular expressions, though I had trouble finding anything concrete that explicitly stated this. However, Series.str.replace() can accept regular expressions (documentation), so that would explain why it works in the dataframe and not in the string example. There’s likely some stuff under the hood for Pandas that makes it possible to use regex.

The + character means that there is one or more occurances, so it doesn’t distinguish between there being 1 space versus 5 spaces – all of them end up being replaced. The reason you still have spaces in your result is because you’re replacing the “one or more spaces grouped together” with a single space:

tafe_survey_updated.columns.str.replace("\s+", " ")

If you had put tafe_survey_updated.columns.str.replace("\s+", "") (no space between the quotation marks), then you would have ended up with the words smooshed together!

I’m not sure about best practices yet (I’m still learning too!). Because the headings on these columns were really long and had weird spacing issues, I would rather use the Dataframe.rename() method with a dictionary and get all the columns standardized at once. (They briefly talked about it on this screen of a previous mission.) I know of another guided project that comes up in a couple courses where you have to do some extensive cleaning of column headings like this one.

I hope that helps.

2 Likes

Hi @april.g

Thank you . That was helpful. :smile:

The space is intentional, to still have a space between the different words.

Regarding the rename function and using a dictionary map to replace column names, that step is also included in the guided project in addition to other replacement methods.

I guess I am cool with this diversified learning, coz for real-world work, it’s better to know of various scenarios and methods to solve them rather than a streamlined way of doing things.

Hence I asked for another way from you, assuming you may solve it in a different manner.

I am yet not at a basic level for regex concepts (it’s completely new for me - learning at my own pace).
So my last resort was to go “.csv” file and update the column names there itself. :stuck_out_tongue_winking_eye:

2 Likes