Dropping columns

Screen Link:

In this screen, we were carrying some data cleaning by dropping unwanted columns in the dataset. I have been struggling to understand why did we choose those specific columns, 28-48 for dete_survey and 17-65 for tafe_survey. Kindly help me figure out on what basis did we choose those specific columns.

Thank You.

1 Like

one could write an endless stream of posts about column selection in this project, here’s my opinion:

  • someone had to make some choice and made a choice that didn’t lead to this project expanding to a very BIG notebook.
  • you could drop less columns (and should try it if you have time for it) - it can lead to some interesting conclusions, but it will take time
1 Like

Thank You @adam.kubalica ,

I will try and experiment with column dropping .

I felt the exact same way. My initial observations were that these might be merged. But when prompted to just drop them outright I was also confused as to why. Columns like Professional Development, Opportunities for promotion, Staff morale, Workplace issue, Physical environment ALL seem like very relevant reasons to move on from a job.

However, the DETE data for these columns appear to be specific abbreviations. I’m having a hard time tracking down what those abbreviations are code for. So I guess that’s why we’re dropping them.

The TAFE dropped columns are a whole bunch of agree/disagree questions. These could have a value of 1-5 if we thought the questions would be helpful. But maybe it’s just too much to deal with for a beginner? Some of the other columns are straight up Y/N and some are just bools. I’m not really sure.

I’m just gonna trust DQ on this one I guess. I don’t really like it though.

@capncrockett I guess that DQ dropped the columns with the specific abbreviations because they did not find a meaning for those but using the same argument, we could have decided not to drop the columns in TAFE. So it has become a little confusing for me as of now, so I guess we will just trust DQ for now.