In this screen, we were carrying some data cleaning by dropping unwanted columns in the dataset. I have been struggling to understand why did we choose those specific columns, 28-48 for dete_survey and 17-65 for tafe_survey. Kindly help me figure out on what basis did we choose those specific columns.

one could write an endless stream of posts about column selection in this project, here’s my opinion:

  • someone had to make some choice and made a choice that didn’t lead to this project expanding to a very BIG notebook.
  • you could drop less columns (and should try it if you have time for it) - it can lead to some interesting conclusions, but it will take time
Thank You @adam.kubalica ,

I will try and experiment with column dropping .

I felt the exact same way. My initial observations were that these might be merged. But when prompted to just drop them outright I was also confused as to why. Columns like Professional Development, Opportunities for promotion, Staff morale, Workplace issue, Physical environment ALL seem like very relevant reasons to move on from a job.

However, the DETE data for these columns appear to be specific abbreviations. I’m having a hard time tracking down what those abbreviations are code for. So I guess that’s why we’re dropping them.

The TAFE dropped columns are a whole bunch of agree/disagree questions. These could have a value of 1-5 if we thought the questions would be helpful. But maybe it’s just too much to deal with for a beginner? Some of the other columns are straight up Y/N and some are just bools. I’m not really sure.

I’m just gonna trust DQ on this one I guess. I don’t really like it though.

@capncrockett I guess that DQ dropped the columns with the specific abbreviations because they did not find a meaning for those but using the same argument, we could have decided not to drop the columns in TAFE. So it has become a little confusing for me as of now, so I guess we will just trust DQ for now.