after completing the guided project on Dataquest, I set up a Jupiter Notebook on my local machine and started from scratch - the workflow differs a bit from the dq-instructions. I would very much appreciate your feedback about what I could do better on my next project (aside from viz formatting - I know I did the bare minimum here… ) !
It would also be great to get the community’s feedback on the following:
What is your best practice approach to updating dataframes in Jupyter notebooks? I struggled a bit to decide between:
Working “inplace” - meaning updating variables. Benefit: You only have a handful of variables to keep track of. Drawback: Rerunning a cell may result in errors (e.g. because the columns datatype has changed or the column was already dropped), so you need to rerun several cells above.
Making several updated copies like “dete_survey_update1”, “dete_survey_updated3” etc. whenever you transform or drop columns. Benefit / Drawback: pretty much opposite of working “inplace” - see above.
I went with the second approach, but felt like I had to keep track of a lot of variable - maybe because I introduced too many even though it was not really necessary. I would really like some specific feedback on this part of the notebook and hear your opinions.
Thanks a lot!
dataquest_data_cleaning_TAL freestyle_2020-04-26.ipynb (758.2 KB)
Click here to view the jupyter notebook file in a new tab
Hi, @tim1albers! I really liked your project, you did ask some interesting questions (with surprising answers) like below average dissatisfaction levels for involuntary exits
Just a few things I noticed:
- Why did you decide not to combine all info from DETE and TAFE surveys and to work separately with both institutions?
- It would be great if you provided some insights for your first plots (like “Number of exits by length of tenure at institute” )
- Some comments for the last code blocks would facilitate reading:)
copy I would say that it depends on the case. Like for dropping or renaming columns, I would just reassign them to the original dataset (since you won’t probably use dropped columns, and renaming is just renaming). The same applies to sorting columns in a desired order.
I would use
inplace to reset index for example.
copy personally in my project, I used it only when the
SettingwithCopy warning arose (like when I was working with only a selection of columns from a data set). I actually used it once, if I’m not mistaken.
It may be actually quite difficult to keep track of all copied data sets .
Many thanks for your feedback!
Re 1.: I only made joint analysis on both datasets when I was investigating subgroups like retirees, because I thought it was generally very insightful to see both institutes side-by-side. This way one sees, that there were similar trends in both institutes.
Re 2.+3.: I actually thought I went overboard in the earlier parts of the analysis. Good to hear that you would have liked some more text
Good to hear your perspective on
copy. I will experiment on my upcoming project.
Thanks again and stay healthy, all the best to Italy!
Yes, that was interesting to know if both institutions have similar results or there is some strong difference between them!
For me, it’s better to comment on almost everything (except obvious things) because it will be very helpful when you’ll eventually review your project.
All the best to Germany!