I can’t help but challenge some points that I don’t quite understand since I feel it helps me to learn better. I’m looking for some help in understanding the thought process in an early part of the solution notebook that doesn’t seem right to me at this point in time; I’m probably mis-interpreting something key here, so I’m grateful for clarification from the teaching team or learning assistants.
Cells 3 and 4 in the solution drop null rows from the
JobRoleInterest column, and then produce a bar chart of interest in web/mobile development vs other interests. Based on this a conclusion is made that 86% of people in the survey are interested in web or mobile development, which suggests to me that 86% of people in the entire data set are interested in web/mobile development. The validity of the survey data as a representative sample seems to hang on this conclusion. However, when I look at the null rows that are dropped, they make up just over 60% of all the rows in the data set; when this much of the survey data is removed, is it really appropriate to word the conclusion in this way?
My approach as been to count the number of respondents interested in web or mobile development against the entire data set. This outputs a significantly smaller percentage of people explicitly interested in web/mobile development, but then I’m suggesting that a subset of the full data set that just contains those interested in web/mobile would make suitable sample of the population of interest to explore questions related to new coder locations and willingness to spend money on learning. Could this also be a good approach to take?
Nope. The author does not mean that 86% of the entire survey respondents have answered positively to the web or mobile dev. courses.
This percentage has been calculated after the removal of the rows with no response to the given column. This % only represents a part of the full survey data, that answered something for this column.
The validity might be based because even after removal of 60% of rows (agreed it’s Huge!) we still are left with about 6000+ records. That in this case has been considered a sizeable data sample.
(I may have answered a similar post for this part, but I can’t find it now ) I would like to highlight these questions first:
- what if all of the students who didn’t answer for this question would like to take Web and/or Mobile dev. courses only? - % of interested students is reported much less as compared to what it really is!
- what if all of the students who didn’t answer for this question would not like to take Web and/or Mobile dev. courses at all? - then your approach would still give correct answers
- what if some of the students who didn’t answer for this question would like to take Web and/or Mobile dev. courses only? - % of interested students is still different from what it could actually be!
Thanks, that’s a really helpful response!
My idea is to not stop you from taking your approach. That’s the creative/analytical liberty that DQ allows… nope… Encourages! But it has to be based on some strong reasoning.
For example, here it’s about choice, and choice options are multiple so we do not have enough information to guess someone’s response and substitute that guess here.
But let’s say we have 5 years of data for a retail store. and we lack a few entries for discount % applied to some products for certain months. For example:
- 2nd year easter week few data points are missing
- 4th year Christmas month few data points are missing
- 5th year July month few data points are missing
For these cases, we can replace the missing data (discount %) by interpolation or substitution (mean/ median) based on all the 5-year data. We can very well reason it out that discount % more or less remains same for each year.