I used two different approaches in the analysis and I would like to hear what people think:
- For the extreme monthly spending values, I replaced values that were above the 95th percentile to a value equal to the 95th percentile. Reasoning was that the extreme values may be incorrect somehow but they were still in indication that the person spent a lot of money for their learning. By imputing it to the 95th percentile value, we still retain the impact that they are spending “a lot” of money but remove the possibility that they entered an unrealistic monthly spending amount. Since the course costs $59, my thought was that (for our purpose of our analysis) it doesn’t matter if someone pays $800, $1500, or $2500, but that it’s sufficient to record that they paid something much higher than the mean or median (aka the 95th percentile).
I know there can be multiple ways to handle the outliers but I am wondering if this is a good method and if it is “statistically correct” (doesn’t invalidate the answer)? I was trying to make the handling of outliers as less arbitrarily as possible.
- I took a simple percentage of people who spend at least $59 compared with the total amount of people from that country. For example, I found that about 20% of the respondents from the US spent at least $59 per month. This way the extremes values become less impactful.
Either way, it seems like all of these approaches are first steps to take further market analysis.
I’m interested to hear what people think about the validity of these approaches.
I also have a
plt.pie formatting question:
In cell 29 of the notebook, both of my pie charts have white marks in the center which are the white boarders of each slice are overlapping somehow. Why does that white seem to overlap in the center where the slices meet and how to fix it?
Thank you all for your time!