This is no feedback. We will come to that later.
We need to first discuss why the
concat methods behave differently. A simple representation is here
For a detailed understanding of the two methods please refer to the official documentation here. Please don’t get overwhelmed with it. Break it into smaller sections and take one section at a time/per day basis.
Why am I pushing you to do that? below are two reasons (Apologies if you have already noticed them):
The rows are duplicated still show differences. The blue ones are completely different from each other as a group. The internal difference among each group is shown by values highlighted in red.
concat method outputs a dataframe which starts from the index 5 instead of 0 and has gaps between the indexes. And we are pretty sure that we haven’t shuffled the dataframe anywhere till this point in the project or Have We?
So the resulting dataframe lacks data to a considerable amount and that can very much influence the analysis. So we need to identify first why is this happening.
One factor that I could see in this project, is the complicated calculation of Year - bifurcation of first year and second year etc. I am confused about the exact logic behind this extraction and then the subsequent classification of the service category. This is causing the duplication in merge -
Age group 41-45 is classified as both - Veteran and Experienced (highlighted in red box).
I assume you haven’t yet matched your answer with that of the solution. If you haven’t yet Great! I will encourage you to check another student’s same project and see the difference in this step. It may or may not be that much different from the actual solution but it still won’t give away the exact answer and help you understand, what can be done to tweak the code for a correct answer.
Just to elaborate on why I am so confused with Years dataframe, please take a look at the replication of code in the attached notebook.
Merge_Concat_Years.ipynb (21.9 KB)