Employee exit surveys: sharing proyect and asking a question

Hi everybody,

In this proyect I struggled with merging dataframes. Specifically when I tried to merge the dataframe resulted of extracting all years from a column, with the initial datafre. I tried to merge them by their common index but the resulted datafreme had duplicated rows.

Finally I could solve it using concat function instead of merge. Did you have a similar problem? Do you know why merge function doesn’t work here?

You could check this issue. It’s in the section C.2. (there is a index of contents at the beggining) of this proyect, next to a markdown cell with letters in red.

Regarding to my final conclusions for this proyect do you agree? Did you get the same conclusion?

Constructive feedback is wellcome.

Many Thanks



Employee Exit Surveys.ipynb (251.7 KB)

Click here to view the jupyter notebook file in a new tab

hi @Daniel_H

This is no feedback. We will come to that later.

We need to first discuss why the merge and concat methods behave differently. A simple representation is here

For a detailed understanding of the two methods please refer to the official documentation here. Please don’t get overwhelmed with it. Break it into smaller sections and take one section at a time/per day basis.

Why am I pushing you to do that? below are two reasons (Apologies if you have already noticed them):

  1. The rows are duplicated still show differences. The blue ones are completely different from each other as a group. The internal difference among each group is shown by values highlighted in red.

  2. The concat method outputs a dataframe which starts from the index 5 instead of 0 and has gaps between the indexes. And we are pretty sure that we haven’t shuffled the dataframe anywhere till this point in the project or Have We? :thinking:

So the resulting dataframe lacks data to a considerable amount and that can very much influence the analysis. So we need to identify first why is this happening.

One factor that I could see in this project, is the complicated calculation of Year - bifurcation of first year and second year etc. I am confused about the exact logic behind this extraction and then the subsequent classification of the service category. This is causing the duplication in merge - Age group 41-45 is classified as both - Veteran and Experienced (highlighted in red box).

I assume you haven’t yet matched your answer with that of the solution. If you haven’t yet Great! I will encourage you to check another student’s same project and see the difference in this step. It may or may not be that much different from the actual solution but it still won’t give away the exact answer and help you understand, what can be done to tweak the code for a correct answer.

Just to elaborate on why I am so confused with Years dataframe, please take a look at the replication of code in the attached notebook.

Merge_Concat_Years.ipynb (21.9 KB)

1 Like

Hi @Rucha,

Thanks again for your time :blush:. I know it takes a long time to check all code above the cell in which I struggled. The gap between rows I think it’s ok, because the dataframe from which I wanted to extract the years (combined_service_cat) was a subset of the initial dataframe. And consecuently years dataframe too. Before I merged this two dataframes I checked its shape, and also I checked if they index was the same. Symply using this expresion years.index == combined_service_cat.index and I get an array full of Trues :smile:
You can see it on the cells just below to merge both dataframes.

I worked in your dummy example and modify it a little. I got extract all years, like in my proyect. And I could also to merge dataframes correctly in your example…So what happened in this step in my proyect, I don’t know. But checking another student’s final results, It seems that I finally get to the same conclusions…

Merge_Concat_Years(mod).ipynb (17.8 KB)

Click here to view the jupyter notebook file in a new tab