Non unique value in value_counts()

Link to the mission
I’m working with the original dataset and trying to push the analysis a bit further by analyzing the dissatisfaction through the ages.

I’m starting by a ‘pre’ cleaning of the ‘age’ column.

combined_updated['age'] = (combined_updated['age']
    .str.replace('or', '-')
    .str.split(' ')
    .str.join('')
    .str.strip()
    )
print(color.BOLD + "Values in the `age` column : " + color.END)
combined_updated['age'].value_counts(dropna = False)

Output :

Values in the `age` column : 
51-55         71
NaN           55
41-45         48
41–45         45
46-50         42
36-40         41
46–50         39
26-30         35
21–25         33
36–40         32
26–30         32
31–35         32
56-older      29
31-35         29
21-25         29
56-60         26
61-older      23
20-younger    10
Name: age, dtype: int64

I don’t understand why I have some duplicates in the dataset ? For example, the 41-45 slots seems identical, doesn’t it ?

1 Like

I solved it by copy pasting the dashes from the output. I suspected that their might be different kind of dashes, and it worked :

combined_updated['age'] = (combined_updated['age']
    .str.replace('or', '-')
    .str.replace('-', '-')
    .str.replace('–', '-')
    .str.split(' ')
    .str.join('')
    .str.strip()
    )

print(color.BOLD + "Values in the `age` column : " + color.END)
combined_updated['age'].value_counts(dropna = False)

Output :

> Values in the `age` column : 
> 41-45         93
> 46-50         81
> 36-40         73
> 51-55         71
> 26-30         67
> 21-25         62
> 31-35         61
> NaN           55
> 56-older      29
> 56-60         26
> 61-older      23
> 20-younger    10
> Name: age, dtype: int64
1 Like