Pandas, import csv, adding na_values also changes other data

Screen Link:

Initially, I had imported my data like this:

dete_survey = pd.read_csv('dete_survey.csv')

Then this is what my data looked like:

Then, to replace the ‘Not Stated’ values by NaN (null), I next wrote:

# Reimport DETE data, consider 'Not Stated' a missing value
dete_survey = pd.read_csv('dete_survey.csv', na_values = 'Not Stated')

# Check a sample
dete_survey[dete_relevant_columns].head(10)

So in comparison to the original code, the part that I added now is na_values = 'Not Stated'

What I expected to happen:
‘Not Stated’ replaced by NaN.

What actually happened:
Well that happened, but the other values also changed. E.g. 1984 now changed to 1984.0. See screenshot:

DETE Start Date_with_na_valuesPNG

And it seems that when running dete_survey_updated.info(), I now have a float64, where earlier I had an object.

Anyone who can explain this? And any suggestions what I should do? Apart from that this looks very strange (1984.0), I am not sure whether this impacts any analysis that I want to do for this field later on?

Yes, this is what pandas does internally. As per Working with missing data -

Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype (see Support for integer NA for more)

There is a way to convert the column to integer datatype, but you should be able to work with the values as floats depending on what you need to do with them.

1 Like

Thank you for the explanation and the links!

Taking it from there, I figured out that a piece of code like this:

dete_resignation['dete_start_date'].astype(pd.Int32Dtype()).value_counts().sort_index()
does return me values like 1978 again instead of 1978.0.

(Although I later saw in the instructions see that I am suggested to use floats instead.)