Guided Project: Practice Optimizing Dataframes and Processing in Chunks

Hello everyone!

How can I work with this question - especially without any visual assessment?

Identify float columns that don’t contain any missing values, and that we can convert to the integer type because they represent whole numbers.

To identify float cols, I have used the following:
chunk.columns[chunk.dtypes =="float64"].tolist()

Is this the best way to retrieve float cols? What would be the most adequate way to do it?

As for knowing whether they can be converted to integers or not, this I have no idea how to (without any visual assessment). How is the proper way to do it?

I appreciate any input for this topic!
Thanks :relaxed:

Hi Nicolas,

I’m working through this project now as well. There is some helpful information on this part in the Optimizing Dataframe Memory Footprint mission, step 8 (Optimizing Integer Columns with Subtypes).

Identify float columns without missing values:
The method recommended in the mission is to use pd.DataFrame.select_dtypes:

print(chunk.select_dtypes(include=['float']).isnull().sum())

This prints a series that shows the float columns and a count of how many null value each contains. In the dataquest example in the mission I referenced, it prints the following:

ExhibitionID              429
ExhibitionSortOrder         0
ConstituentID             514
ConstituentBeginDate     9268
ConstituentEndDate      14739
VIAFID                   7562
ULANID                  12870
dtype: int64

Knowing if floats can be converted to integers:

You can only convert a column from float to int values if there are no missing values because the NumPy int type doesn’t have a missing value object like NaN. So in the example above, only the ExhibitionSortOrder column could be converted to the integer type.

I’m not sure about that last “represent whole numbers” part of the question. What step of the guided project are you on? I see a similar question on step 2, but it doesn’t mention the whole numbers part.