Why do we call the 'first term of the dict', and is there another way to fill in missing values?

Screen Link:

https://app.dataquest.io/m/240/guided-project%3A-predicting-house-sale-prices/2/feature-engineering

My Code:

    num_missing = df.select_dtypes(include=['int', 'float']).isnull().sum()
    fixable_numeric_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
    replacement_values_dict = df[fixable_numeric_cols.index].mode().to_dict(orient='records')[0]
    df = df.fillna(replacement_values_dict)

So the point of this part of the ‘transform_features()’ function (check solution notebook at: https://github.com/dataquestio/solutions/blob/master/Mission240Solutions.ipynb) is used to first seperate columns that have less than 5% of their values missing and more than zero values missing. Once these columns have been separated the idea is to then fill in the missing values with each columns most reoccurring value (the column mode).

The approach taken in the solution notebook is to seperate the columns as described above and then to create a dictionary with each of the columns as a key, and to populate each key with the respective mode for that column.

What I do not fully fully grasp is the ‘[0]’ at the end of the line of code creating the dictionary? If someone could please help explain this to me I would be forever grateful :slight_smile:

Also, I was wondering what other approaches one would be able to use. Initially, seeing as we have already removed all columns with over 5% of their values missing (previously in the same function), I assumed that all that was left would be numeric columns with 5% or less of their values missing, hence I imagined simply using the something like the following would work:

num_missing = df.select_dtypes(include = [‘int’, ‘float’])
df[num_missing] = df[num_missing].fillna(df[num_missing].mode)

When I attempted something similar I noticed that some columns that appeared in the solution notebook were missing later on when I went onto choosing features based on correlation.

Finally, as an another alternative to using a dictionary, I was thinking of maybe making a list of the columns with missing values (of 5% or less), then looping through this list to calculate all the respective column modes while in the same loop applying those modes in the fiilna function to the dataframe? So maybe something like this:

num_missing = df.select_dtypes(include=[‘int’, ‘float’]).isnull().sum().index

for col in num_missing:
col_mode = df[col].mode
df[col] = df[col].fillna(col_mode)

If I am having a mare here I would appreciate any constructive and truthful criticism!

Please help :slight_smile:

Thanks!
John

1 Like

df.to_dict(orient='records') returns a list, the returned list has dictionaries as items, in the dictionary the column is the key and the resulting row value is the value. Therefore replacement_values_dict = df[fixable_numeric_cols.index].mode().to_dict(orient='records')[0] here we are selecting the first item in the list returned by .to_dict().
Check the documentation to understand more.

1 Like

Hi Victor!

Silly of me not to think of that, have tried both inputs with and without the indexing and have seen your explanation unfold perfectly.

I honestly appreciate your help and time thank you!

1 Like

Hi John. If Victor’s reply answered your question then please mark their reply as the Solution.

2 Likes

But Mr Victor, I still do not understand why the following code simply did not work:
num_missing = df.select_dtypes(include = [‘int’, ‘float’])
df[num_missing] = df[num_missing].fillna(df[num_missing].mode)

Mr Victor I do not understand why the simple straightforward code did not work:

num_missing = df.select_dtypes(include = [‘int’, ‘float’])
df[num_missing] = df[num_missing].fillna(df[num_missing].mode)

I’m having the same problem. Were you able to understand why that snippet of code doesn’t work?