num_missing = df.select_dtypes(include=['int', 'float']).isnull().sum() fixable_numeric_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values() replacement_values_dict = df[fixable_numeric_cols.index].mode().to_dict(orient='records') df = df.fillna(replacement_values_dict)
So the point of this part of the ‘transform_features()’ function (check solution notebook at: https://github.com/dataquestio/solutions/blob/master/Mission240Solutions.ipynb) is used to first seperate columns that have less than 5% of their values missing and more than zero values missing. Once these columns have been separated the idea is to then fill in the missing values with each columns most reoccurring value (the column mode).
The approach taken in the solution notebook is to seperate the columns as described above and then to create a dictionary with each of the columns as a key, and to populate each key with the respective mode for that column.
What I do not fully fully grasp is the ‘’ at the end of the line of code creating the dictionary? If someone could please help explain this to me I would be forever grateful
Also, I was wondering what other approaches one would be able to use. Initially, seeing as we have already removed all columns with over 5% of their values missing (previously in the same function), I assumed that all that was left would be numeric columns with 5% or less of their values missing, hence I imagined simply using the something like the following would work:
num_missing = df.select_dtypes(include = [‘int’, ‘float’])
df[num_missing] = df[num_missing].fillna(df[num_missing].mode)
When I attempted something similar I noticed that some columns that appeared in the solution notebook were missing later on when I went onto choosing features based on correlation.
Finally, as an another alternative to using a dictionary, I was thinking of maybe making a list of the columns with missing values (of 5% or less), then looping through this list to calculate all the respective column modes while in the same loop applying those modes in the fiilna function to the dataframe? So maybe something like this:
num_missing = df.select_dtypes(include=[‘int’, ‘float’]).isnull().sum().index
for col in num_missing:
col_mode = df[col].mode
df[col] = df[col].fillna(col_mode)
If I am having a mare here I would appreciate any constructive and truthful criticism!