Why this code does not work?

Hi, I am on Guided Project: Predicting House Sale Prices, I defined a function - transform_features to clean up the raw data, the function supposed to remove columns with Null values, but when i apply it on the data, the column with null values still exist. don’t know why.

Screen Link:
https://app.dataquest.io/m/240/guided-project%3A-predicting-house-sale-prices/1/introduction

My Code:

def transform_features(df):
    
    nul_count = df.isnull().sum()
    drop_col = nul_count[nul_count > (len(df)/20)].sort_values() 
    df = df.drop(drop_col.index, axis=1)
    
    text_col = df.select_dtypes(include=['object'])
    text_nul_count = text_col.isnull().sum().sort_values(ascending=False)
    drop_txt_col = text_nul_count[text_nul_count > 0]
    df = df.drop(drop_txt_col.index, axis=1)
    
    num_missing = df.select_dtypes(include=['int', 'float']).isnull().sum()
    fixable_numeric_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
    replacement_values_dict = df[fixable_numeric_cols.index].mode().to_dict(orient='records')[0]
    df = df.fillna(replacement_values_dict)
      
    years_sold = df['Yr Sold'] - df['Year Built']
    drop_year_1 = years_sold[years_sold < 0].index
    
    year_since_remod = df['Yr Sold'] - df['Year Remod/Add']
    drop_year_2 = year_since_remod[year_since_remod < 0].index
    
    df['years_blt_to_sale'] = years_sold
    df['year_since_remod'] = year_since_remod
    
    df = df.drop([1702, 2180, 2181], axis=0)
    df = df.drop(['Year Built', 'Year Remod/Add'], axis=1)
    
    df = data.drop(['PID','Order'],axis=1)
    df = data.drop(['Mo Sold','Yr Sold','Sale Type','Sale Condition'],axis=1)
    
    return df

data = pd.read_csv('AmesHousing.tsv', delimiter='\t')
transform_data = transform_features(data)
filtered_data = select_features(transform_data)
transform_data.isnull().sum()

What I expected to happen:
I was expecting there is no null values in the dataframe

What actually happened:

Order                0
PID                  0
MS SubClass          0
MS Zoning            0
Lot Frontage       490
Lot Area             0
Street               0
Alley             2732
Lot Shape            0
Land Contour         0
Utilities            0
Lot Config           0
Land Slope           0
Neighborhood         0
Condition 1          0
Condition 2          0
Bldg Type            0
House Style          0
Overall Qual         0
Overall Cond         0
Year Built           0
Year Remod/Add       0
Roof Style           0
Roof Matl            0
Exterior 1st         0
Exterior 2nd         0
Mas Vnr Type        23
Mas Vnr Area        23
Exter Qual           0
Exter Cond           0
                  ... 
Bsmt Full Bath       2
Bsmt Half Bath       2
Full Bath            0
Half Bath            0
Bedroom AbvGr        0
Kitchen AbvGr        0
Kitchen Qual         0
TotRms AbvGrd        0
Functional           0
Fireplaces           0
Fireplace Qu      1422
Garage Type        157
Garage Yr Blt      159
Garage Finish      159
Garage Cars          1
Garage Area          1
Garage Qual        159
Garage Cond        159
Paved Drive          0
Wood Deck SF         0
Open Porch SF        0
Enclosed Porch       0
3Ssn Porch           0
Screen Porch         0
Pool Area            0
Pool QC           2917
Fence             2358
Misc Feature      2824
Misc Val             0
SalePrice            0
Length: 78, dtype: int64
Paste output/error here

The issue here is that at the end of your function you reinitialize df to data, losing all the transformations you previously did. If you change the two last lines before return to df.drop it should work.

Also it might be more efficient here to use df.dropna() with tresh instead of summing null values and dropping based on number of rows.

Hope this helps.

2 Likes

Thanks! this is very helpful