Guided Project: Predicting Car Prices; 2. Data Cleaning

Basics (10).ipynb (14.4 KB)

numeric_cars.isna().sum()

returns:

normalized-losses    37
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
engine-size           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 0
dtype: int64

next

numeric_cars = numeric_cars.dropna(subset=['price'])
numeric_cars.isna().sum()

and

numeric_cars = numeric_cars.fillna(numeric_cars.mean())
numeric_cars.isna().sum()

in attempt to remove nan values
but i still get

normalized-losses    37
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
engine-size           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 0
dtype: int64

notice that there’s no change to the nan counts or locations aside from the price column.

Click here to view the jupyter notebook file in a new tab

This fails because normalized-losses, bore, stroke, horsepower, peak-rpm are objects (strings)

numeric_cars.astype(float)
numeric_cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 15 columns):
normalized-losses    164 non-null object
wheel-base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
curb-weight          205 non-null int64
engine-size          205 non-null int64
bore                 201 non-null object
stroke               201 non-null object
compression-rate     205 non-null float64
horsepower           203 non-null object
peak-rpm             203 non-null object
city-mpg             205 non-null int64
highway-mpg          205 non-null int64
price                201 non-null object
dtypes: float64(5), int64(4), object(6)
memory usage: 24.1+ KB

You need to convert the columns into numbers (either float or integer using astype()), you can also use pd.to_numeric to convert them to numeric

PS: It would also be good for you to provide us with the link to the mission.

1 Like

Used astype to convert all columns to float, since the whole point is to use only the continuous numeric columns only. So by making them into floats, I lose nothing and things work better. Thank you.