Unexpected Missing Values

Hello DataQuest Community,

I need help on how to clean “Unexpected Missing Values”, for instance, I have a string fields which have numbers in some points and I want to insert NaN where there is a number. I have this code below for dealing with one field I want to know how I can do it on more than one field have numbers:

# Detecting numbers 
cnt=0
for row in df['OWN_OCCUPIED']:
    try:
        int(row)
        df.loc[cnt, 'OWN_OCCUPIED']=np.nan
    except ValueError:
        pass
    cnt+=1

You can used vectorized methods throughout rather than deal row by row.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html

  1. Use this with errors== "coerce" to convert string values to NaN.
  2. Create a mask = series.notna() boolean mask to find values that survived the coercion (these are the numbers in string form that can be int() converted)
  3. series.loc[mask,'OWN_OCCUPIED'] = np.nan

Thank you very much for your comment hanqi,

By the way the sample code I posted works but only for only column at a time. How do I use “series.loc[mask,‘OWN_OCCUPIED’] = np.nan” on multiple columns instead of just one “OWN_OCCUPIED”? I have three other columns which should have names but have numbers erroneously entered.
Thanks

Great question, i see you are pushing your limits thinking on more than 1 dimension now.
Soon you will be dealing with multi dataframe considerations.

  1. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.html
  2. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mask.html

Maintain this thought for all future operations you see, because it will be put to good use for many other data munging operations too (eg. type conversion, get_dummies, etc).

Many of these dataframe methods have numpy equivalents. Go for numpy if you can sacrifice readable labels for speed.