I need help on how to clean “Unexpected Missing Values”, for instance, I have a string fields which have numbers in some points and I want to insert NaN where there is a number. I have this code below for dealing with one field I want to know how I can do it on more than one field have numbers:
# Detecting numbers
cnt=0
for row in df['OWN_OCCUPIED']:
try:
int(row)
df.loc[cnt, 'OWN_OCCUPIED']=np.nan
except ValueError:
pass
cnt+=1
Use this with errors== "coerce" to convert string values to NaN.
Create a mask = series.notna() boolean mask to find values that survived the coercion (these are the numbers in string form that can be int() converted)
By the way the sample code I posted works but only for only column at a time. How do I use “series.loc[mask,‘OWN_OCCUPIED’] = np.nan” on multiple columns instead of just one “OWN_OCCUPIED”? I have three other columns which should have names but have numbers erroneously entered.
Thanks
Maintain this thought for all future operations you see, because it will be put to good use for many other data munging operations too (eg. type conversion, get_dummies, etc).
Many of these dataframe methods have numpy equivalents. Go for numpy if you can sacrifice readable labels for speed.