In case it saves you time…
The goal of this one is to eliminate manual inspection of columns. When a CSV is read in, we usually end up with a few numeric columns that are stored as strings on account of non-numeric entries - such as ‘n/a’ or ‘?’, or (say) price values that include the $ sign, etc…
This one looks at the columns that are non-numeric after read-in (i.e., dtypes will not a numeric type like float64) and (after eliminating space, $, and comma)
if number of numeric items is > 50% of total (use regex to match numbers)
and number of unique non-numeric elements is LESS than (3 OR total number of non-numeric elements)
then
report this column as numeric and which entries (values) need to be addressed…
Try it on the car-prices dataset (https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data) :
Cleaning done by removing $,space,comma prior to inspection and..
These columns should be considered numeric and subjected to cleaning :
Name : Entries to address
normalized-losses : ?
bore : ?
stroke : ?
horsepower : ?
peak-rpm : ?
price : ?
On the WNBA dataset (https://www.kaggle.com/jinxbe/wnba-player-stats-2017) :
Cleaning done by removing $,space,comma prior to inspection and..
These columns should be considered numeric and subjected to cleaning :
Name : Entries to address
Experience : R
Code :
def find_numeric_columns( df_in ) :
""" DataFrame --> list of strings (col names) and printed report"""
# operates on the columns which are not already recognized as numeric
# based on analysis, which columns should be considered numeric because
# the majority of entries are numbers? (If diversity of non-numeric values too high (> 20%),
# report as non-numeric )
df = df_in.copy()
numerics = {}
non_numerics = {}
candidates = df.select_dtypes( include='object').columns
for cand in candidates :
df[cand] = df[cand].str.replace('[$, ]', '') # hitting space, $ and comma only
non_num = df.loc[ ~df[cand].str.match( pat='^[+-]?(\d+|\d*\.\d+|\d+\.\d*)([eE][-+]?[0-9]+)?$'), cand]
l_nn = len( non_num )
ls_nn = len( set(non_num ) )
if l_nn < 0.5 * df.shape[0] : # enough numeric items to classify this column
if ls_nn < 3 or ls_nn < 0.2 * l_nn : # diversity low enough
numerics[cand] = set(non_num)
else :
non_numerics[cand] = set(non_num)
print("Cleaning done by removing $,space,comma prior to inspection and..\n")
if len( numerics ) > 0 :
print( "These columns should be considered numeric and subjected to cleaning :")
print( " Name : Entries to address")
for col in numerics.keys() :
print( "{} : {}".format( col, ','.join(numerics[col])))
if len( non_numerics ) > 0 :
print( "\nThese columns have too many unique non-numeric items and deserve a closer look :")
for col in non_numerics.keys() :
print( "{} : {}".format( col, ','.join(non_numerics[col])))
return list(numerics.keys())