Dear Support,
I am working on a dataset that has categorical data in some columns. Two columns have large missing data and I was thinking about filling them since they will impact the analysis significantly more so the missing data are more than 5% of the entire data. I am not sure of the best method to use in filling the data. However, I thought of the following methods:
- Use the mode – This will give too much point to a particular category
- Divide the total number of missing data of a column by the total number of category and assign the result to equally to the missing data each category.
- Randomly fill the missing data with data in the column itself
Below are the frequency of the category and the null values in the category column.
cust_demg['category'].value_counts()
Manufacturing 799
Financial Services 774
Health 602
Retail 358
Property 267
IT 223
Entertainment 136
Argiculture 113
Telecommunications 72
cust_demg.isnull().sum()
category 656
Which of the options above is best or could there be another option I have not thought of?