Cleaning data set

Good day everyone. Can someone please explain to me. When we are given a data set, do we only have to find duplicates, missing data only or is there something else we look for? Secondly how do you find missing data from a very large data set? Thirdly, on the guided project, is there missing data on googleplay data or applestoredata as of today 12 November because all the rows seem to be the same length?

Hi there!
I’d say it’s up to you how much you want to clean. Depending on the data set you may want to:

  • Check if the data is logical or not (for example, if you work on the dataset with some data about the past, there shouldn’t be any dates that are in the future);
  • Unify the format of the data (should it all be upper or lower case, maybe convert some of the data to float, etc);
  • Check if some of the values can be imputed.

For detecting the missing values I would use the pandas isnull() for dataframes or isna() for series, not sure if or how the size of the dataset changes the difficulty of use.

Data cleaning sometimes referred in the data science world as data munging or data wrangling takes up a large part of a data analyst/scientists time. Data cleaning encompasses many different tasks (above and beyond dealing with duplicate and missing data). Here is a good website that explains some of the other tasks involved in data cleaning.

s.matelyte gave a good explanation of finding missing data in a dataset. When the dataset is very large, it can be helpful to visualize where the missing data is. It is not as important to see every missing piece of data as it is to know if the missing data follows a pattern and trying to figure out why these data is missing.

Here is a website that covers many of the topics of missing data visualization that DataQuest covers. Sometimes I find looking at different sources helps me to understand a topic better.

I can check the googleplay and applestoredata for you if you let me know what guided project that is in.