Data Cleaning of a million rows dataset

Having gone through some of the lessons on data cleaning, I was wondering how one can know the stuffs to clean in a data that consist of a million rows. How does one check for the dirt in such rows?

There are some methods that allow the user to get information from a column no matter how many rows it has.

You can use Series.unique() to see all the unique values in the column.

Series.isnull().sum() to see the number of null values.

Series.value_counts() to count the number of times each value appears in the column.

These are just some examples. There are many others depending on what you are trying to do. Take a look at the documentation to see more.

There’s a new kid on the block - sidetable

It makes it very easy to get

  1. missing values and it’s percenatge
  2. data summary
  3. aggregation and threshold etc.



Wow, I’ll experiment this.