[SQL - General] Database cleaning best practices/workflows?

Hello everyone,

I’m curious if someone wanted to share a specific workflow/best practices in SQL for prepping/cleaning data for analysis. For example, what would you look for? What functions are most important? How do you fix common problems?

I am a marketing manager, and I have just fundamental expertise in SQL. I have applied for an analytics position, and I will be given a take-home in the next few days. I genuinely suspect I will be asked to explain how I would clean a messy dataset in SQL.

So, just in case, I was hoping that you guys could provide some directions :slight_smile:

Thank you tons.

Hi @Tarmicle

Guess we are too late to respond to this post. Since I am not sure if you would still like to discuss this topic. Or if you have already worked out the application requirements, would you mind sharing your knowledge so that it helps us learn too?

Thanks.

Hi @Rucha,

Uhm, no actually I would still love to hear about this topic as I haven’t got much knowledge about it yet.

So any advice is still welcome!

Hi @Tarmicle

The data-cleaning and/or wrangling is a universal process that is applicable to all the data we come across. Because IRL we would hardly find a dataset that is ready to work with. So it shouldn’t matter much if we are working in SQL, Python, R or C#.

Few things I could think of (in random order and regardless of the dataset):

  • Missing data

    • In the Oracle database we have something called as NVL function. it helps substitute the null value to a default value. Say 0 for numerical field and “Unknown” for text/string columns.
    • You will have to think hard and figure out an approach to keep a row if there is missing information or you can impute the value - using mean or mode say.
  • Mismatched datatypes

    • Column like Age would be numeric (or integer based) no matter what the UOM - Years, months, days, decades
    • Column like Weight could be of float datatype. Both Age and Weight can be used for calculations so they being string datatype won’t make much sense.
    • Columns like Rating (Strongly disagree to Strongly agree) can be converted to numerical provided it doesn’t lose its meaning.
      • As in if Strongly Disagree is 0 and Strongly Agree is 6 and the average of the data score comes out to be 4.5, we can conclude - the data indicates sample generally agrees. We are quantifying the 4.5 on a scale of 0 to 6 but not in the numerical term.
      • A different column scores 9 on a scale of 0 to 10. Even though 9 is 2X4.5, we can’t say the sample agrees twice as much.
  • Proper texts in the string/ char/ varchar columns

    • “Red”, “red”, "Red ", " RED " and in some cases “Reds” all should mean the same. The extra spaces, the different cases etc. and extra characters need to be dealt with
    • You may come across these issues commonly with salutations like “MS.”, “Ms.”, “Ms”, “Mr.”, “Mr” etc.
    • Sometimes you may come across data like “Dept.”, “Department”, “Division”, “Div.”, “Faculty”. Although some are very different words but they all mean the same. This depends on what kind of analysis you would like to perform.
    • For text column analysis we may need to work on stop/common words.
  • Columns that convey Data-Time information. They need to be in that format to apply various data time functions.

  • Identifying and classifying inliers and outliers.

    • Once you are able to set a criterion then you will have to think about whether to include or exclude them.
    • Or to divide the dataset based on the data range. This idea might generate its own questions and doubts. While learning/searching for something this thought came to me with the data at hand. But I haven’t really done it myself. (I can write a complete book of API’s - Another Parked Idea!)
  • Repeated/ Duplicate data.

    • Entire row duplication can be easily taken care of. But the issue comes when only one variable is different.
    • For example, two people can have the same name but one may be married and the other unmarried. Also, names like Taylor are gender-neutral.
    • But if everything is the same about two people except for their address, then this might indicate that it’s the same person who changed the address but instead of an UPDATE command an INSERT statement was used. (Assume for a minute that primary key constraint was not applied here).
    • In the above scenario the best course would be to at least identify and highlight this data and then let the domain knowledge come into discussion and decision process.
    • In case the outliers are included you will have to look for algorithms and ML process that do take care of their presence.
  • I am not sure if scaling, normalization, standardization can be taken up in the cleaning process but they are often crucial for data preparation.

If you google, you may find more steps/ procedures based on the dataset at hand. And yes, this entire process is iterative. So clean, prepare data, apply an algorithm, observe results, clean some more, apply algorithm again (same or different), observe results… and so on.

And as usual Wikipedia to the rescue Data cleansing - Wikipedia and Tidy data - Wikipedia

If somehow you are still here and not so much disappointed with the post, let me try to do so now.

I don’t have a straight answer for functions. It depends on the data at hand! (“No comments” in the data science world I guess.)

The functions are quite similar in nature across various SQL engines provided they are supported by that database engine. You can maybe pick up topic wise like this - SQL NULL Functions - SQL Database Tutorials

Let me know your thoughts and in case this helped you.

2 Likes

Edit: Few more things to add:

In SQL terms, this is where Joining different tables comes into the picture, provided you take care of redundant data as well.

1 Like

Wow @Rucha ! That is an extremely detailed answer! Thanks a lot, I appreciate you went through all that trouble. I also like the fact you kept the context open to all the languages, as the principles remain the same whether you are using SQL or Phyton.

I guess my best bet now is getting familiar with the functions that will allow me to do the actions you mentioned. I’m sure the SQL course here at DataQuest will cover that.

Thanks again!

1 Like

Hey @Tarmicle
I am glad the post is helpful. :slight_smile:

Hey @the_doctor. let us know your thoughts as well. Thanks.