The data-cleaning and/or wrangling is a universal process that is applicable to all the data we come across. Because IRL we would hardly find a dataset that is ready to work with. So it shouldn’t matter much if we are working in SQL, Python, R or C#.
Few things I could think of (in random order and regardless of the dataset):
- In the Oracle database we have something called as NVL function. it helps substitute the null value to a default value. Say 0 for numerical field and “Unknown” for text/string columns.
- You will have to think hard and figure out an approach to keep a row if there is missing information or you can impute the value - using mean or mode say.
- Column like Age would be numeric (or integer based) no matter what the UOM - Years, months, days, decades
- Column like Weight could be of float datatype. Both Age and Weight can be used for calculations so they being string datatype won’t make much sense.
- Columns like Rating (Strongly disagree to Strongly agree) can be converted to numerical provided it doesn’t lose its meaning.
- As in if Strongly Disagree is 0 and Strongly Agree is 6 and the average of the data score comes out to be 4.5, we can conclude - the data indicates sample generally agrees. We are quantifying the 4.5 on a scale of 0 to 6 but not in the numerical term.
- A different column scores 9 on a scale of 0 to 10. Even though 9 is 2X4.5, we can’t say the sample agrees twice as much.
Proper texts in the string/ char/ varchar columns
- “Red”, “red”, "Red ", " RED " and in some cases “Reds” all should mean the same. The extra spaces, the different cases etc. and extra characters need to be dealt with
- You may come across these issues commonly with salutations like “MS.”, “Ms.”, “Ms”, “Mr.”, “Mr” etc.
- Sometimes you may come across data like “Dept.”, “Department”, “Division”, “Div.”, “Faculty”. Although some are very different words but they all mean the same. This depends on what kind of analysis you would like to perform.
- For text column analysis we may need to work on stop/common words.
Columns that convey Data-Time information. They need to be in that format to apply various data time functions.
Identifying and classifying inliers and outliers.
- Once you are able to set a criterion then you will have to think about whether to include or exclude them.
- Or to divide the dataset based on the data range. This idea might generate its own questions and doubts. While learning/searching for something this thought came to me with the data at hand. But I haven’t really done it myself. (I can write a complete book of API’s - Another Parked Idea!)
Repeated/ Duplicate data.
- Entire row duplication can be easily taken care of. But the issue comes when only one variable is different.
- For example, two people can have the same name but one may be married and the other unmarried. Also, names like Taylor are gender-neutral.
- But if everything is the same about two people except for their address, then this might indicate that it’s the same person who changed the address but instead of an UPDATE command an INSERT statement was used. (Assume for a minute that primary key constraint was not applied here).
- In the above scenario the best course would be to at least identify and highlight this data and then let the domain knowledge come into discussion and decision process.
- In case the outliers are included you will have to look for algorithms and ML process that do take care of their presence.
I am not sure if scaling, normalization, standardization can be taken up in the cleaning process but they are often crucial for data preparation.
If you google, you may find more steps/ procedures based on the dataset at hand. And yes, this entire process is iterative. So clean, prepare data, apply an algorithm, observe results, clean some more, apply algorithm again (same or different), observe results… and so on.
And as usual Wikipedia to the rescue Data cleansing - Wikipedia and Tidy data - Wikipedia
If somehow you are still here and not so much disappointed with the post, let me try to do so now.
I don’t have a straight answer for functions. It depends on the data at hand! (“No comments” in the data science world I guess.)
The functions are quite similar in nature across various SQL engines provided they are supported by that database engine. You can maybe pick up topic wise like this - SQL NULL Functions - SQL Database Tutorials
Let me know your thoughts and in case this helped you.