A more thorough data cleanse

Hello everyone,

Wouldn’t it be better to have a list of “good” characters instead of a list of “bad” characters?
Consider: we have 17k rows in moma list, there can be tens, hundreds maybe even thousands of potential “bad” characters.
On the other hand: We have only numbers, and maybe dashes, to work with for the type of data we need; good_chars = [“1”,“2”,“3”,“4”,“5”,“6”,“7”,“8”,“9”,“0”,"-"]; and this would cover all dates.

So the current function would
def strip_characters(string):
----For char in string:
--------if char in bad_chars:
------------string = string.replace(char,"")
----return string

This works, but only if you have a set number of possibilities, but as I mentioned above, there are a lot more possibilities.
The more thorough function would go:
def strip_characters(string):
----For char in string:
--------if char not in good_chars:
------------string = string.replace(char,"")
----return string

Eager to hear your thoughts!
Thank you!


Welcome to the community!

It depends.

If you have more bad characters than good, then it is okay to create a list of good characters.
Create a list for the one that you have less.

Please check out these guidelines for asking good questions.

Thank you for the reply!
I was referring to the situation at hand, with the moma list. Of course it will depend on the situation, however, in this particular case the goal was to cleanse the row containing years that we want to return either as int or a range of years. If we weren’t provided with all the characters that can happen in the cell, the easier way would be to define the good characters, as there are only 11 of them, while we have an unknown number of bad ones.
Again, thank you for the reply!
I will mark your answer as Solved!

1 Like

So thoughtful of you. Thank you!