How to identify test_data with bad character

Hello Community. I am currently on this mission. its clear but…
Screen Link: </></>

but I am curious how dataquest came about this test_data. how can we identify the data with bad characters when working with a voluminous data.
below is the data.
My Code: <!–```test_data = [“1912”, “1929”, “1913-1923”,
“(1951)”, “1994”, “1934”,
“c. 1915”, “1995”, “c. 1912”,
“(1988)”, “2002”, “1957-1959”,
“c. 1955.”, “c. 1970’s”,
“C. 1990-1999”]

bad_chars = ["(",")",“c”,“C”,".",“s”,"’", " "]```–>

thank you

Hi @eeiohanwe:
Basically the purpose of this exercise is to remove all the unwanted characters and cleanse the list so the end result looks something like this:
stripped_test_data = [1912, 1929, 1913-1923, 1951, 1994, 1934, 1915, 1995, 1912...]

Here is the correct output result for your reference:

Thus bad_chars = ["(",")",“c”,“C”,".",“s”,"’", " "] is defined for the purpose of letting python know which characters should be removed later on. You are supposed to create a function strip_characters to remove all the bad_chars and append the cleaned data (without the bad_chars) to a new list stripped_test_data.

Hope this helps.

Thank you My confusion is as follows: if I wasn’t given the test_data in the visual editor by dataquest, how do I identify them?

Because if I can identify the content of the test_data I can then create the bad-char list.

Ok @eeiohanwe: let’s say you are given the raw data in a csv file. You can then go and physically open the csv file an look at the uncleaned datd, then based on your requirements/how you imagine your clean data to be, come up with stripped_test_data which contains the unwanted characters to be removed.

Hope this answers your question.

yes It does answer my question. That means there is no function or automatic way of identifying them. Thanks for your support

Yep @eeiohanwe. If my answer helped you do you mind marking it as the solution? Thanks!

Hi @eeiohanwe,

When I was going through this mission, I never questioned the character set given! I just agreed it is how it is done. Glad you asked this question, so that now I am rethinking about it.

Like mentioned by I think we have to go through the dataset to come to a conclusion how to find these bad characters. I’ve thought about it from that perspective and I think in this case it will be easier to write a specific code to find the bad character. So depending on different case we have to come up with different logic to write a code to find these so called bad characters.

After seeing this discussion, I tried to come up with a code. Here is what I thought might work in this case.

bad_char=[]                                           # Empty list to store bad characters

for string in test_data:                                 # loops through each string
    for letter in string:                                 # loops through each letter
        #set=['1','2','3','4','5','6','7','8','9','0']     # we can either go through this set
        #if letter not in set:                              # use this line if you are using the set
        if ord(letter) not in range(48,58):               # or we can use ord() to get ascii value of digits 0 to 9
            if letter in bad_char:                      # checks if the letter is already added to the list
                continue                              # if it is already added, the continue with the iteration
            else:                                   # else it appends it the list

Here you can try in two different ways, either type out a set of all accepted characters or try it with the ord() function to find the ASCII value of digits 0 to 9 which are 48 to 57.

Let me know if it helps.

Thank you @jithins123. I will try out this code

1 Like

Let me know if it worked out for you.

Wow this code is simple and easy to understand and the idea is just WoW!

1 Like