LIMITED TIME OFFER: 50% OFF OF PREMIUM WITH OUR ANNUAL PLAN (THAT'S $294 IN SAVINGS).
GET OFFER

How to identify test_data with bad character

Hello Community. I am currently on this mission. its clear but…
Screen Link: </> https://app.dataquest.io/m/351/cleaning-and-preparing-data-in-python/7/parsing-numbers-from-complex-strings-part-one</>

but I am curious how dataquest came about this test_data. how can we identify the data with bad characters when working with a voluminous data.
below is the data.
My Code: <!–```test_data = [“1912”, “1929”, “1913-1923”,
“(1951)”, “1994”, “1934”,
“c. 1915”, “1995”, “c. 1912”,
“(1988)”, “2002”, “1957-1959”,
“c. 1955.”, “c. 1970’s”,
“C. 1990-1999”]

bad_chars = ["(",")",“c”,“C”,".",“s”,"’", " "]```–>

thank you

Hi @eeiohanwe:
Basically the purpose of this exercise is to remove all the unwanted characters and cleanse the list so the end result looks something like this:
stripped_test_data = [1912, 1929, 1913-1923, 1951, 1994, 1934, 1915, 1995, 1912...]

Here is the correct output result for your reference:
image

Thus bad_chars = ["(",")",“c”,“C”,".",“s”,"’", " "] is defined for the purpose of letting python know which characters should be removed later on. You are supposed to create a function strip_characters to remove all the bad_chars and append the cleaned data (without the bad_chars) to a new list stripped_test_data.

Hope this helps.

Thank you @masterryan.prof. My confusion is as follows: if I wasn’t given the test_data in the visual editor by dataquest, how do I identify them?

Because if I can identify the content of the test_data I can then create the bad-char list.

Ok @eeiohanwe: let’s say you are given the raw data in a csv file. You can then go and physically open the csv file an look at the uncleaned datd, then based on your requirements/how you imagine your clean data to be, come up with stripped_test_data which contains the unwanted characters to be removed.

Hope this answers your question.

yes @masterryan.prof. It does answer my question. That means there is no function or automatic way of identifying them. Thanks for your support

Yep @eeiohanwe. If my answer helped you do you mind marking it as the solution? Thanks!

Hi @eeiohanwe,

When I was going through this mission, I never questioned the character set given! I just agreed it is how it is done. Glad you asked this question, so that now I am rethinking about it.

Like mentioned by @masterryan.prof I think we have to go through the dataset to come to a conclusion how to find these bad characters. I’ve thought about it from that perspective and I think in this case it will be easier to write a specific code to find the bad character. So depending on different case we have to come up with different logic to write a code to find these so called bad characters.

After seeing this discussion, I tried to come up with a code. Here is what I thought might work in this case.

bad_char=[]                                           # Empty list to store bad characters

for string in test_data:                                 # loops through each string
    for letter in string:                                 # loops through each letter
        #set=['1','2','3','4','5','6','7','8','9','0']     # we can either go through this set
        #if letter not in set:                              # use this line if you are using the set
        if ord(letter) not in range(48,58):               # or we can use ord() to get ascii value of digits 0 to 9
            if letter in bad_char:                      # checks if the letter is already added to the list
                continue                              # if it is already added, the continue with the iteration
            else:                                   # else it appends it the list
                bad_char.append(letter)
                
print(bad_char) 

Here you can try in two different ways, either type out a set of all accepted characters or try it with the ord() function to find the ASCII value of digits 0 to 9 which are 48 to 57.

Let me know if it helps.

Thank you @jithins123. I will try out this code

1 Like

Let me know if it worked out for you.

Wow this code is simple and easy to understand and the idea is just WoW!

1 Like