Python for Data Science: Intermediate -> Mission 8/10 from Cleaning and Preparing Data in Python

I’ve been stuck on this mission for a day now. I got my code to work with the test data that was given and have implemented a solution that I think should work with the final requirement but then it seems like the output (moma list of lists) doesn’t get impacted by the final function.

#takes a string and removes the bad characters
def strip_characters(string):
for char in bad_chars:
string = string.replace(char,"")
return string

#empty list made
processed_test_data =

#input string will be split on “-” character
def process_date(string):

#2 items will be converted to ints and the average will be returned
if "-" in string:
    return int(round((int(string.split("-")[0]) + int(string.split("-")[1])) / 2))

# otherwise just convert string to int and return it
else:
    return int(string)

#for each item in the list, run it through process_date() and append to list processed_test_data
for item in stripped_test_data:
processed_test_data.append(process_date(item))

#for row in moma (list of lists)
for row in moma:

#date_one variable is assigned value of item at row index 6
date_one = row[6]

#row is then assigned back the value after it runs through strip_characters() and process_date()
#I did try assigning the cleaned value back to row[6] but that didn't work and gave more errors
row = process_date(strip_characters(date_one))

Hi @calvintirrell, I tried debugging your code by copy-pasting it into the DQ interpreter, but since it is out of format I am not sure exactly the error you are receiving, however I might have spotted something
This is the code I run into the interpreter:

Summary
test_data = ["1912", "1929", "1913-1923",
             "(1951)", "1994", "1934",
             "c. 1915", "1995", "c. 1912",
             "(1988)", "2002", "1957-1959",
             "c. 1955.", "c. 1970's", 
             "C. 1990-1999"]


bad_chars = ["(",")","c","C",".","s","'", " "]

#takes a string and removes the bad characters
def strip_characters(string):
    for char in bad_chars:
        string = string.replace(char,"")
        return string

#empty list made
processed_test_data = []

#input string will be split on “-” character
def process_date(string):
#2 items will be converted to ints and the average will be returned
    if "-" in string:
        return int(round(
                    (int(string.split("-")[0]) + int(string.split("-")[1])
                ) / 2))
# otherwise just convert string to int and return it
    else:
        return int(string)
    
#for each item in the list, run it through process_date() and append to list processed_test_data
for item in stripped_test_data:
	processed_test_data.append(process_date(item))

#for row in moma (list of lists)
for row in moma:
    date_one = row[6]
    stripped_date = strip_characters(date_one)
    processed_date = process_date(stripped_date)
    row = processed_date

The latest traceback of the error is telling you this:

<ipython-input-1-0371b29aaa10> in process_date(string)
     27 # otherwise just convert string to int and return it
     28     else:
---> 29         return int(string)
     30 
     31 #for each item in the list, run it through process_date() and append to list processed_test_data

ValueError: invalid literal for int() with base 10: '1941)'

Which is the very first data entry that should be stripeed. It looks like your stripping function worked only on the first bad character, that is the ‘(’ , and left alone the second one, as if it stopped working. And this is exactly what happened, you used a wrong indentation in the for loop:

def strip_characters(string):
    for char in bad_chars:
        string = string.replace(char,"")
        return string   #by setting return INSIDE the loop, you run through it only ONCE

By allowing the function to loop through all the characters and changing the indentation of the return statement, you’ll have this covered.

def strip_characters(string):
    for char in bad_chars:
        string = string.replace(char,"")
    return string   #by setting return outside

Two hints on my side:

  • avoid overusing method chaining for readability. It is not a bad thing to have more variables listed into code and keep the reading vertical, rather than horizontal. E.g. the line return int(round((int(string.split("-")[0]) + int(string.split("-")[1])) / 2)) is far too convoluted for me, I think it would be much simpler to articulate in more rows, such as:
date_to_list =  string.split("-")
date_part1 = int(date_to_list[0])
date_part2 = int(date_to_list[1])
final_date = round((date_part1 + date_part2)/2)
  • You indeed need to reassign values to row[6], so do that while closing the exercise.

Hope it helped!

2 Likes

Hey, thanks for the in depth feedback and help! I just realized last night that you can actually download the csv files and work on the code in an different IDE instead of working all online inside of DQ’s editor. I’ll take a look at this more tonight!

I wanted to respond more in depth after my initial response this morning since I was on the bus to work this morning.

I have attached a screenshot of my original code from yesterday because I realize now that the formatting got all messed up. I posted the code before going to sleep last night and didn’t check it after posting.

I did originally have the methods unchained but was curious about how much I could chain together and still have it work. Ok, good to know about assigning the value back to row[6] - guess there’s some other errors to figure out. Thank you again!

Just checked my code again and added in the row[6] and tried my code and all worked now. I guess something about the platform last night was incorrect? Either way, moving on to the next mission! Thanks!

Yay this is good! I read the code block above and it did look ok - happy you solved it :slight_smile: