LIMITED TIME OFFER: 50% OFF OF PREMIUM WITH OUR ANNUAL PLAN (THAT'S $294 IN SAVINGS).
GET OFFER

Cleaning and Preparing Data in Python -Mission 8

Screen Link: https://app.dataquest.io/m/351/cleaning-and-preparing-data-in-python/8/parsing-numbers-from-complex-strings-part-two

My Code:

test_data = ["1912", "1929", "1913-1923",
             "(1951)", "1994", "1934",
             "c. 1915", "1995", "c. 1912",
             "(1988)", "2002", "1957-1959",
             "c. 1955.", "c. 1970's", 
             "C. 1990-1999"]

bad_chars = ["(",")","c","C",".","s","'", " "]

def strip_characters(string):
    for char in bad_chars:
        string = string.replace(char,"")
    return string

stripped_test_data = ['1912', '1929', '1913-1923',
                      '1951', '1994', '1934',
                      '1915', '1995', '1912',
                      '1988', '2002', '1957-1959',
                      '1955', '1970', '1990-1999']

def process_date(date):
    if '-' in date :                   #Checks if the dash character (-) is in the string so we know if it's a range or not. 
        split_date = date.split("-")   #Splits the string into two strings, before and after the dash character
        date_one = split_date[0] 
        date_two = split_date[1]
        date = (int(date_one) + int(date_two)) / 2  #Converts the two numbers to the integer type and then average them by adding them together and dividing by two
        date = round(date) 
        
    else:
        date = int(date)   #Converts the value to an integer type
    return date

        
processed_test_data = []
for d in stripped_test_data:
    date = process_date(d)
    processed_test_data.append(date)

for row in moma:
    date = row[6]
    date = strip_characters(date)        #remove any bad characters.
    date = process_date(date)       #convert the date.
    row[6] = date



What I expected to happen:
Hi, when I make this exercise on Dataquest , I am not getting any value error; but I transferred the code into Jupyter and I am getting error for int.

What actually happened:

ValueError                                 Traceback (most recent call last)
 <ipython-input-24-662bbd224d8b> in <module> 
     40      date = row [ 6 ] 
     41      date = strip_characters ( date )         #remove any bad characters. 
---> 42      date = process_date ( date )        #convert the date. 
     43      row [ 6 ]  = date

<ipython-input-24-662bbd224d8b> in process_date (date) 
     28  
     29      else : 
---> 30          date = int ( date )    #Converts the value to an integer type 
     31      return date
      32 

ValueError : invalid literal for int () with base 10: 'Date'

Also, I had this error for the 6th mission, and again the problem is about converting date to integer.

How can I solve this on Jupyter? Thank you for help!

Just tried to run your code in a Jupyter Notebook on my system.

The code runs without issue for me.

Couple of things -

  • Based on the error, it seems to be happening at a value - Date. So, somewhere in your data, you have a string with the value Date. So, you can’t convert that to int. I am not sure why you get this error while I don’t unless you accidentally changed something somewhere.

    • Try to print out the row[6] for every row in moma. See where this Date value might be appearing.
  • Share your Jupyter Notebook (through a Github link, or you can share the file in your post here)

    • Can try to look through the code to see if something else you did might be causing it. Also mention the Python version you are using for your Jupyter Notebook.

Hi, thank you for your time. I checked the code and code is the same with the playground code, but I have still the same error. I am using Python 3.
Github link : https://github.com/hulyak/analyze-data-with-python/blob/master/MoMA.ipynb

You missed the instruction on this Step

Use list slicing to remove the column names (the first row) from the moma list of lists.

The first row in moma are the column headers.

  • Title : The title of the artwork.
  • Artist : The name of the artist who created the artwork.
  • Nationality : The nationality of the artist.
  • BeginDate : The year in which the artist was born.
  • EndDate : The year in which the artist died.
  • Gender : The gender of the artist.
  • Date : The date that the artwork was created.
  • Department : The department inside MoMA to which the artwork belongs.

Notice the Date and the BeginDate?

That’s what the errors were pointing to -

ValueError: invalid literal for int() with base 10: ‘BeginDate’

ValueError: invalid literal for int() with base 10: ‘Date’

Once you remove the column names from moma you should be fine.

I will recommend that you start focusing more on how to debug such problems, however. Ask yourself questions. You printed out row[3] and row[4] and you got BeginDate and EndDate. Stop, think about this, and ask yourself questions.

  • Why is the birth date and death date those strings instead of actual years like 2000 or anything else?
  • Where do those string values come from?
  • What else can I print to check where those values come from?
  • Should I print the entire row and see?

Asking such questions will start to lead you to answers yourself and that’s what you should start focusing on as well now.

Good luck!

Hi, sorry I missed that step I think, thank you for your feedback. However, now I am getting error because of clean_and_convert function and strip_characters function. When date is converted to int, I am getting;

AttributeError : ‘int’ object has no attribute ‘replace’

you can check the gtihub code here:

I would first recommend that you follow my suggestions that I previously pointed out

I will recommend that you start focusing more on how to debug such problems, however. Ask yourself questions.

When you get this error, what questions can you ask about it? Break down the thought process first.

Please note, that I don’t try to help people just by providing direct answers. I aim to help them learn how to approach that themselves whenever possible. So, I won’t provide a direct solution here unless you put in the necessary effort towards debugging such issues first. If that’s not preferable to you, then I would suggest asking a new, separate question since your latest error is not the same as the one you asked in this post originally.