LIMITED TIME OFFER: 50% OFF OF PREMIUM WITH OUR ANNUAL PLAN (THAT'S $294 IN SAVINGS).
GET OFFER

Stuck on error: ValueError: invalid literal for int() with base 10: 'h'

Screen Link:
https://app.dataquest.io/c/62/m/356/guided-project%3A-exploring-hacker-news-posts/4/calculating-the-average-number-of-comments-for-ask-hn-and-show-hn-posts

My Code:

total_ask_comments = 0
for row in ask_posts:  
    ask_comments = int(row[4])
    total_ask_comments += ask_comments 
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

total_show_comments = 0
for row in show_posts:
    show_comments = int(row[4])
    total_show_comments += show_comments
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

What I expected to happen: To be able to convert row[4] to an int

What actually happened:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-1a181f327856> in <module>()
      5       del row
      6 for row in ask_posts:
----> 7   ask_comments = int(row[4])
      8   total_ask_comments += ask_comments
      9 avg_ask_comments = total_ask_comments / len(ask_posts)

ValueError: invalid literal for int() with base 10: 'h'

I tried to figure out the problem by making a loop through each character in the data of row[4]. I realized that some rows have the ‘h’ as a string type, not a number. What should I do to solve this issue?
By the way, it may seem that my CSV file is different from the data of the Dataquest course because my first five rows output is not the same. I took the file from the link in the course just as the Guide Project before. Could someone give me an explanation for this strange thing?

Many thanks!

Dataquest modifies the original datasets depending on whether or not it’s relevant to what they want us to learn. It’s clarified in the lesson as well -

You can find the data set here, but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn’t receive any comments and then randomly sampling from the remaining submissions.

So, the data you have and the one they want us to use are different. If you want to download the dataset they use - Loading chinook database on Jupyter - #11 by the_doctor

1 Like

Thanks to your guidelines, I’ve already found the ‘hacker_news.csv’ file. I really appreciate that.
However, though I replaced the new CSV file with the old one, the same error happened. What can I do to fix it?

I haven’t run the code myself to be sure, but it is possible you are including the header row in your for loop which includes a string column name that is contributing to the error. You can also print out the output of row at each iteration to check where row[4] outputs the character h. Because your code is trying to convert a string/character to int and that’s why you get the error.

1 Like

I make sure that the header row has been removed from begin steps. Finally, I made a loop through each character in the data of row[4] as your recommendation and delete rows having the ‘h’ character so that I can continue the project.
Thanks for your help!

I got this same error the first time through.

I realized that I had originally only stored the title of the post in ask_comments (and the other lists), instead of the entire row entry.