356-8 Hacker News Project beyond guided steps

Hi there! For extra practice I wanted to continue onto the 3 objectives past the guided part of the guided project analyzing Hacker News posts. I am receiving this error code on my list of posts not categorized as ‘Show’ or ‘Ask’: ValueError: unconverted data remains: /12/2016 1:43.
I mimicked code used throughout the guided project for this step of print hour and average points per post.
My code:

date_format = "%m/%d/%Y %H:%M"
counts_other_hr = {}
points_other_hr = {}

other_points = []
for row in other_posts:
    other_points.append([row[6], int(row[3])])
for row in other_points:
    date = row[0]
    points = row[1]
    hr = dt.datetime.strptime(date, date_format).strftime("%H")
    if hr in counts_other_hr:
        counts_other_hr[hr] += 1
        points_other_hr[hr] += points
    else:
        counts_other_hr[hr] = 1
        points_other_hr[hr] = points

avg_other_pts = []
for hr in points_other_hr:
    avg_other_pts.append([hr, points_other_hr[hr] / counts_other_hr[hr]])
print(avg_other_pts)

swap_other_avgpts = []
for hr in avg_other_pts:
    swap_other_avgpts.append([row[1], row[0]])
sorted_other_pts = sorted(swap_other_avgpts, reverse=True)

for avg, hr in sorted_other_pts[:5]:
    hr = dt.datetime.strptime(hr, "%H").strftime("%H:%M")
    print("{hr}: {avg:.2f} points per post".format(hr=hr, avg=avg))

This is exact code worked for a different list. I am at a loss of what to do to resolve the error. I thought maybe skimming through this list’s rows to find odd dates but I am not sure about the most efficient route to go, I appended all dates to a new list and printed it but that was not very easy to skim through and look for anomalies. Any help the community has for direction to overcome this would be awesome.

Thanks! And apologies if anyone needs further clarification or my issue is not clear.

Hey, Clara. Very well-posed question!

It seems you realized that some dates aren’t in the correct format and that’s why you’re getting an error. You also seemed to have thought of a strategy to starting dealing with this: find the ill-formatted dates, but you’re having trouble on how to go about doing this.

This very much depends on the issues you face as you try dealing with the first few obstacles. It could possibly be that this is the only ill-formatted date (probably not), in which case you could just fix this in a text editor or something and move on.

I think my first option would be to use a “try except” block. Here’s an external resource on this, since by the time you encounter this project you still haven’t learned about this technique.

An alternative, using the only techniques you’ve learned so far, is to try to programmatically find the problematic dates. For instance, it seems the one you hit starts with a slash. You can add a rule that says “if the first character is /, append date to the list problematic_dates and set date as 01/01/1900 00:00 (just an example), otherwise do business as usual”.

Most likely after you implement this rule, you’ll find other formats that fail and can try this a few times, hopefully this will allow you to at least bypass the problematic dates. If it turns out to be unfeasible, then more potent techniques will be needed. (Tip: get back to this project after you finish step 2, you’ll learn about very potent Python libraries that will allow you to deal with this in a much cleaner, easy and elegant way).

Hope this helps.

1 Like

Bruno, that was super helpful and made me excited for trying to take these projects further.
So I started digging around and testing some str.startswith() functions to try to find the bad dates but the dates could not be found. I was really stumped and even tested a “try except” block in the code…nothing, everything was considered “bad.” But I figured it out!
The error in my code has asterisks around it:

swap_other_avgpts = []
for **hr** in avg_other_pts:
    swap_other_avgpts.append([**row**[1], **row**[0]])
sorted_other_pts = sorted(swap_other_avgpts, reverse=True)

for avg, hr in sorted_other_pts[:5]:
        hr = dt.datetime.strptime(hr, "%H").strftime("%H:%M")
        print("{hr}: {avg:.2f} points per post".format(hr=hr, avg=avg))

So when I switched that “hr” to “row” I do not have an error, it prints beautifully.
Thanks for your help and providing different routes to take when issue arise! This was a fun exercise, but I clearly need to practice my attention to detail skills :sweat_smile:

1 Like

Oops! I share some of the responsibility there, I didn’t pay too much attention to your code as I thought I had already figured it out just from the error message.

Feel free to mark your own answer as a solution. You figured it out!

I think struggles like these carry tremendous learning value, nice job :slight_smile:

3 Likes

Hi - someone asked my specific question a few months ago but it got no response. I replied to it asking again, but also ask here in case it’s quicker.

I received the error below when creating “counts_by_hour = {}” and “comments_by_hour = {}” in the same Guided Project:

ValueError: time data ‘8’ does not match format ‘%m/%d/%Y %H:%M’

I see your response above, Bruno, details a manual approach to identifying errors ( “if the first character is / …) but I’m not sure how to do adopt a similar approach if a number isn’t a number (if I understand my value error correctly…).

Any assistance welcome.

Hey, Murphy.

The problem here seems to be that the value is 8, which doesn’t conform to the expect format.

You can try something like what I suggested above that catches, for example, all values that do not contain slashes. This will capture 8 and possibly others.

Thanks Bruno … the question i have - which i should have elaborated first time around - is how do I write:

‘if value != expected format’

in code?

I looked through your links and have searched elsewhere but I’m not even getting close to a solution.

Got it.

To do this properly, you’ll first need to learn about regular expressions. You’ll learn about them in the next step.

The alternative I’m suggesting is a different logic: “if the format of the value is [insert wrong format], then set it aside”, and you do this multiple times as you encounter incorrect formats.


Edit: @murphysimon My wording was very awkward, I’ve edited my post to make it clearer.

No worries!

But I’m still struggling as I don’t know how to identify which rows have incorrect formatting (so that I can then set it aside, as you say).

It seems to be a similar problem as the first guided project (Profitable App Profiles) but in that project the row with the error in question is specifically given, but it’s not here - the only clue I’m given is ‘8’.

And it seems there is added difficulty because there’s 5 strftime codes (%m/%d/%Y %H:%M), and this ‘8’ could belong to any one of them, but maybe I’m over complicating things?

Values have formats.

Here the value is 8, it’s not in the desired format. It doesn’t concern any one specific component of the format, it’s simply a value that isn’t in the right format.

How can you tell that it isn’t in the right format? Well, one way is that it’s way too short; another is that it doesn’t have any backslashes; another is that it doesn’t have any semicolon. Pick your favorite and go with it.

Then you try again, if you find another error, you employ a similar rule, only adjusted to whatever value you find in that instance.

Maybe I’m not asking this the right way (and maybe I’m being stupid so apologies if the case…)

So…

All the rows in a list of lists called ‘result_list’ contain a column with information in the following format:

8/4/2016 11:52

I need to loop through the list to find the problem ‘8’ within this column.

So, if I do a for loop you’re saying I should look for a number that is too short.

Ok, but how do I write this?

for row in result_list:
if xxxx in

Can you solve for xxxx ?

Or have i misunderstood your response?

I’m speaking from memory here, and I didn’t test it, so it could be me who is missing something.

You say the value is 8/4/2016 11:52, but I say the value is 8, that’s it. There is a row whose date isn’t complete and the value is 8.

Let’s see an artificial example.

>>> from datetime import datetime as dt
>>> strp = dt.strptime
>>> date_format = "%m/%d/%Y %H:%M"
>>> data = [
... ["first_row", "1/9/2020 16:20"],
... ["second_row", "8"],
... ["third_row", "/4/2016 11:52"]
... ]
>>> for row in data:
...     print(strp(row[1], date_format))
... 
2020-01-09 16:20:00
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/bruno/anaconda3/lib/python3.7/_strptime.py", line 577, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/home/bruno/anaconda3/lib/python3.7/_strptime.py", line 359, in _strptime
    (data_string, format))
ValueError: time data '8' does not match format '%m/%d/%Y %H:%M'

Notice how it printed the result for the first row and it failed when it reached the second row.

The date on the third row also has an incorrect format. My suggestion is something like this:

  1. Run your code normally;
  2. Hit an error due to format issues with the value;
  3. Adjust your code to capture this value;

And then repeat the previous steps as necessary.

In this particular example you’d find an error in the second row, adjust your code to add rows whose date, for instance, lacks backslashes to a list of problematic formats and move on.

Then you’d find another error with the third row, adjust your code to deal with this new problem and move. And so on, and so forth.

Thanks Bruno - a couple of things:

  1. I still don’t know what you mean when you say “adjust my code to add rows…”

  2. I’ve run my code again, and your code instead in its place, and it seems there’s a problem with every single row (I’ve printed row[0], row[1], row [2] etc) which suggests something wrong with the formatting of the list which i can’t work out.

  3. I really appreciate your help but this has taken me 5 hours to not solve (!) so is there some way you can remote in, or is there someone I can speak to, or a help desk etc?

Again, assistance is very much appreciated.

Thanks again Bruno but I’ve now solved it.

The error was the creation of result_list that was done incorrectly.

To append the two columns as one row into this list I did this:

if time not in result_list:
result_list.append(row[6])
result_list.append(row[4])

And I should have done this:

if time not in result_list:
result_list.append([row[6],row[4]])

Is there any reason the solutions for this Guided Project aren’t included like the Apps project? It would have saved me hours of time here.

I’ve found the solution - the key icon at the top :upside_down_face:

And the solution shows the “startswith” code in one of the earlier exercises - which is where I couldn’t understand what you were saying … I assume you thought I knew that bit of code, which I didn’t.

Glad you solved it!

I actually didn’t know we were using str.startswith, I was just focused on the error and assumed that everything was correct prior to this, sorry.

Not your error at all… it’s mine.

Fyi my coding experience was literally zero before I started this course, so anything that’s even remotely outside what I’ve been taught here is a struggle at the moment.

Thanks again (I can’t say this enough)

1 Like