Hacker News Analysis

Hello everyone, so I tried to apply pandas and NumPy to make the hacker news analysis much easier and more beautiful but I seem stuck and would appreciate guidance on how to use pandas to complete it.

  • Finding the number of posts and comments created in a certain time (By Hour)
  • Calculate the average number of comments created for Ask HN posts during each hour of the day

Hacker News Analysis.ipynb (36.7 KB)
Thank you, cheers

Click here to view the jupyter notebook file in a new tab

Great work my brother especially as you explored the dataset correctly by showing the headers with hn.head() #you showed the 10 rows of this dataset and then by describing the dataset with hn.describe(), the two Pandas functions/methods important in Exploratory Data Analysis, you just missed the shape function, hn.shape to displays (the number of rows, the number of columns) = (rows, columns). I will follow your codes and check what you need to do to conclude this project.

1 Like

@brayanopiyo18 , can you help me out here

Sorry, I will only help you after 18h00 as our electricity will go away for 2 hours in 10 minutes from now. The best thing is to check what PROCEDURES/#Methods I suggested to complete your Guided Project and test/try them out for yourself. I will write them down before 20h00 pm tonight and your responsibility is to test those functions or methods if they produce the required outputs/results. Remember I have to study your data flow/your own codes first so it reaches the end-goal, not carbon copying the solution.

@OlutokiJohn I am soon getting to your project.

Sorry, I will look at your Hacker News Posts first today: You need to include the hn.shape function in between the hn.head() and the hn.describe() functions to go with the flow/description of the Exploratory Data Analysis (EDA). That is your 3rd coded cell should be hn.shape, i.e., hackers_news.shape based on your choice and your 4th cell should be hn.describe() function whereas hn.head() remains coded cell2. Insert a new cell after your head() function and write hackers_news.shape, run it, will outputs to (20100,7). You need to import datetime and use for loop to get comments per hour then you calculate (1)the Amount* of HN Ask Posts (equiv = total/sum of those comments per hour), (2)the average of HN ASK comments (avg_by_hour*) and then swap* avg_by_hour and finally print! Remember both calculations required For Loop.

1 Like

Yeah, in the solution: your number of posts and comments created in a certain time by hour is
found by using for loop to get comments per hour, then you calculate (1)the Amount* of HN Ask Posts (equiv = total/sum of those comments per hour), (2)the average of HN ASK comments (avg_by_hour*) and then swap* avg_by_hour and finally print! However through this route you first have to import datetime. I hope you understand this explanation. Did you try to add the num_comments in your Output [36], to get the Amount/total Amount and calculate avg_by_hour?

Hi @OlutokiJohn,
I managed to get you answers for your dataflow/or your project’s answers are OK (Remember not to forget to insert that function hn.shape = hacker_news.shape between your header and describe), Now you need to do the following last 4 block of codes per cell/i.e., markdown code 37, 38, 39 and 40 then Run these cells):

In[37]
max_by_comments = 0

max_comments = [ ]

for row in comments_by_hour:

if comments_by_hour[row] > max_by_comments:

max_by_comments = comments_by_hour[row]

max_comments = [row, comments_by_hour[row]]

print('With ’ + str(max_comments[1]) + ’ most of the comments were written around ’ + str(max_comments[0]) + " o’clock.")

In[38]
avg_by_hour = [ ]

for hour in counts_by_hour:

avg = comments_by_hour[hour] / counts_by_hour[hour]

avg_by_hour.append([hour, avg])

for row in sorted(avg_by_hour):

print('Hour: ’ + str(row[0]) + ’ Comments (avg): ’ + str(row[1]))

In[39]
swap_avg_by_hour = [ ]

for row in avg_by_hour:

swap_avg_by_hour.append([row[1], row[0]]) # here we swap the index of the row and append it the new list

print(swap_avg_by_hour)

In[40]
sorted_swap = sorted(swap_avg_by_hour, reverse=True) # we sorted the new list

print(‘Top 5 Hours for Ask Posts Comments’)

for average, hour in sorted_swap[:5]:

hour_object = dt.datetime.strptime(hour, ‘%H’) # convert the string to datetime object

time = hour_object.strftime(’%H:%M’) # format the datetime object

print(’{time}: {average:.2f} comments per post’.format(time=time, average=average) )

It should become clearer to you. if you see a star, * in the above codes above, just take it out/delete it. Good Luck with your studies!!! You will find me @AlMokgalaka twitter and Linkedin.

Hi @OlutokiJohn , I want to join in @10903alm but this time providing the codes and explaining them further.

To find the number of post created per Hour,

  • Access the h column
  • Use the value_counts () method on this column to get the number of post created in every hour. Remember every time a post is created then time must be imbedded , so every row basically describe the post, like the title of the post, number of comments received in this post, the time it was created , the author and so on…, below is the code;
from tabulate import tabulate


print(f'\033[94m \033[4m Number of post created per hour:\033[0m\n')

print(tabulate(ask['h'].value_counts(dropna=False).to_frame(),
              headers=[f'\033[31mTime(hour)\033[0m', '\033[31mNumber of post created\033[0m'], tablefmt='fancy_grid'))

Output
johnp

I think you have this in your workings already , check cell[36] . In this case you apply grouby() method on h column jut the way you did and you sum all all the comments received in every hour. Maybe you can consider styling the output using the c odes below;

print(f'\033[94m \033[4m Number of comments created per hour:\033[0m\n')
dff = ask.groupby('h').sum().reset_index()
print(tabulate(dff['num_comments'].to_frame(),
              headers=[f'\033[31mTime(hour)\033[0m', '\033[31mNumber of comments created per hour\033[0m'],
               tablefmt='fancy_grid'))

Output
johnc

  • We need to get the number of comments created every hour
  • Get the number of post per hour
  • Divide the number of comments per hour with the number of post at that time(hour) to have average number of comments per that time , Have a look at the code…

# getting number of comments per hour, we use groupby then display only two columns of intrest
dff = ask.groupby('h').sum().reset_index()
dff=dff[['h','num_comments']]

# we then get the number of points per hour in ask df
# assign the resulting  frame to df called df2
df2=ask['h'].value_counts(dropna=False).to_frame().reset_index()
# since we need this  df to be in the same order as dff df
# we sort_values using index column , and by this we shall have marched this column with the one in our dff df
df2.sort_values('index',ascending=True,inplace=True)
# we now use the 'h' column in ask df to create a new column in our dff df
dff['num_post_created']=df2['h']
# we now work the average comments per hour using 'num_comments' and 'num_post_created' columns
# and asign  a new column  callled aver_comm_per_hour
dff['aver_comm_per_hour']=round(dff['num_comments']/dff['num_post_created'])


# styling the output


print(f'\033[94m \033[4m Average Number of comments created per hour:\033[0m\n')
print(tabulate(dff['aver_comm_per_hour'].to_frame(),
              headers=[f'\033[31mTime(hour)\033[0m', '\033[31mAverage Number of comments created per hour\033[0m'],
               tablefmt='fancy_grid'))

Output
johna

Note
My outputs will not neccesarily looks like yours coz the downloaded dataset has not been cleaned as that used by DQ, probably expect smaller values in your case.
My last solution has got many approach which if implemented then you will arrive at the same output, I just decided to go kind of manual way :rofl:.

Waiting for the final project and all the best even as you windup.

1 Like

Hi @OlutokiJohn [Marked Final*]
I had supplied you with your problem solution to test/implement the 4 block of cells/codes for 4 blocks of cell as provided below:

In[37]
max_by_comments = 0

max_comments = [ ]

for row in comments_by_hour:

if comments_by_hour[row] > max_by_comments:

max_by_comments = comments_by_hour[row]

max_comments = [row, comments_by_hour[row]]

print('With ’ + str(max_comments[1]) + ’ most of the comments were written around ’ + str(max_comments[0]) + " o’clock.")

In[38]
avg_by_hour = [ ]

for hour in counts_by_hour:

avg = comments_by_hour[hour] / counts_by_hour[hour]

avg_by_hour.append([hour, avg])

for row in sorted(avg_by_hour):

print('Hour: ’ + str(row[0]) + ’ Comments (avg): ’ + str(row[1]))

In[39]
swap_avg_by_hour = [ ]

for row in avg_by_hour:

swap_avg_by_hour.append([row[1], row[0]]) # here we swap the index of the row and append it the new list

print(swap_avg_by_hour)

In[40]
sorted_swap = sorted(swap_avg_by_hour, reverse=True) # we sorted the new list

print(‘Top 5 Hours for Ask Posts Comments’)

for average, hour in sorted_swap[:5]:

hour_object = dt.datetime.strptime(hour, ‘%H’) # convert the string to datetime object

time = hour_object.strftime(’%H:%M’) # format the datetime object

print(’{time}: {average:.2f} comments per post’.format(time=time, average=average) )

1 Like

Thank you @10903alm and @brayanopiyo18, I really appreciate both of you guys.

I hope you see/saw the rectangular/square brackets or this braces, . Although it was my first 2 days in this community as I started, I can tell you that I browsed fewer than 5 students’ work (ipynb), out of those, yours was excellent. Thanks brother, “we learn by helping others”. I only come to study at DQ only if they offer free weekend,

1 Like

@OlutokiJohn
I wish to know if you implemented my suggested solutions and whether you completed the Hacker News Posts Guided Project successfully?

yes i did, thank you very much

Hi @10903alm,

I only come to study at DQ only if they offer free weekend

Please kindly take a look at this post. Hope that everything is clear now :slightly_smiling_face:

1 Like