My Second Guided Project: Explore Hacker News Posts

Hello everyone! Just finished my second project :wink: . Would be happy to have your review upon that.
I worked with the original dataset from Kaggle, so I might have different conclusion.
Thanks in advance.

Exploring_Hacker_News_Posts.ipynb (21.3 KB)

Click here to view the jupyter notebook file in a new tab

5 Likes

Hi @Oksana,congratulations for having completed your second project. your work is well presented and organized .Involving dateutil in your work has rendered an organized output that is when working on most popular time for posting. just a point out, in cell[8] and [9] I admired to know the output :wink: ,hope you will consider such in the other project you will be tackling. Otherwise to me , everything looks good and just wishing you happy coding.

2 Likes

Hi @Oksana,

This is a great work. Looks like you are already quite experienced from the way you have used dateutil library and the way you have written the codes, comments and so on.

While doing this project, I also used the original dataset. But then I applied similar logic DQ might have used to drop the rows by dropping rows that didn’t have comments, points etc. So my conclusion was somewhat similar to that of DQ solutions.

But even your conclusions are quite close to the ‘official’ solution. It says the best time to post is 21CET, you got 22 CET with 21 just behind it.
So I think that time range is definitely the answer.

Thank you for sharing your findings with the complete data set. Keep up the good work. Happy learning.

2 Likes

Hi @brayanopiyo18,

Thanks a lot for your review and your comment about the output. I will keep in mind the importance of this step in my further projects. :slight_smile:

2 Likes

Hello @jithins123,

Thanks for your review :slight_smile: and reporting the difference in my results. This is very interesting point.

When working on DQ’s dataset I had the same results that the ‘official’ solution. It seems like the author of the dataset may update the dataset, I found this in the dataset documentation. Do you think this may be the reason for such discrepancy?

1 Like

Hi @Oksana,

Looks like the original dataset was updated 4 years ago.

Here is what DQ said at the beginning of the project

You can find the data set here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

In your project you had Rows 293119 which I believe confirms the fact that both original datasets used by you, me and DQ are the same.

This is from my project
The number of rows in data set: 293119

In order to match the results, I then created a new list that contains titles that have more than 10 comment and more than 10 points. And I got this
Length of new data set is 25153 which is close to the 20,000 mentioned by DQ. I did my analysis on this 25k rows. I think that might be the reason for difference in our answers.
What do you think?

1 Like

Thank you very Much! Your project helped me clean my information to help it become more readable, your project was amazing!