Math block: To exclude or not to exclude zero-comment posts?

I have a bit of a math block (like a writer’s block, but for math). I’ve noticed generally in the shared projects that zero-comment submissions are filtered out of analyses.

You can find the data set here, but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn’t receive any comments and then randomly sampling from the remaining submissions.
Part 1/8 of guided project

The above section of the guided project references removing the zero-comment submissions. The linked download, though, is the full kit and caboodle (~300,000 rows). I’m only about halfway through, but I’ve opted to do both: analyze the dataset with zero-comment submissions and without.

My results:

Excludes zero-comment posts Includes zero-comment posts
Post Type Post Count Average Comments Post Count Average Comments
Ask HN 6,911 13.74 9,139 10.39
Show HN 5,059 9.81 10,158 4.89
Others 68,431 25.84 273,822 6.46
Total 80,401 23.79 293,119 6.53

I’m not the brightest knife in the crayon box. Some help, insight or something would be helpful.

Why would one remove zero-comment submissions? Why would one leave them in? How can one articulate the striking differences in averages? In what ways could this be characterized? What would I want to look into to compare these sorts of averages?

Any help is appreciated.

Well, I forged ahead without an answer to my “math block”. I conducted the analyses essentially eight times:
Four nested lists

  • Ask Posts
  • Show Posts
  • Other Posts
  • All Posts

Each of those was analyzed with the inclusion and the exclusion of zero-comment posts. It seemed beneficial to include both, even though this is still introductory stuff for data analysis.

My presumption is that the kind of client who would want this type of analysis would be involved in some sort of marketing or advertising. A picture of what that engagement (comments, up/downvotes, times of the day) looks like seems more useful in that context than just identifying a type and time for high-engagement. I hope I’m not off-base with that.

My results are on GitHub.

These are all great questions. Unfortunately, the (my?) answers aren’t straightforward. This is part of the vagueness of dealing with data in some respects.

I don’t know that this is the case, but it is possible that posts without comments are sufficiently different to merit a separate analysis. The question then shifts to “What does sufficiently different mean?”. This, too, isn’t straightforward. Cluster analysis gives us some answers. Don’t worry about this for now.

Remember, the project starts with a couple of questions. Often, you will start with questions. In those cases, that’s your goal. When you don’t have a question, then it’s a matter of exploring and see where it leads you. This ability will come with experience.

Something that comes to mind in your table above is to compare the ratios between the question and show posts average comments in both cases. It’s significantly larger when you don’t exclude zero-comment posts. This suggests that when show posts get comments, they tend to generate more comments than ask posts do.

Thank you for the detailed answer!

I did wonder about that. I’m aware forums/social media do have a lot of information posted that doesn’t exactly merit or allow engagement. I may be trying to get too far ahead of myself for where I am in terms of Python and data analysis.

Teasing more detail out of this dataset may be worth the effort just for fun extras.

Focusing on those questions may be of benefit to me, then, until I’m more familiar with other methods, techniques and oddities of data sciences.


1 Like