Predicting Bike Rentals with Linear Regression, Decision Trees, and Random Forests. A comparison of Performance

Hello Dq!

Please give feedback on my guided project using Random Forests, Decision Trees and Linear Regression.

Mainly I’d like to know if my conclusions are sound, but also feedback on my feature creation process and hyperparameter tuning process would be great.

20) Predicting Bike Rentals [Random Forests].ipynb (1.4 MB)

Click here to view the jupyter notebook file in a new tab

Hi @kevindarley2024

Well, I am pretty sure, your id is gonna come up on my profile page with the “top most replied to” tag!

here we go:

  • I would bring up the “Washington D.C…” section right after the intro as it will give the idea early on that it’s 2 years of data but for every hour. and just simple “let’s begin” before importing files.
  • I love this sentence “There are no apparent missing values.”! :smile:
  • you tried plt.title but commented it, why? plt.title and plt.suptitle didn’t work for cells 14 to 16?

Regarding correlations, it depends on the type of variables we are working with. Usually 0 to 0.3 is weak, 0.3 to 0.7 is moderate and >= 0.7 is strong. If we go by this blog, almost all the variables for this dataset are measurable. So we can go by the above rule. let me know your thoughts on the same.

Some nitpicking:

  • only one year is a leap so expected data rows would be less :rofl: (I have to resort to this just to annoy you!)
  • for cells 21-25, why skipped the range function?

I did not understand this after cell 41. “Our First iteration is surprising. It was expected that the sum of the predicted parts would be more accurate at predicting the total, however we see that the predictions on the total were more accurate.”

Indeed the output of cell 38 does show almost the same features - a sorting in the print result would have been :cherries: on top!. The major diff was atemp for Reg users instead of temp. And the temp and atemp are highly correlated (almost 100%). Wonder if applying assumptions of LR on this dataset would have helped to improve further :thinking:

Also thinking, if my doubt even made sense :grimacing:

Moving on the RMSE consistently improved for the most part. And now the best part of this project…
Especially the plots (except for cell 18) :+1: Kudos for the shooting range idea! :smile: Great project and submission again! Thanks for sharing :slight_smile:

Edit: I was late to post this. You were declared a champion already! Congrats.

1 Like

Hi Rucha!
I like to think that’s because you like my projects so much! :blush:

Good call on the Washington D.C. callout – better to explain the data a little bit for context before giving the results.

If I weren’t studying data science, maybe I’d make it as a politician? Hahaha. Best to cover my butt because sometimes missing values are weirdly encoded and hard to find, right?

For the title and subtitles I think I was testing different ways to show the title and subtitle and ended up going with ax.text because it gave more flexibility in spacing and location – I must have forgotten to delete the old ones!

Regarding correlations I’d say that you would want at least a moderate correlation, you could include weak correlations but risk adding noise to the model. I think of it this way, if a weak correlation is between 0 and .3 would you find value in a feature if the correlation is 0 but is still categorized as weakly correlated? There’s definitely nuance, if you’re lacking features maybe you can shift your window to .2 etc. but more features doesn’t necessarily lead to a more accurate model (or vice versa), especially with linear regression.

So that’s why there were fewer days haha, I was curious why and now I know!

What range function?

Cell 41 – the cnt variable for a given hour is just registered + casual. In our analysis we see that registered and casual users have different behavior with the bike share program. The hypothesis was that because of these differences in behavior we would be able to more accurately predict the registered and casual columns and that the sum of their errors would be less that the predicted column. What was surprising was that this wasn’t the case and that the predicted cnt had a lower error than that of the sum of the predicted registered and casual errors.

Those calculations on cell 38 were so expensive to run (like 12 minutes)! I think the output is sorted by priority, but a sorted output would have made them easier to compare. And totally, I think I was maybe a little lazy in not removing colinear values – same is probably true for some of the time of day features.

What’s wrong with cell 18, just a bit bare and without an immediate insight? I was thinking maybe I could have done k-means or something here, too, instead of eyeballing it.

I was pretty proud of the shooting gallery, but you missed one big thing… The MAE and RMSE labels are flipped! Gotta fix that.

Thanks so much for your time Rucha! Great feedback and it’s appreciated more than you know :blush:.