Random Forest Results - Guided Project 'Predicting the stock market'

Hi !
I am currently working on time series datasets and I decided to revisit the Guided Project ‘predicting the stock market’. In this project, we are using Linear Regression to predict the market prices. The error is not that bad but can be improved with further feature engineering or maybe other algorithms. However, like it’s mention at the end of the project, I tried an ensemble algorithm, a Random Forest Regressor but the result wasn’t what i expected…
Indeed, the error is very high… So i decided to plot the prediction with the actual data and i have a hard time understanding the issue. There is some kind of threshold.
Is it because the stock price has never been that high before ?
Is the algorithm not suitable for this kind of datasets ?
Thanks for your help !

1 Like

Hi @ClementD,

Welcome to the community! I recommend you to check out this notebook on Kaggle. It is using both Linear Regression and Random Forest Regression. However, as in your case, the Random Forest was not performing well, when we plot the predictions.

While I don’t have an in-depth understanding of Time Series Analysis, I assume the Random Forest is failing here due to lack of data points.

If we look at the value ranges in y_train:

pd.cut(y_train, bins=[0, 300, 900, 1200, 1500, 1800, 2100]).value_counts().sort_index()

(0, 300]        9418
(300, 900]      2430
(900, 1200]     1675
(1200, 1500]    1848
(1500, 1800]     115
(1800, 2100]       0
Name: Close, dtype: int64

A vast majority of the values lie between 0 and 1500. And if we look at the plot, we see that the Random Forest was able to perform fairly well till somewhere around 1550. However, since the training data lacked sufficient data points from 1500 onwards, it wasn’t able to predict accurately.

Hope this helps. :slightly_smiling_face:


1 Like

10 minutes on these 3 helped me understand regression trees cannot predict beyond training data range.


Maybe you would get better results with greater depth to allow the tree to have more splits and maybe the higher value points can be grouped into a leaf by themselves without the lower valued points so averaging the group gives a high value.