I wanted to share my latest guided project, predicting the stock market with linear regression. I’d mainly like feedback on areas of improvement for optimizing the algorithm itself and the accuracy of the conclusions I draw from the results. But really, any feedback is greatly appreciated.
Predicting the Stock Market with Linear Regression.ipynb (2.3 MB)
Click here to view the jupyter notebook file in a new tab
The initial narrative was really cool. I sort of missed that in the later sections. Overall a cool & structured project to read.
The first plot makes it seems that the first few selected features can work well and when more features were added, the model kind of deviated father from actual values. However, is there a specific reason why the last plot only has a one year window?
I am also curious about filtering null data for any historical years using index no. rather than a date (
1951-01-03 as suggested by DQ). Most of the world’s stocks trade between Mon-Fri. SP&500 might allow training on Sunday as well, but I don’t think the dataset has this data. So filtering by index 365 may have lost an extra no. of records for analysis (Cell 22)?
Do check your spellings, especially in sub-headings.
The narrative feedback is a good thing for me to note. I feel like that’s a weak point overall for my work so far. Admittedly as I progress through these projects I get lazy and focus only on the mechanics rather than building an interesting narrative. I guess I want to cruise on to the next module. That’s a habit I need to break!
For the year plot window I was trying to highlight that despite what it may look like from a larger view of ~2003-2016, if you zoom in, the model is clearly not useful because the predictions are days behind the actual stock movement rather than before, which would be the ideal. Other projects had predictions for one day in advance, but I felt like showing a predicted timeseries was useful in demonstrating why the model won’t work. I am uncertain about my interpretation, so if you disagree I’d be interested in yours. [Edit: I will go in and adjust it so there are fair chart comparisons so that it’s clear that the model isn’t less accurate because of overfitting from feature selection]
When I was doing the date cut-off I think the rationale was that because there were holidays, weekends, and an some incomplete years that there would be issues using datetime logic when calculating the metrics because there would be different amount of days with active stock trading from calculation to calculation, influencing the results. Since the dataset included only active trading days, I felt a better solution would be to use a flat integer (365) and accept some data loss from likely outdated 1950’s stock data. I say outdated because the way people trade now is way different that it was in 1950, which likely has a negative influence on predictions in the 2000’s where there are way more people participating and participating using computers. You are right, it isn’t perfect, but that is my rationale.
I checked my spelling and couldn’t find anything, but sometimes that’s what happens when you check your own work. I need to download the Jupyter Notebook spell checking extension!
Thanks for putting in the time to review my project, it’s really appreciated! If you have anything else please don’t hesitate, I’m always trying to improve :).
Thank you for sharing this, it was fun to read through it and to see how you can build many modules with which to analyze data from just a few simple created functions.
As someone who has always been interested in data and quanitfying things it neat to see a project built upon an idea and then executed.
I do understand why and how the narrative gets lost. When there is data and its being presented it may seem like the words are superfluous. However, for a new reader like me, even just a few sentences to help parse what the code was doing and why would have been awesome.