How to calculate a 365 days calculation?

Hello everyone,

I’m experimenting with this very interesting mission.
https://app.dataquest.io/m/65/guided-project%3A-predicting-the-stock-market/1/the-dataset
I have a doubt about how to calculate different averages for the days periods suggested in the instructions.

If I understand correctly, when calculating for example The average price from the past 5 days I shouldn’t take into consideration the actual date of each row, but instead select the 5 records prior to each row.
The same goes for when I do the calculations for a period of 30 days. It makes sense.

But when calculating the average price over a period of 365 days, is the idea to try to use records contained in a calendar year instead of the records from 365 rows??

The first option would make more sense to me. Especially considering that in the instructions there is a suggestion about ignoring the rows that belong to the first year of the data.
But if we need to use 365 rows to make each calculation we won’t have enough records to get the data we want (only for the first section of the dataset).

Also, over that long period of time (one year) I believe it makes more sense to only consider the data contained in the past calendar year instead of counting 365 trading days in the past.

I hope my explanation makes sense. Please, let me know what’s your take on this.
And also it’d be nice to hear some ideas of how to perform those calculations if we go for the calendar year measure. :grin:

Cheers!
Agus.

Hi Agus,

What mission is this? Is there a link to it?

Sorry!
I thought had added the link :sweat_smile:
I now edited the post.

Thanks, hanqi!

I haven’t done the mission and don’t have much domain knowledge too, so just speaking from common sense.

I feel it’s never a good idea to select 5 records prior ignoring the dates because that assumes that the previous 5 rows are correctly sorted in time, evenly spaced, not duplicated, did not get weekends filled in by some processing function (since weekends are not supposed to have data), and would be safer to select rows to analyze based on a fixed timedelta backwards from current time or some other historical time.

Selecting real date ranges makes it easier to communicate end points of the range, and filters out noise from non-relevant periods (if selected too much) and fills in missing information (if selected too little).

Selecting fixed number of previous rows may be more statistically correct since every analysis would then have the same number of samples.

Ideally these two could result in the same rows selected. Do you have any examples whether and why people insist on analyzing fixed number of previous rows?

For calculation, pandas has many date types. (point in time and range) https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

Numpy has them too https://numpy.org/doc/stable/reference/arrays.datetime.html

1 Like

The decision about using 5 rows instead of calendar days is just part of the instructions of of the mission:

“Days” means “trading days” – so if you’re computing the average of the past 5 days, it should be the 5 most recent dates before the current one.