Why saving a selection of data?

Screen Link:

When working on this course, I was wondering why Dataquest saved selection of data in this exercise. - " We’ve already saved a selection of data from f500 to a dataframe named f500_sel ."

Why do you use f500_sel not f500?

Does “saving a selection of data” technique help run the code faster?

If so, I would like to use this technique when working with a big dataframe.

Because when I am working with a big dataframe and I use “loop”, it would take so long to return the results.

Thank you!

I’m not an expert on this, and the following is only my limited understanding based on what I’ve read.

I think you’re right.

Another reason is probably to save resources. For example, processing dataframe in memory is faster than on disk, but the space is far more limited.

Plus, if you’re using a service that charges you based on memory usage or computational time, it is really important to lower those two values as much as you can to save costs.

Could be relevant:

If so, I would like to use this technique when working with a big dataframe.

With Pandas, which relies on NumPy, the operations are typically vectorised, so it should be faster than a normal loop. Plus NumPy operations are mostly implemented in C which is much faster than Python.

Though, sometimes those performance optimizations are not enough, so you’ll have to use sampling instead i.e. take a selection of data. Sampling, from what I’ve read, is a complicated topic on its own and I’m not qualified to explain its intricacies.

Pandas has its own sampling methods:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html

https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html

2 Likes

Hi!

That make sense!
Thank you very much for sharing interesting article and the answer!

Plus, if you’re using a service that charges you based on memory usage or computational time, it is really important to lower those two values as much as you can to save costs.

This is good to know!

1 Like

No worries.

Assuming that you’re subscribed, Dataquest does have a section on sampling, so you might want to look into that if you want to start sampling your dataset.

Cheers.

2 Likes

Amazing! I will have a look at this course!

1 Like