Random_state concept for series.sample()

I feel like I am being a little thick here. How does the random_state parameter, when using the series.sample() method impact the sample? Is it only to control which samples are selected?

If that is the case, and you want a truly random sample, why would one want the random_state? Is it just so, when trying to run code multiple times, one always gets the same samples returned?

So… in the code (from the lesson in Statistics):

sample_avg_pts=
i=0
for i in range(100):
avg=wnba[‘PTS’].sample(n=10,random_state=i).mean()
sample_avg_pts.append(avg)
i+=1

It is generating a mean value (one value) based on a sample of 10 values in the series. It does this 100 times. I understand that.

The random_state, though, has no bearing on anything other than the repeat-ability? Do I have that correct?

Hi @marksegal,

Computers generate random numbers by relying on algorithms (a predefined set of rules). Because the computer is relying on these rules, they are not generating truly random numbers, but rather pseudo-random numbers that could be figured out if one knows the seed value and the algorithm.

For most of what we do in data science, these pseudo-random numbers are “good enough.” Things like selecting a random sample from your data-set. In fact, it is often the case in data science and analytics you will want to be able to reproduce your results so having a way to create reproducible “randomness” is actually what you want.

Question
How does the random_state parameter, when using the series.sample() method impact the sample? Is it only to control which samples are selected?
Answer
Yes, the random_state parameter is used as a seed number (initial value) for the algorithm to produce a pseudo-random set of numbers to select the “random” sample with.

Question
If that is the case, and you want a truly random sample, why would one want the random_state?
Answer
Even if you do not include a random state the “random sample” generated still relies on the algorithm which will not produce truly random values. I’m not positive, but I think with out a seed number it uses the time as an initial value to start the algorithm with. Once again, I could not find any confirmation on this, but either way the algorithm is following a set of rules to select an initial value (when you don’t supply one) and then produces seemingly random (pseudo-random) values based on an algorithm. So it may seem more random, because you are not getting the same results all each time, but the results are still pseudo-random and could be predicted if you know the seed value.

Question
Is it just so, when trying to run code multiple times, one always gets the same samples returned?
Answer
Yes, that is why we use the seed value, so we can reproduce our results. This can be convenient for many reasons, but for the purposes in DataQuest, it is perfect for answer checking.

Question
The random_state, though, has no bearing on anything other than the repeat-ability? Do I have that correct?
Answer
Yes. The random state value you enter allows you to reproduce the same results. It the initial value the random number generator algorithm uses to start producing the “random” numbers.

You are on the right track with your thinking! Hope this helps clarify things.

Bradon

p.s. Even though I said that computers generate pseudo-random numbers because they follow algorithms, there are ways that computers attempt to generate truly random values.

1 Like

Thank you for that very detailed explanation! It is much appreciated.