Blue Week Special Offer | Brighten your week!
days
hours
minutes
seconds

Why are my sample() and sample_n() results coming out differently?

On the second page of the Stratified Sampling and Cluster Sampling lesson, we are told that if we use sample_n(wnba, size = 10) and then calculate the mean of the PTS column in that sample, we will get the same result as when we calculate the mean of sample(wnba$PTS, size = 10). The lesson says that the mean in both cases will be 206.5.

However, I tried this in the console (with the code below) and I get two different results. I was also getting different results when learning about set.seed() in the previous lesson, so I’m very curious about this. Could someone explain if I’ve done something wrong here? Or is the lesson incorrect and the sample() or sample_n() functions will return different results?

Screen Link:

https://app.dataquest.io/c/70/m/394/stratified-sampling-and-cluster-sampling/2/sampling-rows

My Code:

library(dplyr)
set.seed(1)
wnba_sampled <-  sample_n(wnba, size = 10)

mean1 <- mean(wnba_sampled$PTS)

pts_sampled <-  sample(wnba$PTS, size = 10)
mean2 <- mean(pst_sampled)

Result:

mean1
numeric (double)
[1] 171.4

mean2
numeric (double)
[1] 195.6

Regards @juliannekelso:

I’ve been with the problem and did one thing that may help determine where the problem is:

I have thought that possibly the dataframe is not the same as the one we work with (sure you has the same though), but before saying that this is true it has occurred to me to make a for a seed and see if the result it gives me is the result of DQ.

To determine that we are right, yours would be to make a loop of about 100 seeds for example and see the result if it coincides with that of DQ.

I pass you the snipet code with which I have been testing:

b <- ''
a <- 0
aa <- mean(wnba_sampled)
lista = list()

for (a in 1:100){
    set.seed(a)
    wnba_sampled <-  sample(wnba$PTS, size = 10)
    aa <- mean(wnba_sampled)
    append(lista, aa)
}

for (b in lista){
    print(b)
}

I have not been able to visualize the result by using print on DQ console but as a starting point you can try and see, if this number apears:

mean(wnba_sampled$PTS) [1]
206.5

Next thing would be to see the contents of the df to determine that it matches the example but what it means will be that dq has some error.

# A tibble: 6 x 32
  Name  Team  Pos   Height Weight   BMI Birth_Place Birthdate   Age College Experience
  <chr> <chr> <chr>  <dbl>  <dbl> <dbl> <chr>       <chr>     <dbl> <chr>   <chr>     
1 Cour… CHI   G        173     66  22.1 US          August 2…    28 Gonzaga 6         
2 Érik… SAN   C        196     86  22.4 BR          Septembe…    34 Brazil  13        
3 Kia … NY    C        193     90  24.2 US          January …    30 Rutgers 9         
4 Sue … SEA   G        175     68  22.2 US          October …    36 Connec… 15        
5 Cand… LA    F/C      193     79  21.2 US          April 19…    31 Tennes… 10        
6 Shen… IND   G        180     78  24.1 US          Septembe…    26 Miami … 6         
# … with 21 more variables: Games_Played <dbl>, MIN <dbl>, FGM <dbl>, FGA <dbl>,
#   FG_perc <dbl>, Fifteen <dbl>, Three_PA <dbl>, Three_P_perc <dbl>, FTM <dbl>,
#   FTA <dbl>, FT_perc <dbl>, OREB <dbl>, DREB <dbl>, REB <dbl>, AST <dbl>,
#   STL <dbl>, BLK <dbl>, TO <dbl>, PTS <dbl>, DD_two <dbl>, TD_three <dbl>```

On the other hand the difference in result you get if I am not wrong is due to the fact that the order of the operations, I remember that happened to me a while ago.

if you manage to make it work, please let me know.

I hope it has at least given you some light. :flashlight:

We see here.

A&E