Stratified sampling in R, SCREEN-5

Why ylim is taken (90,310) for SRS before calculating the result of the scatter plot. How do we know it will be within 90-130.

Same for this exercise - why ylim is specified (80,320).

This cannot be done by geom_point() automatically after the expression in mean_points is calculated ?

Hi @sharathnandalike. This is done purely for aesthetic reasons so that the sample mean (blue line) is displayed in the center of the plot rather than being automatically adjusted by ggplot2. Thanks for the question. I’ll consider clarifying this in the course.

Hi Casey,

The answer did not justify my query. How min. value 90 & max = 310 is put in the code in advance, before calculating the sample mean. Why we did not specify ylim=(80,320) instead here in “learn” plot .

Kindly reply.

Hi @sharathnandalike. There is an opportunity for me to clarify this in the learn section of the course. In the case of screen 393.8, Dataquest provided the limits here to the student with this explanation:

Finally, let’s establish a high and low range for the y-axis of 90 and 310 respectively. We’ve selected this upper y-limit value to ensure that our maximum value of 302.1 is included in the plot.

The intent here was that Dataquest figured out appropriate y-limits here so that the student did not have to. Having a set y-limit of ylim(90, 310) is important on the following screen so that the y-axis is uniform across all plots, making them easier to compare.

Regarding screen 394.5 we use a different y-limit (ylim(80, 320))because the range of the data was different with stratified random sampling. In that case, the max value of mean_points_season was 316.5, so if we used the same y-limit as the previous mission it would have removed the max value from the plot.

In both cases, Dataquest did the work to figure out y-limits so that the student did not have to. Thanks for asking about this, I may clarify this when I perform optimizations of the course in the future.