Statistics Questions
Sample Means DistributionWe are interested in exploring the nature of sampling. We often want to know about characteristics of a
population – the proportion who vote or the average height or age, for example. Since it is usually impossible to
collect data from the entire population, we use samples to guess the qualities of interest. You might wonder how
good a job a sample statistic does of estimating a population parameter and that is the goal of this exploration.
In this activity you will study the ages of a large population (over 5000). We will compute the population mean
age and then generate a large number of samples, calculate the means of those samples and see how closely
those means come to the real thing. More importantly, we shall see that the distribution of those ages forms a
familiar shape.
In addition, we will repeat the sampling procedure for larger sample sizes and see how this affects the distribution
of sample means.
Sample Means Distribution
Use this data set (ages of people arrested in Pittsburgh, PA) in StatCrunch to answer the following questions.
1.
Make a histogram of the Ages. Take a screenshot (insert it below) and comment about the shape of the distribution.
Comment about the shape:
Sample Means Distribution
2.
Take random samples of the ages by going to the Data menu and
selecting Sample. In the Sample window make the sample size 50 and
the number of samples 500. Compute the samples – notice you will now
have 500 new columns in the table representing the 500 samples you’ve
taken.
Sample Means Distribution
2.
Take random samples of the ages by going to the Data menu and
selecting Sample. In the Sample window make the sample size 50 and
the number of samples 500. Compute the samples – notice you will now
have 500 new columns in the table representing the 500 samples you’ve
taken.
3.
Now go to the Stat menu and select Summary Stats -> Columns. Select
all 500 new sample columns (click once at the top of the list where it says
Sample1(AGE), hold the SHIFT key and scroll down and click at the last
sample, sample500(AGE)), and in the Statistics window select only Mean.
Then select the Store in Data Table checkbox at the bottom of the window
and calculate.
Sample Means Distribution
4.
Create a Histogram (Graph menu) of the Means column (take a
screenshot and include it below) and note the shape of the distribution of
means. Compare this with the shape of the original distribution of ages
(you may want to paste that in above it for comparison purposes).
Compare the shapes of the original population
distribution and the distribution of sample
means:
Sample Means Distribution
5.
Now find the summary statistics of the Means column and Age column (leave all the statistics selected). Take a
screenshot of your table and paste in in below. Compare the means and standard deviations of the two distributions.
a.
Note (comment about this) that the means are about the same and the standard deviation from the sample
means is significantly smaller than the standard deviation of the population.
b.
Divide the population standard deviation by √50 and note that this is about the size of the standard deviation
for sample means.
Compare the means and standard deviations of the
original population distribution and the distribution of
sample means:
Sample Means Distribution
6.
7.
We are going to start over but first select and copy the column titled Means (hover over the down arrow at the top of
the column and click – it should select the whole column. Then copy (Command -C, Mac; CTL-C, PC) then click the
refresh button on your browser (this should erase all of the new columns – if it does not, then close the window and
come back to this page to follow the link in #2 again).
On the new window, paste the means column from your previous window and re-title it Means 50 (if you forgot to copy
the means column, forget about it and keep going)
Sample Means Distribution
7.
Repeat step #2 but this time set the sample size to 200: Go to the Data
menu and select Sample. In the Sample window make the sample size
200 and the number of samples 500. Compute the samples – notice you
will now have 500 new columns in the table.
Sample Means Distribution
7.
Repeat step #2 but this time set the sample size to 200: Go to the Data
menu and select Sample. In the Sample window make the sample size
200 and the number of samples 500. Compute the samples – notice you
will now have 500 new columns in the table.
8.
Now go to the Stat menu and select Summary Stats -> Columns. Select
all 500 new sample columns (click once at the top of the list where it says
Sample1(AGE), hold the SHIFT key and scroll down and click at the last
sample, sample500(AGE)), and in the Statistics window select only Mean.
Then select the Store in Data Table checkbox at the bottom of the window
and calculate.
Sample Means Distribution
9.
Create a Histogram (Graph menu) of the Age column, the new Means column, and the Means 50
column. Set Columns per Page to 3 and compute. (take a screenshot and include it ibelow) and note
the shape of the distribution of means. Compare this with the shape of the original distribution of ages
and the means50 distribution. Pay attention to spread..
Compare the shapes of the original population Ages
distribution, this distribution of sample means, and the
means50 distribution.
Sample Means Distribution
10.
Now find the summary statistics of the Means column and Age column (leave all the statistics selected). Take a
screenshot of your table (include it below) and compare the means and standard deviations of the two distributions.
a.
Note (comment about this) that the means are about the same and the standard deviation from the sample
means is significantly smaller than the standard deviation of the population.
b.
Divide the population standard deviation by √200 and note that this is about the size of the standard deviation
for sample means.
Compare the means and standard deviations of the
original population distribution and the distribution of
sample means:
Sample Means Distribution
11.
Finally, compare the results of your two sample distributions. Make dotplots of the two sample mean distributions
(select one and then hold the Command key (ac) or CTL key (PC)) and take a screenshot and include it below). Think
about shape, center, and spread. How are they similar, how are they different? What does this suggest to you about
the effect of sample size in estimating the population mean?
Reese’s Pieces
What does it mean to be 95% confident? What does it mean to say
the confidence interval method is valid? We will turn to an applet
called Simulating Confidence Intervals to illustrate this.
Imagine using a random sample of Reeses Pieces candies to
estimate the proportion of all such candies that are orange. The applet
will simulate taking a large number of random samples and generating
a confidence interval based on each sample.
Begin by setting the applet to the correct values:
●
Statistic → Proportions
●
Distribution → Binomial
●
Method → Wald
●
π → 0.45 (Hershey’s tells us this is the population parameter)
●
Sample Size → 75
●
Confidence Level → 95%
Reese’s Pieces
Warm-up:
Begin by pressing the
button and observe that the applet takes a sample of 75 Reese’s Pieces and
creates a confidence interval from it:
Also notice the vertical line at 0.45 representing the true population parameter.
Keep pressing the
button and observe how the different samples produce different confidence intervals.
Keep pressing until one of the intervals does not overlap the vertical line at 0.45, What do you notice about this
interval? Remember that.
Reese’s Pieces
a)
As we take new samples, what do you notice about the intervals? Are they all the same? Are any colored red?
What does that denote?
b)
Does the value of the population proportion change as we take new samples?
Now change the Number of intervals to 100 and click the
button.
c)
About what percentage of the intervals seem to be successful at capturing (overlapping) the population
proportion?
d)
Use the
button to sort the intervals, and comment on what the intervals that fail to capture the
population proportion have in common.
Reese’s Pieces
e)
In practice, you only take one sample and construct one confidence interval. Can you be sure that the
confidence interval successfully captures the true (but unknown) value of the parameter? In what sense can
you be confident of this?
f)
Now change the confidence level to 80%. Before pressing the
button, what changes do you
expect to see? Then press the button. What two things change about the intervals?
g)
Now change the sample size to 300 (return the confidence level to 95%). Does this produce a dramatically
higher percentage of successful intervals? What does change about the intervals?
h)
Is it desirable to have larger or smaller confidence levels? Explain.
Reese’s Pieces
i) Is it desirable to have wider or narrower confidence intervals? Explain.
j) What’s a drawback of using a very high confidence level such as 99.9%?
k) What would it take to achieve a very high confidence and a very narrow confidence interval?
Why is this so difficult to achieve?