homework 2 for statistics

MTH 542/642 Fall 2022
Homework # 2
Due: Friday, September 23, 11:59 pm.
Homework Policy







This assignment is worth 25 points. If homework is turned in late, then 4 points are deducted for each day of
delay.
Please give all the details of your work. Clearly explain your results and the notations you used, for full
credit.
Your work must be neatly presented. As many as 5 points might be deducted otherwise.
You are welcome to type the solutions for all the problems, but handwriting is also permissible and strongly
encouraged for mathematical proofs.
The R code must be presented for each question along with the answers for that question and not separately
at the end of the homework. You may cut and paste from R into Word. You may use Courier New when
pasting from R to Word.
All the warning messages and errors must be erased when pasting the R code into Word. Otherwise up to 3
points could be deducted for each warning message not erased.
Scan all the pages of your work and make 1 pdf file. Then upload it on Blackboard. Only pdf files
accepted.
Some useful R commands:
pnorm(x) returns the cumulative distribution function for the standard normal distribution
pt(x,df) returns the cumulative distribution function for the t distribution, P(t ≤ x)
qnorm(p) returns the p-quantile for the standard normal distribution – a value x so that P ( X ≤ x) = p
qt(p,df) returns the p-quantile for the t distribution with df degrees of freedom
Problem 1. Consider the data in Heights described on page 2 of the textbook. The two variables are
mother’s height mheight and daughter’s height dheight.
Part I.
a) Obtain side-by-side boxplots for the data in Heights.
First use boxplot(Heights) and see what R gives you. Copy and paste in your word file. Then comment on
the appearance of the boxplots and compare them. Next, refine the appearance of the boxplots: label each
boxplot, color them, label the vertical axis, give a title, for example “Inheritance of Height”.
R hints: to give a title use main; to give names to each boxplot use
names =c(“name1”, “name2”), to give a name to the vertical axis use ylab, to color each boxplot use
col=c(“color1”, “color2”).
Find the exact values for the mean, quartiles, minimum and maximum values.
b) Obtain density histograms for mother’s heights and also for daughter’s heights.
Label each of the axes, give a title, use different cutoff points (use breaks within hist) then the ones given
by default, mark the mean on each histogram (here is an example:
points(mean(dheight),0,pch=24,bg=”black”).
1
c) What percentage of the mother’s heights is within one standard deviation from the mean?
What percentage of the mother’s heights is within two standard deviations from the mean?
Answer the same questions for the daughter’s heights data.
Make sure you specify the mean, standard deviation, the two values that are one standard from the mean, etc.,
for both mothers and daughters.
R Hint:
An example: To find the percentage of the mother’s heights which are between 58 and 62 inches tall:
Create a vector which contains all the mother’s heights in the data between 58 and 62. The find the proportion of
such heights.
y1 0, the observed value of the ith response yi will
typically not equal its expected value E(Y|X = xi). To account for this difference between the observed data and the expected value, statisticians have
invented a quantity called a statistical error, or ei, for case i defined implicitly
by the equation yi = E(Y|X = xi) + ei or explicitly by ei = yi − E(Y|X = xi). The
errors ei depend on unknown parameters in the mean function and so are not
observable quantities. They are random variables and correspond to the vertical distance between the point yi and the mean function E(Y|X = xi). In the
heights data, Section 1.1, the errors are the differences between the heights of
particular daughters and the average height of all daughters with mothers of
a given fixed height.
We make two important assumptions concerning the errors. First, we assume
that E(ei|xi) = 0, so if we could draw a scatterplot of the ei versus the xi, we
would have a null scatterplot, with no patterns. The second assumption is that
the errors are all independent, meaning that the value of the error for one case
Applied Linear Regression, Fourth Edition. Sanford Weisberg.
© 2014 John Wiley & Sons, Inc. Published 2014 by John Wiley & Sons, Inc.
21
22
chapter 2 simple linear regression
3
Y
b1
2
1
1
b0 = intercept
1
2
3
X
Figure 2.1 Graph of a straight line E(Y|X = x) = β0 + β1x. The intercept parameter β0 is the expected
value of the response when the predictor x = 0. The slope parameter β1 gives the change in the
expected value when the predictor x increases by 1 unit.
gives no information about the value of the error for another case. This is likely
to be true in the examples in Chapter 1, although this assumption will not hold
in all problems.
Errors are often assumed to be normally distributed, but normality is much
stronger than we need. In this book, the normality assumption is used primarily to obtain tests and confidence statements with small samples. If the errors
are thought to follow some different distribution, such as the Poisson or the
binomial, other methods besides ols may be more appropriate; we return to
this topic in Chapter 12.
2.1
ORDINARY LEAST SQUARES ESTIMATION
Many methods have been suggested for obtaining estimates of parameters in
a model. The method discussed here is called ordinary least squares, or ols, in
which parameter estimates are chosen to minimize a quantity called the residual sum of squares. A formal development of the least squares estimates is
given in Appendix A.3.
Parameters are unknown quantities that characterize a model. Estimates of
parameters are computable functions of data and are therefore statistics. To
keep this distinction clear, parameters are denoted by Greek letters like α, β,
γ, and σ, and estimates of parameters are denoted by putting a “hat” over the
corresponding Greek letter. For example, β̂1 (read “beta one hat”) is the estimator of β1, and σ̂ 2 is the estimator of σ2. The fitted value for case i is given
by Ê(Y|X = xi), for which we use the shorthand notation ŷi,
ˆ (Y |X = xi ) = βˆ 0 + βˆ 1 xi
yˆ i = E
(2.2)
23
2.1 ordinary least squares estimation
Table 2.1 Definitions of Symbolsa
Quantity
x
y
SXX
SD2x
SDx
SYY
SD2y
SDy
SXY
sxy
rxy
Definition
Description
∑ xi /n
∑ yi /n
∑( xi − x )2 = ∑( xi − x ) xi
SXX/(n − 1)
SXX/(n − 1)
∑( yi − y )2 = ∑( yi − y ) yi
SYY/(n − 1)
SYY/(n − 1)
∑( xi − x )( yi − y ) = ∑( xi − x ) yi
SXY/(n − 1)
sxy/(SDxSDy)
Sample average of x
Sample average of y
Sum of squares for the xs
Sample variance of the xs
Sample standard deviation of the xs
Sum of squares for the ys
Sample variance of the ys
Sample standard deviation of the ys
Sum of cross-products
Sample covariance
Sample correlation
In each equation, the symbol Σ means to add over all n values or pairs of values in the data.
a
Although the ei are random variables and not parameters, we shall use the
same hat notation to specify the residuals: the residual for the ith case, denoted
êi, is given by the equation
ˆ (Y |X = xi ) = yi − yˆ i = yi − (βˆ 0 + βˆ 1 xi ) i = 1, … , n
eˆi = yi − E
(2.3)
which should be compared with the equation for the statistical errors,
ei = yi − (β 0 + β1 xi ) i = 1, … , n
The computations that are needed for least squares for simple regression
depend only on averages of the variables and their sums of squares and sums
of cross-products. Definitions of these quantities are given in Table 2.1. Sums
of squares and cross-products are centered by subtracting the average from
each of the values before squaring or taking cross-products. Appropriate alternative formulas for computing the corrected sums of squares and cross products from uncorrected sums of squares and cross-products that are often given
in elementary textbooks are useful for mathematical proofs, but they can be
highly inaccurate when used on a computer and should be avoided.
Table 2.1 also lists definitions for the usual univariate and bivariate summary
statistics, the sample averages ( x , y ), sample variances (SD2x, SD2y ) , which are
the squares of the sample standard deviations, and the estimated covariance
and correlation (sxy, rxy).1 The “hat” rule described earlier would suggest
that different symbols should be used for these quantities; for example, ρ̂ xy
might be more appropriate for the sample correlation if the population correlation is ρxy. This inconsistency is deliberate since these sample quantities
estimate population values only if the data used are a random sample from a
1
See Appendix A.2.2 for the definitions of the corresponding population quantities.
24
chapter 2 simple linear regression
population. The random sample condition is not required for regression calculations to make sense, and will often not hold in practice.
To illustrate computations, we will use Forbes’s data introduced in Section
1.1, for which n = 17. The data are given in the file Forbes. The response
given in the file is the base-ten logarithm of the atmospheric pressure,
lpres = 100 × log10(pres) rounded to two decimal digits, and the predictor
is the boiling point bp, rounded to the nearest 0.1°F. Neither multiplication by
100 nor the base of the logarithms has important effects on the analysis. Multiplication by 100 avoids using scientific notation for numbers we display in
the text, and changing the base of the logarithms merely multiplies the logarithms by a constant. For example, to convert from base-ten logarithms to
base-two logarithms, multiply by log(10)/log(2) = 3.3219. To convert natural
logarithms to base-two, multiply by 1.4427.
Forbes’s data were collected at 17 selected locations, so the sample variance
of boiling points, SD2x = 33.17 , is not an estimate of any meaningful population
variance. Similarly, rxy depends as much on the method of sampling as it does
on the population value ρxy, should such a population value make sense. In the
heights example, Section 1.1, if the 1375 mother–daughter pairs can be viewed
as a sample from a population, then the sample correlation is an estimate of
a population correlation.
The usual sample statistics are often presented and used in place of the
corrected sums of squares and cross-products, so alternative formulas are
given using both sets of quantities.
2.2
LEAST SQUARES CRITERION
The criterion function for obtaining estimators is based on the residuals, which
are the vertical distances between the fitted line and the actual y-values, as
illustrated in Figure 2.2. The residuals reflect the inherent asymmetry in the
roles of the response and the predictor in regression problems.
The ols estimators are those values β0 and β1 that minimize the function2
n
(2.4)
∑ [ y − (β + β x )]
When evaluated at ( βˆ , βˆ ) , we call the quantity RSS( βˆ , βˆ ) the residual sum
RSS(β0, β1 ) =
0
i
0
1 i
2
i =1
1
0
1
of squares, or just RSS.
The least squares estimates can be derived in many ways, one of which is
outlined in Appendix A.3. They are given by the expressions
SD y
SXY
SYY 
βˆ 1 =
= rxy
= rxy 
 SXX 
SD x
SXX
βˆ 0 = y − βˆ 1 x
2
1/ 2
(2.5)
We occasionally abuse notation by using the symbol for a fixed though unknown quantity like
β0 or β1 as if it were a variable argument. Thus, for example, RSS(β0, β1) is a function of 2 variables
to be evaluated as its arguments β0 and β1 vary.
25
2.2 least squares criterion
5
Response = Y
4
3
2
1
0
Residuals are the signed lengths of
the vertical lines
−1
0
1
2
Predictor = X
3
4
Figure 2.2 A schematic plot for ols fitting. Each data point is indicated by a small circle. The solid
line is the ols line. The vertical lines between the points and the solid line are the residuals. Points
below the line have negative residuals, while points above the line have positive residuals. The
true mean function shown as a dashed line for these simulated data is E(Y|X = x) = 0.7 + .8x.
The several forms for β̂1 are all equivalent.
We emphasize again that ols produces estimates of parameters but not the
actual values of the parameters. As a demonstration, the data in Figure 2.2
were created by setting the xi to be random sample of 20 numbers from a
normal distribution with mean 2 and variance 1.5 and then computing
yi = 0.7 + 0.8xi + ei, where the errors were sampled from a normal distribution
with mean 0 and variance 1. The graph of the true mean function is shown in
Figure 2.2 as a dashed line, and it seems to match the data poorly compared
with ols, given by the solid line. Since ols minimizes (2.4), it will always fit at
least as well as, and generally better than, the true mean function.
Using Forbes’s data to illustrate computations, we will write x to be the
sample mean of bp and y to be the sample mean of lpres. The quantities
needed for computing the least squares estimators are
x = 202.9529 SXX = 530.7824 SXY = 475.3122
y = 139.6053 SYY = 427.7942
(2.6)
The quantity SYY, although not yet needed, is given for completeness. In the
rare instances that regression calculations are not done using statistical software, intermediate calculations such as these should be done as accurately as
possible, and rounding should be done only to final results. We will generally
display final results with two or three digits beyond the decimal point. Using
(2.6), we find
SXY
βˆ 1 =
= 0.895
SXX
βˆ 0 = y − βˆ 1 x = −42.138
26
chapter 2 simple linear regression
The estimated intercept βˆ 0 = −42.138°F is the estimated value of lpres
when bp = 0. Since the temperatures in the data are in the range from about
194°F to 212°F, this estimate does not have a useful physical interpretation.
The estimated slope of βˆ 1 = 0.895 is the change in lpres for a 1°F change in
bp.
The estimated line given by
ˆ (lpresbp
E
| ) = −42.138 + 0.895bp
was drawn in Figure 1.4a. The fit of this line to the data is excellent.
2.3
ESTIMATING THE VARIANCE σ 2
Since the variance σ 2 is essentially the average squared size of the ei2 , we
should expect that its estimator σ̂ 2 is obtained by averaging the squared
residuals. Under the assumption that the errors are uncorrelated random
variables with 0 means and common variance σ 2, an unbiased estimate of
σ 2 is obtained by dividing RSS = ∑ êi2 by its degrees of freedom (df), where
residual df = number of cases minus the number of parameters in the mean
function. For simple regression, residual df = n − 2, so the estimate of σ2 is
given by
σ̂ 2 =
RSS
n−2
(2.7)
This quantity is called the residual mean square. In general, any sum of squares
divided by its df is called a mean square. The residual sum of squares can be
computed by squaring the residuals and adding them up. It can also be computed from the formula (Problem 2.18)
RSS = SYY −
SXY 2
= SYY − β̂12SXX
SXX
(2.8)
Using the summaries for Forbes’s data given at (2.6), we find
RSS = 427.794 −
475.3122 2
530.7824
= 2.1549
σ2 =
2.1549
= 0.1436
17 − 2
(2.9)
(2.10)
The square root of σ̂ 2 , σˆ = 0.1436 = 0.379 is called the standard error of
regression. It is in the same units as is the response variable.
27
2.4 properties of least squares estimates
If in addition to the assumptions made previously the ei are drawn from
a normal distribution, then the residual mean square will be distributed
as a multiple of a chi-squared random variable with df = n − 2, or in
symbols,
σˆ 2 ∼
σ2 2
χ ( n − 2)
n−2
This is proved in more advanced books on linear models and is used
to obtain the distribution of test statistics and also to make confidence statements concerning σ 2. In addition, since the mean of a χ 2 random variable with
m df is m,
E (σˆ 2 |X ) =
σ2
σ2
E [ χ 2 (n − 2)] =
(n − 2) = σ 2
n−2
n−2
This shows that σ̂ 2 is an unbiased estimate of σ2 if the errors are normally
distributed, although normality is not required for this result to hold. Expectations throughout this chapter condition on X to remind us that X is treated as
fixed and the expectation is over the conditional distribution of Y|X, or equivalently of the conditional distribution of e|X.
2.4
PROPERTIES OF LEAST SQUARES ESTIMATES
The ols estimates depend on data only through the statistics given in Table
2.1. This is both an advantage, making computing easy, and a disadvantage,
since any two data sets for which these are identical give the same fitted regression, even if a straight-line model is appropriate for one but not the other, as
we have seen in the example from Anscombe (1973) in Section 1.4. The estimates β̂0 and β̂1 can both be written as linear combinations of y1, . . . , yn.
Writing ci = ( xi − x )/SXX (see Appendix A.3), then
∑( xi − x )( yi − y ) 
βˆ 1 = 
 =

SXX
xi − x
∑  SXX  y = ∑ c y
i
i
i
and
βˆ 0 = y − βˆ 1 x =
1
∑  n − c x  y = ∑ d y
i
i
i i
with di = (1/n − cixi). A fitted value yˆ i = βˆ 0 + βˆ 1 xi is equal to ∑(di + ci xi ) yi , also
a linear combination of the yi.
28
chapter 2 simple linear regression
The fitted value at x = x is
ˆ (Y |X = x ) = y − βˆ 1 x + βˆ 1 x = y
E
so the fitted line passes through the point ( x , y ) , intuitively the center of the
data. Finally, as long as the mean function includes an intercept, ∑ êi = 0. Mean
functions without an intercept may have ∑ êi ≠ 0 .
Since the estimates β̂ 0 and β̂1 depend on the random eis, the estimates are
also random variables. If all the ei have 0 mean and the mean function is
correct, then, as shown in Appendix A.4, the least squares estimates are
unbiased,
( )
E (β̂ |X ) = β
E β̂ 0 |X = β 0
1
1
The variances of the estimators, assuming Var(ei|X) = σ 2, i = 1, . . . , n, and
Cov(ei, ej|X) = 0, i ≠ j, are from Appendix A.4,
(
)
(
)
Var β̂1|X = σ 2
1
SXX
 1 x2 
Var β̂ 0 |X = σ  +
 n SXX 
(2.11)
2
From (2.5) we have β̂0 depends on β̂1 , βˆ 0 = y − βˆ 1 x , and so it is no surprise
that the estimates are correlated, and
(
)
Cov βˆ 0, βˆ 1|X = Cov( y − βˆ 1 x , βˆ 1|X )
= Cov( y, βˆ 1|X ) − xVar(βˆ 1|X )
x
= −σ 2
SXX
(2.12)
The estimated slope and intercept are generally correlated unless the predictor
is centered to have x = 0 (Problem 2.8). The correlation between the intercept
and slope estimates is
(
)
ρ βˆ 0, βˆ 1|X =
−x
SXX/n + x 2
The correlation will be close to plus or minus 1 if the variation in the predictor
reflected in SXX is small relative to x.
The Gauss–Markov theorem provides an optimality result for ols
estimates. Among all estimates that are linear combinations of the ys and
29
2.5 estimated variances
unbiased, the ols estimates have the smallest variance. These estimates are
called the best linear unbiased estimates, or blue. If one believes the assumptions and is interested in using linear unbiased estimates, the ols estimates are
the ones to use.
The means and variances, and covariances of the estimated regression coefficients do not require a distributional assumption concerning the errors. Since
the estimates are linear combinations of the yi, and hence linear combinations
of the errors ei, the central limit theorem shows that the coefficient estimates
will be approximately normally distributed if the sample size is large enough.3
For smaller samples, if the errors e = y − E(y|X = x) are independent and
normally distributed, written in symbols as
ei |X ∼ NID(0, σ 2 ) i = 1, … , n
then the regression estimates β̂ 0 and β̂1 will have a joint normal distribution
with means, variances, and covariances as given before. When the errors are
normally distributed, the ols estimates can be justified using a completely
different argument, since they are then also maximum likelihood estimates, as
discussed in any mathematical statistics text, for example, Casella and Berger
(2001).
2.5
ESTIMATED VARIANCES
(
(
)
)
Estimates of Var β̂ 0 |X and Var β̂1|X are obtained by substituting σ̂ 2 for σ 2
 ( ) for an estimated variance. Thus
in (2.11). We use the symbol Var
(
)
(
)
 βˆ 1|X = σˆ 2 1
Var
SXX
2
 βˆ 0 |X = σˆ 2  1 + x 
Var
 n

SXX 
The square root of an estimated variance is called a standard error, for which
we use the symbol se( ). The use of this notation is illustrated by
(
)
(
 βˆ 1|X
se βˆ 1|X = Var
)
The terms standard error and standard deviation are sometimes used interchangeably. In this book, an estimated standard deviation always refers to the
variability between values of an observable random variable like the response
3
The main requirement for all estimates to be normally distributed in large samples is that
max i [( xi − x )2 /SXX] must get close to 0 as the sample size increases (Huber and Ronchetti, 2009,
Proposition 7.1).
30
chapter 2 simple linear regression
yi or an unobservable random variance like the errors ei. The term standard
error will always refer to the square root of the estimated variance of a statistic
like a mean y, or a regression coefficient β̂1 .
2.6
CONFIDENCE INTERVALS AND t-TESTS
Estimates of regression coefficients and fitted values are all subject to uncertainty, and assessing the amount of uncertainty is an important part of most
analyses. Confidence intervals result in interval estimates, while tests provide
methodology for making decisions concerning the value of a parameter or
fitted value.
When the errors are NID(0, σ 2), parameter estimates, fitted values, and
predictions will be normally distributed because all of these are linear combinations of the yi and hence of the ei. Confidence intervals and tests can be
based on a t-distribution, which is the appropriate distribution with normal
estimates but using σ̂ 2 to estimate the unknown variance σ 2. There are many
t-distributions, indexed by the number of df associated with σ̂ . Suppose we let
t(α/2, d) be the value that cuts off α/2 × 100% in the upper tail of the
t-distribution with d df. These values can be computed in most statistical packages or spreadsheet software.4
2.6.1 The Intercept
The intercept is used to illustrate the general form of confidence intervals for
normally distributed estimates. The standard error of the intercept is
1/ 2
se (β0 |X ) = σˆ (1/n + x 2 /SXX) . Hence, a (1 − α) × 100% confidence interval for
the intercept is the set of points β0 in the interval
βˆ 0 − t(α / 2, n − 2)se(βˆ 0 |X ) ≤ β0 ≤ βˆ 0 + t(α / 2, n − 2)se(βˆ 0 |X )
(
)
(
)
1/ 2
2
For Forbes’s data, se βˆ 0 |X = 0.379 1/17 + ( 202.953) / 530.724
= 3.340 . For a
90% confidence interval, t(0.05, 15) = 1.753, and the interval is
−42.138 − 1.753(3.340) ≤ β 0 ≤ −42.138 + 1.753(3.340)
−47.99 ≤ β 0 ≤ −36.28
Ninety percent of such intervals will include the true value.
A hypothesis test of
NH: β 0 = β 0*, β1 arbitrary
AH: β 0 ≠ β 0*, β1 arbitrary
Readily available functions include tinv in Microsoft Excel, and the function pt in R. Tables of
the t distributions can be easily found by googling t table.
4
31
2.6 confidence intervals and t-tests
is obtained by computing the t-statistic
t=
βˆ 0 − β0*
se(βˆ 0 |X )
(2.13)
and referring this ratio to the t-distribution with df = n − 2, the number of df
in the estimate of σ 2. For example, in Forbes’s data, consider testing the NH
β0 = −35 against the alternative that β0 ≠ −35. The statistic is
t=
−42.138 − (−35)
= −2.137
3.34
Since AH is two-sided, the p-value corresponds to the probability that a t(15)
variable is less than −2.137 or greater than +2.137, which gives a p-value that
rounds to 0.05, providing some evidence against NH. This hypothesis test for
these data is not one that would occur to most investigators and is used only
as an illustration.
2.6.2
Slope
A 95% confidence interval for the slope, or for any of the partial slopes in
multiple regression, is the set of β1 such that
βˆ 1 − t(α / 2, df )se(βˆ 1|X ) ≤ β1 ≤ βˆ 1 + t(α / 2, df )se(βˆ 1|X )
(
(2.14)
)
For simple regression, df = n − 2 and se βˆ 1|X = σˆ / SXX . For Forbes’s data,
df = 15, se βˆ 1|X = 0.0165, and
(
)
0.895 − 2.131(0.0165) ≤ β1 ≤ 0.895 + 2.131(0.0165)
0.86 ≤ β1 ≤ 0.93
As an example of a test for slope equal to 0, consider the Ft. Collins
snowfall data in Section 1.1. One can show, Problem 2.5, that βˆ 1 = 0.203,
se βˆ 1 | X = 0.131 . The test of interest is of
(
)
NH: β1 = 0
AH: β1 ≠ 0
(2.15)
and t = (0.203 − 0)/0.131 =1.553. To get a significance level for this test, compare
t with the t(91) distribution; the two-sided p-value is 0.124, suggesting no
evidence against the NH that Early and Late season snowfalls are
independent.
32
2.6.3
chapter 2 simple linear regression
Prediction
The estimated mean function can be used to obtain values of the response for
given values of the predictor. The two important variants of this problem are
prediction and estimation of fitted values. Since prediction is more important,
we discuss it first.
In prediction we have a new case, possibly a future value, not one used to
estimate parameters, with observed value of the predictor x*. We would like
to know the value y*, the corresponding response, but it has not yet been
observed. If we assume that the data used to estimate the mean function are
relevant to the new case, then the model fitted to the observed data can be
used to predict for the new case. In the heights example, we would probably
be willing to apply the fitted mean function to mother–daughter pairs alive in
England at the end of the nineteenth century. Whether the prediction would
be reasonable for mother–daughter pairs in other countries or in other time
periods is much less clear. In Forbes’s problem, we would probably be willing
to apply the results for altitudes in the range he studied. Given this additional
assumption, a point prediction of y*, say ỹ*, is just
y* = βˆ 0 + βˆ 1 x*
ỹ* predicts the as yet unobserved y*. Assuming the model is correct, then the
true value of y* is
y* = β 0 + β1 x* + e*
where e* is the random error attached to the future value, presumably with
variance σ 2. Thus, even if β0 and β1 were known exactly, predictions would not
match true values perfectly, but would be off by a random amount with standard deviation σ. In the more usual case where the coefficients are estimated,
the prediction error variability will have a second component that arises from
the uncertainty in the estimates of the coefficients. Combining these two
sources of variation and using Appendix A.4,
 1 ( x − x )2 
Var( y*|x* ) = σ 2 + σ 2  + *

n
SXX 
(2.16)
The first σ 2 on the right of (2.16) corresponds to the variability due to e*, and
the remaining term is the error for estimating coefficients. If x* is similar
to the xi used to estimate the coefficients, then the second term will generally
be much smaller than the first term. If x* is very different from the xi used in
estimation, the second term can dominate.
Taking square roots of both sides of (2.16) and estimating σ 2 by σ̂ 2 , we get
the standard error of prediction (sepred) at x*,
33
2.6 confidence intervals and t-tests
1 ( x − x )2 

sepred( y* |x* ) = σ  1 + + *

SXX 
n
1/ 2
(2.17)
A prediction interval uses multipliers from the t-distribution with df equal to
the df in estimating σ 2. For prediction of 100 × log(pres) for a location with
x* = 200, the point prediction is ỹ* = −42.138 + +0.895(200) = 136.961, with
standard error of prediction
1 (200 − 202.9529)2 

sepred( y* |x* = 200) = 0.379  1 +
+


17
530.7824
1/ 2
= 0.393
Thus, a 99% predictive interval is the set of all y* such that
136.961 − 2.95(0.393) ≤ y* ≤ 136.961 + 2.95(0.393)
135.803 ≤ y* ≤ 138.119
More interesting would be a 99% prediction interval for pres, rather than for
100 × log(pres). A point prediction is just 10(136.961/100) = 23.421 inches of
Mercury. The prediction interval is found by exponentiating the end points of
the interval in log scale. Dividing by 100 and then exponentiating, we get
10135.803/100 ≤ pres ≤ 10138.119 /100
22.805 ≤ pres ≤ 24.054
In the original scale, the prediction interval is not symmetric about the point
estimate.
For the heights data, Figure 2.3 is a plot of the estimated mean function
given by the dashed line for the regression of dheight on mheight along
with curves at
βˆ 0 + βˆ 1 x* ± t(.025, 1373)sepred(dheight* |mheight* )
The vertical distance between the two solid curves for any value of
mheight corresponds to a 95% prediction interval for daughter’s height given
mother’s height. Although not obvious from the graph because of the very
large sample size, the interval is wider for mothers who were either relatively
tall or short, as the curves bend outward from the narrowest point at
mheight = mheight.
2.6.4
Fitted Values
In rare problems, one may be interested in obtaining an estimate of E(Y|X = x*).
In the heights data, this is like asking for the population mean height of all
34
chapter 2 simple linear regression
dheight
70
65
60
55
55
60
65
70
mheight
Figure 2.3 Prediction intervals (solid lines) and intervals for fitted values (dashed lines) for the
heights data.
daughters of mothers with a particular height. This quantity is estimated by
the fitted value ŷ = β0 + β1x*, and its standard error is
 1 ( x − x )2 
sefit ( yˆ |x* ) = σˆ  + *
n
SXX 
1/ 2
To obtain confidence intervals, it is more usual to compute a simultaneous
interval for all possible values of x. This is the same as first computing a joint
confidence region for β0 and β1, and from these, computing the set of all possible mean functions with slope and intercept in the joint confidence set. The
confidence region for the mean function is the set of all y such that
(βˆ + βˆ x) − sefit ( yˆ|x )[2F (α ; 2, n − 2)] ≤ y
≤ ( βˆ + βˆ x ) + sefit ( yˆ |x )[ 2F (α ; 2, n…

Don't use plagiarized sources. Get Your Custom Essay on
homework 2 for statistics
Just from $13/Page
Order Essay
Calculate your order
Pages (275 words)
Standard price: $0.00
Client Reviews
4.9
Sitejabber
4.6
Trustpilot
4.8
Our Guarantees
100% Confidentiality
Information about customers is confidential and never disclosed to third parties.
Original Writing
We complete all papers from scratch. You can get a plagiarism report.
Timely Delivery
No missed deadlines – 97% of assignments are completed in time.
Money Back
If you're confident that a writer didn't follow your order details, ask for a refund.

Calculate the price of your order

You will get a personal manager and a discount.
We'll send you the first draft for approval by at
Total price:
$0.00
Power up Your Academic Success with the
Team of Professionals. We’ve Got Your Back.
Power up Your Study Success with Experts We’ve Got Your Back.
Live Chat+1(978) 822-0999EmailWhatsApp

Order your essay today and save 20% with the discount code ORIGINAL

seoartvin escortizmir escortelazığ escortbacklink satışbacklink saleseskişehir oto kurtarıcıeskişehir oto kurtarıcıoto çekicibacklink satışbacklink satışıbacklink satışbacklink