# homework 2 for statistics

MTH 542/642 Fall 2022

Homework # 2

Due: Friday, September 23, 11:59 pm.

Homework Policy

•

•

•

•

•

•

•

This assignment is worth 25 points. If homework is turned in late, then 4 points are deducted for each day of

delay.

Please give all the details of your work. Clearly explain your results and the notations you used, for full

credit.

Your work must be neatly presented. As many as 5 points might be deducted otherwise.

You are welcome to type the solutions for all the problems, but handwriting is also permissible and strongly

encouraged for mathematical proofs.

The R code must be presented for each question along with the answers for that question and not separately

at the end of the homework. You may cut and paste from R into Word. You may use Courier New when

pasting from R to Word.

All the warning messages and errors must be erased when pasting the R code into Word. Otherwise up to 3

points could be deducted for each warning message not erased.

Scan all the pages of your work and make 1 pdf file. Then upload it on Blackboard. Only pdf files

accepted.

Some useful R commands:

pnorm(x) returns the cumulative distribution function for the standard normal distribution

pt(x,df) returns the cumulative distribution function for the t distribution, P(t ≤ x)

qnorm(p) returns the p-quantile for the standard normal distribution – a value x so that P ( X ≤ x) = p

qt(p,df) returns the p-quantile for the t distribution with df degrees of freedom

Problem 1. Consider the data in Heights described on page 2 of the textbook. The two variables are

mother’s height mheight and daughter’s height dheight.

Part I.

a) Obtain side-by-side boxplots for the data in Heights.

First use boxplot(Heights) and see what R gives you. Copy and paste in your word file. Then comment on

the appearance of the boxplots and compare them. Next, refine the appearance of the boxplots: label each

boxplot, color them, label the vertical axis, give a title, for example “Inheritance of Height”.

R hints: to give a title use main; to give names to each boxplot use

names =c(“name1”, “name2”), to give a name to the vertical axis use ylab, to color each boxplot use

col=c(“color1”, “color2”).

Find the exact values for the mean, quartiles, minimum and maximum values.

b) Obtain density histograms for mother’s heights and also for daughter’s heights.

Label each of the axes, give a title, use different cutoff points (use breaks within hist) then the ones given

by default, mark the mean on each histogram (here is an example:

points(mean(dheight),0,pch=24,bg=”black”).

1

c) What percentage of the mother’s heights is within one standard deviation from the mean?

What percentage of the mother’s heights is within two standard deviations from the mean?

Answer the same questions for the daughter’s heights data.

Make sure you specify the mean, standard deviation, the two values that are one standard from the mean, etc.,

for both mothers and daughters.

R Hint:

An example: To find the percentage of the mother’s heights which are between 58 and 62 inches tall:

Create a vector which contains all the mother’s heights in the data between 58 and 62. The find the proportion of

such heights.

y1 0, the observed value of the ith response yi will

typically not equal its expected value E(Y|X = xi). To account for this difference between the observed data and the expected value, statisticians have

invented a quantity called a statistical error, or ei, for case i defined implicitly

by the equation yi = E(Y|X = xi) + ei or explicitly by ei = yi − E(Y|X = xi). The

errors ei depend on unknown parameters in the mean function and so are not

observable quantities. They are random variables and correspond to the vertical distance between the point yi and the mean function E(Y|X = xi). In the

heights data, Section 1.1, the errors are the differences between the heights of

particular daughters and the average height of all daughters with mothers of

a given fixed height.

We make two important assumptions concerning the errors. First, we assume

that E(ei|xi) = 0, so if we could draw a scatterplot of the ei versus the xi, we

would have a null scatterplot, with no patterns. The second assumption is that

the errors are all independent, meaning that the value of the error for one case

Applied Linear Regression, Fourth Edition. Sanford Weisberg.

© 2014 John Wiley & Sons, Inc. Published 2014 by John Wiley & Sons, Inc.

21

22

chapter 2 simple linear regression

3

Y

b1

2

1

1

b0 = intercept

1

2

3

X

Figure 2.1 Graph of a straight line E(Y|X = x) = β0 + β1x. The intercept parameter β0 is the expected

value of the response when the predictor x = 0. The slope parameter β1 gives the change in the

expected value when the predictor x increases by 1 unit.

gives no information about the value of the error for another case. This is likely

to be true in the examples in Chapter 1, although this assumption will not hold

in all problems.

Errors are often assumed to be normally distributed, but normality is much

stronger than we need. In this book, the normality assumption is used primarily to obtain tests and confidence statements with small samples. If the errors

are thought to follow some different distribution, such as the Poisson or the

binomial, other methods besides ols may be more appropriate; we return to

this topic in Chapter 12.

2.1

ORDINARY LEAST SQUARES ESTIMATION

Many methods have been suggested for obtaining estimates of parameters in

a model. The method discussed here is called ordinary least squares, or ols, in

which parameter estimates are chosen to minimize a quantity called the residual sum of squares. A formal development of the least squares estimates is

given in Appendix A.3.

Parameters are unknown quantities that characterize a model. Estimates of

parameters are computable functions of data and are therefore statistics. To

keep this distinction clear, parameters are denoted by Greek letters like α, β,

γ, and σ, and estimates of parameters are denoted by putting a “hat” over the

corresponding Greek letter. For example, β̂1 (read “beta one hat”) is the estimator of β1, and σ̂ 2 is the estimator of σ2. The fitted value for case i is given

by Ê(Y|X = xi), for which we use the shorthand notation ŷi,

ˆ (Y |X = xi ) = βˆ 0 + βˆ 1 xi

yˆ i = E

(2.2)

23

2.1 ordinary least squares estimation

Table 2.1 Definitions of Symbolsa

Quantity

x

y

SXX

SD2x

SDx

SYY

SD2y

SDy

SXY

sxy

rxy

Definition

Description

∑ xi /n

∑ yi /n

∑( xi − x )2 = ∑( xi − x ) xi

SXX/(n − 1)

SXX/(n − 1)

∑( yi − y )2 = ∑( yi − y ) yi

SYY/(n − 1)

SYY/(n − 1)

∑( xi − x )( yi − y ) = ∑( xi − x ) yi

SXY/(n − 1)

sxy/(SDxSDy)

Sample average of x

Sample average of y

Sum of squares for the xs

Sample variance of the xs

Sample standard deviation of the xs

Sum of squares for the ys

Sample variance of the ys

Sample standard deviation of the ys

Sum of cross-products

Sample covariance

Sample correlation

In each equation, the symbol Σ means to add over all n values or pairs of values in the data.

a

Although the ei are random variables and not parameters, we shall use the

same hat notation to specify the residuals: the residual for the ith case, denoted

êi, is given by the equation

ˆ (Y |X = xi ) = yi − yˆ i = yi − (βˆ 0 + βˆ 1 xi ) i = 1, … , n

eˆi = yi − E

(2.3)

which should be compared with the equation for the statistical errors,

ei = yi − (β 0 + β1 xi ) i = 1, … , n

The computations that are needed for least squares for simple regression

depend only on averages of the variables and their sums of squares and sums

of cross-products. Definitions of these quantities are given in Table 2.1. Sums

of squares and cross-products are centered by subtracting the average from

each of the values before squaring or taking cross-products. Appropriate alternative formulas for computing the corrected sums of squares and cross products from uncorrected sums of squares and cross-products that are often given

in elementary textbooks are useful for mathematical proofs, but they can be

highly inaccurate when used on a computer and should be avoided.

Table 2.1 also lists definitions for the usual univariate and bivariate summary

statistics, the sample averages ( x , y ), sample variances (SD2x, SD2y ) , which are

the squares of the sample standard deviations, and the estimated covariance

and correlation (sxy, rxy).1 The “hat” rule described earlier would suggest

that different symbols should be used for these quantities; for example, ρ̂ xy

might be more appropriate for the sample correlation if the population correlation is ρxy. This inconsistency is deliberate since these sample quantities

estimate population values only if the data used are a random sample from a

1

See Appendix A.2.2 for the definitions of the corresponding population quantities.

24

chapter 2 simple linear regression

population. The random sample condition is not required for regression calculations to make sense, and will often not hold in practice.

To illustrate computations, we will use Forbes’s data introduced in Section

1.1, for which n = 17. The data are given in the file Forbes. The response

given in the file is the base-ten logarithm of the atmospheric pressure,

lpres = 100 × log10(pres) rounded to two decimal digits, and the predictor

is the boiling point bp, rounded to the nearest 0.1°F. Neither multiplication by

100 nor the base of the logarithms has important effects on the analysis. Multiplication by 100 avoids using scientific notation for numbers we display in

the text, and changing the base of the logarithms merely multiplies the logarithms by a constant. For example, to convert from base-ten logarithms to

base-two logarithms, multiply by log(10)/log(2) = 3.3219. To convert natural

logarithms to base-two, multiply by 1.4427.

Forbes’s data were collected at 17 selected locations, so the sample variance

of boiling points, SD2x = 33.17 , is not an estimate of any meaningful population

variance. Similarly, rxy depends as much on the method of sampling as it does

on the population value ρxy, should such a population value make sense. In the

heights example, Section 1.1, if the 1375 mother–daughter pairs can be viewed

as a sample from a population, then the sample correlation is an estimate of

a population correlation.

The usual sample statistics are often presented and used in place of the

corrected sums of squares and cross-products, so alternative formulas are

given using both sets of quantities.

2.2

LEAST SQUARES CRITERION

The criterion function for obtaining estimators is based on the residuals, which

are the vertical distances between the fitted line and the actual y-values, as

illustrated in Figure 2.2. The residuals reflect the inherent asymmetry in the

roles of the response and the predictor in regression problems.

The ols estimators are those values β0 and β1 that minimize the function2

n

(2.4)

∑ [ y − (β + β x )]

When evaluated at ( βˆ , βˆ ) , we call the quantity RSS( βˆ , βˆ ) the residual sum

RSS(β0, β1 ) =

0

i

0

1 i

2

i =1

1

0

1

of squares, or just RSS.

The least squares estimates can be derived in many ways, one of which is

outlined in Appendix A.3. They are given by the expressions

SD y

SXY

SYY

βˆ 1 =

= rxy

= rxy

SXX

SD x

SXX

βˆ 0 = y − βˆ 1 x

2

1/ 2

(2.5)

We occasionally abuse notation by using the symbol for a fixed though unknown quantity like

β0 or β1 as if it were a variable argument. Thus, for example, RSS(β0, β1) is a function of 2 variables

to be evaluated as its arguments β0 and β1 vary.

25

2.2 least squares criterion

5

Response = Y

4

3

2

1

0

Residuals are the signed lengths of

the vertical lines

−1

0

1

2

Predictor = X

3

4

Figure 2.2 A schematic plot for ols fitting. Each data point is indicated by a small circle. The solid

line is the ols line. The vertical lines between the points and the solid line are the residuals. Points

below the line have negative residuals, while points above the line have positive residuals. The

true mean function shown as a dashed line for these simulated data is E(Y|X = x) = 0.7 + .8x.

The several forms for β̂1 are all equivalent.

We emphasize again that ols produces estimates of parameters but not the

actual values of the parameters. As a demonstration, the data in Figure 2.2

were created by setting the xi to be random sample of 20 numbers from a

normal distribution with mean 2 and variance 1.5 and then computing

yi = 0.7 + 0.8xi + ei, where the errors were sampled from a normal distribution

with mean 0 and variance 1. The graph of the true mean function is shown in

Figure 2.2 as a dashed line, and it seems to match the data poorly compared

with ols, given by the solid line. Since ols minimizes (2.4), it will always fit at

least as well as, and generally better than, the true mean function.

Using Forbes’s data to illustrate computations, we will write x to be the

sample mean of bp and y to be the sample mean of lpres. The quantities

needed for computing the least squares estimators are

x = 202.9529 SXX = 530.7824 SXY = 475.3122

y = 139.6053 SYY = 427.7942

(2.6)

The quantity SYY, although not yet needed, is given for completeness. In the

rare instances that regression calculations are not done using statistical software, intermediate calculations such as these should be done as accurately as

possible, and rounding should be done only to final results. We will generally

display final results with two or three digits beyond the decimal point. Using

(2.6), we find

SXY

βˆ 1 =

= 0.895

SXX

βˆ 0 = y − βˆ 1 x = −42.138

26

chapter 2 simple linear regression

The estimated intercept βˆ 0 = −42.138°F is the estimated value of lpres

when bp = 0. Since the temperatures in the data are in the range from about

194°F to 212°F, this estimate does not have a useful physical interpretation.

The estimated slope of βˆ 1 = 0.895 is the change in lpres for a 1°F change in

bp.

The estimated line given by

ˆ (lpresbp

E

| ) = −42.138 + 0.895bp

was drawn in Figure 1.4a. The fit of this line to the data is excellent.

2.3

ESTIMATING THE VARIANCE σ 2

Since the variance σ 2 is essentially the average squared size of the ei2 , we

should expect that its estimator σ̂ 2 is obtained by averaging the squared

residuals. Under the assumption that the errors are uncorrelated random

variables with 0 means and common variance σ 2, an unbiased estimate of

σ 2 is obtained by dividing RSS = ∑ êi2 by its degrees of freedom (df), where

residual df = number of cases minus the number of parameters in the mean

function. For simple regression, residual df = n − 2, so the estimate of σ2 is

given by

σ̂ 2 =

RSS

n−2

(2.7)

This quantity is called the residual mean square. In general, any sum of squares

divided by its df is called a mean square. The residual sum of squares can be

computed by squaring the residuals and adding them up. It can also be computed from the formula (Problem 2.18)

RSS = SYY −

SXY 2

= SYY − β̂12SXX

SXX

(2.8)

Using the summaries for Forbes’s data given at (2.6), we find

RSS = 427.794 −

475.3122 2

530.7824

= 2.1549

σ2 =

2.1549

= 0.1436

17 − 2

(2.9)

(2.10)

The square root of σ̂ 2 , σˆ = 0.1436 = 0.379 is called the standard error of

regression. It is in the same units as is the response variable.

27

2.4 properties of least squares estimates

If in addition to the assumptions made previously the ei are drawn from

a normal distribution, then the residual mean square will be distributed

as a multiple of a chi-squared random variable with df = n − 2, or in

symbols,

σˆ 2 ∼

σ2 2

χ ( n − 2)

n−2

This is proved in more advanced books on linear models and is used

to obtain the distribution of test statistics and also to make confidence statements concerning σ 2. In addition, since the mean of a χ 2 random variable with

m df is m,

E (σˆ 2 |X ) =

σ2

σ2

E [ χ 2 (n − 2)] =

(n − 2) = σ 2

n−2

n−2

This shows that σ̂ 2 is an unbiased estimate of σ2 if the errors are normally

distributed, although normality is not required for this result to hold. Expectations throughout this chapter condition on X to remind us that X is treated as

fixed and the expectation is over the conditional distribution of Y|X, or equivalently of the conditional distribution of e|X.

2.4

PROPERTIES OF LEAST SQUARES ESTIMATES

The ols estimates depend on data only through the statistics given in Table

2.1. This is both an advantage, making computing easy, and a disadvantage,

since any two data sets for which these are identical give the same fitted regression, even if a straight-line model is appropriate for one but not the other, as

we have seen in the example from Anscombe (1973) in Section 1.4. The estimates β̂0 and β̂1 can both be written as linear combinations of y1, . . . , yn.

Writing ci = ( xi − x )/SXX (see Appendix A.3), then

∑( xi − x )( yi − y )

βˆ 1 =

=

SXX

xi − x

∑ SXX y = ∑ c y

i

i

i

and

βˆ 0 = y − βˆ 1 x =

1

∑ n − c x y = ∑ d y

i

i

i i

with di = (1/n − cixi). A fitted value yˆ i = βˆ 0 + βˆ 1 xi is equal to ∑(di + ci xi ) yi , also

a linear combination of the yi.

28

chapter 2 simple linear regression

The fitted value at x = x is

ˆ (Y |X = x ) = y − βˆ 1 x + βˆ 1 x = y

E

so the fitted line passes through the point ( x , y ) , intuitively the center of the

data. Finally, as long as the mean function includes an intercept, ∑ êi = 0. Mean

functions without an intercept may have ∑ êi ≠ 0 .

Since the estimates β̂ 0 and β̂1 depend on the random eis, the estimates are

also random variables. If all the ei have 0 mean and the mean function is

correct, then, as shown in Appendix A.4, the least squares estimates are

unbiased,

( )

E (β̂ |X ) = β

E β̂ 0 |X = β 0

1

1

The variances of the estimators, assuming Var(ei|X) = σ 2, i = 1, . . . , n, and

Cov(ei, ej|X) = 0, i ≠ j, are from Appendix A.4,

(

)

(

)

Var β̂1|X = σ 2

1

SXX

1 x2

Var β̂ 0 |X = σ +

n SXX

(2.11)

2

From (2.5) we have β̂0 depends on β̂1 , βˆ 0 = y − βˆ 1 x , and so it is no surprise

that the estimates are correlated, and

(

)

Cov βˆ 0, βˆ 1|X = Cov( y − βˆ 1 x , βˆ 1|X )

= Cov( y, βˆ 1|X ) − xVar(βˆ 1|X )

x

= −σ 2

SXX

(2.12)

The estimated slope and intercept are generally correlated unless the predictor

is centered to have x = 0 (Problem 2.8). The correlation between the intercept

and slope estimates is

(

)

ρ βˆ 0, βˆ 1|X =

−x

SXX/n + x 2

The correlation will be close to plus or minus 1 if the variation in the predictor

reflected in SXX is small relative to x.

The Gauss–Markov theorem provides an optimality result for ols

estimates. Among all estimates that are linear combinations of the ys and

29

2.5 estimated variances

unbiased, the ols estimates have the smallest variance. These estimates are

called the best linear unbiased estimates, or blue. If one believes the assumptions and is interested in using linear unbiased estimates, the ols estimates are

the ones to use.

The means and variances, and covariances of the estimated regression coefficients do not require a distributional assumption concerning the errors. Since

the estimates are linear combinations of the yi, and hence linear combinations

of the errors ei, the central limit theorem shows that the coefficient estimates

will be approximately normally distributed if the sample size is large enough.3

For smaller samples, if the errors e = y − E(y|X = x) are independent and

normally distributed, written in symbols as

ei |X ∼ NID(0, σ 2 ) i = 1, … , n

then the regression estimates β̂ 0 and β̂1 will have a joint normal distribution

with means, variances, and covariances as given before. When the errors are

normally distributed, the ols estimates can be justified using a completely

different argument, since they are then also maximum likelihood estimates, as

discussed in any mathematical statistics text, for example, Casella and Berger

(2001).

2.5

ESTIMATED VARIANCES

(

(

)

)

Estimates of Var β̂ 0 |X and Var β̂1|X are obtained by substituting σ̂ 2 for σ 2

( ) for an estimated variance. Thus

in (2.11). We use the symbol Var

(

)

(

)

βˆ 1|X = σˆ 2 1

Var

SXX

2

βˆ 0 |X = σˆ 2 1 + x

Var

n

SXX

The square root of an estimated variance is called a standard error, for which

we use the symbol se( ). The use of this notation is illustrated by

(

)

(

βˆ 1|X

se βˆ 1|X = Var

)

The terms standard error and standard deviation are sometimes used interchangeably. In this book, an estimated standard deviation always refers to the

variability between values of an observable random variable like the response

3

The main requirement for all estimates to be normally distributed in large samples is that

max i [( xi − x )2 /SXX] must get close to 0 as the sample size increases (Huber and Ronchetti, 2009,

Proposition 7.1).

30

chapter 2 simple linear regression

yi or an unobservable random variance like the errors ei. The term standard

error will always refer to the square root of the estimated variance of a statistic

like a mean y, or a regression coefficient β̂1 .

2.6

CONFIDENCE INTERVALS AND t-TESTS

Estimates of regression coefficients and fitted values are all subject to uncertainty, and assessing the amount of uncertainty is an important part of most

analyses. Confidence intervals result in interval estimates, while tests provide

methodology for making decisions concerning the value of a parameter or

fitted value.

When the errors are NID(0, σ 2), parameter estimates, fitted values, and

predictions will be normally distributed because all of these are linear combinations of the yi and hence of the ei. Confidence intervals and tests can be

based on a t-distribution, which is the appropriate distribution with normal

estimates but using σ̂ 2 to estimate the unknown variance σ 2. There are many

t-distributions, indexed by the number of df associated with σ̂ . Suppose we let

t(α/2, d) be the value that cuts off α/2 × 100% in the upper tail of the

t-distribution with d df. These values can be computed in most statistical packages or spreadsheet software.4

2.6.1 The Intercept

The intercept is used to illustrate the general form of confidence intervals for

normally distributed estimates. The standard error of the intercept is

1/ 2

se (β0 |X ) = σˆ (1/n + x 2 /SXX) . Hence, a (1 − α) × 100% confidence interval for

the intercept is the set of points β0 in the interval

βˆ 0 − t(α / 2, n − 2)se(βˆ 0 |X ) ≤ β0 ≤ βˆ 0 + t(α / 2, n − 2)se(βˆ 0 |X )

(

)

(

)

1/ 2

2

For Forbes’s data, se βˆ 0 |X = 0.379 1/17 + ( 202.953) / 530.724

= 3.340 . For a

90% confidence interval, t(0.05, 15) = 1.753, and the interval is

−42.138 − 1.753(3.340) ≤ β 0 ≤ −42.138 + 1.753(3.340)

−47.99 ≤ β 0 ≤ −36.28

Ninety percent of such intervals will include the true value.

A hypothesis test of

NH: β 0 = β 0*, β1 arbitrary

AH: β 0 ≠ β 0*, β1 arbitrary

Readily available functions include tinv in Microsoft Excel, and the function pt in R. Tables of

the t distributions can be easily found by googling t table.

4

31

2.6 confidence intervals and t-tests

is obtained by computing the t-statistic

t=

βˆ 0 − β0*

se(βˆ 0 |X )

(2.13)

and referring this ratio to the t-distribution with df = n − 2, the number of df

in the estimate of σ 2. For example, in Forbes’s data, consider testing the NH

β0 = −35 against the alternative that β0 ≠ −35. The statistic is

t=

−42.138 − (−35)

= −2.137

3.34

Since AH is two-sided, the p-value corresponds to the probability that a t(15)

variable is less than −2.137 or greater than +2.137, which gives a p-value that

rounds to 0.05, providing some evidence against NH. This hypothesis test for

these data is not one that would occur to most investigators and is used only

as an illustration.

2.6.2

Slope

A 95% confidence interval for the slope, or for any of the partial slopes in

multiple regression, is the set of β1 such that

βˆ 1 − t(α / 2, df )se(βˆ 1|X ) ≤ β1 ≤ βˆ 1 + t(α / 2, df )se(βˆ 1|X )

(

(2.14)

)

For simple regression, df = n − 2 and se βˆ 1|X = σˆ / SXX . For Forbes’s data,

df = 15, se βˆ 1|X = 0.0165, and

(

)

0.895 − 2.131(0.0165) ≤ β1 ≤ 0.895 + 2.131(0.0165)

0.86 ≤ β1 ≤ 0.93

As an example of a test for slope equal to 0, consider the Ft. Collins

snowfall data in Section 1.1. One can show, Problem 2.5, that βˆ 1 = 0.203,

se βˆ 1 | X = 0.131 . The test of interest is of

(

)

NH: β1 = 0

AH: β1 ≠ 0

(2.15)

and t = (0.203 − 0)/0.131 =1.553. To get a significance level for this test, compare

t with the t(91) distribution; the two-sided p-value is 0.124, suggesting no

evidence against the NH that Early and Late season snowfalls are

independent.

32

2.6.3

chapter 2 simple linear regression

Prediction

The estimated mean function can be used to obtain values of the response for

given values of the predictor. The two important variants of this problem are

prediction and estimation of fitted values. Since prediction is more important,

we discuss it first.

In prediction we have a new case, possibly a future value, not one used to

estimate parameters, with observed value of the predictor x*. We would like

to know the value y*, the corresponding response, but it has not yet been

observed. If we assume that the data used to estimate the mean function are

relevant to the new case, then the model fitted to the observed data can be

used to predict for the new case. In the heights example, we would probably

be willing to apply the fitted mean function to mother–daughter pairs alive in

England at the end of the nineteenth century. Whether the prediction would

be reasonable for mother–daughter pairs in other countries or in other time

periods is much less clear. In Forbes’s problem, we would probably be willing

to apply the results for altitudes in the range he studied. Given this additional

assumption, a point prediction of y*, say ỹ*, is just

y* = βˆ 0 + βˆ 1 x*

ỹ* predicts the as yet unobserved y*. Assuming the model is correct, then the

true value of y* is

y* = β 0 + β1 x* + e*

where e* is the random error attached to the future value, presumably with

variance σ 2. Thus, even if β0 and β1 were known exactly, predictions would not

match true values perfectly, but would be off by a random amount with standard deviation σ. In the more usual case where the coefficients are estimated,

the prediction error variability will have a second component that arises from

the uncertainty in the estimates of the coefficients. Combining these two

sources of variation and using Appendix A.4,

1 ( x − x )2

Var( y*|x* ) = σ 2 + σ 2 + *

n

SXX

(2.16)

The first σ 2 on the right of (2.16) corresponds to the variability due to e*, and

the remaining term is the error for estimating coefficients. If x* is similar

to the xi used to estimate the coefficients, then the second term will generally

be much smaller than the first term. If x* is very different from the xi used in

estimation, the second term can dominate.

Taking square roots of both sides of (2.16) and estimating σ 2 by σ̂ 2 , we get

the standard error of prediction (sepred) at x*,

33

2.6 confidence intervals and t-tests

1 ( x − x )2

sepred( y* |x* ) = σ 1 + + *

SXX

n

1/ 2

(2.17)

A prediction interval uses multipliers from the t-distribution with df equal to

the df in estimating σ 2. For prediction of 100 × log(pres) for a location with

x* = 200, the point prediction is ỹ* = −42.138 + +0.895(200) = 136.961, with

standard error of prediction

1 (200 − 202.9529)2

sepred( y* |x* = 200) = 0.379 1 +

+

17

530.7824

1/ 2

= 0.393

Thus, a 99% predictive interval is the set of all y* such that

136.961 − 2.95(0.393) ≤ y* ≤ 136.961 + 2.95(0.393)

135.803 ≤ y* ≤ 138.119

More interesting would be a 99% prediction interval for pres, rather than for

100 × log(pres). A point prediction is just 10(136.961/100) = 23.421 inches of

Mercury. The prediction interval is found by exponentiating the end points of

the interval in log scale. Dividing by 100 and then exponentiating, we get

10135.803/100 ≤ pres ≤ 10138.119 /100

22.805 ≤ pres ≤ 24.054

In the original scale, the prediction interval is not symmetric about the point

estimate.

For the heights data, Figure 2.3 is a plot of the estimated mean function

given by the dashed line for the regression of dheight on mheight along

with curves at

βˆ 0 + βˆ 1 x* ± t(.025, 1373)sepred(dheight* |mheight* )

The vertical distance between the two solid curves for any value of

mheight corresponds to a 95% prediction interval for daughter’s height given

mother’s height. Although not obvious from the graph because of the very

large sample size, the interval is wider for mothers who were either relatively

tall or short, as the curves bend outward from the narrowest point at

mheight = mheight.

2.6.4

Fitted Values

In rare problems, one may be interested in obtaining an estimate of E(Y|X = x*).

In the heights data, this is like asking for the population mean height of all

34

chapter 2 simple linear regression

dheight

70

65

60

55

55

60

65

70

mheight

Figure 2.3 Prediction intervals (solid lines) and intervals for fitted values (dashed lines) for the

heights data.

daughters of mothers with a particular height. This quantity is estimated by

the fitted value ŷ = β0 + β1x*, and its standard error is

1 ( x − x )2

sefit ( yˆ |x* ) = σˆ + *

n

SXX

1/ 2

To obtain confidence intervals, it is more usual to compute a simultaneous

interval for all possible values of x. This is the same as first computing a joint

confidence region for β0 and β1, and from these, computing the set of all possible mean functions with slope and intercept in the joint confidence set. The

confidence region for the mean function is the set of all y such that

(βˆ + βˆ x) − sefit ( yˆ|x )[2F (α ; 2, n − 2)] ≤ y

≤ ( βˆ + βˆ x ) + sefit ( yˆ |x )[ 2F (α ; 2, n…