Problems students are to do -
- students registered for 400 (graduate) credit
please do problems 1 - 5 below
- students registered for 300 (undergraduate) credit
please do problems 1, 2 and 4 below
Directions - Students are asked to type or neatly write up their clear
and thorough answers
to these questions. Please do not hesitate to contact Dr. O'Brien
(via email, phone or in person)
should you have any questions! Students are required to use the
Minitab output which is now
provided at the end of each question. You do not need to attach
the output - just make
reference to it in your write-up of your solutions.
NB (for problem 5)- one definition of a model
as being "linear" is that the partial derivative
of the expectation of Y with respect to each
of the model parameters involves none of
the model parameters. Otherwise the model
is "nonlinear."
1. The Peakflow dataset was gathered
to obtain a useful simulation model for peak water flow
from watersheds; the model was tested by comparing
measured peak flow (cfs) from 10 storms with
predictions of peak flow obtained from the simulation
model. Q0 and QP are the observed and
predicted peak flows (not reported here), respectively,
and the dependent variable is
Y = ln(Q0/QP), so the
dependent variable will have the value zero if the observed and predicted
peak flows agree. The independent variables
are the area of the watershed (x1, in mi2), the average
slope of the watershed (x2, in percent),
the surface absorbency index (x3, = 0 for complete absorbency,
and =100 for no absorbency), and the peak intensity
of rainfall (in/hr) computed on half-hour time
intervals (x4).
(a) Use these data to regress Y on the four independent
variables (including the intercept term), to obtain
and interpret the parameter
estimates, and to comment on the fit.
(b) Test whether both x2 and x3 can
be dropped from the Full Model described in (a).
(c) Using the Reduced Model described in (b), test
the model assumptions by commenting on
any unusual outliers or
influential observations, and check for homoskedasticity and multicolinearity
(see
notes below). Check out this
Minitab output.
2. The Ozone dataset was obtained by
Dr. A.S. Heagle of North Carolina State Univ so as to assess the
impact of ozone (in ppm) on crop yield.
(a) Use these data to regress Yield on Ozone, examine
the fit and the residuals plot, and comment on
whether the model assumptions
appear to be met.
(b) Create a term for the square of Ozone (OZ2),
and regress Yield on both Ozone and OZ2, again
commenting on the fit.
Does the quadratic model appear to fit the data better than the linear
one?
For this part of the exercise
(since these data are indeed noisy), use a =
10% in your test.
(c) Comment on the reasonability of a quadratic
regression model for data of this sort. E.g., use the
model obtained in part (b)
to predict the Yield for an ozone level of 0.35 ppm.
Check out this
Minitab output.
3. A health researcher interested in studying the relationship between
diastolic blood pressure and age among
healthy adult women 20 to 60 years old collected
data on 54 subjects. The data are saved here.
(a) Perform the necessary linear regression (OLS),
examine the residual plot and report on your findings.
(b) Perform the linear regression on a suitable
transformation of the data, and report the results.
Check out this
Minitab output. NB - "1000rec2" = 1000/(y2).
4. In his PhD thesis, Rick Linthurst of NCSU examined this dataset
for the purpose of identifying the important soil
characteristics influencing aerial biomass production
of the marsh grass Spartina alterniflora in the Cape Fear
Estuary of North Carolina. Using salinity
(%), pH (acidity), K (potassium, ppm), Na (sodium, ppm) and
Zn (zinc, ppm) as potential explanatory variables,
obtain a good linear model to predict biomass. Comment on
the quality of the final fit. Hint: use a
stepwise regression approach. Check out this
Minitab output.
5. A researcher is trying to understand why statisticians call the model
function f(x) = b0 + b1X
+ b2X2 "linear"
and the model function g(X) = beqX
"nonlinear" (the parameters of the second model are b
and q). Convince
the researcher that this is indeed by taking all
the necessary partial derivatives and arguing accordingly.
NB - homoskedasticity means "same" (homo) "variance"
(skedasticity), meaning you are to check
for constant variance. Multicolinearity
is a problem that sometimes occur when we have strong
correlations and dependences in the x's and
the X'X matric cannot be inverted. One measure of
multicolinearity is the variance inflation
factor (VIF), obtainable in Minitab -> Regression -> Options,
and a VIF of above 5 or 10 is indicative that
a variable is problematic and overly correlated with
the others.