Homework # 2 – due on or before Thursday Sept. 18 at 5pm
Problems students are to do -
- students registered for 436 (graduate) credit – do
problems 1 - 5 below
- students registered for 336 (undergraduate) credit – do
problems 1, 2 and 4 below
Directions -
Students are asked to type or neatly write up their clear and thorough answers
to these questions. Please do not hesitate to contact Dr. O'Brien (via
email, phone or in person)
should you have any questions! Students are required to use the Minitab
output that is now
provided at the end of each question. You do not need to attach the
output - just make
reference to it in your write-up of your solutions.
NB
(for problem 5)- one definition of a model as being "linear"
is that the partial derivative
of the expectation of Y with respect to each
of the model parameters involves none of the
model parameters. Otherwise the model is "nonlinear."
1.
The
Peakflow
dataset was gathered to obtain a useful simulation model for peak water
flow from watersheds; the model was tested by comparing measured peak flow
(cfs) from 10 storms with predictions of peak flow obtained from the simulation
model. Q0 and QP are the observed and predicted
peak flows (not reported here), respectively, and the dependent variable is Y
= ln(Q0/QP), so the dependent variable will have the
value zero if the observed and predicted peak flows agree. The
independent variables are the area of the watershed (x1, in mi2),
the average slope of the watershed (x2, in percent), the surface
absorbency index (x3, = 0 for complete absorbency, and =100 for no
absorbency), and the peak intensity of rainfall (in/hr) computed on half-hour time
intervals (x4).
(a)
Use
these data to regress Y on the four independent variables (including the
intercept term), to obtain and interpret the parameter estimates, and to
comment on the fit.
(b)
Test
whether both x2 and x3 can be dropped from the Full Model described in (a).
(c)
Using
the Reduced Model described in (b), test the model assumptions by commenting on
any unusual outliers or influential observations, and check for
homoskedasticity and multicolinearity (see notes below). Check out this
Minitab output.
2.
The
Ozone
dataset was obtained by Dr. A.S. Heagle of North Carolina State Univ so
as to assess the impact of ozone (in ppm) on crop yield.
(a)
Use
these data to regress Yield on Ozone, examine the fit and the residuals plot,
and comment on whether the model assumptions appear to be met.
(b)
Create
a term for the square of Ozone (OZ2), and regress Yield on both Ozone and OZ2,
again commenting on the fit. Does the quadratic model appear to fit the
data better than the linear one? For this part of the exercise (since these
data are indeed noisy), use a = 10% in your test.
(c)
Comment
on the reasonability of a quadratic regression model for data of this sort.
E.g., use the model obtained in part (b) to predict the Yield for an ozone
level of 0.35 ppm. Check out this
Minitab output.
3.
A
health researcher interested in studying the relationship between diastolic
blood pressure and age among healthy adult women 20 to 60 years old collected
data on 54 subjects. The data are saved here.
(a)
Perform
the necessary linear regression (OLS), examine the residual plot and report on
your findings.
(b)
Perform
the linear regression on a suitable transformation of the data, and report the
results. Check out this
Minitab output. NB - "1000rec2" = 1000/(y2).
4.
In
his PhD thesis, Rick Linthurst of NCSU examined this dataset
for the purpose of identifying the important soil characteristics influencing
aerial biomass production of the marsh grass Spartina alterniflora in
the Cape Fear Estuary of North Carolina. Using salinity (%), pH
(acidity), K (potassium, ppm), Na (sodium in ppm) and Zn (zinc, ppm) as
potential explanatory variables, obtain a good linear model to predict
biomass. Comment
on the quality of the final fit. Clearly
interpret the parameter estimates. Hint: use a stepwise regression approach. Check
out this
Minitab output.
5.
A
researcher is trying to understand why statisticians call the model function
f(x) = b0
+ b1X + b2X2
"linear" and the model function g(X) = bexp{-qx} "nonlinear"
(the parameters of the second model are b and q). Convince the researcher that this is
indeed by taking all the necessary partial derivatives and arguing accordingly.
NB - homoskedasticity means
"same" (homo) "variance" (skedasticity), meaning you are to
check for constant variance. Multicolinearity is a problem
that sometimes occurs when we have strong correlations and dependences
in the x's and the X'X matrix cannot be inverted. One measure of multicolinearity
is the variance inflation factor (VIF), obtainable in Minitab -> Regression
-> Options, and a VIF of above 5 or 10 is indicative that a variable
is problematic and overly correlated with the others.