Homework # 2     -     due Wednesday, 15th February 2006 at 4.15pm

Problems students are to do -
    - students registered for 400 (graduate) credit please do problems 1 - 5 below
    - students registered for 300 (undergraduate) credit please do problems 1, 2 and 4 below

Directions - Students are asked to type or neatly write up their clear and thorough answers
to these questions.  Please do not hesitate to contact Dr. O'Brien (via email, phone or in person)
should you have any questions!  Students are required to use the Minitab output which is now
provided at the end of each question.  You do not need to attach the output - just make
reference to it in your write-up of your solutions.

NB (for problem 5)- one definition of a model as being "linear" is that the partial derivative
of the expectation of Y with respect to each of the model parameters involves none of
the model parameters.  Otherwise the model is "nonlinear."

1. The  Peakflow dataset was gathered to obtain a useful simulation model for peak water flow
    from watersheds; the model was tested by comparing measured peak flow (cfs) from 10 storms with
    predictions of peak flow obtained from the simulation model.  Q0 and QP are the observed and
    predicted peak flows (not reported here), respectively, and the dependent variable is
    Y = ln(Q0/QP), so the dependent variable will have the value zero if the observed and predicted
    peak flows agree.  The independent variables are the area of the watershed (x1, in mi2), the average
    slope of the watershed (x2, in percent), the surface absorbency index (x3, = 0 for complete absorbency,
    and =100 for no absorbency), and the peak intensity of rainfall (in/hr) computed on half-hour time
    intervals (x4).
    (a) Use these data to regress Y on the four independent variables (including the intercept term), to obtain
        and interpret the parameter estimates, and to comment on the fit.
    (b) Test whether both x2 and x3 can be dropped from the Full Model described in (a).
    (c) Using the Reduced Model described in (b), test the model assumptions by commenting on
        any unusual outliers or influential observations, and check for homoskedasticity and multicolinearity
        (see notes below).  Check out this Minitab output.

2. The Ozone dataset  was obtained by Dr. A.S. Heagle of North Carolina State Univ so as to assess the
    impact of ozone (in ppm) on crop yield.
    (a) Use these data to regress Yield on Ozone, examine the fit and the residuals plot, and comment on
        whether the model assumptions appear to be met.
    (b) Create a term for the square of Ozone (OZ2), and regress Yield on both Ozone and OZ2, again
        commenting on the fit.  Does the quadratic model appear to fit the data better than the linear one?
        For this part of the exercise (since these data are indeed noisy), use a = 10% in your test.
    (c) Comment on the reasonability of a quadratic regression model for data of this sort. E.g., use the
        model obtained in part (b) to predict the Yield for an ozone level of 0.35 ppm.
  Check out this Minitab output.

3. A health researcher interested in studying the relationship between diastolic blood pressure and age among
    healthy adult women 20 to 60 years old collected data on 54 subjects. The data are saved  here.
    (a) Perform the necessary linear regression (OLS), examine the residual plot and report on your findings.
    (b) Perform the linear regression on a suitable transformation of the data, and report the results.
 Check out this Minitab output.  NB - "1000rec2" = 1000/(y2).

4. In his PhD thesis, Rick Linthurst of NCSU examined this dataset for the purpose of identifying the important soil
    characteristics influencing aerial biomass production of the marsh grass Spartina alterniflora in the Cape Fear
    Estuary of North Carolina.  Using salinity (%), pH (acidity), K (potassium, ppm), Na (sodium, ppm) and
    Zn (zinc, ppm) as potential explanatory variables, obtain a good linear model to predict biomass. Comment on
    the quality of the final fit.  Hint: use a stepwise regression approach. Check out this Minitab output.

5. A researcher is trying to understand why statisticians call the model function f(x) = b0 + b1X + b2X2 "linear"
    and the model function g(X) = beqX "nonlinear" (the parameters of the second model are b and q).  Convince
    the researcher that this is indeed by taking all the necessary partial derivatives and arguing accordingly.

NB - homoskedasticity means "same" (homo) "variance" (skedasticity), meaning you are to check
for constant variance.  Multicolinearity is a problem that sometimes occur when we have strong
correlations and dependences in the x's and the X'X matric cannot be inverted.  One measure of
multicolinearity is the variance inflation factor (VIF), obtainable in Minitab -> Regression -> Options,
and a VIF of above 5 or 10 is indicative that a variable is problematic and overly correlated with
the others.