Final Exam

Due date - by 2:00pm on Saturday 6th May 2006 in Dr. O'Brien's office
(Damen Hall, room 321).  Late final exams will be penalized 10% per hour late,
and not accepted after 10pm on 6th May.

Ground Rules - Do your own work!! Do not discuss anything related to the following
problems with your classmates or others. For clarifications and general questions, ask
Dr. O'Brien, who will be available during regular office hours plus
  * Thursday  4th May   10am - 5pm
  * Friday      5th May   10am - 5pm (exceptions - to proctor an exam, attend seminar and part of the Dept. picnic)
  * Saturday  6th May    11am - 2pm.
You may, however, consult any text you'd like.

Directions - as always, you must list all assumptions at the start of each problem (even
when you are not explicitly asked to do so).  Also, answer all questions fully, remembering
to examine residuals and check any assumptions you can.  Your answers should discuss
both the bigger picture (and practical relevance of the results) and the technical
(statistical) details, and should be in the form of a concise report. Do not attach output;
sections of output can be referenced by page number.

Problems to do -
  * students registered for undergraduate version - do problems 1-4  below
  * students registered for graduate version - do problems 1-6   below

#1. Copenhagen housing satisfaction data is entered and analyzed here, and the
    explanation of the data and study is given at the top of that SAS program.
(a) the GLM procedure code is given, where the count is predicted as a function
    of housing satisfaction, contact with other residents, and housing type (tower
    block, apartment, house).  Interpret the results of this analysis, critiquing all
    assumptions, and removing unnecessary predictive terms (this is done for you in
    the second GLM run).  For individuals living in houses and who have a high
    degree of contact with other residents, use this model to predict the number
    of residents with a low, a medium, and a high degree of satisfaction.
(b) again using the same dataset, obtain and interpret a predictive model for the
    respective counts, but assuming that counts follows a Poisson distribution.
    Use this new predictive model to again predict the number of residents with a low,
    a medium, and a high degree of satisfaction for individuals living in houses and
    who have a high degree of contact with other residents. List and critique all
    assumptions, and drop all unnecessary terms from your model. (Hint: the
    corresponding procedure has a "predicted" option for the model statement.)
(c) finally, fit the proportional odds model and again use this new predictive model
    to predict the number of residents with a low, a medium, and a high degree
    of satisfaction for individuals living in houses and who have a high degree of
    contact with other residents. List and critique all assumptions, and drop all
    unnecessary terms from your model.

#2  Goldin, Benditti, Humphreys & Dennis (1955, J. National Cancer Institute, p.129)
    provides a dataset useful to investigate the interactive effects of amethopterin and
    6-mercaptopurine in mice for the treatment of leukemia.  The data are dichotomous
    and represent a count of the dead animals treated by various drug combinations.
(a) relative potency - use nlmixed's #1 - #3 in this program/output to comment
    on the relative potency of these two drugs, commenting on any necessary assumptions
    and any tests of these.  Specify all hypotheses, test stats, pvalues, etc.
(b) synergy or otherwise - use nlmixed's #4 and #5 to comment on whether you feel these
    drugs enhance or detract from one another (that is, exhibit any form of interaction),
    listing all assumptions and providing any relevant tests.  Comment on the point of
    nlmixed #5 and it's usefulness - be precise. Looking at the respective predicted
    ("Pred" from Fourth NLMixed and "Predb" from Fifth NLMixed) and
    actual ("rat") percentages, comment on the goodness of fit of the models.
    Suggest amendments to improve the model if possible.  Make the connection between
    the q8 parameter for nlmixed #3 and the q4 parameter for nlmixed #4.
(c) design - what type of design was used in this study (be precise)?  The design used
    58 x 8 =  464 mice.  Comment on the wisdom of the allocation of the 464 mice to
    the treatment combinations, and suggest an alternate allocation if possible.  Defend
    the choice of your suggested design.

#3. Survival data presented in an article by Box and Cox is analyzed here using several
    procedures in SAS.
(a) The first analysis (Proc GLM) mimics the analysis presented by Box and Cox to
    illustrate the use of transformations (at least 40 years ago when true survival data
    couldn't yet be analysed). Summarize the findings for this analysis and list all necessary
    assumptions.
(b) The next analyses (two Proc PHREGs), fits Cox's PH model to the original survival
    time data. Use these outputs to test whether the interaction between poison and treat-
    ment (6 degrees of freedom all together) can be dropped.  Incidentally, do any of the
    (one-degree-of-freedom) components of the interaction term appear significant?
    Which?  List all assumptions for this (Cox PH) model including the assumed model
    equation. Also, explain to what degree the Cox PH model methodology is both "para-
    metric" and "nonparametric."
(c) The final analyses (two runs of Proc Lifetest) provide us with the KM estimates and
    perform the Logrank test.  Indicate which are the respective KM estimates (indicating
    too what they are estimating), and interpret the results of the logrank tests, listing all
    necessary assumptions.
(d) Compare and contrast the previous three approaches.  For example, to what extent
    do the GLM results (e.g. test for interaction and SNK results) agree and disagree with
    the CoxPH and KM results. Also, relate the Cox PH (no-interaction) model estimates
    to the graphed KM estimates.

# 4. Bradstreet (table 5, part 4) presents data for a COD involving multivariate data,
    where the measurements are y1 = AUC and y2 = CMax.  The data are analyzed
    here and are described here.  Using the program/output, thoroughly analyze these
    bivariate data answering questions like (supported with test stats, dfs, pvalues)
        * are there any sequence, gender, period effects?
        * are there any carryover effects (for the bivariate and univariate analyses)?
        * how do the treatments (dose levels) differ?
        * what is the point of the third GLM and subsequent plots?
        * what remaining analyses should be undertaken (if any) and/or can you suggest
            a better way to analyze these data?  Be specific.

#5. Nguyen and Amaratunga (2001, in Millard and Krause, Applied Statistics in the
    Pharmaceutical Industry, Chap. 5) present data corresponding to the concentration
    levels of M2000 (in ng/ml) in the plasma in 12 volunteers at 14 time points: immediately
    after and 10, 15, 30, 60 and 90 min and 2, 4, 6, 8, 10, 12, 16 and 24h after ad-
    ministration of the drug.  Demographic measurements (gender, age in years, height
    in cm and weight in kg) were also measured for these volunteers.  These data are
    analyzed in this SAS program/output.  Thorougly analyze these data using the output
    from the five NLMixed's, listing necessary assumptions and findings associated with
    each, and which you prefer, as well as subsequent analyses you would like performed.
    Comment on the importance of the demographic measurements, and highlight the models
    used in each instance.  Remember to bring in residual plots and hypothesis tests (stating
    null, alternative, test statistic, pvalue and conclusion) to support your findings.

# 6. Everitt (A Handbook of Statistical Analyses using SPlus, 1994, p.55) reports data
    from Lee (1980).  Fifty-one previously untreated adult patients with acute myeloblastic
    leukaemia were given a course of treatment, at the end of which they were assessed as
    having responded or not responded.  Six pre-treatment variables were recorded:
    1. age at diagnosis (age);
    2. smear differential percentage of blasts (smear);
    3. percentage of absolute marrow leukaemia infiltrate (infiltrt);
    4. percentage labelling index of the bone marrow leukaemia cells (label);
    5. absolute blasts (ablasts);
    6. highest temperature prior to temperature (hightemp).
    Three post-treatment variables are also recorded:
        "response" - 1 = Responds to treatment, 0 = Fails to respond
        "survtime" - survival time from diagnosis (in months)
        "status" - 0 = dead, 1 = still alive
    These data are entered and analyzed using this SAS program, which fits a CPH model
    and LR test regarding the survival times and then relates Response (yes or no).
(a) Discuss the CPH model portion of the output -  how were the explanatorial variables
    chosen, which were (was) chosen, and what is the relevance of the parameter estimate
    and ChiSq test. List all assumptions for this procedure.
(b) Now discuss the KM and LogRank portion of the output, again listing all necessary
    assumptions, and summarising the (test) findings.  Which individuals are considered
    "censored" here? Why?
(c) Finally, ignoring survival time, focus on the patients who responded versus those who
    didn't by discussing the Logistic output.  List the necessary assumptions for this analysis,
    discuss which (pre-treatment) variables are significant and which appear not to be, the
    relevance of the (especially signs of) the parameter estimates, the Odds Ratios and 95%
    CIs.