Due date - by 2:00pm on Saturday 6th
May 2006 in Dr. O'Brien's office
(Damen Hall, room 321). Late final exams will be penalized 10%
per hour late,
and not accepted after 10pm on 6th May.
Ground Rules - Do your own
work!! Do not discuss anything related to the following
problems with your classmates or others. For clarifications and general
questions, ask
Dr. O'Brien, who will be available during regular office hours plus
* Thursday 4th May 10am - 5pm
* Friday 5th May 10am
- 5pm (exceptions - to proctor an exam, attend seminar and part of the
Dept. picnic)
* Saturday 6th May 11am
- 2pm.
You may, however, consult any text you'd like.
Directions - as always, you must list all assumptions
at the start of each problem (even
when you are not explicitly asked to do so). Also, answer all
questions fully, remembering
to examine residuals and check any assumptions you can. Your
answers should discuss
both the bigger picture (and practical relevance of the results) and
the technical
(statistical) details, and should be in the form of a concise report.
Do
not attach output;
sections of output can be referenced by page number.
Problems to do -
* students registered for undergraduate version - do problems
1-4
below
* students registered for graduate version - do problems 1-6
below
#1. Copenhagen housing satisfaction data is entered and analyzed here,
and the
explanation of the data and study is given at the
top of that SAS program.
(a) the GLM procedure code is given, where the count is predicted as
a function
of housing satisfaction, contact with other residents,
and housing type (tower
block, apartment, house). Interpret the results
of this analysis, critiquing all
assumptions, and removing unnecessary
predictive terms (this is done for you in
the second GLM run). For individuals living
in houses and who have a high
degree of contact with other residents, use this
model to predict the number
of residents with a low, a medium, and a high degree
of satisfaction.
(b) again using the same dataset, obtain and interpret a predictive
model for the
respective counts, but assuming that counts follows
a Poisson distribution.
Use this new predictive model to again predict the
number of residents with a low,
a medium, and a high degree of satisfaction for
individuals living in houses and
who have a high degree of contact with other residents.
List
and critique all
assumptions, and drop all unnecessary
terms from your model. (Hint: the
corresponding procedure
has a "predicted" option for the model statement.)
(c) finally, fit the proportional odds model and again use this new
predictive model
to predict the number of residents with a low, a
medium, and a high degree
of satisfaction for individuals living in houses
and who have a high degree of
contact with other residents. List and critique
all assumptions, and drop all
unnecessary terms from your model.
#2 Goldin, Benditti, Humphreys & Dennis (1955, J. National
Cancer Institute, p.129)
provides a dataset useful to investigate the interactive
effects of amethopterin and
6-mercaptopurine in mice for the treatment of leukemia.
The data are dichotomous
and represent a count of the dead animals treated
by various drug combinations.
(a) relative potency - use nlmixed's
#1 - #3 in this program/output to comment
on the relative potency of these two drugs, commenting
on any necessary assumptions
and any tests of these. Specify all hypotheses,
test stats, pvalues, etc.
(b) synergy or otherwise - use
nlmixed's #4 and #5 to comment on whether you feel these
drugs enhance or detract from one another (that
is, exhibit any form of interaction),
listing all assumptions and providing any relevant
tests. Comment on the point of
nlmixed #5 and it's usefulness - be precise. Looking
at the respective predicted
("Pred" from Fourth
NLMixed and "Predb" from Fifth NLMixed) and
actual ("rat") percentages,
comment on the goodness of fit of the models.
Suggest amendments to improve the model if possible.
Make the connection between
the q8 parameter
for nlmixed #3 and the q4 parameter
for nlmixed #4.
(c) design - what type of design
was used in this study (be precise)? The design used
58 x 8 = 464 mice. Comment on the wisdom
of the allocation of the 464 mice to
the treatment combinations, and suggest an alternate
allocation if possible. Defend
the choice of your suggested design.
#3. Survival data presented in an article by Box and Cox is analyzed
here
using several
procedures in SAS.
(a) The first analysis (Proc GLM) mimics the analysis presented by
Box and Cox to
illustrate the use of transformations (at least
40 years ago when true survival data
couldn't yet be analysed). Summarize the findings
for this analysis and list all necessary
assumptions.
(b) The next analyses (two Proc PHREGs), fits Cox's PH model to the
original survival
time data. Use these outputs to test whether the
interaction between poison and treat-
ment (6 degrees of freedom all together) can be
dropped. Incidentally, do any of the
(one-degree-of-freedom) components of the interaction
term appear significant?
Which? List all assumptions for this (Cox
PH) model including the assumed model
equation.
Also, explain to what degree the Cox PH model methodology is both "para-
metric" and "nonparametric."
(c) The final analyses (two runs of Proc Lifetest) provide us with
the KM estimates and
perform the Logrank test. Indicate which are
the respective KM estimates (indicating
too what they are estimating), and interpret the
results of the logrank tests, listing all
necessary assumptions.
(d) Compare and contrast the previous three approaches. For example,
to what extent
do the GLM results (e.g. test for interaction and
SNK results) agree and disagree with
the CoxPH and KM results. Also, relate the Cox PH
(no-interaction) model estimates
to the graphed KM estimates.
# 4. Bradstreet (table 5, part 4) presents data for a COD involving
multivariate data,
where the measurements are y1 = AUC and y2 = CMax.
The data are analyzed
here and are described
here.
Using the program/output, thoroughly analyze these
bivariate data answering questions like (supported
with test stats, dfs, pvalues)
* are there any sequence,
gender, period effects?
* are there any carryover
effects (for the bivariate and univariate analyses)?
* how do the treatments
(dose levels) differ?
* what is the point of the
third GLM and subsequent plots?
* what remaining analyses
should be undertaken (if any) and/or can you suggest
a better way to analyze these data? Be specific.
#5. Nguyen and Amaratunga (2001, in Millard and Krause, Applied Statistics
in the
Pharmaceutical Industry, Chap. 5) present
data corresponding to the concentration
levels of M2000 (in ng/ml) in the plasma in 12 volunteers
at 14 time points: immediately
after and 10, 15, 30, 60 and 90 min and 2, 4, 6,
8, 10, 12, 16 and 24h after ad-
ministration of the drug. Demographic measurements
(gender, age in years, height
in cm and weight in kg) were also measured for these
volunteers. These data are
analyzed in this SAS program/output.
Thorougly analyze these data using the output
from the five NLMixed's, listing necessary assumptions
and findings associated with
each, and which you prefer, as well as subsequent
analyses you would like performed.
Comment on the importance of the demographic measurements,
and highlight the models
used in each instance. Remember to bring in
residual plots and hypothesis tests (stating
null, alternative, test statistic, pvalue and conclusion)
to support your findings.
# 6. Everitt (A Handbook of Statistical Analyses using SPlus,
1994, p.55) reports data
from Lee (1980). Fifty-one previously untreated
adult patients with acute myeloblastic
leukaemia were given a course of treatment, at the
end of which they were assessed as
having responded or not responded. Six pre-treatment
variables were recorded:
1. age at diagnosis (age);
2. smear differential percentage of blasts (smear);
3. percentage of absolute marrow leukaemia infiltrate
(infiltrt);
4. percentage labelling index of the bone marrow
leukaemia cells (label);
5. absolute blasts (ablasts);
6. highest temperature prior to temperature (hightemp).
Three post-treatment variables are also recorded:
"response" - 1 = Responds
to treatment, 0 = Fails to respond
"survtime" - survival time
from diagnosis (in months)
"status" - 0 = dead, 1 =
still alive
These data are entered and analyzed using this
SAS program, which fits a CPH model
and LR test regarding the survival times and then
relates Response (yes or no).
(a) Discuss the CPH model portion of the output - how were the
explanatorial variables
chosen, which were (was) chosen, and what is the
relevance of the parameter estimate
and ChiSq test. List all assumptions for this procedure.
(b) Now discuss the KM and LogRank portion of the output, again listing
all necessary
assumptions, and summarising the (test) findings.
Which individuals are considered
"censored" here? Why?
(c) Finally, ignoring survival time, focus on the patients who responded
versus those who
didn't by discussing the Logistic output.
List the necessary assumptions for this analysis,
discuss which (pre-treatment) variables are significant
and which appear not to be, the
relevance of the (especially signs of) the parameter
estimates, the Odds Ratios and 95%
CIs.