Do not turn in raw computer output. You may cut and paste key graphs. All results must be summarized in your own words, not just presented without interpretation. You can use tables as needed to present results efficiently, but describe the results presented and interpret them.
Everyone needs to have at least one variable that started out as continuous or ordinal (even if you are deciding to dichotomize it), so that you can evaluate linearity in the logit for that variable.
1. Describe the study for which your data were collected (no more than ½ page total):
a. Study design
b. Number of observations
c. Main purpose of the original study
d. How the data were originally collected
2. What is the topic of the investigation you will pursue with these data? Describe in general terms, e.g. “The influence of reproductive history on the risk of breast cancer”. Use no more than 2 sentences.
3. What variables will you be using? Specify how they were defined in the data. How were the data collected? Are the variables continuous, ordinal, categorical? If categorical, what are the categories? Are any data missing?
b. Main exposure
c. Covariates (you’ll need at least 4, but no more than 10) (they should not all be highly correlated among themselves)
4. In one sentence, state the hypothesis you will be examining with these data. This operational hypothesis is different from the answer to #2 above, because it reflects the specific analyses you will be conducting, and your specific expectations regarding any associations observed in the data. Example: “Early age at menarche increases the risk of breast cancer.” The hypothesis also takes into account the limits of your data, based on what specific variables are available (reproductive history may include age at menarche, number of pregnancies, number of live births, length of breastfeeding, inter-pregnancy intervals, use of oral contraceptives, etc.). Also, state separately any subsidiary hypotheses you plan to address, e.g. dose-response relationship, interaction with other variables, etc. (No more than one sentence per hypothesis)
5. Describe the univariate distribution of the outcome variable and the main exposure variable. For binary or other categorical variables, give proportions, and briefly describe in words. For continuous variables, provide histograms or boxplots, present means and standard deviations, and briefly describe in words (symmetric or skewed, approximately normal or not, uni- vs. bi- or multi-modal).
1. Present and describe a crude bivariate analysis of the main exposure and outcome. Show a contingency table and report odds ratio or relative risk, and the confidence interval. If the main exposure is not dichotomous, choose at least one dichotomization. (Everyone’s outcome must be dichotomous)
2. Show that you have at least three covariates that are not too highly correlated (i.e. collinear) with your main exposure variable. You don’t have to examine all covariates – just show that three satisfy this requirement. If the exposure and covariate are continuous variables, use correlation coefficients. Statistical significance is not relevant: the magnitude of the correlation is. A high correlation is greater than 0.7. A moderate one is greater than 0.4. If the exposure and covariate are binary, or one is binary and the other categorical, examine odds ratios. Show contingency tables and note the direction of any association. A strong association would be an odds ratio of 5 or greater. If one of the exposure or covariate is continuous and the other categorical, examine the magnitude of the difference in means, divided by the square root of the pooled variance. A strong association is shown if this is greater than 2 (i.e. the means differ by more than 2 standard deviations).
3. Restate your main exposure, outcome, and hypothesis.
4. Univariate descriptions:
a. Briefly describe the distribution of each covariate. For example, if it’s binary or categorical, give proportions; if it’s continuous normal give mean and standard deviation; if it’s continuous non-normal then provide a description of the overall shape and give critical percentiles. Boxplots or histograms may be appropriate for some variables.
b. Indicate if you have made any changes in coding choices since Project Part 1.
c. Based on the above, indicate what decisions you will make on coding of categorical and continuous variables (one sentence per variable). (These decisions are not final, because your bivariate distributions may force you to reconsider.)
5. Bivariate relationships: using between 4 and 8 covariates, look at and describe bivariate distributions as specified below: [for 2a and 2b, if both variables are binary then your description would be, e.g., “Smokers were 2.2 times as likely to be married as non-smokers”; if both variables are continuous, use both graphical techniques (scatterplot) and simple linear regression (do not rely on p-value for slope); if one is continuous and the other is binary or categorical, compare means, standard deviations, and histograms or boxplots.] Confidence intervals and p-values are important and should be included, but consider strength of effect (strength of association) as your primary finding of interest.
a. The main exposure by each covariate: among all subjects, and separately among controls if it’s a case-control study.
b. The outcome by each covariate: among all subjects, and separately among unexposed (if your main exposure is continuous, define a relatively unexposed group which is large enough for this purpose).
c. Do the distributions in 3a and 3b change your choices of coding for any variable(s)? If so, which one(s) and why?
d. Summarize your findings from 3a and 3b in one or two simple tables. For the purposes of this table, you can exclude the associations among controls only from 3a, or the unexposed only from 3b.
e. Based on the two bivariate relationships (exposure/confounder and confounder/outcome), which covariates are strong, possible, or unlikely candidates for confounding?
f. Can the exposure/confounder and confounder/outcome relationships tell you anything about possible effect modification? Explain why or why not.
6. Provide the crude exposure/outcome relationship (this was #6 in Project Part 1). Now categorize each covariate and examine the main hypothesis across covariate values for each covariate.
a. Make a table showing the effect measures (OR, RR, etc.) at each category of each covariate, and the effect measure adjusted for each covariate.
b. Based on this table as well as the Breslow-Day tests for heterogeneity (in proc freq), explain which variables appear to be strong/clear effect modifiers, weak effect modifiers, or not effect modifiers.
c. For each covariate, evaluate confounding by comparing your crude exposure/outcome relationship to the exposure/outcome relationship adjusting for that covariate. Provide interpretations for each covariate, taking into account whether that covariate also appears to be an effect modifier.
d. If you have one or more apparent effect modifiers, how will you code the variables and the interaction term to make sense in a multivariate model? If you don’t have an effect modifier, explain hypothetically how this would be accomplished if you did have an effect modifier.
7. Decide which variables (including which interaction terms) you now feel you would want to include in a (hypothetical) multivariate model. Clearly justify your decision about each variable, considering all of the following:
a. Your answers to 2, 3, and 4 above
b. The aim of your study and how the results might be put to use.
c. The existing literature and any discrepancies between the existing literature and your findings in 2, 3, and 4 above (I am not looking for a comprehensive review of the literature, but you should spend some time searching on PubMed or Medline to identify published papers relevant to your question. You will need this background to interpret your results (from Project Part 2 and 3) in the context of what has been found before. Make a statement about which variables are usually considered to be confounders, how these appear to operate in your dataset, and whether that raises any discrepancies you need to think about)
8. For each ordinal or continuous exposure or covariate, evaluate the assumption of linearity in the logit, using dummy variables for equal-width categories. Determine if the assumption is upheld or violated, and clearly state what the implication will be for how you choose to code the variable in a multivariate model.
9. List the variables you have selected for your full model based on Project Part 2. Note which are expected to be interaction terms and potentially confounding variables, and indicate coding for each variable. If there are pairs or sets of variables that pose a potential collinearity problem, describe these and state how you will address this problem.
10. Are there any outliers in your data that you need to watch carefully in regards to their influence on your model? If so, describe these observations and in what way they are outliers. Observations may function as outliers in the context of a single variable distribution, or in the context of multiple variables as revealed in a 2x2 table or in a multivariable model.
11. Fit the “full model” in logistic regression. This is the single model that includes all your covariates and interaction terms. A word of caution: be careful about fitting all your interaction terms at the same time … the model may crash if there are too many. If you get a message like “quasi-complete separation of data – maximum likelihood cannot be calculated”, or something like that, then check for a variable that may be causing problems (huge standard error), remove that variable, and try again until you get a “full model” that does not crash.
a. Provide a table showing the full model with all the beta coefficients and their standard errors.
b. For each variable that is not involved in interactions, the odds ratio and confidence interval will be shown in the SAS results. Report and interpret these. Be careful because if you have coded your own interaction term, the “odds ratio” for the main effects and interaction term will be included in the SAS output along with the variables that are not involved in interactions (SAS is only doing that because it doesn’t know that you have coded an interaction. If you tell SAS to code the interaction automatically, like “model y=x z x*z”, then the main effects x and z, and the interaction term x*z, will not appear as odds ratios in the SAS output). If the variable is not binary (dichotomous), state in words what the quantitative relation is between that variable and the risk of the outcome (e.g., for each ounce of alcohol, the odds of disease is multiplied by 1.6; or, for each increase of one unit on the scale from 1 to 5 of increasing job satisfaction, the OR for myocardial infarction is 0.67).
c. For your main exposure, interpret the beta coefficient. If it is involved in an interaction, provide an odds ratio (main exposure – outcome relationship) for the referent group (that is, the reference stratum; that is, the group who are unexposed on the effect modifier) and describe who is in the referent group.
12. Based only on the full model (question #3 above), what conclusions can you draw with respect to the following: (one or two sentences per variable)
a. Which variables (main exposures or confounders) do not contribute to the prediction of the outcome?
b. Are interactions present or absent on the multiplicative scale? (Remember that logistic regression fits a multiplicative model, so when you test an interaction term in this model, you are testing it on the multiplicative scale) (My advice: (i) do not disregard any interaction with p-value < 0.2 [unless you have more than two interactions – and be especially careful if you have a small sample size], and (ii) calculate ORs for different categories of the relevant variables to see how/whether they differ. This is a substantive, quantitative issue, and should not depend on the p-value. For example, if a dichotomous variable for alcohol shows an OR in males of 1.7 and an OR in females of 2.3, this may not be important even if the interaction p-value is 0.04. On the other hand, if the alcohol variable has 3 or more ordinal levels, so that the OR for level 3 versus level 1 in males is 2.8 but over 6 for females, then it may be worth reporting even if the interaction p-value is 0.25.)
13. Model building
a. Decide what your goal is in model-building: best overall predictive model; most parsimonious model with good predictivity; or most unbiased estimate of exposure-disease relationship (note that these may be contradictory goals).
b. Describe a strategy for model-building or paring down. Include evaluations for alternative ways to look at the relationships, if appropriate (e.g. continuous variables versus unstructured coding [dummy variables] to allow evaluation of dose-response, or particular subgroups for separate models, etc.). Important: you must not use a canned stepwise regression procedure. You must control the choices yourself at every step. Consider: stratifying on potential interaction variables; alternate codings, where appropriate; the decisions you previously made regarding potentially collinear sets of variables; what variables can clearly be dropped and what will be the basis for such decisions; and removing outliers or extreme observations to see what effect that has on coefficients of interest.
14. Fit the models that implement the strategy outlined in question #5 above. To facilitate comparisons among models, construct a table resembling the one below. (This example compares a limited set of models; yours will include more models: I expect that you will have lots of comparisons to make between different models, and I want you to fully explore different ways to code the variables, different findings when you include different covariates, any interaction terms and confounders, etc. You might, for example, have a whole set of models dealing with specific interactions alone, etc. However please try not to show more than 20 models.) Fill in the table with the coefficients and standard errors, and indicate what level of “confidence” you have chosen (p-value).
15. Interpret the models you have fit.
a. For nested models, examine the likelihood ratio test statistics and draw conclusions. For example, comparing the “full model” versus “model 2”: deviance = ((-2 log L[model 2]) – (-2 log L[full model])) = say, 0.25, distributed as χ2[df=2], and is not significant, therefore, model 2 can be considered as adequate as the full model.
b. For both nested and non-nested models, compare ORs across models, with particular attention to the main exposure(s). What conclusions can you draw regarding confounding and interactions?
c. Compare the modeling results with your crude, adjusted, and stratified estimates of the associations between exposure and outcome (Project Parts 1 and 2). Did the multivariate analysis substantially alter your conclusions from the preliminary analyses?
16. What final model(s) would you put in a published report? Show model(s) with odds ratio and confidence interval for each predictor not involved in an interaction, and odds ratio for each subgroup if you do have an interaction. Justify your choices of final model(s). Do not choose more than 2 final models. If you choose more than one final model, make a clear statement of why both are informative. (This question refers to the model(s) you would show in full in the publication. Remember that you could still report, without showing the entire model, other information such as “coding C3 as a continuous variable, or including an interaction with C1, had no effect on the coefficient of the main exposure and did not result in a substantially better fit.”)
17. Write a structured abstract summarizing your main results. Include four sections, with a heading showing where each section begins: Introduction, Methods, Results, and Conclusions. Consider including a brief background on the data (sample size, where and when collected), objective of the investigation, statistical methods, major findings, and conclusions, implications, or recommendations (no more than 350 words total). Focus the abstract on the substance of your question and your hypothesis, and the final analyses and conclusions you came to. Do not focus your abstract on the step-by-step process you went through for this class. Instead, write it as if you were writing the abstract for a published paper describing your final results.