# I have Stat data mining assignment i want done

It has to be done in R or Sas, deliverable is the code and a write up

GLMs and Data Mining Final Project Due: Friday of Finals week You are tasked with finding a sufficiently complex data set, framing your research question(s), performing the appropriate analyses, and presenting up the process and results. This project is also an opportunity to practice writing a whitepaper. For data analysts like us, a whitepaper is a professionally written technical document thoroughly describing the way we solved an analytical problem. A well-done whitepaper accom- plishes the concept of reproducibility - someone else should be able to reproduce your exact results given the contents of a whitepaper. I’ve outlined the things that are necessary (read: these are the MINIMUM requirements) in this paper. Again - this should be a professionally written (typed) document that you would feel comfortable turning into your boss. Turn in a PDF. The Problem: You are tasked with finding a sufficiently complex data set, framing your research question(s), performing the appropriate analyses, and writing up the process and results. I must sign off on this data set to ensure sufficient complexity. See project proposal assignment on blackboard, due April 15 (but feel free to turn in earlier.) In short, this whitepaper should convince me that you understand both GLMs and at least one of (trees, forests, clustering) and are able to apply the concepts and articulate them in practice. Your use of GLMs and machine learning methods should make sense in the context of the data and goals. Your writeup should clearly and concisely include the following: • Introduce the data and the research question. Should include a quality graphical exploration of your data. (At least 1 page) • A complete GLM analysis – Compare multiple fully-specified models in an appropriate way. – Specify your final model (as you would any GLM, using statistical notation). This should be accompanied by ∗ A justification of your random component. This should include a discussion and speci- fication of other potential random components and why you didn’t choose them. (e.g., AIC/BIC/mean-variance, etc...) ∗ A justification of your link function. Same logic as above - need to compare with alternatives and justify choice. ∗ A justification of your linear predictor (which explanatory variables you are including in your final model). e.g., Did you use your own hypotheses to motivate the included x variables? AIC on a subset of models? BIC? Random forest? (At least 1 page) – Walk through the steps of a lack of fit test and any potential follow-up steps this requires – Perform a residual analysis (if appropriate) – Interpret coefficients in a way that is meaningful to stakeholders and discuss any relevant findings. (At least 1 page) • An implementation of a machine-learning method. Examples of context include, but are not limited to: – Comparing machine-learning to GLM in forecasting/predictive modeling using accuracy/sensitivity/specificity – Using random forests to select linear predictor – Using clustering to reduce data dimension and use in a GLM as a categorical variable • Include an appendix with any and all code used. You’re welcome to use SAS or R or both. You may want to format your document so that it has sections such as: Introduction & Data, Methods, Results, Conclusions, Appendix. 1

