Biostatistics for Health Data 19.577 - Spring 2012 Dr. Manuel Cifuentes Wednesday, 6 to 9 pm Credits: 3 Location: Kitson 208 Schedule: Wednesday, 6:00 to 8:50 p.m. Office: Kitson Hall – 200-Q - Wednesdays 4 - 6 p.m. and per request. Telephones: 978-934-3271 (Cifuentes) – (603) 274-9419 (Cell phone for emergencies) Email: Manuel_Cifuentes@uml.edu (PLEASE, use 19.577 Spring 2012 in the beginning of the subject of any email communication regarding the course). Course Description This a graduate level course in basic statistical techniques to be used in, but not limited to, health research. The purpose of this course is to train students in developing a consistent statistical approach to solve quantitative research problems of mild complexity. This approach begins by understanding the statistical logic (not necessarily the mathematics) underlying quantitative research questions, continues with being skillful in using statistical software to perform the mathematical work, and concludes with knowing how to interpret and explain the numerical results. Using type of variable measurement as the main criterion, the following statistical models will be studied: linear regression, ANOVA, Chi square, logistic regression, non-parametric methods, and general linear models. Emphasis will be placed on interpretation of regression coefficients, odds ratios, and intercept. There will be an introduction to effect modification and confounding using multivariate regression. As with many other areas of knowledge, statistics is affected by the use it or lose it law. This law has deep neurophysiological correlates that can be used as excuses to forget almost everything about statistics. Therefore, more than asking you to remember a formula or a computational process, you will be asked to organize yourself in a way that will allow you, at a later time, to re-learning what you have forgotten about statistics and to learn new interesting statistical tools. Class organization The class meets for three hours once a week. In case of severe weather or other disasters that might result in cancellation of class, please check your email looking for a message from Manuel Cifuentes and/or call Manuel Cifuentes phone number above - a voice mail message will provide information. To achieve the course goals, we will simultaneously use the book, SPSS, lecturing and discussion. Be prepared to show how much you do not know and to be fully respectful of others’ need and desire to learn. Between classes you will read the book and use SPSS and, therefore will come fully prepared for classes. It is not a bad idea to believe that you will need 2 to 3 hours of study per each hour attending classes, so that you may need as much as 9 hours working by yourself or with study partners before and/or after classes. Our classes, except the first one, will be divided in two blocks. One week of readings and exercises will be used for the first block of each class. The first block of every class will be group work. The instructor will make different groups of students every class. Each group is going to review and grade other group member’s homework due for that day. A perfect homework will have a score of 7. A perfect HW passed in after 6:05 PM will have a score of 6. Any minor problem or mistake will discount one point; a major problem will discount two points. There will be not half points. Participation and questioning are encouraged. You are also encouraged to appeal for a better grade, after you have been in-class graded, using the best of your statistical knowledge. It is absolutely forbidden to talk about this grading activity out of class time. We are going to be as candid as we can with each other. Therefore, privacy and confidentiality must be protected. We will respect each other to learn from our mistakes. During the second block, usually after a 10-minute break, the instructor will introduce the new topic and perform a lecture/demonstration. Some concepts will be explained to prepare you for a better understanding of the reading material. The general logic and the statistical assumptions of the statistical model will be discussed. Student Responsibilities Attendance at all classes is expected. If you will be absent, please notify the instructors in advance. Homework assignments are due PRINTED or HANDWRITTEN by 6:00 PM on the dates noted on the class schedule. Readings are assigned and are strongly recommended. Students are expected to learn the assigned textbook material, even if it is not explicitly presented in class. “All students are advised that there is a University policy regarding dishonesty and cheating. It is the students’ responsibility to familiarize themselves with this policy.” Students should notify the instructors in advance about any potential conflicts between their religious observance and course due dates/examinations. When a conflict occurs, the instructors and student will work out, in advance, a reasonable alternative. Instructor’s Responsibilities In addition to organizing and presenting material, preparing and re-grading homework and preparing and grading exams, I am available for help sessions with students. If it is not possible to come during office hours, please call or email for an appointment. Email questions are encouraged, and I will get back to you as soon as I can. If you will be making a special trip to campus to see me, please phone ahead to reserve time. I find that it is often useful to respond to email course content questions by sending the response to the entire class. This way you can all benefit from the inquiries of other class members. Of course, if the question is a personal one, then I will respond only to the questioner. Software and Textbooks Required: We have chosen a friendly statistical package and a friendly text book as companions for this semester. The software is SPSS, which is widely available on the UMASS Lowell campus, and the book is Discovering Statistics Using SPSS by Andy Field (SAGE, 3rd edition, hardcover, 822 pages). Supplementary: A Handbook of Statistical Analyses using SPSS by Sabine Landau and Brian S. Everitt (PDF provided by the instructor) Grading Attendance, class participation 25% HW 25% Midterm 25% Final 25% Graduate students are required to maintain a B average at all times, and in addition cannot receive more than two grades of B/C or C. Please pay attention to the drop dates, and talk to me before these dates are reached if you have any doubts about how you are doing in the class. It may be useful for you to know our own interpretations of letter grades, so that you know what to expect by way of grading: A excellent. Student has mastered the material completely. There is essentially no improvement possible. A- very good. Student has mastered the material to a high degree. Only minor mistakes were made, or minor room for improvement is evident. B+/B good. Student has mastered the material to an acceptable degree. There is room for improvement, but all the essential mastery has been demonstrated. B-/C poor. Student has mastered only some of the material, and there are serious gaps in demonstrated mastery. This is not an acceptable level of performance for a continuing graduate student. F unacceptable. Student has not shown even a minimum level of learning. Course goal Using SPSS, the student should be able to adequately check assumptions, perform and interpret linear regression, logistic regression, ANOVA, Chi square, general linear, and non-parametric models. The student will master the interpretation of intercepts, regression coefficients, and odds ratios and will be able to control for confounding and effect modifiers. Course Objectives At the end of the course, students will be able to: 1. Understand how to classify variables according to their measurement level (continuous, categorical) 2. Understand the concept of variable role: (predictor, independent variable, cause) and (outcome, dependent variable, effect) 3. Perform basic data cleaning and data management using SPSS 4. Find the best statistical model to study association between two variables based on measurement level and variable role (previous point) 5. Being able to use SPSS to check assumptions of and run correlations and simple linear regression 6. Being able to use SPSS to check residuals in linear regression 7. Understand how to interpret correlation and linear regression results 8. Being able to use SPSS to check assumptions of and run one way ANOVA 9. Understand how to interpret one way ANOVA results 10. Being able to use SPSS to check assumptions of and run Chi Square 11. Understand how to interpret Chi Square results 12. Being able to use SPSS to check assumptions of and run logistic regression 13. Understand how to interpret logistic regression results 14. Being able to use SPSS to check assumptions of and run general linear models 15. Understand how to interpret general linear model results 16. Perform basic non-parametric tests using SPSS 17. Control for confounding and determine the presence of effect modification 18. Interpret intercepts 19. Interpret regression coefficients 20. Interpret odds ratios 21. Do not be afraid of statistics 19.577 Spring 2012 Class Schedule Class number Date Content Readings related to the class Chapters 1 and 2 1. January 25* 2. February 1 Reminding the basis of statistics Using SPSS 3. February 8 Using SPSS 4. February 15 Correlation Chapter 6. Skip 6.5.5. 5. February 22 Linear regression Chapter 7 – up to 7.6 complete 6. February 29* Linear regression Chapter 7 – 7.7 to 7.10 7. March 7 ANOVA Chapter 9 (recommended) Chapter 10 (required) March 14 Spring break. No classes Midterm exam 8. March 21 Chapters 3 and 4 Chapter 5 Homework due by 6:05 PM on the day of the class No homework due. 0. Quizzes on page 29 and page 59-60 1. Data management A. Use database “cars.sav” and describe with charts the following variables: mpg, weight, accel, year, origin, cylinder. Explain why you selected each chart and interpret each chart. 2. Data management B. Use database “General Social Survey.sav” and check normality (all and by sex) and homogeneity of variance (by sex) in the following variables: age, educ, prestg80. Explore whether transformations produce normality and/or homogeneity of variance. Describe and interpret your findings. 3. Correlation. Use database “World95.sav” and compute the appropriate correlation (check assumptions!!) among the following variables (all combinations): populatn, density, babymort, lifeexpm, lifexpf, religion, and pop_incr. Interpret the results and their statistical significance. 4. Linear regression. Use database “Employee data.sav” and predict salary using salbegin. Check assumptions and interpret the intercept, regression coefficient, and model fit. Look for influential cases and outliers. 5. Linear regression. Use database “Employee data.sav” and predict salary using prevexp. Add a second predictor salbegin, check assumptions again, look for influential cases, outliers, and multicollinearity. Interpret the intercept, regression coefficients, and model fit. Compare residuals, regression coefficients, and model fit for both models. Would you prefer a model with one or two predictors? What predictor(s). Why? 9. March 28 Logistic regression I Chapter 8 – Up to 8.5 10. April 4 Logistic regression II Chapter 8 – Complete 11. April 11 Chapter 7 = 7.7, 7.8 Chapter 11 12. April 18 Introduction to General Linear Models and multivariate analysis I Introduction to General Linear Models and multivariate analysis II 13. April 25 TBA 14. May 2 Centering predictors; meaningful intercept Review (questions and answers) Non-parametric analysis and Categorical data 15. May 9 Final exam TBA Chapter 15 Chapter 18 – up to 18.5 6. ANOVA. Use database “University of Florida graduate salaries.sav” and compare the mean initial salary across genders, across colleges, and across graduation dates. Check assumptions. Describe and interpret your results. 7. Logistic regression. Use database “University of Florida graduate salaries.sav” and dichotomize salaries by the median. Predict over the median salaries using separately gender, college, and graduation dates. Check assumptions. Describe and interpret odds ratios. 8. Logistic regression. Use database “Employee data.sav” and dichotomize salary by the highest tertile. Predict highest tertile salary with two separate (two regressions with one predictor each) and two simultaneous (one regression with two predictors) predictors: salbegin and prevexp. Check assumptions. Describe, interpret, and compare odds ratios across models. 9. Running General Linear Models. Use database “Employee data.sav” and predict salary using prevexp. Add a second predictor salbegin. Interpret the intercept, regression coefficients, and model fit. Compare with the same analysis using linear regression (homework 5). Add a third predictor (Job category). Interpret the intercept, regression coefficients, and model fit. How would you run the analysis with these three simultaneous predictors (prevexp, salbegin, and jobcat) using multivariate least squares linear regression? 10. Confounding and effect modification. Use database “World95.sav” and determine whether religion is a confounder or an effect modifier or both of the association between the predictor lit_fema and the outcome lifeexpf. Describe and interpret the regression coefficients and the intercept. 11. Multivariate analysis and intercept interpretation. Use database “World95.sav” and determine whether religion is a confounder or an effect modifier or both of the association between the predictor lit_fema and the outcome lifeexpf. Center the continuous predictor and run the analysis again. Compare the models (describe and interpret the regression coefficients, the intercept, and the model fit) with non-centered and centered continuous predictor. Explain the differences if there is any.