Course Overview STT 864: Statistical Methods II 1/29 General information Time: M-W, 12:40-2:00pm Place: A220 Wells Hall Instructor: Ping-Shou Zhong Office Hours: M.-W., 3:00-4:00pm at C418 Wells Hall and by appointment E-mail: pszhong@stt.msu.edu 2/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models, Chapman and Hall/CRC. I McCulloch and Searle S. (2001), Generalized, Linear and Mixed Models, Wiley & Sons. I Hardin, J. and Hilbe, J. (2007), Generalized Linear Models and Extensions, 2nd Edition, Stata Press. 3/29 Main topics I Review of linear models I Non-linear models I generalized linear models I linear mixed models I generalized linear mixed models 4/29 Laboratory I Main purpose: demonstrate the practical implementation of the statistical methods and provide you with opportunities to analyze data in the class. I A total of four to five labs in this semester. I Lab location: B110G (tentative, TBA). 5/29 Homework I Homework will be typically assigned on Wednesday. I You will need to use R for some computation. You will need to download software R in your personal computer. I You could discuss it with your classmates. I But finish it independently. I You are encouraged to use R markdown. 6/29 Grading I Homework: 40% I Course Project: 30% I Final Exam: 30% 7/29 Course overview: Linear models I Linear models are used for studying the relationship between some predictors and a response. I In general, Y is used to denote the response variable, which is the variable we would like to predict for. X = (X1 , X2 , · · · , Xp )T are predictors (covariates) that are used for predicting the response variable Y . I Typically, the response variable and the predictors are obtained for n subjects. Here n is called sample size. 8/29 Example: Beverage study data set 9/29 Example: Beverage study data set 10/29 Example: Beverage study data set I This data set comes from a study conducted by Baty et al. (2006). I The original purpose of this study was to measure the influence of beverages on blood gene expression. I To explore the underlying mechanisms of the cardioprotective effects of beverages. 11/29 Some Fundamentals of Microarray Biology A Cell chromosome nucleus DNA strands 12/29 DNA contains genes that code for proteins. DNA (transcription) RNA (translation) protein Proteins perform essential biological functions. 13/29 Microarray technology I Microarrays allow researchers to measure the abundance of thousands of mRNA transcripts in multiple biological samples. I By understanding how transcript abundance changes across experimental conditions, researchers gain clues about gene function and learn how genes work together to carry out biological processes. 14/29 Example: Beverage study data set I Six healthy individuals participated in the experiment. I Four different beverages (500mL each: grape juice, red wine, 40g diluted ethanol, water) are evluated in the study. I Blood samples were taken after their drinking beverages. I Gene expression data for 22,238 genes were measured using the blood samples. 15/29 Build a linear model for the beverage study data set I What is the response? 16/29 Build a linear model for the beverage study data set I What is the response? I Gene expression is the response. 16/29 Build a linear model for the beverage study data set I What is the response? I Gene expression is the response. I What are the covariates/predictors? 16/29 Build a linear model for the beverage study data set I What is the response? I Gene expression is the response. I What are the covariates/predictors? I The types of beverage are the predictors/covariates. 16/29 Define response and predictors I Response variable: Yi is the gene expression data for one particular gene obtained from the i-th individual. 17/29 Define response and predictors I Response variable: Yi is the gene expression data for one particular gene obtained from the i-th individual. I How to define the covariates/predictors? 17/29 Define response and predictors I Response variable: Yi is the gene expression data for one particular gene obtained from the i-th individual. I How to define the covariates/predictors? I The types of beverage are the predictors/covariates. 17/29 Define response and predictors I Response variable: Yi is the gene expression data for one particular gene obtained from the i-th individual. I How to define the covariates/predictors? I The types of beverage are the predictors/covariates. I Assume Xi is the beverage the i-th individual taken. Typically, we use dummy variables to represent te categorical data. That is Xi = (Xi1 , Xi2 , Xi3 , Xi4 )T . 17/29 Define response and predictors I Response variable: Yi is the gene expression data for one particular gene obtained from the i-th individual. I How to define the covariates/predictors? I The types of beverage are the predictors/covariates. I Assume Xi is the beverage the i-th individual taken. Typically, we use dummy variables to represent te categorical data. That is Xi = (Xi1 , Xi2 , Xi3 , Xi4 )T . I Xi1 is the dummy variable for grape juice, Xi1 = 1 if i-th individual drinks grape duice, Xi1 = 0 otherwise; Xi2 is the dummy variable for red wine; Xi3 is the dummy variable for diluted ethanol; Xi4 is the dummy variable for water. 17/29 Example: A linear model A linear model for studying the relationship between beverages and gene expression is Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi3 + β4 Xi4 + εi , i = 1, · · · , n, where εi is measurement error. The measurement error εi is typically assumed to be normally distributed with mean 0 and unknown variance σ 2 . β0 , β1 , β2 , β3 , β4 are unknown parameters. 18/29 Linear models in a matrix form I Let Y = (Y1 , · · · , Yn )T be the n × 1 response vector. I Let Xi = (1, Xi1 , ·, Xi4 )T be the 5 × 1 predictor obtain from the i-th individual. Let X = (X1 , · · · , Xn )T be the n × 5 design matrix. I Let β = (β0 , · · · , β4 )T and ε = (ε1 , · · · , εn )T . I A linear model could be written as the following matrix form Y = X β + ε. 19/29 Basic assumptions I Y |X is normally distributed; I E(Y |X ) = X β is linear function of β; I ε1 , · · · , εn are independent, namely, Y1 , · · · , Yn are independent. 20/29 Outline of the course I Generalized linear models, which allows Y |X to be binary, counts and other distributions, and also allow E(Y |X ) be a non-linear function of unknown parameters. I Non-linear models, which assumes E(Y |X ) to be a nonlinear function of unknown parameters. I Linear mixed models and generalized linear mixed models. These models will allow some dependence among the observations Y1 , · · · , Yn . 21/29 Example for generalized linear models Consider a breast cancer study conducted by Richardson et al. (2006). The study aims to provide insight into the molecular pathogenesis of Sporadic basal-like cancers (BLC), a distinct class of human breast cancers. Fourty seven subjects participated into this study. For each patient, the single nucletide polymorphism (SNP) array and microarray gene expression were measured. The original data consist of 7 normal specimens, 2 BRCA-associated breast cancer specimens, 18 sporadic BLC specimens and 20 non-BLC specimens. 22/29 Questions If we would like to find out what genes are associated with the BLC cancers, I what is the response should be used? I what are the covariates? I can we fit them using linear models? 23/29 Questions If we would like to find out genes that are associated with the all the four types of breast cancers, I what is the response should be used? I what are the covariates? I can we fit them using linear models? 24/29 Example for non-linear models Let us consider the beverage study in more detail. In fact, in the experiment, for each individual and each beverage, blood samples were taken at baseline (0 hour, without drinking beverages), 1, 2, 4, 12 hours after the drink together with standardized nutrition. RNA of 120 samples was hybridized on Affymetrix microarrays. 25/29 7.1 7.0 6.9 6.8 gene expression for gene 1 for individuals with Alcohol 7.2 Gene expression profile for gene 1 in Alcohol group 0 2 4 6 8 10 12 hours 26/29 Nonlinear relationship I Consider time in hours as the covariate and the gene expression for gene 1 as the response. I It might be clear that E(Y |X ) is not linear in X. Namely, we can not write E(Y |X ) = β T X . I A non-linear regression may be more appropriate. That is, assuming E(Y |X ) = g(X ; β) where g(X ; β) is a nonliner function of X and β. 27/29 Example for linear mixed models Let us now examine the data structure of the beverage study data set more carefully. The design of the experiment could be illustrated in the following plot: 28/29 Dependence among observations I Consider the gene expression data observed for j-th gene at k -th hour of the i-th individual. Denote it by Yijk . I The observations of the same gene from the same individual at different hours are dependent. Namely, Yijk , k = 0, 1, 2, 4, 12 are dependent to each other. I The observations of the gene expression from the same individual for different genes are also dependent. Namely, Yijk , j = 1, 2, 3, 4, · · · are dependent. 29/29