Course Overview STT 864: Statistical Methods II 1/29

advertisement
Course Overview
STT 864: Statistical Methods II
1/29
General information
Time: M-W, 12:40-2:00pm
Place: A220 Wells Hall
Instructor: Ping-Shou Zhong
Office Hours: M.-W., 3:00-4:00pm at C418 Wells Hall and by
appointment
E-mail: pszhong@stt.msu.edu
2/29
References
I
Faraway, J. (2005), Extending the Linear Model with R:
Generalized Linear, Mixed Effects and Nonparametric
Regression Models, Chapman and Hall/CRC.
I
McCulloch and Searle S. (2001), Generalized, Linear and
Mixed Models, Wiley & Sons.
I
Hardin, J. and Hilbe, J. (2007), Generalized Linear Models
and Extensions, 2nd Edition, Stata Press.
3/29
Main topics
I
Review of linear models
I
Non-linear models
I
generalized linear models
I
linear mixed models
I
generalized linear mixed models
4/29
Laboratory
I
Main purpose: demonstrate the practical implementation of
the statistical methods and provide you with opportunities
to analyze data in the class.
I
A total of four to five labs in this semester.
I
Lab location: B110G (tentative, TBA).
5/29
Homework
I
Homework will be typically assigned on Wednesday.
I
You will need to use R for some computation. You will need
to download software R in your personal computer.
I
You could discuss it with your classmates.
I
But finish it independently.
I
You are encouraged to use R markdown.
6/29
Grading
I
Homework: 40%
I
Course Project: 30%
I
Final Exam: 30%
7/29
Course overview: Linear models
I
Linear models are used for studying the relationship
between some predictors and a response.
I
In general, Y is used to denote the response variable,
which is the variable we would like to predict for.
X = (X1 , X2 , · · · , Xp )T are predictors (covariates) that are
used for predicting the response variable Y .
I
Typically, the response variable and the predictors are
obtained for n subjects. Here n is called sample size.
8/29
Example: Beverage study data set
9/29
Example: Beverage study data set
10/29
Example: Beverage study data set
I
This data set comes from a study conducted by Baty et al.
(2006).
I
The original purpose of this study was to measure the
influence of beverages on blood gene expression.
I
To explore the underlying mechanisms of the
cardioprotective effects of beverages.
11/29
Some Fundamentals of Microarray Biology
A Cell
chromosome
nucleus
DNA strands
12/29
DNA contains genes that code for proteins.
DNA
(transcription)
RNA
(translation)
protein
Proteins perform essential biological functions.
13/29
Microarray technology
I
Microarrays allow researchers to measure the abundance
of thousands of mRNA transcripts in multiple biological
samples.
I
By understanding how transcript abundance changes
across experimental conditions, researchers gain clues
about gene function and learn how genes work together to
carry out biological processes.
14/29
Example: Beverage study data set
I
Six healthy individuals participated in the experiment.
I
Four different beverages (500mL each: grape juice, red
wine, 40g diluted ethanol, water) are evluated in the study.
I
Blood samples were taken after their drinking beverages.
I
Gene expression data for 22,238 genes were measured
using the blood samples.
15/29
Build a linear model for the beverage study data set
I
What is the response?
16/29
Build a linear model for the beverage study data set
I
What is the response?
I
Gene expression is the response.
16/29
Build a linear model for the beverage study data set
I
What is the response?
I
Gene expression is the response.
I
What are the covariates/predictors?
16/29
Build a linear model for the beverage study data set
I
What is the response?
I
Gene expression is the response.
I
What are the covariates/predictors?
I
The types of beverage are the predictors/covariates.
16/29
Define response and predictors
I
Response variable: Yi is the gene expression data for one
particular gene obtained from the i-th individual.
17/29
Define response and predictors
I
Response variable: Yi is the gene expression data for one
particular gene obtained from the i-th individual.
I
How to define the covariates/predictors?
17/29
Define response and predictors
I
Response variable: Yi is the gene expression data for one
particular gene obtained from the i-th individual.
I
How to define the covariates/predictors?
I
The types of beverage are the predictors/covariates.
17/29
Define response and predictors
I
Response variable: Yi is the gene expression data for one
particular gene obtained from the i-th individual.
I
How to define the covariates/predictors?
I
The types of beverage are the predictors/covariates.
I
Assume Xi is the beverage the i-th individual taken.
Typically, we use dummy variables to represent te
categorical data. That is Xi = (Xi1 , Xi2 , Xi3 , Xi4 )T .
17/29
Define response and predictors
I
Response variable: Yi is the gene expression data for one
particular gene obtained from the i-th individual.
I
How to define the covariates/predictors?
I
The types of beverage are the predictors/covariates.
I
Assume Xi is the beverage the i-th individual taken.
Typically, we use dummy variables to represent te
categorical data. That is Xi = (Xi1 , Xi2 , Xi3 , Xi4 )T .
I
Xi1 is the dummy variable for grape juice, Xi1 = 1 if i-th
individual drinks grape duice, Xi1 = 0 otherwise; Xi2 is the
dummy variable for red wine; Xi3 is the dummy variable for
diluted ethanol; Xi4 is the dummy variable for water.
17/29
Example: A linear model
A linear model for studying the relationship between beverages
and gene expression is
Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi3 + β4 Xi4 + εi , i = 1, · · · , n,
where εi is measurement error. The measurement error εi is
typically assumed to be normally distributed with mean 0 and
unknown variance σ 2 . β0 , β1 , β2 , β3 , β4 are unknown
parameters.
18/29
Linear models in a matrix form
I
Let Y = (Y1 , · · · , Yn )T be the n × 1 response vector.
I
Let Xi = (1, Xi1 , ·, Xi4 )T be the 5 × 1 predictor obtain from
the i-th individual. Let X = (X1 , · · · , Xn )T be the n × 5
design matrix.
I
Let β = (β0 , · · · , β4 )T and ε = (ε1 , · · · , εn )T .
I
A linear model could be written as the following matrix form
Y = X β + ε.
19/29
Basic assumptions
I
Y |X is normally distributed;
I
E(Y |X ) = X β is linear function of β;
I
ε1 , · · · , εn are independent, namely, Y1 , · · · , Yn are
independent.
20/29
Outline of the course
I
Generalized linear models, which allows Y |X to be binary,
counts and other distributions, and also allow E(Y |X ) be a
non-linear function of unknown parameters.
I
Non-linear models, which assumes E(Y |X ) to be a
nonlinear function of unknown parameters.
I
Linear mixed models and generalized linear mixed models.
These models will allow some dependence among the
observations Y1 , · · · , Yn .
21/29
Example for generalized linear models
Consider a breast cancer study conducted by Richardson et al.
(2006). The study aims to provide insight into the molecular
pathogenesis of Sporadic basal-like cancers (BLC), a distinct
class of human breast cancers.
Fourty seven subjects participated into this study. For each
patient, the single nucletide polymorphism (SNP) array and
microarray gene expression were measured. The original data
consist of 7 normal specimens, 2 BRCA-associated breast
cancer specimens, 18 sporadic BLC specimens and 20
non-BLC specimens.
22/29
Questions
If we would like to find out what genes are associated with the
BLC cancers,
I
what is the response should be used?
I
what are the covariates?
I
can we fit them using linear models?
23/29
Questions
If we would like to find out genes that are associated with the all
the four types of breast cancers,
I
what is the response should be used?
I
what are the covariates?
I
can we fit them using linear models?
24/29
Example for non-linear models
Let us consider the beverage study in more detail. In fact, in the
experiment, for each individual and each beverage, blood
samples were taken at baseline (0 hour, without drinking
beverages), 1, 2, 4, 12 hours after the drink together with
standardized nutrition. RNA of 120 samples was hybridized on
Affymetrix microarrays.
25/29
7.1
7.0
6.9
6.8
gene expression for gene 1 for individuals with Alcohol
7.2
Gene expression profile for gene 1 in Alcohol group
0
2
4
6
8
10
12
hours
26/29
Nonlinear relationship
I
Consider time in hours as the covariate and the gene
expression for gene 1 as the response.
I
It might be clear that E(Y |X ) is not linear in X. Namely, we
can not write E(Y |X ) = β T X .
I
A non-linear regression may be more appropriate. That is,
assuming E(Y |X ) = g(X ; β) where g(X ; β) is a nonliner
function of X and β.
27/29
Example for linear mixed models
Let us now examine the data structure of the beverage study
data set more carefully. The design of the experiment could be
illustrated in the following plot:
28/29
Dependence among observations
I
Consider the gene expression data observed for j-th gene
at k -th hour of the i-th individual. Denote it by Yijk .
I
The observations of the same gene from the same
individual at different hours are dependent. Namely, Yijk ,
k = 0, 1, 2, 4, 12 are dependent to each other.
I
The observations of the gene expression from the same
individual for different genes are also dependent. Namely,
Yijk , j = 1, 2, 3, 4, · · · are dependent.
29/29
Download