Advanced Statistical Methods:
Beyond Linear Regression
John R. Stevens
Utah State University
Notes 1. Case Study Data Sets
Mathematics Educators Workshop
28 March 2009 http://www.stat.usu.edu/~jrstevens/pcmi
Why this workshop?
Me …
Outreach mission of USU
Recruitment – undergraduate & graduate
Too much fun
You …
Outline
Notes 1: Case Study Data sets
1. Challenger Explosion
2. Beetle Fumigation
3. T-cell Cancer
Notes 2: Statistical Methods I
Logistic Regression – incl. Separation of Points
EM Algorithm
Notes 3: Statistical Methods II
Tests for Differential Expression
Multiple hypothesis testing
Visualization
Machine Learning
Notes 4: Computer Implementation
(Notes 5): Bonus Material
4
Case Study 1: Challenger
January 18, 1986 explosion prompted the Presidential
Commission on the Space Shuttle Challenger Accident
Commission's 1986 report attributed the explosion to a burn through of an O-ring seal at a field joint in one of the solidfuel rocket boosters
After each of the previous 24 launches, the solid rocket boosters were inspected, and the presence or absence of damage to the field joint was noted
5
Challenger Data
Motivating question:
What was so different on the 25th launch?
13
14
15
16
17
8
9
10
11
12
5
6
3
4
7
Obs
1
2
18
19
20
21
22
23
24
Flight Temp Damage
STS1 66 NO
STS9 70 NO
STS51B 75
STS2 70
STS41B 57
STS51G 70
STS3 69
NO
YES
YES
NO
NO
YES
NO
STS41C 63
STS51F 81
STS4 80
STS41D 70
STS51I 76
STS5 68
STS41G 78
STS51J 79
STS6 67
STS51A 67
YES
NO
NO
NO
NO
NO
NO
STS61A 75
STS7 72
STS51C 53
STS61B 76
STS8 73
STS51D 67
STS61C 58
YES
NO
YES
NO
NO
NO
YES
Case Study 2: Beetle Fumigation
– Rhyzopertha Dominica
6 (Image courtesy Clemson University – USDA Cooperative Extension Slide Series, www.insectimages.org)
7
Motivation
Beetle: lesser grain borer
A primary pest of stored grain
A year-round problem in moderate climates
Australian grain industry:
$6–8 billion
Zero tolerance for insect-infested grain
Phosphine fumigant for control
Some beetles have developed resistance levels more than 235 times greater than normal
(UQ News Online, 18 Oct. 1999)
8
Experimental Background
Two DNA markers linked to resistance
rp6.79: two genotypes: –,+
rp5.11: three genotypes: B,H,A
Motivating question:
What contributes to the degree of resistance?
Mixture of six beetle genotypes exposure to various concentrations of fumigant (48 hours)
9
Experimental Data
Phosphine
Dosage
Total
Receiving Total
(mg/L)
0
Dosage
98
Deaths
0
Total Survivors Observed at Genotype
Survivors -/B
98 31
-/H
27
-/A +/B +/H +/A
10 6 20 4
0.003
0.004
0.005
0.01
0.05
0.1
0.2
0.3
0.4
1.0
100
100
100
100
300
400
750
500
500
7850
10,798
270
383
740
490
16
68
78
77
492
7,806
10,420
84 18 26 10 6 20
32 10
22 1
23 0
30 0
17 0
10 0
10 0
8 0
44 0
4
4
3
7
5
2
7
6
1 9 8 5
0 0 5 20
0 0 0 10
5
7
0 0 0 0 10
0 0 0 0 10
0 0 0 0 8
0 0 0 0 44
4
4
2
0
378
10
Practical Considerations in
Choosing Dosage
Clearly a high dosage would kill all beetles, regardless of genotype
Time more important than concentration
Expense more time with lower dose
Technical limitations maintain concentration in silos
Safety spontaneous combustion at high conc.
11
Case Study 3: T-cell Cancer
Acute lymphoblastic leukemia (ALL)
leukemia – cancer of white blood cells
ALL – excess of lymphoblasts (immature cells that become white blood cells)
Two types of interest here:
T-cell – manage cell-mediated immune response
(activation of cells, release of cytokines)
B-cell – manage humoral immune response
(secretion of antibodies)
Researchers used gene expression technology
12
Central Dogma of Molecular
Biology
13
General assumption of microarray technology
Use mRNA transcript abundance level as a measure of the level of “expression” for the corresponding gene
Proportional to degree of gene expression
14
How to measure mRNA abundance?
Several different approaches with similar themes:
Affymetrix GeneChip
Nimblegen array
Two-color cDNA array more oligonucleotide arrays
Representation of genes on slide
Small portion of gene
Larger sequence of gene
Affymetrix Probes
25 bp
15 (Images courtesy Affymetrix, www.affymetrix.com)
Affymetrix Technology – GeneChip
Each spot on array represents a single probe sequence
(with millions of copies)
Perfect match
Mismatch
Each gene is represented by a unique set of probe pairs (usually
12-20 probe pairs per probe set)
These probes are fixed to the array
16 (Image courtesy Affymetrix, www.affymetrix.com)
Affymetrix Technology – Expression
17
A tissue sample is prepared so that its mRNA has fluorescent tags; wait for hybridization
(Images courtesy Affymetrix, www.affymetrix.com)
Affymetrix GeneChip
18 Image courtesy Affymetrix, www.affymetrix.com
19
Cartoon Representations
Animation 1: GeneChip structure
(1 min.)
Animation 2: Measuring gene expression
(2.5 min)
Data: Spot Intensities
20
Full Array Image Close-up of Array Image
Images courtesy Affymetrix, www.affymetrix.com
21
Basic goal of microarray technology
“Observe” gene expression in different conditions – healthy vs. diseased, e.g.
Decide which genes’ expression levels are changing significantly between conditions
Target those genes – to halt disease, e.g.
Study those genes – to better understand differences at the genetic level
22
ALL Data
“Preprocessed” gene expression data
12625 genes (hgu95av2 Affymetrix GeneChip)
128 samples (arrays) a matrix of “expression values” – 128 cols, 12625 rows phenotypic data on all 128 patients, including:
95 B-cell cancer
33 T-cell cancer
Motivating question: Which genes are changing expression values systematically between B-cell and T-cell groups?
23
Next …
Analysis for these case studies
Build on known statistical methods
Notice huge potential for additional methods