Data Analysis and Statistics in the Lab

advertisement
Southern California
Bioinformatics Summer Institute
Richard Johnston
Pasadena City College
UCLA School of Medicine
rmjohnston@mac.com







Introduction
Data
Displaying Data
Descriptive Statistics
Inferential Statistics
Q&A
Wrap-up
©Richard Johnston
2




Introduce commonly used statistical concepts and
measures
Show how to compute statistical measures using Excel
and R
Minimize discussion of statistical theory as much as
possible.
Provide sample calculations in the downloaded materials
©Richard Johnston
3






Could I have gotten these results by chance?
Is there a significant difference between these two
samples?
How can I describe my results?
What can I say about the average of these
measurements?
What should I do with these outliers?
… etc.
©Richard Johnston
4
Excel 2007
R (???)
©Richard Johnston
5






Mostly, the look and feel have changed.
Up to 1,048,576 rows and 16,384 columns in a single
worksheet
Up to 32,767 characters in a single cell
More sorting options
Enhanced data importing
Improved PivotTables
©Richard Johnston
6





More options for conditional formatting of cells
Multithreaded calculation of formulae, to speed up large
calculations, especially on multi-core/multi-processor
systems.
Improved filtering
New charting features
… Other changes to make Excel and other Office
programs more Vista-like
©Richard Johnston
7





R is a language and environment for statistical
computing and graphics widely used in research
institutions.
R provides a wide variety of statistical and graphical
techniques, and is highly extensible.
One of R's strengths is the ease with which publicationquality plots can be produced.
R is available as free software,
R runs on Windows, MacOS, and a wide variety of UNIX
platforms.
©Richard Johnston
8



Primary (collected by you) or Secondary
(obtained from another source)
Observational or Experimental
Quantitative or Qualitative
 Quantitative data uses numerical values to describe
something (e.g., weight, temperature)
 Qualitative data uses descriptive terms to classify
something (e.g., gender, color)
©Richard Johnston
9

Nominal (Qualitative)
 Examples are gender, color, species name, …

Ordinal (Qualitative or quantitative)
 Allows rank ordering of values
 Examples:
 Grades A-F
 Rating Level 1 through 5
 “Slow”, “Medium”, “Fast”
©Richard Johnston
10

Interval (Quantitative)
 Allows addition and subtraction, but not
multiplication and division
 No real zero point
 Example: Temperature measurement in degrees
Fahrenheit
 100 degrees F is 50 more than 50 degrees F
 100 degrees F is not twice as hot as 50 degrees F
©Richard Johnston
11

Ratio (Quantitative)
 Allows addition, subtraction, multiplication and division
 Has a true zero point
 A value zero means the absence of the measured quantity
 Examples: Weight, age, or speed
 To decide if a measurement is Interval or Ratio, see if the
phrase “Twice as…” makes sense:
 e.g., Twice as (heavy, old, fast)
©Richard Johnston
12
 Charts
 Pie
 Column
 Line
 Scatter
 Histograms
Tip:
Pivot Charts can be used to quickly generate a wide variety of charts.
©Richard Johnston
13
Open AspirinStudyData.xlsx or .xls in Excel and select “Pie” tab.
©Richard Johnston
14
Open AspirinStudyData.xlsx or .xls in Excel and select “Bar” tab.
©Richard Johnston
15
Open AspirinStudyData.xlsx or .xls in Excel and select “XY Line” tab.
©Richard Johnston
16
Open AspirinStudyData.xlsx or .xls in Excel and select “Scatter” tab.
©Richard Johnston
17
©Richard Johnston
18








Open AspirinStudyData.xlsx 0r .xls
Create bin values 20,25,… ,100
Select Data Analysis
Select Histogram
Click Input Range and select the Age data
Click on Bin Values and select the bin values.
Check Labels and Chart Output
Click OK
©Richard Johnston




Start R
Select File>change dir…
Browse to default directory
Type the following (including capital letters):
> aspirin=read.csv("AspirinStudyData.csv",header=T)
> attach(aspirin)
>hist(Age)
©Richard Johnston
©Richard Johnston

Measures of Central Tendency




Mean
Median
Mode
Measures of Dispersion





Range
Variance
Standard Deviation
Quartiles
Interquartile Range
©Richard Johnston
22








Use built-in functions to perform basic analyses:
Formulas|MoreFunctions|Statistical
Use the Data Analysis Add-in for more complex analyses:
Open AspirinStudyData.xlsx
Select Data|Data Analysis
Select Descriptive Statistics
Select the Age data
Check Labels in first row
Check Summary Statistics
©Richard Johnston
23
©Richard Johnston
24

Type the following:
> summary(Age)
Min. 1st Qu.
34.00
54.00
>sd(Age)
[1] 8.173196
>var(Age)
[1] 66.80114
…(etc.)
Median
59.00
©Richard Johnston
Mean 3rd Qu.
59.09
65.00
Max.
82.00
25

Mean
 Arithmetic average of the values

Median
 Midpoint of the values (half are higher and half are lower)
 If there are an even number the median is the average of
the two points.

Mode
 The value that occurs most frequently.
 There may be more than one mode.
©Richard Johnston
26

Range
 Gives an idea of the spread of values, but depends only on
two of them – the largest and the smallest.

Variance
 Averages the squared deviations of each value from the
mean.

Standard Deviation
 Calculated by taking the square root of the variance
 More useful than the variance since it’s in the same
measurement units as the data.
©Richard Johnston
27

Quartiles
 Divides the data into four equal segments
In Excel:
Q1: 54
Q2: 59
Q3: 65
Q4: 82
=QUARTILE(Age,1)
=QUARTILE(Age,2)
=QUARTILE(Age,3)
=QUARTILE(Age,4)
In R:
>quantile(Age)
0% 25% 50%
34
54
59
75% 100%
65
82
©Richard Johnston
Approximately
25% are less than Q1
50% are less than Q2
75% are less than Q3
100% are less than Q4
28

Interquartile Range (IQR)
 Measures the spread of the center 50% of the data
 IQR = Q3 – Q1
 Used to help identify outliers (more on this later)
General “Rule”:
Consider discarding values
less than Q1 – 1.5 x IQR or
greater than Q3+1.5 x IQR
©Richard Johnston
29
 Predicting the distribution of values
 Empirical rule for “Bell Shaped” curves:
Approximately
 68% of the values will fall within 1 SD of the
mean,
 95% will fall within 2SD of the mean, and
 99.7% will fall within 3SD of the mean
©Richard Johnston
30
From the example:
±1 SD: 71.4%
± 2SD: 95.4%
± 3SD: 99.4%
©Richard Johnston
31






Outliers are values that are (or seem to be) out of line with
the rest of the observations.
Outliers can distort statistical measures
They may be indicative of transient errors in equipment or
errors in transcription.
They may also indicate a flaw in experimental assumptions.
As we’ve just shown, 3 observations out of 1000 can be
expected to be over 3 SD from the mean.
Outliers that can’t be readily explained should receive careful
attention.
©Richard Johnston
32


Quartiles and boxplots can
help identify outliers.
In R, type
boxplot(Age)




The middle box is the IRQ.
The horizontal line is the
median.
The whiskers are 1.5 x IRQ
In this example, four values
are outliers
©Richard Johnston
33
Histogram of Age
0.02
0.01
0.00
In R:
> h=hist(Age,plot=F)
> plot(h)
> s=sd(Age)
> m=mean(Age)
ylim=range(0,h$density,
+ dnorm(0,sd=s))
>hist(Age,freq=F,ylim=ylim)
> curve(dnorm(x,m,s),add=T)
Density
0.03
0.04
0.05
Age histogram and a normal
curve with the same mean
and standard deviation.
30
40
50
60
70
80
Age
©Richard Johnston
34








The mean, median, and mode are the same value
The distribution is bell shaped and symmetrical around
the mean
The total area under the curve is equal to 1
The left and right sides extend indefinitely
x = the normally distributed random variable of interest
μ = the mean of the normal distribution
σ = the standard deviation of the normal distribution
z = the number of standard deviations between x and μ,
otherwise known as the standard z-score
©Richard Johnston
35
z is calculated using the formula
z= (x-μ)/σ
For the Age data
μ = 59.09
σ = 8.17
for x = 82 (the oldest subject)
z = (82 – 59.09)/8.17
= 2.80
The 82 year old is 2.80 SD away from the
mean of the population.
©Richard Johnston
36

The standard normal distribution is a normal
distribution with
μ=0
σ =1

The total area under the standard normal curve is
equal to 1.
©Richard Johnston
37
The shaded area represents the
probability that
x is within 1 SD of the mean
in Excel
68% = NORMDIST(1,0,1,1).
NORMDIST(-1,0,1,1)
-4
-3
-2
-1
0
1
2
3
1
2
3
4
The z-score for 5% probability that
x is less than z is about -1.64
-1.64 = NORMSINV(.05).
-4
©Richard Johnston
-3
-2
-1
0
4
38




Sampling
Sampling Distributions
Confidence Intervals
Hypothesis Testing
©Richard Johnston
39



The term ”population” in statistics represents all
possible outcomes or measurements of interest
in a particular study.
A “sample” is a subset of the population that is
representative of the whole population.
Analysis of a sample allows us to infer
characteristics of the entire population with a
quantifiable degree of certainty.
©Richard Johnston
40



In the 1980’s Harvard did a study of the effectiveness of
aspirin in the prevention of heart attacks. They followed over
22,000 physicians for five years. Half of the physicians were
given a daily dose of aspirin, and half were given a placebo.
Neither the subjects nor the investigators knew which was
being administered. (More on this later.)
A coin is flipped 20 times in each of 20 trials to determine
whether it is “fair”.
Seeds are divided randomly into two groups and planted.
One group receives fertilizer A and the other group receives
fertilizer B. All other factors (light, water, etc.) are kept the
same.
©Richard Johnston
41

Patients at 6 US hospitals were randomly assigned to 1 of 3
groups: 604 received intercessory prayer after being
informed that they may or may not receive prayer; 597 did
not receive intercessory prayer also after being informed that
they may or may not receive prayer; and 601 received
intercessory prayer after being informed they would receive
prayer. Intercessory prayer was provided for 14 days, starting
the night before coronary artery bypass graft surgery (CABG).
The primary outcome was presence of any complication
within 30 days of CABG. Secondary outcomes were any major
event and mortality. (American Heart Journal, 2006)
©Richard Johnston
42

Several factors contribute to the determination of the sample
size needed for a particular study:
 Desired confidence level (95%, 99%)
 Margin of error (5%,3%)
 Population size (Results don’t change much for populations of 20,000
or more)
 Expected proportion (p=q=.5 is conservative)



For example, a 99% confidence level with a margin of error of
6% would require about 450 samples.
Formulas for sample size vary, and are not presented here
Several online tools are available for determining sample sizes
(e.g., http://www.raosoft.com/samplesize.html)
©Richard Johnston
43




Your company has just completed a five year study on the
effectiveness of aspirin in preventing heart attacks.
Five hundred physician volunteers were divided randomly
into two groups. One group received 325mg of aspirin every
other day. The other group received a placebo instead of
aspirin.
Neither the subjects nor the test administrators knew
whether aspirin or placebo was being administered.
The subjects were monitored for five years to determine
whether or not they experienced a heart attack
©Richard Johnston
44
The results of the study are provided in tabular format. The
table contains the following information:
Field Name
Subject
Age
Sex
Group
Smoker
Attack
AttackDate
Ulcer
Transfusion
Contents
Subject identification number
Age of the subject at the start of the experiment
Sex of the subject
Group the subject was assigned to (placebo or aspirin)
Smoker/Non-Smoker status
Attack: The subject had a heart attack during the study
No Attack: The subject did not have a heart attack during the study
Date the heart attack occurred
Ulcer: The subject developed an ulcer during the study
No Ulcer: The subject did not develop an ulcer during the study
Trans: The subject required a transfusion during the study
No Trans: The subject did not require a transfusion during the study
©Richard Johnston
45
The capabilities of Excel can be used to summarize the
data in various ways. One way to summarize the data is
with a PivotTable report.
Open AspirinStudyData.xlsx or .xls and click on tab “Study
Summary”
©Richard Johnston
46
The data seem to indicate that aspirin helps to prevent
heart attacks. Your task is to determine the statistical
significance of the results.
©Richard Johnston
47




If we perform an experiment such as flipping a coin we can
count the number of “successes” (i.e., heads) in a number of
trials to get an estimate of the underlying probability that a flip
of the coin will result in heads.
The larger the number of trials, the more confident we are that
the true probability is within a certain range, or confidence
interval.
The number of successes in repeated experiments such as this
form the familiar normal (bell-shaped) curve.
Assuming a normal distribution of results allows us to calculate
statistical characteristics for a wide variety of experiments,
including clinical trials.
©Richard Johnston
48


The “true” chance of attack in the Placebo Group
is referred to as p1. Similarly, the “true” chance
of attack in the Aspirin Group is referred to as p2.
Our objective is to estimate the true difference of
p1 and p2 using the results of this study.
Note: The details of the computation are provided in
the Aspirin Study PDF document and Excel Workbook
©Richard Johnston
49
We compute estimates of p1, p2 and the difference p1 – p2
using the information in Table 1 as follows:
Estimate of p1:
Estimate of p2:
Estimate of p1 - p2:
(Note: Statisticians use the caret or “hat” to indicate that
a value is an estimate of the true value for that measure.)
©Richard Johnston
50
The computation of a “confidence interval” allows us to
specify the probability that any given confidence interval
from a random sample will contain the true population
mean. Typically, a 95% confidence interval is used.
The formula for computing the confidence interval for the true difference is:
p1  p2  pˆ1  pˆ 2  z SE( pˆ1  pˆ 2 )
2
where


p1  p2 is the true difference
pˆ1  pˆ 2 is the observed difference

z  is the critical value for 95% confidence (see diagram)
2
SE( pˆ1  pˆ 2 ) is the standard error of pˆ1  pˆ 2
(The standard deviation of the sample proportion.)



©Richard Johnston
51
The 95% confidence interval for this study is
p1  p2  .096  (1.96)(.033)
 .096  .066

In other words, we are 95% confident that the true difference
in the heart attack rates is between .030 and .162. Since the
lower number is still positive, we are 95% confident that aspirin
has a beneficial effect in preventing heart attacks.
©Richard Johnston
52
(See documentation for details of computation)
©Richard Johnston
53




A confidence level is a range of values used to
estimate a population parameter such as the mean.
A confidence level is the probability that the interval
estimate will include the mean.
Increasing the confidence level makes the interval
wider (less precise).
Increasing the sample size reduces the width of the
interval (more precise).
©Richard Johnston
54
We now address the question:
If aspirin had no effect, what is the probability
that the observed results occurred by chance?
©Richard Johnston
55
In order to answer this question, we formulate two
hypotheses H0 and Ha.
H0 is called the “Null Hypothesis”, and Ha is called the
“Alternate Hypothesis”
For this study H0 and Ha can be stated as follows:
Null hypothesis
H0:
Aspirin has no effect, and p1 = p2
Alternate Hypothesis
Ha:
Aspirin does reduce heart attacks, and p1 > p2
©Richard Johnston
56
Under the Null Hypothesis, the probability of attack with
or without aspirin therapy is the same, and the observed
difference is due to random chance.
Since we are assuming p1 = p2, we can pool the results to
get an estimate of the probability of an attack under the
Null Hypothesis:
pˆ 

x1  x 2
55  31

 .172
n1  n2 250  250
©Richard Johnston
57
The Standard Error under the Null Hypothesis is given by
SE O ( pˆ1  pˆ 2 ) 
pˆ (1 pˆ )(
 .172(1 .172)(
1 1
 )
n1 n 2
1
1

)
250 250
 .0338

©Richard Johnston
58
We can now compute the test statistic z for the
observed results under the Null Hypothesis:
pˆ1  pˆ 2
zOBS 
SE O ( pˆ1  pˆ 2 )
.220  .124

.0338
 2.840
The value for zOBS is almost three standard deviations from zero,
indicating that it is highly unlikely we would get the observed
results fromrandom chance.
©Richard Johnston
59
In order to determine the probability of getting this result
under the Null Hypothesis, we compute the “p-value”.
The p-value in this case is the probability that the test
statistic z is greater than or equal to zOBS:
p  value  Pr(z  zOBS )  Pr(z  2.840)

©Richard Johnston
60
We can use a built-in Excel function to calculate the p-value. In
this case we use the function NORMSDIST which returns the
area under the bell curve from minus infinity up to the specified z
value.
Since we are interested in the area under the right-hand tail of
the bell curve we use the following calculation in Excel:
p-value = 1-NORMSDIST(ZOBS)
= 1- NORMSDIST(2.840)
=.002
©Richard Johnston
61
In R:
> attack=c(31,55)
> total=c(250,250)
> prop.test(attack,total,alternative="less",correct=F)
2-sample test for equality of proportions
without continuity correction
data: attack out of total
X-squared = 8.089, df = 1, p-value = 0.002227
alternative hypothesis: less
sample estimates:
prop 1 prop 2
0.124 0.220
©Richard Johnston
62
0
99.8%
0.2%
zOBS  2.840
0
-5
-4
-3
-2
-1
0
1
2
3
4
5

Illustration of p-value for zOBS=2.840
©Richard Johnston
63
Conclusion (Finally!)
Given this p-value, we can reject the Null Hypothesis HO
with 99.8% level of confidence, and accept the Alternate
Hypothesis Ha that the probability of heart attack is less
when aspirin is taken regularly.
©Richard Johnston
64
There are three versions of the Alternate Hypothesis Ha:
Two sided
H a : p1  p2
p - value  Pr( z  zOBS )
Right
H
a
: p1  p2
p - value  Pr(z  zOBS )
Left
H
a
: p1  p2
p - value  Pr(z  zOBS )
This tutorial used the “Right” version, since we were
interested in the right-hand tail of the bell curve. The

determination of the p-value is slightly different in each
case.
©Richard Johnston
65
9
8
Interval or
Ratio data
6
5
4
3
2
1
0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
105
110
115
120
Frequency
7
Bin
Sample A
Sample B
©Richard Johnston
66

Use one of Excel’s t-Tests in the Data Analysis Add-in
 In this case, Two Samples with Unequal Variances
©Richard Johnston
67
t-Test: Two-Sample Assuming Unequal Variances
Mean
Variance
Observations
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail

Sample A Length
Sample B Length
31.81100303
59.25031625
245.1876265
408.7490104
50
50
0
92
-7.587355003
1.28745E-11
1.661585397
2.57491E-11
1.986086272
Since p-value < is less than .05, we reject the null hypothesis
©Richard Johnston
68
In R:
>t.test(Sample_A,Sample_B)
Welch Two Sample t-test
data: Sample_A and Sample_B
t = -7.5874, df = 92.23, p-value = 2.544e-11
alternative hypothesis: true difference in means is
not equal to 0
95 percent confidence interval:
-34.62166 -20.25696
sample estimates:
mean of x mean of y
31.81100 59.25032
©Richard Johnston
69



Allows hypothesis testing of nominal and ordinal
data.
Used to test whether a frequency distribution fits
a predicted distribution.
Hypotheses:
 H0: The actual distribution can be described by the
expected distribution
 Ha: The actual distribution differs from the expected
distribution
©Richard Johnston
70
Suppose the expected distribution of colors of flowers is:
Color
Expected percentage
White
40%
Yellow
30%
Orange
20%
Blue
5%
Purple
5%
Total
100%
©Richard Johnston
71
The observed distribution of an experimental sample is:
Color
Number
White
145
Yellow
128
Orange
73
Blue
32
Purple
22
Total
400
Can we conclude that the expected distribution is
“true” based on the observations?
©Richard Johnston
72
The Chi-Square statistic is calculated from:
Where
O = Number observed in each category
E = Number expected in each category
For this example, X2 = 9.95
©Richard Johnston
73
The critical Chi-Square score Xc2 depends of the number of
degrees of freedom. In this case:
d.f. = k – 1, where
k is the number of categories. In this case k=5 so d.f. = 4.
In Excel, we can use the CHIINV function to get the critical chisquare score:
CHIINV(probability ,deg-freedom)
For α=10, d.f. = 4:
CHIINV(0.1,4)= 7.77944
Since X2 = 9.95 is greater than the critical chi-square value, we
conclude that the observed distribution differs from the expected
distribution.
©Richard Johnston
74
In Excel, we can use the CHITEST function to calculate the
probability of the observed chi-square score:
CHITEST(actual_range,expected_range)
CHITEST returns the probability that a value of the χ2 statistic at
least as high as the value calculated could have happened by
chance.
In this case,
CHITEST(actual_range,expected_range) = .041
This means that the probability of the observed results is less than
the 10% probability we chose for α, so we conclude that the
observed distribution differs from the expected distribution.
©Richard Johnston
75
Use Excel’s Help resources to explore the various
types of tests and statistics
 A list of useful books and websites is provided
with the handouts.
Most importantly Have a statistician look at your results before
publishing them

©Richard Johnston
76
©Richard Johnston
77
1. Excellent electronic statistics textbook:
http://www.statsoft.com/textbook/stathome.html
2. UCLA’s Statistics Advisory site: http://www.ats.ucla.edu/stat/
3. Choosing the correct statistic:
http://bama.ua.edu/~jleeper/627/choosestat.html
4. Handy R reference:
http://www.math.ilstu.edu/dhkim/Rstuff/Rtutor.html
5. Discussions of statistical tests with examples:
http://www.ats.ucla.edu/stat/stata/whatstat/whatstat.htm#hsb
6. List of sites for learning and using R:
http://www.ats.ucla.edu/stat/R/
7. Wikipedia has informative discussions of topics in statistics,
with links to primary references
©Richard Johnston
78
Material and information from the following references
were used in this presentation
1. Introductory Statistics with R by Peter Dalgaard
2. The Complete Idiot's Guide to Statistics by Robert
A. Donnelly Jr.
3. Cartoon Guide to Statistics by Larry Gonick and
Woollcott Smith
©Richard Johnston
79






SoCal BSI
Dr. Momand
Dr. Johnston
SoCal BSI Core Instructors
Ronnie Cheng
All of you
No statisticians were harmed during
the making of this presentation
©Richard Johnston
80
Download