StatisticalMethods05..

advertisement
UCLA STAT 233
Statistical Methods in Biomedical
Imaging
zInstructor:
Ivo Dinov,
Asst. Prof. In Statistics and Neurology
University of California, Los Angeles, Spring 2004
http://www.stat.ucla.edu/~dinov/
Stat 233, UCLA, Ivo Dinov
Slide 1
General Statistical Techniques
zIntro to stats, vocabulary
zDisplaying data
zCentral tendency and variability
zNormal z-scores, standardized distribution
zProbability, Samples & Sampling error
zType I and Type II errors; Power of a test
zIntro to hypothesis testing
zOne sample tests & Two independent samples tests
zTwo sample tests - dependent samples & Estimation
zCorrelation and regression techniques
zNon-parametric statistical tests
Stat 233, UCLA, Ivo Dinov
General Statistical Techniques
zApplications of Central Limit Theorem,
Slide 2
Errors in Samples …
z Selection bias: Sampled population is not a representative subgroup of
the population really investigated.
Law of Large Numbers.
z Non-response bias: If a particular subgroup of the population studied
zDesign of studies and experiments.
does not respond, the resulting responses may be skewed.
zFisher's F-Test & Analysis Of Variance
(ANOVA, 1- or 2-way).
z Question effects: Survey questions may be slanted or loaded to
influence the result of the sampling.
z Is quota sampling reliable? Each interviewer is assigned a fixed quota
zPrinciple Component Analysis (PCA).
of subjects (subjects district, sex, age, income exactly specified, so
investigator can select those people as they liked).
zχ2 (Chi-Square) Goodness-of-fit test.
z Target population –entire group of individuals, objects, units we study.
zMultiple linear regression
z Study population –a subset of the target population containing
zGeneral Linear Model
z Sampling protocol – procedure used to select the sample
z Sample – the subset of “units” about which we actually collect info.
all “units” which could possibly be used in the study.
zBootstrapping and Resampling
Stat 233, UCLA, Ivo Dinov
Slide 3
Slide 4
More terminology …
z Census – attempt to sample the entire population
z Parameter – numerical characteristic of the population,
e.g., income, age, etc. Often we want to estimate population
parameters.
z Statistic – a numerical characteristic of the sample.
(Sample) statistic is used to estimate a
corresponding population parameter.
z Why do we sample at random? We draw “units” from
the study population at random to avoid bias. Every subject in
the study sample is equally likely to be selected. Also randomsampling allows us to calculate the likely size of the error in
our sample estimates.
Slide 5
Stat 233, UCLA, Ivo Dinov
Stat 233, UCLA, Ivo Dinov
Terminology …
z Random allocation – randomly assigning treatments to units,
leads to representative sample only if we have large # experimental units.
z Completely randomized design- the simplest experimental
design, allows comparisons that are unbiased (not necessarily
fair). Randomly allocate treatments to all experimental units,
so that every treatment is applied to the same number of units.
E.g., If we have 12 units and 3 treatments, and we study treatment efficacy,
we randomly assign each of the 3 treatments to 4 units exactly.
z Blocking- grouping units into blocks of similar units for
making treatment-effect comparisons only within individual
groups. E.g., Study of human life expectancy perhaps income
is clearly a factor, we can have high- and low-income blocks
and compare, say, gender differences within these blocks
separately.
Slide 6
Stat 233, UCLA, Ivo Dinov
1
Experiments vs. observational studies
Terminology …
for comparing the effects of treatments
z Why should we try to “blind” the investigator in an
experiment?
z In an Experiment
z Why should we try to “blind” human experimental
subjects?
z Observational study – useful when can’t design a
„ experimenter determines which units receive which
treatments. (ideally using some form of random allocation)
controlled randomized study
„ compare units that happen to have received each of the
treatments
„ Ideal for describing relationships between different
characteristics in a population.
„ often useful for identifying possible causes of effects, but
cannot reliably establish causation.
z The basic rule of experimentor :
“Block what you can and randomize what you cannot.”
z Only properly designed and executed experiments
can reliably demonstrate causation.
Slide 7
Slide 8
Stat 233, UCLA, Ivo Dinov
Types of variable
Distinguishing between types of variable
z Quantitative variables are measurements and
counts
Types of Variables
„Variables with few repeated values are treated as
continuous.
„Variables with many repeated values are treated
as discrete
Stat 233, UCLA, Ivo Dinov
Quantitative
Qualitative
(measurements and counts)
Continuous
Discrete
(define groups)
(few repeated values) (many repeated values)
Categorical
(no idea of order)
Ordinal
(fall in natural order)
z Qualitative variables (a.k.a. factors or classvariables) describe group membership
Slide 9
Slide 10
Stat 233, UCLA, Ivo Dinov
Histograms vs. Stem-and-leaf plots …
Stat 233, UCLA, Ivo Dinov
Interpreting Stem-plots and Histograms
z What advantages does a stem-and-leaf plot have over
a histogram? (S&L Plots return info on individual values, quick to
produce by hand, provide data sorting mechanisms. But, histograms are
more attractive and more understandable).
z The shape of a histogram can be quite drastically
altered by choosing different class-interval
boundaries. What type of plot does not have this
problem? (density trace) What other factor affects the
shape of a histogram? (bin-size)
z What was another reason given for plotting data on a
variable, apart from interest in how the data on that
variable behaves? (shows features, cluster/gaps, outliers; as well as
trends)
Slide 11
Stat 233, UCLA, Ivo Dinov
(a) Unimodal
(b) Bimodal
(c) Trimodal
1 2
−
2
e
x
(d) Symmetric
1
(e) Positively skewed
(long upper tail)
e
−| x|
( x2 −1) 2
(g) Symmetric
(f) Negatively skewed
(long lower tail)
(h) Bimodal with gap
Slide 12
(i) Exponential shape
Stat 233, UCLA, Ivo Dinov
2
Skewness & Kurtosis
Descriptive statistics from computer
programs like STATA
z What do we mean by symmetry and positive and
negative skewness? Kurtosis? Properties?!?
N
∑ (Y
Skewness =
−Y )
N
3
k
k =1
( N − 1) SD
3
∑ (Y
Kurtosis =
;
−Y )
4
k
k =1
( N − 1) SD
Minitab output
STATA
Output
Standard deviation
4
Descriptive Statistics
z Skewness is linearly invariant Sk(aX+b)=Sk(X)
Variable
age
Variable
age
z Skewness is a measure of unsymmetry
N
45
Minimum
36.000
Mean
50.133
Maximum
59.000
Median
51.000
Q1
46.500
z Kurtosis is (also linearly invariant) a measure of flatness
TrMean
50.366
Q3
56.000
Lower quartile
StDev
6.092
SE Mean
0.908
Upper quartile
z Both are used to quantify departures from StdNormal
z Skewness(StdNorm)=0; Kurtosis(StdNorm)=3
Slide 13
Slide 14
Stat 233, UCLA, Ivo Dinov
Box plot compared to dot plot
Q1
Construction of a box plot
Q 1 Med
Q3
Median
Stat 233, UCLA, Ivo Dinov
Q3
Data
Box plot
1.5 IQR
1.5 IQR
Dot plot
(pull back until hit observation)
50
100
150
(pull back until hit observation)
200
SYSVOL
Figure 2.4.3
Box plot for SYSVOL.
Scale
Figure 2.4.4
Slide 15
Slide 16
Stat 233, UCLA, Ivo Dinov
TABLE 2.5.1 Word Lengths for the First 100
Words on a Randomly Chosen Page
2
3
6
5
7
2
4
9
9
5
4
5
6
2
10
4
2
3
3
3
4
9
2
7
2
3
5
3
11
3
9
8
4
2
9
9
3
4
3
4
3
2
4
6
5
6
4
2
4
5
2
5
2
4
4
3
2
4
7
4
Stat 233, UCLA, Ivo Dinov
Describing data with pictures and two numbers
Frequency Table
3
2
3
2
7
Construction of a box plot.
2
4
2
6
3
3
1
3
6
5
4
4
7
10
2
6
2
4
4
5
5
5
2
3
2
3
2
6
5
4
4
5
4
7
2
•
•
Random Number generation: frequency histogram
Descriptive statistics
- Central tendency (Mode, Median, Mean)
- Variability (Variance, Standard deviation)
Frequency Table
Value u
j
Frequency f
j
1
1
2
22
3
18
4
22
Slide 17
5 6 7 8 9
13 8 6 1 6
10
2
Stat 233, UCLA, Ivo Dinov
11
1
Slide 18
Stat 233, UCLA, Ivo Dinov
3
Histogram of random numbers generated
Central tendency: the middle in a single number
Notes:
2. Sacrifice some information (within
intervals) to gain this perspective
8
Frequency
• Mode: The most frequent score in the distribution.
• Median: The centermost score if there are an odd
1: Histogram allows you to see
distribution -- estimate middle, spread.
10
number of scores or the average of the two centermost
scores if there are an even number of scores.
3. How can we further reduce this
information to a single number for
middle and another for spread?
6
4
• Mean: The sum of the scores divided by the number of
scores.
2
0
0-19 2039
4059
6079
80- 100- 120- 140- 160- 180- 200- 220- 240- 260- 280- 30099 119 139 159 179 199 219 239 259 279 299 319
Random Number (interval)
Slide 19
Slide 20
Stat 233, UCLA, Ivo Dinov
Neat things about the mean
Five number summary
The five-number summery = (Min, Q1, Med, Q3, Max)
Slide 21
•
Conceptual SS:
SS: Σ(x-x)2
Computational
Σ x2 - (Σx)2/N
“Sum of each X’s squared difference
from the mean”
Xi (Xi-X)
X1
X2
X3
S um=>
Mea n=>
1
2
3
6
2
“Sum of the Xs after they have
been squared minus the sum
of all the Xs, which is then
squared and then divided by the
total number of Xs”
2
1
0
1
2
SS = Σ x2 - (Σx)2/N
= (12 + 22 + 32) - [(1+2+3)2/3]
= (1+4+9) - (62/3)
= 14 - 36/3
= 14- 12
=2
Slide 23
•
If you add/subtract a constant to/from each score, you
change the mean by adding/subtracting the constant
to/from it.
•
If you multiply/divide each score by a constant you
change the mean by multiplying/dividing it by the
constant.
•
•
•
Summed deviations from the mean = 0, or Σ(xi-x) = 0
Stat 233, UCLA, Ivo Dinov
Very sensitive to extreme scores (outliers).
Sum of squared deviations from the mean (SS) is
minimized.
Σ(xi-x)2 = minimum
Σx2 - (Σx)2/N = minimum
Slide 22
Stat 233, UCLA, Ivo Dinov
Conceptual formula vs. computational formula for SS
Stat 233, UCLA, Ivo Dinov
Stat 233, UCLA, Ivo Dinov
Questions about measures of central tendency?
•
Why is the mean our preferred measure of central tendency?
-
Adjusted for the number of scores.
-
Sum of squared deviations from the mean (SS) is minimized.
SS is the square of the sum of each score’s difference from the
mean.
Σ(x-x)2
Σx2 - (Σx)2/N = minimum
Takes into account the numerical “weight” of each score.
As scores of greater magnitude are added, the mean increases
As scores of lesser magnitude are added, the mean decreases
Slide 24
Stat 233, UCLA, Ivo Dinov
4
Variability
Variability
•
•
•
•
How do these distributions differ?
Also interested in its spread (or variability).
Frequency histogram for women in 214
How can we describe variability with a single number?
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Another Frequency histogram
X = 64.3
Frequency
Define distributions by:
- Central tendency
- Variability
Frequency
•
Not only interested in a distribution’s middle.
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
X = 64.3
60
60 61 62 63 64 65 66 67 68 69 70
61
Slide 25
Slide 26
Stat 233, UCLA, Ivo Dinov
62
63
64
65
66
67
68
69
70
Inches
Inches
Stat 233, UCLA, Ivo Dinov
One is more spread out (greater variability) than another.
Describe variability around the mean with one number.
•
• Want to adjust for the number of scores.
• Take into account the numerical “weight” of each score.
- As scores are farther from the mean, the index of variability
Spread out around what?
Another Frequency histogram
X = 64.3
Frequency
Frequency
Frequency histogram for women in 214
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
60 61 62 63 64 65 66 67 68 69 70
Inches
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
should increase
- As scores are closer to the mean, the index of variability
X = 64.3
should decrease
• Suppose we measured each score’s distance from its mean, and
then
scores.
60 61
62 63 64
65 66 67 68
- Measuring the distance from the mean should tell us how
69 70
spread out each score is relative to the mean.
Inches
Slide 27
Slide 28
Stat 233, UCLA, Ivo Dinov
Generally
Generally
•
•
Population variance (sigma squared, or σ2) is the average of the
squared deviations from the mean:
Σ(x- µ)2
N
•
Note: also written SS/N or Σ x2 - (Σx)2/N
N
Population standard deviation (sigma, or σ) is the square root of
the average of the squared deviations from the mean:
Σ(x- µ)2
N
used the average distance as our measure?
- Using the average distance will adjust for the number of
Note: also written SS/N or
Slide 29
Σ x2 - (Σx)2/N
N
Stat 233, UCLA, Ivo Dinov
Sample variance (or s2) is the (corrected) average of the squared
deviations from the mean:
Σ(x-x)2
N-1
•
Stat 233, UCLA, Ivo Dinov
Note: also written SS/N-1or Σ x2 - (Σx)2/N
N-1
Sample standard deviation (or s) is the square root of the
(corrected) average of the squared deviations from the mean:
Σ(x-x)2 Note: also written
N-1
SS/N-1or
Slide 30
Σ x2 - (Σx)2/N
N-1
Stat 233, UCLA, Ivo Dinov
5
Effects of µ and σ
The Normal distribution density curve
(a) Changing
z Is symmetric about the mean! Bell-shaped and
unimodal.
Mean is a measure of …
central tendency
shifts the curve along the axis
z Mean = Median!
1
=
2=
6
(b) Increasing
increases the spread and flattens the curve
N(µ,
2.2σ)
140
160
1
50% 50%
180
= 160
Standard deviation is
a measure of …
Mean
variability/spread
Slide 31
200
1
=6
2 =174
2=
140
160
Slide 32
Stat 233, UCLA, Ivo Dinov
180
1=
12
200
2 =170
Stat 233, UCLA, Ivo Dinov
Chebyshev’s Theorem
Understanding the standard deviation: σ
z Applies to all distributions
Probabilities/areas
and numbers
of deviations
standard deviations
(c) Probabilities and numbers
of standard
z Pafnuty Chebyshev (Пафнутий Чебышёв) (1821 - 1894).
AKA Chebyshov, Tchebycheff or Tschebyscheff.
for the Normal distribution
Shaded area = 0.683
Shaded area = 0.954
Shaded area = 0.997
P(µ − kσ < X < µ + kσ ) ≥ 1 −
−
+
68% chance of falling
between − and
+
−2
+2
95% chance of falling
between − 2 and
+2
Slide 33
−3
z Gives a lower bound for the probability that a value of a
random variable with finite variance lies within a certain
distance from the variable's mean; equivalently, the theorem
provides an upper bound for the probability that values lie
outside the same distance from the mean. The theorem applies
even to non "bell-shaped" distributions and puts bounds on
how much of the data is or is not "in the middle".
z Let X be a random variable with mean µ and finite variance σ2.
Now, for any real number k > 0,
 X −µ  1
⇔P
≥ k ≤
2
 k2
 σ
k
1
z Only the cases k > 1 provide useful information. Why?
Slide 35
for k > 1
Slide 34
Stat 233, UCLA, Ivo Dinov
Chebyshev’s Theorem
P( X − µ < kσ ) ≥1−
+3
99.7% chance of falling
between − 3 and
+3
Stat 233, UCLA, Ivo Dinov
1
2
k
Stat 233, UCLA, Ivo Dinov
Markov’s & Chebyshev’s Inequalities
z Markov's inequality: (Markov was a student of Chebyshev)
E (Y )
d
Since, if X = d , if Y ≥ d , Note Y ≥ 0, ⇒ X ≥ 0
0, otherwise
E (Y ) ≥ E ( X ) ≥ d × P{Y ≥ d }
Then :
_________________________________________
2
Let Y = X – E(X) and d = k 2 with k > 0 ⇒
E | X – E(X) |2
⇒
P(Y ≥ d ) = P | X – E(X) |2 ≥ k 2 ≤
k2
2
Var(X) σ
1
=
⇒ P(| X - E(X) |≥ k × σ ) ≤
P(| X - E(X) |≥ k ) ≤
k2
k2
k2
Let k’=k/σ Î k=k’σ
If Y ≥ 0 & d > 0 ⇒ P(Y ≥ s ) ≤
{
(
)
Slide 36
(
)
Stat 233, UCLA, Ivo Dinov
6
Chebyshev’s Theorem
Coefficient of Variation
z Ratio of the standard deviation to the mean, expressed
as a percentage
z Applies to all distributions
Number of
Standard
Deviations
Minimum Proportion
of Values Falling
Within Distance
Distance from
the Mean
µ ± 2σ
K=2
1-1/22 = 0.75
µ ± 3σ
K=3
C.V .=
1-1/32 = 0.89
µ ± 4σ
K=4
z Measurement of relative dispersion
σ
(100)
µ
1-1/42 = 0.94
Slide 37
Slide 38
Stat 233, UCLA, Ivo Dinov
Coefficient of Variation – an example
The inverse problem – Percentiles/quantiles - SOCR
(b) 80th percentile (0.8-quantile)
of women’s heights
(a) p-Quantile
µ = 29
µ = 84
1
σ
1
σ
= 4.6
σ
CV
. .=µ
1
Programs supply x p
x-value for which pr(X
2
1
(100)
2
= 10
σ
CV
. .=µ
2
1
4.6
(100)
29
= 1586
.
2
(100)
2
10
(100)
84
.
= 1190
=
=
Stat 233, UCLA, Ivo Dinov
people have
height below the
80th percentile.
prob = p
This is EQ to
=162.7 x = ??
saying there’s
0.8
xp = ??
Program returns 167.9. 80% chance that a
Thus 80% lie below 167.9.
random
(c) Further percentiles of women’s heights
observation from
Percent
1%
5%
10% 20% 30% 70% 80% 90% 95% 99%
the distribution
Propn
0.01 0.05
0.1
0.2
0.3
0.7
0.8
0.9
0.95 0.99
will fall below the
Percentile (or quantile)
148.3 152.5 154.8 157.5 159.4 166.0 167.9 170.6 172.9 177.1
(cm)
80th percentile.
5'0"
5'2"
5'2"
5'5"
5'6"
5'7"
5'8”
5'9"
(ft'in") 4'10" 5'0"
x p) = p
Normal( = 162.7, 80%
= 6.2)of
prob = 0.8
3/8"
3/8"
1/8"
1/8"
3/4"
frac)
The(+inverse
problem is7/8"
what is the3/4"
height
for1/8"
the 80th
percentile/quantile?
So
far we studied given the height value what’s the corresponding percentile?
Slide 39
Stat 233, UCLA, Ivo Dinov
Slide 40
Stat 233, UCLA, Ivo Dinov
General Normal Curve
Standard Normal Approximation
z The standard normal curve can be used to estimate the percentage of
entries in an interval for any process. Here is the protocol for this
approximation:
„ Convert the interval (we need the assess the percentage of entries in) to
standard units. We saw the algorithm already.
„ Find the corresponding area under the normal curve (from tables or online
databases);
z The general normal curve is defined by:
„ Where µ is the average of (the symmetric)
normal curve, and σ is the standard
deviation (spread of the distribution).
−
y=
„ Why worry about a standard and general normal curves?
„ How to convert between the two curves?
( x − µ )2
2 2
e σ
2πσ
2
Compute %
Transform to Std.Units
Data
What percentage of the
density scale histogram
is shown on this graph?
12
18
22
Report back %
Slide 41
Stat 233, UCLA, Ivo Dinov
Slide 42
Stat 233, UCLA, Ivo Dinov
7
Continuous Variables and Density Curves
Summary
z There are no gaps between the values a continuous
random variable can take.
QuincunxApplet.html
z Random observations arise in two main ways: (i) by
sampling populations; and (ii) by observing
processes.
Show Sampling Distribution Simulation Applet
file:///C:/Ivo.dir/UCLA_Classes/Winter2002/AdditionalInstructorAids/
SamplingDistributionApplet.html
Stat 233, UCLA, Ivo Dinov
Slide 43
Slide 44
The density curve
Stat 233, UCLA, Ivo Dinov
For any random variable X
z The probability distribution of a continuous variable
is represented by a density curve.
„ Probabilities are represented by areas under the curve,
‰the probability that a random observation falls between a and b
equal to the area under the density curve between a and b.
z E(aX +b) = a E(X) +b and SD(aX +b) = | a | SD(X)
„ The total area under the curve equals 1.
„ The population (or distribution) mean µX = E(X), is where
the density curve balances.
„ When we calculate probabilities for a continuous random
variable, it does not matter whether interval endpoints are
included or excluded.
Slide 45
Standard Units
Stat 233, UCLA, Ivo Dinov
Ranges, extremes and z-scores
The z-score of a value a is ….
Central ranges:
z the number of standard deviations a is away from the
mean
z positive if a is above the mean and negative if a is
below the mean.
The standard Normal distribution has µ = 0 and σ = 0.
z We usually use Z to represent a random variable with
a standard Normal distribution.
Slide 47
Slide 46
Stat 233, UCLA, Ivo Dinov
Stat 233, UCLA, Ivo Dinov
„ P(-z ≤ Z ≤ z) is the same as the probability that a random
observation from an arbitrary Normal distribution falls
within z SD's either side of the mean.
Extremes:
„ P(Z ≥ z) is the same as the probability that a random
observation from an arbitrary Normal distribution falls
more than z standard deviations above the mean.
„ P(Z ≤ -z) is the same as the probability that a random
observation from an arbitrary Normal distribution falls
more than z standard deviations below the mean.
Slide 48
Stat 233, UCLA, Ivo Dinov
8
Combining Random Quantities
Independence
We model variables as being independent ….
Variation and independence:
z No two animals, organisms, natural or man-made
objects are ever identical.
z There is always variation. The only question is
whether it is large enough to have a practical impact
on what you are trying to achieve.
z Variation in component parts leads to even greater
variation in the whole.
Slide 49
z if we think they relate to physically independent
processes
z and if we have no data that suggests they are related.
Both sums and differences of independent random
variables are more variable than any of the
component random variables
Slide 50
Stat 233, UCLA, Ivo Dinov
Formulas
Stat 233, UCLA, Ivo Dinov
Example
z For a constant number a, E(aX) = aE(X)
and SD(aX) = |a| SD(X).
z For constant numbers a & b, E(aX+b) = aE(X)+b
and SD(aX+b) = |a| SD(X).
z Means of sums and differences of random variables
act in an obvious way
z For independent random variables
„ the mean of the sum is the sum of the means
„ the mean of the difference is the difference in the means
z For independent random variables, (cf. Pythagorean theorem),
SD( X + X ) = SD( X − X ) = SD ( X ) + SD( X )
2
1
2
1
2
1
2
2
E( X + X ) = E ( X ) + E ( X )
1
2
1
2
[ASIDE: Sums and differences of independent Normally distributed random
variables are also Normally distributed]
Slide 51
2
1
2
1
z The mean rest-state fMRI response for one subject in the left
OFC is 125+/-12.
„ Signals exceeding 150 are thought to indicate imaging
effects of cognitive stimulation. Approximately what
proportion of the OFC regions would be
expected to light up (exceed 150) on average?
„About what percentage of all voxels will
have mean intensities between 111 and
130?
„Assumptions?
Stat 233, UCLA, Ivo Dinov
2
1
2
2
E( X + X ) = E ( X ) + E ( X )
1
2
1
2
z For Dependent variables: Ex. E(X)=1, SD(X)=3
„ Y = 2X-1 Î E(Y)=1 and SD(Y) = 6
„ SD(X+Y)=SD(3X-1) = 9, NOT
„ 7== SD(X+Y) = Sqrt(SD2 (X)+SD2(Y)).
„ Defense vs. prosecution argument may be different for an
X+Y value of 18, say.
Slide 52
Stat 233, UCLA, Ivo Dinov
Areas under Standard Normal Curve –
Normal Approximation
Slide 53
SD( X + X ) = SD( X − X ) = SD ( X ) + SD( X )
Stat 233, UCLA, Ivo Dinov
Conditional Probability
The conditional probability of A occurring given that
B occurs is given by
pr(A | B) =
pr(A and B)
pr(B)
Suppose we select one out of the 400 patients in the study and we
want to find the probability that the cancer is on the extremities
given that it is of type nodular: P = 73/125 = P(C. on Extremities | Nodular)
# nodular patients with cancer on extremitie s
# nodular patients
Slide 54
Stat 233, UCLA, Ivo Dinov
9
Review
Melanoma – type of skin cancer –
an example of laws of conditional probabilities
pr( A and B) = pr( A | B)pr( B) = pr( B | A)pr( A)
TABLE 4.6.1: 400 Melanoma Patients by Type and S ite
S ite
Head and
Neck
Type
Hutchinson's
melanomic freckle
Superficial
Nodular
Indeterminant
Column Totals
22
16
19
11
68
Trunk
Extremities
Row
Totals
2
54
33
17
106
10
115
73
28
226
34
185
125
56
400
pr(A) = 1 - pr(A)
1. Proportions (partial description of a real population) and
probabilities (giving the chance of something happening in a random
experiment) may be identical – under the experiment choose-aunit-at-random
2. Properties of probabilities.
{ p } define probabilities ⇔ p ≥ 0;
N
k
k =1
∑
k
k
p =1
k
Contingency table based on Melanoma histological type and its location
Slide 55
Slide 56
Stat 233, UCLA, Ivo Dinov
Stat 233, UCLA, Ivo Dinov
TABLE 4.6.5 Number of Individuals
Conditional probabilities and 2-way tables
z Many problems involving conditional probabilities
can be solved by constructing two-way tables
z This includes reversing the order of conditioning
Having a Given Mean Absorbance Ratio
(MAR) in the ELISA for HIV Antibodies
MAR
Healthy Donor
<2
2 - 2.99
202
73
3
4
5
6
15
3
2
2
0
297
P(A & B) = P(A | B) x P(B) = P(B | A) x P(A)
- 3.99
- 4.99
- 5.99
-11.99
12+
Total
HIV patients
} 275
0
Test cut-off 2
Falsepositives
} 2 FalseNegatives
7
7
15
36
21
88
(FNE)
Power of
a test is:
1-P(FNE)=
1-P(Neg|HIV)
~ 0.976
Adapted from Weiss et al.[1985]
Slide 57
Slide 58
Stat 233, UCLA, Ivo Dinov
Sensitivity vs. Specificity of a Test
4 Factors affecting the power
z An ELISA is developed to diagnose HIV infections. Serum from 10,000 patients that
were positive by Western Blot (the gold standard assay) were tested and 9990 were
found to be positive by the new ELISA. The manufacturers then used the ELISA to test
serum from 10,000 nuns who denied risk factors for HIV infection. 9990 were negative
and the 10 positive results were negative by Western Blot.
HIV Infected (True Case)
ELISA
Test
+
-
+
9990 (TP)
-
10 (FN, β )
10 (FP, α)
9990 (TN)
10,000 (TP+FN)
10,000 (FP+TN)
Sensitivity =
TP/(TP+FN)
9990/(9990+10)
= 0.999
Specificity=
TN/(FP+TN)
9990/(9990+10)
= 0.999
Slide 59
Stat 233, UCLA, Ivo Dinov
Stat 233, UCLA, Ivo Dinov
ƒ Larger:
Î
Causes:
ƒ Sample size (positive)
ƒ Sample variance (negative)
ƒ Effect size (positive)
ƒ The chosen level for α (positive)
Slide 60
Stat 233, UCLA, Ivo Dinov
10
Comments
Hypotheses
z Why can't we (rule-in) prove that a hypothesized value of a
parameter is exactly true? (Because when constructing estimates
Guiding principles
We cannot rule in a hypothesized value for a parameter, we
can only determine whether there is evidence to rule out a
hypothesized value.
The null hypothesis tested is typically a skeptical reaction
to a research hypothesis
based on data, there’s always sampling and may be non-sampling errors,
which are normal, and will effect the resulting estimate. Even if we do
60,000 ESP tests, as we saw earlier, repeatedly we are likely to get
estimates like 0.2 and 0.200001, and 0.199999, etc. – non of which may be
exactly the theoretically correct, 0.2.)
z Why use the rule-out principle? (Since, we can’t use the rule-in
method, we try to find compelling evidence against the observed/dataconstructed estimate – to reject it.)
z Why is the null hypothesis & significance testing typically
used? (Ho: skeptical reaction to a research hypothesis; ST is used to check
if differences or effects seen in the data can be explained simply in terms
of sampling variation!)
Slide 61
Slide 62
Stat 233, UCLA, Ivo Dinov
The t-test
STEP 1
The t-test
Using θˆ to test H 0: θ = θ 0 versus some alternative H 1.
Calculate the test statistic ,
θˆ − θ
t0 = ˆ 0 =
se(θ )
estimate - hypothesized value
standard error
[This tells us ho w m a ny s ta ndard erro rs the es tim ate is a bo ve the hypo the s ize d
value (t 0 po s itive ) o r belo w the hypo the s ized va lue (t 0 nega tive).]
STEP 2
Calculate the P -value using the following table.
STEP 3
Interpret the P -value in the context of the data.
Slide 63
Stat 233, UCLA, Ivo Dinov
Alternative
hypothesis
H 1: θ > θ0
H 1: θ < θ0
H 1: θ ≠ θ0
Evidence against H0: θ > θ 0
provided by
θˆ too much bigger than θ0
(i.e., θˆ - θ0 too large)
θˆ too much smaller than θ0
(i.e., θˆ - θ0 too negative)
θˆ
too far from θ0
(i.e., |θˆ - θ0| too large)
Slide 64
Stat 233, UCLA, Ivo Dinov
Comparing Proportions – matching groups
z Samples are large enough to use Normal-approx.
Since the two proportions come from totally diff.
mothers they are independent Æ
Estimate - HypothesizedValue
=
SE
pˆ1 − pˆ 2 − 0
pˆ1 − pˆ 2
=
=
pˆ1(1 − pˆ1) pˆ 2 (1 − pˆ 2 )
SE ( pˆ1 − pˆ 2 )
+
n1
n2
t0 =
P -value
P = pr(T≥ t 0)
P = pr(T ≤ t 0)
P = 2 pr(T≥ |t 0|)
where T ~ Student(df )
Stat 233, UCLA, Ivo Dinov
Statistical vs. Practical Significance – CI’s
z Computed by:
estimate ± z × SE = pˆ1 − pˆ 2 ± 1.96 × SE ( pˆ1 − pˆ 2 ) =
pˆ (1 − pˆ1) pˆ 2 (1 − pˆ 2 )
pˆ1 − pˆ 2 ± 1.96 × 1
+
n1
n2
P − value = Pr(T ≥ t 0 )
Slide 65
Stat 233, UCLA, Ivo Dinov
Slide 66
Stat 233, UCLA, Ivo Dinov
11
Analysis of two independent samples
Urinary androsterone levels cont.
Urinary androsterone levels – data, dot-plots and 95% CI. Relations
between hormonal levels and homosexuality, Margolese, 1970.
Hormonal levels are lower for homosexuals. Samples are
independent, as unrelated. Results, P-value of t-test 0.004 with a
CI (µHet-µHom)=[0.4:1.7]. Normal hypothesis satisfied?Skewed?
Two Sample T-Test and Confidence Interval
Homosexuals
Two sample
T for androsterone
N
Mean
StDev
SE Mean
Confidence interval
hetero Heterosexuals
11
3.518
0.721
0.22
homose
15
2.500
0.923
0.24
95% CI for mu (hetero)
- mu
( 0.35,
1.69) 5
1
2 (homose):
3
4
T-Test mu (hetero) = mu (homose)
(vs not=):
Androsterone (mg/24 hrs)
T=3.16 P=0.0044 DF=23
TABLE 10.2.1 Urinary Androsterone Levels(mg/24 hr)
Homosexual:
2.5,
1.8,
Heterosexual: 3.9,
1.6,
2.2,
4.0,
3.9,
3.1,
3.8,
3.4,
1.3
3.9,
2.3,
1.6,
2.5, 3.4, 1.6, 4.3, 2.0,
2.9,
3.2,
4.6, 4.3, 3.1, 2.7, 2.3
Figure 10.2.1 Dot plots of the androsterone data (with 95% CIs).
t-test statistic
P-value
Figure 10.2.3 Minitab 2-sample t-output for the androstenone data
From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000.
S o urc e : M a rgo le s e [1970].
Homosexuals
Heterosexuals
1
2
3
Androsterone (mg/24 hrs)
Slide 67
4
5
Slide 68
Stat 233, UCLA, Ivo Dinov
Comparing two means for independent samples
Suppose we have 2 samples/means/distributions as
follows: { x , N (µ , σ ) } and { x , N ( µ , σ ) }. We’ve
1
1 1
2
2 2
seen before that to make inference about µ1 − µ 2 we
µ
µ
−
=
0
with t = ( x − x ) − 0
can use a T-test for H0:
1
And
CI(µ − µ )
1
2
1
2
0
= x − x ± t × SE ( x − x )
1
2
1
2
2
SE ( x − x )
1
2
If the 2 samples are independent we use the SE formula
SE = s 2 / n + s 2 / n
with df = Min( n − 1; n − 1. )
1 1 2 2
1
2
This gives a conservative approach for hand calculation of an
approximation to the what is known as the Welch procedure,
which has a complicated exact formula.
Slide 69
Stat 233, UCLA, Ivo Dinov
Comparing two means for independent samples
1. How sensitive is the two-sample t-test to non-Normality
in the data? (The 2-sample T-tests and CI’s are even
more robust than the 1-sample tests, against nonNormality, particularly when the shapes of the 2
distributions are similar and n1=n2=n, even for small n,
remember df= n1+n2-2.
3. Are there nonparametric alternatives to the two-sample
t-test? (Wilcoxon rank-sum-test, Mann-Witney test, equivalent tests, same Pvalues.)
4. What difference is there between the quantities tested
and estimated by the two-sample t-procedures and the
nonparametric equivalent? (Non-parametric tests are based on
Stat 233, UCLA, Ivo Dinov
Means for independent samples –
equal or unequal variances?
Pooled T-test is used for samples with assumed equal
variances. Under data Normal assumptions and equal
variances of
( x1 − x 2 − 0) / SE ( x1 − x 2 ), where
(n1 − 1) s12 + ( n2 − 1) s22
n1 + n2 − 2
df
=
(n + n − 2)
is exactly Student’s t distributed with
1 2
SE = sp 1 / n1 + 1 / n2 ; s p =
Here sp is called the pooled estimate of the variance,
since it pools info from the 2 samples to form a
combined estimate of the single variance σ12= σ22 =σ2.
The book recommends routine use of the Welch unequal variance method.
Slide 70
Stat 233, UCLA, Ivo Dinov
Paired Comparisons
z An fMRI study of N subjects: The point in the time
course of maximal activation in the rostral and caudal
medial premotor cortex was identified, and the
percentage changes in response to the go and no-go
tasks from the rest state measured. Similarly the points
of maximal activity during the go and no-go task were
identified in the primary motor cortex. Paired t-test
comparisons between the go and no-go percentage
changes were performed across subjects for these
regions of maximum activity.
ordering, not size, of the data and hence use median, not~mean,
for
the average. The equality of 2 means is tested and CI(µ1 - µ1~).
Slide 71
Stat 233, UCLA, Ivo Dinov
Slide 72
Stat 233, UCLA, Ivo Dinov
12
2-sample t-tests and intervals for differences
Paired data
between means µ1−µ2
z We have to distinguish between independent and
related samples because they require different
methods of analysis.
Assume
„ statistically independent random samples from the two
populations of interest
‰both samples come from Normal distributions
z Paired data is an example of related data.
„ Pooled method also assumes that σ1=σ2
Welch method (unpooled) does not
Two-sample t-methods are
z With paired data, we analyze the differences
„ this converts the initial problem into a one-sample
problem.
‰remarkably robust against non-Normality
‰can be sensitive to the presence of outliers in small to moderatesized samples
‰One-sided tests are reasonably sensitive to skewness.
z The sign test and Wilcoxon rank-sum test are
nonparametric alternatives to the one-sample or
paired t-test, respectively.
Slide 73
„ The Wilcoxon or Mann-Whitney test is a nonparametric
alternative to the two-sample t-test.
Slide 74
Stat 233, UCLA, Ivo Dinov
We know how to analyze 1 & 2 sample data.
How about if we have than 2 samples –
One-way ANOVA, F-test
One-way ANOVA refers to the situation of having one
factor (or categorical variable) which defines group
membership – e.g., comparing 4 fMRI stimuli in a block
design.
Hypotheses for the one-way analysis-of-variance F-test
Null hypothesis: All of the underlying true means are identical.
Alternative: Differences exist between some of the true means.
Slide 75
S ource
Between
Within
Total
a
df
squares
i
i
∑ (n − 1)s
i
∑ ∑(x
ij
of S quares
2
2
i
− x ..) 2
z A large P-value indicates that the differences seen
between the sample means could be explained simply
in terms of sampling variation.
z A small P-value indicates evidence that real
differences exist between at least some of the true
means, but gives no indication of where the
differences are or how big they are.
z To find out how big any differences are we need
confidence intervals.
2
B
k -1
s
n tot - k
sW2
a
Stat 233, UCLA, Ivo Dinov
Where did the F-statistics came from?
Mean sum
∑ n (x . − x ..)
(The null hypothesis is that all underlying true means are identical.)
Slide 76
Typical Analysis-of-Variance Table for One-Way ANOVA
S um of
Interpreting the P-value from the F-test
Stat 233, UCLA, Ivo Dinov
Form of a typical ANOVA table
Stat 233, UCLA, Ivo Dinov
F -statistic
P -value
f 0 = s 2B / s2W
pr(F≥ f 0)
n tot - 1
z Let’s look at this example comparing groups. How do
we obtain intuitive evidence against H0? Far separated
sample means + differences of sample means are large
compared to their internal (within) variability! Which of
the following examples indicate group diff’s are “large”?
Example 1
Gp 1
Gp 2
Gp 3
Example 2
Gp 1
Gp 2
Gp 3
Example 3
Gp 1
Gp 2
Gp 3
M ean sum of squares = (sum of squares)/df
z The F-test statistic, f0, applies when we have
independent samples each from k Normal
populations, N(µi, σ), note same variance is assumed.
Slide 77
Stat 233, UCLA, Ivo Dinov
Slide 78
Stat 233, UCLA, Ivo Dinov
13
F-test assumptions
More about the F-test
z s2B is a measure of variability
of sample means, how far apart
they are.
z s2W reflects the avg. internal
Variability within the samples.
2
∑ ni ( xi. − x..)
2
..
s =
B
k −1
∑ (ni − 1) si 2
2
s = ..
W
n −k
tot
z The F-test statistic, f0, tests H0 by comparing the
variability of the sample means (numerator) with the
variability within the samples (denominator).
z Evidence against H0 is provided by values of f0
which would be unusually large if H0 was true.
Slide 79
1. Samples are independent, physically independent
subjects, units, objects are being studies.
2. Sample Normal distributions, especially sensitive
for small ni, number of observations, N(µi, σ).
3. Standard deviations should be equal within all
samples, σ1= σ2= σ3=… σn = σ. (1/2 <= σk/σj<=2)
k
How to check/validate these assumptions for your data?
For the reading-score improvement data:
independence is clear since different groups of students are used.
Dot-plots of group data show no evidence of non-Normality.
Sample SD’s are very similar, hence we assume population SD’s are
similar.
Slide 80
Stat 233, UCLA, Ivo Dinov
F-test assumptions
1. Whether the two samples are independent of each other is
generally determined by the structure of the experiment from
which they arise. Obviously correlated samples, such as a set
of pre- and post-test observations on the same subjects, are
not independent, and such data would be more appropriately
tested by a one-sample test on the paired differences.
2. Two random variables are independent if their joint
probability density is the product of their individual
(marginal) probability densities. Less technically, if two
random variables A and B are independent, then the
probability of any given value of A is unchanged by
knowledge of the value of B. A sample of mutually
independent random variables is an independent sample.
Slide 81
Approaches for modeling data relationships
Regression and Correlation
zThere are random and nonrandom variables
zCorrelation applies if both variables (X/Y) are
random (e.g., We saw a previous example, systolic vs.
diastolic blood pressure SISVOL/DIAVOL) and are
treated symmetrically.
zRegression applies in the case when you want to
single out one of the variables (response variable, Y)
and use the other variable as predictor (explanatory
variable, X), which explains the behavior of the
response variable, Y.
Stat 233, UCLA, Ivo Dinov
Stat 233, UCLA, Ivo Dinov
9000
10000
11000
9000
12000
Disposable income ($)
10000
11000
12000
Correlation coefficient (-1<=R<=1): a measure of linear
association, or clustering around a line of multivariate
data.
Relationship between two variables (X, Y) can be
summarized by: (µX, σX), (µY, σY) and the correlation
coefficient, R. R=1, perfect positive correlation (straight
line relationship), R =0, no correlation (random cloud
scatter), R = –1, perfect negative correlation.
Computing R(X,Y): (standardize, multiply, average)
Disposable income ($)
z Regression is a way of studying relationships between
variables (random/nonrandom) for predicting or explaining
behavior of 1 variable (response) in terms of others
(explanatory variables or predictors).
From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 1999.
Slide 83
Stat 233, UCLA, Ivo Dinov
Slide 82
Correlation Coefficient
Regression relationship = trend + residual scatter
(a) Sales/income
Stat 233, UCLA, Ivo Dinov
R( X , Y ) =
N  x − µ  y − µ  X={x1, x2,…, xN,}
1
∑ 
 Y={y , y ,…, y ,}

N − 1 k = 1 σ  σ  (µ , σ1 ),2(µ , σN )
k
x
x
k
y
y
X
X
Y
Y
sample mean / SD.
Slide 84
Stat 233, UCLA, Ivo Dinov
14
Correlation Coefficient - Properties
Correlation Coefficient - Properties
Correlation is pseudo-invariant w.r.t. linear transformations of X or Y
N  xk − µx  yk − µy 
1
R( X , Y ) =
∑ 
=

N − 1 k = 1 σx  σy 
R( aX + b, cY + d ), since
 axk + b − µax + b   axk + b − ( aµx + b) 
=

=
a × σx
σax + b
 


Correlation is Associative
R( X , Y ) =
1 N  x − µ  y − µ 
∑ 
 = R (Y , X )

N k = 1  σ  σ 
k
x
k
x
y
y
Correlation measures linear association, NOT an association in
general!!! So, Corr(X,Y) could be misleading for X & Y related in
a non-linear fashion.
 a ( xk − µ ) + b − b 
xk − µx 

 = sign( a )



a × σx
 σx 


Slide 85
Slide 86
Stat 233, UCLA, Ivo Dinov
Correlation Coefficient - Properties
R( X , Y ) =
x
k
y
x
1.
2.
Least squares criterion
1 N  x − µ  y − µ 
∑ 

 = R (Y , X )
N k = 1  σ  σ 
k
y
R measures the extent of
linear association between
two continuous variables.
Association does not imply
causation - both variables
may be affected by a third
variable – age was a
confounding variable.
Least squares criterion: Choose the values of the
parameters to minimize the sum of squared
prediction errors (or sum of squared residuals),
n
∑ (y
i =1
Slide 87
2
Slope:
Least-squares line:
ith data point
yˆ = βˆ0 + βˆ1x
(x , y )
i
yi
^y
i
i
0
^
x1 x 2 . .
Least-squares line:
Slide 89
n
∑ ( xi − x )( yi − y )
βˆ1 = i = 1
;
n
2
∑ ( xi − x )
i =1
[
Prediction
y - ^y
error
i
i
^
1
Stat 233, UCLA, Ivo Dinov
The least squares line
Its parameters are denoted:
Intercept:
2
− yˆi )
Slide 88
(c) Prediction errors
Choose line with smallest
sum of squared
prediction errors
Min Σ (yi − y^i )
i
Stat 233, UCLA, Ivo Dinov
The least squares line
Least-squares line
Stat 233, UCLA, Ivo Dinov
xi
. . .
xn
]
βˆ = y − βˆ x
0
1
yˆ = βˆ0 + βˆ1x
Stat 233, UCLA, Ivo Dinov
Slide 90
Stat 233, UCLA, Ivo Dinov
15
The simple linear model
y
Data generated from Y = 6 + 2x + error (U)
Dotted line
is true line and
solid line
is the data-estimated LS line.
Note differences between true β0=6, β1=2 and
their estimates β0^ & β1^.
y
x
x
1
x
2
x
3
x
4
(a) The simple linear model
When X = x,
x
1
x
2
x
3
Y ~ Normal(µY,σ) where µY = β0 + β1 x,
OR
Random error
Slide 91
= 7.38,
0
^
Sample 4:
= 2.10
1
30
30
20
20
10
10
0
2
Sample 5:
30
4
^
= 9.14,
0
6
^
8
^
= 7.92,
0
10
10
20
20
10
10
2
4
x
6
8
1= 1.44
y
0
0
2
4
6
0
8
2
0
4
6
8
Stat 233, UCLA, Ivo Dinov
Data generated from Y = 6 + 2x + error(U)
= 1.59
1
2
4
Combined: 0 = 7.44,
6
^
Estimates of slope,
0
0
0
Slide 93
8
= 1.70
5
10
15
0.5
True value
2
4
x
1
Mean = 1.98
Std dev. = 0.46
1
Figure 12.4.3
0
^
^
y
0
Sample 2: 0 = 9.11,
30
0
20
^
1= 2.26
Mean = 6.05
Std dev. = 2.34
0
20
^
Estimates of intercept,
0
30
= 3.63,
Histograms of least-squares estimates from 1,000 data sets
^
= 1.13
1
0
Slide 92
y
0
^
Stat 233, UCLA, Ivo Dinov
Data generated from Y = 6 + 2x + error(U)
^
4
(b) Data sampled from the model
when X = x, Y = β0 + β1 x + U, where U ~ Normal(0,σ)
Sample 3:
Sample 1:
30
6
Stat 233, UCLA, Ivo Dinov
1.0
1.5
2.0
2.5
3.0
3.5
True value
Data generated from the model Y = 6 + 2 x + U
where U Normal( µ = 0, σ = 3).
8
Slide 94
Stat 233, UCLA, Ivo Dinov
R-Square
Summary
A mathematical term describing how much variation is
being explained by the predictor X.
For the simple linear model, least-squares estimates
are unbiased [ E(β^)= β ] and Normally distributed.
Noisier data produce more-variable least-squares
estimates.
R2 = 1 - SS(regression)/SS(total), Assuming "SS" =
Sum Squared error, and that "SS(total)" means the
variance in the data. This should be obvious, as Rsquared approaches 1 as a regression approaches
a perfect fit.(i.e.,
R2 = 1 - Σ((data – regress’n)2))/Σ((data–data_mean)2))
The R-squared value is the fraction of the variance
(not 'variation') in the data that is explained by a
regression model (does not have to be simple-linear).
Slide 95
Stat 233, UCLA, Ivo Dinov
Slide 96
Stat 233, UCLA, Ivo Dinov
16
Summary
Summary
4. In the simple linear model, what behavior is
governed by σ ? (the spread of scatter of the data around trend)
1. Before considering using the simple linear model, what
sort of pattern would you be looking for in the scatter
plot? (linear trend with constant scatter spread across the range of X)
5. Our estimate of σ can be thought of as a sample
standard deviation for the set of prediction errors
from the least-squares line.
2. What assumptions are made by the simple linear model,
SLM? (X is linearly related to the mean value of the Y obs’s at
each X, µY= β0 + β1 x; where β0 & β1 are the true values of the
intercept and slope of the SLM; The LS estimates β0^ & β1^
estimate the true values of β 0 & β 1; and the random errors U=YµY~N(µ, σ).)
30
20
3. If the simple linear model holds, what do you know about
the sampling distributions of the least-squares estimates?
(Unbiased and Normally distributed)
10
0
0
Slide 97
2
z Error = Actual value – Predicted value
Y
Stat 233, UCLA, Ivo Dinov
X
1
2
3
4
5
6
7
8
Y
Y= β 0 + β1 X
20
10
10
X
0
1
2
2
3
0
1
4
4
5
k
5
1
X
1
z The RMS Error for the regression line Y= β0 + β1 X is
2
2
2
2
2
2
3
( y − yˆ ) + ( y − yˆ ) + ( y − yˆ ) + ( y − yˆ ) + ( y − yˆ )
4
5 −1
5
where yˆ = βˆ + βˆ x , 1 ≤ k ≤ 5
6
z First compute the LS linear fit (estimate β 0^ + β 1^ )
7
z Then Compute the individual errors
8
z Finally compute the cumulative RMS measure.
z Error = Actual value – Predicted value
2
0
2
1
3
3
2
2
3
yˆ = βˆ + βˆ x ,
k
0
1
4
4
5
5
k
3
4
k
Slide 100
Y
9
15
12
19
11
20
22
18
5
Stat 233, UCLA, Ivo Dinov
z Then Compute the individual errors
X
1
2
3
z Finally compute the cumulative RMS measure.
( y − yˆ ) 2 + ( y − yˆ ) 2 + ( y − yˆ ) 2 + ( y − yˆ ) 2 + ( y − yˆ ) 2 4
5
5 −1
6
ˆ
ˆ
where yˆ = β + β x , 1 ≤ k ≤ 5
7
z Note on the Correlation coefficient formula,
8
( y − yˆ ) 2 , where
K
K
1
1
2
k
0
yˆ = βˆ + βˆ x ,
k
0
2
1
3
1
3
1≤ k ≤ 8
k
4
4
5
5
k
Y
9
15
12
19
11
20
22
18
N  x − µ  y − µ  X={x , x ,…, x ,}
1
1 2
N
∑ 


N − 1 k = 1 σ  σ  Y={y1, y2,…, yN,}
k
x
k
x
Stat 233, UCLA, Ivo Dinov
5
Compute the RMS Error for this
regression line
R( X , Y ) =
Slide 101
4
5 −1
1≤ k ≤ 5
Stat 233, UCLA, Ivo Dinov
Compute the RMS Error for this
regression line
1
1
where
Slide 99
Y
9
15
12
19
11
20
22
18
0
2
4
6
8
z The RMS Error for the regression line Y= β0 + β1 X is
( y − yˆ ) 2 + ( y − yˆ ) 2 + ( y − yˆ ) 2 + ( y − yˆ ) 2 + ( y − yˆ ) 2
5 −1
1≤ k ≤ 5
yˆ = βˆ + βˆ x ,
k
3
X
0
0
2
4
6
8
z The RMS Error for the regression line Y= β0 + β1 X is
( y − yˆ ) 2 + ( y − yˆ ) 2 + ( y − yˆ ) 2 + ( y − yˆ ) 2 + ( y − yˆ ) 2
k
8
30
20
1
Slide 98
z Error = Actual value – Predicted value
30
where
6
Compute the RMS Error for this
regression line
RMS Error for regression
1
4
Stat 233, UCLA, Ivo Dinov
y
y
Slide 102
(µX, σX), (µY, σY)
sample mean / SD.
Stat 233, UCLA, Ivo Dinov
17
Recall the correlation coefficient…
Misuse of the correlation coefficient
Another form for the correlation coefficient is:
[
]
n
∑ ( x − x )( y − y )
i
i
i =1
=
R ( X ; Y ) = Corr ( X ; Y ) =

 n
  n
 ∑ ( x − x )2  ×  ∑ ( y − y )2 


 
i
i

i = 1
 i = 1
n
−
n× x × y
∑ yx
i i
[
]
i =1
=
 n
  n

 ∑ ( x − x )2  ×  ∑ ( y − y )2 

 

i
i
i = 1
 i = 1

(
n
∑ y x + x y − yx − x y
i i
i
i
i =1
)(
n
n
2
2
∑ (x −x) × ∑ ( y −y)
i
i
i =1
i =1
(b)
( )
r=0
)
Slide 103
Slide 104
n
n
βˆ1 = ∑ [( xi − x )( yi − y )] ∑ ( xi − x ) 2 ;
i =1
i =1
10000
11000
12000
βˆ0 = y − βˆ1x
z Scatter = residual (prediction) error Err=Obs-Pred
∑ ( yi − yˆ i ) = ( y1 − yˆ1 ) + ( y2 − yˆ 2 ) + ... + ( yn − yˆ n )
2
Stat 233, UCLA, Ivo Dinov
1. Note that there is a slight difference in the formula for
the slope of the Least-Squares Best-Linear Fit line:
n
∑ ( xi − x )( yi − y )
βˆ1 = i = 1
; βˆ0 = y − βˆ1 x
n
2
∑ ( xi − x )
i =1
[
yˆ = βˆ0 + βˆ1x + Err
z Trend=best linear fit Line (LS) 9000
r=0
Another Notation for the Slope of the LS line
z Regression relationship =
trend + residual scatter
2
(c)
r=0
Stat 233, UCLA, Ivo Dinov
Linear Regression
n
Some patterns with r = 0
(a)
2
2
βˆ = Corr ( X ; Y ) ×
1
]
SD(Y )
;
SD ( X )
βˆ = y − βˆ x
0
1
i =1
Slide 105
Stat 233, UCLA, Ivo Dinov
Definition of the population median
1. The population median is defined as the number in
the middle of the distribution of the RV, i.e., 50% of
the data lies below and 50% above the median.
2. Under what circumstances is the population median
the same as the population mean? (symmetry of the distribution.)
3. Why do we use the population median rather than the
population mean in the sign test? (for a skewed distribution, mean
may not be representative, or may be outlier heavily influenced.)
4. Why is the model for the sign test like tossing a fair
coin? (In the sign-test we test H0: µ~ =0, under H0 a random observation
~
~
is as likely to be < µ as to be > µ . So observation has + or – sign with
the same probability, hence the coin-toss model, distribution-free, nonparametric approach. Testing H0 is just like testing biased/unbiased coin).
Slide 107
Stat 233, UCLA, Ivo Dinov
Slide 106
Stat 233, UCLA, Ivo Dinov
Why Use
Nonparametric Statistics?
z Parametric tests are based upon assumptions that
may include the following:
„ The data have the same variance, regardless of the treatments or
conditions in the experiment.
„ The data are normally distributed for each of the treatments or
conditions in the experiment.
z What happens when we are not sure that these
assumptions have been satisfied?
Slide 108
Stat 233, UCLA, Ivo Dinov
18
How Do Nonparametric Tests Compare
with the Usual z, t, and F Tests?
The Wilcoxon Rank Sum Test
z Suppose we wish to test the hypothesis that two
distributions have the same center.
z Studies have shown that when the usual
assumptions are satisfied, nonparametric tests are
about 95% efficient when compared to their
parametric counterparts.
z We select two independent random samples from each
population. Designate each of the observations from
population 1 as an “A” and each of the observations
from population 2 as a “B”.
z When normality and common variance are not
satisfied, the nonparametric procedures can be
much more efficient than their parametric
equivalents.
z If H0 is true, and the two samples have been drawn
from the same population, when we rank the values in
both samples from small to large, the A’s and B’s
should be randomly mixed in the rankings.
Slide 109
Stat 233, UCLA, Ivo Dinov
What happens when H0 is true?
• Suppose we had 5 measurements (A) from
population 1 and 6 measurements (B) from
population 2.
• If they were drawn from the same population,
the rankings might be like this.
ABABBABABBA
• In this case if we summed the ranks of the A
measurements and the ranks of the B
measurements, the sums would be similar.
Slide 111
Stat 233, UCLA, Ivo Dinov
How to Implement Wilcoxon’s Rank Test
•Rank the combined sample from smallest to
largest.
•Let T1 represent the sum of the ranks of the
first sample (A’s).
•Then, T1* defined below, is the sum of the
ranks that the A’s would have had if the
observations were ranked from large to small.
Slide 110
Stat 233, UCLA, Ivo Dinov
What happens if H0 is not true?
• If the observations come from two different
populations, perhaps with population 1 lying
to the left of population 2, the ranking of the
observations might take the following ordering.
AAABABABBB
In this case the sum of the ranks of the B
observations would be larger than that for
the A observations.
Slide 112
Stat 233, UCLA, Ivo Dinov
The Wilcoxon Rank Sum Test – aka
Mann-Whitney U test
H
H00::the
thetwo
twopopulation
populationdistributions
distributions are
arethe
thesame
same
H
thetwo
twopopulations
populationsare
arein
in some
some way
waydifferent
different
Haa::the
z The test statistic is the smaller of T1 and T1*.
z Reject H0 if the test statistic is less than the
critical value found in Wilcoxon Rank Sum
Test Table.
T1* = n 1 ( n 1+ n 2 + 1) − T 1
Slide 113
Stat 233, UCLA, Ivo Dinov
Slide 114
Stat 233, UCLA, Ivo Dinov
19
Example
IfIfseveral
severalmeasurements
measurementsare
aretied,
tied,
each
eachgets
getsthe
theaverage
averageof
ofthe
theranks
ranks
they
theywould
wouldhave
havegotten,
gotten,ififthey
they
were
werenot
nottied!
tied!(See
(Seexx==180)
180)
The wing stroke frequencies of
two species of bees were recorded for a sample of n1 = 4 from
species 1 and n2 = 6 from species 2. Can you conclude that the
distributions of wing strokes differ for these two species? Use
α = .05.
H : the two species are the same
Species 1
235
225
190
188
Species 2
H00: the two species are the same
(10)
(9)
180
H : the two species are in some way different
(3.5) Haa: the two species are in some way different
169
(1)
(8)
(7)
180
(3.5)
(6)
(2)
(5)
185
178
182
Can you conclude that the
The Bee Problem
distributions of wing
strokes differ for these two species? α = .05.
Reject
RejectH
H00..Data
Data
provides
provides sufficient
sufficient
evidence
evidenceindicating
indicatingaa
difference
differencein
in the
the
distributions
distributions of
of wing
wing
stroke
strokefrequencies.
frequencies.
1. The test statistic is T = 10.
1. The sample with the smaller sample
size is called sample 1.
2. The critical value of T from
Table 7(b) for a two-tailed test
with α/2 = .025 is T = 12; H0 is
rejected if T ≤ 12.
2. We rank the 10 observations from
smallest to largest, shown in
parentheses in the table.
Slide 115
Slide 116
Stat 233, UCLA, Ivo Dinov
Stat 233, UCLA, Ivo Dinov
Large Sample Approximation:
Wilcoxon Rank Sum Test
Minitab Output
Recall T1 = 34; T1* = 10.
Mann-Whitney Test and CI: Species1, Species2
Species1
N =
4
Median =
207.50
Species2
N =
6
Median =
180.00
Point estimate for ETA1-ETA2 is
30.50
95.7 Percent CI for ETA1-ETA2 is (5.99,56.01)
W = 34.0
Test of ETA1 = ETA2 vs ETA1 not = ETA2 is significant
at 0.0142
The test is significant at 0.0139 (adjusted for ties)
Minitab
Minitabcalls
callsthe
theprocedure
procedurethe
theMann-Whitney
Mann-WhitneyUU
Test,
Test,equivalent
equivalentto
tothe
theWilcoxon
WilcoxonRank
RankSum
SumTest.
Test.
When n1 and n2 are large (greater than 10 is large
enough), a normal approximation can be used to
approximate the critical values.
1. Calculate T1 and T1*. Let T = min(T1, T1* ).
T − µT
2. The statistic z =
has an approximate
σT
Normal distribution with
n ( n + n + 1)
n n ( n + n + 1)
and σT2 = 1 2 1 2
µT = 1 1 2
T1* = n 1 (n 1+ n 2 + 1) − T 1
2
12
T1 = sum of the ranks
of sample 1 (A’s).
The
34and
andhas
haspp-value
-value
Thetest
teststatistic
statisticisisW
W==TT11==34
==.0142.
.0142.Do
Donot
notreject
rejectHH00for
forαα==.05.
.05.
Slide 117
Calculate
T1 = 7 + 8 + 9 + 10 = 34
T1* = n1(n1 + n2 + 1) − T1
= 4(4 + 6 + 1) − 34 = 10
Slide 118
Stat 233, UCLA, Ivo Dinov
Stat 233, UCLA, Ivo Dinov
Wilcoxon Rank Sum test Å Æ
two-sample T test
The Sign Test
should you use the Wilcoxon Rank Sum
test instead of the two-sample t test for
independent samples?
9when the responses can only be ranked and
not quantified (e.g., ordinal qualitative data)
9when the F test or the Rule of Thumb shows a
problem with equality of variances
9when a normality plot shows a violation of
the normality assumption
z The sign test is a fairly simple
procedure that can be used to compare two
populations when the samples consist of
paired observations.
•When
Slide 119
Stat 233, UCLA, Ivo Dinov
z It can be used
9when the assumptions required for the paireddifference test are not valid or
9 when the responses can only be ranked as “one
better than the other”, but cannot be quantified.
Slide 120
Stat 233, UCLA, Ivo Dinov
20
The Sign Test
The Sign Test
9
9For
For each
each pair,
pair, measure
measure whether
whether the
the first
first
response—say,
response—say, A—exceeds
A—exceeds the
the second
second
response—say,
response—say, B.
B.
9
9The
The test
test statistic
statistic isis x,
x, the
the number
number of
of times
times that
that
A
exceeds
B
in
the
n
pairs
of
observations.
A exceeds B in the n pairs of observations.
9
9Only
Only pairs
pairs without
without ties
ties are
are included
included in
in the
the test.
test.
9
9Critical
Critical values
values for
for the
the rejection
rejection region
region or
or exact
exact
p-values
p-values can
can be
be found
found using
using the
the cumulative
cumulative
binomial
binomial distribution
distribution (SOCR
(SOCR resource
resource online).
online).
Slide 121
H
H00:: the
the two
two populations
populations are
are identical
identical versus
versus
:
one
or
two-tailed
alternative
H
Haa: one or two-tailed alternative
isis equivalent
equivalent to
to
H
:
p
=
P(A
exceeds
H00: p = P(A exceeds B)
B) == .5
.5 versus
versus
H
(≠, <,
<, or
or >)
>) .5
.5
Haa:: pp (≠,
Test
Test statistic:
statistic: xx == number
number of
of plus
plus signs
signs
Rejection
region,
p-values
from
Rejection region, p-values from Bin(n=size,
Bin(n=size, p).
p).
Slide 122
Stat 233, UCLA, Ivo Dinov
The Gourmet Chefs
Example
Two gourmet chefs each tasted and rated
eight different meals from 1 to 10. Does it appear
that one of the chefs tends to give higher ratings
than the other? Use α = .01.
Meal
1
2
3
4
5
6
7
8
Chef A
6
4
7
8
2
4
9
7
Chef B
8
5
4
7
3
7
9
8
Sign
-
-
+
+
-
-
0
-
HH0::the
rating distributions are the same (p = .5)
0 the rating distributions are the same (p = .5)
ratings are different (p ≠ .5)
HHa::the
a the ratings are different (p ≠ .5)
Slide 123
When n ≥ 25, a normal approximation can be
used to approximate the critical values of
Binomial distribution.
1. Calculate x = number of plus signs .
x − .5n
has an approximat e
.5 n
z distributi on.
Slide 125
Meal
1
2
3
4
5
6
7
8
Chef A
6
4
7
8
2
4
9
7
Chef B
8
5
4
7
3
7
9
8
p-value =.454 istoo
toolarge
largeto
to
reject
Thereisisinsufficient
insufficient
rejectHH00..There
HH0::pp==.5
.5
evidence
to
indicate
that
one
0
evidence to indicate that one
the
pair)
HHa::pp≠≠.5.5 with
withnn==77(omit
(omitchef
thetied
tied
pair)
tends
to
a
chef
tends
torate
rateone
onemeal
meal
plus
22 other.
Test
than
numberof
ofhigher
plussigns
signs
TestStatistic:
Statistic:xx==number
higher
than==the
the
other.
Sign
Stat 233, UCLA, Ivo Dinov
-
-
+ p-value
- =.454
0
is
+
Use
UseTable
Table11with
withnn==77and
andpp==.5.
.5.
pp-value
-value==P(observe
P(observexx==22or
orsomething
somethingequally
equallyas
asunlikely)
unlikely)
x
≤
2)
+
P(
x
≥
5)
=
2(.227)
=
==P(
P(x ≤ 2) + P(x ≥ 5) = 2(.227) =.454
.454
k
0
1
2
3
4
5
6
7
P(x ≤ k)
.008
.062
.227
.500
.773
.938
.992
1.000
Slide 124
Stat 233, UCLA, Ivo Dinov
Large Sample Approximation:
The Sign Test
2. The statistic z =
Stat 233, UCLA, Ivo Dinov
Y~Bin(n, p) Î
E(Y) = np
Var(Y) = np(1-p)
Stat 233, UCLA, Ivo Dinov
Example
You record the number of
For
Foraatwo
twotailed
tailedtest,
test,we
wereject
rejectHH00
accidents per day at a large
ifif|z|z| |>>1.96
1.96(5%
(5%level).
level).
manufacturing plant for both the
HH0 isisrejected.
rejected.There
Thereisisevidence
evidence
0
day and evening shifts for n =
of
ofaadifference
differencebetween
betweenthe
theday
day
100 days. You find that the
and
andnight
nightshifts.
shifts.
number of accidents per day for
the evening shift xE exceeded H : the distributions (# of
H00: the distributions (# of
the corresponding number of
accidents)
.5)
accidents)are
arethe
thesame
same((pp==.5)
accidents in the day shift xD on H : the distributions are diff. (p≠.5)
Haa: the distributions are diff. (p≠.5)
63 of the 100 days. Do these
Test statistic :
results provide sufficient
x − .5n 63 − .5(100)
evidence to indicate that more
z=
=
= 2.60
accidents tend to occur on one
.5 n
.5 100
shift than on the other?
Slide 126
Stat 233, UCLA, Ivo Dinov
21
Which test should you use?
z We compare statistical tests using
z If all parametric assumptions have been met, the
parametric test will be the most powerful.
Definition: Power = 1 − β
= P(reject H0 when Ha is true)
z The power of the test is the probability of rejecting the null
hypothesis when it is false and some specified alternative is true.
z The power is the probability that the test will do what it was
designed to do—that is, detect a departure from the null
hypothesis when a departure exists.
Slide 127
z If not, a nonparametric test may be more powerful.
z If you can reject H0 with a less powerful nonparametric
test, you will not have to worry about parametric
assumptions.
z If not, you might try
„ more powerful nonparametric test or
„ increasing the sample size to gain more power
Slide 128
Stat 233, UCLA, Ivo Dinov
The Wilcoxon Signed-Rank Test
– different form Wilcoxon Rank Sum Test
z The Wilcoxon Signed-Rank Test is a more powerful
nonparametric procedure that can be used to compare
two populations when the samples consist of paired
observations.
z It uses the ranks of the differences, d = x1-x2 that we
used in the paired-difference test.
Slide 129
Which test should you use?
The Wilcoxon Signed-Rank Test
– different form Wilcoxon Rank Sum Test
99For
For each
eachpair,
pair, calculate
calculatethe
the difference
differencedd ==xx11-x
-x22..
Eliminate
Eliminatezero
zerodifferences.
differences.
99Rank
Rank the
theabsolute
absolute values
values of
of the
thedifferences
differences from
from11 to
ton.
n.
Tied
Tied observations
observations are
areassigned
assigned average
average of
of the
theranks
ranks they
they
would
would have
have gotten
gottenifif not
not tied.
tied.
+
ƒƒT
T+ ==rank
rank sum
sumfor
for positive
positive differences
differences
ƒƒT
T- ==rank
rank sum
sumfor
for negative
negative differences
differences
+
99If
If the
thetwo
twopopulations
populations are
arethe
the same,
same, TT+ and
and TT- should
should
++ or T-- is unusually large, this
be
nearly
equal.
If
either
T
be nearly equal. If either T or T is unusually large, this
provides
provides evidence
evidenceagainst
againstthe
thenull
nullhypothesis.
hypothesis.
Slide 130
Stat 233, UCLA, Ivo Dinov
Slide 131
Stat 233, UCLA, Ivo Dinov
Stat 233, UCLA, Ivo Dinov
Example
The Wilcoxon Signed-Rank Test
H
H00:: the
the two
two populations
populations are
are identical
identical versus
versus
H
:
one
or
two-tailed
alternative
Haa: one or two-tailed alternative
+
Test
Test statistic:
statistic: TT == min
min (( TT + and
and TT -))
Critical
Critical values
values for
for aa one
one or
or two-tailed
two-tailed
rejection
region
can
be
found
rejection region can be found using
using
Wilcoxon
Wilcoxon Signed-Rank
Signed-Rank Test
Test Table.
Table.
Stat 233, UCLA, Ivo Dinov
To compare the densities of cakes using
mixes A and B, six pairs of pans (A and B) were
baked side-by-side in six different oven
locations. Is there evidence of a difference in
density for the two cake mixes?
Location
1
2
3
4
5
6
Cake Mix A
.135
.102
.098
.141
.131
.144
Cake Mix B
.129
.120
.112
.152
.135
.163
d = xA-xB
.006
-.018
-.014
-.011
-.004
-.019
HH0::the
density distributions are the same
0 the density distributions are the same
density distributions are different
HHa::the
a the density distributions are different
Slide 132
Stat 233, UCLA, Ivo Dinov
22
Cake Densities
Do
Donot
notreject
reject
is
HH0..There
0 There is
insufficient
Cake Mix B
.129
.120
.112
.152
.135
.163
insufficient
evidence
d = xA-xB
.006
-.018
-.014
-.011
-.004
-.019
evidenceto
to
indicate
indicatethat
that
Rank
2
5
4
3
1
6
there
is
a
there is a
difference
differencein
in
+
Calculate:
Calculate:TT+ ==22and
andTT -==5+4+3+1+6
5+4+3+1+6==19.
19.
densities
densitiesfor
forthe
the
+
+
two
The
twocake
cakemixes.
mixes.
min((TT ,, TT ))==2.
2.
Thetest
teststatistic
statisticisisTT==min
Rejection
Rejectionregion:
region:Use
UseTable
Table8.
8.For
Foraatwo-tailed
two-tailedtest
testwith
with
αα==.05,
.05,reject
rejectHH00ififTT≤≤1.
1.
Rank
Rankthe
the66
differences,
differences,
without
withoutregard
regard
to
tosign.
sign.
Location
1
2
3
4
5
6
Cake Mix A
.135
.102
.098
.141
.131
.144
Slide 133
When n ≥ 25, a normal approximation can be
used to approximate the critical values in Table 8.
1. CalculateT + and T − . Let T = min(T + , T − ).
T − µT
2. The statisticz =
has an approximate
σT
z distribution with
µT =
n(n + 1)
n(n + 1)(2n + 1)
and σ T2 =
4
24
Slide 134
Stat 233, UCLA, Ivo Dinov
The Kruskal-Wallis – H Test
z The KruskalKruskal-Wallis H Test is a nonparametric
procedure that can be used to compare more than
two populations in a completely randomized
design.
z Non-parametric equivalent to ANOVA F-test!
z All n = n1+n2+…+nk measurements are jointly
ranked.
z We use the sums of the ranks of the k samples to
compare the distributions.
Slide 135
Large Sample Approximation:
The Signed-Rank Test
The Kruskal-Wallis – H Test
99Rank
Rank the
thetotal
totalmeasurements
measurements in
inall
allkk samples
samples
from
from11 to
ton.
n. Tied
Tied observations
observations are
are assigned
assigned average
averageof
of the
the
ranks
ranks they
theywould
would have
have gotten
gotten ifif not
nottied.
tied.
99Calculate
Calculate
th
ƒƒT
Tii==rank
rank sum
sum for
for the
the iithsample
sample ii ==1,
1,2,…,k
2,…,k
ƒƒn
n == nn11+n
+n22+…+n
+…+nkk
99And
And the
thetest
teststatistic
statistic H
Hisis(analog
(analogto:
to: FF== MSST/MSSE)
MSST/MSSE)
H=
 k 2
12 
Ti 
− 3(n + 1)

n(n + 1)
ni 
 i =1

∑
Slide 136
Stat 233, UCLA, Ivo Dinov
Slide 137
Stat 233, UCLA, Ivo Dinov
Stat 233, UCLA, Ivo Dinov
Example
The
H Test
TheKruskal-Wallis
Kruskal-Wallis H Test
H
H00:: the
the kk distributions
distributions are
are identical
identical versus
versus
H
Haa:: at
at least
least one
one distribution
distribution isis different
different
Test
statistic:
Kruskal-Wallis
H
Test statistic: Kruskal-Wallis H
When
is true,
true, the
the test
test statistic
statistic H
H has
has an
an
When H
H00 is
2
2
approximate
approximate χχ distribution
distribution with
with df
df == k-1.
k-1.
Use
Use aa right-tailed
right-tailed rejection
rejection region
region or
or p-value
p-value
based
based on
on the
the Chi-square
Chi-square distribution.
distribution.
Stat 233, UCLA, Ivo Dinov
Four groups of students were randomly
assigned to be taught with four different
techniques, and their achievement test scores
were recorded. Are the distributions of test
scores the same, or do they differ in location?
1
2
3
4
65
75
59
94
87
69
78
89
73
83
67
80
79
81
62
88
Slide 138
Stat 233, UCLA, Ivo Dinov
23
Teaching Methods
Teaching Methods
1
Ti
2
65
(3) 75
87
73
79
3
(7) 59 (1) 94
(16)
(13) 69
(5) 78 (8) 89
(15)
(6) 83
(12) 67 (4) 80
(10)
(9) 81
(11) 62 (2) 88
(14)
31
35
15
Test statistic: H =
55
Rank
Rankthe
the16
16
measurements
HH0::the
distributions of scores are the same
measurements
0 the distributions of scores are the same
from
distributions differ in location
HHa::the
from11to
to16,
16,
a the distributions differ in location
and
andcalculate
calculate
12
T2
the
thefour
fourrank
rank Test statistic: H =
∑ i − 3( n + 1)
n
(
n
+
1
)
ni
sums.
sums.
12  312 + 352 + 152 + 552 

 − 3(17) = 8.96
=
16(17) 
4

Slide 139
HH0::the
distributions of scores are the same
0 the distributions of scores are the same
distributions differ in location
HHa::the
a the distributions differ in location
4
=
Stat 233, UCLA, Ivo Dinov
Key Concepts
2. Use T1 to test for population 1 to the left of population 2
Nonparametric Methods
1.
These methods can be used when the data cannot be
measured on a quantitative scale, or when
2.
The numerical scale of measurement is arbitrarily set by the
researcher, or when
3.
The parametric assumptions such as normality or constant
variance are seriously violated.
II. Wilcoxon Rank Sum Test: Independent Random
Samples
Use T1* to test for population to the right of population 2.
Use the smaller of T1 and T1* to test for a difference in the
locations of the two populations.
3. Table 7 of Appendix I has critical values for the rejection
of H0.
4. When the sample sizes are large, use the normal
T − µT
approximation:
z=
σT
n (n + n + 1)
n n (n + n2 + 1)
µT = 1 1 2
and σ T2 = 1 2 1
2
12
1. Jointly rank the two samples: Designate the smaller
sample as sample 1. Then
T1 = Rank of sample1 ⇒ T1* = n1(n1 + n2 + 1) − T1
Slide 154
Stat 233, UCLA, Ivo Dinov
Stat 233, UCLA, Ivo Dinov
Key Concepts
Key Concepts
IV.
Sign Test for a Paired Experiment
1. Find x, the number of times that observation A exceeds
observation B for a given pair.
2. To test for a difference in two populations, test H0 : p = 0.5
versus a one- or two-tailed alternative.
3. Use Table 1 of Appendix I to calculate the p-value for the test.
4. When the sample sizes are large, use the normal
approximation:
z=
Reject
Thereisissufficient
sufficient
RejectHH00..There
evidence
evidenceto
toindicate
indicatethat
thatthere
there
isisaadifference
differencein
intest
testscores
scoresfor
for
the
thefour
fourteaching
teachingtechniques.
techniques.
Slide 140
Stat 233, UCLA, Ivo Dinov
I.
III.
12  312 + 352 + 152 + 552 

 − 3(17) = 8.96
16(17) 
4

Rejection
Rejectionregion:
region:Use
UseTable
Table5.
5.
For
Foraaright-tailed
right-tailedchi-square
chi-squaretest
test
α
=
.05
and
df
=
4-1
=3,
with
with α = .05 and df = 4-1 =3,
reject
7.81.
rejectHH00ififHH ≥≥7.81.
Key Concepts
Slide 153
T2
12
∑ i − 3( n + 1)
n(n + 1) ni
x − .5 n
Wilcoxon SignedSigned-Rank Test: Paired Experiment
1. Calculate the differences in the paired observations. Rank
the absolute values of the differences. Calculate the rank
−
+
sums T and T for the positive and negative differences,
respectively. The test statistic T is the smaller of the two rank
sums.
2. Table 8 of Appendix I has critical values for the rejection of
for both one- and two-tailed tests.
3. When the sampling sizes are large, use the normal
approximation:
T − [n(n + 1) 4]
z=
[n(n + 1)(2n + 1)] 24
.5 n
Slide 155
Stat 233, UCLA, Ivo Dinov
Slide 156
Stat 233, UCLA, Ivo Dinov
24
Key Concepts
Key Concepts
V. KruskalKruskal-Wallis H Test: Completely Randomized Design
VI. The Friedman Fr Test: Randomized Block Design
1. Jointly rank the n observations in the k samples. Calculate the
rank sums, Ti = rank sum of sample i, and the test statistic
12
T2
H=
∑ i − 3( n + 1)
n(n + 1) ni
2. If the null hypothesis of equality of distributions is false, H
will be unusually large, resulting in a one-tailed test.
1. Rank the responses within each block from 1 to k. Calculate the
rank sums T1, T2, …, Tk, and the test statistic
12
Fr =
∑ Ti 2 − 3b(k + 1)
bk (k + 1)
2. If the null hypothesis of equality of treatment distributions is
false, Fr will be unusually large, resulting in a one-tailed test.
3. For sample sizes of five or greater, the rejection region for H is
based on the chi-square distribution with (k − 1) degrees of
freedom.
Slide 157
3. For block sizes of five or greater, the rejection region for Fr is
based on the chi-square distribution with (k − 1) degrees of
freedom.
Slide 158
Stat 233, UCLA, Ivo Dinov
Stat 233, UCLA, Ivo Dinov
Sensitivity vs. Specificity
Key Concepts
VII. Spearman's Rank Correlation Coefficient
zSensitivity is a measure of the fraction of gold standard
known examples that are correctly classified/identified.
1.
Rank the responses for the two variables from smallest to
largest.
2.
Calculate the correlation coefficient for the ranked
observations:
S xy
6 ∑ di2
or rs = 1 −
if there are no ties
rs =
n( n 2 − 1)
S xx S yy
zSpecificity is a measure of the fraction of negative
examples that are correctly classified:
3.
Table 9 in Appendix I gives critical values for rank
correlations significantly different from 0.
zTP = True Positives
4.
The rank correlation coefficient detects not only significant
linear correlation but also any other monotonic relationship
between the two variables.
zTN = True Negatives
Slide 159
z
Sensitivity
Fraction True Positives
1.0
Worse
Test
Bigger area Î higher test accuracy and better
discrimination i.e. the ability of the test to
correctly classify those with and without the disease.
Specificity
Stat 233, UCLA, Ivo Dinov
Slide 161
zFN = False Negatives
zFP = False Positives
Ho: no effects (µ=0)
True Reality
Ho true
Ho false
Can’t reject
TN
FN
Reject Ho
FP
TP
Slide 160
Receiver-Operating Characteristic curve
zReceiver Operating Characteristic (ROC) curve
Fraction False Positives
Specificity = TN/(TN+FP)
Stat 233, UCLA, Ivo Dinov
Stat 233, UCLA, Ivo Dinov
The ROC Curve
Better
Test
Sensitivity= TP/(TP+FN)
Test Results
z
1.0
zROC curve demonstrates several things:
ƒIt shows the tradeoff between sensitivity and specificity
(any increase in sensitivity will be accompanied by a
decrease in specificity).
ƒThe closer the curve follows the left border and then the
top border of the ROC space, the more accurate the test.
ƒThe closer the curve comes to the 45-degree diagonal of
the ROC space, the less accurate the test.
ƒThe slope of the tangent line at a cut-point gives the
likelihood ratio (LR) for that value of the test. You can
check this out on the graph above.
ƒThe area under the curve is a measure of test accuracy.
Stat 233, UCLA, Ivo Dinov
Slide 162
25
Bias and Precision
Bias and Precision
z The bias in an estimator is the distance between
between the center of the sampling distribution of the
estimator and the true value of the parameter being
ˆ ) − θ , where
estimated. In math terms, bias = E (Θ
theta Θ̂ is the estimator, as a RV, of the true
(unknown) parameter θ .
z The precision of an estimator is a measure of how
variable is the estimator in repeated sampling.
value of parameter
value of parameter
(a) No bias, high precision
(b) No bias, low precision
z Example, Why is the sample mean an unbiased
estimate for the population mean? How about ¾ of
n


the sample mean?
ˆ ) − µ = E 3 1 ∑ X  − µ =
E (Θ
4n


k =1 k 
3
µ
µ − µ = ≠ 0, in general.
4
4
n


ˆ ) − µ = E 1 ∑ X  − µ = 0
E (Θ

n
 k =1 k 
Slide 163
z Suppose: {Y1, Y2, ………….., Y4N} IID (µ, σ).
3N
4N
z Let Y = 1
Y; Y =1
3 N ∑k =1
k
2
2
Y =
Yk ;
z And Y = 1
4 N ∑ k =1
2
z Which estimator is better? Y or Y ?
E (Y ) = E (Y ) = µ
Slide 164
Var (Y ) = σ
<
2
3N
N ∑k =3 N +1
( )
z χ2 [Chi-Square] goodness of fit test:
„ Let {X1, X2,…, XN} are IID N(0, 1)
„ W = X12 + X22 + X32 + …+ XN2
χ2
„W ~ χ2(df=N)
„ Note: If {Y1, Y2, …, YN} are IID N(µ, σ2), then
N
1
SD 2 (Y ) =
(Yk − Y )2
N −1
k =1
N −1 2
Yk
∑
Var Y = σ
W=
2
N
σ2
„ And the Statistics W ~ χ2(df=N-1)
„ E(W)=N; Var(W)=2N
z Both are unbiased, but variance of second one is
smaller Î estimator is more precise!!!
Slide 165
Stat 233, UCLA, Ivo Dinov
Continuous Distributions – χ2 [Chi-Square]
4N
1
Y +Y
value of parameter
(d) Biased, low precision
Stat 233, UCLA, Ivo Dinov
Bias and Precision – 2 unbiased estimators of µ
1
value of parameter
(c) Biased, high precision
Slide 166
Stat 233, UCLA, Ivo Dinov
Continuous Distributions – F-distribution
SD (Y )
Stat 233, UCLA, Ivo Dinov
Continuous Distributions – F-distribution
z F-distribution is the ratio of two χ2 random variables.
z
F-distribution is the ratio of two χ2 random variables.
z Snedecor's F distribution is most commonly used in
tests of variance (e.g., ANOVA). The ratio of two
chi-squares divided by their respective degrees of
freedom is said to follow an F distribution
z
Snedecor's F distribution is most commonly used in tests of
variance (e.g., ANOVA). The ratio of two chi-squares divided
by their respective degrees of freedom is said to follow an F
distribution
SD 2 (Y ) =
WY =
Fo =
1
N −1
N
N −1
σ Y2
1
k =1
2
SD (Y ); W X =
M −1
σ 2X
l =1
2
SD ( X );
WY /( N − 1)
~ F( df1 = N − 1, df 2 = M − 1)
WX /( M − 1)
Slide 168
„
M
∑ (Yk − Y )2; SD2 ( X ) = M − 1 ∑ (X l − X )2
Stat 233, UCLA, Ivo Dinov
k
„
{Y1;1, Y1;2, ………….., Y1;N1} IID from a Normal(µ1;σ1)
{Y2;1, Y2;2,.., Y2;N2} IID from a Normal(µ2;σ2)
„ .,..
„
{Yk;1, Yk;2, ….., Yk;N2} IID from a Normal(µ2;σ2)
„
„
σ1= σ2= σ3=… σn = σ. (1/2 <= σk/σj<=2)
k
Samples are independent!
Slide 169
Stat 233, UCLA, Ivo Dinov
26
Continuous Distributions – Cauchy’s
Continuous Distributions – Review
z Cauchy’s distribution, X~Cauchy(t,s), t=location; s=scale
z PDF(X):
1
f ( x) =
; x ∈ R (reals)
sπ (1 + ( x − t ) / s ) 2 )
1
f ( x) =
z PDF(Std Cauchy’s(0,1)):
(
sπ 1 + x 2
)
z The Cauchy distribution is (theoretically) important as an example of
a pathological case. Cauchy distributions look similar to a normal
distribution. However, they have much heavier tails. When studying
hypothesis tests that assume normality, seeing how the tests perform
on data from a Cauchy distribution is a good indicator of how
sensitive the tests are to heavy-tail departures from normality. The
mean and standard deviation of the Cauchy distribution are
undefined!!! The practical meaning of this is that collecting 1,000
data points gives no more accurate of an estimate of the mean and
standard deviation than does a single point (CauchyÆTdfÆNormal).
Slide 171
z Remained to see a good ANOVA (F-distribution Example)
z SYSTAT Æ FileÆLoad (Data)
ÆC:\Ivo.dir\Research\Data.dir\WM_GM_CSF_tissueMap
s.dir\ATLAS_IVO_WM_GM.xls
Æ Statistics Æ ANOVA ÆEst.Model Æ
Dependent(Value) Æ Factors(Method, Hemi, TissueType)
[For 1/2/3-Way ANOVA]
Slide 172
Stat 233, UCLA, Ivo Dinov
Continuous Distributions – Exponential
z Exponential distribution, X~Exponential(λ)
f ( x) = λe
;
x≥0
z E(X)=1/ λ; Var(X)=1/ λ2;
z Another name for the exponential mean is the Mean Time To Fail
or MTTF and we have MTTF = 1/ λ.
z If X is the time between occurrences of rare events that happen on the average
with a rate l per unit of time, then X is distributed exponentially with parameter λ.
Thus, the exponential distribution is frequently used to model the time interval
between successive random events. Examples of variables distributed in this
manner would be the gap length between cars crossing an intersection, life-times
of electronic devices, or arrivals of customers at the check-out counter in a grocery
store.
Stat 233, UCLA, Ivo Dinov
Continuous Distributions – Exponential Examples
z Customers arrive at a certain store at an average of 15 per hour. What is the
probability that the manager must wait at least 5 minutes for the first customer?
z The exponential distribution is often used in probability to model (remaining)
lifetimes of mechanical objects for which the average lifetime is known and for
which the probability distribution is assumed to decay exponentially.
z Suppose after the first 6 hours, the average remaining lifetime of batteries for a
portable compact disc player is 8 hours. Find the probability that a set of batteries
lasts between 12 and 16 hours.
Solutions:
Stat 233, UCLA, Ivo Dinov
Continuous Distributions – Exponential
z Exponential distribution, Example:
z The exponential model, with only one unknown parameter, is the
simplest of all life distribution models.
− λx
Slide 173
z Uniform, (General/Std) Normal, Student’s T, F, χ2,
Cauchy distributions.
By-hand vs. ProbCalc.htm
z On weeknight shifts between 6 pm and 10 pm, there are an
average of 5.2 calls to the UCLA medical emergency
number. Let X measure the time needed for the first call on
such a shift. Find the probability that the first call arrives
(a) between 6:15 and 6:45 (b) before 6:30. Also find the
median time needed for the first call ( 34.578%; 72.865% ).
„ We must first determine the correct average of this exponential
distribution. If we consider the time interval to be 4x60=240
minutes, then on average there is a call every 240 / 5.2 (or 46.15)
minutes. Then X ~ Exp(1/46), [E(X)=46] measures the time in
minutes after 6:00 pm until the first call.
Slide 174
Stat 233, UCLA, Ivo Dinov
Recall the example of Poisson approx to Binomial
z Suppose P(defective chip) = 0.0001=10-4. Find the
probability that a lot of 25,000 chips has > 2 defective!
z Y~ Binomial(25,000, 0.0001), find P(Y>2). Note that
Z~Poisson(λ =n p =25,000 x 0.0001=2.5)
2.5 z −2.5
e =
z!
z =0
0
1
2
 2.5 −2.5 2.5 −2.5 2.5 −2.5 
1− 
e +
e +
e  = 0.456
1!
2!
 0!

2
z Here the average waiting time is 60/15=4 minutes. Thus X ~ exp(1/4). E(X)=4.
Now we want P(X>5)=1-P(X <= 5). We obtain a right tail value of .2865. So
around 28.65% of the time, the store must wait at least 5 minutes for the first
customer.
z Here the remaining lifetime can be assumed to be X ~ exp(1/8). E(X)=8. For the
total lifetime to be from 12 to 16, then the remaining lifetime is from 6 to 10. We
find that P(6 <= X <= 10) = .1859.
Slide 175
Stat 233, UCLA, Ivo Dinov
P ( Z > 2) = 1 − P ( Z ≤ 2) = 1 − ∑
Slide 176
Stat 233, UCLA, Ivo Dinov
27
Normal approximation to Binomial
Normal approximation to Binomial – Example
z Suppose Y~Binomial(n, p)
z Then Y=Y1+ Y2+ Y3+…+ Yn, where
z Roulette wheel investigation:
z Compute P(Y>=58), where Y~Binomial(100, 0.47) –
„ Yk~Bernoulli(p) , E(Yk)=p & Var(Yk)=p(1-p) Î
„ E(Y)=np & Var(Y)=np(1-p), SD(Y)= (np(1-p))
1/2
„ Standardize Y:
‰ Z=(Y-np) / (np(1-p))1/2
‰ By CLT Î Z ~ N(0, 1). So, Y ~ N [np, (np(1-p))1/2]
z Normal Approx to Binomial is
reasonable when np >=10 & n(1-p)>10
(p & (1-p) are NOT too small relative to n).
Slide 177
„ The proportion of the Binomial(100, 0.47) population having
more than 58 reds (successes) out of 100 roulette spins (trials).
„ Since np=47>=10
& n(1-p)=53>10 Normal
approx is justified.
Roulette has 38 slots
z Z=(Y-np)/Sqrt(np(1-p)) = 18red 18black 2 neutral
58 – 100*0.47)/Sqrt(100*0.47*0.53)=2.2
z P(Y>=58) Í Î P(Z>=2.2) = 0.0139
z True P(Y>=58) = 0.177, using SOCR (demo!)
z Binomial approx useful when no access to SOCR avail.
Slide 178
Stat 233, UCLA, Ivo Dinov
Normal approximation to Poisson
z Let X1~Poisson(λ) & X2~Poisson(µ) ÆX1+ X2~Poisson(λ+µ)
Stat 233, UCLA, Ivo Dinov
Normal approximation to Poisson – example
z Let X1~Poisson(λ) & X2~Poisson(µ) ÆX1+ X2~Poisson(λ+µ)
z Let X1, X2, X3, …, Xk ~ Poisson(λ ), and independent,
z Let X1, X2, X3, …, X200 ~ Poisson(2), and independent,
z Yk = X1 + X2 + ··· + Xk ~ Poisson(kλ), E(Yk)=Var(Yk)=kλ.
z Yk = X1 + X2 + ··· + Xk ~ Poisson(400), E(Yk)=Var(Yk)=400.
z The random variables in the sum on the right are
independent and each has the Poisson distribution
with parameter λ .
z By CLT the distribution of the standardized variable
(Yk − 400) / (400)1/2 Î N(0, 1), as k increases to infinity.
z By CLT the distribution of the standardized variable
(Yk − kλ ) / (kλ )1/2 Î N(0, 1), as k increases to infinity.
z So, for kλ >= 100, Zk = {(Yk − kλ ) / (kλ )1/2 } ~ N(0,1).
z Î Yk ~ N(kλ, (kλ)1/2).
Slide 179
z P(2 < Yk < 400) = (std’z 2 & 400) =
z P( (2−400)/20 < Zk < (400−400)/20 ) = P( -20< Zk<0)
= 0.5
Slide 180
Stat 233, UCLA, Ivo Dinov
Poisson or Normal approximation to Binomial?
z Poisson Approximation (Binomial(n, pn) Æ Poisson(λ) ):
y −λ
WHY? λ e
 n  p y (1 − p ) n− y 

n
→
∞ →
n
 y n
y!
n × pn 
→ λ
„n>=100 & p<=0.01 & λ =n p <=20
z Normal Approximation
(Binomial(n, p) Æ N ( np, (np(1-p))1/2) )
„np >=10 & n(1-p)>10
Slide 181
z Zk = (Yk − 400) / 20 ~ N(0,1)Î Yk ~ N(400, 400).
Exponential family and arrival numbers/times
z First, let Tk denote the time of the k'th arrival for k =
1, 2, ... The gamma experiment is to run the process
until the k'th arrival occurs and note the time of this
arrival.
z Next, let Nt denote the number of arrivals in the time
interval (0, t] for t ≥ 0. The Poisson experiment is to run
the process until time t and note the number of
arrivals.
density function of the k'th arrival time is
z How are Tk & Nt related?
zNt ≥ k
Stat 233, UCLA, Ivo Dinov
Stat 233, UCLA, Ivo Dinov
Í Î
Tk ≤ t
Slide 182
fk(t) = (rt)k − 1re−rt / (k − 1)!, t > 0.
This distribution is the gamma
distribution with shape parameter k
and rate parameter r. Again, 1/r
is knows as the scale parameter. A more general
version of the gamma distribution,
allowing non-integer k,
Stat 233, UCLA, Ivo Dinov
28
Bayesian Estimation
Independence of continuous RVs
zMLE
z The RV’s {Y1, Y2, Y3, …, Yn} are independent if for any
n-tuple {y1, y2, y3, …, yn}
zMLE(mean) for N(µ, σ2) and Poisson(λ)
P({Y1 ≤ y1} ∩ {Y2 ≤ y2 } ∩ {Y3 ≤ y3 } ∩ ... ∩ {Yn ≤ yn }) =
= P(Y1 ≤ y1 ) × P(Y2 ≤ y2 ) × P(Y3 ≤ y3 ) × ... × P(Yn ≤ yn )
Slide 183
Stat 233, UCLA, Ivo Dinov
Stat 233, UCLA, Ivo Dinov
Law of Total Probability
Bayesian Rule
z If {A1, A2, …, An} are a partition of the sample space
(mutually exclusive and UAi=S) then for any event B
P(B) = P(B|A1)P(A1) + P(B|A2)P(A2) +…+ P(B|An)P(An)
S
Ex:
P(B) = P(B|A1)P(A1) +
A2
B|A2
B|A1
B
Slide 185
P ( Ai | B ) × P ( Ai )
∑k =1 P(B | Ak )P( Ak )
n
P( B | Ai ) × P( Ai )
∑k =1 P(B | Ak )P( Ak )
n
Slide 186
Stat 233, UCLA, Ivo Dinov
Classes vs. Evidence Conditioning
D = the test person has the disease.
z Classes: healthy(NC), cancer
T = the test result is positive.
z Evidence: positive mammogram (pos), negative
Ex: (Laboratory blood test) Assume:
Find:
P(positive Test| Disease) = 0.95
P(Disease | positive Test)=?
P(positive Test| no Disease)=0.01
P(D | T) = ?
P(Disease) = 0.005
P(D I T )
P(T | D) × P(D)
P(D | T ) =
=
P(T )
P(T | D) × P( D) + P(T | Dc ) × P(Dc )
=
=
IB) / P(B) = [P(B | Ai) x P(Ai)] / P(B) =
Stat 233, UCLA, Ivo Dinov
Bayesian Rule
P ( Ai ) =
z If {A1, A2, …, An} are a non-trivial partition of the
sample space (mutually exclusive and UAi=S, P(Ai)>0)
then for any non-trivial event and B ( P(B)>0 )
P(Ai | B) = P(Ai
A1
P(B|A2)P(A2)
Slide 184
0.95 × 0.005
0.00475
=
= 0.193
0.95 × 0.005 + 0.01× 0.995 0.02465
Slide 187
Stat 233, UCLA, Ivo Dinov
mammogram (neg)
z If a woman has a positive mammogram result, what is the
probability that she has breast cancer?
P(class | evidence) =
P (cancer ) = 0.01
P (evidence | class) × P(class)
∑classes P(evidence | class) × P(class)
P ( pos | cancer ) = 0 .8
P( pos | healthy ) = 0.1 P(C|P)=P(P|C)xP(C)/[P(P|C)xP(C)+ P(P|H)xP(H)]
P(C|P)=0.8x0.01 / [0.8x0.01 + 0.1x0.99] = ?
P (cancer | pos ) = ?
Slide 188
Stat 233, UCLA, Ivo Dinov
29
(Log)Likelihood Function
Test Results
Bayesian Rule
True Disease State
No Disease
Disease
z Suppose we have a sample {X1, …, Xn} IID D(θ) with
Total
probability density function p = p(X | θ). Then the joint
density p({X1, …, Xn} | θ) is a function of the
(unknown) parameter θ.
OK (0.98505) False Negative II 0.9853
Negative
False Positive I
(0.00995)
Positive
Total
(0.00025)
OK (0.00475) 0.0147
0.995
0.005
z Likelihood function l(θ | {X1, …, Xn})= p({X1,…,Xn}|θ)
1.0
( I DC ) = P(T | DC )× P(DC ) = 0.01× 0.995 = 0.00995
PT
Power of Test = 1 – P(TC | D) = 0.95
Sensitivity: TP/(TP+FN) = 0.00475/(0.00475+ 0.00025)= 0.95
z Log-likelihood L(θ|{X1, …, Xn})=Logel(θ|{X1, …, Xn})
z Maximum-likelihood estimation (MLE):
z Suppose {X1, …, Xn} IID N(µ, σ2), µ is unknown. We
estimate it by:MLE(µ)=µ^=ArgMaxµL(µ| ({X1,…,Xn})
Specificity: TN/(TN+FP) = 0.98505/(0.98505+ 0.00995) = 0.99
Slide 189
Slide 190
Stat 233, UCLA, Ivo Dinov
(Log)Likelihood Function
(Log)Likelihood Function
z Suppose {X1, …, Xn} IID N(µ, σ2), µ is unknown. We
estimate it by:MLE(µ)=µ^=ArgMaxµL(µ| ({X1,…,Xn})
− ( x − µ ) 2 2σ 2 

n e i


MLE ( µ ) = Log 
 = L( µ )
2
i =1


2πσ


∏
0 = L' ( µˆ ) =
1
2
2
 − ∑in=1 ( xi − µˆ ) 2σ  ∑in=1 2( xi − µˆ )

2σ 2

e
n 
2πσ 2 2
(
)
∑n x
⇔ 0 = 2∑in=1( xi − µˆ ) ⇔ µˆ = i =1 i
n
2
∑ n (x − µ )
Similarly can show that : MLE (σ ) = σˆ = i =1 i
Stat 233, UCLA, Ivo Dinov
.
n −1
Hypothesis Testing –
the Likelihood Ratio Principle
z Let {X1, …, Xn} be a random sample from a density f(x; p),
where p is some population parameter. Then the joint density is
f(x1,…, xn; p) = f(x1; p)x… xf(xn; p).
z The likelihood function L(p) = f(X1,…, Xn; p)
z Testing: Ho: p is in Ω vs Ha: p is in Ωa, where Ω I Ωa= 0
z Find max of L(p) in Ω.
z Find max of L(p) in Ωa.
λ ( x1,..., xn ) =
z Find likelihood ratio
max L( p )
p∈Ω
max L( p )
p∈Ω a
z Reject Ho if likelihood-ratio statistics λ is small (λ<k)
Slide 193
z Suppose {X1, …, Xn} IID Poisson(λ), λ is unknown.
Estimate λ by:MLE(λ)=λ^=ArgMaxλL(λ|({X1,…,Xn})

MLE (λ ) = Log 


0 = L' (λˆ ) =
.
Slide 191
Stat 233, UCLA, Ivo Dinov
Stat 233, UCLA, Ivo Dinov
=
n
∏i=1
e− λ λx
i 
 = L (λ )
( xi )! 


 e− nλ λ∑in=1 xi 
∂
=
Log 
n
∂λ

( xi )! 


i =1
∏
1
∂
∑n x
− nλ + Log (λ ) ∑in=1 xi = − n + ∑in=1 xi ⇔ λˆ = i =1 i .
n
∂λ
λ
(
)
Slide 192
Stat 233, UCLA, Ivo Dinov
Hypothesis Testing –
the Likelihood Ratio Principle Example
z Let {X1, …, Xn}={0.5, 0.3, 0.6, 0.1, 0.2} be IID N(µ, 1) Î f(x;µ).
The joint density is f(x1,…,xn; µ)=f(x1;µ)x… xf(xn;µ).
−( x − µ )
z The likelihood function L(p) = f(X1,…,Xn; p)
2
e
f ( x) =
z Testing: Ho: µ>0 is in Ω vs Ha: µ<=0.
2
z Reject Ho if likelihood-ratio statistics λ is small (λ<k)
ln(numer) = quadratic in µ!
max L( p)
λ0 = λ ( x1,..., xn ) = p∈Ω
=
ln(deno) = quadratic in µ!
max L( p )
Maximize both Æ find ratio
p∈Ω a
2
 − (0.5− µ ) + (0.3− µ ) + (0.6− µ ) + (0.1− µ ) +( 0.2− µ )

2
max e
µ >0 

2
2
2
2
2
 − (0.5− µ ) + (0.3− µ ) + (0.6− µ ) + (0.1− µ ) +( 0.2− µ )

2
max e
µ ≤0 

2
2
2
2
Slide 194
2










Let P(Type I) = α
to~1/λο ~ tα,df=4
Îone-sample T-test
Stat 233, UCLA, Ivo Dinov
30
Hypothesis Testing –
the Likelihood Ratio Principle Example
z Testing: Ho: µ>0 is in Ω vs Ha: µ<=0. Reject Ho if
likelihood-ratio statistics λ is small (λ<k)
λ0 = λ ( x1,..., xn ) =
max L( p )
p∈Ω
max L( p )
=
p∈Ω a
 − (0.5− µ ) +(0.3− µ ) + (0.6− µ ) + (0.1− µ ) + (0.2− µ )

2
max e
µ >0 

2
2
2
2
2
 (0.5− µ ) +(0.3− µ ) + (0.6− µ ) + (0.1− µ ) + (0.2− µ )
 −
2
max e
µ ≤0 

2
e
−
2
2
2
(0.5− µ ) 2 +(0.3− µ ) 2 + (0.6− µ ) 2 + (0.1− µ ) 2 + (0.2− µ ) 2
2
e
−
Test Procedure
Let P(Type I) = α
2
µ = 0.34
(0.5− µ ) 2 + (0.3− µ ) 2 + (0.6− µ ) 2 + (0.1− µ ) 2 + (0.2− µ ) 2
2
Slide 195




=





to~1/λο ~ tα,df=4
Îone-sample T-test
2. Specify a Rejection Region (Example: zo > zα
=
µ =0
Slide 196
Stat 233, UCLA, Ivo Dinov
z Suppose the true MMSE score for AD subjects is ~ N(23, 12).
z A new cognitive test is proposed, and it’s assumed that its values
are ~ N(25, 12). A sample of 10 AD subjects take the new test.
z Hypotheses are: Ho: µtest=25 vs. Ha: µtest<25 (one-sided, more power)
• The value of α is specified by the experimenter
z α = P(false-positive, Type I, error) = 0.05.
• The value of β is a function of α, n, and δ (the
difference between the null hypothesized mean and the true
mean). For a two sided hypothesis test of a normally
distributed population
2
z Critical Value for α is Zscore= -1.64. Thus, Xavgcritical = Zcritical*σ+µ
z Xavgcritical = 25-1.64=23.4, And our conclusion, from {X1, …, X10}
which yields Xavg will be reject Ho, if Xavg < 23.4.
z β=P(fail to reject Ho|Ho is false)=P(Xavg>=23.4|Xavg ~N(23,12/10))

δ n 
δ n 
− Φ − Z α +
σ 
σ 
2

z Note: Xavg ~N(23,12/10)) when it’s given that X ~N(23,12))
z Standardize: Z = (23.4 – 23)/(1/10) = 4.0
• It is not true that α =1- β (RHS=this is the test power!)
Slide 197
Slide 198
Stat 233, UCLA, Ivo Dinov
Type I, Type II Errors & Power of Tests
z Suppose the true MMSE score for AD subjects is ~ N(23,
Stat 233, UCLA, Ivo Dinov
Type I, Type II Errors & Power of Tests
β = Pr{Fail to Reject H 0 H 0 is False}

)
e−0.086 = e−0.269 = 0.68
e-0.375
α = Pr{Reject H 0 H 0 is true}

2
3. The null hypothesis is rejected iff the computed value
for the statistic falls in the rejection region
Type I and Type II Errors
β = Φ Z α +
1. Calculate a Test Statistic (Example: zo)
Stat 233, UCLA, Ivo Dinov
Another Example –Type I and II Errors & Power
12).
„ Size-of-studied-effect: effect-size increase Î power increase
z About 75% of all 80 year old humans are free of
amyloid plaques and tangles, markers of AD. A new
AD vaccine is proposed that is supposed to increase
this proportion. Let p be the new proportion of subjects
with no AD characteristics following vaccination.
Ho: p=0.75, H1: p>0.75.
z X = number of AD tests with no pathology findings in
n=20 80-y/o vaccinated subjects. Under Ho we expect
to get about n*p=15 no AD results. Suppose we’d
invest in the new vaccine if we get >= 18 no AD tests
Î rejection region R={18, 19, 20}.
„ Type of Alternative hypothesis: 1-sized tests are more powerful
z Find α and β. How powerful is this test?
z A new cognitive test is proposed, and it’s assumed that its values
are ~ N(25, 12). A sample of 10 AD subjects take the new test.
z β=P(fail to reject Ho|Ho is false)=P(Xavg>=23.4|Xavg ~N(23,12/10))
z Note: Xavg ~N(23,12/10)) when it’s given that X ~N(23,12))
z Standardize: Z = (23.4 – 23)/(1/10) = 4.0.
z β=P(fail to reject Ho|Ho is false)=P(Z>4.0) = 0.00003
z Î Power (New Test) = 1 – 0.00003 = 0.99997
∃different β for each different
z How does Power(Test) depend on:
α, true mean µ, alternative H
„ Sample size, n=10: n-increase Î power increase
Slide 199
Stat 233, UCLA, Ivo Dinov
a
Slide 200
Stat 233, UCLA, Ivo Dinov
31
Another Example –Type I and Type II Errors
z Ho: p=0.75, H1: p>0.75. X = number of test with no AD findings
in n=20 experiments.
z X~Binomial(20, 0.75). Rejection region R={18, 19, 20}.
z Find α =P(Type I) = P(X>=18 when X~Binomial(20, 0.75)).
z Use SOCR resource Î α =1-0.91= 0.09 How does Power(Test)
depend on n, effect-size?
z Find β(p=0.85) =P (Type II) =
Distribution Theory
z Discrete Distributions
z Continuous distributions
z Normal Approximation to Binomial and Poisson
z Chi-Square
z Fisher’s F distributions
„ P(fail to reject Ho | X~Binomial(20, 0.85))=P(X<18 | X~Binomial(20, 0.85))
„ Use SOCR resource Î β =0.595ÎPower
of test = 1- β =0.405
z Find β(p=0.95) =P (Type II) =
„ P(fail to reject Ho | X~Binomial(20, 0.95))=P(X<18 | X~Binomial(20,0.95))
„ Use SOCR resource Î β =0.076 ÎPower
Slide 201
of test = 1- β = 0.924
Stat 233, UCLA, Ivo Dinov
Slide 202
Stat 233, UCLA, Ivo Dinov
32
Download