Basic Statistical Concepts - LSUHSC School of Public Health

advertisement
Basic Statistical Concepts
Donald E. Mercante, Ph.D.
Biostatistics
School of Public Health
LSU-HSC
Two Broad Areas of Statistics
Descriptive Statistics
- Numerical descriptors
- Graphical devices
- Tabular displays
Inferential Statistics
- Hypothesis testing
- Confidence intervals
- Model building/selection
Descriptive Statistics
When computed for a population of values,
numerical descriptors are called
Parameters
When computed for a sample of values,
numerical descriptors are called
Statistics
Descriptive Statistics
Two important aspects of any
population
Magnitude of the responses
Spread among population members
Descriptive Statistics
Measures of Central Tendency
(magnitude)
Mean
- most widely used
- uses all the data
- best statistical properties
- susceptible to outliers
Median
- does not use all the data
- resistant to outliers
Descriptive Statistics
Measures of Spread (variability)
range
- simple to compute
- does not use all the data
variance - uses all the data
- best statistical properties
- measures average distance of
values from a reference
point
Properties of Statistics
• Unbiasedness - On target
• Minimum variance - Most reliable
• If an estimator possesses both properties then it is
a MINVUE = MINimum Variance Unbiased
Estimator
• Sample Mean and Variance are UMVUE =
Uniformly MINimum Variance Unbiased
Estimator
Inferential Statistics
- Hypothesis Testing
- Interval Estimation
Hypothesis Testing
Specifying hypotheses:
H0: “null” or no effect hypothesis
H1: research or alternative
hypothesis
Note: Only H0 (null) is tested.
Errors in Hypothesis Testing
Reality 
 Decision
H0 True
H0 False
Fail to Reject H0


Reject H0


Hypothesis Testing
In parametric tests, actual parameter
values are specified for H0 and H1.
H0: µ < 120
H1: µ > 120
Hypothesis Testing
Another example of explicitly
specifying H0 and H1.
H 0:  = 0
H 1:   0
Hypothesis Testing
General framework:
• Specify null & alternative hypotheses
• Specify test statistic
• State rejection rule (RR)
• Compute test statistic and compare to
RR
• State conclusion
Common Statistical Tests
Test Name
Purpose
One-sample (z) t-test Test value of a mean
Two-sample (z) t-test Compare two means
Paired t-test
Compare difference in
means (compare related means)
ANOVA
Test for differences in
2 or more means
Common Statistical Tests (cont.)
Test
Purpose
Test on binomial
proportion(s)
Test whether binomial
proportions =0, or each
other.
Test on correlation
coefficient(s)
Test whether correlation
coefficient =0, or each other.
Regression
Test whether slope = 0
RxC contingency table
analysis
Test whether two categorical
variables are related
Advanced Topics
Test
Purpose
Multivariate Tests
e.g., MANOVA
Test value of several
parameters simultaneously
Repeated Measures /
Crossovers
Test means when subjects
repeatedly measured
Survival Analysis
Estimate and compare
survival probabilities for
one or more groups
Nonparametric Tests
Many analogous to standard
parametric tests
P-Values
p = Probability of obtaining a result at
least
this extreme given the null is
true.
 P-values are probabilities
 0<p<1
 Computed from distribution of the test
statistic
Epidemiological Concepts
Rate a proportion, specifically a fraction, where
The numerator, c, is included in the denominator:
c
cd
-Useful for comparing groups of unequal size
Example:
neonatal mortatilty rate 
# deaths  28 days old
total # live births
Epidemiological Concepts
Measures of Morbidity:
Incidence Rate: # new cases occurring during a
given time interval divided by population at risk
at the beginning of that period.
Prevalence Rate: total # cases at a given time
divided by population at risk at that time.
Epidemiological Concepts
Most people think in terms of probability (p) of
an event as a natural way to quantify the chance
an event will occur =>
0<=p<=1
0 = event will certainly not occur
1 = event certain to occur
But there are other ways of quantifying the
chances that an event will occur….
Epidemiological Concepts
Odds and Odds Ratio:
expected # times an event will occur
Odds of an event  O 
expected # times the event will not occur
For example, O = 4 means we expect 4 times as
many occurrences as non-occurrences of an event.
In gambling, we say, the odds are 5 to 2. This
corresponds to the single number 5/2 = Odds.
Epidemiological Concepts
The relationship between probability & odds
p
prob of event
O

1 - p prob of no event
p
O
1 O
Epidemiological Concepts
Probability
.1
.2
.3
.4
.5
.6
.7
.8
.9
Odds
.11
.25
.43
.67
1.00
1.50
2.33
4.00
9.00
Odds<1 correspond
To probabilities<0.5
0<Odds<
Example 1: Odds Ratio
Death sentence by race of defendant in 147 trials
Blacks
Nonblacks
Total
Death
28
22
50
Life
45
52
97
Total
73
74
147
Example 2: Odds Ratio
Odds of death sentence = 50/97 = 0.52
For Blacks:
O = 28/45 = 0.62
For Nonblacks: O = 22/52 = 0.42
Ratio of Black Odds to Nonblack Odds = 1.47
This is called the Odds Ratio
28
28 * 52 1456
45
OR 


 1.47
22
22 * 45 990
52
Logistic Regression
Odds ratios are directly related to the parameters of
the logit (logistic regression) model.
Logistic Regression is a statistical method that
models binary (e.g., Yes/No; T/F; Success/Failure)
data as a function of one or more explanatory
variables.
We would like a model that predicts the probability
of a success, ie, P(Y=1) using a linear function.
Logistic Regression
Problem: Probabilities are bounded by 0 and 1.
But linear functions are inherently unbounded.
Solution: Transform P(Y=1) = p to an odds. If we
take the log of the odds the lower bound is also
removed.
Setting this result equal to a linear function of the
explanatory variables gives us the logit model.
Logistic Regression
Logit or Logistic Regression Model
 pi 
    1 X i1   2 X i 2     k X ik
log
 1  pi 
Where pi is the probability that yi = 1.
The expression on the left is called the logit or log
odds.
Logistic Regression
Probability of success:
pi  PY  1 
1
1  e  1X i1   2 X i 2   k X ik
Odds Ratio for Each Explanatory Variable:
OR for X i  e i
Screening Tests
Suppose a new screening test for herpes virus has
been developed and the following summary for 1000
individuals has been compiled:
Has Herpes
Does Not
Have Herpes
Screened Positive
45
10
Screened Negative
5
940
Screening Tests
How do we evaluate the usefulness of such a test?
Diagnostics:
sensitivity
specificity
False positive rate
False negative rate
predictive value positive
predictive value negative
Screening Tests
Generic Screening Test Table
Screened
Positive
Screened
Negative
Total
With
Disease
a
Without
Disease
b
Total
c
d
c+d
a+c
b+d
N
a+b
Screening Tests
prevalence 
Sensitivity 
ac
N
a
d
Specificity 
ac
bd
b
False positive rate 
bd
c
False negative rate 
ac
a
Yield or predictive value  
ab
d
Yield or predictive value  
cd
Screening Tests
50
prevalence 
 5%
1000
Sensitivit y 
45
940
 90 % Specificit y 
 98.95%
50
950
10
False positive rate 
 1.05%
950
5
False negative rate 
 10%
50
45
Yield or predictive value  
 81.82%
55
940
Yield or predictive value  
 99.47%
950
Interval Estimation
Statistics such as the sample mean,
median, variance, etc., are called
point estimates
-vary from sample to sample
-do not incorporate precision
Interval Estimation
Take as an example the sample mean:
Estimates
X ——————> 
(popn mean)
Or the sample variance:
S2 ——————> 2
(popn variance)
Interval Estimation
Recall Example 1, a one-sample t-test on
the population mean. The test statistic
was
x  0
t 
s
n
This can be rewritten to yield:
Interval Estimation


x  0


P  t1 
 t1   1  
s
2
2 

n


Which can be rearranged to give a
(1-)100% Confidence Interval for :
x  t1
Form: Estimate ±
2
, n1
s
n
Multiple of Std Error of the Est.
Interval Estimation
Example 1: Standing SBP
Mean = 140.8, s.d. = 9.5, N = 12
95% CI for :
140.8 ± 2.201 (9.5/sqrt(12))
140.8 ± 6.036
(134.8, 146.8)
Download