data analysis & Basic statistics

advertisement
DATA ANALYSIS & BASIC
STATISTICS
XIAO WU
XIAO.WU@YALE.EDU
PURPOSE OF THIS WORKSHOP
• Statistics as a useful tool to analyze results
• Basic terminology and most commonly used tests
• Exposure to more advanced statistical tools
WHY DO WE NEED STATISTICS?
WHY DO WE NEED STATISTICS?
• Summary
• Classification
• Interpretation
• Pattern searching
• Abnormality identification
• Prediction
• Intrapolation
• Extrapolation
SUMMARY
http://www.mymarketresearchmethods.com/descriptive-inferential-statistics-difference/
SUMMARY
• Mean, median, mode
• Variance, standard deviation
• Max, min values and range
• Quartiles
http://www.mymarketresearchmethods.com/descriptive-inferential-statistics-difference/
EXAMPLE
Firm A
• Mean: $5,800
Firm B
• Mean: $5,000
EXAMPLE
Firm A
• Mean: $5,800
• Median: $4,000
• SD: $7,270
• 3rd Quartile: $4,000
• 1st Quartile: $500
Firm B
• Mean: $5,000
• Median: $5,000
• SD: $203
• 3rd Quartile: $5,175
• 1st Quartile: $4,825
EXAMPLE
#
Salary ($)
#
Salary ($)
1
20000
1
4650
2
4000
2
4700
3
4000
3
4750
4
500
4
4800
5
5
500
4850
6
4900
7
4950
8
5000
9
5050
10
5100
11
5150
12
5200
13
5250
14
5300
15
5350
CLASSIFICATION
Identification of
variable
• Independent vs.
dependent
• Numeric vs.
categorical
Variable
Categorical
Numeric
Nominal
Continuous
Ordinal
Discrete
PATTERN SEARCHING
• Distribution of data
• Some commonly
used distributions
•
•
•
•
Uniform
Binomial
Poisson
…
• Central limit
theorem
http://www.mathwave.com/img/art/graphs_pdf2.gif
UNIFORM
• Every outcome has
equal chance
• Example:
• Flipping a coin
• Rolling a dice
• What if you need to
flip multiple times?
BINOMIAL
• Two outcomes,
probability p and 1p
• Multiple trials: n
• Example:
• Flipping a coin 100
times
• Germination of
multiple seeds
https://onlinecourses.science.psu.edu/stat414/sites/onlinecourses.science.p
su.edu.stat414/files/lesson09/graph_n15_p02.gif
POISSON
• Counts of rare,
independent events
• Each with
probability, or
average rate p
• Example:
radioactive decay
http://kaffee.50webs.com/Science/images/alpha_decay.gif
THE MOST IMPORTANT DISTRIBUTION
NORMAL DISTRIBUTION
• Central limit
theorem
• Every distribution
converges to a
normal distribution
• Large sample size 
normal distribution
Parameters:
• mean
• standard deviation
https://www.mathsisfun.com/data/images/normal-distrubution-large.gif
PATTERN SEARCHING
Hypothesis testing
• Difference between two populations
• Z-test or t-test?
• What does p-value mean?
• Family-wise error – Bonferroni correction
• More than two possibilities
• Chi square test
• Fisher’s exact test
• More than two variables
• ANOVA
EXAMPLE 1
SAT score is related to gender
• Null hypothesis
• Alternative hypothesis (3 possibilities)
• One or two tail?
• Z or T test?
• p=0.07, conclusion?
EXAMPLE 2
Predictors of stroke
• Age
• Hypertension
• Gender
• …
EXAMPLE 3
Genome-wide association studies
• Scanning markers across the DNA of many people
to find genetic variations associated with certain
diseases
PATTERN SEARCHING
Hypothesis testing
• One variable
• Z-test or t-test?
• What does p-value mean?
• Family-wise error – Bonferroni correction
• Compare two categorical variables
• Chi square test
• Fisher’s exact test
• More than two variables
• ANOVA
CHI SQUARE
Punnett Square
• A cross between two pea plants yields 880 plants,
639 green, 241 yellow
• Hypothesis: The green allele is dominant and both
parents are heterozygous.
http://www2.lv.psu.edu/jxm57/irp/chisquar.html
CHI SQUARE
G
g
G
GG
(green)
Gg(green)
g
Gg(green)
gg
(yellow)
• 75% green
• 25% yellow
CHI SQUARE
Green
Yellow
Observed (o)
639
241
Expected (e)
660
220
Deviation (d=o – e)
-21
21
Deviation squared
(d^2)
441
441
d^2/e
0.668
2
Sum
2.669
Degree of freedom: number of categories – 1 = 1
CHI SQUARE
PREDICTION
• Regression
• Linear regression
• Multiple linear
regression
• Accuracy vs.
simplicity
• Validation
• leave-k-out
http://2.bp.blogspot.com/-W7Ptp8uB02U/T8UAGm4Uw5I/AAAAAAAAC08/DcHCtLWXvU/s1600/actnactn+1.png
EXAMPLE
• Use brain structural measurements to predict a
subject’s performance on picture vocabulary test
• 144 total structural measurements
• 521 subjects
• First step: eliminate unnecessary variables
• All zeros?
• Highly correlated pairs
• Variables that do not correlate well with performance score
EXAMPLE
•
•
•
•
Run regression
Validation: leave 1 out and leave 10 out
Principle component analysis
…
PREDICTION
More complicated models:
• Baysian approach
• Use prior knowledge to update prediction
• Diffusion weights
• Use local structure to predict neighboring values
STATISTICAL TOOLS
• EXCEL
• MatLab
•R
• MiniTab
•…
QUESTIONS?
MY OWN RESEARCH
• Cost-effectiveness analysis
• Mathematical modeling in medicine
• Simulate iterations rather than actual patients
RECENT RESULTS
RESULTS
GROUP EXERCISE
Download