Introduction to Statistics - The Department of Statistics and Applied

advertisement
ST1232
Statistics in the Life Sciences
YY Teo
Associate Professor
Saw Swee Hock School of Public Health, NUS
Department of Statistics & Applied Probability, NUS
Life Sciences Institute, NUS
Genome Institute of Singapore, A*STAR
Lesson Structure
• 13 weeks of 2 lectures (of 2 hours) per week
• Practically, 17-18 lectures planned,
newspaper statistics, conferences, etc.
• Tutorials in computer labs from week 3
onwards (11 weeks of tutorials)
• Consultation (Fridays 2pm – 3.30pm)
• 3 assessments:
– tutorial participation (10%)
– mid-term quiz (30%)
– end-of-term exam (60%)
Resources
• Lectures, slides, tutorials
• Fred Ramsey and Dan Schafer (2001) The
Statistical Sleuth. 2nd edition, Duxbury Press
• Julie Pallant. SPSS Survival Manual: A Step-byStep Guide to Data Analysis Using SPSS for
Windows. 3rd edition, Open University Press
• http://www.statistics.nus.edu.sg/~statyy/ST1232
Tutorials
• Note the available time slots and sign up at the CORS system:
http://www.nus.edu.sg/cors/. The tutorial will be at S16-05-102 (Com lab 2)
•
•
•
•
•
•
•
•
•
•
•
•
•
T1: Mondays (8am – 9am)
T2: Mondays (9am – 10am)
T3: Tuesdays (8am – 9am)
T4: Tuesdays (9am – 10am)
T5: Wednesdays (9am – 10am)
T6: Wednesdays (10am – 11am)
T7: Wednesdays (11am – 12pm)
T8: Thursdays (9am – 10am)
T9: Thursdays (10am – 11am)
T10: Thursdays (11am – 12pm)
T11: Fridays (8am – 9am)
T12: Fridays (9am – 10am)
T13: Fridays (10am – 11am)
Medical Statistics
• Quantitative basis to human diseases and traits
• Progression from observational science!
• Statistics and mathematics required for this
advancement, from observational to quantitative
Statistics in medical research
Medical statistics
Identification of risk factors
Association with
genes and
environment
Disease risk
modeling and
prediction
Disease prevention / treatment
Pharmaceutical
developments / clinical
trials
Understand
Establish
Relevance of
Applications in a
inter-population population-specific international trials
multi-ethnic
risks to diseases
risk architecture
and findings
setting
Pregnancy Test Kit
A woman buys a pregnancy test kit, and is interested to find
out whether she is pregnant.
One hypothesis in this case (status quo), is that she is not
pregnant.
The other hypothesis (hypothesis of interest), is that she is
pregnant.
Test kit may show:
+ve: indicating there is evidence to suggest pregnancy
–ve: indicating lack of evidence to suggest pregnancy
Pregnancy Test Kit
The test kit may either be accurate, or inaccurate.
Actually pregnant
Actually not pregnant
Test kit shows +ve
Correct +ve diagnosis
(Sensitivity, or Power)
Incorrect +ve
diagnosis
Test kit shows –ve
Incorrect –ve
diagnosis
Correct –ve diagnosis
(Specificity)
Sensitivity or Specificity?
• Objective of the experiment
Sensitivity or Specificity?
• Objective of the experiment
• HIV diagnostic kit, 99.9% sensitive and 99.5% specific
Sensitivity or Specificity?
• Objective of the experiment
• HIV diagnostic kit, 99.9% sensitive and 99.5% specific
Correct identification
of HIV +ves
Sensitivity or Specificity?
• Objective of the experiment
• HIV diagnostic kit, 99.9% sensitive and 99.5% specific
Correct identification
of HIV +ves
Correct identification
of HIV -ves
Sensitivity or Specificity?
• Objective of the experiment
• HIV diagnostic kit, 99.9% sensitive and 99.5% specific
• Tests on immigrants, assume 1,001,000 applications
each month, of which 1000 are truly HIV-positive
Sensitivity or Specificity?
• Objective of the experiment
• HIV diagnostic kit, 99.9% sensitive and 99.5% specific
• Tests on immigrants, assume 1,001,000 applications
each month, of which 1000 are truly HIV-positive
HIV +ve x 1000
HIV –ve x 1,000,000
Sensitivity or Specificity?
• Objective of the experiment
• HIV diagnostic kit, 99.9% sensitive and 99.5% specific
• Tests on immigrants, assume 1,001,000 applications
each month, of which 1000 are truly HIV-positive
HIV +ve x 1000
HIV –ve x 1,000,000
On average,
999 correctly identified,
1 incorrectly diagnosed as
HIV -ve
On average,
995,000 correctly identified
as HIV -ve,
5000 incorrectly diagnosed
as HIV +ve
Sensitivity or Specificity?
995,001 identified as HIV –ve in
total
5,999 identified as HIV +ve in
total
On average,
999 correctly identified,
BUT…
1 incorrectly diagnosed as
HIV -ve
Almost 5 in 6 of those identified
On average,
as HIV +ve are FALSE!
995,000 correctly identified
as HIV -ve,
5000 incorrectly diagnosed
as HIV +ve
70
60
50
Male
Female
40
Weight
80
90
100
Height and Weight
140
150
160
170
H eight
180
190
200
70
60
50
Male
Female
40
Weight
80
90
100
Height and Weight
140
150
160
170
H eight
180
190
200
70
60
50
Male
Female
40
Weight
80
90
100
Height and Weight
140
150
160
170
H eight
180
190
200
70
60
50
Male
Female
40
Weight
80
90
100
Height and Weight
140
150
160
170
H eight
180
190
200
Scientific Process
Research hypothesis:
- What is your scientific question?
- What are you trying to achieve?
Scientific Process
Human Diversity
Human Diversity
• Even within human race, variation exists between people
of different ethnicities, cultures and populations
• Genetic basis to a substantial fraction of such variation
Human Diversity
• Even within human race, variation exists between people
of different ethnicities, cultures and populations
• Genetic basis to a substantial fraction of such variation
• Observable differences – physical appearances, build,
weight
Human Diversity
• Even within human race, variation exists between people
of different ethnicities, cultures and populations
• Genetic basis to a substantial fraction of such variation
• Observable differences – physical appearances, build,
weight
• Variation in susceptibility to diseases
• Influenced by evolutionary processes, over many
generations
• Cross-sectional observation of adaptation and natural
selection
Target population
• Depends entirely on your research hypothesis!
Target population:
- Everyone in Singapore?
- Every female individuals in Singapore?
- Every female individuals of a certain
age in Singapore?
- Every femal individuals of a certain age
in Singapore, and who could be
pregnant?
Target population:
- Everyone in Singapore?
- Everyone of a certain age in Singapore?
- Everyone of a certain age in NUS?
- Everyone of a certain age from a specific
population group in Singapore
Target populations
• Depends entirely on your research hypothesis!
• Example: Interest to investigate the genetic factors that
increase the risk to type 2 diabetes in Chinese adults in
Singapore.
• Target population(s):
– Every Chinese adult in Singapore that is affected by type 2
diabetes
– Normal Chinese adults (unaffected by type 2 diabetes) of the
same age band
– Classic case-control design in medical epidemiology.
But, is this sufficient???
Samples versus Population
• Obviously not possible to perform an experiment on
every diabetic Chinese adult in Singapore
• Select a representative set of individuals from the
appropriate population to perform the experiment on
• This set of individuals is known as your samples.
All diabetic Chinese adults in Singapore
Selected samples in research
Scientific Process
What is your intuition?
• A pharmaceutical firm is developing a medical drug, that
purportedly treats severe headache.
• During the clinical trials (testing the efficacy and safety of
the drug), it was tested on 10 people, of which 7 reported
that it worked to reduce headaches, while 3 claimed it
had no effect.
• Another pharma also developed a competing treatment,
but tested on 1000 people, of which 704 reported it
helped to reduce headaches, while 294 claimed it had no
effect, and 2 people claimed their headaches worsen.
Which setting do you think gives you more
information about the developed drug? And
why?
Sample Size Determination
• Types of effects that can be detected depends entirely
on sample sizes.
RR = 2.5
200 cases and 200 controls
RR = 1.8
1000 cases and 1000 controls
For complex diseases!
RR = 1.2
4000 cases and 4000 controls
Pregnancy Test Kit
The test kit may either be accurate, or inaccurate.
Actually pregnant
Actually not pregnant
Test kit shows +ve
Correct +ve diagnosis
(Sensitivity, or Power)
Incorrect +ve
diagnosis
Test kit shows –ve
Incorrect –ve
diagnosis
Correct –ve diagnosis
(Specificity)
Sample Size Determination
• An issue commonly discussed in medical research!
• Power calculations, sample size, effect sizes, statistical
significance?
Power calculations
Sample size
Effect sizes
Statistical Significance
Recall: Your ability to identify a true pregnancy
Require  evidence,
means Power 
What level of statistical evidence do you
consider “believable”?
Scientific Process
Sample Selection
• Simple Random Sample
– Every sample in the population has an equal chance of being
selected (e.g. phonebook sampling)
• Stratified Sample
– Every sample in the population belongs uniquely to a specific
category (e.g. gender)
• Cluster Sampling
– Each cluster has the characteristics of the population, and
sampling is performed within the cluster rather than in the
population (e.g. diabetic patients in one hospital in Singapore,
compared to all diabetic patients in Singapore)
• Multistage Sampling
– A combination of different sampling schemes
Scientific Process
Data exploration and Statistical analysis
1. Exploratory data analysis
2. Probability and Bayes Theorem
3. Theoretical distributions (Uniform, Bernoulli, Binomial,
Poisson, Normal)
4. Confidence Interval
5. Hypothesis testing (t-test, ANOVA, test of proportions,
Chi-square tests)
GIBBERISH?!
6. Non-parametric tests
7. Linear regression and correlation
8. Logistic regression
Data exploration and Statistical analysis
1. Data checking, identifying problems and characteristics
2. Understanding chance and uncertainty
3. How will the data for one attribute behave, in a
theoretical framework?
4. Theoretical framework assumes complete information,
need to address uncertainties in real data
5. Testing your beliefs, do the data support what you think
is true?
6. What happens when the assumptions of the theoretical
framework are not valid
7. Modeling relationships between multiple outcomes and
a numerical response
8. Ditto, but with a two-state outcome.
Data
Data exploration,
categorical / numerical
outcomes
Model relationships
between different outcomes
Estimation of parameters,
quantifying uncertainty
Linear regression
(Numerical
response)
Logistic regression
(Categorical
response)
Confidence intervals, to
quantifying uncertainty
Model each outcome with
a theoretical distribution
Estimation of parameters,
quantifying uncertainty
Hypothesis testing
Parametric tests
(t-tests, ANOVA,
test of proportions)
Non-parametric tests
(Wilcoxon, KruskalWallis, rank test)
Scientific Process
Statistics – Truths or Lies
• 21st century – age of information
• Responsible for driving scientific progress in multiple
disciplines
• Core skills for data analysis
• Ability and knowledge to ingest and digest information is
at a premium
Statistics – Truths or Lies
Computers and Statistics
• Excel, SPSS, Minitab, Stata, Mathlab, R, etc…
• RExcel for this course:
http://www.stat.nus.edu.sg/~statyy/ST1232/bin/RExcel_installation.docx
Advantages
• Speed, accuracy, ease of data manipulation
• Easy to produce plots, cross-tabulation tables,
summary statistics
Disadvantages
• Inappropriate analysis / use of wrong tests
• Data dredging
Brief introduction to RExcel and SPSS
Features
• RExcel and SPSS – extremely similar in terms of data
entry and usage
• Spreadsheet-based data entry system
Link data in Excel to R
Features
• RExcel and SPSS – extremely similar in terms of data
entry and usage
• Spreadsheet-based data entry system
• Remember: a unique individual/entry per row!
• Drop-down menu option for data analysis
Features
• RExcel and SPSS – extremely similar in terms of data
entry and usage
• Spreadsheet-based data entry system
• Remember: a unique individual/entry per row!
• Drop-down menu option for data analysis
• While both are extremely intuitive, SPSS is slightly more
user-friendly, in terms of defining variables and format of
output
In RExcel
Output is in the R Commander
tab
Features
• RExcel and SPSS – extremely similar in terms of data entry
and usage
• Spreadsheet-based data entry system
• Remember: a unique individual/entry per row!
• Drop-down menu option for data analysis
• While both are extremely intuitive, SPSS is slightly more userfriendly, in terms of defining variables and format of output
• Details will be given in the subsequent lectures
Important to know the usage and interpretation of both SPSS
and RExcel well, examinable and practically important!
Reminders
Book your tutorial slots!
Work on your tutorials before going to
the classes!
Download