What is statistics?

advertisement
MD 5108
Biostatistics for Basic Research
Lecturer: Dr K. Mukherjee
Office: S16-06-100
Tel: 874 2764
Email: stamk@nus.edu.sg
Objectives
To train practitioners of the biomedical sciences in
the use and interpretation of statistical data analysis.
• explore and present data using tables, charts and graphs
•
•
•
•
•
•
•
•
ability to do simple statistical calculations with a calculator
carry out data analysis using a statistical package such as SPSS
pick the right procedure for analysing a set of data
interpret results correctly and report findings
avoid misuse and abuse of statistics
understand statistical contents of papers in medical journals
judge claims and statements critically
discuss and communicate ideas in a quantitative manner
Teaching approach
• nonmathematical introduction
• explanation of concepts rather than proofs
• emphasis on methodology and procedures
• emphasise use of statistical package rather
than manual calculation
• emphasis on choosing the right procedure
• emphasis on correct interpretation of results
• examples from clinical research literature
Topic 1: What is statistics?
“A branch of mathematics dealing with the analysis
and interpretation of masses of numerical data”
Merrian-Webster Dictionary
“The field of study that involves the collection and
analysis of numerical facts or data of any kind” Oxford
Dictionary
“The study of how information should be employed to
reflect on, and give guidance for action, in a practical
situation involving uncertainty” Vic Barnett
Biostatistics: Application of statistical methods
to biological, medicine and health sciences
Why the need for Statistics
in Biomedicine ?
Two main reasons:
• Variation
– attributes differ not only among individuals but also
within the same individual over time
• Sampling
– biomedical research projects mostly carried out on
small numbers of study subjects
– challenging problem to project results from small
samples studies to individuals at large
Biological Variation
Necessitates the use of statistical
methods in biomedicine to put numerical
data into a context by which we can
better judge their meaning
From sample to population
Statistical methods
used to produce
statistical inferences
about a population
based on information
from a sample derived
from that population
Population
inductive
statistical
methods
sample
Altman (1991) Practical Statistics for Medical Research,
Chapman and Hall.
Bailar & Mosteller (1986) Medical Uses of Statistics, NEJM Books.
Many studies
have been done
on misuse of
statistics in medicine
From Altman (1991)
Schor and Karten (1966, J. Am. Med. Assoc.):
• 149 papers classed as “analytical studies” in
3 issues of 11 most frequently read medical
journals
• assessment criteria:
Validity with respect to:
• Design of experiment?
• Type of analysis performed?
• Applicability of statistical test used?
Findings of Schor and Karten:
• 28% of papers acceptable
• 68% deficient but acceptable if reviewed
• 4% unsalvageable
Lesson:
CARE
must be exercised when
reading scientific papers in
biomedical journals!
Knowledge of basic
biostatistics is required
“ There are three kinds of lies: lies, damned
lies and statistics” Benjamin Disraeli
“ It is easy to lie with statistics, but it is easier
to lie without them” Frederick Mosteller
“Statistical thinking will one day be as
necessary for efficient citizenship as the
ability to read and write.” H.G. Wells
Types of statistical methods
1. Descriptive statistical methods
 data collection and organization
 summarizing data and describing its characteristics
 presentation and publication
2. Exploratory data analysis
 play around and get a feel of the data
 preliminary analysis, often graphical
 looking for patterns and possible relationships
 are assumptions satisfied?
 which model and procedure to use?
3. Inductive (inferential) statistical methods
Statistical inferences
about a population
based on information
from a sample derived
from that population
Population
• estimation, confidence intervals
• hypothesis testing
• prediction, forecasting
• classification
inductive
statistical
methods
sample
Topic 2: Types of data
Sources of data, the raw materials of statistics
 Routinely kept records, e.g., hospital medical records
 Surveys
 Experiments
 Clinical trials
 Data base
 Published reports
Any characteristic that can be measured or classified
into categories is called a variable
Types of variables
(1) Qualitative variables
 cannot be measured numerically
 categorical in nature, e.g., gender
 categories must not overlap and must cover all possibilities
w Nominal variables (No inherent ordering of categories)
§ M/F, Yes/No
§ Blood group (A, B, AB, O)
§ Ethnic group (Chinese, Malay, Indian, Others)
w Ordinal variables (Categories are ordered in some sense)
§ response to treatment: unimproved, improved, much improved
§ pain severity: no pain, slight pain, moderate pain, severe pain
(2) Quantitative variables

can be measured numerically, e.g., weight, height,
concentration
 can be continuous or discrete
w a continuous variable can take on any value (subject to
precision of measuring instrument) within some range or
interval, e.g., weight, height, blood pressure, cholesterol level
w a discrete variable is usually a count of something and hence
takes on integer values only, e.g., number of admissions to
NUH
Variable types and measurement types
 have implications on how data should be displayed or
summarized
 determines the kind of statistical procedures that should be
used
SUMMARY
Variable
Types
of
variables
Qualitative
or categorical
Nominal
(not ordered)
e.g. ethnic
group
Ordinal
(ordered)
e.g. response
to treatment
Quantitative
measurement
Discrete
(count data)
e.g. number
of admissions
Measurement scales
Continuous
(real-valued)
e.g. height
Topic 3: Presenting data graphically
Advantages of graphical data display
 Let data speak for itself
 Get a good feel of the data before formal analysis
 Graphs and plots easier to understand and interpret
 Reveal patterns in data which may shed light on the
appropriate model/analysis to use
e.g., Skewed or symmetric distribution
Multiple peaks / mode
Are there any outliers ?
Relatioship between variables.
Graphs for categorical data
Bar chart for world pharmaceutical spendings, 1997
% of world spendings
35
30
25
20
15
10
5
0
Africa
Australasia
Canada
Europe
Japan Latin America Middle East SE Asia & China USA
Region
Pie chart for world pharmaceutical spendings, 1997
Canada
( 2, 2.0%)
USA
(34, 34.0%)
SE Asia & Ch ( 7, 7.0%)
Middle East ( 2, 2.0%)
Latin Americ ( 8, 8.0%)
Australasia ( 1, 1.0%)
Japan
(16, 16.0%)
Europe
Af rica
( 1, 1.0%)
(29, 29.0%)
Segmented bar chart for world pharmaceutical spending, 1997
100
% of world spending
90
80
70
60
50
40
30
20
10
0
Africa
Australasia
Canada
Europe
Japan
Latin America
Middle East
SE Asia & Chin
USA
Bar chart for world pharmaceutical spendings, 1997
35
% of world spending
30
25
20
15
10
5
0
Africa
Australasia
Canada
Europe
Japan Latin America Middle East SE Asia & China USA
Region
World pharmaceutical spending, 1997
( 2, 2.0%)
USA
(34, 34.0%)
SE Asia & Ch ( 7, 7.0%)
Middle East ( 2, 2.0%)
Latin Americ ( 8, 8.0%)
Australasia ( 1, 1.0%)
Japan
(16, 16.0%)
Europe
Af rica
( 1, 1.0%)
(29, 29.0%)
100
Sum of % of world spending
Canada
90
80
70
60
50
40
30
20
10
0
Africa
Australasia
Canada
Europe
Japan
Latin America
Middle East
SE Asia & Chin
USA
Comparison of methods
 Bar charts can be read more accurately and offer better
distinction between close together values
 Pie charts especially useful for showing percentage
distribution
 Pie charts can display large and small % simultaneously
without scale break
 A single bar chart is preferable to a single segmented bar
chart
 A series of segmented bar charts is easier to read than a
series of pie charts or ordinary bar charts
Bar chart for number of health professionals
Number of workers
6000
5000
4000
3000
2000
1000
0
Dentists
Doctors
Nurses
Profession
Pharmacists
Variation of the basic bar chart
Stacked bar chart for number of health professionals
6000
Private
Public
Number of workers
5000
4000
3000
2000
1000
0
Dentists
Doctors
Nurses
Profession
Pharmacists
Clustered bar chart for number of health professionals
Number of workers
4000
Private
Public
3000
2000
1000
0
Dentists
Doctors
Nurses
Profession
Pharmacists
Segmented bar charts by profession
Private
Public
100
90
Percent by sector
80
70
60
50
40
30
20
10
0
Dentists
Doctors
Nurses
Profession
Pharmacists
Clustered bar chart for number of health professionals
Number of workers
4000
Private
Public
3000
2000
1000
0
Dentists
Doctors
Nurses
Pharmacists
Profession
Stacked bar chart for number of health professionals
6000
Segmented bar charts by profession
Private
Public
90
5000
80
Percent by sector
Number of workers
Private
Public
100
4000
3000
2000
1000
70
60
50
40
30
20
10
0
0
Dentists
Doctors
Nurses
Profession
Pharmacists
Dentists
Doctors
Nurses
Profession
Pharmacists
Plotting by sector rather than by profession
 Look at the data from a different angle
 Highlight different aspects of the data
Clustered bar charts of number of health professionals
Number of workers
4000
Dentists
Doctors
Nurses
Pharmacists
3000
2000
1000
0
Private
Public
Sector
Stacked bar charts by sector
6000
Dentists
Doctors
Nurses
Pharmacists
Number of workers
5000
4000
3000
2000
1000
0
Private
Public
Sector
Percentage bar charts by sector
Dentists
Doctors
Nurses
Pharmacists
100
Percent within sector
90
80
70
60
50
40
30
20
10
0
Private
Public
Sector
Segmented bar charts by sector
Dentists
Doctors
Nurses
Pharmacists
100
Percent within sector
90
80
70
60
50
40
30
20
10
0
Private
Public
Sector
Percentage bar charts by sector
Clustered bar chart of number of health professionals
Dentists
Doctors
Nurses
Pharmacists
3000
2000
1000
90
0
80
70
60
50
40
30
20
10
0
Private
Public
Private
Sector
Public
Sector
Stacked bar charts by sector
Segmented bar charts by sector
Dentists
Doctors
Nurses
Pharmacists
5000
4000
3000
2000
1000
0
Dentists
Doctors
Nurses
Pharmacists
100
90
Percent within sector
6000
Number of workers
Dentists
Doctors
Nurses
Pharmacists
100
Percent within sector
Number of workers
4000
80
70
60
50
40
30
20
10
0
Private
Public
Sector
Private
Public
Sector
A back to back bar chart
Source: JAMA, 1978, vol 239, no 21
Comparison of methods
Stacked bar chart is also a bar chart for the
combined data
Some of the bars in a stacked bar chart are not
aligned
Bars in clustered bar charts are aligned but it is
harder to visualize how the component bars would
stack up
Back to back bar charts are applicable when there
are 2 groups only, the aggregated bars are not
aligned
Series of stacked or segmented bar charts useful in
showing time trend
Time Trend
Exaggerate visually the increase in # prescriptions
written per person by starting at 8 rather than 0
Stacked bar chart of yearly mortality rate per 1000 births
Pagano & Gauvreau (1999) Principles of Biostatistics, Duxbury.
Response under two treatments
Response to
Treatment
Treatment
A
B
None
Partial
Complete
3
15
9
2
22
30
Total
27
54
A misleading bar chart
A
B
Frequency
30
20
10
0
None
Partial
Complete
Response to treatment
By design, there are twice as many patients receiving treatment B
Within treatment percentage
Can compare the response type percentages
for the two treatments
Response to
treatment
None
Partial
Complete
100
90
80
70
60
50
40
30
20
10
0
A
B
Treatment
Within treatment percentage
Stacked bar charts for percentage figures
Response to
treatment
None
Partial
Complete
100
90
80
70
60
50
40
30
20
10
0
A
B
Treatment
Graphs for quantitative data
 Histogram
 Frequency polygon
 Box plot
Histogram
Divide the range of the data into a suitably chosen
number of intervals/bins, all of the same width
The number of observations that fall within each
interval is plotted
Relative frequency histogram
Plot the proportions of observations that fall within
the class intervals
Wild & Seber (2000) Chance Encounters, Wiley.
Histogram of End-Systolic Volume for 45 Male
Heart Attack Patients
Frequency
20
10
0
40
60
80 100 120 140 160 180 200 220
Relative frequency polygon for SysVol
40
Percent
30
20
10
0
40
60
80 100 120 140 160 180 200 220
SysVol
Comparison of methods
Histogram
good at revealing distributional shape such as
symmetry, skewness, number of peaks etc
difficult to superimpose or draw side by side
Frequency polygons
 can be superimposed for easy comparison
Wild & Seber (2000, p.59)
Can be superimposed
Pagano & Gauvreau (1999)
Wild & Seber (2000)
Median and quartiles
Sort the data in increasing order
The median is the middle value (if n is odd) or the
average of the two middle values (if n is even), it is
a measure of the “center” of the data
Quartiles: dividing the set of ordered values into 4
equal parts Q2 = second quartile = median
first 25%
second 25%
Q1
third 25%
Q2
IQR = Interquartile range = Q3  Q1
fourth 25%
Q3
Box plot
Draw a box from the lower quartile to the
upper quartile and a line to mark the position of
the median
Extend from both edges of the box by 1.5
IQR, pull back the lines until they hit
observation
Observations more than 1.5 IQR away from
the lower or upper quartile are marked out as
outside values for further investigation and
checking
How a boxplot is constructed (Wild & Seber, 2000, p.73)
5-Number Summary: min, lower quartile, median, upper quartile, max
Dotplot for SysVol = End-systolic volume,
a measure of the size of the heart
50
100
150
200
SysVol
Boxplot for SysVol
20
120
Sys Vol
220
Advantages of box plot
quick visual summary of a data set
capture prominent features like location,
spread, skewness and outliers
can easily draw a series of box plots side by
side; not so for histograms
Brand name
Type
Taste
Happy Hill Supers
Beef
Bland
Georgies Skinless
Beef
Beef
Bland
Special Market's
Beef Premium
BlandB
Spike's Beef
Beef
Medium
Hungry Hugh's
BeefJumbo Medium
Beef
Great Dinner
Beef
Beef
Medium
RJB KosherBeef
Beef
Medium
Wonder Kosher
Beef Skinless
Medium
Bee
Happy FatsBeef
Jumbo Beef
Medium
Midwest Beef
Beef
Medium
General Kosher
Beef Beef Medium
Wall's Kosher
Beef
Beef Lower
Medium
F
Hickory Natural
Beef SmokeMedium
Smith BeefBeef
Medium
Premium Beef
Beef
Medium
Family StoreSkinless
Beef
Beef
Medium
Sam's Kosher
BeefBeef Medium
Hammer Beef
Beef
Medium
Athens Beef
Beef
Medium
Regents Kosher
Beef Beef Scrumpt.
Really Big Meat
Bland
Biggest Jumbo
Meat
Bland
Home MadeMeat
Bland
Martha's Jumbo
Meat DinnerBland
Hammer Premium
Meat
Bland
Willie's Wieners
Meat
Bland
Premium Hot
Meat
Dogs Medium
Airport Wieners
Meat
Medium
Judy's Favorite
MeatJumbosMedium
Stick Lean Meat
Supreme Jumbo
Medium
Stick Jumbo
Meat
Medium
Fat Jack Jumbo
Meat
Medium
Thin Jack Veal
Meat
Medium
Top Grade Hot
MeatDogs Medium
Blended w/Chicken&Beef
Meat
Scrumpt.
Heaven Made
Meat
Scrumpt.
Baked and Meat
Smoked Scrumpt.
Smart Person
Poultry
ChickenBland
Woods Park
Poultry
Chicken Medium
Tony Turkey
Poultry
Medium
Rose Garden
Poultry
Turkey Medium
Low Fat Turkey
Poultry
Medium
Special Market's
PoultryTurkey
Medium
Caloryless Poultry
Turkey
Medium
Heaven Made
Poultry
Lower Fat
Medium
McDowell'sPoultry
Jumbo Chicken
Medium
$/oz
0.11
0.17
0.11
0.15
0.1
0.11
0.21
0.2
0.14
0.14
0.23
0.25
0.07
0.09
0.1
0.1
0.19
0.11
0.19
0.17
0.12
0.12
0.12
0.1
0.11
0.13
0.1
0.09
0.11
0.15
0.13
0.1
0.18
0.09
0.07
0.08
0.06
0.08
0.05
0.07
0.08
0.08
0.07
0.09
0.06
0.07
$/lbProt Cal
14.23
21.7
14.49
20.49
14.47
15.45
25.25
24.02
18.86
18.86
30.65
25.62
8.12
12.74
14.21
13.39
22.31
19.95
22.9
19.78
14.86
17.32
15.2
14.01
13.92
18.24
14.12
11.83
15.41
17.4
17.32
15.61
20.4
12.65
11.17
11.75
9.49
10.21
6.37
8.42
9.37
9
8.07
9.39
6.59
8.43
Sod
186
181
176
149
184
190
158
139
175
148
152
111
141
153
190
157
131
149
135
132
173
191
182
190
172
147
146
139
175
136
179
153
107
195
135
140
138
129
132
102
106
94
102
90
99
107
Prot/Fat
495
477
425
322
482
587
370
322
479
375
330
300
386
401
645
440
317
319
298
253
458
506
473
545
496
360
387
386
507
393
405
372
144
511
405
428
339
430
375
396
383
387
542
359
357
528
1
2
1
1
1
1
2
2
1
1
1
3
2
1
1
1
2
1
2
2
2
1
1
1
2
1
1
2
1
3
1
1
3
1
1
1
1
2
2
3
3
4
5
5
4
2
Dataset
Hotdogs
Graphical Analysis of
the “Hotdogs” data.
Parallel Box plots Can Be Quite Revealing
Rice (1995) Mathematical Statistics & Data Analysis, Duxbury Press.
1969
Reduction in concentration through time
Higher during winter months
Skewed toward higher value
Spread increases with level
1972
(Parallel histograms
much harder to visualise)
Download