Introduction to Biostatistics: Data Collection. Descriptive Statistics

advertisement
Introduction to Research Methods
In the Internet Era
Introduction to Biostatistics
Data Collection
Descriptive Statistics
Thomas Songer, PhD
with acknowledgment to several slides provided by
M Rahbar and Moataza Mahmoud Abdel Wahab
Key Lecture Concepts
• Distinguish between different strategies
for obtaining a sample from a population
• Distinguishing between different forms of
data collection
• Identify key approaches to organize and
portray your data
• Understand the measures of central
tendency and variability in your data
2
Descriptive & Inferential Statistics
Descriptive Statistics deal with the
enumeration, organization and graphical
representation of data from a sample
Inferential Statistics deal with reaching
conclusions from incomplete information, that
is, generalizing from the specific sample
Inferential statistics use available information in
a sample to draw inferences about the population
from which the sample was selected
Rahbar
Epidemiology is…
• The study of disease and its treatment,
control, and prevention in a population of
individuals.
• Whole populations may be examined, but…
• More frequently, samples of the population
may be examined. Samples that are studied
must be representative of the population for
the results to be generalized to the total
population.
Torrence 1997
4
Hypothetical Population
Sample 1:
Representative? Y N
Sample 2:
Representative? Y N
Sample 3:
Representative? Y N
5
Sampling Approaches
• Convenience Sampling: select the most
accessible and available subjects in target
population. Inexpensive, less time consuming,
but sample is nearly always non-representative
of target population.
• Random Sampling (Simple): select subjects at
random from the target population. Need to
identify all in target population first. Provides
representative sample frequently.
6
Sampling Approaches
• Systematic Sampling: Identify all in target
population, and select every xth person as a
subject.
• Stratified Sampling: Identify important subgroups in your target population. Sample from
these groups randomly or by convenience.
Ensures that important sub-groups are included
in sample. May not be representative.
• More complex sampling
7
Sampling Error
• The discrepancy between the true population
parameter and the sample statistic
• Sampling error likely exists in most studies,
but can be reduced by using larger sample
sizes
• Sampling error approximates 1 / √n
• Note that larger sample sizes also require time
and expense to obtain, and that large sample
sizes do not eliminate sampling error
8
Research Process
Research question
Hypothesis
Identify research design
Data collection
Presentation of data
Data analysis
Interpretation of data
Polgar, Thomas
9
Types of Data Collection
• Surveys/Questionnaires
– Self-report
– Interviewer-administered
– proxy
• Direct medical examination
• Direct measurement (e.g. blood draws)
• Administrative records
10
Understanding and Presenting
Data
11
Types of Data
1. Categorical: (e.g., Sex, Marital Status,
income category)
2. Continuous: (e.g., Age, income, weight,
height, time to achieve an outcome)
3. Discrete: (e.g.,Number of Children in a
family)
4. Binary or Dichotomous: (e.g., response to
all Yes or No type of questions)
12
Brain Size and IQ
What types of data do these variables represent?
Gender
FSIQ
VIQ
PIQ
Weight
Height
MRI Count
Female
133
132
124
118
64.5
816932
Male
140
150
124
124
72.5
1001121
Male
139
123
150
143
73.3
1038437
Male
133
129
128
172
68.8
965353
Female
137
132
134
147
65
951545
Female
99
90
110
146
69
928799
Female
138
136
131
138
64.5
991305
Female
92
90
98
175
66
854258
Male
89
93
84
134
66.3
904858
Male
133
114
147
172
68.8
955466
Female
132
129
124
118
64.5
833868
13
Scale of Data
1. Nominal: These data do not represent an amount or
quantity (e.g., Marital Status, Sex)
2. Ordinal: These data represent an ordered series of
relationship (e.g., level of education)
3. Interval: These data is measured on an interval scale
having equal units but an arbitrary zero point. (e.g.:
Temperature in Fahrenheit)
4. Interval Ratio: Variable such as weight for which we
can compare meaningfully one weight versus another
(say, 100 Kg is twice 50 Kg)
14
Organizing Data and Presentation
•
•
•
•
•
•
•
•
Frequency Table
Frequency Histogram
Relative Frequency Histogram
Frequency polygon
Relative Frequency polygon
Bar chart
Pie chart
Box plot
15
Frequency Table
• Generally, the first approach to examining
your data.
• Identifies distribution of variables overall
• Identifies potential outliers
– Investigate outliers as possible data entry
errors
– Investigate a sample of others for data entry
errors
16
Frequency Table
A research study has been conducted examining
the number of children in the families living in a
community. The following data has been
collected based on a random sample of n = 30
families from the community.
2, 2, 5, 3, 0, 1, 3, 2, 3, 4, 1, 3, 4, 5, 7, 3, 2, 4, 1, 0,
5, 8, 6, 5, 4 , 2, 4, 4, 7, 6
Organize this data in a Frequency Table!
17
X=No. of
Children
0
1
2
3
4
5
6
7
8
Count
(Frequency)
2
3
5
5
6
4
2
2
1
Relative Freq.
2/30=0.067
3/30=0.100
5/30=0.167
5/30=0.167
6/30=0.200
4/30=0.133
2/30=0.067
2/30=0.067
1/30=0.033
18
Frequency Table
Now, construct a similar frequency table for the
age of patients with Heart related problems in a
clinic.
The following data has been collected based on a
random sample of n = 30 patients who went to the
emergency room of the clinic for Heart related
problems.
The measurements are: 42, 38, 51, 53, 40, 68, 62,
36, 32, 45, 51, 67, 53, 59, 47, 63, 52, 64, 61, 43, 56,
58, 66, 54, 56, 52, 40, 55, 72, 69.
19
Age Groups
Frequency
32 -36 yr
37- 41 yr
42-46 yr
47-51 yr
52-56 yr
57-61 yr
62-66 yr
67-72 yr
Total
2
3
4
3
8
3
4
3
n=30
Relative
Frequency
2/30=0.067
3/30=0.100
4/30=0.134
3/30=0.100
8/30=0.267
3/30=0.100
4/30=0.134
3/30=0.100
20
Frequency Polygon
• Use to identify the distribution of your data
Frequency
9
8
Female
7
Male
6
5
4
3
2
1
0
20-
30-
40-
50-
60-69
Age in years
21
Table 1 in a paper
Describe your study population in a frequency table
Table Title
Name of variable
(Units of variable)
Frequency
(n)
%
Mean
(SD)
- Categories
Total
22
Measures of Central Tendency
Where is the heart of distribution?
1. Mean
2. Median
3. Mode
23
Sample Mean
The arithmetic mean (or, simply, mean) is
computed by summing all the observations in the
sample and dividing the sum by the number of
observations.
For a sample of five household incomes, 6000,
10,000, 10,000, 14000, 50,000 the sample mean is,
6000 + 10000 + 10000 + 14000 + 50000
X =
= 18000
5
24
Median
In a list ranked from smallest
measurement to the highest, the median is
the middle value
In our example of five household incomes,
first we rank the measurements
6,000 10,000
10,000 14,000
50,000
Sample Median is 10,000
25
Mode
• In nominal data:
• The value which occurs with the greatest
frequency
26
Measures of non-central locations
•Quartiles
•Quintiles
•Percentiles
27
Measures of Dispersion or Variability
• Range (present highest and lowest value
in a distribution. The difference between
these values is the range)
• Variance
• Standard deviation (the square root of
the variance)
28
Sample Variance
n
 ( xi - x )
2
s =
2
i=1
n -1
S = standard deviation
(square root of variance)
29
Calculation of Variance and
Standard deviation
2
2
2
2
(6000-18000 ) +(10000-18000 ) +(10000-18000 ) +(14000-18000)+(50000-18000 )
=
S=
5-1
2
2
S = 328,000,000
S  18110.77
30
Mean and Standard deviation (SD)
7 7
7 77
7
Mean = 7
SD=0
7
8
7 77
6
Mean = 7
SD=0.63
3
2
7 8
13
9
Mean = 7
SD=4.04
31
Empirical Rule
For a Normal distribution approximately,
a) 68% of the measurements fall within one
standard deviation around the mean
b) 95% of the measurements fall within two
standard deviations around the mean
c) 99.7% of the measurements fall within three
standard deviations around the mean
32
Suppose the reaction time of a particular drug
has a Normal distribution with a mean of 10
minutes and a standard deviation of 2 minutes
Approximately,
a) 68% of the subjects taking the drug will have
reaction time between 8 and 12 minutes
b) 95% of the subjects taking the drug will have
reaction tome between 6 and 14 minutes
c) 99.7% of the subjects taking the drug will have
reaction tome between 4 and 16 minutes
33
Download