frequency distribution

advertisement
hss2381A – stats... or whatever
Univariate Analysis, part 1
Descriptive Statistics
• Evidence-based practice
(EBP): Use of best
clinical evidence in
making patient care
decisions
• Best source of evidence:
Systematic research
Evidence-Based Medicine (EBM) or EvidenceBased Practice (EBP) Questions:
• How reliable is the evidence?
• What is the magnitude of effects?
• How precise is the estimate of effects?
• Answering these questions requires an
understanding of statistics
Data and Data Analysis
• In the context of a study, the information
gathered to address research questions is data
• In quantitative research, data are usually
quantitative (numbers)
• Quantitative data are subjected to statistical
analysis
Examples of Independent and
Dependent Variables
•
•
Independent variable (IV): Smoking
Dependent variable (DV): Lung cancer
IV  DV ?
Research Question
•
•
Research questions are the queries
researchers seek to answer through the
collection and analysis of data
Research questions communicate the
research variables and the population (the
entire group of interest)
– Example: In hospitalized children (population)
does music (IV) reduce stress (DV)?
Defining a Variable
• Two phases:
– Conceptual
– operational
Defining a Variable
•
•
•
In studies, variables need to be defined
Conceptual definition: The theoretical
meaning of the underlying concept
Operational definition: The precise set of
operations and procedures used to measure
the variable
Example:
Concept = how long have you been on this planet?
Operation = In what age group, by years, are you in?
Descriptive Statistics
• Researchers collect their data from a sample
of study participants—a subset of the
population of interest
• Descriptive statistics describe and summarize
data about the sample
– Examples: Percent female in the sample, average
weight of participants
Inferential Statistics
• Researchers obtain data from a sample but often
want to draw conclusions about a population
• Parameter: A descriptive index for a population
– Example: Average daily caloric intake of all 10-year-old
children in New York
• Statistic: A descriptive index for a sample
– Example: Average daily caloric intake of 300 10-year-old
children from three particular NY schools
SPSS and Statistical Analysis
• SPSS (Statistical Package for the Social
Sciences) is among the most popular statistical
software packages for analyzing research data
• It is user friendly and menu driven
• The datasets offered with this textbook are set
up as SPSS files
The Data Editor in SPSS
• The data editor in SPSS offers a convenient
spreadsheet-like method of creating, editing,
and viewing data
• There are two “views” within the data editor:
– Data View: Shows the actual data values
– Variable View: Shows variable information for all
variables
Data View in the Data Editor
• The columns represent one
variable each; unique
variable names (no more
than eight characters long)
are shown at the top of
each column
• Each row is a case,
representing an individual
participant
• The data view tab is at the
bottom
Variable View in the Data Editor
• Variable View shows a
wealth of information
about how variables
are coded, how they
will be labeled in
output, level of
measurement, and so
on
• The Variable View tab
is at the bottom
Versions of SPSS
• New versions of SPSS are created regularly, to
offer improved options for analysis and
presentation
• Examples in this book were created in SPSS
Version 16.0
• The student version of SPSS is available for
analyzing relatively small datasets (no more
than 50 variables and no more than 1,500
cases)
What is this?
FREQUENCY DISTRIBUTION
Same as Histogram?
Frequency Distributions
• A frequency distribution is a systematic
arrangement of data values, with a count of
how many times each value occurred in a
dataset
• You can portray this as a table or as a graph
Constructing a Frequency
Distribution
• List each data value in a sequence (usually,
ascending order) 1, 2, 3, 4, 5…
• Tally each occurrence of the value
• Total the frequencies for each value (f)
• The sum of fs for all data values must equal
the sample size:
Σf = N
Elements of a Typical Frequency
Distribution
•
•
•
•
Data values
Absolute frequencies (counts)
Relative frequencies (percentages)
Cumulative relative frequencies (the
percentage for a given score value, combined
with percentages for all preceding values)
Example...
• Let’s say we have 10 people of varying ages:
– Ages: 17, 26, 33, 35, 14, 55, 67, 35, 21, 19
• Let’s construct the frequency distribution of
the age GROUPS: 0-25 yrs, 26-45 yrs, >45 yrs
Age group
Frequency
Relative Freq.
Cumulative Freq.
0-25
4
4/10 = 40%
40%
26-45
4
4/10 = 40%
40+40% = 80%
>45
2
2/10 = 20%
80+20% = 100%
Summary of Our Example
Data Value
Frequency
(f)
Percentage
(%)
Cumulative
Percentage
0-25
4
40.0
40.0
26-45
4
40.0
80.0
>45
2
20.0
100.0
TOTAL
10
100.0
Frequency Distributions and
Measurement Levels
• Remember “measurement levels”?
– Nominal, ordinal, interval, ratio...
• Frequency distributions can be constructed for
variables measured at any level of
measurement
• BUT…for categorical (nominal-level) variables,
cumulative frequencies do not make sense
• Also...
Frequency Distributions for
Variables with Many Values
• When a variable has
many possible
values, a regular
frequency
distribution may be
unwieldy
– For example, weight
values (here, in
pounds)
Weight
f
98
1
99
1
100
1
101
0
102
2
103
1
104
0
105
2
106
1
Etc. to
285 lb
…
Which is Why We Used “Age Group”
instead of “Age”
• This is sometimes called a “grouped frequency
distribution”
• In a grouped frequency distribution
contiguous values are grouped into sets (class
intervals)
• Typically, we use groupings that are
psychologically appealing (e.g., 10-25 years
etc, not 7-13 years, etc)
Weight
f
98
1
99
1
100
1
101
0
102
2
103
1
104
0
105
2
106
1
Etc. to
285 lb
…
Weight
Interval
f
75 - 100
6
101 - 125
15
126 - 150
33
151 - 175
26
176 - 200
24
201 - 225
14
226 - 250
9
251 - 275
6
276 - 300
2
This grouping
communicates
information more
conveniently than
individual
weights
Reporting Frequency Information
• Can be reported narratively in text (e.g., “83%
of study participants were male”)
• In a frequency distribution table (multiple
variables often presented in a single table)
• In a graph: Different graphs used for different
types of data
Bar Graphs
• Bar graphs: Used for nominal (and many
ordinal) level variables
• Bar graphs have a horizontal dimension (X
axis) that specifies categories (i.e., data
values)
• The vertical dimension (Y axis) specifies either
frequencies or percentages
• Bars for each category drawn to the height
that indicates the frequency or %
Bar Graphs
• Example of
a bar graph
• Note the
bars do not
touch each
other
Pie Chart
• Pie Charts: Also used for nominal (and many
ordinal) level variables
• Circle is divided into pie-shaped wedges
corresponding to percentages for a given
category or data value
• All pieces add up to 100%
• Place wedges in order, with biggest wedge
starting at “12 o’clock”
Pie Chart
• Example of
a pie chart,
for same
marital
status data
Histograms
• Histograms: Used for interval- and ratio-level
data
• Similar to a bar graph, with an X and Y axis—
but adjacent values are on a continuum so
bars touch one another
• Data values on X axis are arranged from
lowest to highest
• Bars are drawn to height to show frequency or
percentage (Y axis)
Histograms
Example of a histogram: Heart rate data
12
10
8
f
6
4
2
0
0
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
Heart rate in bpm
Frequency Polygons
• Frequency polygons: Also used for intervaland ratio-level data
• Similar to histograms, but instead of bars, a
dot is used above score values to designate
frequency/percentage
• Better than histograms for showing shape of
distribution of scores, and is usually preferred
if variable is continuous
•Note that the
line is brought
down to zero for
the score below
lowest data point
(54) and above
highest data
point (75)
Frequency Polygon, Heart Rate
12
10
F re q u e n c y
•Example of a
frequency
polygon (created
in SPSS)
8
6
4
2
0
54
56
58
60
62
64
66
Heart Rate in bpm
68
70
72
74
Shapes of Distributions
• Distributions of data values can be described
in terms of:
– Modality
– Symmetry
– Kurtosis
Modality
• Modality concerns how many peaks (values
with high frequencies) there are
• Unimodal = 1 peak
• Bimodal = 2 peaks
• Multimodal = multiple peaks
Unimodal:
Bimodal:
How is this useful?
Example: Tuberculosis
• What is it?
• We apply tuberculin skin test (also called PPD
– purified protein derivative) test
• Positive response is an “induration”
– a hard, raised area with clearly defined margins at
and around the injection site
What type of curve is this?
Distribution of systolic blood pressure for men
(unimodal distribution)
Symmetry
• Symmetric Distribution: the two halves of the
distribution, folded over in the middle, are
identical
Symmetry
• Asymmetric (Skewed) Distribution: Peaks are
“off center” and there is a tail trailing off for
data values with low frequency
– Positive skew: Longer tail trails off to right (fewer
people with high values, like for income)
– Negative skew: Longer tail trails off to left (fewer
people with low values, like age at death)
Direction of Skew
•
Examples of distributions with different
skews:
Skewness Index
•
•
Indexes have been developed to quantify
degree of skewness
One skewness index (e.g., in SPSS) has:
– Negative values, for a negative skew
– 0, for no skew
– Positive values, for a positive skew
•
If skewness index is less than twice the value
of its standard error (to be explained later),
distribution can be treated as not skewed
Skewness Index Examples
•
•
•
Skewness index = 0.80
Standard error = 0.33
• Skewness index = -0.72
• Standard error = 0.34
20
20
10
10
• Negative skew
Positive skew
Std. Dev = 2.74
Std. Dev = 2.96
Mean = 4.3
Mean = 8.6
N = 50.00
0
2.0
4.0
POSSKEW
6.0
8.0
10.0
12.0
N = 50.00
0
2.0
4.0
NEGSKEW
6.0
8.0
10.0
12.0
Kurtosis
• Kurtosis: Degree of pointedness or flatness of
the distribution’s peak
• Leptokurtic: Very thin, sharp peak
• Platykurtic: Flat peak
• Mesokurtic: Neither pointy nor flat
– Like skewness, there is an index of kurtosis
• Positive values: Greater peakedness
• Negative values: Greater flatness
Kurtosis Examples
Leptokurtic (+ index)
Platykurtic (– index)
Normal Distribution
What is this curve called?
Normal Distribution
•
A normal distribution (aka normal curve, bell
curve, Gaussian distribution, etc) is:
– Unimodal
– Symmetric
– Neither peaked nor flat
•
Plays an important role in inferential
statistics
We will re-visit the Normal Distribution in more depth in the future
Some human characteristics are normally
distributed (approximately), like height
1 short person, 3 medium persons, 1 tall person
Uses of Frequency Distributions in
Data Analysis
• First step in understanding your data!
– Begin by looking at the frequency distributions for
all or most variables, to “get a feel” for the data
– Through inspection of frequency distributions,
you can begin to assess how “clean” the data are
• (will discuss next time)
Central Tendency
• “Central Tendency” is a characteristic of a
distribution
– Describes how data is clustered around some value
– In other ways, it’s a way of summarizing your data by
identifying one value in the set that is the most
important
– There are several indices of central tendency, but 3
are the most important:
• Mode
• Median
• Mean
Next class, we’ll get into
these in more depth
Homework!
• P.17: A1-A4
• P.36: A1-A5
Download