Statistics as a Tool in Scientific Research

advertisement
Statistics for Everyone Workshop
Fall 2010
Part 1
Statistics as a Tool in Scientific Research:
Summarizing & Graphically Representing Data
Workshop presented by Linda Henkel and Laura McSweeney of Fairfield University
Funded by the Core Integration Initiative and the Center for Academic Excellence at
Fairfield University
Statistics as a Tool in Scientific Research
Types of Research Questions
• Descriptive (What does X look like?)
• Correlational (Is there an association between
X and Y? As X increases, what does Y do?)
•
Experimental (Do changes in X cause changes
in Y?)
Different statistical procedures allow us to
answer the different kinds of research
questions
Statistics as a Tool in Scientific Research
 Start with the science and use statistics as
a tool to answer the research question
Get your students to formulate a research
question first:
• How often does this happen?
• Did all plants/people/chemicals act the
same?
• What happens when I add more sunlight,
give more praise, pour in more water?
Statistics as a Tool in Scientific Research
Can collect data in class
Can use already collected data (yours or database)
Helping students to formulate research question:
Ask them to think about what would be interesting
to know. What do they want to find out? What do
they expect?
For example, what research questions might you
ask from the survey?
Descriptive, correlational, experimental
Types of Data: Measurement Scales
Categorical: male/female
blood type (A, B, AB, O)
Stage 1, Stage 2, Stage 3
melanoma
Numerical:
weight
# of white blood cells
mpg
Types of Data: Measurement Scales
Categorical:
• Nominal (name/label)
• Ordinal (rank order)
Numerical:
• Interval (equal intervals)
• Ratio (equal intervals and absolute
zero)
Types of Data: Measurement Scales
Nominal: numbers are arbitrary; 1= male, 2 = female
Ordinal: numbers have order (i.e., more or less) but
you do not know how much more or less; 1st place
runner was faster but you do not know how much
faster than 2nd place runner
Interval: numbers have order and equal intervals so
you know how much more or less; A temperature
of 102 is 2 points higher than one of 100
Ratio: same as interval but because there is an
absolute zero you can talk meaningfully about
twice as much and half as much; Weighing 200
pounds is twice as heavy as 100 pounds
Types of Data on Questionnaire
1. What College are you from?
CAS
Business
Engineering
Nursing
Other
2. How many years have you been teaching at Fairfield University?
3.
How important do you think it is to integrate statistics into your
courses?
Not at all
Somewhat Important Important Very Important
4. How excited are you about integrating statistics into your courses?
1
Not at all
excited
2
3
4
5
6
7
Extremely
excited
Types of Data on Questionnaire
5.
Are you male or female?
6. How many hours a week do you watch television on average?
7. How many hours a day do you spend on the internet on average?
8. Of the following reality/game shows, which one would you most
like to be on?
(a) Dancing With the Stars
(b) American Idol
(c) Bachelor/Bachelorette
(d) The Apprentice
9. Can you roll your tongue? Yes
10. How many siblings do you have?
No
Types of Statistical Procedures
Descriptive: Organize and summarize
data
Inferential: Draw inferences about the
relations between variables; use
samples to generalize to population
Descriptive Statistics
The first step is ALWAYS getting to know
your data
 Summarize and visualize your data
It is a big mistake to just throw numbers into
the computer and look at the output of a
statistical test without any idea what those
numbers are trying to tell you or without
checking if the assumptions for the test
are met.
Descriptive Statistics
Numerical Summaries:
• Frequencies
• Contingency tables
• Measures of central tendency
• Measures of variability
• Representing numerical summaries in tables
Graphical Summaries:
• Bar graphs or Pie graphs
• Histograms
• Scatterplots
• Time series plot
Summarizing and Reporting Categorical Data
Frequency = number of times each
score occurs in a set of data
Relative Frequency = percent or
proportion of times each score occurs
in a set of data
Frequency Table
Marital Status
Frequency
(f)
Relative
Frequency
(rel f)
Married
34
.14
Widowed
129
.54
Divorced
35
.15
Separated
30
.12
Never Married
13
.05
Total
241
1.00
Contingency Table
A display to summarize two categorical
variables in a table.
Each entry in the table represents the
number of observations in a sample with a
certain outcome for the 2 variables.
Contingency Tables
Gender
Binge
Drinker
Non-binge
Drinker
Total
Male
1908
2017
3925
Female
2854
4125
6979
Total
4762
6142
10904
Contingency Tables
Gender
Binge
Drinker
Non-binge
Drinker
Total
Male
49%
51%
3925
Female
41%
59%
6979
Total
4762
6142
10904
Choosing the Appropriate Type of Graph
One categorical variable (e.g., Political party): Bar
Chart or Pie Graph
Two categorical variables (e.g., Political party vs.
Gender): Side-by-side Bar Chart
*Notice with 2 variables, one variable may be treated as the
dependent variable and one variable may be treated as the
independent variable.
Choosing the Appropriate Type of Graph
One numerical variable (e.g., Height): Histogram
One numerical variable and one categorical variable
(e.g., Height vs. Gender): Side-by-side Histograms
Two paired numerical variables (e.g., Weight vs.
Exercise per week): Scatterplot
One numerical variable over time (e.g., Number of Cells
vs. Minutes): Time Series Plot
*Notice with 2 variables, one variable may be treated as the
dependent variable and one variable may be treated as the
independent variable.
Bar Graph (Frequency)
Example of Bar Chart with Frequencies
140
120
No. of Patients
100
80
60
40
20
0
Widowed
Divorced
Separated
Marital Status
Never Married
Bar Graph (Relative Frequency)
Example of Bar Chart with
Relative Frequencies
0.6
Proportion of Patients
0.5
0.4
0.3
0.2
0.1
0
Married
Widowed
Divorced
Marital Status
Separated
Never Married
Pie Chart
Marital Status
5%
14%
12%
Married
Widowed
Divorced
15%
Separated
Never Married
54%
Simple Frequency Tables and Bar Graphs
Side by Side Bar Charts
Conditioned
Proportions
Male
Female
Binge
Drinker
0.49
0.41
Nonbinge
Drinker
0.51
0.59
0.70
Proportions
0.60
0.50
0.40
M
0.30
F
0.20
0.10
0.00
Binge
Non-binge
Histogram of Simple Frequency Data
Side by Side Histograms
Size Frequency Distribution for Female Crabs
4.5
4
3.5
Counts
3
2.5
2
1.5
1
0.5
0
11.0015.00
15.0019.00
19.0023.00
23.0027.00
27.0031.00
31.0035.00
35.0039.00
39.0043.00
43.0047.00
47.0051.00
43.0047.00
47.0051.00
CW (mm)
Size Frequency Distribution for Male Crabs
14
12
Counts
10
8
6
4
2
0
11.0015.00
15.0019.00
19.0023.00
23.0027.00
27.0031.00
31.0035.00
CW (mm)
35.0039.00
39.0043.00
Scatterplot
Percent
Overweight
Threshold
of Pain
89
2
16
90
3
14
Example of Scatterplot
4
Threshold of Pain
75
12
10
30
4.5
51
5.5
75
7
4
62
9
2
45
13
90
15
20
14
8
6
0
0
20
40
60
Percent Overweight
80
100
Time Series Plot
Software File Updates
Number of Updates
Month Number of
Updates
1
323
2
268
3
290
4
405
5
383
6
368
...
...
12
75
500
400
300
200
100
0
0
2
4
6
8
Month Number
10
12
14
Shapes of Distributions
• Normal (approximately symmetric)
• Skewed
• Unimodal/Bimodal/Uniform/Other
• Outliers
The Normal Curve
“Bell-shaped”
Most scores in center, tapering off symmetrically
in both tails
Amount of peakedness (kurtosis) can vary
Variations to Normal Distribution
Skew = Asymmetrical distribution
• Positive/right skew = greater frequency of low scores than
high scores (longer tail on high end/right)
• Negative/left skew = greater frequency of high scores
than low scores (longer tail on low end/left)
Histogram Showing Positive (Right) Skew
Variations to Normal Distribution
Bimodal distribution: two peaks
Rectangular/Uniform: all scores occur with equal
frequency
Potential Outlier: An observation that is well above
or below the overall bulk of the data
Important to determine normality (look at the
histogram of the data) so you can choose
appropriate measures of central tendency and
variability
Variations to Normal Distribution
Bimodal Distribution (two peaks)
Number of People
25
20
15
10
Rectangular/Uniform Distribution
(equal # highs and lows)
5
0
16
Weight
14
12
10
8
Number of People
100-119 120-139 140-159 160-179 180-199 200-219 220-239
6
4
2
0
100-119 120-139 140-159 160-179 180-199 200-219 220-239
Potential Outlier in Distribution
Number of People
Weight
16
14
12
10
8
6
4
2
0
100119
120139
140159
160179
180199
200219
Weight
220239
240259
260279
280299
Examples of Bad Graphs:
What is Wrong With the Picture?
8.4
Pollen Count
8.3
8.2
8.1
8
7.9
7.8
7.7
On campus
In town
Location
Pollen Count
Examples of Bad Graphs:
What is Wrong With the Picture?
50
45
40
35
30
25
20
15
10
5
0
On campus
In town
Location
Pollen Count
A BETTER Graph
10
9
8
7
6
5
4
3
2
1
0
On campus
In town
Location
Average Pollen Count
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
2006
2007
6
Year
Pollen Count
5
4
3
2
1
0
Pollen Count
2000
2001
2002
2003
2004
Year
20
18
16
14
12
10
8
6
4
2
0
2000
2001
2002
2003
2004
Year
2005
2006
2007
2008
2005
2006
2007
2008
Example of a Bad Graph
First Frequency
Digit
Graph the distribution of
the first digits.
140
120
100
80
FIRST
60
NUMBER
40
20
0
1
2
3
4
5
6
7
8
9
1
2
3
109
75
77
4
99
5
6
7
8
72
117
89
62
9
43
Example of a Bad Graph
First Frequency
Digit
Graph the distribution of
the first digits.
140
120
100
80
NUMBER
60
40
20
0
1
2
3
4
5
6
7
8
9
1
2
3
109
75
77
4
99
5
6
7
8
72
117
89
62
9
43
Example of a GOOD Graph
First Frequency
Digit
Graph the distribution of
the first digits.
Number of Occurances
Distribution of the First Digits
140
120
100
80
60
40
20
0
1
2
3
4
5
First Digit
6
7
8
9
1
2
3
109
75
77
4
99
5
6
7
8
72
117
89
62
9
43
Example of a Bad Graph
Graph the distribution of the
number of bacteria in the cultures
sampled.
Bacteria
60
50
40
BAC
30
20
10
0
1
2
3
4
5
6
7
8
9
10
Number of
Bacteria
41
33
43
52
46
37
44
49
53
30
Example of a Bad Graph
Graph the distribution of the
number of bacteria in the cultures
sampled.
Number of
Bacteria
41
33
43
52
Distribution of Bacteria
Number of Samples
2.5
2
1.5
1
0.5
0
30.0033.00
33.0036.00
36.0039.00
39.0042.00
42.0045.00
45.0048.00
Number of Bacteria
48.0051.00
51.0054.00
54.0057.00
57.0060.00
46
37
44
49
53
30
Example of a GOOD Graph
Graph the distribution of the
number of bacteria in the cultures
sampled.
Number of
Bacteria
41
33
43
52
Distribution of Bacteria
Number of Samples
6
5
4
3
2
1
0
30.0040.00
40.0050.00
50.0060.00
60.0070.00
70.0080.00
80.0090.00
Number of Bacteria
90.00100.00
100.00110.00
110.00120.00
120.00130.00
46
37
44
49
53
30
Guidelines for Good Graphs
• Label both axes and provide a heading to make clear what the graph
is representing.
• Vertical axes should usually start at 0 to help our eyes compare
relative sizes.
• Remove any clutter that isn’t needed or is distracting
• The axes may need to be resized to remove extra white space
• Be careful in using unusual bars since it can be easy to get the
relative percentages that the figures represent incorrect.
• Sometimes displaying information for more than one group on the
same graph can be difficult especially when the values differ greatly.
Consider using relative frequencies or separate graphs instead.
Other Guidelines to Making Graphs
• Y axis should be ¾ as tall as X axis
• When the number of score values on X axis is large, scores
should be collapsed so there are at least 5 intervals but no
more than 12
• The width of each interval on the X axis should be equal
• Frequency on the Y axis must be continuous and regular
• Range on the Y axis and X axis must neither unduly
compress nor unduly stretch the data
Time to Practice
• Looking at shape of distribution
• Making graphs
Teaching tips:
• Hands-on practice is important for your
students
• Sometimes working with a partner helps
Download