Chapter 3 Slides (PPT) - Updated 1/4/2016

advertisement
Chapter 3
Summarizing Data
Graphical Methods - 1 Variable
• After data collected, sorted into categories/ranges of
values so that each individual observation falls in
exactly one category/range
– Numeric Responses: Break “range” of values into nonoverlapping bins and count number of units in each bin
– Categorical Responses: List all possible categories (with
“Other” if needed), and count numbers of units in each
• Pie Chart: Displays percent in each category/range
• Bar Chart: Displays frequency/percent per category
• Histogram: Displays frequency/percent per “range”
Constructing Pie Charts
• Select a small number of categories (say 5 or 6 at
most) to avoid many narrow “slivers”
• If possible, arrange categories in ascending or
descending order for categorical variables
Monthly Philly Rainfall 1825-1869 (1/100 in)
Philly Monthy Rainfall 1825-1869 (1/100 inches)
Category
1
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5
6
7
8
9
10
11
Range
<100
100-199
200-299
300-399
400-499
500-599
600-699
700-799
800-899
900-999
>1000
Count
17
78
132
115
86
55
27
17
6
3
4
Constructing Bar Charts
• Put frequencies on one axis (typically vertical, unless
many categories) and categories on other
• Draw rectangles over categories with height=frequency
• Leave spaces between categories
Constructing Histograms
• Used for numeric variables, so need Class Intervals
– Let Range = Largest - Smallest Measurement
– Break range into (say) 5-20 intervals depending on sample size
– Make the width of the subintervals a convenient unit, and make
“break points” so that no observations fall on them
– Obtain Class Frequencies, the number in each subinterval
– Obtain Relative Frequencies, proportion in each subinterval
• Construct Histogram
– Draw bars over each subinterval with height representing class
frequency or relative frequency (shape will be the same)
– Leave no space between bars to imply adjacency of class
intervals
Histogram
140
100
80
60
40
20
rain100
e
M
or
00
11
0
90
0
70
0
50
0
30
0
0
10
Frequency
120
100
200
300
400
500
600
700
800
900
1000
1100
1200
More
Interpreting Histograms
• Probability: Heights of bars over the class intervals
are proportional to the “chances” an individual
chosen at random would fall in the interval
• Unimodal: A histogram with a single major peak
• Bimodal: Histogram with two distinct peaks (often
evidence of two distinct groups of units)
• Uniform: Interval heights are approximately equal
• Symmetric: Right and Left portions are same shape
• Right-Skewed: Right-hand side extends further
• Left-Skewed: Left-hand side extends further
Stem-and-Leaf Plots
• Simple, crude approach to obtaining shape of
distribution without losing individual measurements to
class intervals. Procedure:
– Split each measurement into 2 sets of digits (stem and leaf)
– List stems from smallest to largest
– Line corresponding leaves aside stems from smallest to
largest
– If too cramped/narrow, break stems into two groups: low
with leaves 0-4 and high with leaves 5-9
– When numbers have many digits, trim off right-most (less
significant) digits. Leaves should always be a single digit.
Time Series Plots
• Many datasets represent a single variable measured on
a single unit at different time points
• When measurements are made at equally spaced time
points, goal is often to describe temporal variation
• Annual measurements can reveal long-term trends
• Sub-annual (weekly, monthly, quarterly) measurements
can reveal long-term trends as well as seasonal
fluctuations
• Plots generally have measurement on vertical axis and
time period on horizontal.
• Some plots include bars around points to represent
fluctuations within that time period
Philly Rainfall 1/1825-12/1869
Rainfall (1/100th inches)
2000
1000
0
Month
Numerical Descriptive Measures
• Numeric summaries of a set of measurements
• Measures of Central Tendency describe the
“location” or center of a set of measurements
• Measures of Variability describe the “spread” or
dispersion of a set of measurements
• Parameters: Numeric descriptive measures based on
Populations of measurements
• Statistics: Numeric descriptive measures based on
Samples of measurements
Measures of Central Tendency - I
• Mode: Most often occuring outcome (typically only of
interest for variables taking on only “discrete” values)
• Median: Middle value when measurements ordered
from smallest to largest
• Mean: Sum of all measurements, divided by total
number of measurements (equal distribution of total)
Population
y

( N elements) :  
Sample (n elements) :
i
y

y
i
i
N
i
n
In practice, we only observe sample, and use y to estimate 
Example - Philadelphia Rainfall
N  540 Months (Treating as Population )
540
198547
yi  198547   
 367.68

540
i 1
Ordered Amounts : y( 270)  339 y( 271)  341  M  340
Note: The mean is higher than median as a few very large
amounts were observed.
Measures of Central Tendency - II
• Outlier: Individual measurement(s) falling far away from
others. Can have large effect on mean, not median
• Trimmed Mean (TM): Mean that is based on center
measurements (deleting extreme measurements).
• Mode: For continuous (smooth) distributions, mode is
value corresponding to the peak of the frequency curve
• Skewness: Shape of the distribution:
– Mound-Shaped Distributions: Mode  Median  Mean  TM
– Right-Skewed Distributions: Mode < Median < TM < Mean
– Left-Skewed Distributions: Mean < TM < Median < Mode
Measures of Variability - I
• Variability: Magnitude of dispersion in data.
• Range: Difference between largest and smallest
measurements in a set.
• pth-Percentile: Value that has at most p% of
measurements below, and (100-p)% above it (0<p<100)
– Lower Quartile = 25th Percentile (Q1)
– Median = 50th Percentile (Q2)
– Upper Quartile = 75th Percentile (Q3)
• Interquartile Range: Difference between the upper and
lower quartiles (measures the amount of spread in he
middle 50% of ordered measurements). IQR = Q3-Q1
Quantile Plot
• Quantile: Q(u) ≡ Number that divides a dataset such that
the fraction of observations below Q(u) = u and the
fraction above Q(u) = 1-u
• Quantile plot – Plot of Q(u) on vertical axis versus u on
horizontal axis
 Place scale on horizontal axis ranging over 0 to 1
 Order data: y(1) ≤ y(2) ≤ … ≤ y(n) and scale vertical axis to
include full range of y-values
 Plot y(i) versus ui = (i – 0.5)/n for i = 1,2,…,n
Quantile Plot
Q(u) versus u
1800
u_i
0.000926
0.002778
0.00463
0.006481
…
0.993519
0.99537
0.997222
0.999074
y_(i)
19
25
26
55
…
1005
1102
1180
1582
1600
1400
1200
1000
Q(u)
i
1
2
3
4
…
537
538
539
540
y_(i)
800
600
400
200
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
u
1
Measures of Variability - II
• Deviation: Distance between an individual
measurement and the group mean: y  y
• Variance: “Average” squared deviation
• Standard Deviation: Square root variance (data’s units)
Population ( N elements) : Variance :  2 
Sample (n elements) : Variance : s 2 
2
(
y


)
i i
N
2
(
y

y
)
i i
n 1
Std. Dev.     2
Std. Dev. s   s 2
Empirical rule (measurements with mound-shaped histogram)
Approximately 68% of measurements lie within 1 SD of mean
Approximately 95% of measurements lie within 2 SD of mean
Virtually all of measurements lie within 3 SD of mean
Example - Philadelphia Rainfall (Population)
25 th  Percentile : 232.75
75 th  Percentile : 468
Inter - Quartile Range : IQR  468  232.75  235.25
540
2
(
y


)
 19822752
 i
i 1
19822752
 
 36708.8
540
  36708.8  191.6
    367.7  191.6  (176.1 , 559.3)
2
  2
 367.7  383.2  (0* , 750.9)
Note: 383 (71%) Months lie within 1 of  and 518 (96%) within 2
Other Measures of Variation
• Median Absolute Deviation (MAD) – Median of
the absolute values of differences between
observed data values and the sample median,
divided by 0.6745 (due to properties of normal
distribution, this provides estimate of )
• Coefficient of Variation (CV) – Standard
deviation as a fraction of mean (assuming  ≠ 0).
Often reported as a percentage: CV  100  s y  %
MAD
CV(%)
177.9096 52.15765
Boxplots
• Graph highlighting spread of set of measurements,
highlighting quartiles and outliers.
• Constructing a boxplot:
– Draw box with top at Q3, bottom at Q1, and line crossing at
median (Q2). Height of box is IQR = Q3 - Q1
– Compute “lower inner fence” = Q1-1.5(IQR) = LIF
– Compute “upper inner fence” = Q3+1.5(IQR) = UIF
– Compute “lower outer fence” = Q1-3.0(IQR) = LOF
– Compute “upper outer fence” = Q3+3.0(IQR) = UOF
– Draw line from Q3 to max(UIF, largest y value). Place ‘*’ for
any y values between UIF and UOF, ‘o’ for any above UOF
– Draw line from Q1 to min(LIF, smallest y value). Place ‘*’ for
any y values between LIF and LOF, ‘o’ for any below LOF
BoxPlot
0
500
UIF = 468+1.5(232.25) = 816.375
1000
1500
2000
UOF = 468+3(232.25) = 1164.75
Summarizing Data of More than One Variable
• Contingency Table: Cross-tabulation of units based on
measurements of two qualitative variables simultaneously
• Stacked Bar Graph: Bar chart with one variable
represented on the horizontal axis, second variable as
subcategories within bars
• Cluster Bar Graph: Bar chart with one variable forming
“major groupings” on horizontal axis, second variable
used to make side-by-side comparisons within major
groupings (displays all combinations in factorial expt)
• Scatterplot: Plot with quantitaive variables y and x
plotted against each other for each unit
• Side-by-Side Boxplot: Compares distributions by groups
Example - Ginkgo and Acetazolamide for Acute
Mountain Syndrome Among Himalayan Trekkers
Contingency
Table (Counts)
Percent
Outcome by
Treatment
Placebo
Acet
Ginkgo
Acc+Gi
Total
Placebo
Acet
Ginkgo
Acc+Gi
AMS
40
14
43
18
115
No AMS
79
104
81
108
372
Total
119
118
124
126
487
AMS
33.61
11.86
34.68
14.29
No AMS
66.39
88.14
65.32
85.71
Total
100
100
100
100
Stacked Bar Graph of AMS Incidence (Percent)
100%
90%
80%
70%
60%
No AMS
50%
AMS
40%
30%
20%
10%
0%
Placebo
Acet
Ginkgo
Treatment
Acc+Gi
Cluster Bar Graph of AMS Incidence (Counts)
120
100
Frequency
80
AMS
60
No AMS
40
20
0
Placebo
Acet
Ginkgo
Treatment
Acc+Gi
3-D Barchart of Incidence of AMS
100.00
90.00
80.00
70.00
60.00
Percent within Treatment
50.00
40.00
30.00
20.00
10.00
No AMS
0.00
Placebo
AMS
Acet
Ginkgo
Treatment
Acc+Gi
Outcome
Scatterplots
• Identify the explanatory and response variables of
interest, and label them as x and y
• Obtain a set of individuals and observe the pairs
(xi , yi) for each pair. There will be n pairs.
• Statistical convention has the response variable (y)
placed on the vertical (up/down) axis and the
explanatory variable (x) placed on the horizontal
(left/right) axis. (Note: economists reverse axes in
price/quantity demand plots)
• Plot the n pairs of points (x,y) on the graph
France August,2003 Heat Wave Deaths
•
•
•
•
Individuals: 13 cities in France
Response: Excess Deaths(%) Aug1/19,2003 vs 1999-2002
Explanatory Variable: Change in Mean Temp in period (C)
Data: City
Dth03
Dth9902
%chng (y)
Degchg(x)
Little
Marseilles
Grenoble
Rennes
Toulouse
Bordeaux
Strasbourg
Nice
Poitiers
Lyon
Le Mans
Dijon
Paris
200
571
148
156
315
318
253
341
184
447
204
168
1854
192.3
456.8
115.6
114.7
231.6
222.4
167.5
222.9
102.8
248.3
112.1
87.0
766.1
4
25
28
36
36
43
51
53
79
80
82
93
142
4.0
4.3
6.3
5.6
6.6
6.2
5.9
4.3
7.3
6.8
7.0
7.4
6.7
France August,2003 Heat Wave Deaths
2003 France Heat Wave Mortality
Possible Outlier
160
140
Excess Mortality (%)
120
100
80
60
40
20
0
3
3.5
4
4.5
5
5.5
6
Change in Mean Temp (Celsius)
6.5
7
7.5
8
Example - Pharmacodynamics of LSD
• Response (y) - Math score (mean among 5 volunteers)
• Explanatory (x) - LSD tissue concentration (mean of 5 volunteers)
• Raw Data and scatterplot of Score vs LSD concentration:
80
70
60
LSD Conc (x)
1.17
2.97
3.26
4.69
5.83
6.00
6.41
50
40
SCORE
Score (y)
78.93
58.20
67.47
37.47
45.65
32.92
29.97
30
20
1
2
LSD_CONC
Source: Wagner, et al (1968)
3
4
5
6
7
Manufacturer Production/Cost Relation
X= Amount Produced Y= Total Cost
Month
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Prod
46.75
42.18
41.86
43.29
42.12
41.78
41.47
42.21
41.03
39.84
39.15
39.20
39.52
38.05
39.16
38.59
Cost
92.64
88.81
86.44
88.80
86.38
89.87
88.53
91.11
81.22
83.72
84.54
85.66
85.87
85.23
87.75
92.62
Month
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Prod
36.54
37.03
36.60
37.58
36.48
38.25
37.26
38.59
40.89
37.66
38.79
38.78
36.70
35.10
33.75
34.29
n=48 months (not in order)
Cost
91.56
84.12
81.22
83.35
82.29
80.92
76.92
78.35
74.57
71.60
65.64
62.09
61.66
77.14
75.47
70.37
Month
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Prod
32.26
30.97
28.20
24.58
20.25
17.09
14.35
13.11
9.50
9.74
9.34
7.51
8.35
6.25
5.45
3.79
Cost
66.71
64.37
56.09
50.25
43.65
38.01
31.40
29.45
29.02
19.05
20.36
17.68
19.23
14.92
11.44
12.69
Manufacturer Production/Cost Relation
Production (x) / Cost (y) Relation
100
90
80
70
Total Cost
60
50
40
30
20
10
0
0
5
10
15
20
25
Total Production
30
35
40
45
50
Download