90-786 Lecture 3

advertisement
Lecture 3. Data Compression for
Two Variables: Scatterplots, CrossTabulations, and Correlation
David R. Merrell
90-786 Intermediate Empirical
Methods for Public Policy and
Management
Lecture 3: Agenda






Review of Lecture 2
Cross-Tabulations
Comparison Bar Charts
Parallel Box Plots
Scatterplots
Correlation Coefficients
Review of Lecture 2


Mean or Median
Models for Data
Mean or Median




Complaints have reached the city manager
that Tardy City is taking too long to pay its
bills.
Data are days taken to pay seven bills: 34
27
64
31
30
26
35
Calculate the mean and median.
What do you conclude?
Models for Data


Data = Fit + Residual
Fit as a Center




Mean
Median
Mode
Example: Number of Stat Courses
Taken by Students in 90-786
Bin
More
Frequency Cumulative %
0
1
5.26%
1
15
84.21%
2
2
94.74%
3
1
100.00%
0
100.00%
Frequency
Histogram
20
15
10
5
0
150.00%
100.00%
Frequency
50.00%
Cumulative %
.00%
0
1
2
Bin
3
More
Frequency
15
10
5
0
0
1
2
C1
3
Summary Statistics (Excel)
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Confidence Level(95.0%)
1.157894737
0.138140489
1
1
0.602140432
0.362573099
4.885489992
1.659166502
3
0
3
22
19
0.290222623
Summary Statistics (Minitab)
Descriptive Statistics
Variable
N
Mean
Median Tr Mea
StDev SE Mean
C1
19
1.158
1.000
1.118
0.602
0.138
Variable
C1
1.000
Min
0.000
Max
3.000
Q1
1.000
Q
Measures of Error
Sum of Squared Residuals
2
)
a

X
(
 i
Sum of Absolute Residuals
X
Mean
Median
Mode
i
a
Sum Squared Residuals Sum Absolute Residuals Percent Misses
6.50
7.05
100.00
7.00
5.00
21.05
7.00
5.00
21.05
Data Compression for
Two Variables...And More







Two-Variable Description
Cross-Tabulations
Comparison Bar Charts
Parallel Box Plots
Scatterplots
Scatterplot Matrix
Correlation Coefficients
Two-Variable Description
Dependent Variable
Independent Variable
Nominal or Ordinal
Interval
Level of Measurement Level of Measurement
Cross-tabulation
Nominal or Ordinal
Level of Measurement
Cross-tabulation
(group inteval data)
Table or chart
Interval
Level of Measurement
Scatterplot
Structure of a Cross-Tabulation
Dependent
Variable
Independent Variable
Row Total
Group 1
Group 2
0
a
b
a+b
1
c
d
c+d
2
e
f
e+f
b+d+f
a+b+c+
d+e+f
Column Total
a+c+e
Street Repair Practices


Study street repair practices of local
government
Cities and counties handle street
repairs:



using their own public employees
exclusively
by contracting out part of the work
contracting out all the work
Table 1. Street Repair: Counts
Street Repair Practices by Type of Government: Public Employees
and Contracting by Cities and Counties in the United States
Type of Local Government
Street Repair Practice
Only Public
Public and Contracting out
Only Contracting out
Total
City
No.
County
No.
Total
966
396
172
61
1,138
457
36
8
1,398
241
44
1,639
Table 2. Street Repair: Percents
Street Repair Practices by Type of Government: Public Employees
and Contracting by Cities and Counties in the United States
Type of Local Government
Street Repair Practice
City
County
%
%
69.1%
28.3%
71.2%
25.3%
69.4%
27.9%
1,138
457
Only Contracting out
2.6%
3.3%
2.7%
44
Total
100%
1,398
100%
241
100%
Only Public
Public and Contracting out
%
Number
Total
%
Number
1,639
Educational Achievement



Residents of Allegheny County that are
in labor force
Random sample survey of Allegheny
County residents in labor force in 199?
Variables: gender and highest
educational achievement
Educational Achievement:
Coding of Ordinal Variables








1
2
3
4
5
6
7
8
if
if
if
if
if
if
if
if
grade 4 or less
grades 5-7
grade 8
high school incomplete (9-11)
high school graduate (12)
technical, trade, or business after high school
college/ university incomplete
college/university graduate or more
Educational Achievement Table
Education
Female
No.
Male
%
No.
Total
%
No.
%
3
1
0.21%
1
0.21%
2
0.21%
4
5
6
7
8
25
173
49
76
150
5.27%
36.50%
10.34%
16.03%
31.65%
29
137
32
88
196
6.00%
28.36%
6.63%
18.22%
40.58%
54
310
81
164
346
5.64%
32.39%
8.46%
17.14%
36.15%
Total
474
100.00%
483
100.00%
957
100.00%
Bar Chart
45%
40%
35%
30%
25%
Female
Male
20%
15%
10%
5%
0%
3
4
5
6
7
8
Job Satisfaction and Income
for Postal Employees
Job
Satisfaction
Low
Medium
High
Total
Low
Income
Medium
High
50%
30%
20%
100%
(n=200)
20%
53.3%
26.7%
100%
(n=150)
13.3%
20%
66.7%
100%
(n=75)
Five Number Summary
Age of Allegheny County residents by location:
individuals in labor force in 199?.
Age
Maximum
Upper quartile
Median
Lower quartile
Minimum
Mon Valley
Location
Pittsburgh
Other
69
45
36
27
17
71
43.5
33
26
16
77
47
37
29
16
Parallel Box Plots
80
oo
70
o
o
60
50
40
30
20
10
The Mon Valley
Pittsburgh
Other
Scatterplots



Creating via Excel ChartWizard
Transformation of Variables
Scatterplot Matrices
Scatterplot 1
$100,000
$90,000
Salary
$80,000
$70,000
$60,000
$50,000
$40,000
$30,000
$20,000
$10,000
$0
0
5
10
15
Years employed
20
25
30
Scatterplot 2
$45,000
Salary
$40,000
$35,000
$30,000
$25,000
$20,000
$15,000
0
5
10
15
Years employed
20
25
30
Scatterplot 3
$45,000
Salary
$40,000
$35,000
$30,000
$25,000
Female employees
$20,000
Male employees
$15,000
0
5
10
15
Years employed
20
25
30
Scatterplot Matrix
Years
Salary
Age Hired
Correlation Coefficient, r
(X

r
i
 X )(Yi  Y )
S X SY
Properties of r
1  r  1
r  1  data all on negatively sloping straight line
r = 0  data in "shot - gun" pattern
r = +1  data all on positively sloping straight line
International Adoption Visas:
1991 vs 1988

r:/academic/90-786/ Chatterjee/
Adopt.dat
International Adoption Visas
Country
1988
1991
Africa
Belize
Bolivia
Brazil
Cambodia
Canada
Chile
China
28
6
21
164
0
12
252
52
41
4
51
178
59
12
263
62
Etc.
1992
63
8
74
139
16
6
176
201
log 1992
International Adoption Visas
3.5
3
2.5
2
1.5
1
0.5
0
Series1
0
1
2
log 1988
3
4
8
7
log 1992
6
5
4
3
2
0
1
2
3
4
5
log 1988
6
7
8
9
Excel Calculation of r






Use statistical function, correl
Eliminate missing data values
Identify X data
Identify Y data
Finish
Value: r = 0.879098 (.88)
Minitab Calculation of r
Correlations (Pearson)
Correlation of log 1988 and
log 1992 = 0.873
Next Time ...
Ethics and the Value of Data




Social Value of Data
Privacy Issues
Confidentiality
Applications in Health Care
Download