Lecture 4: Confidence intervals, case selection, T

advertisement
SPS 580 Lecture 4
CI Case selection outliers T
I.
PRECISION OF DATA -- The “plus or minus”
Research all about analyzing the mean or the percentage. When you take a random sample for a
survey how much confidence do you have in the mean, or the percentage? The level of
confidence is expressed as the “plus or minus” that goes along with the result. Also called the
“margin of error”.
SCALE
NAME
ID
SCORE
1
JACOB
ADAMS
3.0
2
CYNTHIA
AVERY
6.0
3
MARK
BIERY
3.0
4
ABRAHAM
BOL
9.0
5
PRASID
DHITAL
0.0
6
PAUL
DOMBROSKI
5.0
7
KATHRYN
DUVAL
1.0
8
BROOKE
EISENMENGER
0.0
9
MOLLY
FANNIN
3.0
10
CRYSTAL
GARDNER
2.0
11
SIMONE
GOURGUECHON
2.0
12
SHAWN
JANZEN
2.0
13
AMANDA
MAHONEY
1.0
14
CLAIRE
MARCH
7.0
15
ELLEN
MCELLIGOTT
6.0
16
WILMAR
MOLINA
2.0
17
MEGAN
MORROW
1.0
18
KATERI
NELIS
5.0
19
JOANNA
OUBRE
4.0
20
PATRICK
POUNTNEY
6.0
21
AMANDA
RIORDAN
9.0
22
ANNE
SAWKIW
10.0
23
REBECCA
WILSON
4.0
24
JONATHAN WITTIG
0.0
25
JADA
WOLLENZIEN
26
EDWARD
ZEHME
Average
3.25
variance (N-1)
8.92
N
4.00
SEM
1.49
+/-
2.93
low
Illustration: I designed a questionnaire with a scale of 0-10 to
determine whether the students in this class agree or disagree with
<something really important>. A score of 5 is neutral; above that =
favorable opinion; below that = unfavorable opinion. I’d like to
know the average score in the class (aka the TRUE MEAN).
 Universe = the 26 students in SPS 580, “Scale Score” shows
what each person would say if they were asked the question. There
is a “TRUE MEAN” -- I don’t know what it is, that’s why I’m
doing the survey.
I didn’t have enough money to survey everyone in the course (i.e.,
conduct a census). I only have enough money to survey 4 people.
SCALE
SAMPLE 1
SCORE
SHAWN
JANZEN
2.0
CLAIRE
MARCH
7.0
3.0
REBECCA WILSON
4.0
0.0
EDWARD
0.0
ZEHME
So I randomly selected 4
people and interviewed them.
 These are the answers I
have for my survey.
 And here’s the results from my data analysis: the observed mean is 3.25
STATISTICS ALERT . . .
MEAN = Sum(x) / n = 13/4= 3.25
The 95% confidence interval equals “Observed Mean” +/- 2.93 … which
means I am 95% certain that the TRUE MEAN – i.e., average score for all
6.18
students in the class (for the UNIVERSE) is between 0.32 and 6.18
STATISTICS ALERT . . .
0.32
high
95% Confidence Interval = +/- 1.96 * Standard Error of the Mean = +/- 1.96 * 1.49
Standard Error of the Mean = Square Root( Variance / n ) = Sqrt (8.92/4) = 1.49
Variance = Sum of ( (individual score – MEAN)^2 ) / (n-1) = Sum ( (x – 3.25)^2 ) / 3 = 8.92
x
(x-mean) (x-mean)^2
2.0
-1.25
1.5625
7.0
3.75
14.0625
4.0
0.75
0.5625
0.0
-3.25
10.5625
SUM
SUM/(n-1)
26.75
8.92
 The variance has to do with the amount of VARIETY in the
scores – it bounces around the same value regardless of how many
people you interview
The standard error of the mean has to do with the variance and the
SAMPLE SIZE, it gets smaller if the sample gets larger.
1
SPS 580 Lecture 4
CI Case selection outliers T
A. The meaning of the 95% confidence interval . . . The 95% CI is a way of saying we are
95% certain that the “REAL MEAN” – i.e., the one we would get if we surveyed everybody - is within the interval . . . “Observed Mean” +/- 1.96 * SEM
WHERE IS THE "REAL MEAN"
0
10
Observed mean=
3.25
95% CI =
0.32 ………………………..6.18
95% of what? Well, if we did 100 surveys with the same sample size, then 95% of the time –
i.e. 95 times out of 100, the 95% confidence interval will contain the “TRUE MEAN”
SAMPLE 2
Scores
Results
SIMONE
GOURGUECHON
2.0
Avg
2.75
AMANDA
MAHONEY
1.0
SEM
0.85
KATERI
NELIS
5.0
+/-
1.67
JADA
WOLLENZIEN
SAMPLE 3
 To test this, I did four more surveys, based
on a random sample of the same size, from the
same universe
3.0
 These are the results of surveys 2,3,4, and 5.
CRYSTAL GARDNER
2.0
Avg
2.50
CLAIRE
MARCH
7.0
SEM
1.55
MEGAN
MORROW
1.0
+/-
3.05
JONATHAN WITTIG
SAMPLE 4
0.0
ABRAHAM BOL
9.0
Avg
5.25
PRASID
DHITAL
0.0
SEM
2.25
MOLLY
FANNIN
3.0
+/-
4.41
Here are the mean and 95% CI for each of the 5
samples . . .
Mean and 95% CI for 5 samples from same Universe
95% Low
Mean
95% High
Sample 1
0.32
3.25
6.18
Sample 2
1.08
2.75
4.42
AMANDA RIORDAN
SAMPLE 5
9.0
Sample 3
-0.55
2.50
5.55
JACOB
ADAMS
3.0
Avg
4.75
Sample 4
0.84
5.25
9.66
PAUL
DOMBROSKI
5.0
SEM
0.85
CLAIRE
MARCH
7.0
+/-
1.67
Sample 5
3.08
4.75
6.42
REBECCA WILSON
4.0
10.00
From the TOTAL data base we calculate that
the “TRUE MEAN” is 3.60
TRUE MEAN = 3.60
8.00
However, in a research setting you don’t know
this, you just have an observed sample mean
and a 95% confidence interval
6.00
4.00
 In my 5 samples the 95% CI included the
“TRUE MEAN” every time. If I had done 100
samples, I would expect that the 95% CI
included the true mean 95 times
2.00
0.00
-2.00
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
95% Confidence Interval
95% Low
Mean
95% High
2
SPS 580 Lecture 4
CI Case selection outliers T
B. Interpreting the 95% confidence interval
In my survey I found out that the “True Opinion” is likely (95%) to be between .32 and 6.18
WHERE IS THE "REAL MEAN"
0
10
Observed mean=
3.25
95% CI =
0.32 ………………………..6.18
Q: So how helpful was my survey? A: Not very – Another way to say it is that I’m 95% certain
that the “True opinion” is either negative (< 5), neutral (5), or positive (>5).
Q: How could I do a survey that is more helpful?
A: Increase the sample size.
II.
PRECISION OF PERCENTAGES
When Y is a dichotomy coded (0,1) , the mean is the proportion in category 1.
Don’t believe me . . .. add up these 10 responses and divide by 10 to get the average (mean)
person
1
2
3
4
5
6
7
8
9
10
score
0
1
1
1
1
0
1
0
0
1
sum of scores
6
n
10
average = proportion(1) = p
0.60
variance = p(1-p)
0.24
SEM = sqrt ( SEM / n )
0.155
+/- = 1.96 * SEM
0.304 30.4%
95% CI = p +/- 1.96 * SEM
60.0%
low
0.296 29.6%
p
0.600 60.0%
high
0.904 90.4%
 mean = proportion coded (1)
 also can be expressed as % coded (1)
 formula for variance is simpler
 formula for SEM is THE SAME
 formula for +/is THE SAME
 formula for 95% CI is THE SAME
WHERE IS THE "REAL MEAN"
0%
Observed mean=
95% CI =
100%
60.0%
29.6% …………..………..90.4%
Q: Is the majority opinion in the class above or below 50% ?
A: I don’t know, but I’m sure it’s between 29.6% and 90.4% !!!
Q: What can you do to make this more precise? A:
 LOOK AT ASSGT 4 Part 1
3
 interpretation of
results is THE SAME
SPS 580 Lecture 4
CI Case selection outliers T
III.
EXPLORING Confidence Intervals with Live Data
WBEZ marketing committee wants to know how to increase revenue from its younger audience
 Listenership, familiarity w/WBEZ Membership in NFPs Usual payment for membership
A. what % listen to WBEZ radio station?
FREQUENCIES VARIABLES= wbezrng /ORDER=ANALYSIS.
wbezrng Amount of Time Listening to WBEZ
Frequency
Valid
1 Know It, Don't Listen
408
2 Listen < 1 hr/day
462
3 Listen > 1hr/day
299
4 Not Familiar
1809
5 Don't Listen to Radio
8 Don't Know
Total
Missing
System
Total
WBEZ familiarity, listenership, adult
population
Don’t listen to radio
247
Not familiar with WBEZ
1,809
Familiar, don't/DK listen
424
Listen to WBEZ
761
3,241
247
16
3241
33449
36690
A. NOTE: the marketing committee wants to target its research on population 45 and under.
1. Define a selection variable . . .
RECODE age01 (10 thru 45=1) (46 thru 98=2) (ELSE=9) INTO AGE2.
VARIABLE LABELS AGE2 'age2 '.
VALUE LABELS age2 1 '18- 45' 2 '46+' 9 'not valid'.
MISSING VALUES age2 (9) .
2. Select that data only for analysis . . .
DATA / SELECT CASES / IF CONDITION IS SATISFIED / IF age2=1 OK/PASTE
Other selection variables mentioned in SPS570/580. . . transit riders, low income
Univariate selection variables (for now) . . . typology construction later
USE ALL.
COMPUTE filter_$=(AGE2 = 1).
VARIABLE LABEL filter_$ 'AGE2 = 1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMAT filter_$ (f1.0).
FILTER BY filter_$.
NEW DATA FOR
AGE 18-45 ONLY
Potential listeners
Current listeners
WBEZ familiarity, listenership, population 18-45
Don’t listen to radio
110
Not familiar with WBEZ
1,125
%
Familiar, don't/DK listen
261
13%
Listen to WBEZ
447
23%
1,943
36%
4
+/1.5%
1.9%
2.1%
SPS 580 Lecture 4
CI Case selection outliers T
B. What % belong to non-profit arts or cultural organizations?
RECODE mems1 mems10 mems11 mems12 mems13 mems2 mems3 mems4 mems5
mems6 mems7 mems8 mems9 (1=1) (2=0) (8=0).
VALUE LABELS mems1 mems10 mems11 mems12 mems13 mems2 mems3 mems4 mems5
mems6 mems7 mems8 mems9 0 ' no DK ' 1 ' yes member'.
COMPUTE ARTCULTMEMBERSHIPS = mems1+ mems2+ mems3+ mems4+ mems5
+ mems6 + mems7 + mems8 + mems9+ mems10+ mems11+ mems12+ mems13.
KEEP THE SELECTION VARIABLE OPERATING
FREQUENCIES VARIABLES=ARTCULTMEMBERSHIPS /ORDER=ANALYSIS.
ARTCULTMEMBERSHIPS
Frequency
Valid
Valid
Percent
.00
1434
81.8
1.00
245
14.0
2.00
57
3.2
3.00
13
.7
4.00
4
.2
6.00
1
.1
Total
1754
100.0
Inspection => (0,1) utility only
One or more memberships (1 – 6)
= 18% +/- 1.8%
Note: you can get different data points
on the same population even though
memberships and WBEZ listenership
are not asked the same year
C. How much are people willing to pay for a membership in an arts/cultural organization?
MEMSPED How much would you be willing to spend for a one-year family membership in one of these types of
organizations?
memsped How Much for One-Year Family Membership
dollars
frequency
dollars
frequency
$0
174
$65
7
dollars frequency
$1
4
$67
1
$225
$2
1
$68
1
$250
5
$5
24
$70
10
$300
11
$7
1
$75
43
$350
2
$10
35
$80
4
$360
1
$15
13
$85
2
$400
4
$20
93
$89
1
$500
10
$25
106
$90
1
$501
1
$29
1
$100
251
$745
1
$30
54
$110
1
$1,000
4
$35
25
$115
1
$1,200
1
$40
41
$120
4
$1,500
1
$45
7
$125
3
$2,000
2
$50
346
$150
26
$5,000
1
$55
4
$175
1
$5,200
1
$60
33
$200
58
$6,000
1
9998 DK
324
MEAN = $82
+/- $14
1
5
 anybody see any problems here?
 Anything above $500 is quite possibly
a mistake
 Anything above $200 is an OUTLIER
Rule of thumb: values more than 150%
distance from the mean are going to cause
trouble
SPS 580 Lecture 4
CI Case selection outliers T
D. Deal with OUTLIERS one of two ways
a. Work with the median (as opposed to the mean). . . Between $45 and $50 -- where
the 50th percentile falls
b. But the median can’t be used for many statistical procedures
c. So we RECODE OUTLIERS to an acceptable maximum value and then re-calculate
the mean
RECODE memsped (225 thru 6000=200) (ELSE=Copy) INTO memsped2.
VARIABLE LABELS memsped2 'money for membership'.
MISSING VALUES MEMSPED2 (9997, 9998) .
FREQUENCIES VARIABLES= MEMSPED2 /ORDER=ANALYSIS.
DESCRIPTIVES VARIABLES=MEMSPED2
/STATISTICS=MEAN STDDEV VARIANCE MIN MAX SEMEAN.
How Much for One-Year Family Membership w/o OUTLIERS
dollars
frequency
dollars
frequency
$0
174
$65
7
$1
4
$67
1
$2
1
$68
1
$5
24
$70
10
$7
1
$75
43
$10
35
$80
4
$15
13
$85
2
$20
93
$89
1
$25
106
$90
1
$29
1
$100
251
$30
54
$110
1
$35
25
$115
1
$40
41
$120
4
$45
7
$125
3
$50
346
$150
26
$55
4
$175
1
$60
33
$200
105
6
MEAN = $60
+/- $3
Look at the impact of trimming the
outliers $82 vs. $60
Rule of thumb: the reason you trim
outliers is that means (and lots of
other really important statistics) are
VERY STRONGLY
INFLUENCED by extreme values.
You get a more stable, and
therefore more accurate picture of
the REAL WORLD by trimming
(not eliminating) outliers.
SPS 580 Lecture 4
IV.
CI Case selection outliers T
Should WBEZ market differently in the city vs. suburbs?
RECODE region (1=1) (2 thru 7=2) (ELSE=9) INTO region2.
VARIABLE LABELS region2 'region recoded'.
value labels region3 1 'Chicago' 2 'Suburbs'.
missing values region2 (9).
KEEP THE SELECTION VARIABLE OPERATING
A. Is WBEZ listenership higher in the city or in the suburbs?
1. Don’t listen
to radio
2. Not familiar
with WBEZ
1 Chicago
11%
49%
12%
29%
766
0 Suburbs
2%
64%
15%
19%
1177
6%
58%
13%
23%
1943
Total
3. Familiar, 4. Listen to
don't/DK listen
WBEZ
Total
Chi Sq(3) = 98
Phi
= .22
But chi square, phi are blanket tests, WBEZ wants to know specifically about listenership
Place
Does Place Predict WBEZ Listenership ?
Mean
Std Err
+/CI(Low)
Mean
CI(High)
1 Chicago
0.29
0.0163
0.0320
0.25
0.29
0.32
0 Suburbs
0.19
0.0115
0.0226
0.17
0.22
Difference =
0.09
0.19
0.09
What is the CONFIDENCE INTERVAL for the difference of means
If it includes the value ZERO then the difference of the means is NOT SIGNIFICANT
STATISTICAL THEORY ALERT . . .
Mean 1 has its uncertainty (SEM1)
Mean 2 has its uncertainty (SEM2)
Logical conclusion  Wouldn’t it make sense that the uncertainty of the difference is equal to
the sum of the two uncertainties? Well it is, sort of . . .
STATISTICS ALERT 
STD ERROR of DIFFERENCE OF 2 Means = SQRT ( SEM1^2 + SEM2^2 )
95% CONFIDENCE INTERVAL for the difference of means = +/- 1.96 * SEDiff
1 Chicago
0 Suburbs
CI(Low)
25%
17%
5%
Mean
29%
19%
9%
DIFFERENCE of Means
STD ERROR of DIFFERENCE
CI(High)
32%
22%
13%
= T - Test
 the CI(Diff) does NOT include ZERO, so we
conclude that there is a SIGNIFICANT
DIFFERENCE in listenership by place . . .
In the city, listenership is 10% higher than in the
suburbs
Another way to look at this is that the difference is
significant if the t-test > 1.96 (or < -1.96 for
negative differences) df = INFINITE
7
SPS 580 Lecture 4
CI Case selection outliers T
E. Is the percent who belong to nfp arts/cultural organizations higher in the city?
KEEP THE SELECTION VARIABLE OPERATING
region2 region recoded * ARTCULTMEMBERSHIPS Crosstabulation
% within region2 region recoded
ARTCULTMEMBERSHIPS
.00
region2 region recoded
1.00
2.00
3.00
4.00
6.00
.00 Suburbs
80.5%
15.2%
3.5%
.4%
.3%
1.00 Chicago
83.2%
12.5%
3.0%
1.1%
.1%
.1%
100.0%
81.8%
14.0%
3.2%
.7%
.2%
.1%
100.0%
Total
100.0%
grouping together to focus on any memberships vs. none
1 Chicago
0 Suburbs
Difference
SE(Diff)
T
critical value of T
Not significant
Any
memberships
0.17
0.19
-0.03
0.02
-1.42
-1.96
N
p*(1-p)/n
808
946
sum -->
sqrt -->
0.0001733 Chi square (5) = 7.7
0.0001656 phi = .077
0.0003389
0.0184083
p > .05
 Not Sig, answer is NO, % who
belong is same in city and in suburbs
F. Is the amount people are willing to pay for a membership higher in the suburbs?
Report
memsped2 money for membership
ANALYZE/ COMPARE MEANS /
MEANS / Dependent Memsped2 /
Independent Region2 / OPTIONS Mean
Std Error of Mean /
region2 region recoded
Avg. spend for
memberships
.00 Suburbs
$59.88
1.00 Chicago
$60.39
Difference
-$0.51
2.8316
-0.18
-1.96
T
critical value
SEM
$1.8197
$2.1694
SUM -->
SQRT -->
Std. Error of
Mean
dimension1
SEDifference
Total
Mean
.00 Suburbs
59.8833
1.81970
1.00 Chicago
60.3991
2.16942
Total
60.1166
1.39812
SEM^2
3.3113
4.7064
8.0177
$2.8316
 Not Sig, answer is NO, people pay
the same in the suburbs as in the city
Not significant
8
SPS 580 Lecture 4
CI Case selection outliers T
Assignment 4:
Part 1: The Excel spreadsheet for Assignment 4 contains a list of the students in SPS 580 and
their opinions on two really important issues. Opinion Item 2 is measured on a (1,10) scale.
Opinion Item 3 is measured as a (0,1) dichotomy.
1. Randomly select 10 people from the list; analyze the scores for the answers they gave to
Opinion Item 2 and Opinion Item 3.
2. For Opinion Item #2: What is the observed mean, the variance, the SEM, the 95% CI,
what do you conclude from your survey?
3. For Opinion Item #3: ditto
Part 2: Define a policy research problem on a TARGETED POPULATION using a univariate
selection variable (recoded, but no typologies)
1. TARGET POPULATION: Use PASW to select the targeted population, describe how
this is done
2. DEPENDENT VARIABLES . . . Define one dichotomous (0,1) outcome variable (Y1),
and one interval scale outcome variable (Y2) -- can be a scale you compute or an interval
variable on the data set
a. For Y1 what is the 95% Confidence Interval for the percent
b. For Y2 . . .
i.
is there a need to trim outliers, take the necessary action, explain it
ii.
What are the low/high (trimmed) values, what is the mean and the 95%
Confidence Interval for the mean?
3. INDEPENDENT VARIABLE: Define a (0,1) dichotomous independent variable (X1)
that classifies the target population according to a characteristic of policy interest, explain
the variable and categories
a. What is the theory being tested for X1  Y1
b. Crosstabulate X1 and Y1, show a PQ table of percents, with added columns/rows
as needed to show the steps in calculating a T-test for the difference in
percentages
c. What do conclude from the data and the T-test?
d. What is the theory being tested for X1  Y2
e. Calculate a table of means for X1 and Y2, show a PQ table of means, with added
columns/rows as needed to show the steps in calculating a T-test for the
difference in means
f. What do conclude from the data and the T-test?
9
SPS 580 Lecture 4
CI Case selection outliers T
10
Download