Lecture 7

advertisement
'
$
Chapter 7
Analyzing categorical data
&
%
1
'
$
What is a categorical variable?
Examples:
• Gender (“Male”,“Female”)
• Sick or well
• Success or failure
• Age group (“Below 20”, “20 to below 40”, “40 to below 60”,
“60 and above”)
&
%
2
'
$
Common techniques used to analyze categorical data
• Frequency tables
• Contingency tables
• Charts
• Test of proportion
• Chi-square test
&
%
3
'
$
Questionnaire design and analysis
• It is the most common way to collect certain types of data
• The data collected can be manually entered into the computer
if they are not collected via computer or online.
&
%
4
'
$
SAS: proc freq
data ex7 1;
input @1 id $3. @4 age 2.0 @6 gender $1.@7 race $1.
@8 marital $1. @9 education $1. @10 subsi 1.0;
* Adding labels to the variables;
label marital =“Marital Status”
education=“Education Level”
Subsi=“Baby Subsidy”;
datalines;
0012911113
0024522222
0033513244
0042711112
0056821323
0066512432
;
&
5
%
'
$
SAS: proc freq
proc freq data=ex7 1;
title “Frequency Counts for Categorical Variables”;
tables gender race marital education subsi;
/∗ Alternatively, we can use the following command;
tables gender-subsi;∗/
run;
&
%
6
'
$
SAS output: proc freq
&
%
7
'
$
SAS output: proc freq
&
%
8
'
$
SAS: Adding “Value Labels” (Format)
proc format;
value $sexfmt “1”=“Male”
“2”=“Female”
Others=“Miscoded”;
value $race “1”=“Chinese”
“2”=“Malay”
“3”=“Indian”
“4”=“Others”;
value $mari “1”=“Single”
“2”=“Married”
“3”=“Widowed”
“4”=“Divorced”;
&
%
9
'
$
SAS: Adding “Value Labels” (Format)
value $educ “1”=“O-level or Less”
“2”=“A-Level or Poly”
“3”=“Bachelor degree”
“4”=“Postgraduate degree”;
value agree 1=“Strongly Disagree”
2=“Disagree”
3=“No Opinion”
4=“Agree”
5=“Strongly Agree”;
&
%
10
'
$
SAS: Adding “Value Labels” (Format)
data ex7 1label;
input @1 id $3. @4 age 2.0 @6 gender $1.@7 race $1.
@8 marital $1. @9 education $1.@10 subsi 1.0;
label marital =“Marital Status”
education=“Education Level”
Subsi=“Baby Subsidy”;
format gender $sexfmt.
race $race.
marital$mari.
education $educ.
subsi agree.;
&
%
11
'
$
SAS: Adding “Value Labels” (Format)
datalines;
0012911113
0024522222
0033513244
0042711112
0056821323
0066512432
;
proc freq data=ex7 1label;
title “Frequency Counts for Categorical Variables”;
tables gender race marital education subsi;
run;
&
%
12
'
$
SAS output: proc freq
&
%
13
'
$
SAS output: proc freq
&
%
14
'
$
SAS: Using a format to recode a variable
proc format;
value agegp low-20=“0-20”
21-40=“21-40”
41-60=“41-60”
60-high=“Greater than 60”
.=“Did not Answer”
other=“Out of Range”;
proc freq data=ex7 1label;
title “Using a Fromat to Group a Numeric Varible”;
tables age;
format age agegp.;
run;
&
%
15
'
$
SAS output: Using a format to recode a variable
&
%
16
'
$
R: Adding value labels
>ex7.1=read.fwf(“D:/ST2137/ex7 1.txt”,header=F,
width=c(3,2,1,1,1,1,1))
>names(ex7.1)=c(“id”,“age”,“gender”,“race”,“marital”,
“education”,“subsi”)
>attach(ex7.1)
>gendername=c(“Male”,“Female”)
>gendergp=gendername[gender]
>gender
[1]1 2 1 1 2 1
>gendergp
[1] “Male” “Female” “Male” “Male” “Female” “Male”
&
%
17
'
$
R: Recode a variable
>agegpname=c(“low-20”,“21-40”,“41-60”,“61-80”,‘over 80”)
>agegp=agegpname[ceiling(age/20)]
>age
[1] 29 45 35 27 68 65
>agegp
[1] “21-40” “41-60” “61-80” “61-80”
&
%
18
'
$
R: Table
>gendername=c(“Male”,“Female”)
>gendergp=gendername[gender]
>table(gendergp)
gendergp
Female Male
2
4
&
%
19
'
$
R: Table
>agegpname=c(“low-20”,“21-40”,“41-60”,“61-80”,“over 80”)
>agegp=agegpname[ceiling(age/20)]
>table(agegp)
agegp
21-40 41-60 61-80
3
1
2
&
%
20
'
$
R: Table
>racegpname=c(“Chinese”,“Malay”,“Indian”,“Others”)
>racegp=racegpname[race]
>table(racegp)
racegp
Chinese Indian Malay
3
1
2
&
%
21
'
$
R: Table
>marigpname=c(“Single”,“Married”,“Widowed”,“Divorced”)
>marigp=marigpname[marital]
>table(marigp)
marigp
Divorced Married Single Widowed
1
2
2
1
&
%
22
'
$
R: Table
>educgpname=c(“(1)High Sch or Less”,“(2)A-Level or Poly”,
+“(3)Bachelor degree”,“(4)Postgraduate degree”)
>educgp=educgpname[education]
>table(educgp)
educgp
(1)High Sch or Less (2)A-Level or Poly
(3)Bachelor degree (4)Postgraduate degree
2
2
1
1
&
%
23
'
$
R: Table
>likegpname=c(“(1)Strongly Disagree”,“(2)Disagree”,
+“(3)No Opinion”,“(4)Agree”,“(5)Strongly Agree”)
>subsigp=likegpname[subsi]
>table(subsigp)
subsigp
(2)Disagree
(3)No Opinion
(4)Agree
3
2
1
&
%
24
'
$
SPSS: Frequency tables
• Suppose the data set on slide 5 has been imported into the
SPSS.
• “Analyze”→ “Descriptive Statistics” →“Frequency...”
• Move the variables to the “Variables” panel → “OK”
&
%
25
'
$
SPSS output: Frequency tables
&
%
26
'
$
SPSS output: Frequency tables
&
%
27
'
$
Two-way frequency tables
Count the occurrences of one variable at each level of another
variable.
For example:
We would like to know
1. How many males and females were there in the sample?
2. How many respondents were for Candidate A and how many
were for Candidate B?
3. How many males and females were for Candidate A and B,
respectively?
&
%
28
'
$
Two-way frequency tables: SAS
proc format;
value $genfmt “M”=“Male”
“F”=”Female”
Other=“Miscoded”;
value $candfmt “A”=“Candidate A”
“B”=”Candidate B”;
&
%
29
'
$
Two-way frequency tables: SAS
data ex7 2;
infile“D:\ST2137\ex7 2.txt”;
input gender $ candid $;
label gender=“Gender”
candid=“Candidate”;
format gender $genfmt.
candid $candfmt.;
run;
proc freq data=ex7 2;
tables gender*candid/chisq;
run;
&
%
30
'
$
Two-way frequency tables: SAS output
&
%
31
'
$
Two-way frequency tables: SAS output
&
%
32
'
$
Computing Chi-square from frequency counts: SAS
/*Computing Chi-square from frequency counts*/
data ex7 2c;
input group $ outcome $ count;
datalines;
drug alive 90
drug dead 10
placebo alive 80
placebo dead 20
;
proc freq data=ex7 2c;
tables group*outcome/chisq;
weight count;
run;
&
%
33
'
$
Two-way frequency tables: SAS output
&
%
34
'
$
Two-way frequency tables: SAS output
&
%
35
'
$
Two-way frequency tables: R
>ex7.2=read.table(“D:/ST2137/ex7 2.txt”,header=F)
>names(ex7.2)=c(“gender”,“candid”)
>table(ex7.2)
candid
gender
A
B
F
70
30
M
40
40
&
%
36
'
$
Two-way frequency tables: R
>chisq.test(table(ex7.2))
Pearson’s Chi-squared test Yate’s continuity correction
data:table(ex7.2)
X-squared=6.6626,df=1,p-value=0.009846
Computing chi-square from the frequency counts: R
>v=matrix(c(90,10,80,20),nc=2)
>v=data.frame(v)
>names(v)=c(“Alive”,“Dead”)
>row.names(v)=c(“Drug”,“Control”)
>chisq.test(v)
Pearson’s Chi-squared test with Yate’s continuity correction
data:v
X-squared=3.1765, df=1, p-value=0.0747
&
%
37
'
$
Two-way frequency tables: SPSS
• “Analyze”→ “Descriptive Statistics” →“Cross Tables...”
• Move one of the the variables to the “Row” window and second
variable to “Column(s)” window.
&
%
38
'
$
Two-way frequency tables: SPSS
• Click on “Statistics”
• Choose “Chi-square” or some other statistics →
“Continue”→“OK”
&
%
39
'
$
Computing Chi-square from frequency tables: SPSS
• Data file as shown below
• “Data”→‘Weight Cases”
• Move the variable “Count” to the “Frequency Variable” panel
under “Weight cases by option”
• Proceed as on p38-39.
&
%
40
'
$
Computing Chi-square from frequency tables: SPSS
&
%
41
'
$
Paired Data
• Paired data arise when the subjects are responding to a
question under two different conditions (e.g. before and after
treatment).
• Paired designs are also used when a specific person is matched
on some criteria, such as age and gender, to another person for
the purpose of analysis.
&
%
42
'
$
McNemar’s test for paired data: SAS
proc format;
value $opin “p”=“Positive” “n”=“Negative”;
run;
data ex7 3;
length before after $1.;
infile “D:\ST2137\ex7 3.txt”;
input subject before $ after $;
format before after $opin.;
proc freq data=ex7 3;
title “McNemar’s Test for Paired Samples”;
tables before *after/agree;
run;
&
%
43
'
$
McNemar’s test for paired data: SAS output
&
%
44
'
$
McNemar’s test for paired data: SAS output
&
%
45
'
$
McNemar’s test for frequency counts: SAS
proc format;
value $opin “p”=“Positive” “n”=“Negative”;
run;
data ex7 3c;
length before after $1.;
input after $ before $ count;
format before after $opin.;
datalines;
n n 32
n p 30
p n 15
p p 23
;
&
%
46
'
$
McNemar’s test for frequency counts: SAS
proc freq data=ex7 3;
title “McNemar’s Test for Paired Samples”;
tables before *after/agree;
weight count;
run;
&
%
47
'
$
McNemar’s test: R
#Example 7.3
>ex7.3=read.table(“D:/ST2137/ex7 3.txt”,header=F)
>names(ex7.3)=c(“ID”,“Before”,“After”)
>attach(ex7.3)
>mcnemar.test(table(ex7.3[,2:3]))
McNemar’s Chi-square test with continuity correction
data:table(ex7.3[,2:3])
McNemar’s chi-squared=4.3556,df=1,p-value=0.03689
&
%
48
'
$
McNemar’s test for Frequency Counts: R
#Example 7.3c: Handling frequency counts
>ex7.3c=matrix(c(32,15,30,23),nr=2,byrow=T,
+dimnames=list(“Before”=c(“No”,“Yes”),“After”=c(“No”,“Yes”)))
>ex7.3c
After
Before No
Yes
No
32
15
Yes
30
23
>mcnemar.test(ex7.3c)
McNemar’s Chi-squared test with continuity correction
data:ex7.3c
McNemar’s Chi-squared=4.3556,df=1,p-value=0.03689
&
%
49
'
$
McNemar’s Test: SPSS
• “Analyze”→ “Descriptive Statistics” →“Crosstabs...”
• Move “Before” to the “Row” window and “After” to
“Column(s)” window.
• Click on “Statistics...” and choose “McNemar”
• “Continue”→“OK”
&
%
50
'
$
McNemar’s Test: SPSS
&
%
51
'
$
McNemar’s Test: SPSS
If frequency counts are available instead of the raw data, then we
can weight the data in the following way.
“Data”→“Weight Cases..”
&
%
52
Download