2D: Contingency table /Cross

advertisement
Boris Mirkin
Computational Intelligence and Data Visualization
A unique course looking at data from inside rather than
from outside as is common to CS
Lecture 25/1/10
Current web site:
www.dcs.bbk.ac.uk/~mirkin/advanced/
Literature:
The closest:
files corrm.doc and summm.doc in the web site are parts of
a text-book draft of mine – cover all, including coursework.
Computational support:
MatLab is on all PCs in Lab room 128; an elementary
introduction is in file Matlab.doc, same web site.
Any other computer language may be used as well.
1
Subjects for today:
III. Bivariate analysis: summarisation,
correlation, visualization
TWO features
 III.2 Nominal/Quantitative
- box-plot
- tabular (piece-wise) regression
- correlation ratio (determination coefficient)
 III.3. Nominal/Nominal
- contingency table
- Quetelet index
- Chi-squred coefficient
 III.4. Quantitative/Nominal
- k-NN rule
- reject option
2
 III.2 Nominal/Quantitative
- box-plot
- tabular (piece-wise) regression
- correlation ratio (determination coefficient)
In Student data:
Occupation is categorical (nominal),
Age and OOProg are quantitative (continuous-valued)
III.2.1 Box-plot: Within-category distributions (real)
Age
51
20
IT
BA
AN
(Occupation)
Within-category distributions (ideal – no variance within
categories:
Age would-be exact prediction Occupation  Age)
51
20
IT
BA
AN
(Occupation)
3
Within-category distributions (anti-ideal –full variance within
categories: a would-be case when knowledge of Occupation gives
no information of Age)
Age
51
20
IT
BA
AN
(Occupation)
Box-plot:
Age
51
III.2.2.
Piece-wise, or tabular, regression
(within category averages and standard deviations),
20the data: within-group variance
Its fit to
IT
BA
AN
(Occupation)
4
Tabular (piece-wise) regression of y over x:
A table: Category of x / Proportion/Mean/Std
of y
Tabular regression Occupation/Age
Occupation
IT
BA
AN
Total
#
35
34
31
100
Age Mean
28.2
39.3
33.7
33.7
Age StD
5.6
7.3
8.7
8.5
Tabular regression Occupation/OOProg
Occupation
IT
BA
AN
Total
#
35
34
31
100
OOP Mean
76.1
56.7
50.7
61.6
OOP StD
12.9
12.3
12.4
16.5
In which table the correlation is greater?
5
Correlation ratio (determination coefficient for
piece-wise/tabular regression) - Average withincategory variance:
2w= k pk2k
(k- category, pk its proportion)
Correlation ratio shows the drop of
the variance of y when x becomes
known - from 2 to 2w :
2 = 1 – 2w/2
Properties:
- The range is between 0 and 1
- Correlation ratio is 0 when all 2k are zero
(y is constant within groups)
6
- Correlation ratio is 1: all 2k of the order of
2
Occupation/Age
Occupation/OOOProg
Correlation ratio
28.1%
42.3%
The second table shows a greater correlation.
7
III.3 Contingency (flow) data Two sets of categories and their co-occurrence counts
Table 1. Protocol/Attack contingency table for Intrusion
data of 100 attacks (in file corrm.doc, p. 19-20).
Category
Tcp
Udp
Icmp
Total
apache
23
0
0
23
saint
11
0
0
11
surf
0
0
10
10
norm
30
26
0
56
Total
64
26
10
100
A useful tool for assessment of correlation: between two category
sets:
Udp  norm
Icmp  surf attack
according to the data because of many zeros.
8
More subtle example:
Partition the Market town set in four classes according
to the number of Banks and Building Societies (Ba):
Ba 10, 10>Ba4, 4>Ba2, Ba=0/1.
Cross classify this with Farmers Market (Yes/No).
Cross classification of the Ba related partition with FM
Number of banks/build. Societies
FMarket 10+
4+
2+
1Total
Yes
2
5
1
1
9
No
4
7
13
12
36
Total
6
12
14
13
45
Normalise:
BA/FM Cross classification frequencies, per cent
FMarket 10+
4+
2+
1Total
Yes
4.44
11.11 2.22
2.22
20
No
8.89
15.56 28.89 26.67 80
Total
13.33 26.67 31.11 28.89 100
9
Any pattern? Let us clean from small within-line
numbers to make more zeros
BA/FM cross classification: Cleaned to sharpen
Number of banks/build. Societies
FMarket 10+
4+
2+
1Total
Yes
2
5
0
0
7
No
0
0
13
12
25
Total
2
5
13
12
32
Doctoring data? This borders with forgery. Try facts.
Quetelet index (1832):
association between categories k and l
BA/FM Cross classification frequencies, per cent
FMarket 10+
4+
2+
1Yes
4.44
11.11
2.22
2.22
No
8.89
15.56
28.89
26.67
Total
13.33
26.67
31.11
28.89
Total
20
80
100
P(Ba=10+ & FM=Yes)=4.44% (joint probability/rate)
P(Ba=10+ / FM=Yes) =0.0444/0.20=0.222=22.2%
(conditional probability/rate)
10
Is this high? Low?
Quetelet A. (1832): compare with P(Ba=10+) (average rate)!
q(Ba=10+/ FM=Yes) =
[P(Ba=10+/FM=Yes) - P(Ba=10+)] / P(Ba=10+) =
[0.2222 – 0.1333] / 0.1333= 0.6667 = 66.7%
Condition FM=Yes raises P(Ba=10+) by 66.7%
BA/FM Cross classification Quetelet coefficients1, %
FMarket 10+
4+
2+
1Yes
-64.29
-61.54
66.67 108.33
No
-16.67
-27.08
16.07
15.38
Formulation: Contingency (flow data) Table
(disjoint categories only)
Two sets of disjoint categories: l=1,…,L and k=1,…,K
Nkl are (k,l) co-occurrence counts; they sum up to N.
Contingency table:
counts Nkl or frequencies pkl =Nkl /N.
TOTALS: Within-row sums Nk+ =l Nkl and within-column
sums N+l =k Nkl (and their frequency counterparts) are referred
to as marginals: located on margins.
1
Bold font for the increase.
11
Quetelet (relative) index: general definition
q(m/k) = [P(m/k)-P(m)]/P(m) =
little algebra
= [P(m&k)/P(k)-P(m)]/P(m) = P(m&k)/[P(k)*P(m)] –1
= pmk /(pm+p+k)  1
Quetelet (absolute) index: general definition
a(m/k) = P(m/k)-P(m) =
little algebra
= P(m&k)/P(k)-P(m) = P(m&k)/[P(k)*P(m)] –1
= pmk /p+k  pm+
Either of this can be considered a standardization of
the data.
Newman (2007) reports of good results found at
clustering networks with a symmetric standardization
s(m, k) = pmk - p+kpm+
12
Summary Quetelet association index: Sum them up
Q   p q   p 2/ p p
1
km km
km
k  m
k ,m
k ,m
with weights proportional to frequencies.
Q is proportional to celebrated Pearson chi-squared
association coefficient - deviation from statistical
independence (K. Pearson, 1901):
X 2  N  ( p  p p )2 / p p
 NQ
km k   m
k  m
k ,m
Statistical independence: for all k and m
p p p
km
k  m
Chi-squared divided by N is a major index of association in
contingency tables. Good properties:
In the case K ≤ L, it ranges between:
0 – if there is statistical independence at all entries, or
all qkl=0, and
K -1 – if each column l contains only one non-zero
entry pk(l)l which is thus equal to p+m, which can be
interpreted as a logical implication m  k(m)
13
Quetelet decomposition of X2 gives a data mining
perspective to this coefficient: presenting NQ=X2 as
sum of N pkmqkm (Mirkin, 2001) allows for
visualization of the association components
MatLab:
ChiSq=N*sum(P.*Q);
That is, entries in N* P.*Q are contributions of pairs of the
categories to the chi-squared
BA/FM chi-squared (NQ = 6.86) and its cross decomposition2
FMarket 10+
Yes
1.33
No
-.67
Total
0.67
4+
5.41
-1.90
3.51
2+
-.64
2.09
1.45
1-.62
1.85
1.23
Total
5.43
1.37
6.86
Bear in mind that the conventional decomposition of
chi-squared is based on the conventional formula so
that contribution of (k,m) pair is considered to be equal
to
N*(pmk - p+kpm+)2/ (p+kpm+)
2
Bold font for the positive items; red for exceptional contributions
14
or, heuristically, its square root – mm to bear
the sign of the relation, which comes naturally
with Quetelet coefficients
A real-life example of categorical data
Ceefax 31/10/2007: Stop’n’Search bias: In 2005/6
there were 878,153 stops and searches in England and
Wales. Blacks 7 times and Asians 2 times more
frequently than Whites.
What does that mean? There are three races
mentioned, B/A/W, and proportions of “S’n’S” are
7/2/1? That would lead to this distribution:
Hypothetic S’n’S frequency:
Abs
Per cent
B
614,707
70
A
175,631
20
W
87,815
10
Total
878,153
100
15
Example continued: NO. Our guess is wrong. The
Ceefax data refer to the frequencies relative to the race
distribution in the population (see
http://news.bbc.co.uk/1/hi/uk/7069791.stm of
29/10/07).
Real S’n’S frequency:
Hypo
Real
Per cent
B
614,707
131,723
15
A
175,631
70,252
8
W
87,815
676,178
77
Total
878153
878153
100
Whites are an overwhelming majority – but their
proportion in the overall population is even greater:
Race in England and Wales (2001 census):
B
A
W
Pop 1,509,216 3,018,431 47,514,269
Per cent 2.9
5.8
91.3
Total
52,041,916
100
16
2D: Contingency table /Cross-classification
Example continued: Joint distribution of S’n’S_Race
and Pop_Race
Entity set?
Population
Two features? Race (B,A,W) and Search (Yes, No)
S’n’S
Not S’n’S
Total
Black
131,723
1,377,493
1,509,216
Asian
70,252
2,948,179
3,018,431
White
676,178
46,838,091 47,514,269
Total
878,153
51,163,763 52,041,916
Table 1: Cross-classification of Race and S’n’S in E/W
(absolute values)
S’n’S
Not S’n’S
Total
Black
0.25/8.7
2.65/91.3
2.9
Asian
0.14/2.3
5.66/97.7
5.8
White
1.30/1.4
90.00/98.6
91.3
Total
98.31
1.69
100.0
Table 1: Cross-classification of Race and S’n’S in E/W
(rel. values, per cent to total/per cent to race category)
17
A bias indeed. Is it racist? I am not sure because
the S’nS’ race distribution mimics the race
distribution of prison inmates. Whether the latter
reflects any bias, I do not know.
Task: Find a similar decomposition of chi-squared for
the Race/S’n’S example;
for OOPmarks/Occupation in Student data. (Hint:
First, develop OOPmarks categories.)
18
Download