Boris Mirkin Computational Intelligence and Data Visualization A unique course looking at data from inside rather than from outside as is common to CS Lecture 25/1/10 Current web site: www.dcs.bbk.ac.uk/~mirkin/advanced/ Literature: The closest: files corrm.doc and summm.doc in the web site are parts of a text-book draft of mine – cover all, including coursework. Computational support: MatLab is on all PCs in Lab room 128; an elementary introduction is in file Matlab.doc, same web site. Any other computer language may be used as well. 1 Subjects for today: III. Bivariate analysis: summarisation, correlation, visualization TWO features III.2 Nominal/Quantitative - box-plot - tabular (piece-wise) regression - correlation ratio (determination coefficient) III.3. Nominal/Nominal - contingency table - Quetelet index - Chi-squred coefficient III.4. Quantitative/Nominal - k-NN rule - reject option 2 III.2 Nominal/Quantitative - box-plot - tabular (piece-wise) regression - correlation ratio (determination coefficient) In Student data: Occupation is categorical (nominal), Age and OOProg are quantitative (continuous-valued) III.2.1 Box-plot: Within-category distributions (real) Age 51 20 IT BA AN (Occupation) Within-category distributions (ideal – no variance within categories: Age would-be exact prediction Occupation Age) 51 20 IT BA AN (Occupation) 3 Within-category distributions (anti-ideal –full variance within categories: a would-be case when knowledge of Occupation gives no information of Age) Age 51 20 IT BA AN (Occupation) Box-plot: Age 51 III.2.2. Piece-wise, or tabular, regression (within category averages and standard deviations), 20the data: within-group variance Its fit to IT BA AN (Occupation) 4 Tabular (piece-wise) regression of y over x: A table: Category of x / Proportion/Mean/Std of y Tabular regression Occupation/Age Occupation IT BA AN Total # 35 34 31 100 Age Mean 28.2 39.3 33.7 33.7 Age StD 5.6 7.3 8.7 8.5 Tabular regression Occupation/OOProg Occupation IT BA AN Total # 35 34 31 100 OOP Mean 76.1 56.7 50.7 61.6 OOP StD 12.9 12.3 12.4 16.5 In which table the correlation is greater? 5 Correlation ratio (determination coefficient for piece-wise/tabular regression) - Average withincategory variance: 2w= k pk2k (k- category, pk its proportion) Correlation ratio shows the drop of the variance of y when x becomes known - from 2 to 2w : 2 = 1 – 2w/2 Properties: - The range is between 0 and 1 - Correlation ratio is 0 when all 2k are zero (y is constant within groups) 6 - Correlation ratio is 1: all 2k of the order of 2 Occupation/Age Occupation/OOOProg Correlation ratio 28.1% 42.3% The second table shows a greater correlation. 7 III.3 Contingency (flow) data Two sets of categories and their co-occurrence counts Table 1. Protocol/Attack contingency table for Intrusion data of 100 attacks (in file corrm.doc, p. 19-20). Category Tcp Udp Icmp Total apache 23 0 0 23 saint 11 0 0 11 surf 0 0 10 10 norm 30 26 0 56 Total 64 26 10 100 A useful tool for assessment of correlation: between two category sets: Udp norm Icmp surf attack according to the data because of many zeros. 8 More subtle example: Partition the Market town set in four classes according to the number of Banks and Building Societies (Ba): Ba 10, 10>Ba4, 4>Ba2, Ba=0/1. Cross classify this with Farmers Market (Yes/No). Cross classification of the Ba related partition with FM Number of banks/build. Societies FMarket 10+ 4+ 2+ 1Total Yes 2 5 1 1 9 No 4 7 13 12 36 Total 6 12 14 13 45 Normalise: BA/FM Cross classification frequencies, per cent FMarket 10+ 4+ 2+ 1Total Yes 4.44 11.11 2.22 2.22 20 No 8.89 15.56 28.89 26.67 80 Total 13.33 26.67 31.11 28.89 100 9 Any pattern? Let us clean from small within-line numbers to make more zeros BA/FM cross classification: Cleaned to sharpen Number of banks/build. Societies FMarket 10+ 4+ 2+ 1Total Yes 2 5 0 0 7 No 0 0 13 12 25 Total 2 5 13 12 32 Doctoring data? This borders with forgery. Try facts. Quetelet index (1832): association between categories k and l BA/FM Cross classification frequencies, per cent FMarket 10+ 4+ 2+ 1Yes 4.44 11.11 2.22 2.22 No 8.89 15.56 28.89 26.67 Total 13.33 26.67 31.11 28.89 Total 20 80 100 P(Ba=10+ & FM=Yes)=4.44% (joint probability/rate) P(Ba=10+ / FM=Yes) =0.0444/0.20=0.222=22.2% (conditional probability/rate) 10 Is this high? Low? Quetelet A. (1832): compare with P(Ba=10+) (average rate)! q(Ba=10+/ FM=Yes) = [P(Ba=10+/FM=Yes) - P(Ba=10+)] / P(Ba=10+) = [0.2222 – 0.1333] / 0.1333= 0.6667 = 66.7% Condition FM=Yes raises P(Ba=10+) by 66.7% BA/FM Cross classification Quetelet coefficients1, % FMarket 10+ 4+ 2+ 1Yes -64.29 -61.54 66.67 108.33 No -16.67 -27.08 16.07 15.38 Formulation: Contingency (flow data) Table (disjoint categories only) Two sets of disjoint categories: l=1,…,L and k=1,…,K Nkl are (k,l) co-occurrence counts; they sum up to N. Contingency table: counts Nkl or frequencies pkl =Nkl /N. TOTALS: Within-row sums Nk+ =l Nkl and within-column sums N+l =k Nkl (and their frequency counterparts) are referred to as marginals: located on margins. 1 Bold font for the increase. 11 Quetelet (relative) index: general definition q(m/k) = [P(m/k)-P(m)]/P(m) = little algebra = [P(m&k)/P(k)-P(m)]/P(m) = P(m&k)/[P(k)*P(m)] –1 = pmk /(pm+p+k) 1 Quetelet (absolute) index: general definition a(m/k) = P(m/k)-P(m) = little algebra = P(m&k)/P(k)-P(m) = P(m&k)/[P(k)*P(m)] –1 = pmk /p+k pm+ Either of this can be considered a standardization of the data. Newman (2007) reports of good results found at clustering networks with a symmetric standardization s(m, k) = pmk - p+kpm+ 12 Summary Quetelet association index: Sum them up Q p q p 2/ p p 1 km km km k m k ,m k ,m with weights proportional to frequencies. Q is proportional to celebrated Pearson chi-squared association coefficient - deviation from statistical independence (K. Pearson, 1901): X 2 N ( p p p )2 / p p NQ km k m k m k ,m Statistical independence: for all k and m p p p km k m Chi-squared divided by N is a major index of association in contingency tables. Good properties: In the case K ≤ L, it ranges between: 0 – if there is statistical independence at all entries, or all qkl=0, and K -1 – if each column l contains only one non-zero entry pk(l)l which is thus equal to p+m, which can be interpreted as a logical implication m k(m) 13 Quetelet decomposition of X2 gives a data mining perspective to this coefficient: presenting NQ=X2 as sum of N pkmqkm (Mirkin, 2001) allows for visualization of the association components MatLab: ChiSq=N*sum(P.*Q); That is, entries in N* P.*Q are contributions of pairs of the categories to the chi-squared BA/FM chi-squared (NQ = 6.86) and its cross decomposition2 FMarket 10+ Yes 1.33 No -.67 Total 0.67 4+ 5.41 -1.90 3.51 2+ -.64 2.09 1.45 1-.62 1.85 1.23 Total 5.43 1.37 6.86 Bear in mind that the conventional decomposition of chi-squared is based on the conventional formula so that contribution of (k,m) pair is considered to be equal to N*(pmk - p+kpm+)2/ (p+kpm+) 2 Bold font for the positive items; red for exceptional contributions 14 or, heuristically, its square root – mm to bear the sign of the relation, which comes naturally with Quetelet coefficients A real-life example of categorical data Ceefax 31/10/2007: Stop’n’Search bias: In 2005/6 there were 878,153 stops and searches in England and Wales. Blacks 7 times and Asians 2 times more frequently than Whites. What does that mean? There are three races mentioned, B/A/W, and proportions of “S’n’S” are 7/2/1? That would lead to this distribution: Hypothetic S’n’S frequency: Abs Per cent B 614,707 70 A 175,631 20 W 87,815 10 Total 878,153 100 15 Example continued: NO. Our guess is wrong. The Ceefax data refer to the frequencies relative to the race distribution in the population (see http://news.bbc.co.uk/1/hi/uk/7069791.stm of 29/10/07). Real S’n’S frequency: Hypo Real Per cent B 614,707 131,723 15 A 175,631 70,252 8 W 87,815 676,178 77 Total 878153 878153 100 Whites are an overwhelming majority – but their proportion in the overall population is even greater: Race in England and Wales (2001 census): B A W Pop 1,509,216 3,018,431 47,514,269 Per cent 2.9 5.8 91.3 Total 52,041,916 100 16 2D: Contingency table /Cross-classification Example continued: Joint distribution of S’n’S_Race and Pop_Race Entity set? Population Two features? Race (B,A,W) and Search (Yes, No) S’n’S Not S’n’S Total Black 131,723 1,377,493 1,509,216 Asian 70,252 2,948,179 3,018,431 White 676,178 46,838,091 47,514,269 Total 878,153 51,163,763 52,041,916 Table 1: Cross-classification of Race and S’n’S in E/W (absolute values) S’n’S Not S’n’S Total Black 0.25/8.7 2.65/91.3 2.9 Asian 0.14/2.3 5.66/97.7 5.8 White 1.30/1.4 90.00/98.6 91.3 Total 98.31 1.69 100.0 Table 1: Cross-classification of Race and S’n’S in E/W (rel. values, per cent to total/per cent to race category) 17 A bias indeed. Is it racist? I am not sure because the S’nS’ race distribution mimics the race distribution of prison inmates. Whether the latter reflects any bias, I do not know. Task: Find a similar decomposition of chi-squared for the Race/S’n’S example; for OOPmarks/Occupation in Student data. (Hint: First, develop OOPmarks categories.) 18