Limn, to present an image or lifelike imitation of; Limit-n, to explore any data set to its limits Di Cook Statistics, Iowa State University Joint with Peter Sutherland, Manuel Suarez, Vasant Honavar, Doina Caragea, Les Miller Outline What is the Role of Viz in Nonparametrics? (especially Classication) Movie Technology for Large Data Case studies FBI Bullet Data: feed forward back propagation neural network. (Course Project of Jenifer Schumi). Crabs data: CART tree, Support Vector Machines. 0-3 FBI Forensic Bullet Data 4 manufacturers, 50 bullets from each, 5 trace elements measured on each: Copper, Arsenic, Bismuth, Silver, Antimony. Can we recogize the manufacturer based on the concentrations of trace elements? Note: With graphics it is easy to build a classcation rule for all 4 classes, but with CART, LDA and feed-forward neural networks it was not possible to perfectly classify all 4 groups. There confusion between two groups. Data provided to Hal Stern and Alicia Carriquiry by the FBI. This analysis was conducted by Jennifer Schumi as a class project. 0-4 Feed-forward neural networks (Ripley's S code) 0-5 Where's the boundary? 1000 Copper 600 400 200 0 Copper 800 Arsenic Antimony 2000 4000 6000 8000 1e+04 Antimony Copper Bismuth Arsenic Antimony Copper Bismuth Arsenic Silver Antimony 0-6 Case Study: Australian Leptograpsus Crabs 50 specimens of each of 4 classes (Orange, Blue, M,F). Preserved specimens lose their color, classify on morphological measurements. Source: Campbell and Mahon, 1974. 0-7 CART Classication tree: tree(formula = factor(crabsg) ., data = data.frame(crabsd)) Number of terminal nodes: 28 Residual mean deviance: 0.5903 = 101.5 / 172 Misclassication error rate: 0.125 = 25 / 200 FL<17.45 1| RW<13.55 2 RW<15.9 4 CW<36.35 1 CW<37.3 CL<37.1 FL<19.45 FL<21.55 2 3 4 4 444 FL<15.9 2 BD<12.15 FL<17 4 4 4 RW<13.75 CW<44.35 2 1 RW<13.1 4 3 FL<19.95 1 1 3 1 1 1 FL<21.55 3 4 RW<11.15 RW<11.85 1 333 2 3 CL<26.7CL<26.4 1 CW<33.95 2 2BD<12.85 2 3 3 3 4 FL<12.25 4 2 2 1 RW<9.15 2 3 FL<10 CL<23.3 1 2 RW<8.05 2 3 2 2 1 2 0-8 Graphics (RW-m)/s (CW-m)/s (CL-m)/s (BD-m)/s (FL-m)/s (RW-m)/s BD FL RW CL CW (CW-m)/s (CL-m)/s BD 0-9 RW 15 20 25 30 CL 35 40 50 FL Graphics 45 10 12 14 16 18 20 22 24 8 20 18 16 14 12 10 8 6 20 25 30 CW 35 40 45 50 55 0 - 10 Graphics Strong correlation between variables. Strong separation between species. Some confusion in sexes of smaller crabs. Males separated from females in plots of CL vs RW (males have a higher CL:RW ratio), and BD vs RW, and CW vs RW. The two species can be separated in the plots of FL vs CW, and BD vs CW. 0 - 11 Support Vector Machines Method Species Sexes Error Rate NN (SVM) 0 2 0.01 RW CW FL BD 0 - 12 Large amounts of data? Reduce data or scale up methods? We want to look at the full resolution! Scaling up tour methods: tour algorithm is linear in n! ............exploit technological advances. 0 - 13 Using Video Technology Two steps: 1. Create an animation sequence, and save it as a quicktime movie or as a series of JPEG images. 2. View the animation, and interact with it by brushing small subsets, and highlighting the brushed points with overlays on the images. 0 - 14 Approaches to Generating the Animations Little Tour: interpolation path between all pairs of variables. Grand Tour: interpolation between random orthonormal bases. From File: interpolation path between bases read from le, takes advantage of work by Neil Sloane on xed size tours. PP Guidance: generates indices of interest for each projection, that could be used later to subset the tour space to interesting subspaces. 0 - 15 Example: Tropical Atmosphere Ocean Array of Buoys Monitoring El Ni~no, 70 buoys in Pacic, 5 variables, time (1980-1994) and space components. 0 - 16 Example Tour movie of the entire data set: 178080 points from 1980-1998, on variables zonal winds, meridian winds, humidity, air temperature and sea surface temperature. Overlays: Dec 1993 (normal), Dec 1997 (El Ni~no). 0 - 17 Discussion Overlays on video: Binned resolutions, intelligently selected subsets, models. How do we link between multiple views in real-time: create indexing when plots are constructed? How do we visually represent weighted data? Glyph size, color? 0 - 18 Summary Visualization can help understand and simplify results of non-parametric methods. Video technology provides some potential for scaling methods up. 0 - 19