Data, Data, Data! 1 Data, Data, Data! 2 $ ' ' $ Thanks to ... USNA Michelson Lecture 2001 • Office of Naval Research: – Reza Malek Madani – Wen Masters – Carey Schwartz • Will Glaser, Savage Beast • Jeffrey Klein, Xamplify Data, Data, Data! Challenges and Opportunities of the coming Data Deluge • John Lavery, Army Research Office • Tracy McSheery, PhaseSpace • Peter Weinberger, Renaissance Technologies David Donoho, Statistics Department, Stanford University • Shankar Sastry, UC Berkeley & Data, Data, Data! ' % & 3 % Data, Data, Data! 4 $ ' $ Agenda I. The Era of Data II. Its Characteristics III. Its Critics IV. Mathematics: The Long View V. Mathematical Challenge: Navigating High Dimensional Space & % & % Data, Data, Data! 5 Data, Data, Data! 6 $ ' ' $ I. The other World Hunger Problem The coming Data Era • Since beginning of history: unquenched desire for data "Data, Data, Data! I can't make bricks without clay!" -- The Adventure of the • Coming years, problem will be solved • Prodigious efforts of best and brightest • Massive constructions: ambitious and visionary • Will define an Era Copper Beeches (a) Sherlock • Will permeate everything (b) Quotation & % & Data, Data, Data! 7 % Data, Data, Data! 8 $ ' ' $ The Era of Faith The coming Data Era In place of cathedrals: massive investment for • Ubiquitous sensors • Pervasive networking • Gargantuan databases (a) Sistine Chapel & (b) Notre Dame • Data-rich discourse and decision % & % Data, Data, Data! 9 Data, Data, Data! 10 $ ' ' $ I.1 Data and Notions of Fate The Data Era Touches Everything... Touches even the most spiritual places • Our Fate • Our Creation (a) AB 3700 & (b) Celera Genomics % & Data, Data, Data! 11 % Data, Data, Data! 12 $ ' ' $ Decoding the Genome Genomic Data Availability • 109 Base pairs • 30,000 Genes • On-line 24*7 • Ubiquitous Software • Internet Accessibility (a) Nature & (b) Science % & % Data, Data, Data! 13 Data, Data, Data! 14 $ ' ' $ I.2 Data and Notions of Creation Genomic Data will be touching your fate • Will you (your favorite phobia here ...) – Will you die young? – Will your offspring share your traits? – Will you react poorly to stress? • Will we ... conquer disease, defects? & % & Data, Data, Data! 15 % Data, Data, Data! 16 $ ' ' $ Sloan Digital Sky Survey Sloan Facts (from www.sdss.org) • Catalog larger by orders of magnitude tha perviou • Data available on-line 24*7 • SDSS logo – ”Cosmic Genome Project” (a) Sloan Instrument & (b) Las Campanas Survey % & % Data, Data, Data! 17 Data, Data, Data! 18 $ ' ' $ Data Have Revolutionary Impact • Today: one-time transition in human affairs Data Touch our Idea of Creation • You will be participants and shapers Comparable Transition: Printing Press • Sensors seeing back into the earliest moments • 50 million books in century 1450-1550 • Processing detects the first wrinkles in space/time • Mass Communication, Art, Expression • Exploration, Conquest, Empire • Religious conflict, Total War & % & Data, Data, Data! 19 % Data, Data, Data! $ ' ' 20 $ II. Symbols of the Modern Era? True characteristics of the Modern Era Ubiquitous, Pervasive Combination of • Gathering Data • Transporting • Archiving • Publishing • Exploiting (a) DataCenter & (b) Network % & % Data, Data, Data! 21 Data, Data, Data! 22 $ ' ' $ Electronic Nose (N. Lewis, Caltech) II.1 Gathering Data Everything Can and Will Become Data • Smells • Chemical Vision • Motions, Gestures • Preferences • Essences • Human Identity • Behavior & % & Data, Data, Data! 23 Data, Data, Data! 24 $ ' ' Electronic Nose Results (a) & % $ Hyperspectral Photography (a) Eye (b) % & (b) Spectrum % Data, Data, Data! 25 Data, Data, Data! 26 $ ' ' Human Motion Tracking – HiBall Tracker Hyperspectral Photography – Derived Image (a) Figure 11: Derived Image & $ (b) % & Data, Data, Data! 27 % Data, Data, Data! 28 $ ' ' $ SAR/Lidar Sensor Fusion New types of data: endless • Cortical response to stimuli (fMRI) • Internet Browsing Behavior • Gait, Posture, Gesture • NBA: mining videos for player moves & % & % Data, Data, Data! 29 Data, Data, Data! 30 $ ' ' $ II.3 Ubiquitous and Pervasive Archiving Data II.2 Ubiquitous and Pervasive Data Transport • Earthquakes Netcentric Communications • Galaxies • Internet • Genome • Wireless • ... • Bluetooth • Massive Data Swamps Everything – Voice, Media etc. • Comprehensive • Complete & % & Data, Data, Data! 31 % Data, Data, Data! 32 $ ' ' $ Earthquake Catalogs II.4 Ubiquitous and Pervasive Publication of Data (a) Forbidden City & % & (b) Detained EP3 on CNN % Data, Data, Data! 33 Data, Data, Data! 34 $ ' ' Publication of High-Resolution Data (a) 1.0 Meter Res $ Publication of Financial Data (b) 0.5 Meter Res & Data, Data, Data! ' % & 35 % Data, Data, Data! 36 $ ' $ Military need for Data II.5 Pervasive and Ubiquitous Exploitation • Scientific • Military • Industry • Investment Community (a) & % & (b) % Data, Data, Data! 37 Data, Data, Data! 38 $ ' ' $ Modernizing War Military Contributions to Data Era • Computers • Communications • Networking • Storage • Display (a) Publ Docs & % & Data, Data, Data! 39 % Data, Data, Data! 40 $ ' ' $ Unmanned Air Vehicles Navy Vision & (b) Army Docs % & % Data, Data, Data! 41 Data, Data, Data! 42 $ ' ' $ Unmanned Ground Sensors (Concept) MicroMotes (Kris Pister, David Culler et al.) (a) Mote & % & Data, Data, Data! 43 Data, Data, Data! $ ' ' (b) Network % 44 $ Corporate Organization = Information Organization US Corporations and Data Era • Size of IT marketplace $750B/year • Size of IT budgets nearing 50% of total. • Merchandising • Airlines • Communications • Shipping • Manufacturing & % & % Data, Data, Data! 45 Data, Data, Data! 46 $ ' ' $ WalMart: 7 Terabyte Transactions Database Investment Community and Data Era Massive Investment in startups for ... • Network Infrastructure • Genomic Databases • Consumer Behavior Tracking • Mass Personalization (a) WalMart (b) Database & % & Data, Data, Data! 47 % Data, Data, Data! 48 $ ' ' $ Compiling Musical Genome Example: Savage Beast (Oakland, CA) (a) Query & (b) Approach % & % Data, Data, Data! 49 Data, Data, Data! 50 $ ' ' $ III Critics of the Data Era, 1 (a) Ralph Nader & (b) Pope John % & Data, Data, Data! 51 % Data, Data, Data! 52 $ ' ' $ III Critics of the Data Era, 2 First Law of Data Smog Information, once rare and cherished like caviar, is now plentiful and take for granted like potatoes. & % & % Data, Data, Data! 53 Data, Data, Data! 54 $ ' ' $ My Reaction Typical Quotation from Data Smog • Beginning of telephone era: people reported shock lasting days after a phone call Tens of thousands of words daily pulse through our beleaguered brains. Accompanied by a massive amount of other auditory and visual stimuli. In every moment of the audio-visual orgy of our highly-informed days, the brain handles a massive amount of electrical traffic. No wonder we feel burnt • Much opposition is simply romanticism On the Other hand ... • Serious looming problems in Data era; • Typical critics ill-equipped to see, understand, or articulate them. – Philosopher Philip Novak & % & Data, Data, Data! 55 % Data, Data, Data! 56 $ ' ' $ A Common View of Technology IV. The Long View – Mathematics "Any sufficiently advanced technology is indistinguishable from Magic" • Society is making massive bets ... • Are these wise? • Will the investment pay off? • Where do you come in? -- Arthur C. Clarke (a) 2001 & % & (b) Quote % Data, Data, Data! 57 Data, Data, Data! 58 $ ' ' $ USNA Responsibility • You are Leaders of Tomorrow Undergraduate/Graduate Courses Relevant • You must be Tough Customers • Demand Good Investments • Linear Algebra/Vector Analysis • Make Society’s Investments Pay Off • Statistics/Data Analysis Mathematics Tells You • Machine Learning/Data Mining • Information Theory • Information technology is not magic • Signal Processing • Extracting Information from data is not a sure thing • Specific hard work on a case-by-case basis • You can learn what must be done & Data, Data, Data! ' % & 59 % Data, Data, Data! 60 $ ' $ Statistical Dataset • N Individuals, p variables • X(i, j) data 1 ≤ i ≤ N , 1 ≤ j ≤ p Mathematical Viewpoint about Database • Datasets == arrays of numbers • Array of numbers == points clouds in space • Exploitation of Data == Recognizing patterns in point clouds Example Mathematics Student Exams • N = 25 Students, p = 5 measured variables • X(i, 1): Diff. Geometry score of student i • X(i, 2): Complex Analysis score of student i • X(i, 3): Algebra score of student i • X(i, 4): Real Analysis score of student i & • X(i, 5): Statistics score of student i % & % Data, Data, Data! 61 Data, Data, Data! 62 $ ' ' $ Dataset as a point cloud in 2-d ID # Diff. Complex Algebra Real Statistics Graduate Student Exam Scores 90 Analysis Analysis 1 36 58 43 36 37 2 62 54 50 46 52 3 31 42 41 40 29 4 76 78 69 66 81 5 46 56 52 56 40 6 12 42 38 38 28 7 39 46 51 54 41 ... ... ... ... ... ... 80 70 60 Statistics Scores Geom. 50 40 30 20 10 & 0 10 20 30 40 50 Differential Geometry Scores 60 70 80 % & Data, Data, Data! 63 % Data, Data, Data! 64 $ ' ' Dataset as a point cloud in 3-d $ Many other Statistical Datasets Graduate Student Exam Scores • N Individuals, p variables • X(i, j) data 1 ≤ i ≤ N , 1 ≤ j ≤ p 70 65 Examples 60 Algebra 55 50 • Medical: Individuals = Patients, Variables = Age, Blood Pressure, ... 45 40 35 30 0 100 80 20 60 40 40 60 20 80 Differential Geometry & • Financial: Individuals = Stocks, Variables = P/E Ratio, Earnings Growth (APR), ... 0 Statistics • Sports: Individuals = Athletes, Variables = AB, H, W, BA, RBI, HR, ... % & % Data, Data, Data! 65 Data, Data, Data! 66 $ ' ' $ Interlude: MacSpin Some Common Patterns in Point Clouds • Planes • Clusters • Outliers • Filaments Data Analysis: Finding and Interpreting such Patterns & % & Data, Data, Data! 67 Data, Data, Data! $ ' ' % 68 $ V. The True Challenge: High Dimensionality Society’s Bets Modern Data have high volumes • Socially useful datasets == point clouds in space • A Sound: > 3KHz • Exploitation of Data == Recognizing patterns in point clouds • An Image: >1M Pixels • We will recognize patterns in point clouds which allow exploitation Each ’Data Point’ in a database is a highly multidimensional object • An Utterance: >50K dimensions • A Face: > 200K dimensions & % & % Data, Data, Data! 69 Data, Data, Data! 70 $ ' ' $ Example Modern Databases Human Identity Needs with High Dimensional data • For each person, a 256 ∗ 256 image • N = 106 individuals (points) • Visualizing high dimensional data • p = 256 ∗ 256 = 65536 variables (dimensions) • Navigating high dimensional space Hyperspectral Image • Discovering high dimensional patterns • For each chemical, a 1024-long spectrum • N = 5000 compounds (points) • p = 1024 variables (dimensions) & % & Data, Data, Data! ' 71 Data, Data, Data! % 72 $ ' $ % & % Can We Visualize High Dimensions? & Data, Data, Data! 73 Data, Data, Data! 74 ' $ ' $ & % & % Data, Data, Data! 75 Data, Data, Data! 76 ' $ ' $ & % & % Data, Data, Data! 77 Data, Data, Data! 78 ' $ ' $ & % & % Data, Data, Data! 79 Data, Data, Data! 80 ' $ ' $ & % & % Data, Data, Data! ' 81 Data, Data, Data! 82 $ ' $ We must Reduce Dimensionality!! Idea: Project onto Subspace & % & Data, Data, Data! ' 83 % Data, Data, Data! 84 $ ' $ Illustration of Rotate/Drop Coordinates Approaches to Reduce Dimensionality • Ideas from Linear Algebra and Statistics 20 • Choose new coordinate system (basis) 10 0 • Rotate into that basis -10 • Use only a few of the new coordinates 20 10 -20 20 0 10 0 -10 -10 -20 & % & -20 % Data, Data, Data! 85 Data, Data, Data! 86 $ ' ' $ Dataset as a point cloud in 2-d Graduate Student Exam Scores 90 80 Approaches to Reduce Dimensionality 70 Statistics Scores 60 • V.I Principal Component Analysis • V.II Multiresolution Analysis 50 40 30 20 10 & 0 10 20 30 40 50 Differential Geometry Scores 60 % & Data, Data, Data! 87 80 % Data, Data, Data! 88 $ ' ' 70 $ Concentration Ellipsoid and Principal Axes Graduate Student Exam Scores Principal Components 90 80 • Concentration Ellipsoid Statistics Scores 70 60 • Principal Axes 50 • Longer Axes – more variation 40 Sometimes 30 • Few Axes – most variation 20 • Summarize many variables by a few. 10 0 -20 & 0 20 40 60 Differential Geometry Scores 80 100 % & % Data, Data, Data! 89 Data, Data, Data! 90 $ ' ' $ PCA Output on eNose PCA Relies on ... • Linear Algebra (eigenvalues of positive definite matrices) • Multivariate Statistics (covariance matrices) • Numerical Computing (eigenvalue solvers) Used throughout Science and Technology • Chemometrics • Hyperspectral Imagery • Human Identification & Data, Data, Data! ' % & 91 % Data, Data, Data! 92 $ ' $ Eigenfaces PCA Output on Hyperspectral Imagery (a) UBC Students & % & (b) Principal faces % Data, Data, Data! 93 Data, Data, Data! 94 $ ' ' $ Example: Facial Gestures Can we discover ... • ... from data • ... the true cooordinate system • Recent approach: ICA • Uses different statistical principles & % & Data, Data, Data! 95 Data, Data, Data! $ ' ' % 96 $ V.II Dimensionality Reduction by Multiresolution Anal • New coordinate system: not pixels but (scale,location) • Break Data into coarse/medium/fine scales • Select information by scale/location (a) PCA (b) ICA • Alter egos: wavelets, multiscale methods Sejnowski Lab, Salk Inst. & % & % Data, Data, Data! 97 Data, Data, Data! 98 $ ' ' $ Bandpass BigMac, s=1 BigMac Bandpass, s=2 Partitioned, s=2 Ridgelet Coeff, s=2 Bandpass, s=3 Partitioned, s=3 Ridgelet Coeff, s=2 Figure 26: Fingerprint Movie. 64W + 0/256/768/3072 Curvelets Figure 25: BigMac Image, and stages of Curvelet Analysis. & Data, Data, Data! ' % & 99 % Data, Data, Data! 100 $ ' $ V.III Nonlinear Structures in High-D Space Figure 27: Lichtenstein, ‘In The Car’; 64 Wavelets+ 256 Curvelets & (a) Scatter % & (b) Manifold % Data, Data, Data! ' 101 Data, Data, Data! 102 $ ' $ Example of Nonlinear Structure: Images Can We Discover ... • ... from data • ... an underlying nonlinear coordinate system Relevant Courses • Linear Algebra • Multivariate Statistics • Differential Geometry of Curves and Surfaces & % & Data, Data, Data! ' 103 Data, Data, Data! $ ' % 104 $ ISOMAP (Josh Tenenbaum, Stanford) Many Reasons to Navigate High Dimensions • Locate Best Matches in Multimedia Databases • Steganography • Understand tactical tendencies of opponent • Maneuver planning Many Methods to Navigate High Dimensions • Sonification • Diverse Mathematical Ideas & % & % Data, Data, Data! 105 $ ' Summary • Entering Era of Data • Serious Challenges: Data Smog, Data Glut • Mathematical Challenges: navigate, explore high dimensional space • Only scratched surface in this lecture • Progress important for Navy Goals and Society in General • You can contribute & %