Thanks to ... USNA Michelson Lecture 2001 Data, Data, Data!

advertisement
Data, Data, Data!
1
Data, Data, Data!
2
$ '
'
$
Thanks to ...
USNA Michelson Lecture 2001
• Office of Naval Research:
– Reza Malek Madani
– Wen Masters
– Carey Schwartz
• Will Glaser, Savage Beast
• Jeffrey Klein, Xamplify
Data, Data, Data!
Challenges and Opportunities of
the coming Data Deluge
• John Lavery, Army Research Office
• Tracy McSheery, PhaseSpace
• Peter Weinberger, Renaissance Technologies
David Donoho, Statistics Department, Stanford University
• Shankar Sastry, UC Berkeley
&
Data, Data, Data!
'
% &
3
%
Data, Data, Data!
4
$ '
$
Agenda
I. The Era of Data
II. Its Characteristics
III. Its Critics
IV. Mathematics: The Long View
V. Mathematical Challenge: Navigating High Dimensional Space
&
% &
%
Data, Data, Data!
5
Data, Data, Data!
6
$ '
'
$
I. The other World Hunger Problem
The coming Data Era
• Since beginning of history: unquenched desire for data
"Data, Data, Data!
I can't make bricks
without clay!"
-- The Adventure of the
• Coming years, problem will be solved
• Prodigious efforts of best and brightest
• Massive constructions: ambitious and visionary
• Will define an Era
Copper Beeches
(a) Sherlock
• Will permeate everything
(b) Quotation
&
% &
Data, Data, Data!
7
%
Data, Data, Data!
8
$ '
'
$
The Era of Faith
The coming Data Era
In place of cathedrals: massive investment for
• Ubiquitous sensors
• Pervasive networking
• Gargantuan databases
(a) Sistine Chapel
&
(b) Notre Dame
• Data-rich discourse and decision
% &
%
Data, Data, Data!
9
Data, Data, Data!
10
$ '
'
$
I.1 Data and Notions of Fate
The Data Era Touches Everything...
Touches even the most spiritual places
• Our Fate
• Our Creation
(a) AB 3700
&
(b) Celera Genomics
% &
Data, Data, Data!
11
%
Data, Data, Data!
12
$ '
'
$
Decoding the Genome
Genomic Data Availability
• 109 Base pairs
• 30,000 Genes
• On-line 24*7
• Ubiquitous Software
• Internet Accessibility
(a) Nature
&
(b) Science
% &
%
Data, Data, Data!
13
Data, Data, Data!
14
$ '
'
$
I.2 Data and Notions of Creation
Genomic Data will be touching your fate
• Will you (your favorite phobia here ...)
– Will you die young?
– Will your offspring share your traits?
– Will you react poorly to stress?
• Will we ... conquer disease, defects?
&
% &
Data, Data, Data!
15
%
Data, Data, Data!
16
$ '
'
$
Sloan Digital Sky Survey
Sloan Facts (from www.sdss.org)
• Catalog larger by orders of magnitude tha perviou
• Data available on-line 24*7
• SDSS logo – ”Cosmic Genome Project”
(a) Sloan Instrument
&
(b) Las Campanas Survey
% &
%
Data, Data, Data!
17
Data, Data, Data!
18
$ '
'
$
Data Have Revolutionary Impact
• Today: one-time transition in human affairs
Data Touch our Idea of Creation
• You will be participants and shapers
Comparable Transition: Printing Press
• Sensors seeing back into the earliest moments
• 50 million books in century 1450-1550
• Processing detects the first wrinkles in space/time
• Mass Communication, Art, Expression
• Exploration, Conquest, Empire
• Religious conflict, Total War
&
% &
Data, Data, Data!
19
%
Data, Data, Data!
$ '
'
20
$
II. Symbols of the Modern Era?
True characteristics of the Modern Era
Ubiquitous, Pervasive Combination of
• Gathering Data
• Transporting
• Archiving
• Publishing
• Exploiting
(a) DataCenter
&
(b) Network
% &
%
Data, Data, Data!
21
Data, Data, Data!
22
$ '
'
$
Electronic Nose (N. Lewis, Caltech)
II.1 Gathering Data
Everything Can and Will Become Data
• Smells
• Chemical Vision
• Motions, Gestures
• Preferences
• Essences
• Human Identity
• Behavior
&
% &
Data, Data, Data!
23
Data, Data, Data!
24
$ '
'
Electronic Nose Results
(a)
&
%
$
Hyperspectral Photography
(a) Eye
(b)
% &
(b) Spectrum
%
Data, Data, Data!
25
Data, Data, Data!
26
$ '
'
Human Motion Tracking – HiBall Tracker
Hyperspectral Photography – Derived Image
(a)
Figure 11: Derived Image
&
$
(b)
% &
Data, Data, Data!
27
%
Data, Data, Data!
28
$ '
'
$
SAR/Lidar Sensor Fusion
New types of data: endless
• Cortical response to stimuli (fMRI)
• Internet Browsing Behavior
• Gait, Posture, Gesture
• NBA: mining videos for player moves
&
% &
%
Data, Data, Data!
29
Data, Data, Data!
30
$ '
'
$
II.3 Ubiquitous and Pervasive Archiving Data
II.2 Ubiquitous and Pervasive Data Transport
• Earthquakes
Netcentric Communications
• Galaxies
• Internet
• Genome
• Wireless
• ...
• Bluetooth
• Massive
Data Swamps Everything – Voice, Media etc.
• Comprehensive
• Complete
&
% &
Data, Data, Data!
31
%
Data, Data, Data!
32
$ '
'
$
Earthquake Catalogs
II.4 Ubiquitous and Pervasive Publication of Data
(a) Forbidden City
&
% &
(b) Detained EP3 on CNN
%
Data, Data, Data!
33
Data, Data, Data!
34
$ '
'
Publication of High-Resolution Data
(a) 1.0 Meter Res
$
Publication of Financial Data
(b) 0.5 Meter Res
&
Data, Data, Data!
'
% &
35
%
Data, Data, Data!
36
$ '
$
Military need for Data
II.5 Pervasive and Ubiquitous Exploitation
• Scientific
• Military
• Industry
• Investment Community
(a)
&
% &
(b)
%
Data, Data, Data!
37
Data, Data, Data!
38
$ '
'
$
Modernizing War
Military Contributions to Data Era
• Computers
• Communications
• Networking
• Storage
• Display
(a) Publ Docs
&
% &
Data, Data, Data!
39
%
Data, Data, Data!
40
$ '
'
$
Unmanned Air Vehicles
Navy Vision
&
(b) Army Docs
% &
%
Data, Data, Data!
41
Data, Data, Data!
42
$ '
'
$
Unmanned Ground Sensors (Concept)
MicroMotes (Kris Pister, David Culler et al.)
(a) Mote
&
% &
Data, Data, Data!
43
Data, Data, Data!
$ '
'
(b) Network
%
44
$
Corporate Organization = Information Organization
US Corporations and Data Era
• Size of IT marketplace $750B/year
• Size of IT budgets nearing 50% of total.
• Merchandising
• Airlines
• Communications
• Shipping
• Manufacturing
&
% &
%
Data, Data, Data!
45
Data, Data, Data!
46
$ '
'
$
WalMart: 7 Terabyte Transactions Database
Investment Community and Data Era
Massive Investment in startups for ...
• Network Infrastructure
• Genomic Databases
• Consumer Behavior Tracking
• Mass Personalization
(a) WalMart
(b) Database
&
% &
Data, Data, Data!
47
%
Data, Data, Data!
48
$ '
'
$
Compiling Musical Genome
Example: Savage Beast (Oakland, CA)
(a) Query
&
(b) Approach
% &
%
Data, Data, Data!
49
Data, Data, Data!
50
$ '
'
$
III Critics of the Data Era, 1
(a) Ralph Nader
&
(b) Pope John
% &
Data, Data, Data!
51
%
Data, Data, Data!
52
$ '
'
$
III Critics of the Data Era, 2
First Law of Data Smog
Information, once rare and cherished like caviar, is now
plentiful and take for granted like potatoes.
&
% &
%
Data, Data, Data!
53
Data, Data, Data!
54
$ '
'
$
My Reaction
Typical Quotation from Data Smog
• Beginning of telephone era: people reported shock lasting days
after a phone call
Tens of thousands of words daily pulse through our
beleaguered brains. Accompanied by a massive amount of
other auditory and visual stimuli. In every moment of the
audio-visual orgy of our highly-informed days, the brain
handles a massive amount of electrical traffic. No wonder
we feel burnt
• Much opposition is simply romanticism
On the Other hand ...
• Serious looming problems in Data era;
• Typical critics ill-equipped to see, understand, or articulate
them.
– Philosopher Philip Novak
&
% &
Data, Data, Data!
55
%
Data, Data, Data!
56
$ '
'
$
A Common View of Technology
IV. The Long View – Mathematics
"Any sufficiently
advanced technology is
indistinguishable
from Magic"
• Society is making massive bets ...
• Are these wise?
• Will the investment pay off?
• Where do you come in?
-- Arthur C. Clarke
(a) 2001
&
% &
(b) Quote
%
Data, Data, Data!
57
Data, Data, Data!
58
$ '
'
$
USNA Responsibility
• You are Leaders of Tomorrow
Undergraduate/Graduate Courses Relevant
• You must be Tough Customers
• Demand Good Investments
• Linear Algebra/Vector Analysis
• Make Society’s Investments Pay Off
• Statistics/Data Analysis
Mathematics Tells You
• Machine Learning/Data Mining
• Information Theory
• Information technology is not magic
• Signal Processing
• Extracting Information from data is not a sure thing
• Specific hard work on a case-by-case basis
• You can learn what must be done
&
Data, Data, Data!
'
% &
59
%
Data, Data, Data!
60
$ '
$
Statistical Dataset
• N Individuals, p variables
• X(i, j) data 1 ≤ i ≤ N , 1 ≤ j ≤ p
Mathematical Viewpoint about Database
• Datasets == arrays of numbers
• Array of numbers == points clouds in space
• Exploitation of Data == Recognizing patterns in point clouds
Example Mathematics Student Exams
• N = 25 Students, p = 5 measured variables
• X(i, 1): Diff. Geometry score of student i
• X(i, 2): Complex Analysis score of student i
• X(i, 3): Algebra score of student i
• X(i, 4): Real Analysis score of student i
&
• X(i, 5): Statistics score of student i
% &
%
Data, Data, Data!
61
Data, Data, Data!
62
$ '
'
$
Dataset as a point cloud in 2-d
ID #
Diff.
Complex
Algebra
Real
Statistics
Graduate Student Exam Scores
90
Analysis
Analysis
1
36
58
43
36
37
2
62
54
50
46
52
3
31
42
41
40
29
4
76
78
69
66
81
5
46
56
52
56
40
6
12
42
38
38
28
7
39
46
51
54
41
...
...
...
...
...
...
80
70
60
Statistics Scores
Geom.
50
40
30
20
10
&
0
10
20
30
40
50
Differential Geometry Scores
60
70
80
% &
Data, Data, Data!
63
%
Data, Data, Data!
64
$ '
'
Dataset as a point cloud in 3-d
$
Many other Statistical Datasets
Graduate Student Exam Scores
• N Individuals, p variables
• X(i, j) data 1 ≤ i ≤ N , 1 ≤ j ≤ p
70
65
Examples
60
Algebra
55
50
• Medical: Individuals = Patients, Variables = Age, Blood
Pressure, ...
45
40
35
30
0
100
80
20
60
40
40
60
20
80
Differential Geometry
&
• Financial: Individuals = Stocks, Variables = P/E Ratio,
Earnings Growth (APR), ...
0
Statistics
• Sports: Individuals = Athletes, Variables = AB, H, W, BA,
RBI, HR, ...
% &
%
Data, Data, Data!
65
Data, Data, Data!
66
$ '
'
$
Interlude: MacSpin
Some Common Patterns in Point Clouds
• Planes
• Clusters
• Outliers
• Filaments
Data Analysis: Finding and Interpreting such Patterns
&
% &
Data, Data, Data!
67
Data, Data, Data!
$ '
'
%
68
$
V. The True Challenge: High Dimensionality
Society’s Bets
Modern Data have high volumes
• Socially useful datasets == point clouds in space
• A Sound: > 3KHz
• Exploitation of Data == Recognizing patterns in point clouds
• An Image: >1M Pixels
• We will recognize patterns in point clouds which allow
exploitation
Each ’Data Point’ in a database is a highly multidimensional object
• An Utterance: >50K dimensions
• A Face: > 200K dimensions
&
% &
%
Data, Data, Data!
69
Data, Data, Data!
70
$ '
'
$
Example Modern Databases
Human Identity
Needs with High Dimensional data
• For each person, a 256 ∗ 256 image
• N = 106 individuals (points)
• Visualizing high dimensional data
• p = 256 ∗ 256 = 65536 variables (dimensions)
• Navigating high dimensional space
Hyperspectral Image
• Discovering high dimensional patterns
• For each chemical, a 1024-long spectrum
• N = 5000 compounds (points)
• p = 1024 variables (dimensions)
&
% &
Data, Data, Data!
'
71
Data, Data, Data!
%
72
$ '
$
% &
%
Can We Visualize High Dimensions?
&
Data, Data, Data!
73
Data, Data, Data!
74
'
$ '
$
&
% &
%
Data, Data, Data!
75
Data, Data, Data!
76
'
$ '
$
&
% &
%
Data, Data, Data!
77
Data, Data, Data!
78
'
$ '
$
&
% &
%
Data, Data, Data!
79
Data, Data, Data!
80
'
$ '
$
&
% &
%
Data, Data, Data!
'
81
Data, Data, Data!
82
$ '
$
We must Reduce Dimensionality!!
Idea: Project onto Subspace
&
% &
Data, Data, Data!
'
83
%
Data, Data, Data!
84
$ '
$
Illustration of Rotate/Drop Coordinates
Approaches to Reduce Dimensionality
• Ideas from Linear Algebra and Statistics
20
• Choose new coordinate system (basis)
10
0
• Rotate into that basis
-10
• Use only a few of the new coordinates
20
10
-20
20
0
10
0
-10
-10
-20
&
% &
-20
%
Data, Data, Data!
85
Data, Data, Data!
86
$ '
'
$
Dataset as a point cloud in 2-d
Graduate Student Exam Scores
90
80
Approaches to Reduce Dimensionality
70
Statistics Scores
60
• V.I Principal Component Analysis
• V.II Multiresolution Analysis
50
40
30
20
10
&
0
10
20
30
40
50
Differential Geometry Scores
60
% &
Data, Data, Data!
87
80
%
Data, Data, Data!
88
$ '
'
70
$
Concentration Ellipsoid and Principal Axes
Graduate Student Exam Scores
Principal Components
90
80
• Concentration Ellipsoid
Statistics Scores
70
60
• Principal Axes
50
• Longer Axes – more variation
40
Sometimes
30
• Few Axes – most variation
20
• Summarize many variables by a few.
10
0
-20
&
0
20
40
60
Differential Geometry Scores
80
100
% &
%
Data, Data, Data!
89
Data, Data, Data!
90
$ '
'
$
PCA Output on eNose
PCA Relies on ...
• Linear Algebra (eigenvalues of positive definite matrices)
• Multivariate Statistics (covariance matrices)
• Numerical Computing (eigenvalue solvers)
Used throughout Science and Technology
• Chemometrics
• Hyperspectral Imagery
• Human Identification
&
Data, Data, Data!
'
% &
91
%
Data, Data, Data!
92
$ '
$
Eigenfaces
PCA Output on Hyperspectral Imagery
(a) UBC Students
&
% &
(b) Principal faces
%
Data, Data, Data!
93
Data, Data, Data!
94
$ '
'
$
Example: Facial Gestures
Can we discover ...
• ... from data
• ... the true cooordinate system
• Recent approach: ICA
• Uses different statistical principles
&
% &
Data, Data, Data!
95
Data, Data, Data!
$ '
'
%
96
$
V.II Dimensionality Reduction by Multiresolution Anal
• New coordinate system: not pixels but (scale,location)
• Break Data into coarse/medium/fine scales
• Select information by scale/location
(a) PCA
(b) ICA
• Alter egos: wavelets, multiscale methods
Sejnowski Lab, Salk Inst.
&
% &
%
Data, Data, Data!
97
Data, Data, Data!
98
$ '
'
$
Bandpass BigMac, s=1
BigMac
Bandpass, s=2
Partitioned, s=2
Ridgelet Coeff, s=2
Bandpass, s=3
Partitioned, s=3
Ridgelet Coeff, s=2
Figure 26: Fingerprint Movie. 64W + 0/256/768/3072 Curvelets
Figure 25: BigMac Image, and stages of Curvelet Analysis.
&
Data, Data, Data!
'
% &
99
%
Data, Data, Data!
100
$ '
$
V.III Nonlinear Structures in High-D Space
Figure 27: Lichtenstein, ‘In The Car’; 64 Wavelets+ 256 Curvelets
&
(a) Scatter
% &
(b) Manifold
%
Data, Data, Data!
'
101
Data, Data, Data!
102
$ '
$
Example of Nonlinear Structure: Images
Can We Discover ...
• ... from data
• ... an underlying nonlinear coordinate system
Relevant Courses
• Linear Algebra
• Multivariate Statistics
• Differential Geometry of Curves and Surfaces
&
% &
Data, Data, Data!
'
103
Data, Data, Data!
$ '
%
104
$
ISOMAP (Josh Tenenbaum, Stanford)
Many Reasons to Navigate High Dimensions
• Locate Best Matches in Multimedia Databases
• Steganography
• Understand tactical tendencies of opponent
• Maneuver planning
Many Methods to Navigate High Dimensions
• Sonification
• Diverse Mathematical Ideas
&
% &
%
Data, Data, Data!
105
$
'
Summary
• Entering Era of Data
• Serious Challenges: Data Smog, Data Glut
• Mathematical Challenges:
navigate, explore high dimensional space
• Only scratched surface in this lecture
• Progress important for Navy Goals and Society in General
• You can contribute
&
%
Download