Efficient Statistical Learning from “Big Science” Data

advertisement
Efficient Statistical Learning from
“Big Science” Data
Andrew
Moore
Andrew
Connolly
Jeremy
Kubica
Ting
Liu
Auton Lab
U. Pittsburgh
Auton Lab
Auton Lab
The Auton Lab
School of Computer Science
Carnegie Mellon University
www.autonlab.org
The Auton Lab
Faculty: Andrew Moore (Prof), Jeff Schneider (Research
Scientist), Artur Dubrawski (Systems Scientist)
Postdoctoral Fellows: Brigham Anderson, Alexander Gray, Paul Komarek,
Dan Pelleg
Graduate Students: Brent Bryan, Kaustav Das, Khalid El-Arini, Anna
Goldenberg, Jeremy Kubica, Ting Liu, Daniel Neill,
Sajid Siddiqi, Purna Sarkar, Ajit Singh, Weng-Keen
Wong
Head of Software Jeanie Komarek
Development:
Programmers: Patrick Choi, Adam Goode, Pat Gunn, Joey Liang, John
Ostlund, Robin Sabhnani, Rahul Sankathar
Executive Assistant: Kristen Schrauder
Head of Sys. Admin: Jacob Joseph
Undergraduate and
Masters Interns:
Recent Alumni:
Kenny Daniel, Sandy Hsu, Dongryeol Lee, Jennifer Lee, Avilay Parekh, Chris
Rotella, Jonathan Terleski
Drew Bagnell (RI faculty), Scott Davies (Google), David Cohn (Google), Geoff
Gordon (CMU), Paul Hsiung (USC), Marina Meila (U. Washington), Remi Munos
(Ecole Polytechnique), Malcolm Strens (Qinetiq)
Current Sponsors
•
•
•
•
•
•
•
•
•
•
•
•
•
National Science Foundation (NSF)
NASA
Defense Advanced Research Projects Agency (DARPA)
Department of Homeland Security (DHS)
Homeland Security Advanced Research Projects Agency (HSARPA)
United States Department of Agriculture (USDA)
State of Pennsylvania
Pfizer Corporation
Caterpillar Corporation
British Petroleum
Psychogenics Corporation
Transform Pharmaceuticals
Health Canada
Collaborators...
Ting Liu
Paul Hsiung
Andy
Connolly
Bob Nichol
Jeff Schneider
Istvan
Szapudi
Chris
Genovese
Alex Szalay
Alex Gray
Brigham
Anderson
Anna
Goldenberg
Kan Deng
Greg Cooper
Mike Wagner
Ajit Singh
Andrew Lawson
Daniel Neill
Scott Davies
Rich Tsui
Jeremy Kubica
Sajid
Paul Komarek
Drew
Bagnell
Siddiqi
Weng-Keen Wong
Dan Pelleg
Larry
Wasserman
1996
1997
1998
1999
2000
2001
2002
2003
M&M
Mars
Line
control
Adrenali
ne
(NOX
minimalizati
on)
Kodak
(Image
stabilizatio
n)
Digital
Equipm
ent
(pregna
n-cy
monitoring)
M&M
Mars
(manufa
c-turing)
NASA/
NSF
(Astrop
hysics
mining)
3M:
Textile
tension
control
Caterpillar
(Spare
parts)
US Army
(biotoxin
detection)
M&M
Mars:
Schedulin
g with
uncertainty
3M
(Adhesive
design)
DigitalMC
(Music
tastes)
Caterpillar
(emissions)
SmartMoney
(anomalies)
Unilever
(Brand
Management
)
Phillips
Petroleum
(work-force
optimization)
Cellomics
(screened
anomaly
detection)
Biometrics
company
(health
monitor)
Boeing
(intrusion)
Masterfoods
(new product
development)
Cellomics
(pro-teomics
screen)
ABB (Circuitbreaker
supply chain)
SwissAir
(Flight delays)
3M (secret)
Washington
Public
Hospital
System
(ER delays)
Unilever
(targeted
marketing)
NASA (National
Virtual
Observatory)
NSF (astrostatistics
software)
DARPA (national
disease monitor)
Masterfoods (biochemistry)
Pfizer (Highthroughput screen)
Caterpillar Inc.
(Self-optimizing
Engines)
Beverage
Company
(Ingredients/Manuf
acturing/Marketing/
Sales Bayes Net)
Transform Pharma
(massive
autonomous
experiment design)
Census Bureau
(privacy protection)
Psychogenics Inc:
Effects of
psychotropic drugs
on rats
NSF (astrostatistics
software)
Masterfoods (biochemistry)
State of PA (National
Disease Monitor [with
Mike Wagner of U. Pitt])
State of PA (Anti
Cancer [collaboration
with CMU Biology]
DARPA (detecting
patterns in links)
Other Government
Departments
(identifying dangerous
people, potential
collaborators, and
aliases)
Other Government
Departments (detecting
a class of clusters)
Other Pharma
Research Co. Life
Science specific data
mining
United States
Department of
Agriculture: Early
warning system for food
terrorism
NSF: Biosurveillance
Algorithms
Au
De ton/
plo SPR
ym
ent
s
Our 5 biggest
applications in 2004
Biomedical Security (with Mike
Wagner, University of Pittsburgh)
Autonomous selftweaking engines
Drug Screening
Intelligence Data
Big Astrophysics Automated Science
Record # 456621:
Big Noisy Database of Links
There must be
something useful
here… but I’m
drowning in noise
Doe and Smith
were booked in
the same hotel
on 3/6/93
Analyst
Potential coll-aborators
of Doe:
Big Noisy Database of Links
This is more
useful
objective
information
TGAT
The pattern of travel
among the group
consisting of (Doe,
Smith, Jones, Moore)
during 93-96 can’t be
explained by
coincidence
Cached Sufficient Statistics
Kd-trees and Ball Trees
K-nearest neighbor with ball trees
Very fast non-parametric classification
skewed binary outputs
General binary outputs
multi-classed outputs
Very fast kernel-based statistics
n-point computations
clustering
non-parametric clustering (overdensity
hunting)
Active learning for anomaly hunting
GMorph: Efficient Galaxy morphology fitting
Other Auton topics
Data Analysis: The old days
Question
•
•
Answer
Size
Ellipticity
Color
23
0.96
Red
33
0.55
Red
36
40
20
48
Green
Data Analysis: The new days
Question
•
•
Seventeen
months later…
Answer
Cached Sufficient Statistics
Question
•
•
∑
∑∑ ∑
∑ ∑
∑∑ ∑
Cached Sufficient Statistics
Question
•
•
Answer
∑
∑∑ ∑
∑ ∑
∑∑ ∑
Kd-trees
• Friedman, Bentley and Finkel, 1977
Kd-trees
• Cleveland and Delvin, 1988
BSPs
Fast LOESS
Doom
Variable Resolution
Reinforcement
Learning
• Simons, Van Brussel, De
Schutter and Verhaert, 1982
• Chapman and Kaelbling,
1991
:
Kd-trees
• Barnes and Hut, 1986
Multipole
BSPs
Fast LOESS
Greengard
WSPDs
Doom
• Callahan and Kosaraju, 1993
Variable Resolution
Reinforcement
Learning
Kd-trees
• Kan Deng
Multipole
BSPs
Fast LOESS
Greengard
WSPDs
Doom
Cached
Counts
Variable Resolution
Reinforcement
Learning
Single Kernel
Regressions
K-means
• Dan Pelleg
Single Kernel Density
Estimations
Kd-trees
Multipole
BSPs
Fast LOESS
Greengard
WSPDs
Doom
Cached
Counts
Cached
Sufficient
Statistics
Mahalonobis
Maximization
Fast LWR
Mixture Loglike
• Moore, 1998
Mixture EM steps
Variable Resolution
Reinforcement
Learning
Single Kernel
Regressions
K-means
Single Kernel Density
Estimations
Kd-trees
Multipole
BSPs
Fast LOESS
Greengard
WSPDs
Doom
Cached
Counts
Cached
Sufficient
Statistics
Mahalonobis
Maximization
Fast LWR
• Alex Gray
Single Kernel
Regressions
K-means
Pipelined Kernel
Operations
Single Kernel Density
Estimations
Dual-tree
cached stats
2-point
correlation
Mixture Loglike
Mixture EM steps
Variable Resolution
Reinforcement
Learning
Multi-tree
n-point
correlation
Kd-trees
Multipole
BSPs
Fast LOESS
Greengard
WSPDs
Doom
Cached
Counts
Cached
Sufficient
Statistics
Non-parametric cluster
/ clutter / cluster
counts
Mahalonobis
Maximization
Single Kernel
Regressions
K-means
Single Kernel Density
Estimations
• Weng-Keen
Fast LWR
Dual-tree
Wong
cached stats
2-point
correlation
Mixture Loglike
Mixture EM steps
Variable Resolution
Reinforcement
Learning
Pipelined Kernel
Operations
Multi-tree
n-point
correlation
Via NIPS
Kd-trees
Ball trees
Multipole
Via DBMS
BSPs
Fast LOESS
Metric trees
Greengard
WSPDs
Doom
Cached
Counts
• Alex Gray
Cached
Non-parametric cluster
Sufficient
/ clutter / cluster
Statistics
counts
Variable Resolution
Reinforcement
Learning
Single Kernel
Regressions
K-means
Single Kernel Density
Estimations
• Ting Liu
Mahalonobis
Maximization
Fast LWR
2-point
correlation
Mixture Loglike
Mixture EM steps
Dual-tree
cached stats
Pipelined Kernel
Operations
Multi-tree
n-point
correlation
High dimensional
versions
Via NIPS
Kd-trees
Ball trees
Multipole
Via DBMS
BSPs
Fast LOESS
Metric trees
Greengard
WSPDs
Doom
Cached
Sufficient
Statistics
Non-parametric cluster
/ clutter / cluster
counts
Mahalonobis
Maximization
Mixture Loglike
Mixture EM steps
Cached
Counts
Fast LWR
• Jeremy Kubica
Variable Resolution
Reinforcement
Learning
Single Kernel
Regressions
K-means
Dual-tree
cached stats
a le
c
s
ve
i
s
ct
s
e
j
a
b
M
o
2-point
e
l
p
i
t
correlation
mul ng
i
k
c
a
r
Pipelined
Kernel
t
Operations
Single Kernel Density
Estimations
Multi-tree
n-point
correlation
High dimensional
versions
Cached Sufficient Statistics
Ball Trees (= Metric Trees)
K-nearest neighbor with ball trees
Very fast non-parametric classification
skewed binary outputs
General binary outputs
multi-classed outputs
Very fast kernel-based statistics
n-point computations
clustering
non-parametric clustering (overdensity
hunting)
Active learning for anomaly hunting
GMorph: Efficient Galaxy morphology fitting
Other Auton topics
Structure of kdtrees
In memory…
I contain 300 points
and have center of
mass (55,52) and
covariance…
⎛ 2308 − 45 ⎞
⎜⎜
⎟⎟
⎝ − 45 1965 ⎠
Structure of of kdtrees
In memory…
I contain 130 points
and have center of
mass (23,49) and
covariance…
⎛ 754 124 ⎞
⎜⎜
⎟⎟
⎝ 124 1965 ⎠
I contain 170 points
and have center of
mass (73,58) and
covariance…
⎛ 654 24 ⎞
⎜⎜
⎟⎟
⎝ 24 1885 ⎠
Structure of of kdtrees
In memory…
Structure of of kdtrees
In memory…
Structure of of kdtrees
In memory…
<more levels here>
Structure of of kdtrees
In memory…
<more levels here>
Range Search
Range Count
Range Count
Range Count
Range Count
Range Count
Range Count
Range Count
Range Count
Range Count
Range Count
Range Count
Range Count
Range Count
Range Count
Range Count
Range Count
A Set of Points
in a metric
space
Ball Tree root
node
A Ball Tree
A Ball Tree
A Ball Tree
A Ball Tree
A Ball Tree
•J. Uhlmann, 1991
•S. Omohundro, NIPS 1991
Ball-trees: properties
Let Q be any query point and let x be a
point inside ball B
|x-Q| ≥ |Q - B.center| - B.radius
|x-Q| ≤ |Q - B.center| + B.radius
B.center
x
Q
Cached Sufficient Statistics
Ball Trees (= Metric Trees)
K-nearest neighbor with ball trees
Very fast non-parametric classification
skewed binary outputs
General binary outputs
multi-classed outputs
Very fast kernel-based statistics
n-point computations
clustering
non-parametric clustering (overdensity
hunting)
Active learning for anomaly hunting
GMorph: Efficient Galaxy morphology fitting
Other Auton topics
Goal: Find out
the 2-nearest
neighbors of Q.
Q
•J. Uhlmann, 1991
•S. Omohundro, NIPS 1991
Q
Q
Q
Q
Q
Q
We’ve hit a leaf
node, so we
explicitly look at the
points in the node
Q
Two nearest
neighbors found so
far are in pink
(remember we have
yet to search the
blue balls)
Q
Q
Q
Q
Q
Q
Q
Q
Q
Q
Q
Q
Q
Q
Q
Cached Sufficient Statistics
Ball Trees (= Metric Trees)
K-nearest neighbor with ball trees
Very fast non-parametric classification
skewed binary outputs
General binary outputs
multi-classed outputs
Very fast kernel-based statistics
n-point computations
clustering
non-parametric clustering (overdensity
hunting)
Active learning for anomaly hunting
GMorph: Efficient Galaxy morphology fitting
Other Auton topics
KNS2
• Assume binary output
• Assume positive class is much less frequent
than negative class
• Assume we want more than a
“positive/negative” prediction: we want to
know exactly how many of the K-NN are
from the +ve class
KNS2 does this without finding the K-NN
Assume we have a set
of data points.
Some are +ve points
(denoted )
The large majority are
–ve points (denoted )
Assume we have a set
of data points.
Some are +ve points
(denoted )
Preprocessing:
Put all
Q
your +ve points in a
small ball tree
Preprocessing: Put all
your -ve points in a
separate large ball tree
The vast majority are –
ve points (denoted )
Goal: Find out
how many of
the 5-nearest
neighbors of Q
are positive.
Q
Step One: Find
the five nearest
+ve points using
KNS1.
Q
We’re assuming there
are far fewer +ves
than –ves so this is
not the dominant cost.
Step 2: Search the
ball-tree of –ve
points starting at
the root.
Q
Q
By the end of the search this
will contain number of negative
points closer to query than
closest +ve point
Q
Q
Search the balltree of –ve
points starting
at the root.
By the end of the search this
will contain number of negative
points closer to query than
closest +ve point
Search the balltree of –ve
points starting
By the end of the search this will
contain number of negative at the root.
Q
Q
points whose distance to query is
between distances of closest +ve
point and 2nd closest +ve point
By the end of the search this
will contain number of negative
points closer to query than
closest +ve point
By the end of the search this will
contain number of negative points
whose distance to query is between
distances of 2nd closest +ve point and
3rd closest +ve point
Search the balltree of –ve
By the end ofstarting
the search this will
points
By the end of the search this will
contain number of negative points
distanceroot.
to query is between
contain number of negativewhose
at the
distances
of 3 closest +ve point and
points whose distance to query
is
4th closest +ve point
between distances of closest +ve
point and 2nd closest +ve point
rd
Q
Q
By the end of the
search this will
contain number of
negative points
whose distance to
query is between
distances of 4th
closest +ve point
and 5th closest
+ve point
By the end of the search this will
contain number of negative points
whose distance to query is between
uery!
q
N
N
t to 5point
n
distances of 2nd closest
+ve
and
a
v
e
l
re
f
i
y
l
on
3Brdut closest
+ve point
By the end of the search this
will contain number of negative
points closer to query than ry!
que
N
N
closest +ve tpoint
o5
t
n
a
v
r e le
f
i
y
l
n
By the end of the search this will
Bu t o
By the end of the search this will
contain number of negative points
contain number of negativewhose distance to query is between
ryis!
of 3rd closest +ve point and
edistances
u
!
q
points whose distance5-to
query
N
query
N
N
th
N
5
o
4 closest +ve
point
t tof
n
ant to
a
v
v
e
betweenfdistances
closest
+ve
l
e
e
l
r
re
i
only if
y
t
l
u
n
B
nd
t o and 2 closest +ve point
Bupoint
Q
Search the balltree of –ve
points starting
at the root.
Q
By the end of the
search this will
contain number of
negative points
nt
releva
f
i
y
l
whose
!
But ondistance
ueryto
q
N
N
to 5is- between
query
distances of 4th
closest +ve point
and 5th closest
+ve point
Search the balltree of –ve
points starting
at the root.
Q
Q
Q
Q
Q
Q
Q
m
o
fr all
ce B
n
a t in
t
s
th Di oin
5
le e p
to
b
is
v
Q int
–
s
m
r o e po
Po any
f
ce t +v
ax t o
n
ta ses
M
s
i
D clo
Q
Q
Q
m
o
fr all
ce B
n
a t in
t
s
th Di oin
5
le e p
to
b
is
v
Q int
–
s
m
r o e po
Po any
f
ce t +v
ax t o
n
ta ses
M
s
i
D clo
Q
Q
…and there are eight
points in the ball
Q
Q
Q
m
o
fr all
ce B
n
a t in
t
s
Di oin
le e p
b
si – v
s
Po any
ax t o
M
Q
Q
Distance from Q to 4thclosest +ve point
Q
m
o
fr all
ce B
n
a t in
t
s
Di oin
le e p
b
si – v
s
Po any
ax t o
M
Q
Q
Distance from Q to 4thclosest +ve point
…and there are eight
points in the ball
Q
Q
3rdQ to t
om
e fr ve poin
an c
Dist losest +
c
Q
m
o
r
f al l
e
c B
n
a in
si t int
D o
e
l ep
b
i
Q ss –v
Po ny
ax to a
M
Q
3rdQ to t
om
e fr ve poin
an c
Dist losest +
c
Q
m
o
r
f al l
e
c B
n
a in
si t int
D o
e
l ep
b
i
Q ss –v
Po ny
ax to a
M
Q
No prune!
Q
Q
Q
Q
Q to t
om
e fr ve poin
anc
Dist osest +
cl
3rd-
Q
Q
Max Pos
sible Dis
tance fro
to any –
mQ
ve point
in Ball
…and there are six
points in the ball
Q
Q
Max Pos
sible Dis
tance fro
to any –
mQ
ve point
in Ball
…and there are six
points in the ball
Q
Q
Distance fr
o m Q to
2nd-closes
t +ve
point
Q
Q
Q
Q
Q
Q
No prune.
Q
Q
No prune.
Q
Q
Ball is leaf
so explore
its points
I contain exactly one point closer
than the 1st closest +ve point (says
the Ball)
Ball is leaf
so explore
its points
Q
Q
No prune.
1
Return and
try other
sibling
Q
Q
1
I can’t possibly
have any
interesting points
Return and
try other
sibling
Q
Q
1
I can’t possibly
have any
interesting points
Q
Q
1
I can’t possibly
have any
interesting points
Q
Q
1
We’re done
Q
Q
1
We’re done
There’s one –ve point
closer than the closest
+ve point.
There are more than 3
–ve points closer than
the 2nd closest +ve
point.
Q
=> Exactly 1 of the 5
nearest neighbors is
+ve
Q
1
Balls visited
Q
Q
1
Another
example
Q
Experimental results
Num of Distance computations
Speedup for K-NN
100
80
Naive
KNS1
KNS2
KNS3
60
40
20
0
ds1
ds2 J_Lee Blanc letter Movie
Wall-clock-time speedup for k-NN
70
60
50
40
30
20
10
0
Naive
KNS1
KNS2
KNS3
ds1
ds2 J_Lee Blanc letter Movie
Cached Sufficient Statistics
Ball Trees (= Metric Trees)
K-nearest neighbor with ball trees
Very fast non-parametric classification
skewed binary outputs
General binary outputs
multi-classed outputs
Very fast kernel-based statistics
n-point computations
clustering
non-parametric clustering (overdensity
hunting)
Active learning for anomaly hunting
GMorph: Efficient Galaxy morphology fitting
Other Auton topics
Cached Sufficient Statistics
Ball Trees (= Metric Trees)
K-nearest neighbor with ball trees
Very fast non-parametric classification
skewed binary outputs
General binary outputs
multi-classed outputs
Very fast kernel-based statistics
n-point computations
clustering
non-parametric clustering (overdensity
hunting)
Active learning for anomaly hunting
GMorph: Efficient Galaxy morphology fitting
Other Auton topics
Data Structures for Fast Kmeans
The Auton Lab
Carnegie Mellon University
www.autonlab.org
Some Bio
Assay data
GMM
clustering
of the
assay data
Resulting
Density
Estimator
Three
classes of
assay
(each learned with it’s
own mixture model)
(Sorry, this will again be semiuseless in black and white)
Resulting
Bayes
Classifier
Resulting Bayes
Classifier, using
posterior
probabilities to
alert about
ambiguity and
anomalousness
Yellow means
anomalous
Cyan means
ambiguous
[Moore, 1999], [Pellg and Moore, 2002]
Cached Sufficient Statistics
Ball Trees (= Metric Trees)
K-nearest neighbor with ball trees
Very fast non-parametric classification
skewed binary outputs
General binary outputs
multi-classed outputs
Very fast kernel-based statistics
n-point computations
clustering
non-parametric clustering (overdensity
hunting)
Active learning for anomaly hunting
GMorph: Efficient Galaxy morphology fitting
Other Auton topics
Detecting
overdensities
Detecting
overdensities
Finding the
overdense
regions
Finding the
overdense
regions
• Many possible approaches
Examples:
• Dasgupta and Raftery, 1998
• Byers and Raftery, 1998
• Cuevas et al, 2000
• Reichart et al, 1999
• All share the same computational
primitives
Finding the
overdense
regions
• Many possible approaches
Examples:
• Dasgupta and Raftery, 1998
• Byers and Raftery, 1998
• Cuevas et al, 2000
• Reichart et al, 1999
• All share the same computational
primitives
Finding the
overdense
regions
Finding the
overdense
regions
• Step One:
Identify the high
density points
Finding the
overdense
regions
• Step One:
Identify the high
density points
• Step Two:
Delete the rest
Finding the
overdense
regions
• Step One:
Identify the high
density points
• Step Two:
Delete the rest
• Step Three:
Find connected
components
Finding the
overdense
regions
CFF assumes:
Kernel Density
Estimation
• Step One:
Identify the high
density points
• Step Two:
Delete the rest
• Step Three:
Find connected
components
Finding the
overdense
regions
CFF assumes:
Kernel Density
Estimation
CFF assumes: A and B are in same
component if there’s a path between
them with all small steps
A
C
B
• Step One:
Identify the high
density points
• Step Two:
Delete the rest
• Step Three:
Find connected
components
Finding the
overdense
Can be done efficiently
regions
CFF assumes:
Kernel Density
Estimation
with 2-point style
search
CFF assumes: A and B are in same
component if there’s a path between
them with all small steps
Can be done
efficiently with
2-point-style
search plus an
extra trick
A
C
B
• Step One:
Identify the high
density points
• Step Two:
Delete the rest
• Step Three:
Find connected
components
Results
4-dimensional Sloan
Astrophysics color-space data
[Wong and Moore, 2002]
Results
4-dimensional Sloan
Astrophysics color-space data
Cached Sufficient Statistics
Ball Trees (= Metric Trees)
K-nearest neighbor with ball trees
Very fast non-parametric classification
skewed binary outputs
General binary outputs
multi-classed outputs
Very fast kernel-based statistics
n-point computations
clustering
non-parametric clustering (overdensity
hunting)
Active learning for anomaly hunting
GMorph: Efficient Galaxy morphology fitting
Other Auton topics
Active Learning of Anomalies
Set of
Records
Ask expert
to classify
Spot “important”
records
Build Model from
data and labels
Run all data
through model
Dan Pelleg
[Pelleg, Moore and Connolly 2004]
Anomaly GUI
Anomaly Performance
Cached Sufficient Statistics
Ball Trees (= Metric Trees)
K-nearest neighbor with ball trees
Very fast non-parametric classification
skewed binary outputs
General binary outputs
multi-classed outputs
Very fast kernel-based statistics
n-point computations
clustering
non-parametric clustering (overdensity
hunting)
Active learning for anomaly hunting
GMorph: Efficient Galaxy morphology fitting
Other Auton topics
GMorph: Fast Galaxy Morphology
• How do you perform 107 large
nonlinear optimizations in practical
time?
• How do you avoid local optima
• Idea: Pre-cache a “library” of solutions.
Use efficient nearest neighbor to
match new problems to library as
seeds.
• Early tests bring galaxy morphology
fits down from minutes to sub-seconds
Brigham
Anderson
[Anderson, Bernardi, Connolly, Moore, Nichol, 2004]
Cached Sufficient Statistics
Ball Trees (= Metric Trees)
K-nearest neighbor with ball trees
Very fast non-parametric classification
skewed binary outputs
General binary outputs
multi-classed outputs
Very fast kernel-based statistics
n-point computations
clustering
non-parametric clustering (overdensity
hunting)
Active learning for anomaly hunting
GMorph: Efficient Galaxy morphology fitting
Other Auton topics
Other Relevant Auton Topics
Bayesian Networks
G
H
S
D
C
W
[Moore and Lee, 1998], [Moore and Wong, 2003]
Other Relevant Auton Topics
Bayesian Networks
“What’s strange about recent events?”
G
H
S
Weng-Keen
Wong
D
C
W
[Wong, Moore, Cooper and Wagner 2003]
Other Relevant Auton Topics
Bayesian Networks
“What’s strange about recent events?”
G
H Scan Statistics
Spatial
S
D
C
Daniel Neill
W
[Neill, Moore and Wagner, 2004]
Other Relevant Auton Topics
Bayesian Networks
“What’s strange about recent events?”
G
H Scan Statistics
Spatial
S
Massively multiple
target tracking
D
Jeremy
Kubica
W
C
Conclusions
• Geometry can help tractability of
Massive Statistical Data Analysis
• Cached sufficient statistics are one
approach
• Papers, tutorials, software, examples:
www.autonlab.org
Download