Nonlinear Data Transformation with Diffusion Map

advertisement
Nonlinear Data Transformation
Applications
with Diffusion Map
Peter Freeman
Ann Lee
Joey Richards*
Chad Schafer
0.16
!11
x 10
5
0.14
0
!5
0.1
t
3
! "3
0.12
Department of Statistics
Carnegie Mellon University
www.incagroup.org
!10
0.08
!15
0.06
5
0
* now at U.C. Berkeley
!5
!10
!12
x 10
!15
!20
!t
2
"
!25
!6
!4
2
0
!2
4
6
0.04
!7
x 10
t
1
! "
2
1
Richards et al. (2009; ApJ 691, 32)
Predict redshift and identify outliers (Richards, Freeman, Lee, Schafer (2009
93
Data Transformation: Why Do It?
✤
The Problem: Astronomical data that inhabit complex
structures in (high-)dimensional spaces are difficult to analyze
using standard statistical methods. For instance, we may want
to:
✤
✤
✤
✤
Estimate photometric redshifts from galaxy colors
Estimate galaxy parameters (age, metallicity, etc.) from galaxy spectra
Classify supernovae using irregularly spaced photometric observations
The Solution: If these data possess a simpler underlying
geometry in the original data space, we transform the data so
as to capture and exploit that geometry.
✤
✤
Usually (but not always), transforming the data affects dimensionality
reduction, mitigating the “curse of dimensionality.”
We seek to transform the data in such a way as to preserve relevant
physical information whose variation is apparent in the original data.
Data Transformation: Example
cifying the Distances
Note that these may be non-standard
ear the manifold,
with each
n = 500
and
errormay
SD σrepresent
= 0.5.
data (e.g.,
data
point
a vector of values,69like a spectrum).
These data inhabit a
one-dimensional manifold
in a two-dimensional space.
Perhaps a physical parameter
of interest (e.g., redshift) varies
smoothly along the manifold.
We want to transform
the data in such a way that
we can employ simple statistics
(e.g., linear regression) to model
the variation of that physical
parameter. (Accurately.)
The Classic Choice: PCA
cifying the Distances
Principal components analysis
will do a terrible job (at dimension
reduction) in this instance because
it is a linear transformer.
ear the manifold, with n = 500 and error SD σ = 0.5.
69
The Classic Choice: PCA
cifying the Distances
Principal components analysis
will do a terrible job (at dimension
reduction) in this instance because
it is a linear transformer.
In PCA, high-dimensional
data are projected onto
hyperplanes. Physical
information may not be
well-preserved in the transformation.
ear the manifold, with n = 500 and error SD σ = 0.5.
69
Nonlinear Data Transformation
✤
There are many available methods for nonlinear data
transformation which have yet to be widely applied to
astronomical data:
✤
✤
✤
Local linear embedding (LLE; see, e.g., Vanderplas & Connolly 2009)
Others: Laplacian eigenmaps, Hessian eigenmaps, LTSA
We apply the diffusion map (Coifman & Lafon 2006,
Lafon & Lee 2006; see diffusionMap R package).
✤
✤
The Idea: to estimate the “true” distance between two data points via a
fictive diffusion (i.e., Markov random walk) process.
The Advantage: The Euclidean distance between points x and y in the
space of transformed data is approximately the diffusion distance
between those points in the original data space. Thus variations of
physical parameters along the original manifold are approximately
preserved in the new data space.
Diffusion Map: Intuition
on Distances
Pick location...
*
*
t=1
t=2
...set up a kernel...
on the noisy spiral
t = 25
*
*
*
on Distances
ntered on one point
Diffusion Distances
Diffusion Distances
Diffusion Distances
...
*
*
*
72
*
t = 25
t=2
t=1
...and map out the random walk.
Distribution after the 25th step
Distribution after the second step
Yields distribution over points after first step
74
75
76
ize
diffusion
map and map
the and the
Acoordinate
key feature
of SCA
is that
the
choice
ofthe
s(x,
y) is not
crucial, as
, wetheutilize
the diffusion
propose
a new empirical
method for photometric
system
for
data
that
was
absent
in
original
represenit is oftenitsimple
to simple
determine
whether orwhether
not twoordata
points
arepoints are
ermine
optimal
bases
of
simple
is
often
to
determine
not
two
data
based
on the diffusion
mapbases
(Coifman
& ‘similar’.
Lafon
tation. For instance, imagine data in two dimensions that to the eye
hm
to determine
optimal
of simple
ee to
estimate
the
formation
‘similar’.
2006),
which
is star
an approach
spectral conclearly exhibit spiral structure (e.g. fig. 1 of Richards et al. 2009a).
hat
we use
to estimate
the startoformation
We remove
extreme outliers from our data set, not because of their
Wesuch
remove
outliers
from our
data set,
because
SCA). SCA is a suite of established non-linear
For
data,extreme
the Euclidean
distance
between
datanot
points
x and of
y their
effect
on
diffusion
map
construction
(a
hallmark
of
the
diffusion
ics capture
of our the
diffusion
map
and of data by
underlying
geometry
wouldon
notdiffusion
be an optimal
of the
distance
effect
mapdescription
construction
(a‘true’
hallmark
of between
the diffusion
whatthe
basics of
our diffusion
map and
map is its robustness
inthethe
presence
of outliers),
but rather
because
neighbourhood
information
them
along
spiral.
Diffusion
map is of
a leading
example
of an apce
a new component:
thethrough
ap- a Markov
map
is
its
robustness
in
the
presence
outliers),
but
rather
because
dallows
introduce
a
new
component:
the
apthey can bias
the to
coefficients
of
the linear
regression the
model
onePress
to findeta al.
natural
coordinate
system
proach
SCA.the
In the
diffusion
map
framework,
‘true’(see
distance
(see e.g.
1992),
a
they
can
bias
coefficients
of
the
linear
regression
model (see
extension
(see
e.g.
Press
et
al.
1992),
a
Section
2.2)
and
because
we
find
that
individual
predictions
made
tometric
colours
original
parametrization
is estimated via a fictive diffusion process over the data, with one
te
technique
forwhose
estimating
difSection
2.2) and because we find that individual predictions made
nd accurate
technique
for estimating
diffor
these
objects
are
highly
We compute
available
statistical
techniques.
In
Richards
et
al.
proceeding
from x inaccurate.
to y via a random
walk alongthe
theempirical
spiral.
given those of the training set.
for
these
objects
are
highly
inaccurate.
We
compute
the empirical
wrds
objects
given those
of thethetraining
set.
et al. (2009b),
we apply
diffusion
map
We
construct
diffusion
maps
as
follows.
distributions of Euclidean distances in colour space from each object
ork to SDSS data, specifically
distributions
of
Euclidean
distances
iny)colour
space fromrelates
each object
problems.
In Richards
et al. (2009a),
We define
a similarity
measure
s(x,
that quantitatively
rronomical
framework
to SDSS
data, specifically
to
its
n
th
nearest
neighbour,
where
n
∈
[1,
10].
These
distributions
uminous red galaxies
(LRGs),
totwo
its ndata
th nearest
where
n ∈a [1,
These
mework
map and
adaptive
points
xneighbour,
and y.with
In this
work,
data10].
‘point’
is adistributions
vector
Gs) ✤andcombining
luminousdiffusion
red galaxies
(LRGs),
are
well
described
as
exponential,
estimated
mean
and
standard
uracy
onit par
with Digital
that ofSky
more
are
well
described
mean and
and standard
. . .as
, cexponential,
ofmedian
length value
pwith
for estimated
ax̃ single
galaxy,
the
nd
apply
to
Sloan
Survey
(SDSS)
of=
colours
{c/
1 , log(2)
p}
chieve accuracy on par with that
of more
σ̂
=
x̃
for
.
We
exclude
all
deviation
µ̂
n
n
n
n
s.demonstrating
We also apply
our
similarityµ̂measure
apply isfor
themedian
Euclidean
distance
how
it framework
may be used to reduce the
= x̃we
value
x̃n . We exclude all
deviation
n = σ̂n that
n / log(2)
echniques. We also apply
our framework
+5
σ̂
= 6σ̂n ,
data
whose
n
th
nearest
neighbour
is
at
a
distance
>
µ̂
n
n
!
shift
Survey
is predict,
matchedfortoexample, redhe
data
space that
and to
data of
whose
n thp10].
nearest
neighbour
is atper
a distance
>
µ̂n +5σ̂n = 6σ̂n ,
"
alaxy
Redshift
Survey
that
is
matched
to
for
any
value
n
∈
[1,
We
find
that
≈80
cent
of
extreme
$
"
&2
ationally
efficient
manner. Legacy
We also demonstrate
nce–Hawaii
Telescope
# of n%c∈
for
any
value
[1,
10].
We
per cent
of extreme
−
c
. find that ≈80
s(x,
y)
=
x,i
y,i
nada–France–Hawaii
Telescope
Legacy
outliers
are
removed
with
the
first
nearest-neighbour
cut
alone,
with
he
diffusion map
components
analydemonstrate
thattoitprincipal
provides
aci=1
outliers
areremoved
removed
with the
first
nearest-neighbour cut alone, with
008)
and
demonstrate
that
it
provides
acthe
fraction
of
those
falling
as
n
increases.
d,
much
more
commonly
used
linear
technique.
0.75 given four colours alone.
the
of of
those
removed
as
ngraph
increases.
A fraction
keyremoved,
feature
SCA
is that theafalling
choice
of s(x,
y) iswhere
not crucial,
as
With
outliers
we
construct
weighted
the
fts
to
z
≈
0.75
given
four
colours
alone.
(2009b),
we
utilize
the
diffusion
map
and
the
stributions
of photometric and
✤
itWith
isobserved
often
simple
topoints:
determine
whether oranot
two datagraph
pointswhere
are the
outliers
removed,
we construct
weighted
gvariate
algorithm
to determineof
optimal
bases of and
simple
nodes
are
the
data
distributions
photometric
nd DEEP2 are affected by at‘similar’.
nodes
are
( data points:
' the observed
pectra
that
we DEEP2
use to estimate
the star formation
2
SDSS
and
are
affected
by
ats(x,extreme
y)
surement error in the predictor
( our data set, not because of their
' outliers2 from
We remove
s.
,
(1)
w(x,
y)
=
exp
−
s(x,construction
y)
y of
measurement
error
insumthe predictor
s.
Last,
in
Section
4,
we
effect
on
diffusion
map
(a
hallmark
of
the
diffusion
" −
,
(1)
w(x, y) = exp
e review the basics of our diffusion map and
ar
models.
Last,
in
Section
4,
we
sum"
is its robustness in the presence of outliers), but rather because
we can
oura framework
ork,
and extend
introduce
new component: the
ap- " is map
where
a tuning
parameter that should be small enough that
cuss
how
we
can
extend
our
framework
they
can
bias
the coefficients
of the
linear
regression
model
(see that
pectroscopic
coverage
where
" is ax tuning
parameter
that
should
be small
enough
yström
extension
(see e.g.will
Pressbeet al. 1992),
a y) ≈
w(x,
0
unless
and
y
are
similar,
but
large
enough
such
✤
Section
2.2)
and
because
we find
thatsimilar,
individual
predictions
made such
eficient
where
coverage
will bedifandspectroscopic
accurate technique
for estimating
w(x,
y)
≈
0
unless
x
and
y
are
but
large
enough
that the graph
is fully
connected.
(We
discussWehow
we estimate
for
these
objects
are
highly
inaccurate.
compute
the empirical
for new objects given those of the training
set.
that
the
graph
is
fully
connected.
(We
discuss
how
we estimate
" in Sectiondistributions
2.2.) The of
probability
of
stepping
from
x
to
y
in
one
Euclidean
distances
in
colour
space
from
each
object
)probability of stepping from x to
pply our framework to SDSS data, specifically
"
in
Section
2.2.)
The
y in one
y) n =
w(x, neighbour,
y)/ z w(x,
z).
We
store
the one-step
step is p1 (x,
to
its
th
nearest
where
n
∈
[1,
10].
These
distributions
)
ies (MSGs) and luminous red galaxies (LRGs),
y)n data
=
w(x,
We
the one-step
step
is pdescribed
1 (x,all
pointsy)/
in with
anz nw(x,
× nz).
matrix
P;and
then,
arebetween
well
as exponential,
estimated
meanstore
standard
at we achieve accuracy on par with that ofprobabilities
more
probabilities
nprobability
data
an nx̃n×
nfrom
matrix
P; then,
= chains,
σ̂n = x̃nall
/the
log(2)
forpoints
median
. We
exclude
deviation
µ̂n between
by the theory
of Markov
ofinvalue
stepping
x all
tensive techniques. We also apply our framework
t σ̂n = 6σ̂nfrom
✤
bydata
the
theory
ofthe
Markov
chains,
probability
x
,
nby
th
nearest
neighbour
at aof
distance
>µ̂nof+5
y)
the matrix
Pstepping
. The
to ytoin t steps
is whose
given
element
p t (x,isthe
1 Survey that is matched
EEP2 Galaxy Redshift
any
value is
ofgiven
nx∈and
[1,by
10].
find
per
extreme
(x, y)
theofmatrix
Pt . The
tofor
y in
t steps
element
p t≈80
cs
ofCanada–France–Hawaii
diffusion map construcdiffusion distance
between
y the
at We
time
t isthat
defined
asofcent
f the
Telescope Legacy
outliers
are
removed
with the xfirst
nearest-neighbour
cut alone,
the basics
of
diffusion
map
diffusion
distance
between
and
y at time t is defined
as with
details,
weand
refer
the reader
toitconstrucGwyn
2008)
demonstrate
that
provides ac∞
$
Diffusion Map: The Math (Part I)
Define similarity measure between two points x and y, e.g., the
Euclidean distance:
Construct a weighted graph:
Row-normalize to compute “one-step” probabilities:
Use p (x,y) to populate n x n matrix P of one-step probabilities.
observed data points:"
Section
4, we sumed
by atcoverage
willnodes
be are thew(x,
' y) ≈ 20(unless x and y are similar, but large enough such
end
our
framework
y)
predictor
where " iss(x,
a tuning
parameter that should be small enough that
,
(1)
w(x,
y)
=
exp
−
that
the
graph
is
fully
connected.
(We
discuss
cwe
coverage
will
be
sumw(x, y) ≈ 0 "unless x and y are similar, but large enoughhow
such we estimate
in Section
2.2.)
The
probability
of stepping
from x to y in one
the
graph
is fully
connected.
discuss
howthat
we estimate
where " isthat
a"tuning
parameter
that
should
be(We
small
enough
)
Section
stepping
from
x We
to y store
in one the one-step
(x,yThe
y)areprobability
=
w(x,
y)/
w(x,
z).
is xp2.2.)
1and
w(x, y) ≈" in
0step
unless
similar,
large
such
zenough
)butof
y) =
w(x,
y)/
w(x,
z).
We estimate
store
stepprobabilities
isis pfully
1 (x, connected.
between
all
nz data
points
in anthe
n one-step
× n matrix P; then,
that the graph
(We
discuss
how
we
probabilities
between
n data
points
in the
an
×y n
" in Section
2.2.)the
The
probability
of
stepping
from
xn
toprobability
in matrix
one P;
by
theory
of)all
Markov
chains,
ofthen,
stepping from x
by the
theory
of Markov
chains,
the
probability
of stepping from x
y)
=
w(x,
y)/
w(x,
z).
We
store
the
one-step
step is p1 (x,
t
z
(x,
y)
of
the
matrix
P
.
The
to
y
in
t
steps
is
given
by
the
element
p
t
t
y) ofP;
thethen,
matrix P . The
to ybetween
in t stepsallisngiven
by the in
element
probabilities
data points
an n ×p tn(x,
matrix
t as
on
map
construcdiffusion
distance
between
x
and
y
at
time
t
is
defined
ion ✤map
construcThe
probability
of
stepping
from
x
to
y
in
t
steps
is
P
diffusion
distance
between
x and y at of
time
t is defined
by the theory
of Markov
chains,
the probability
stepping
from as
x
efer
reader
t
∞
refer the
the reader
toytoin t steps is given by
∞the$
(x,
y)
of
the
matrix
P
. The
to
element
p
t
$
✤
2t
2
2
andRichards
Richards etet
al.al.
nd
construc(x,t2 (x,
y)between
=y) =λx2t
(ψ
(x)
−
ψ
(
y))
,
Dt2D
diffusion
distance
and
y
at
time
t
is
defined
as
λ
(ψ
(x)
−
ψ
(
y))
,
j
j
j
j
j
j
trast
thetouse
use of
reader
j =1 j =1
∞
ast
the
ofdifdif$
th
λ
j = j largest eigenvalue of P
2
2t
2
ards
et al.
technique,
PCA,
ψ j ( y)) ,eigenvectors thand eigenvalues of P, reDt (x, y) =
j (x)
where λψj j(ψ
and
λj−represent
rartechnique,
PCA,
of P of P, rej = j (right) eigenvector
where ψ j and λj represent ψeigenvectors
and eigenvalues
se of difmaps
in predicting
j =1
spectively. By retaining the m eigenmodes corresponding to the m
aps
in
predicting
ue,
PCA,
spectively.
retainingand
theby
mintroducing
eigenmodes
corresponding
to the m
alaxy
spectra. where ψ largest
representBy
eigenvectors
and
eigenvalues
ofthe
P, diffusion
reeigenvalues
map
✤
j and λj non-trivial
laxy
redicting
utilizespectra.
a local disnon-trivial
the diffusion map
spectively. Bylargest
retaining
corresponding
to the m
t the m eigenmodes
t eigenvalues
t and by introducing
(2)
" t : x &→ [λ1 ψ 1 (x), λ2 ψ 2 (x), . . . , λm ψ m (x)]
ctra.
tilize
a local largest
dis- non-trivial
s.
The eigenmodes
eigenvalues
the diffusion
map
t and by introducing
t
t
:
x
→
&
[λ
ψ
(x),
λ
ψ
(x),
.
.
.
,
λ
ψ
(x)]
(2)
"
ocal
tRp to Rm , we
1
2
m
1
2
m
Thediseigenmodes
have
that
from
t
t
t
:
x
→
&
[λ
ψ
(x),
λ
ψ
(x),
.
.
.
,
λ
(2)
"
t
1
2
1
m ψ m (x)]
enmodes
p2$
m
m
from
R to R2t , we have that2
p
m
2
2
,
we
have
that
from R toDR
by Lee & Wasserman
(x,
y)
(
λ
(ψ
(x)
−
ψ
(
y))
=
||"
(x)
−
"
(
y)||
,
t
t
j
t
jm j
$
m
j
=1
$
2 2t
2t 2
2 2
2
2
y Lee & Wasserman
(x,
y)
(
λ
(ψ
(x)
−
ψ
(
y))
=
||"
(x)
−
"
(
y)||
,
D
Wasserman
t
t
=j ||" t (x) −
Dt (x, y) (
j " t ( y)|| ,
t λj (ψ j (x) − ψ j ( y))
j
✤
amework
e will be
Diffusion Map: The Math (Part II)
.
The diffusion distance between x and y at time t is
Retain the “top” m eigenmodes to create diffusion map:
The tuning parametersj =1ε and m are determined by
AS, MNRAS 398, 2012–2021
minimizing predictive risk (a topic I will skip over in the
S 398, 2012–2021
interests
of time). The choice of t generally does not matter.
S, MNRAS
398, 2012–2021
j =1
The Spiral, Redux
dinate Functions
Coordinate Functions
The first diffusion coordinate
ordinate plot for diffusion map
91
The second diffusion coordinate
Second coordinate plot for diffusion map
92
2014
P. E. Freeman et al.
Application I
sets of !104 galaxies. (However, see Budavári et
propose an incremental methodology for computing
Photometric data sets can, of course, be much large
require a computationally efficient scheme for esti
vectors for new galaxies given those computed for
True Spectroscopic Redshift
galaxies used to train the regression model. A stand
applied mathematics for ‘extending’ a set of eigen
Nyström extension.
The implementation is simple: determine the dist
x 10
0.16
space
from
each
new
galaxy
to
its
nearest
neighbours
2.2 Regression
5
set, then take a weighted average of those neighbours
0.14
As in Richards et al. (2009a), we perform linear regression to predict
Let X represent the n × k matrix containing the col
the function z = r(! t ), where z is true redshift and ! t is a0 vector of
training set, where n and k are the number
of objec
0.12
diffusion coordinates in Rm , representing a vector of photometric
respectively. Let X’ represent a similar n" × k mat
!5
colours x in Rp :
set. The fi
colour data for n" objects in the validation0.1
m
"
Nyström extension is to compute the n" × n weight m
!10
! =
!j "t,j (x)
!
r (! t ) = ! t β
β
elements equivalent to those shown in equation
(1)
0.08
j =1
that there, x and y are both members of the training s
!15
m
m
0.06 set). W
x is a new point while y belongs to the training
"
"
t
"
5
!j λj ψj (x) =
!j ψj (x).
=
β
β
same value !
% as was selected during
0
6 diffusion map
4
!5
0.04of gala
j =1
j =1
since the training set is0 a 2random sample
!10
x 10
!15
!2
x 10 set to be sam
!20
original set, we!4expect
the validation
We see that the choice of the parameter t is unimportant, as changing
!25
!6
t
!j" . We present
!j , with no change in β
t
same
underlying
probability
distribution. We)row-no
it simply leads to a rescaling in β
! "2
! "
2
1 1
dividing by each element
in row i" by ρi " = i Wi " ,i
relevant regression formulae in Appendix A.
(2009;
691, 32) with
Let " beRichards
the n × et
m al.
matrix
ofApJ
eigenvectors
We determine optimal values of the tuning parameters (%, m)
Predictrisk,
redshift
identify outliers
(Richards, Freeman,
Lee, the
Schafer
(2009
vector of eigenvalues
λ. To estimate
eigenvecto
by minimizing estimates of the prediction
R(%, m)and
= E(L),
galaxies, we compute the n" × m matrix ! " :
where E(L) is the expected value of a loss function L over all
possible realizations of the data [one example of L is the so-called
93
"
i.e. the Euclidean distance in the m-dimensional embedding defined
by equation (2) approximates diffusion distance. (We discuss how
we estimate m in Section 2.2, and show that, in this work, the choice
of t is unimportant.) We stress that the diffusion map reparametrizes
the data into a coordinate system that reflects the connectivity of
the data and does not necessarily affect dimension reduction. If the
original parametrization in Rp is sufficiently complex, then it may
be the case that m ! p.
Applications
!11
t
3
✤
Spectroscopic redshift
estimation and outlier
detection using SDSS
galaxy spectra.
Estimation via adaptive
regression:
! "3
✤
!12
!7
Application II
Applications
✤
12
10
3
6
3
8
4
3
2
0
!2
!4
!6
0
!
/(1
!50
2
!!
Estimating properties of
SDSS galaxies (age,
metallicity, etc.) using a
subset of the Bruzual &
Charlot (2003) dictionary
of theoretical galaxy
spectra.
Selection of prototype
spectra made through
diffusion K-means.
! /(1!! ) "
✤
)"
2
!100
2
For estimating properties of
!100
!80
!60
!40
!20
0
! 1/(1!! 1) " 1
Richards et al. (2009; MNRAS 399, 1044)
galaxies (Richards, Freeman, Lee, Schafer (200
95
2015
0.30
0.30
0.20
Freeman et al. (2009; MNRAS 398, 2012)
0.00
0.5
0.4
0.3
0.2
Photometric Redshift Estimate ( z )
0.05
0.10
0.15
0.20
0.25
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
0.00
R
Petro
Photometric Redshift Estimate ( z )
Photometric redshift
estimation for SDSS
Main Sample Galaxies.
✤ Uses Nyström Extension
laxies for quickly predicting
le, we photometric
extract those 360 122redshifts
galaxies with of
> 77.983;
Strauss
nitude test
<17.77set
(or Fdata,
given
the
ur main sample galaxy or MSG sample. We
diffusion
coordinates
0 galaxies
from this sample
to train our regres-of
training
setalgorithm
data.described
on of the
outlier-removal
o the✤ removal of 251 galaxies from this set.
Displays effect of flux
e algorithm outlined in Sections 2.1 and 2.2
measurement
error
upon
! = (0.05, 150),
er estimates
(!
! , m)
i.e. in order
e appropriate,
the 16-dimensional
colour data
predictions:
attenuation
o 150-dimensional
bias. space.
0.35
or more elements of the set {F ! } < 0, we
m analysis. The flux units are nanomaggies;
✤
!
to magnitude m! is m! = 22.5 − 2.5 log10 F ! ,
!
= 2.5 log10 (F !j /F !i ).
o colours is ci−j
f galaxies in our sample is 417 224.
0.10
Applications
0.00
Application III
Photometric Redshift Estimate ( z )
Photometric redshift estimation using SCA
0.05 0.20.10
0.3
0.15
0.4
0.20
0.25 0.5 0.30
0.35
Spectroscopic Redshift (Z)
genvector estimates are independent of those
apply the Nyström extension to validation
Figure 1. Top: predictions for 10 000 randomly selected objects in the MSG
redshift estimation
(Freeman, Newman, Lee, Richards, Sc
at a time, then concatenate the resultingPhotometric
pre! !
Application IV
The Supernova Classification Challenge
✤
✤
Classifying SNe in the
Supernova Photometric
Classification Challenge
(Kessler et al. arXiv:
1001.5210)
See talk by Joey Richards
for more details!
From the training set, Type Ia.
g
i
r
z
Richards et al. (2010; in preparation)
53
Future Application
The Big Picture
Encoding Space
Distribution Space
Data Space
Component 3
Confidence/Credible
Region
Compone
nt 2
Physically Possible Distributions
Com
pon
ent
1
Transform observed light curves and
Once
represented
low-dimensional
space encoding space,
theoretical
lightincurves
to a low-dimensional
nonparametric
density
estimation
useful
comparing
encoding space,
where
they may
beforcompared
observations
and theory density estimation.
using nonparametric
Supernova light curves
Diffusion Map: Challenges
✤
Computational Challenge I: efficient construction of weighted
graph w.
✤
✤
✤
Distance computation slow for high-dimensional data.
Graph may be sparse: can we short-circuit the distance computation?
Computational Challenge II: execution time and memory
requirements for eigen-decomposition of the one-step
probability matrix P.
✤
✤
✤
✤
SVD limited to approximately 10,000 x 10,000 matrices on typical
desktop computers.
Slow: we only need the top n% of eigenvalues and eigenvectors, but
typical SVD implementations compute all of them.
P may be sparse: efficient sparse SVD algorithms?
Would algorithm of Budavári et al. (2009; MNRAS 394, 1496) help?
Diffusion Map: Challenges
✤
Computational Challenge III: efficient implementation of the
Nyström Extension to apply training set results to far larger
test sets.
✤
Predictions for 350,000 SDSS MSGs computed in 10 CPU hours...is
this too slow in the era of LSST?
And One Statistical Challenge...
0.30
0.20
2016
P. E. Freeman et al.
0.25 0.5 0.30
Spectroscopic Redshift (Z)
ose
ion
Figure 1. Top: predictions for 10 000 randomly selected objects in the MSG
pre- ✤ validation
hotometric
redshift
Newman,
Richards,
! ) = (0.05,(Freeman,
set, forestimation
(!
!, m
150). Bottom:
same as Lee,
top, for
the LRGSchafer (2009))
are
! ) = (0.012, 200). In both cases, we remove 5σ
validation set, with (!
!, m
om- ✤ outliers from the sample prior to plotting,
94 thus the actual number of plotted
ing
points is 9740 (top) and 9579 (bottom).
our
0.10
0.15
0.20
0.25
0.05
Spectroscopic Redshift (Z)
Spe
0.02
0.04
Can attenuation bias be effectively mitigated? TBD.
This is not diffusion map specific...
ias
0.020
0.000
0.05
0.35
0.030
0.4
0.20
0.020
0.3
0.15
d Deviation
0.05 0.20.10
0.010
Sample Standard Deviation
0.04
0.02
0.00
−0.04
−0.02
Sample Bias
0.5
0.4
0.3
0.2
Photometric Redshift Estimate ( z )
0.00
0.05
0.10
0.15
0.20
0.25
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
0.00
0.030
0.10
0.00
0.30
0.35
Flux measurement error causes attenuation bias:
Photometric Redshift Estimate ( z )
with
uss
We
resbed
set.
2.2
der
data
Photometric Redshift Estimate ( z )
Applications
we ✤
ies;
F !,
2015
Photometric redshift estimation using SCA
And One Statistical Challenge...
No. 1, 2008
MACHINE LEARNING FOR PROBABI
five-band
ollow-up
of ANNz,
t training
ic survey
talog and
al. 2001).
! 17.77)
2), while
y uniform
me limited
tion).
ANNz
(Collister & Lahav 2004; PASP 116, 345)
Fig. 2.—Spectroscopic vs. photometric redshifts for ANNz applied to 10,000
galaxies randomly selected from the SDSS EDR.
tecture was 5 : 10 : 10 : 1. A committee of five such networks
was trained on the training and validation sets, then applied
to the evaluation set. Figure 2 shows the ANNz photometric
kNN (Ball et al. 2008; ApJ 683, 12)
Fig. 6.— Photometric vs. spectroscopic redshift for the 82,672 SDSS DR5
main sample galaxies of the blind testing set (20% of the sample). Here, zphot is the
mean photometric redshift from the PDF for each object. The result from a single
split (of the 10 used for validation) of the data into training and blind testing data
is shown. Here, ! is the RMS dispersion between zphot and zspec . [See the electronic edition of the Journal for a color version of this figure.]
tro
Summary
✤
✤
✤
Methods of nonlinear data transformation such as diffusion
map can help make statistical analyses of complex (and perhaps
high-dimensional) data tractable.
Analyses with diffusion map generally outperform (i.e., result
in a lower predictive risk) similar analyses with PCA, a linear
technique.
Nonlinear techniques have great promise in the era of LSST, so
long as certain computational challenges are overcome. We
seek
✤
✤
✤
✤
Optimal construction of weighted graphs
Optimal implementations of SVD (memory, execution time, sparsity)
Optimal implementation of the Nyström Extension
Regardless of whether the challenges are overcome, the
accuracy of our results may be limited by measurement error.
Predictive Risk: an Algorithm
✤
✤
✤
Pick tuning parameter values ε and m.
Transform the data into diffusion space.
Perform k-fold cross-validation on the transformed data:
✤
✤
✤
✤
✤
Assign each datum to one of k groups.
Fit model (e.g., linear regression) to the data in k-1 groups (i.e., leave the
data of the kth group out of the fit).
Given best-fit model, compute estimate ŷi for all data in the kth group.
Repeat process until all k groups have been held out.
1
Assuming the L2 (squared-error) loss function, our estimate of
the predictive risk is generally
n
✤
1"
!
R(!, m) =
[!
yj (!, m) − Yj ]2
n j=1
We vary ε and m until the predictive risk estimate is minimized.
Nyström Extension
✤
✤
The basic idea: compute the similarity of a test set datum to
the training set data, and use that similarity to determine the
diffusion coordinate for that
interpolation,
4 datum
P. E.via
Freeman
et al. with no
eigen-decomposition.
Mathematically:
Ψ! :
!
Ψ = WΨΛ ,
✤
where Λ is a m × m diagonal matrix with entries
W is the matrix of similarities
between
the testfor
setthe
data
theare
the redshift
predictions
n! and
objects
training set data, while Λwhere
is a diagonal
with entries
1/λi. ge
βb are the matrix
linear regression
coefficients
the original training set.
3
APPLICATION TO SDSS AND DEEP
Download