Nonlinear Data Transformation Applications with Diffusion Map Peter Freeman Ann Lee Joey Richards* Chad Schafer 0.16 !11 x 10 5 0.14 0 !5 0.1 t 3 ! "3 0.12 Department of Statistics Carnegie Mellon University www.incagroup.org !10 0.08 !15 0.06 5 0 * now at U.C. Berkeley !5 !10 !12 x 10 !15 !20 !t 2 " !25 !6 !4 2 0 !2 4 6 0.04 !7 x 10 t 1 ! " 2 1 Richards et al. (2009; ApJ 691, 32) Predict redshift and identify outliers (Richards, Freeman, Lee, Schafer (2009 93 Data Transformation: Why Do It? ✤ The Problem: Astronomical data that inhabit complex structures in (high-)dimensional spaces are difficult to analyze using standard statistical methods. For instance, we may want to: ✤ ✤ ✤ ✤ Estimate photometric redshifts from galaxy colors Estimate galaxy parameters (age, metallicity, etc.) from galaxy spectra Classify supernovae using irregularly spaced photometric observations The Solution: If these data possess a simpler underlying geometry in the original data space, we transform the data so as to capture and exploit that geometry. ✤ ✤ Usually (but not always), transforming the data affects dimensionality reduction, mitigating the “curse of dimensionality.” We seek to transform the data in such a way as to preserve relevant physical information whose variation is apparent in the original data. Data Transformation: Example cifying the Distances Note that these may be non-standard ear the manifold, with each n = 500 and errormay SD σrepresent = 0.5. data (e.g., data point a vector of values,69like a spectrum). These data inhabit a one-dimensional manifold in a two-dimensional space. Perhaps a physical parameter of interest (e.g., redshift) varies smoothly along the manifold. We want to transform the data in such a way that we can employ simple statistics (e.g., linear regression) to model the variation of that physical parameter. (Accurately.) The Classic Choice: PCA cifying the Distances Principal components analysis will do a terrible job (at dimension reduction) in this instance because it is a linear transformer. ear the manifold, with n = 500 and error SD σ = 0.5. 69 The Classic Choice: PCA cifying the Distances Principal components analysis will do a terrible job (at dimension reduction) in this instance because it is a linear transformer. In PCA, high-dimensional data are projected onto hyperplanes. Physical information may not be well-preserved in the transformation. ear the manifold, with n = 500 and error SD σ = 0.5. 69 Nonlinear Data Transformation ✤ There are many available methods for nonlinear data transformation which have yet to be widely applied to astronomical data: ✤ ✤ ✤ Local linear embedding (LLE; see, e.g., Vanderplas & Connolly 2009) Others: Laplacian eigenmaps, Hessian eigenmaps, LTSA We apply the diffusion map (Coifman & Lafon 2006, Lafon & Lee 2006; see diffusionMap R package). ✤ ✤ The Idea: to estimate the “true” distance between two data points via a fictive diffusion (i.e., Markov random walk) process. The Advantage: The Euclidean distance between points x and y in the space of transformed data is approximately the diffusion distance between those points in the original data space. Thus variations of physical parameters along the original manifold are approximately preserved in the new data space. Diffusion Map: Intuition on Distances Pick location... * * t=1 t=2 ...set up a kernel... on the noisy spiral t = 25 * * * on Distances ntered on one point Diffusion Distances Diffusion Distances Diffusion Distances ... * * * 72 * t = 25 t=2 t=1 ...and map out the random walk. Distribution after the 25th step Distribution after the second step Yields distribution over points after first step 74 75 76 ize diffusion map and map the and the Acoordinate key feature of SCA is that the choice ofthe s(x, y) is not crucial, as , wetheutilize the diffusion propose a new empirical method for photometric system for data that was absent in original represenit is oftenitsimple to simple determine whether orwhether not twoordata points arepoints are ermine optimal bases of simple is often to determine not two data based on the diffusion mapbases (Coifman & ‘similar’. Lafon tation. For instance, imagine data in two dimensions that to the eye hm to determine optimal of simple ee to estimate the formation ‘similar’. 2006), which is star an approach spectral conclearly exhibit spiral structure (e.g. fig. 1 of Richards et al. 2009a). hat we use to estimate the startoformation We remove extreme outliers from our data set, not because of their Wesuch remove outliers from our data set, because SCA). SCA is a suite of established non-linear For data,extreme the Euclidean distance between datanot points x and of y their effect on diffusion map construction (a hallmark of the diffusion ics capture of our the diffusion map and of data by underlying geometry wouldon notdiffusion be an optimal of the distance effect mapdescription construction (a‘true’ hallmark of between the diffusion whatthe basics of our diffusion map and map is its robustness inthethe presence of outliers), but rather because neighbourhood information them along spiral. Diffusion map is of a leading example of an apce a new component: thethrough ap- a Markov map is its robustness in the presence outliers), but rather because dallows introduce a new component: the apthey can bias the to coefficients of the linear regression the model onePress to findeta al. natural coordinate system proach SCA.the In the diffusion map framework, ‘true’(see distance (see e.g. 1992), a they can bias coefficients of the linear regression model (see extension (see e.g. Press et al. 1992), a Section 2.2) and because we find that individual predictions made tometric colours original parametrization is estimated via a fictive diffusion process over the data, with one te technique forwhose estimating difSection 2.2) and because we find that individual predictions made nd accurate technique for estimating diffor these objects are highly We compute available statistical techniques. In Richards et al. proceeding from x inaccurate. to y via a random walk alongthe theempirical spiral. given those of the training set. for these objects are highly inaccurate. We compute the empirical wrds objects given those of thethetraining set. et al. (2009b), we apply diffusion map We construct diffusion maps as follows. distributions of Euclidean distances in colour space from each object ork to SDSS data, specifically distributions of Euclidean distances iny)colour space fromrelates each object problems. In Richards et al. (2009a), We define a similarity measure s(x, that quantitatively rronomical framework to SDSS data, specifically to its n th nearest neighbour, where n ∈ [1, 10]. These distributions uminous red galaxies (LRGs), totwo its ndata th nearest where n ∈a [1, These mework map and adaptive points xneighbour, and y.with In this work, data10]. ‘point’ is adistributions vector Gs) ✤andcombining luminousdiffusion red galaxies (LRGs), are well described as exponential, estimated mean and standard uracy onit par with Digital that ofSky more are well described mean and and standard . . .as , cexponential, ofmedian length value pwith for estimated ax̃ single galaxy, the nd apply to Sloan Survey (SDSS) of= colours {c/ 1 , log(2) p} chieve accuracy on par with that of more σ̂ = x̃ for . We exclude all deviation µ̂ n n n n s.demonstrating We also apply our similarityµ̂measure apply isfor themedian Euclidean distance how it framework may be used to reduce the = x̃we value x̃n . We exclude all deviation n = σ̂n that n / log(2) echniques. We also apply our framework +5 σ̂ = 6σ̂n , data whose n th nearest neighbour is at a distance > µ̂ n n ! shift Survey is predict, matchedfortoexample, redhe data space that and to data of whose n thp10]. nearest neighbour is atper a distance > µ̂n +5σ̂n = 6σ̂n , " alaxy Redshift Survey that is matched to for any value n ∈ [1, We find that ≈80 cent of extreme $ " &2 ationally efficient manner. Legacy We also demonstrate nce–Hawaii Telescope # of n%c∈ for any value [1, 10]. We per cent of extreme − c . find that ≈80 s(x, y) = x,i y,i nada–France–Hawaii Telescope Legacy outliers are removed with the first nearest-neighbour cut alone, with he diffusion map components analydemonstrate thattoitprincipal provides aci=1 outliers areremoved removed with the first nearest-neighbour cut alone, with 008) and demonstrate that it provides acthe fraction of those falling as n increases. d, much more commonly used linear technique. 0.75 given four colours alone. the of of those removed as ngraph increases. A fraction keyremoved, feature SCA is that theafalling choice of s(x, y) iswhere not crucial, as With outliers we construct weighted the fts to z ≈ 0.75 given four colours alone. (2009b), we utilize the diffusion map and the stributions of photometric and ✤ itWith isobserved often simple topoints: determine whether oranot two datagraph pointswhere are the outliers removed, we construct weighted gvariate algorithm to determineof optimal bases of and simple nodes are the data distributions photometric nd DEEP2 are affected by at‘similar’. nodes are ( data points: ' the observed pectra that we DEEP2 use to estimate the star formation 2 SDSS and are affected by ats(x,extreme y) surement error in the predictor ( our data set, not because of their ' outliers2 from We remove s. , (1) w(x, y) = exp − s(x,construction y) y of measurement error insumthe predictor s. Last, in Section 4, we effect on diffusion map (a hallmark of the diffusion " − , (1) w(x, y) = exp e review the basics of our diffusion map and ar models. Last, in Section 4, we sum" is its robustness in the presence of outliers), but rather because we can oura framework ork, and extend introduce new component: the ap- " is map where a tuning parameter that should be small enough that cuss how we can extend our framework they can bias the coefficients of the linear regression model (see that pectroscopic coverage where " is ax tuning parameter that should be small enough yström extension (see e.g.will Pressbeet al. 1992), a y) ≈ w(x, 0 unless and y are similar, but large enough such ✤ Section 2.2) and because we find thatsimilar, individual predictions made such eficient where coverage will bedifandspectroscopic accurate technique for estimating w(x, y) ≈ 0 unless x and y are but large enough that the graph is fully connected. (We discussWehow we estimate for these objects are highly inaccurate. compute the empirical for new objects given those of the training set. that the graph is fully connected. (We discuss how we estimate " in Sectiondistributions 2.2.) The of probability of stepping from x to y in one Euclidean distances in colour space from each object )probability of stepping from x to pply our framework to SDSS data, specifically " in Section 2.2.) The y in one y) n = w(x, neighbour, y)/ z w(x, z). We store the one-step step is p1 (x, to its th nearest where n ∈ [1, 10]. These distributions ) ies (MSGs) and luminous red galaxies (LRGs), y)n data = w(x, We the one-step step is pdescribed 1 (x,all pointsy)/ in with anz nw(x, × nz). matrix P;and then, arebetween well as exponential, estimated meanstore standard at we achieve accuracy on par with that ofprobabilities more probabilities nprobability data an nx̃n× nfrom matrix P; then, = chains, σ̂n = x̃nall /the log(2) forpoints median . We exclude deviation µ̂n between by the theory of Markov ofinvalue stepping x all tensive techniques. We also apply our framework t σ̂n = 6σ̂nfrom ✤ bydata the theory ofthe Markov chains, probability x , nby th nearest neighbour at aof distance >µ̂nof+5 y) the matrix Pstepping . The to ytoin t steps is whose given element p t (x,isthe 1 Survey that is matched EEP2 Galaxy Redshift any value is ofgiven nx∈and [1,by 10]. find per extreme (x, y) theofmatrix Pt . The tofor y in t steps element p t≈80 cs ofCanada–France–Hawaii diffusion map construcdiffusion distance between y the at We time t isthat defined asofcent f the Telescope Legacy outliers are removed with the xfirst nearest-neighbour cut alone, the basics of diffusion map diffusion distance between and y at time t is defined as with details, weand refer the reader toitconstrucGwyn 2008) demonstrate that provides ac∞ $ Diffusion Map: The Math (Part I) Define similarity measure between two points x and y, e.g., the Euclidean distance: Construct a weighted graph: Row-normalize to compute “one-step” probabilities: Use p (x,y) to populate n x n matrix P of one-step probabilities. observed data points:" Section 4, we sumed by atcoverage willnodes be are thew(x, ' y) ≈ 20(unless x and y are similar, but large enough such end our framework y) predictor where " iss(x, a tuning parameter that should be small enough that , (1) w(x, y) = exp − that the graph is fully connected. (We discuss cwe coverage will be sumw(x, y) ≈ 0 "unless x and y are similar, but large enoughhow such we estimate in Section 2.2.) The probability of stepping from x to y in one the graph is fully connected. discuss howthat we estimate where " isthat a"tuning parameter that should be(We small enough ) Section stepping from x We to y store in one the one-step (x,yThe y)areprobability = w(x, y)/ w(x, z). is xp2.2.) 1and w(x, y) ≈" in 0step unless similar, large such zenough )butof y) = w(x, y)/ w(x, z). We estimate store stepprobabilities isis pfully 1 (x, connected. between all nz data points in anthe n one-step × n matrix P; then, that the graph (We discuss how we probabilities between n data points in the an ×y n " in Section 2.2.)the The probability of stepping from xn toprobability in matrix one P; by theory of)all Markov chains, ofthen, stepping from x by the theory of Markov chains, the probability of stepping from x y) = w(x, y)/ w(x, z). We store the one-step step is p1 (x, t z (x, y) of the matrix P . The to y in t steps is given by the element p t t y) ofP; thethen, matrix P . The to ybetween in t stepsallisngiven by the in element probabilities data points an n ×p tn(x, matrix t as on map construcdiffusion distance between x and y at time t is defined ion ✤map construcThe probability of stepping from x to y in t steps is P diffusion distance between x and y at of time t is defined by the theory of Markov chains, the probability stepping from as x efer reader t ∞ refer the the reader toytoin t steps is given by ∞the$ (x, y) of the matrix P . The to element p t $ ✤ 2t 2 2 andRichards Richards etet al.al. nd construc(x,t2 (x, y)between =y) =λx2t (ψ (x) − ψ ( y)) , Dt2D diffusion distance and y at time t is defined as λ (ψ (x) − ψ ( y)) , j j j j j j trast thetouse use of reader j =1 j =1 ∞ ast the ofdifdif$ th λ j = j largest eigenvalue of P 2 2t 2 ards et al. technique, PCA, ψ j ( y)) ,eigenvectors thand eigenvalues of P, reDt (x, y) = j (x) where λψj j(ψ and λj−represent rartechnique, PCA, of P of P, rej = j (right) eigenvector where ψ j and λj represent ψeigenvectors and eigenvalues se of difmaps in predicting j =1 spectively. By retaining the m eigenmodes corresponding to the m aps in predicting ue, PCA, spectively. retainingand theby mintroducing eigenmodes corresponding to the m alaxy spectra. where ψ largest representBy eigenvectors and eigenvalues ofthe P, diffusion reeigenvalues map ✤ j and λj non-trivial laxy redicting utilizespectra. a local disnon-trivial the diffusion map spectively. Bylargest retaining corresponding to the m t the m eigenmodes t eigenvalues t and by introducing (2) " t : x &→ [λ1 ψ 1 (x), λ2 ψ 2 (x), . . . , λm ψ m (x)] ctra. tilize a local largest dis- non-trivial s. The eigenmodes eigenvalues the diffusion map t and by introducing t t : x → & [λ ψ (x), λ ψ (x), . . . , λ ψ (x)] (2) " ocal tRp to Rm , we 1 2 m 1 2 m Thediseigenmodes have that from t t t : x → & [λ ψ (x), λ ψ (x), . . . , λ (2) " t 1 2 1 m ψ m (x)] enmodes p2$ m m from R to R2t , we have that2 p m 2 2 , we have that from R toDR by Lee & Wasserman (x, y) ( λ (ψ (x) − ψ ( y)) = ||" (x) − " ( y)|| , t t j t jm j $ m j =1 $ 2 2t 2t 2 2 2 2 2 y Lee & Wasserman (x, y) ( λ (ψ (x) − ψ ( y)) = ||" (x) − " ( y)|| , D Wasserman t t =j ||" t (x) − Dt (x, y) ( j " t ( y)|| , t λj (ψ j (x) − ψ j ( y)) j ✤ amework e will be Diffusion Map: The Math (Part II) . The diffusion distance between x and y at time t is Retain the “top” m eigenmodes to create diffusion map: The tuning parametersj =1ε and m are determined by AS, MNRAS 398, 2012–2021 minimizing predictive risk (a topic I will skip over in the S 398, 2012–2021 interests of time). The choice of t generally does not matter. S, MNRAS 398, 2012–2021 j =1 The Spiral, Redux dinate Functions Coordinate Functions The first diffusion coordinate ordinate plot for diffusion map 91 The second diffusion coordinate Second coordinate plot for diffusion map 92 2014 P. E. Freeman et al. Application I sets of !104 galaxies. (However, see Budavári et propose an incremental methodology for computing Photometric data sets can, of course, be much large require a computationally efficient scheme for esti vectors for new galaxies given those computed for True Spectroscopic Redshift galaxies used to train the regression model. A stand applied mathematics for ‘extending’ a set of eigen Nyström extension. The implementation is simple: determine the dist x 10 0.16 space from each new galaxy to its nearest neighbours 2.2 Regression 5 set, then take a weighted average of those neighbours 0.14 As in Richards et al. (2009a), we perform linear regression to predict Let X represent the n × k matrix containing the col the function z = r(! t ), where z is true redshift and ! t is a0 vector of training set, where n and k are the number of objec 0.12 diffusion coordinates in Rm , representing a vector of photometric respectively. Let X’ represent a similar n" × k mat !5 colours x in Rp : set. The fi colour data for n" objects in the validation0.1 m " Nyström extension is to compute the n" × n weight m !10 ! = !j "t,j (x) ! r (! t ) = ! t β β elements equivalent to those shown in equation (1) 0.08 j =1 that there, x and y are both members of the training s !15 m m 0.06 set). W x is a new point while y belongs to the training " " t " 5 !j λj ψj (x) = !j ψj (x). = β β same value ! % as was selected during 0 6 diffusion map 4 !5 0.04of gala j =1 j =1 since the training set is0 a 2random sample !10 x 10 !15 !2 x 10 set to be sam !20 original set, we!4expect the validation We see that the choice of the parameter t is unimportant, as changing !25 !6 t !j" . We present !j , with no change in β t same underlying probability distribution. We)row-no it simply leads to a rescaling in β ! "2 ! " 2 1 1 dividing by each element in row i" by ρi " = i Wi " ,i relevant regression formulae in Appendix A. (2009; 691, 32) with Let " beRichards the n × et m al. matrix ofApJ eigenvectors We determine optimal values of the tuning parameters (%, m) Predictrisk, redshift identify outliers (Richards, Freeman, Lee, the Schafer (2009 vector of eigenvalues λ. To estimate eigenvecto by minimizing estimates of the prediction R(%, m)and = E(L), galaxies, we compute the n" × m matrix ! " : where E(L) is the expected value of a loss function L over all possible realizations of the data [one example of L is the so-called 93 " i.e. the Euclidean distance in the m-dimensional embedding defined by equation (2) approximates diffusion distance. (We discuss how we estimate m in Section 2.2, and show that, in this work, the choice of t is unimportant.) We stress that the diffusion map reparametrizes the data into a coordinate system that reflects the connectivity of the data and does not necessarily affect dimension reduction. If the original parametrization in Rp is sufficiently complex, then it may be the case that m ! p. Applications !11 t 3 ✤ Spectroscopic redshift estimation and outlier detection using SDSS galaxy spectra. Estimation via adaptive regression: ! "3 ✤ !12 !7 Application II Applications ✤ 12 10 3 6 3 8 4 3 2 0 !2 !4 !6 0 ! /(1 !50 2 !! Estimating properties of SDSS galaxies (age, metallicity, etc.) using a subset of the Bruzual & Charlot (2003) dictionary of theoretical galaxy spectra. Selection of prototype spectra made through diffusion K-means. ! /(1!! ) " ✤ )" 2 !100 2 For estimating properties of !100 !80 !60 !40 !20 0 ! 1/(1!! 1) " 1 Richards et al. (2009; MNRAS 399, 1044) galaxies (Richards, Freeman, Lee, Schafer (200 95 2015 0.30 0.30 0.20 Freeman et al. (2009; MNRAS 398, 2012) 0.00 0.5 0.4 0.3 0.2 Photometric Redshift Estimate ( z ) 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 R Petro Photometric Redshift Estimate ( z ) Photometric redshift estimation for SDSS Main Sample Galaxies. ✤ Uses Nyström Extension laxies for quickly predicting le, we photometric extract those 360 122redshifts galaxies with of > 77.983; Strauss nitude test <17.77set (or Fdata, given the ur main sample galaxy or MSG sample. We diffusion coordinates 0 galaxies from this sample to train our regres-of training setalgorithm data.described on of the outlier-removal o the✤ removal of 251 galaxies from this set. Displays effect of flux e algorithm outlined in Sections 2.1 and 2.2 measurement error upon ! = (0.05, 150), er estimates (! ! , m) i.e. in order e appropriate, the 16-dimensional colour data predictions: attenuation o 150-dimensional bias. space. 0.35 or more elements of the set {F ! } < 0, we m analysis. The flux units are nanomaggies; ✤ ! to magnitude m! is m! = 22.5 − 2.5 log10 F ! , ! = 2.5 log10 (F !j /F !i ). o colours is ci−j f galaxies in our sample is 417 224. 0.10 Applications 0.00 Application III Photometric Redshift Estimate ( z ) Photometric redshift estimation using SCA 0.05 0.20.10 0.3 0.15 0.4 0.20 0.25 0.5 0.30 0.35 Spectroscopic Redshift (Z) genvector estimates are independent of those apply the Nyström extension to validation Figure 1. Top: predictions for 10 000 randomly selected objects in the MSG redshift estimation (Freeman, Newman, Lee, Richards, Sc at a time, then concatenate the resultingPhotometric pre! ! Application IV The Supernova Classification Challenge ✤ ✤ Classifying SNe in the Supernova Photometric Classification Challenge (Kessler et al. arXiv: 1001.5210) See talk by Joey Richards for more details! From the training set, Type Ia. g i r z Richards et al. (2010; in preparation) 53 Future Application The Big Picture Encoding Space Distribution Space Data Space Component 3 Confidence/Credible Region Compone nt 2 Physically Possible Distributions Com pon ent 1 Transform observed light curves and Once represented low-dimensional space encoding space, theoretical lightincurves to a low-dimensional nonparametric density estimation useful comparing encoding space, where they may beforcompared observations and theory density estimation. using nonparametric Supernova light curves Diffusion Map: Challenges ✤ Computational Challenge I: efficient construction of weighted graph w. ✤ ✤ ✤ Distance computation slow for high-dimensional data. Graph may be sparse: can we short-circuit the distance computation? Computational Challenge II: execution time and memory requirements for eigen-decomposition of the one-step probability matrix P. ✤ ✤ ✤ ✤ SVD limited to approximately 10,000 x 10,000 matrices on typical desktop computers. Slow: we only need the top n% of eigenvalues and eigenvectors, but typical SVD implementations compute all of them. P may be sparse: efficient sparse SVD algorithms? Would algorithm of Budavári et al. (2009; MNRAS 394, 1496) help? Diffusion Map: Challenges ✤ Computational Challenge III: efficient implementation of the Nyström Extension to apply training set results to far larger test sets. ✤ Predictions for 350,000 SDSS MSGs computed in 10 CPU hours...is this too slow in the era of LSST? And One Statistical Challenge... 0.30 0.20 2016 P. E. Freeman et al. 0.25 0.5 0.30 Spectroscopic Redshift (Z) ose ion Figure 1. Top: predictions for 10 000 randomly selected objects in the MSG pre- ✤ validation hotometric redshift Newman, Richards, ! ) = (0.05,(Freeman, set, forestimation (! !, m 150). Bottom: same as Lee, top, for the LRGSchafer (2009)) are ! ) = (0.012, 200). In both cases, we remove 5σ validation set, with (! !, m om- ✤ outliers from the sample prior to plotting, 94 thus the actual number of plotted ing points is 9740 (top) and 9579 (bottom). our 0.10 0.15 0.20 0.25 0.05 Spectroscopic Redshift (Z) Spe 0.02 0.04 Can attenuation bias be effectively mitigated? TBD. This is not diffusion map specific... ias 0.020 0.000 0.05 0.35 0.030 0.4 0.20 0.020 0.3 0.15 d Deviation 0.05 0.20.10 0.010 Sample Standard Deviation 0.04 0.02 0.00 −0.04 −0.02 Sample Bias 0.5 0.4 0.3 0.2 Photometric Redshift Estimate ( z ) 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.030 0.10 0.00 0.30 0.35 Flux measurement error causes attenuation bias: Photometric Redshift Estimate ( z ) with uss We resbed set. 2.2 der data Photometric Redshift Estimate ( z ) Applications we ✤ ies; F !, 2015 Photometric redshift estimation using SCA And One Statistical Challenge... No. 1, 2008 MACHINE LEARNING FOR PROBABI five-band ollow-up of ANNz, t training ic survey talog and al. 2001). ! 17.77) 2), while y uniform me limited tion). ANNz (Collister & Lahav 2004; PASP 116, 345) Fig. 2.—Spectroscopic vs. photometric redshifts for ANNz applied to 10,000 galaxies randomly selected from the SDSS EDR. tecture was 5 : 10 : 10 : 1. A committee of five such networks was trained on the training and validation sets, then applied to the evaluation set. Figure 2 shows the ANNz photometric kNN (Ball et al. 2008; ApJ 683, 12) Fig. 6.— Photometric vs. spectroscopic redshift for the 82,672 SDSS DR5 main sample galaxies of the blind testing set (20% of the sample). Here, zphot is the mean photometric redshift from the PDF for each object. The result from a single split (of the 10 used for validation) of the data into training and blind testing data is shown. Here, ! is the RMS dispersion between zphot and zspec . [See the electronic edition of the Journal for a color version of this figure.] tro Summary ✤ ✤ ✤ Methods of nonlinear data transformation such as diffusion map can help make statistical analyses of complex (and perhaps high-dimensional) data tractable. Analyses with diffusion map generally outperform (i.e., result in a lower predictive risk) similar analyses with PCA, a linear technique. Nonlinear techniques have great promise in the era of LSST, so long as certain computational challenges are overcome. We seek ✤ ✤ ✤ ✤ Optimal construction of weighted graphs Optimal implementations of SVD (memory, execution time, sparsity) Optimal implementation of the Nyström Extension Regardless of whether the challenges are overcome, the accuracy of our results may be limited by measurement error. Predictive Risk: an Algorithm ✤ ✤ ✤ Pick tuning parameter values ε and m. Transform the data into diffusion space. Perform k-fold cross-validation on the transformed data: ✤ ✤ ✤ ✤ ✤ Assign each datum to one of k groups. Fit model (e.g., linear regression) to the data in k-1 groups (i.e., leave the data of the kth group out of the fit). Given best-fit model, compute estimate ŷi for all data in the kth group. Repeat process until all k groups have been held out. 1 Assuming the L2 (squared-error) loss function, our estimate of the predictive risk is generally n ✤ 1" ! R(!, m) = [! yj (!, m) − Yj ]2 n j=1 We vary ε and m until the predictive risk estimate is minimized. Nyström Extension ✤ ✤ The basic idea: compute the similarity of a test set datum to the training set data, and use that similarity to determine the diffusion coordinate for that interpolation, 4 datum P. E.via Freeman et al. with no eigen-decomposition. Mathematically: Ψ! : ! Ψ = WΨΛ , ✤ where Λ is a m × m diagonal matrix with entries W is the matrix of similarities between the testfor setthe data theare the redshift predictions n! and objects training set data, while Λwhere is a diagonal with entries 1/λi. ge βb are the matrix linear regression coefficients the original training set. 3 APPLICATION TO SDSS AND DEEP