KL Paper Draft

advertisement
Cleansing Noise from Discretized Data Sets Using the KarhunenLoève Transform
1. INTRODUCTION
Over the past thirty years, advances in digital technology (e.g. CCD cameras, high capacity data
servers) have transformed the field of astronomy into a true precision science. As scientific
capability improves, errors once thought too small to be dealt with are emerging as the largest
sources of uncertainty left to be removed from data sets.
Among these types of errors are the photometric calibration errors associated with the Sloan
Digital Sky Survey telescope (hereafter SDSS)…
In Section 2, I’ll ….
2. The Karhunen-Loève Method
The Karhunen-Loève transform is a unique linear algebra tool that transforms one’s given basis
into one that prioritizes the correlations between features. In the context of CMB measurements,
one’s given basis would likely be the individual pixels that measure temperature. In galaxy
surveys, the given basis is more naturally chosen to be distinct sections of space, or cells, inside
which one measures some quantity such as galaxy counts. In any case, the given basis can easily
extend into tens of thousands of dimensions, rendering the cleansing of errors from those many
data elements without the proper method a challenging task.
The Karhunen-Loève method requires first that the user assembles an N-dimensional noise
vector with each dimension corresponding to a distinct cell of the discretized data set. In this
description, let 𝜹 represent the column vector containing the errors one wishes to eliminate.
Details on constructing such a vector will be provided in Section 6.
In many cases, the error one seeks to remove is the result of a random process, making 𝜹 itself a
random variable. Under these circumstances there will not exist a single error vector, but in
principle, an infinite number of possible vectors. To fully encapsulate the distribution from
which a given error originates, one needs to perform an averaging over many realizations. The
Karhunen-Loève method requires that this be done by creating “noise matrices”, 𝜼, by taking the
outer products of the 𝜹 vectors. Letting 𝛿𝑖𝛼 represent the error in the ith cell during the αth
realization, the αth noise matrix may be represented thusly:
𝛿1𝛼
𝛼
𝛼 𝛼∗𝑇
𝜼 =𝜹 𝜹
2
𝛼 𝛼
= 𝛿2 𝛿1
⋮
𝛼 𝛼
(𝛿𝑁 𝛿1
𝛿1𝛼 𝛿2𝛼
2
𝛿2𝛼
⋮
𝛿𝑁𝛼 𝛿2𝛼
⋯ 𝛿1𝛼 𝛿𝑁𝛼
⋯ 𝛿2𝛼 𝛿𝑁𝛼
⋱
⋮
2
⋯ 𝛿𝑁𝛼 )
By summing these matrices and dividing by the number of realizations, M, one effectively takes
the average, or expectation value, of the noise matrix:
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑁𝑜𝑖𝑠𝑒 𝑀𝑎𝑡𝑟𝑖𝑥 ≡ 𝜺 = 𝐸 [𝜼𝛼 ] =
1
∑ 𝜼𝛼
𝑀
𝛼
By definition, the expectation value of the product of a real column vector with its transpose is a
correlation matrix. If we represent e as an eigenvector and λ an eigenvalue then
𝜺𝒆 = 𝜆𝒆
Correlation matrices are Hermitian symmetric, positive definite, and perhaps most importantly,
all of their eigenvectors are orthonormal. Therefore,
𝜆𝑗 ,
∗𝑇
𝒆∗𝑇
𝜺𝒆
=
𝜆
𝒆
𝒆
=
{
𝑗
𝑗 𝑖 𝑗
𝑖
0,
𝑖=𝑗
𝑖≠𝑗
A matrix representation of this relationship would be
𝒆1∗𝑇 [𝜺𝒆1 ] 𝒆1∗𝑇 [𝜺𝒆2 ] ⋯
𝒆2∗𝑇 [𝜺𝒆1 ] 𝒆∗𝑇
2 [𝜺𝒆2 ] ⋯
⋮
⋮
⋱
∗𝑇 [
∗𝑇
(𝒆𝑁 𝜺𝒆1 ] 𝒆𝑁 [𝜺𝒆2 ] ⋯
𝒆1∗𝑇 [𝜺𝒆𝑁 ]
𝜆1 0
∗𝑇 [
𝒆2 𝜺𝒆𝑁 ] = ( 0 𝜆2
⋮ ⋮
⋮
0 0
𝒆∗𝑇
𝑁 [𝜺𝒆𝑁 ])
⋯ 0
⋯ 0
)
⋱
⋮
⋯ 𝜆𝑁
If
|
𝑬 = (𝒆1
|
|
𝒆2
|
|
|
⋯ 𝒆𝑁 )
|
|
then this transform may be rewritten as
𝜆1 0
0 𝜆2
𝑬∗𝑇 𝜺𝑬 = (
⋮ ⋮
0 0
⋯ 0
⋯ 0
)
⋱
⋮
⋯ 𝜆𝑁
These eigenvectors stored in E provide an attractive opportunity. Collectively they serve as a
completely orthogonal set of basis vectors. This allows for one’s signal data to be projected
upon these new vectors without any worry of overlap. If it were possible to identify the
eigenmodes of the average noise matrix most chiefly responsible for errors, then one could
identify the amount of signal that fell onto each “large” noise mode and subtract it out of the
signal itself.
Let s represent the N-dimensional signal from a discretized data set that one wishes to clean.
This signal can be expanded in terms of the orthogonal set of basis vectors from the KL
transform:
𝒔 = 𝜅1 𝒆1 + 𝜅2 𝒆2 + ⋯ + 𝜅𝑁 𝒆𝑁
Because the eigenvectors 𝒆1 , 𝒆2 , etc. are orthonormal, the coefficients are found to be the inner
product of each noise eigenvector with the signal.
𝑁
𝜅𝑖 = ∑ 𝑒𝑖∗ [𝑛]𝑠[𝑛]
𝑛=1
A unique feature of the KL transform is its ability to optimally represent a truncated expansion of
a signal s. Instead of using all N eigenvectors and coefficients, the signal can be most faithfully
reproduced using M < N modes if those M modes correspond to the M largest eigenvalues, λ.
If 𝑀′ is the number of significant noise modes (i.e. modes with large eigenvalues), then the noise
itself can be represented by a truncated basis with dimensionality much less than N.
𝑠𝑛𝑜𝑖𝑠𝑒 = 𝜅1 𝒆1 + 𝜅2 𝒆2 + ⋯ + 𝜅𝑀′ 𝒆𝑀′
One method of eliminating the noise would be to simply ignore the modes corresponding to
those errors:
𝑀′
𝒔𝑐𝑙𝑒𝑎𝑛 = 𝒔 − ∑ 𝜅𝑖 𝒆𝑖
𝑖=1
After cleansing, the user should verify that the power of the subtracted noise is not large
compared to the power of the signal itself. Under these conditions we can assume that the part of
the signal removed is primarily noise and does not carry much useful signal information with it.
In subsequent sections, this technique will be applied to the photometric calibration errors
present in the Sloan Digital Sky Survey.
3. The Sloan Digital Sky Survey
Data for the Sloan Digital Sky Survey is provided by a dedicated 2.5 meter, 3º wide-field
telescope located at Apache Point Observatory in New Mexico. Photometric information is
gathered using five, essentially nonoverlapping bandpass filters (u’, g’, r’, i', z’) covering a range
from the ultraviolet limit for our atmosphere at 3000 Å to the sensitivity limit for silicon CCDs at
11000 Å (Fukugita, 1996). The set of five filter passbands is positioned in each of six adjacent
camera columns (camcols) for a total of 30 2048 by 2048 pixel CCDs (Gunn, 1998).
The Sloan telescope operates through drift scanning over great circles on the sky. A full stretch
of observational area along such a circle is defined to be a “stripe”, bounded on east and west by
lines of constant “lambda” and on the north and south by lines of constant “eta”, 2.5 º apart.
“Lambda” and “eta” are generalized spherical survey coordinates rotated from the standard RA
and Declination coordinate system.
Due to the physical separation of each camera column, a full stripe can be observed only through
summing two offset, but partially overlapping “strips”. Each strip is described by the region
covered by its six camera columns, each one of which is called a “scanline”. In short, six
scanlines comprise one strip. Two strips form a full stripe.
At times, it may not be possible for SDSS to observe one full stripe during a single, contiguous
observing pass due to degradations in observing conditions. In such cases, each stripe is split
into two or more adjacent “runs”. Each scanline in a run is called a SEGMENT.
Photometric calibration must be done for each camcol for every run (i.e. for every SEGMENT).
It is a multistep process facilitated by three telescopes: Apache Point’s 20-inch Photometric
Telescope (PT), the United States Naval Observatory’s (UNSO’s) 40-inch telescope in Flagstaff,
Arizona (Smith, 2002), and the SDSS main telescope. The UNSO telescope has collected and
calibrated observations of a network of 158 bright primary stars that tend to saturate the SDSS
main telescope along with a set of secondary patches of sky (Tucker, 2006). The PT observes
the primary stars along with the sky patches that overlap with the SDSS 2.5 meter’s scanning, so
that primary star calibration may be tied to fainter stars in the secondary patches. To quantify the
atmospheric extinction, the PT must observe the primary stars through different air masses.
While there are differences in both observing conditions (USNO: ambient, PT: dry air, SDSS 2.5
meter: vacuum) and filter responses at the three telescopes, these have been well accounted for
during data reduction. The errors that accumulate throughout the steps of this process lead to a
photometric uncertainty of approximately 2% rms in r. (Ivezić, 2004). It is this uncertainty that
we attempt to model and eliminate.
In this paper we will only consider galaxies that fall into SDSS’s main galaxy sample (MGS;
Strauss et al. 2002). All MGS galaxies are required to have Petrosian corrected r-band
magnitudes brighter than 17.77. Failure to exactly calibrate the r-band in each camcol may lead
to the exclusion of “real” MGS galaxies or the inclusion of galaxies too dim to meet the strict
MGS criteria. In effect, photometric calibration errors may lead to the overcounting or
undercounting of MGS galaxies in sections of space.
We limit the MGS sample in a few ways. First, we exclude galaxies with Petrosian corrected rband magnitudes brighter than 15, redshifts greater than 0.3, and redshift confidences less than
0.9. Because the depth of each galaxy is important in this analysis, objects without redshift
measurements or for which redshift measurements failed are not included. We additionally
exclude galaxies whose redshifts were taken “by hand” with low confidence as well as those
whose cross-correlations and emz’s are inconsistent. (THESE ARE SKY SERVER’S
DEFINITIONS OF MY MGS CONDITIONS)
The reason Petrosian magnitudes are used is that, in the absence of seeing, they measure a
constant fraction of a given galaxy’s light regardless of distance or size. The Petrosian flux is
defined as the flux within a circular aperture with a radius equal to N P P , where N P = 2 for the
SDSS and where  P is the radius at which the “Petrosian ratio” (for some theta, the ratio of the
local surface brightness averaged over an annulus at r to the mean surface brightness within
theta) falls below a threshold value set to be 0.2 for the SDSS (Blanton et al., 2003). I’M NOT
SURE I WROTE THIS, OR IF I NEED IT.
Once a large number of objects have been observed photometrically, these chunks are fed
through a pipeline to establish whether or not they are candidates for spectroscopic observation.
Any object that passes the criteria for spectroscopy becomes part of the sky’s target version.
The process of determining which targeted objects will be observed spectroscopically is call
tiling and the details are described in (Blanton, 2003a).
Spectroscopic fibers are physically drilled into a one meter round aluminum plate called a tile.
At most, 640 fibers may be fed into a single tile and due to drilling constraints, no two fibers can
be closer together than 55´´. During each Target Version, the targeting software lays down tiles
in such a way as to maximize the number of useful spectra gained. This means that densely
populated areas of the sky may be covered by multiple overlapping tiles. The number of tiles
covering a particular region of sky is referred to as the depth of that region.
Tiling may be further complicated by the existences of a tiling boundary, which is usually a
rectangular area on the surface of a sphere defined to be the only permissible area inside which
to assign spectroscopic targets and tiling masks, which are areas where no fibers are to be placed,
even if targets exist there. The combination of these different spectroscopic descriptions leads
the definition of sectors (Blanton, 2005; Hamilton & Tegmark, 2004), each of which has, in
general, its own spectroscopic coverage properties. Sectors are individual to each data release,
with DR6 possessing 9464 of them.
To fully describe a sector, it is important to know its ratio of spectroscopically observed galaxies
to the number of targeted galaxies. In Section 5, we will describe our method of estimating this
percentage, which we term the redshift completeness of each sector.
DISCUSS GENERATING THE REDSHIFT DISTRIBUTION OF MGS GALAXIES?
4. Modeling Photometric Calibration Errors
Since photometric calibration errors are a random process, I employ a Monte Carlo method to
best describe their average effect over many realizations. This problem essentially reduces to
counting the percent change in the number of galaxies in a section of sky given a particular
realization of photometric errors. The first step in being able to quantify these changes is the
creation of a set of “countable galaxies” using randomly distributed points.
4.1 Angular Randoms
We begin by constructing a collection of random points with RA and Declination information
(hereafter “angular randoms”) distributed uniformly over three-quarters of the sky. Ideally, we
would create an arbitrarily large number of points to ensure high resolution results. In reality,
increasing the quantity of angular randoms beyond a certain number yields diminishing returns
while slowing queries over them. The middle ground we select is to place an average of 1000
angular randoms in every square degree of the sky, ensuring full coverage of the Sloan DR6
footprint.
Counting angular randoms over geometric sections of the sky must be an efficient operation.
Since the positional information of these randoms is fixed, there will be a need to routinely
access data about a given point and to determine whether or not it meets specific criteria. The
nature of this problem invites the use of databases accessed using the Structured Query Language
(SQL).
With tens of millions of angular randoms in our simulation, a naïve comparison of every point’s
spatial information with geometric requirements (e.g. that a point lies within a certain angular
distance of another object) would be needlessly time consuming. An ideal system would be
optimized for queries on spherical surfaces while minimizing computational expenditure, where
objects with geographical proximity are stored next to each other on disk, greatly reducing the
time needed for searches provided the data set is ordered (“indexed”) properly.
For this reason, all angular randoms are assigned identification numbers called “htmIDs”,
where “htm” stands for Hierarchical Triangular Mesh (Szalay, 2005 MIGHT NEED BETTER
REFERENCE). The HTM infrastructure is constructed by drawing successively smaller
triangles on a sphere. First, each hemisphere of a sphere is split into four equal area triangles by
connecting the poles to the equator. Each smaller triangle is then split in four again by drawing
lines between each side’s midpoint. This process is repeated until the desired resolution is
reached, usually around ~20? such splits for SDSS. Objects in a given triangle are stored near
each other on disk, greatly reducing the time needed for searches provided the data set is ordered
(“idexed”) properly in “htmID”.
Additional position information is then assigned to each angular random using the Sloan
geometry. Given an (RA, Dec) pair, SDSS geometry fully determines lambda and eta (Sloan
survey coordinates), which determines stripe number, which determines mu and nu (coordinates
in the SDSS great circle coordinate system). The strip and camcol assignments are generated
geometrically from the nu value using simple relationships.
As described in Section 3, not all areas on the SDSS photometric footprint are observed with the
same completeness. We would like to quantitatively evaluate the percentage of galaxies that are
observed spectroscopically with those that are merely targeted for spectroscopy. Sections of the
footprint where this fraction is too small will not be considered in this analysis. Since not all
‘SECTORS’ in SDSS are observed at the same completeness, we must add ‘SECTOR’
information to each angular random as well.
Performing this assignment quickly again requires the use of the HTM functions. The geometric
area of each of DR6’s 9464 SECTORs is stored in the Region table under the regionString
heading. The SDSS function fHtmCoverRegion translates this area into a list of contiguous
htmIDs. While this list of htmID ranges completely covers the SECTOR, there may remain
some htmIDs that lie outside the SECTOR. A second routine joins these results with the SDSS
Halfspace table to identify angular randoms that reside fully inside the SECTOR. After all
SECTORS have been examined in this manner, all angular randoms which have not yet been
linked to a SECTOR are assumed to lie outside the spectroscopic footprint and are assigned
SECTOR = 0.
Finally, the SEGMENT in which each angular random resides can be found. The geometric area
for each of DR6’s 2052 SEGMENTs is also described in the Region table, allowing SEGMENT
assignments to be made in manner very similar to that used for SECTOR assignments.
4.2 Selection Function Parameterization
While the angular randoms will ultimately stand in for galaxies when the Monte Carlo
simulations are performed, they are distributed on the surface of the Celestial Sphere with only
two dimensions of position information specified. Accurately reflecting the percentage of
angular randoms that would exist between any two redshift ranges requires detailed knowledge
of the MGS’s distribution as a function of redshift.
To model this distribution I start with the Schechter luminosity function:

*
L
 ( L)dL    *  e ( L / L ) dL
L 
*
This equation has three parameters, Φ* (a normalization parameter), L* (a characteristic
luminosity), and α (a faint-end slope). Since I will only be considering derivatives of this
function, the normalization parameter Φ* does not need to be determined. While this function
can successfully be used to model the MGS distribution, I have found a four-parameter evolving
Schechter luminosity function to provide a better fit to the data. The form of the evolving
function is the same with one exception; L* is permitted to vary with redshift.

 1 z
L  L 
 1  z0



B
L* still needs to be parameterized, of course, but now the parameter B must be as well. z0 is set
to 0.1, the median redshift of the galaxies under consideration.
The selection function, which gives the probability that a galaxy at distance x from the observer
is included in the catalogue, is given by

  ( L)dL
 ( x) 
Lmin ( x )
Despite the fact that we ultimately need the integral’s limits to be in terms of the absolute
limiting magnitude M, it's often easier to do integrals of the Schechter function in terms of L
because they can be cast in terms of incomplete gamma functions. If I were to make the
following substitution,






L
u

B
 * 1 z  
 
 L 
  1  z0  
then the selection function can be rewritten as
 ( x) 

 u
* 
e u dL
Lmin ( x )
The form of the incomplete Gamma function is

a, z    u a 1e  u du
z
so the selection function may then be recast in the following manner.




B

Lmin ( z ) 
* * 1 z 
    1,
 ( z )   L 
B 
1

z


0 



* 1 z


L

1 z  
0  


The luminosity limit above needs to be recast in terms of a magnitude limit via the distance
modulus. The distance modulus is (Efstathiou, 1993)
M lim  mlim  5 log d L ( z)  25  k ( z)
Here 𝑘(𝑧), the K-correction at redshift z, adjusts the magnitude of each object so that they may
be compared with a common restframe filter.
Typically, K-corrections are assigned individually to each galaxy by fitting its five ugriz
magnitudes with a set of thirty galaxy spectral energy distribution templates as developed
through stellar population synthesis models (Bruzual & Charlo, 2003). Tamas Budavari has
developed a code that performs these fittings for SDSS galaxies and stores the results to a
database.
Because angular randoms possess no intrinsic brightness information, assignment of exact Kcorrections will be impossible. Instead, we sort MGS galaxies into a total of 300 redshift bins of
size 0.001 between redshifts 0 and 0.3 since K-corrections are a generally increasing function of
redshift. The average correction for MGS galaxies in each bin is then used to calculate the
limiting absolute magnitude in a quasi-discretized fashion.
The lower limit of the selection function integral, Llim(z), can then be related to the distance
modulus with
Llim  L0100.4 M lim  M 0 
Since the limiting apparent magnitude, mlim , in the r-band is 17.77, the final value requiring evaluation is
the luminosity distance as a function of redshift. First, following Peebles (1993) I introduce the
following function and assume a flat universe:
E z    m 1  z    k 1  z       m 1  z    
3
2
3
The total line-of-sight comoving distance to an object at redshift z is given by
c
DC 
H0
z
dz '
0 Ez'
The luminosity distance is then
𝐷𝐿 = 𝐷𝐶 (𝑡0 )(1 + 𝑧)
which can be evaluated through numerical integration.
The expected number distribution of galaxies as a function of redshift depends on the selection function
(see 3.3 Adrian’s thesis):
nexp z  dz   z dV z   z DC
c
2
H 0 m 1  z   
3
dz
Working with these functions requires choosing a cosmological model whose parameters I list
below along with others I use later in this section.
Variable
m
Value
0.3
Description
Mass density parameter

mlim
mmam
17.77
Limiting apparent magnitude in the survey
15.00
M max
-20
M lim
-17
incr
0.02
(mlim  incr )
17.79
h
1.0
Maximum apparent magnitude you are allowed to see in the
survey
The upper luminosity limit of the selection function (if it’s
not infinity) is dependent upon M max(z) which is, in turn,
dependent upon mmax.
This is the same thing as UpperMlimit, except this is set for
the Mlim(z).
Size of step taken to calculate the derivative of the selection
function
Upper limit of magnitude used to calculate the derivative of
the selection function
Hubble parameter set to match Blanton et al. (2003) and his
Schechter parameters
0.7
Dark Energy density parameter
The goal is to parameterize the evolving Schechter function to best-fit the actual distribution of
MGS galaxies in SDSS. To do so, I first begin with a parameterized form of the galaxy
distribution:
f z   z e
b
g   z / zc 
in which z is redshift and the parameters g, zc, and b are to be determined. Fitting this function
leads to the following best fits:
Parameter
g
b
zc
DR4 Best-Fit Value
1.658
1.720
0.0923
DR5 Best-Fit Value
1.617
1.718
0.0933
DR6 Best-Fit Value
1.543
1.720
0.0948
The following charts illustrate the level of agreement for DR5 and DR6:
I find that selecting a set of three parameters based on all of the histogram’s data points leads to a
curve that does not match the histogram well at larger redshifts. The primary cause is the Great
Wall structure which will dominate in the least squares fitting and which is not well represented
by the selection function. Removing the values in the redshift range 0.0575 to 0.1225 (14 of 60
data points) achieves tighter fits.
To parameterize the evolving Schechter function, I use this three parameter fit as a guide for a
couple reasons. First, I look to match the shape of the selection function as well as possible.
Since my current curves match very well, especially at high redshift where it’s most important,
trying to replicate this shape seems more sensible than actively modeling in the noise of the
histogram. Second, while my histogram only has a limited number available data points, my
curve is created with a function, allowing it to be fit to any number of positions that I choose.
The charts below reveal the best-fit Schechter parameters and their visual levels of agreement
with the SDSS MGS distribution.
4.3 Galaxy Count Dependence on Photometric Calibration Errors
The primary justification for working with the evolving Schechter function as opposed to the
parameterized functional fit is that the former has magnitude dependence. One could modulate
the limiting magnitude slightly and determine the fractional change in the number of galaxies
expected by propagating that change through to the selection function.
If we let 𝑁𝑡𝑜𝑡𝑎𝑙 and 𝑁 represent the number of galaxies counted in a section of space with and
without photometric calibration errors respectively, then
N total  N  N
where Δ𝑁 is the change in number count due to those errors. Since the present goal is to
represent Δ𝑁 as a function of varying apparent magnitude, I introduce a function 𝑓(𝑧) which
represents the fractional change in the number of galaxies per unit limiting apparent magnitude.
ΔN may then be expressed as:
N  N f z  dm
where 𝑓(𝑧) is related to the change in the selection function.
f ( z) 
1  z   ln(  z )

 z  m
m
The total number of galaxies counted in a section of space is given thusly
N total  N 1  f z  dm
The above derivative is best approximated numerically by taking the difference between two
selection functions with slightly different limiting apparent magnitudes.
f ( z) 
1  ( z)    ( z)

incr
As listed in the table above, the apparent magnitude increment is chosen to be 0.02, which makes
𝜙(𝑧)+ the selection function at a limiting apparent magnitude of 17.77 + 0.02 = 17.79. The
shape of this function is shown in the figure below.
7
6
5
4
3
2
1
0.05
0.1
0.15
0.2
0.25
Clearly, this figure indicates that photometric calibration errors should have a greater affect on
galaxy counts at larger redshifts.
5. Discretizing the SDSS into Survey Cells
Application of the Karhunen-Loève transform requires that one’s data be discretized. In galaxy
surveys, a customary discrete unit is a cell of some fixed size in redshift space. Future analysis
will use these cells to compute a cell-averaged correlation matrix. Evaluation of that matrix
requires an integration that is simplified if the cells are spherical in nature, which justifies their
shape in this analysis. These cells should be numerous enough to capture the true distribution of
galaxies without oversmoothing, but not so numerous as to make calculations over them
computationally expensive.
5.1 Creation and Placement of Survey Cells
As a compromise between having a very large number of cells and being able to perform
statistics without making the problem too time intensive computationally, we choose to lay down
identical spherical cells of size 6 ℎ−1 𝑀𝑝𝑐. Choosing the Hubble parameter to be ℎ = 1 has the
added benefit of making the cells equally sized in both redshift and actual space and simplifies
the calculation of each cell’s angular extent. The spheres are created to fill all of space in the
redshift range of ~ 0.047 to 0.172 which corresponds to the location of more than three-quarters
of all MGS galaxies in DR6.
The most efficient method to maximize the volume of space filled by the spherical cells without
overlapping is the Hexagonal Closest Packing (HCP) arrangement, which in pure Euclidean
geometry fills about 74% of the available volume. The HCP operates by placing spheres at
every linear combination of three unit vectors:
A  1,0,0 
1 3 
B   ,
,0 
2
2


1 3 6 

C   ,
,

2
6
3


To account for the size of the spheres, each of these vectors is scaled by the separation between
the cell centers, or twice the radius.
Simple relationships can be used to transform Cartesian positions to Right Ascension,
Declination, and depth in ℎ−1 𝑀𝑝𝑐. From here, the sample is whittled down further by
eliminating all cells that fail to reach inside the photometric footprint. To do this, I invoke
fFootprintEq, an SDSS routine that reveals whether an angular extent around a given position
falls with the regions CHUNK, PRIMARY, TILE, or SECTOR. If none of these regions fall
within a radius of the cell’s center then I can safely remove that sphere from consideration.
Additional information may be ascribed to each cell. Using the cosmological parameters in the
table above, a cell’s depth may be used to determine the redshift of its center. From there,
angular diameter distance can be calculated. This is particularly crucial since the number of
angular randoms in each cell will depend on the angular radius, 𝜃, of each cell when projected
onto the sky. Since the physical size and depth of each cell is specified beforehand, the angular
radii of the spheres are found by the following equation.
𝜃=
6 𝑀𝑝𝑐
𝐷𝐴
The angular randoms are distributed on a two-dimensional sphere, while the cells are threedimensional. Therefore, the number of angular randoms counted inside a cell’s angular radius
represents the total number of angular randoms in a cone of the given angular extent. Therefore,
I must employ two weighting mechanisms that transform the number count in a cone into a number count
in a sphere at a given redshift.
The first weighting mechanism constrains the number count in redshift space. The depth and radius of the
spheres can be used to define two redshifts, 𝑧1 and 𝑧2 , corresponding to the front and back edges of the
sphere as viewed along the line of sight. Integrating the normalized MGS galaxy distribution function
between those same limits will produce the percentage of galaxies, or in this case, angular randoms,
within the sphere’s range. Each angular random within the cell’s angular radius then receives a weight
equal to this fraction to account for depth.
The second weighting is needed to account for each cell’s spherical geometry. I accomplish this
by weighting number counts as a function of viewing angles, θ and φ, the angular distances in
RA and Dec respectively between random points and the center of the sphere. At the limits,
100% of angular randoms along a line through the center of the sphere will lie inside the cell.
Conversely, none of the angular randoms along a line tangent to the sphere will lie inside the
cell. Clearly, angles in between these two extremes will yield different percentages. If the
distance to the center of the sphere is d, and the radius of the sphere is r, then the percentage of
angular randoms inside a cell as a function of angle may be approximated to be
d2
 d 2  r2
2
2
1  sin   sin 
r 2 (1  sin 2   sin 2  )
This derivation is for a flat-space geometry in a non-expanding universe. To apply it to this
simulation, the depth value d needs to be modified from its value in redshift space. I choose to
modify d since the angular radius, which determines the number of angular randoms in each cell,
clearly cannot be altered without affecting the number count, and the radius r has been
established as a fixed quantity.
In a flat geometry, the distance to an object of angular radius θ and radius r is 𝑑 = 𝑟/ sin 𝜃, a
value that is different than the actual redshift distance due to cosmological effects. Substituting
𝑟/ sin 𝜃 for d in the above equation, I establish the distance to a cell assuming space is totally
flat, allowing for the use of the above angular weighting formula.
Ultimately, each angular random within a cell’s angular radius will receive a weight value equal
to the product of its redshift weight, 𝑤𝑟 , and its angular weight, 𝑤𝑎 . Every point within a cell
will have the same redshift weighting since each point shares the same probability of being
included in the cell’s specified redshift range. Alternatively, each point will receive its own
angular weighting since each point is a unique angular distance away from the center of its cell.
From here, it can be established that the total number of angular randoms within a particular cell
is
𝑁𝑐
𝑁 = ∑ 𝑤𝑟𝑖 𝑤𝑎𝑖
𝑖=1
where 𝑁𝑐 is the number of angular randoms within the cell’s angular radius.
With the angular randoms now completely specified relative to our survey geometry, all that
remains is to count the number in each cell. More specifically, since the photometric calibration
errors are to be applied on each SEGMENT, one needs to count the total number of angular
randoms in each SEGMENT in each cell. Fortunately, the nature of the number count equation
(PROVIDE NUMBER) requires that this counting be performed only once.
With millions of randoms requiring weighting factors for tens of thousands of cells each,
intelligent counting techniques are a necessity. For each cell, one begins using the unit vector to
the cell’s center and its angular radius to create a “cell description” which may then be translated
into HTMid ranges using the SDSS fHtmCoverRegion function. As long as the angular randoms
are indexed according to their HtmIDs, parsing the millions of randoms down into a set that
exists only within the proximity of the cell itself should not be overly time intensive. At this
point, a simple
(𝑥𝑖 − 𝑥)2 + (𝑦𝑖 − 𝑦)2 + (𝑧𝑖 − 𝑧)2 < 𝑟 2
algorithm may be run on each angular random to ensure it lies within the cell’s circular
boundary, where the ith angular random is at position (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 ) and the center of the cell of
radius r is at position (𝑥, 𝑦, 𝑧). Finally, weights are applied to each point, sums are taken over
each SEGMENT, and results are used to modulate the number counts as described in Section 6.
5.2 Determining a Cell’s SECTOR Completeness
To understand what percentage of galaxies targeted for spectroscopy in a particular cell actually
have spectra taken, one must begin by first examining SDSS’s SECTORs. Each SECTOR is
observed during a particular “TargetVersion” by up to five overlapping spectroscopic tiles. The
geometric description of each SECTOR is used to identify which MGS galaxies exist within its
boundaries.
All MGS galaxies are be viable targets for spectroscopy. In particularly crowded areas of the
sky, however, fibers cannot be placed close enough to each other on the tile to obtain spectra for
all desired objects. We call the percentage of those galaxies for which spectroscopy is
successfully taken to the completeness. All spectroscopically observed objects will be given
“specObjID” in SDSS. Therefore, SECTOR completeness can be established by comparing the
total number of galaxies in a SECTOR fitting the MGS criteria to the number of those objects
with “specObjIDs”.
Ultimately, our goal is to determine the MGS angular completeness of each sphere. Spheres
contain multiple sectors, each of which has its own angular completeness. Naively, we could
take the ratio of NSpecObj/Targeted as the angular completeness, but for sectors with a small
number of objects this will be either inaccurate or impossible. Instead, we make the assumption
that the angular completeness is the same for all sectors with the same depth taken during the
same TargetVersion. Since tiles are places uniquely for each TargetVersion, sectors of the sky
that end up with the same number of tiles covering them should be similar.
Out of 9464 DR6 sectors, there are only 90 distinct TargetVersion/nTiles pairs (hereafter
“completeness version”). Of these 90, seven contain no Targeted objects. I address the
properties of these seven completeness versions below.
TargetVersion
nTiles
Targeted
NSpec
Obj
DR6 Sector
Number of Angular
Randoms Contained
Therein
+v2_5 +v3_1_0
3
0
0
41485
0
+v2_5 +v3_1_0
2
0
0
34697
1
+v2_11
4
0
0
37070
3
+v2_13_5 +v3_1_0
2
0
0
37927
0
+v4_5
5
0
0
40222
4
+v2_13_7
4
0
0
37730
0
+v6_4
4
0
0
36461
1
The 83 completeness versions with objects in them correspond to an area on the sky of 6807.93
square degrees. The seven completeness versions without objects correspond to an area of only
0.00834 square degrees on the sky. Also, as can be seen in the above table, it turns out that the
seven completeness versions without any objects correspond to a negligible nine angular
randoms out of more than thirty million. Therefore, we simply remove those random points
from our DR6 analysis.
Of the remaining 83 completeness versions, two have multiple TargetVersions:
TargetVersion
nTiles
Targeted
NSpecObj
DR5 Sector
+v2_2 +v2_5
3
19
19
39801
+v2_2 +v2_5
2
44
44
39803
These two completeness versions amount to about 0.811 square degrees on the sky. Below are
the individual completeness versions that combine to form these two:
TargetVersion
nTiles
Targeted
NSpecObj
Completeness
Fraction
+v2_2
3
1361
1338
0.9831
+v2_5
3
69
67
0.9710
+v2_2
2
10505
10147
0.9659
+v2_5
2
4636
4429
0.9554
The angular completenesses are not terribly far apart for these completeness versions. Therefore,
taking one or the other for the angular completeness of the “multiple TargetVersion” entries will
probably be a reasonable approximation. Lacking a better criterion with which to decide
between the two, we will simply take the fraction determined from a greater number of objects.
For example, the angular completeness of completeness version (+v2_2 +v2_5, 3) will be
0.9831.
Once the above procedure is applied each, of the remaining 83 completeness versions will have
an angular completeness associated with it. Since each SECTOR uniquely maps to one of these
completeness versions, this means that each SECTOR will now have an angular completeness
value.
Letting S represent the number of SECTORs present in a given cell, the total angular
completeness of a cell is
∑𝑆𝑖=1 𝑁𝑖 𝑐𝑖
∑𝑆𝑖=1 𝑁𝑖
where 𝑁𝑖 is the number of angular randoms in the ith SECTOR, and 𝑐𝑖 is the angular
completeness of the ith SECTOR.
Finally, we will only include spheres into our noise simulation whose angular completenesses are greater
than or equal to 65%. This ensures that we do not attempt to model noise statistics for regions of the sky
that only slightly reach into the spectroscopic footprint or for which there is small spectroscopic coverage.
There are 54896 spheres that meet the 65% criterion for DR5 and 66113 that meet the criterion for DR6.
6. Data
6.1 Error Realizations and Construction of the Average Noise Matrix
We assume that photometric calibration errors follow a Gaussian distribution with mean zero and
standard deviation of 0.02 magnitudes. During each realization, a set of errors from this
distribution is generated and applied to each SEGMENT in the survey. These errors will affect
the number count of angular randoms in each SEGMENT according to INSERT EQUATION
FROM ABOVE. After all errors have been applied, the total change in number count in each
cell may be found by summing over the changes in the cell’s individual SEGMENTs.
IN THE BEGINNING OF SECTION 2 we describe the creation of a noise matrix from the 𝛼 th
realization of the noise. The matrix is the outer product of the 𝜹 vector with itself. The ith value
in 𝜹 contains a measure of the noise in the survey’s ith cell as given by the overdensity:

i
g i

1
ni
where  i is a measure of the overdensity for the ith sphere in realization α. ni is the number
count when there are no photometric calibration errors, and g i is the number count for the ith
sphere in realization α.
Taking the outer product of the 𝜹 vector is computationally intensive. With 66,113 elements in
each vector for DR6, the outer product will have that many elements squared, or approximately
4.4 × 109 elements. If each element requires 8 bytes of memory, storage of the full matrix can
cost upwards of 32 GB. However, outer products are always symmetric, so only half of these
values are unique, reducing our memory requirements by a factor of 2.
If memory is a concern, the unique values of the outer product can be evaluated on their own “by
hand”. However, if large memory systems are available, there exist numerical recipes optimized
for this sort of problem. One could use BLAS (Basic Linear Algebra Subprograms) routines
which may be found in the larger LAPACK (Linear Alegebra PACKage) libraries. For example,
BLAS’s “dsyr” function can aggregate outer products of many different source vectors in a
manner required for the evaluation of INSERT EQUATION NUMBER.
The final average noise matrix is only an approximation, one that will approach the true
correlation matrix as the number of realizations approaches infinity. In principle, we only need
several thousand to ensure that deviations between successive matrices become small.
To quantify how the matrices are changing as a function of the number of realizations, we
evaluate a set of statistics for them, including the average root mean square magnitude of a
matrix element and the quadratic deviation between two matrices, which may be found by
summing the squares of the differences between the elements in two matrices then dividing by
the number of elements. Finally, we examine the square root of the ratio of the quadratic
deviation to the norm of the matrix with more realizations, or the “relative delta” of two
matrices. The comparisons in the charts below are between a DR6 average noise matrix created
with 16000 realizations and those from a lesser number of realizations.
While the exact number of realizations used should depend the accuracy one seeks to achieve, it
is also clear there are diminishing returns beyond 8000 realizations for this particular problem.
In this analysis, we generate our average noise matrix using the largest number of realizations we
ran for DR6, 16,000.
6.2 Evaluation of Noise Eigenmodes
Aside from running the noise realizations, which can take arbitrarily long depending on the
accuracy one wishes to achieve, the most time intensive step in our noise elimination method is
the Karhunen-Loève transform itself. Conceptually, the KL transform is a straightforward
change of basis facilitated by diagonalizing the average noise matrix. In practice, a proper
diagonalization routine must be selected with special consideration given to the memory
limitations of your system.
We chose to utilize the eigenproblem algorithms in LAPACK to perform the KL transform.
Prior to 2007, Intel’s double-precision diagonalization algorithms required that slightly more
than an entire 𝑁 × 𝑁 matrix’s worth of memory be allocated. For problems of the scale
described in this paper, this would require upwards of 40GB of memory, a strict limitation for
those without access to hefty computing resources. In 2007, however, Intel released an updated
version of its LAPACK libraries, version 9.1, which corrected an inability to access vector
addresses whose indices were larger than the four-byte integer limit. In practice, this allows
symmetric matrices to be stored as one-dimensional arrays containing only the matrices’ unique
elements. The memory requirements are effectively halved, and the problem becomes solvable
on systems with less resources.
LAPACK eigenproblem solvers allow the user to choose which eigenvalues and/or eigenvectors
are desired as output. Further, the user may dictate how many eigenvectors are to be output,
either by their corresponding eigenvalues’ range (e.g. from the ith largest eigenvalue to the jth
largest) or their eigenvalues’ magnitudes. Reporting all N eigenvectors would double the
memory requirements of the problem. Since we are only interested in the largest eigenmodes,
i.e. those most chiefly responsible for the noise, we may request that only largest 𝑀′ be returned.
The nature of one’s noise provides a good estimate for 𝑀′ . Because our noise realizations treat
SDSS SEGMENTs as our discrete noise units, the number of SEGMENTs that reach into the
area covered by our DR6 spherical cells should provide an upper limit to the number of
significant eigenvalues returned. The actual number of significant modes may be less if certain
SEGMENTs fail to contribute much to photometric calibration noise, but a greater number of
modes would indicate an additional source of noise in our simulation possibly unrelated to the
SEGMENTs themselves.
For DR6, we find 1839 unique SEGMENTs reach into our cells’ footprint at least to some
degree. The chart below shows the ordered eigenvalues from the diagonalization of DR6’s
average noise matrix. Notice that the eigenvalues fall off by four orders of magnitude almost
instantly beyond eigenvalue 1833.
The eigenvectors corresponding to these largest modes contain information about the structure of
noise within our simulation and are directly related to the geometry of the SEGMENTs. In the
table below, I visually depict the first four eigenmodes in terms of our original basis, the 66,113
DR6 spherical cells. The modes identify the SEGMENTs over which the noise was modulated.
Eigenvector 1
Eigenvector 2
Eigenvector 3
Eigenvector 4
6.3 DR6 Galaxy Overdensities
The noise modes described in the previous section reference the overdensity of number counts
due to photometric calibration errors. If we are to remove these errors from the primary signal,
then the signal itself must be represented in terms of galaxy overdensity.
𝑛𝑖
𝛿′𝑖 =
−1
⟨𝑛𝑖 ⟩
Here, 𝑛𝑖 is the number of MGS galaxies inside the boundaries of Cell i. This identification is
performed in a manner similar to that used to count angular randoms in each cell, with the
obvious difference that now no weights need be applied since all galaxies have real threedimensional positions courtesy of their redshifts.
The following procedure winnows the full set of MGS galaxies down into the number present in
a given cell:
1. Using the htmIDs, determine the galaxies that fall within the cell’s angular radius.
2. Using the front and back redshifts of the cell, keep only the galaxies that fall within the
cell’s redshift range.
3. Determine the Cartesian components of the galaxies (X,Y,Z) and of the cell’s center
(x,y,z) and require that
x  X 2   y  Y 2  z  Z 2  R 2
The expected number of MGS galaxies in Cell i, ⟨𝑛𝑖 ⟩, is the number one would anticipate in the
absence of clustering. It may be determined by using the parameterized evolving selection
function DESCRIBED IN SECTION X. Because each cell is modeled as having a front and a
back distance given in h-1 Mpc, those two radial distances define a spherical shell of a given
volume centered on Earth and extending from an inner radius redshift of z1 to outer radius z2. A
query over DR6 reveals the total number of MGS galaxies meeting our redshift quality criteria,
𝑁𝑀𝐺𝑆 = 498,867, while the selection function describes their distribution in redshift. A simple
integration of the selection function reveals the percentage of MGS galaxies, f, between
redshifts 𝑧1 and 𝑧2 . Note that f is independent of h since all distances are given in h-1 Mpc in
comoving space. Thus, the number of galaxies in a particular spherical shell is
𝑁𝑠ℎ𝑒𝑙𝑙 = 𝑓𝑁𝑀𝐺𝑆
From here one can determine the total number of MGS galaxies in each spherical shell
surrounding a given cell. The result would be a simple ratio,
𝑁𝑐𝑒𝑙𝑙 → 𝑁𝑠ℎ𝑒𝑙𝑙 (
𝑉𝑐𝑒𝑙𝑙
)
𝑉𝑠ℎ𝑒𝑙𝑙
but the shell’s actual volume is only a fraction of spherical shell’s total volume. The
spectroscopic coverage of DR6 is 6860 square degrees (http://cas.sdss.org/dr6/en/sdss/release),
independent of redshift since SDSS’s drift scanning observes all radial distances within the same
angular boundaries. If 𝑠 is made to represent the fraction of the full sphere covered by the DR6
spectroscopic footprint (i.e. 6860/41253), then the expected number of galaxies in Cell i will be
𝑉𝑐𝑒𝑙𝑙
⟨𝑛𝑖 ⟩ = 𝑓𝑖 𝑁𝑀𝐺𝑆 (
) = 𝑓𝑖 𝑁𝑀𝐺𝑆 (
𝑠𝑉𝑠ℎ𝑒𝑙𝑙
𝑠(𝜒𝑖3
𝑟𝑖3𝑐𝑒𝑙𝑙
− 𝜒𝑖3𝑖𝑛𝑛𝑒𝑟 )
𝑜𝑢𝑡𝑒𝑟
)
where the inner and outer radii are given as comoving distances corresponding to the cell’s
radius. The figure below illustrates the distribution of galaxy overdensities in DR6. Note that
the overdensity will equal negative one for cells which contain no MGS galaxies.
7. Analysis
As described in Section 2, the removal of noise from the DR6 MGS data set first requires taking
the inner product of the overdensity data vector with each significant noise vector. The resulting
product of the data with noise mode i is given the symbol 𝜅𝑖 . From Section 6, we discovered that
there are 1833 such values of significance for DR6.
We now establish a new vector, 𝜹𝑐𝑙𝑒𝑎𝑛 , that is the difference between the cleansed overdensities
and the raw overdensities. We expect that for most cells, photometric calibration errors would
have only a small effect on the overdensity as the following figure verifies.
The change in the overdensity can be used to determine the change in the number of galaxies
counted.
𝑛𝑖𝑐𝑙𝑒𝑎𝑛 = ⟨𝑛𝑖 ⟩(𝛿𝑖𝑐𝑙𝑒𝑎𝑛 + 1)
The following figures capture the percentage change in the number counts as determined from
the cleaned overdensities. Note that I have excluded plotting cells with no MGS galaxies, since
by definition the percentage change would be undefined. In some cases, the relative change in
number count would take the “cleaned” number of galaxies below zero. In the following plots,
those changes are left in as is.
One concern is that in the process of removing the photometric noise, one might also be
removing significant portions of the signal as well. While this is unlikely to be the case for most
cells, it will almost certainly be the case for a few. Under such conditions, the relative change in
the number of galaxies will be much larger than photometric calibration errors alone could
account.
For this reason, the data in the figures below have been preprocessed to remove outliers as
defined through the Median Absolute Deviation (MAD). Here, we exclude any cell whose
relative change in the number of galaxies, 𝑟𝑘𝑖 , satisfies the following condition:
|
𝑟𝑘𝑖 − 𝑚𝑒𝑑𝑖𝑎𝑛(𝑟𝑘 )
| > 3.5
1.4826 ∗ 𝑀𝐴𝐷
where i represents the ith data point and k represents the kth subset of the data. In the figure
below, we have split the data into 26 subsets, equally spaced in redshift. We note that the
processed data demonstrates a trend we have expected, namely that photometric calibration
errors affect the number counts of galaxies to a greater extent at larger redshifts.
8. Conclusions
Download