Cleansing Noise from Discretized Data Sets Using the KarhunenLoève Transform 1. INTRODUCTION Over the past thirty years, advances in digital technology (e.g. CCD cameras, high capacity data servers) have transformed the field of astronomy into a true precision science. As scientific capability improves, errors once thought too small to be dealt with are emerging as the largest sources of uncertainty left to be removed from data sets. Among these types of errors are the photometric calibration errors associated with the Sloan Digital Sky Survey telescope (hereafter SDSS)… In Section 2, I’ll …. 2. The Karhunen-Loève Method The Karhunen-Loève transform is a unique linear algebra tool that transforms one’s given basis into one that prioritizes the correlations between features. In the context of CMB measurements, one’s given basis would likely be the individual pixels that measure temperature. In galaxy surveys, the given basis is more naturally chosen to be distinct sections of space, or cells, inside which one measures some quantity such as galaxy counts. In any case, the given basis can easily extend into tens of thousands of dimensions, rendering the cleansing of errors from those many data elements without the proper method a challenging task. The Karhunen-Loève method requires first that the user assembles an N-dimensional noise vector with each dimension corresponding to a distinct cell of the discretized data set. In this description, let 𝜹 represent the column vector containing the errors one wishes to eliminate. Details on constructing such a vector will be provided in Section 6. In many cases, the error one seeks to remove is the result of a random process, making 𝜹 itself a random variable. Under these circumstances there will not exist a single error vector, but in principle, an infinite number of possible vectors. To fully encapsulate the distribution from which a given error originates, one needs to perform an averaging over many realizations. The Karhunen-Loève method requires that this be done by creating “noise matrices”, 𝜼, by taking the outer products of the 𝜹 vectors. Letting 𝛿𝑖𝛼 represent the error in the ith cell during the αth realization, the αth noise matrix may be represented thusly: 𝛿1𝛼 𝛼 𝛼 𝛼∗𝑇 𝜼 =𝜹 𝜹 2 𝛼 𝛼 = 𝛿2 𝛿1 ⋮ 𝛼 𝛼 (𝛿𝑁 𝛿1 𝛿1𝛼 𝛿2𝛼 2 𝛿2𝛼 ⋮ 𝛿𝑁𝛼 𝛿2𝛼 ⋯ 𝛿1𝛼 𝛿𝑁𝛼 ⋯ 𝛿2𝛼 𝛿𝑁𝛼 ⋱ ⋮ 2 ⋯ 𝛿𝑁𝛼 ) By summing these matrices and dividing by the number of realizations, M, one effectively takes the average, or expectation value, of the noise matrix: 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑁𝑜𝑖𝑠𝑒 𝑀𝑎𝑡𝑟𝑖𝑥 ≡ 𝜺 = 𝐸 [𝜼𝛼 ] = 1 ∑ 𝜼𝛼 𝑀 𝛼 By definition, the expectation value of the product of a real column vector with its transpose is a correlation matrix. If we represent e as an eigenvector and λ an eigenvalue then 𝜺𝒆 = 𝜆𝒆 Correlation matrices are Hermitian symmetric, positive definite, and perhaps most importantly, all of their eigenvectors are orthonormal. Therefore, 𝜆𝑗 , ∗𝑇 𝒆∗𝑇 𝜺𝒆 = 𝜆 𝒆 𝒆 = { 𝑗 𝑗 𝑖 𝑗 𝑖 0, 𝑖=𝑗 𝑖≠𝑗 A matrix representation of this relationship would be 𝒆1∗𝑇 [𝜺𝒆1 ] 𝒆1∗𝑇 [𝜺𝒆2 ] ⋯ 𝒆2∗𝑇 [𝜺𝒆1 ] 𝒆∗𝑇 2 [𝜺𝒆2 ] ⋯ ⋮ ⋮ ⋱ ∗𝑇 [ ∗𝑇 (𝒆𝑁 𝜺𝒆1 ] 𝒆𝑁 [𝜺𝒆2 ] ⋯ 𝒆1∗𝑇 [𝜺𝒆𝑁 ] 𝜆1 0 ∗𝑇 [ 𝒆2 𝜺𝒆𝑁 ] = ( 0 𝜆2 ⋮ ⋮ ⋮ 0 0 𝒆∗𝑇 𝑁 [𝜺𝒆𝑁 ]) ⋯ 0 ⋯ 0 ) ⋱ ⋮ ⋯ 𝜆𝑁 If | 𝑬 = (𝒆1 | | 𝒆2 | | | ⋯ 𝒆𝑁 ) | | then this transform may be rewritten as 𝜆1 0 0 𝜆2 𝑬∗𝑇 𝜺𝑬 = ( ⋮ ⋮ 0 0 ⋯ 0 ⋯ 0 ) ⋱ ⋮ ⋯ 𝜆𝑁 These eigenvectors stored in E provide an attractive opportunity. Collectively they serve as a completely orthogonal set of basis vectors. This allows for one’s signal data to be projected upon these new vectors without any worry of overlap. If it were possible to identify the eigenmodes of the average noise matrix most chiefly responsible for errors, then one could identify the amount of signal that fell onto each “large” noise mode and subtract it out of the signal itself. Let s represent the N-dimensional signal from a discretized data set that one wishes to clean. This signal can be expanded in terms of the orthogonal set of basis vectors from the KL transform: 𝒔 = 𝜅1 𝒆1 + 𝜅2 𝒆2 + ⋯ + 𝜅𝑁 𝒆𝑁 Because the eigenvectors 𝒆1 , 𝒆2 , etc. are orthonormal, the coefficients are found to be the inner product of each noise eigenvector with the signal. 𝑁 𝜅𝑖 = ∑ 𝑒𝑖∗ [𝑛]𝑠[𝑛] 𝑛=1 A unique feature of the KL transform is its ability to optimally represent a truncated expansion of a signal s. Instead of using all N eigenvectors and coefficients, the signal can be most faithfully reproduced using M < N modes if those M modes correspond to the M largest eigenvalues, λ. If 𝑀′ is the number of significant noise modes (i.e. modes with large eigenvalues), then the noise itself can be represented by a truncated basis with dimensionality much less than N. 𝑠𝑛𝑜𝑖𝑠𝑒 = 𝜅1 𝒆1 + 𝜅2 𝒆2 + ⋯ + 𝜅𝑀′ 𝒆𝑀′ One method of eliminating the noise would be to simply ignore the modes corresponding to those errors: 𝑀′ 𝒔𝑐𝑙𝑒𝑎𝑛 = 𝒔 − ∑ 𝜅𝑖 𝒆𝑖 𝑖=1 After cleansing, the user should verify that the power of the subtracted noise is not large compared to the power of the signal itself. Under these conditions we can assume that the part of the signal removed is primarily noise and does not carry much useful signal information with it. In subsequent sections, this technique will be applied to the photometric calibration errors present in the Sloan Digital Sky Survey. 3. The Sloan Digital Sky Survey Data for the Sloan Digital Sky Survey is provided by a dedicated 2.5 meter, 3º wide-field telescope located at Apache Point Observatory in New Mexico. Photometric information is gathered using five, essentially nonoverlapping bandpass filters (u’, g’, r’, i', z’) covering a range from the ultraviolet limit for our atmosphere at 3000 Å to the sensitivity limit for silicon CCDs at 11000 Å (Fukugita, 1996). The set of five filter passbands is positioned in each of six adjacent camera columns (camcols) for a total of 30 2048 by 2048 pixel CCDs (Gunn, 1998). The Sloan telescope operates through drift scanning over great circles on the sky. A full stretch of observational area along such a circle is defined to be a “stripe”, bounded on east and west by lines of constant “lambda” and on the north and south by lines of constant “eta”, 2.5 º apart. “Lambda” and “eta” are generalized spherical survey coordinates rotated from the standard RA and Declination coordinate system. Due to the physical separation of each camera column, a full stripe can be observed only through summing two offset, but partially overlapping “strips”. Each strip is described by the region covered by its six camera columns, each one of which is called a “scanline”. In short, six scanlines comprise one strip. Two strips form a full stripe. At times, it may not be possible for SDSS to observe one full stripe during a single, contiguous observing pass due to degradations in observing conditions. In such cases, each stripe is split into two or more adjacent “runs”. Each scanline in a run is called a SEGMENT. Photometric calibration must be done for each camcol for every run (i.e. for every SEGMENT). It is a multistep process facilitated by three telescopes: Apache Point’s 20-inch Photometric Telescope (PT), the United States Naval Observatory’s (UNSO’s) 40-inch telescope in Flagstaff, Arizona (Smith, 2002), and the SDSS main telescope. The UNSO telescope has collected and calibrated observations of a network of 158 bright primary stars that tend to saturate the SDSS main telescope along with a set of secondary patches of sky (Tucker, 2006). The PT observes the primary stars along with the sky patches that overlap with the SDSS 2.5 meter’s scanning, so that primary star calibration may be tied to fainter stars in the secondary patches. To quantify the atmospheric extinction, the PT must observe the primary stars through different air masses. While there are differences in both observing conditions (USNO: ambient, PT: dry air, SDSS 2.5 meter: vacuum) and filter responses at the three telescopes, these have been well accounted for during data reduction. The errors that accumulate throughout the steps of this process lead to a photometric uncertainty of approximately 2% rms in r. (Ivezić, 2004). It is this uncertainty that we attempt to model and eliminate. In this paper we will only consider galaxies that fall into SDSS’s main galaxy sample (MGS; Strauss et al. 2002). All MGS galaxies are required to have Petrosian corrected r-band magnitudes brighter than 17.77. Failure to exactly calibrate the r-band in each camcol may lead to the exclusion of “real” MGS galaxies or the inclusion of galaxies too dim to meet the strict MGS criteria. In effect, photometric calibration errors may lead to the overcounting or undercounting of MGS galaxies in sections of space. We limit the MGS sample in a few ways. First, we exclude galaxies with Petrosian corrected rband magnitudes brighter than 15, redshifts greater than 0.3, and redshift confidences less than 0.9. Because the depth of each galaxy is important in this analysis, objects without redshift measurements or for which redshift measurements failed are not included. We additionally exclude galaxies whose redshifts were taken “by hand” with low confidence as well as those whose cross-correlations and emz’s are inconsistent. (THESE ARE SKY SERVER’S DEFINITIONS OF MY MGS CONDITIONS) The reason Petrosian magnitudes are used is that, in the absence of seeing, they measure a constant fraction of a given galaxy’s light regardless of distance or size. The Petrosian flux is defined as the flux within a circular aperture with a radius equal to N P P , where N P = 2 for the SDSS and where P is the radius at which the “Petrosian ratio” (for some theta, the ratio of the local surface brightness averaged over an annulus at r to the mean surface brightness within theta) falls below a threshold value set to be 0.2 for the SDSS (Blanton et al., 2003). I’M NOT SURE I WROTE THIS, OR IF I NEED IT. Once a large number of objects have been observed photometrically, these chunks are fed through a pipeline to establish whether or not they are candidates for spectroscopic observation. Any object that passes the criteria for spectroscopy becomes part of the sky’s target version. The process of determining which targeted objects will be observed spectroscopically is call tiling and the details are described in (Blanton, 2003a). Spectroscopic fibers are physically drilled into a one meter round aluminum plate called a tile. At most, 640 fibers may be fed into a single tile and due to drilling constraints, no two fibers can be closer together than 55´´. During each Target Version, the targeting software lays down tiles in such a way as to maximize the number of useful spectra gained. This means that densely populated areas of the sky may be covered by multiple overlapping tiles. The number of tiles covering a particular region of sky is referred to as the depth of that region. Tiling may be further complicated by the existences of a tiling boundary, which is usually a rectangular area on the surface of a sphere defined to be the only permissible area inside which to assign spectroscopic targets and tiling masks, which are areas where no fibers are to be placed, even if targets exist there. The combination of these different spectroscopic descriptions leads the definition of sectors (Blanton, 2005; Hamilton & Tegmark, 2004), each of which has, in general, its own spectroscopic coverage properties. Sectors are individual to each data release, with DR6 possessing 9464 of them. To fully describe a sector, it is important to know its ratio of spectroscopically observed galaxies to the number of targeted galaxies. In Section 5, we will describe our method of estimating this percentage, which we term the redshift completeness of each sector. DISCUSS GENERATING THE REDSHIFT DISTRIBUTION OF MGS GALAXIES? 4. Modeling Photometric Calibration Errors Since photometric calibration errors are a random process, I employ a Monte Carlo method to best describe their average effect over many realizations. This problem essentially reduces to counting the percent change in the number of galaxies in a section of sky given a particular realization of photometric errors. The first step in being able to quantify these changes is the creation of a set of “countable galaxies” using randomly distributed points. 4.1 Angular Randoms We begin by constructing a collection of random points with RA and Declination information (hereafter “angular randoms”) distributed uniformly over three-quarters of the sky. Ideally, we would create an arbitrarily large number of points to ensure high resolution results. In reality, increasing the quantity of angular randoms beyond a certain number yields diminishing returns while slowing queries over them. The middle ground we select is to place an average of 1000 angular randoms in every square degree of the sky, ensuring full coverage of the Sloan DR6 footprint. Counting angular randoms over geometric sections of the sky must be an efficient operation. Since the positional information of these randoms is fixed, there will be a need to routinely access data about a given point and to determine whether or not it meets specific criteria. The nature of this problem invites the use of databases accessed using the Structured Query Language (SQL). With tens of millions of angular randoms in our simulation, a naïve comparison of every point’s spatial information with geometric requirements (e.g. that a point lies within a certain angular distance of another object) would be needlessly time consuming. An ideal system would be optimized for queries on spherical surfaces while minimizing computational expenditure, where objects with geographical proximity are stored next to each other on disk, greatly reducing the time needed for searches provided the data set is ordered (“indexed”) properly. For this reason, all angular randoms are assigned identification numbers called “htmIDs”, where “htm” stands for Hierarchical Triangular Mesh (Szalay, 2005 MIGHT NEED BETTER REFERENCE). The HTM infrastructure is constructed by drawing successively smaller triangles on a sphere. First, each hemisphere of a sphere is split into four equal area triangles by connecting the poles to the equator. Each smaller triangle is then split in four again by drawing lines between each side’s midpoint. This process is repeated until the desired resolution is reached, usually around ~20? such splits for SDSS. Objects in a given triangle are stored near each other on disk, greatly reducing the time needed for searches provided the data set is ordered (“idexed”) properly in “htmID”. Additional position information is then assigned to each angular random using the Sloan geometry. Given an (RA, Dec) pair, SDSS geometry fully determines lambda and eta (Sloan survey coordinates), which determines stripe number, which determines mu and nu (coordinates in the SDSS great circle coordinate system). The strip and camcol assignments are generated geometrically from the nu value using simple relationships. As described in Section 3, not all areas on the SDSS photometric footprint are observed with the same completeness. We would like to quantitatively evaluate the percentage of galaxies that are observed spectroscopically with those that are merely targeted for spectroscopy. Sections of the footprint where this fraction is too small will not be considered in this analysis. Since not all ‘SECTORS’ in SDSS are observed at the same completeness, we must add ‘SECTOR’ information to each angular random as well. Performing this assignment quickly again requires the use of the HTM functions. The geometric area of each of DR6’s 9464 SECTORs is stored in the Region table under the regionString heading. The SDSS function fHtmCoverRegion translates this area into a list of contiguous htmIDs. While this list of htmID ranges completely covers the SECTOR, there may remain some htmIDs that lie outside the SECTOR. A second routine joins these results with the SDSS Halfspace table to identify angular randoms that reside fully inside the SECTOR. After all SECTORS have been examined in this manner, all angular randoms which have not yet been linked to a SECTOR are assumed to lie outside the spectroscopic footprint and are assigned SECTOR = 0. Finally, the SEGMENT in which each angular random resides can be found. The geometric area for each of DR6’s 2052 SEGMENTs is also described in the Region table, allowing SEGMENT assignments to be made in manner very similar to that used for SECTOR assignments. 4.2 Selection Function Parameterization While the angular randoms will ultimately stand in for galaxies when the Monte Carlo simulations are performed, they are distributed on the surface of the Celestial Sphere with only two dimensions of position information specified. Accurately reflecting the percentage of angular randoms that would exist between any two redshift ranges requires detailed knowledge of the MGS’s distribution as a function of redshift. To model this distribution I start with the Schechter luminosity function: * L ( L)dL * e ( L / L ) dL L * This equation has three parameters, Φ* (a normalization parameter), L* (a characteristic luminosity), and α (a faint-end slope). Since I will only be considering derivatives of this function, the normalization parameter Φ* does not need to be determined. While this function can successfully be used to model the MGS distribution, I have found a four-parameter evolving Schechter luminosity function to provide a better fit to the data. The form of the evolving function is the same with one exception; L* is permitted to vary with redshift. 1 z L L 1 z0 B L* still needs to be parameterized, of course, but now the parameter B must be as well. z0 is set to 0.1, the median redshift of the galaxies under consideration. The selection function, which gives the probability that a galaxy at distance x from the observer is included in the catalogue, is given by ( L)dL ( x) Lmin ( x ) Despite the fact that we ultimately need the integral’s limits to be in terms of the absolute limiting magnitude M, it's often easier to do integrals of the Schechter function in terms of L because they can be cast in terms of incomplete gamma functions. If I were to make the following substitution, L u B * 1 z L 1 z0 then the selection function can be rewritten as ( x) u * e u dL Lmin ( x ) The form of the incomplete Gamma function is a, z u a 1e u du z so the selection function may then be recast in the following manner. B Lmin ( z ) * * 1 z 1, ( z ) L B 1 z 0 * 1 z L 1 z 0 The luminosity limit above needs to be recast in terms of a magnitude limit via the distance modulus. The distance modulus is (Efstathiou, 1993) M lim mlim 5 log d L ( z) 25 k ( z) Here 𝑘(𝑧), the K-correction at redshift z, adjusts the magnitude of each object so that they may be compared with a common restframe filter. Typically, K-corrections are assigned individually to each galaxy by fitting its five ugriz magnitudes with a set of thirty galaxy spectral energy distribution templates as developed through stellar population synthesis models (Bruzual & Charlo, 2003). Tamas Budavari has developed a code that performs these fittings for SDSS galaxies and stores the results to a database. Because angular randoms possess no intrinsic brightness information, assignment of exact Kcorrections will be impossible. Instead, we sort MGS galaxies into a total of 300 redshift bins of size 0.001 between redshifts 0 and 0.3 since K-corrections are a generally increasing function of redshift. The average correction for MGS galaxies in each bin is then used to calculate the limiting absolute magnitude in a quasi-discretized fashion. The lower limit of the selection function integral, Llim(z), can then be related to the distance modulus with Llim L0100.4 M lim M 0 Since the limiting apparent magnitude, mlim , in the r-band is 17.77, the final value requiring evaluation is the luminosity distance as a function of redshift. First, following Peebles (1993) I introduce the following function and assume a flat universe: E z m 1 z k 1 z m 1 z 3 2 3 The total line-of-sight comoving distance to an object at redshift z is given by c DC H0 z dz ' 0 Ez' The luminosity distance is then 𝐷𝐿 = 𝐷𝐶 (𝑡0 )(1 + 𝑧) which can be evaluated through numerical integration. The expected number distribution of galaxies as a function of redshift depends on the selection function (see 3.3 Adrian’s thesis): nexp z dz z dV z z DC c 2 H 0 m 1 z 3 dz Working with these functions requires choosing a cosmological model whose parameters I list below along with others I use later in this section. Variable m Value 0.3 Description Mass density parameter mlim mmam 17.77 Limiting apparent magnitude in the survey 15.00 M max -20 M lim -17 incr 0.02 (mlim incr ) 17.79 h 1.0 Maximum apparent magnitude you are allowed to see in the survey The upper luminosity limit of the selection function (if it’s not infinity) is dependent upon M max(z) which is, in turn, dependent upon mmax. This is the same thing as UpperMlimit, except this is set for the Mlim(z). Size of step taken to calculate the derivative of the selection function Upper limit of magnitude used to calculate the derivative of the selection function Hubble parameter set to match Blanton et al. (2003) and his Schechter parameters 0.7 Dark Energy density parameter The goal is to parameterize the evolving Schechter function to best-fit the actual distribution of MGS galaxies in SDSS. To do so, I first begin with a parameterized form of the galaxy distribution: f z z e b g z / zc in which z is redshift and the parameters g, zc, and b are to be determined. Fitting this function leads to the following best fits: Parameter g b zc DR4 Best-Fit Value 1.658 1.720 0.0923 DR5 Best-Fit Value 1.617 1.718 0.0933 DR6 Best-Fit Value 1.543 1.720 0.0948 The following charts illustrate the level of agreement for DR5 and DR6: I find that selecting a set of three parameters based on all of the histogram’s data points leads to a curve that does not match the histogram well at larger redshifts. The primary cause is the Great Wall structure which will dominate in the least squares fitting and which is not well represented by the selection function. Removing the values in the redshift range 0.0575 to 0.1225 (14 of 60 data points) achieves tighter fits. To parameterize the evolving Schechter function, I use this three parameter fit as a guide for a couple reasons. First, I look to match the shape of the selection function as well as possible. Since my current curves match very well, especially at high redshift where it’s most important, trying to replicate this shape seems more sensible than actively modeling in the noise of the histogram. Second, while my histogram only has a limited number available data points, my curve is created with a function, allowing it to be fit to any number of positions that I choose. The charts below reveal the best-fit Schechter parameters and their visual levels of agreement with the SDSS MGS distribution. 4.3 Galaxy Count Dependence on Photometric Calibration Errors The primary justification for working with the evolving Schechter function as opposed to the parameterized functional fit is that the former has magnitude dependence. One could modulate the limiting magnitude slightly and determine the fractional change in the number of galaxies expected by propagating that change through to the selection function. If we let 𝑁𝑡𝑜𝑡𝑎𝑙 and 𝑁 represent the number of galaxies counted in a section of space with and without photometric calibration errors respectively, then N total N N where Δ𝑁 is the change in number count due to those errors. Since the present goal is to represent Δ𝑁 as a function of varying apparent magnitude, I introduce a function 𝑓(𝑧) which represents the fractional change in the number of galaxies per unit limiting apparent magnitude. ΔN may then be expressed as: N N f z dm where 𝑓(𝑧) is related to the change in the selection function. f ( z) 1 z ln( z ) z m m The total number of galaxies counted in a section of space is given thusly N total N 1 f z dm The above derivative is best approximated numerically by taking the difference between two selection functions with slightly different limiting apparent magnitudes. f ( z) 1 ( z) ( z) incr As listed in the table above, the apparent magnitude increment is chosen to be 0.02, which makes 𝜙(𝑧)+ the selection function at a limiting apparent magnitude of 17.77 + 0.02 = 17.79. The shape of this function is shown in the figure below. 7 6 5 4 3 2 1 0.05 0.1 0.15 0.2 0.25 Clearly, this figure indicates that photometric calibration errors should have a greater affect on galaxy counts at larger redshifts. 5. Discretizing the SDSS into Survey Cells Application of the Karhunen-Loève transform requires that one’s data be discretized. In galaxy surveys, a customary discrete unit is a cell of some fixed size in redshift space. Future analysis will use these cells to compute a cell-averaged correlation matrix. Evaluation of that matrix requires an integration that is simplified if the cells are spherical in nature, which justifies their shape in this analysis. These cells should be numerous enough to capture the true distribution of galaxies without oversmoothing, but not so numerous as to make calculations over them computationally expensive. 5.1 Creation and Placement of Survey Cells As a compromise between having a very large number of cells and being able to perform statistics without making the problem too time intensive computationally, we choose to lay down identical spherical cells of size 6 ℎ−1 𝑀𝑝𝑐. Choosing the Hubble parameter to be ℎ = 1 has the added benefit of making the cells equally sized in both redshift and actual space and simplifies the calculation of each cell’s angular extent. The spheres are created to fill all of space in the redshift range of ~ 0.047 to 0.172 which corresponds to the location of more than three-quarters of all MGS galaxies in DR6. The most efficient method to maximize the volume of space filled by the spherical cells without overlapping is the Hexagonal Closest Packing (HCP) arrangement, which in pure Euclidean geometry fills about 74% of the available volume. The HCP operates by placing spheres at every linear combination of three unit vectors: A 1,0,0 1 3 B , ,0 2 2 1 3 6 C , , 2 6 3 To account for the size of the spheres, each of these vectors is scaled by the separation between the cell centers, or twice the radius. Simple relationships can be used to transform Cartesian positions to Right Ascension, Declination, and depth in ℎ−1 𝑀𝑝𝑐. From here, the sample is whittled down further by eliminating all cells that fail to reach inside the photometric footprint. To do this, I invoke fFootprintEq, an SDSS routine that reveals whether an angular extent around a given position falls with the regions CHUNK, PRIMARY, TILE, or SECTOR. If none of these regions fall within a radius of the cell’s center then I can safely remove that sphere from consideration. Additional information may be ascribed to each cell. Using the cosmological parameters in the table above, a cell’s depth may be used to determine the redshift of its center. From there, angular diameter distance can be calculated. This is particularly crucial since the number of angular randoms in each cell will depend on the angular radius, 𝜃, of each cell when projected onto the sky. Since the physical size and depth of each cell is specified beforehand, the angular radii of the spheres are found by the following equation. 𝜃= 6 𝑀𝑝𝑐 𝐷𝐴 The angular randoms are distributed on a two-dimensional sphere, while the cells are threedimensional. Therefore, the number of angular randoms counted inside a cell’s angular radius represents the total number of angular randoms in a cone of the given angular extent. Therefore, I must employ two weighting mechanisms that transform the number count in a cone into a number count in a sphere at a given redshift. The first weighting mechanism constrains the number count in redshift space. The depth and radius of the spheres can be used to define two redshifts, 𝑧1 and 𝑧2 , corresponding to the front and back edges of the sphere as viewed along the line of sight. Integrating the normalized MGS galaxy distribution function between those same limits will produce the percentage of galaxies, or in this case, angular randoms, within the sphere’s range. Each angular random within the cell’s angular radius then receives a weight equal to this fraction to account for depth. The second weighting is needed to account for each cell’s spherical geometry. I accomplish this by weighting number counts as a function of viewing angles, θ and φ, the angular distances in RA and Dec respectively between random points and the center of the sphere. At the limits, 100% of angular randoms along a line through the center of the sphere will lie inside the cell. Conversely, none of the angular randoms along a line tangent to the sphere will lie inside the cell. Clearly, angles in between these two extremes will yield different percentages. If the distance to the center of the sphere is d, and the radius of the sphere is r, then the percentage of angular randoms inside a cell as a function of angle may be approximated to be d2 d 2 r2 2 2 1 sin sin r 2 (1 sin 2 sin 2 ) This derivation is for a flat-space geometry in a non-expanding universe. To apply it to this simulation, the depth value d needs to be modified from its value in redshift space. I choose to modify d since the angular radius, which determines the number of angular randoms in each cell, clearly cannot be altered without affecting the number count, and the radius r has been established as a fixed quantity. In a flat geometry, the distance to an object of angular radius θ and radius r is 𝑑 = 𝑟/ sin 𝜃, a value that is different than the actual redshift distance due to cosmological effects. Substituting 𝑟/ sin 𝜃 for d in the above equation, I establish the distance to a cell assuming space is totally flat, allowing for the use of the above angular weighting formula. Ultimately, each angular random within a cell’s angular radius will receive a weight value equal to the product of its redshift weight, 𝑤𝑟 , and its angular weight, 𝑤𝑎 . Every point within a cell will have the same redshift weighting since each point shares the same probability of being included in the cell’s specified redshift range. Alternatively, each point will receive its own angular weighting since each point is a unique angular distance away from the center of its cell. From here, it can be established that the total number of angular randoms within a particular cell is 𝑁𝑐 𝑁 = ∑ 𝑤𝑟𝑖 𝑤𝑎𝑖 𝑖=1 where 𝑁𝑐 is the number of angular randoms within the cell’s angular radius. With the angular randoms now completely specified relative to our survey geometry, all that remains is to count the number in each cell. More specifically, since the photometric calibration errors are to be applied on each SEGMENT, one needs to count the total number of angular randoms in each SEGMENT in each cell. Fortunately, the nature of the number count equation (PROVIDE NUMBER) requires that this counting be performed only once. With millions of randoms requiring weighting factors for tens of thousands of cells each, intelligent counting techniques are a necessity. For each cell, one begins using the unit vector to the cell’s center and its angular radius to create a “cell description” which may then be translated into HTMid ranges using the SDSS fHtmCoverRegion function. As long as the angular randoms are indexed according to their HtmIDs, parsing the millions of randoms down into a set that exists only within the proximity of the cell itself should not be overly time intensive. At this point, a simple (𝑥𝑖 − 𝑥)2 + (𝑦𝑖 − 𝑦)2 + (𝑧𝑖 − 𝑧)2 < 𝑟 2 algorithm may be run on each angular random to ensure it lies within the cell’s circular boundary, where the ith angular random is at position (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 ) and the center of the cell of radius r is at position (𝑥, 𝑦, 𝑧). Finally, weights are applied to each point, sums are taken over each SEGMENT, and results are used to modulate the number counts as described in Section 6. 5.2 Determining a Cell’s SECTOR Completeness To understand what percentage of galaxies targeted for spectroscopy in a particular cell actually have spectra taken, one must begin by first examining SDSS’s SECTORs. Each SECTOR is observed during a particular “TargetVersion” by up to five overlapping spectroscopic tiles. The geometric description of each SECTOR is used to identify which MGS galaxies exist within its boundaries. All MGS galaxies are be viable targets for spectroscopy. In particularly crowded areas of the sky, however, fibers cannot be placed close enough to each other on the tile to obtain spectra for all desired objects. We call the percentage of those galaxies for which spectroscopy is successfully taken to the completeness. All spectroscopically observed objects will be given “specObjID” in SDSS. Therefore, SECTOR completeness can be established by comparing the total number of galaxies in a SECTOR fitting the MGS criteria to the number of those objects with “specObjIDs”. Ultimately, our goal is to determine the MGS angular completeness of each sphere. Spheres contain multiple sectors, each of which has its own angular completeness. Naively, we could take the ratio of NSpecObj/Targeted as the angular completeness, but for sectors with a small number of objects this will be either inaccurate or impossible. Instead, we make the assumption that the angular completeness is the same for all sectors with the same depth taken during the same TargetVersion. Since tiles are places uniquely for each TargetVersion, sectors of the sky that end up with the same number of tiles covering them should be similar. Out of 9464 DR6 sectors, there are only 90 distinct TargetVersion/nTiles pairs (hereafter “completeness version”). Of these 90, seven contain no Targeted objects. I address the properties of these seven completeness versions below. TargetVersion nTiles Targeted NSpec Obj DR6 Sector Number of Angular Randoms Contained Therein +v2_5 +v3_1_0 3 0 0 41485 0 +v2_5 +v3_1_0 2 0 0 34697 1 +v2_11 4 0 0 37070 3 +v2_13_5 +v3_1_0 2 0 0 37927 0 +v4_5 5 0 0 40222 4 +v2_13_7 4 0 0 37730 0 +v6_4 4 0 0 36461 1 The 83 completeness versions with objects in them correspond to an area on the sky of 6807.93 square degrees. The seven completeness versions without objects correspond to an area of only 0.00834 square degrees on the sky. Also, as can be seen in the above table, it turns out that the seven completeness versions without any objects correspond to a negligible nine angular randoms out of more than thirty million. Therefore, we simply remove those random points from our DR6 analysis. Of the remaining 83 completeness versions, two have multiple TargetVersions: TargetVersion nTiles Targeted NSpecObj DR5 Sector +v2_2 +v2_5 3 19 19 39801 +v2_2 +v2_5 2 44 44 39803 These two completeness versions amount to about 0.811 square degrees on the sky. Below are the individual completeness versions that combine to form these two: TargetVersion nTiles Targeted NSpecObj Completeness Fraction +v2_2 3 1361 1338 0.9831 +v2_5 3 69 67 0.9710 +v2_2 2 10505 10147 0.9659 +v2_5 2 4636 4429 0.9554 The angular completenesses are not terribly far apart for these completeness versions. Therefore, taking one or the other for the angular completeness of the “multiple TargetVersion” entries will probably be a reasonable approximation. Lacking a better criterion with which to decide between the two, we will simply take the fraction determined from a greater number of objects. For example, the angular completeness of completeness version (+v2_2 +v2_5, 3) will be 0.9831. Once the above procedure is applied each, of the remaining 83 completeness versions will have an angular completeness associated with it. Since each SECTOR uniquely maps to one of these completeness versions, this means that each SECTOR will now have an angular completeness value. Letting S represent the number of SECTORs present in a given cell, the total angular completeness of a cell is ∑𝑆𝑖=1 𝑁𝑖 𝑐𝑖 ∑𝑆𝑖=1 𝑁𝑖 where 𝑁𝑖 is the number of angular randoms in the ith SECTOR, and 𝑐𝑖 is the angular completeness of the ith SECTOR. Finally, we will only include spheres into our noise simulation whose angular completenesses are greater than or equal to 65%. This ensures that we do not attempt to model noise statistics for regions of the sky that only slightly reach into the spectroscopic footprint or for which there is small spectroscopic coverage. There are 54896 spheres that meet the 65% criterion for DR5 and 66113 that meet the criterion for DR6. 6. Data 6.1 Error Realizations and Construction of the Average Noise Matrix We assume that photometric calibration errors follow a Gaussian distribution with mean zero and standard deviation of 0.02 magnitudes. During each realization, a set of errors from this distribution is generated and applied to each SEGMENT in the survey. These errors will affect the number count of angular randoms in each SEGMENT according to INSERT EQUATION FROM ABOVE. After all errors have been applied, the total change in number count in each cell may be found by summing over the changes in the cell’s individual SEGMENTs. IN THE BEGINNING OF SECTION 2 we describe the creation of a noise matrix from the 𝛼 th realization of the noise. The matrix is the outer product of the 𝜹 vector with itself. The ith value in 𝜹 contains a measure of the noise in the survey’s ith cell as given by the overdensity: i g i 1 ni where i is a measure of the overdensity for the ith sphere in realization α. ni is the number count when there are no photometric calibration errors, and g i is the number count for the ith sphere in realization α. Taking the outer product of the 𝜹 vector is computationally intensive. With 66,113 elements in each vector for DR6, the outer product will have that many elements squared, or approximately 4.4 × 109 elements. If each element requires 8 bytes of memory, storage of the full matrix can cost upwards of 32 GB. However, outer products are always symmetric, so only half of these values are unique, reducing our memory requirements by a factor of 2. If memory is a concern, the unique values of the outer product can be evaluated on their own “by hand”. However, if large memory systems are available, there exist numerical recipes optimized for this sort of problem. One could use BLAS (Basic Linear Algebra Subprograms) routines which may be found in the larger LAPACK (Linear Alegebra PACKage) libraries. For example, BLAS’s “dsyr” function can aggregate outer products of many different source vectors in a manner required for the evaluation of INSERT EQUATION NUMBER. The final average noise matrix is only an approximation, one that will approach the true correlation matrix as the number of realizations approaches infinity. In principle, we only need several thousand to ensure that deviations between successive matrices become small. To quantify how the matrices are changing as a function of the number of realizations, we evaluate a set of statistics for them, including the average root mean square magnitude of a matrix element and the quadratic deviation between two matrices, which may be found by summing the squares of the differences between the elements in two matrices then dividing by the number of elements. Finally, we examine the square root of the ratio of the quadratic deviation to the norm of the matrix with more realizations, or the “relative delta” of two matrices. The comparisons in the charts below are between a DR6 average noise matrix created with 16000 realizations and those from a lesser number of realizations. While the exact number of realizations used should depend the accuracy one seeks to achieve, it is also clear there are diminishing returns beyond 8000 realizations for this particular problem. In this analysis, we generate our average noise matrix using the largest number of realizations we ran for DR6, 16,000. 6.2 Evaluation of Noise Eigenmodes Aside from running the noise realizations, which can take arbitrarily long depending on the accuracy one wishes to achieve, the most time intensive step in our noise elimination method is the Karhunen-Loève transform itself. Conceptually, the KL transform is a straightforward change of basis facilitated by diagonalizing the average noise matrix. In practice, a proper diagonalization routine must be selected with special consideration given to the memory limitations of your system. We chose to utilize the eigenproblem algorithms in LAPACK to perform the KL transform. Prior to 2007, Intel’s double-precision diagonalization algorithms required that slightly more than an entire 𝑁 × 𝑁 matrix’s worth of memory be allocated. For problems of the scale described in this paper, this would require upwards of 40GB of memory, a strict limitation for those without access to hefty computing resources. In 2007, however, Intel released an updated version of its LAPACK libraries, version 9.1, which corrected an inability to access vector addresses whose indices were larger than the four-byte integer limit. In practice, this allows symmetric matrices to be stored as one-dimensional arrays containing only the matrices’ unique elements. The memory requirements are effectively halved, and the problem becomes solvable on systems with less resources. LAPACK eigenproblem solvers allow the user to choose which eigenvalues and/or eigenvectors are desired as output. Further, the user may dictate how many eigenvectors are to be output, either by their corresponding eigenvalues’ range (e.g. from the ith largest eigenvalue to the jth largest) or their eigenvalues’ magnitudes. Reporting all N eigenvectors would double the memory requirements of the problem. Since we are only interested in the largest eigenmodes, i.e. those most chiefly responsible for the noise, we may request that only largest 𝑀′ be returned. The nature of one’s noise provides a good estimate for 𝑀′ . Because our noise realizations treat SDSS SEGMENTs as our discrete noise units, the number of SEGMENTs that reach into the area covered by our DR6 spherical cells should provide an upper limit to the number of significant eigenvalues returned. The actual number of significant modes may be less if certain SEGMENTs fail to contribute much to photometric calibration noise, but a greater number of modes would indicate an additional source of noise in our simulation possibly unrelated to the SEGMENTs themselves. For DR6, we find 1839 unique SEGMENTs reach into our cells’ footprint at least to some degree. The chart below shows the ordered eigenvalues from the diagonalization of DR6’s average noise matrix. Notice that the eigenvalues fall off by four orders of magnitude almost instantly beyond eigenvalue 1833. The eigenvectors corresponding to these largest modes contain information about the structure of noise within our simulation and are directly related to the geometry of the SEGMENTs. In the table below, I visually depict the first four eigenmodes in terms of our original basis, the 66,113 DR6 spherical cells. The modes identify the SEGMENTs over which the noise was modulated. Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 6.3 DR6 Galaxy Overdensities The noise modes described in the previous section reference the overdensity of number counts due to photometric calibration errors. If we are to remove these errors from the primary signal, then the signal itself must be represented in terms of galaxy overdensity. 𝑛𝑖 𝛿′𝑖 = −1 〈𝑛𝑖 〉 Here, 𝑛𝑖 is the number of MGS galaxies inside the boundaries of Cell i. This identification is performed in a manner similar to that used to count angular randoms in each cell, with the obvious difference that now no weights need be applied since all galaxies have real threedimensional positions courtesy of their redshifts. The following procedure winnows the full set of MGS galaxies down into the number present in a given cell: 1. Using the htmIDs, determine the galaxies that fall within the cell’s angular radius. 2. Using the front and back redshifts of the cell, keep only the galaxies that fall within the cell’s redshift range. 3. Determine the Cartesian components of the galaxies (X,Y,Z) and of the cell’s center (x,y,z) and require that x X 2 y Y 2 z Z 2 R 2 The expected number of MGS galaxies in Cell i, 〈𝑛𝑖 〉, is the number one would anticipate in the absence of clustering. It may be determined by using the parameterized evolving selection function DESCRIBED IN SECTION X. Because each cell is modeled as having a front and a back distance given in h-1 Mpc, those two radial distances define a spherical shell of a given volume centered on Earth and extending from an inner radius redshift of z1 to outer radius z2. A query over DR6 reveals the total number of MGS galaxies meeting our redshift quality criteria, 𝑁𝑀𝐺𝑆 = 498,867, while the selection function describes their distribution in redshift. A simple integration of the selection function reveals the percentage of MGS galaxies, f, between redshifts 𝑧1 and 𝑧2 . Note that f is independent of h since all distances are given in h-1 Mpc in comoving space. Thus, the number of galaxies in a particular spherical shell is 𝑁𝑠ℎ𝑒𝑙𝑙 = 𝑓𝑁𝑀𝐺𝑆 From here one can determine the total number of MGS galaxies in each spherical shell surrounding a given cell. The result would be a simple ratio, 𝑁𝑐𝑒𝑙𝑙 → 𝑁𝑠ℎ𝑒𝑙𝑙 ( 𝑉𝑐𝑒𝑙𝑙 ) 𝑉𝑠ℎ𝑒𝑙𝑙 but the shell’s actual volume is only a fraction of spherical shell’s total volume. The spectroscopic coverage of DR6 is 6860 square degrees (http://cas.sdss.org/dr6/en/sdss/release), independent of redshift since SDSS’s drift scanning observes all radial distances within the same angular boundaries. If 𝑠 is made to represent the fraction of the full sphere covered by the DR6 spectroscopic footprint (i.e. 6860/41253), then the expected number of galaxies in Cell i will be 𝑉𝑐𝑒𝑙𝑙 〈𝑛𝑖 〉 = 𝑓𝑖 𝑁𝑀𝐺𝑆 ( ) = 𝑓𝑖 𝑁𝑀𝐺𝑆 ( 𝑠𝑉𝑠ℎ𝑒𝑙𝑙 𝑠(𝜒𝑖3 𝑟𝑖3𝑐𝑒𝑙𝑙 − 𝜒𝑖3𝑖𝑛𝑛𝑒𝑟 ) 𝑜𝑢𝑡𝑒𝑟 ) where the inner and outer radii are given as comoving distances corresponding to the cell’s radius. The figure below illustrates the distribution of galaxy overdensities in DR6. Note that the overdensity will equal negative one for cells which contain no MGS galaxies. 7. Analysis As described in Section 2, the removal of noise from the DR6 MGS data set first requires taking the inner product of the overdensity data vector with each significant noise vector. The resulting product of the data with noise mode i is given the symbol 𝜅𝑖 . From Section 6, we discovered that there are 1833 such values of significance for DR6. We now establish a new vector, 𝜹𝑐𝑙𝑒𝑎𝑛 , that is the difference between the cleansed overdensities and the raw overdensities. We expect that for most cells, photometric calibration errors would have only a small effect on the overdensity as the following figure verifies. The change in the overdensity can be used to determine the change in the number of galaxies counted. 𝑛𝑖𝑐𝑙𝑒𝑎𝑛 = 〈𝑛𝑖 〉(𝛿𝑖𝑐𝑙𝑒𝑎𝑛 + 1) The following figures capture the percentage change in the number counts as determined from the cleaned overdensities. Note that I have excluded plotting cells with no MGS galaxies, since by definition the percentage change would be undefined. In some cases, the relative change in number count would take the “cleaned” number of galaxies below zero. In the following plots, those changes are left in as is. One concern is that in the process of removing the photometric noise, one might also be removing significant portions of the signal as well. While this is unlikely to be the case for most cells, it will almost certainly be the case for a few. Under such conditions, the relative change in the number of galaxies will be much larger than photometric calibration errors alone could account. For this reason, the data in the figures below have been preprocessed to remove outliers as defined through the Median Absolute Deviation (MAD). Here, we exclude any cell whose relative change in the number of galaxies, 𝑟𝑘𝑖 , satisfies the following condition: | 𝑟𝑘𝑖 − 𝑚𝑒𝑑𝑖𝑎𝑛(𝑟𝑘 ) | > 3.5 1.4826 ∗ 𝑀𝐴𝐷 where i represents the ith data point and k represents the kth subset of the data. In the figure below, we have split the data into 26 subsets, equally spaced in redshift. We note that the processed data demonstrates a trend we have expected, namely that photometric calibration errors affect the number counts of galaxies to a greater extent at larger redshifts. 8. Conclusions