Correlation in the MWA US VLBI Technical Coordination Meeting Roger Cappallo 2007.5.14 MWA Introduction Low frequency (80-300 MHz) array in outback of Western Australia 500 dual pol. dipole tiles spread across 1.5 km of desert Analog signals sampled at 640 MHz, broken into coarse (~1.3 MHz) channels, of which 31 MHz are transmitted to correlator coarse F fine F PFB: fine channels, reorder, route X, short med sum rotate, long sum Large-N Considerations Correlator design based on ideas developed independently at CSIRO and MIT Joint white paper (Bunton, Cappallo, Morales, 2005) applying ideas to SKAMP and MWA Large N correlator has two central problems: computation and routing How to bring >500K signal pairs together for multiply and add? Answer: Replicate (500x!) signals at as low a level as possible, in hierarchical fashion Signal Replication In order of increasing cost: local traces in FPGA (x16) chip-wide traces in FPGA (x8) multiple traces on PC board (x4) across a backplane off-board signals (e.g. multicast packets) Temporal Mismatch FPGA multiplier rate ~250 MHz Data sample rate is 10 KHz Factor of 25000 mismatch! Answer: Time multiplex over multiple (256) station pairs, and multiple (96) frequency channels Correlator Requirements Complex cross-multiply and accumulate data from 524800 signal pairs Each pair comprises 3072 channels with 10 KHz bandwidth 10 KHz bandwidth → 30 Km wavelength, thus array is λ / 20 within a channel, regardless of direction! Max fringe-rate of 0.109 Hz for 1.5 km baseline at 300 MHz would allow dump rate of 2 s (which is v.1 int. period); 0.5 s used for solar & transients, longer baselines, minimize coherence loss No fringe rotation or delay compensation necessary in hardware! System Dataflow Numerology 1 correlator board has 8 SX-35’s, each with 136 cells, which can process a total of 278528 signal pairs Each pair of correlator boards processes 96 channels (0.967 MHz) 32 board pairs required for all 3072 channels (30.94 MHz) Requires five 23” or six19” AdvancedTCA shelves Two boards cover all baselines • 2 boards: m and n • CMAC chips 0..7 • axis of symmetry along hypotenuse • reverse input order to get lower diagonal half Cells mapped onto SX-35 chip • Separate groups of 256 antennas to X and to Y • Uses 136 of 192 available DSP slices Correlator cell • 16 X & 16 Y 8 bit input values in distributed RAM, for a single point in time • complex 4+4 bit multiply encoded into single 36 bit hardware multiply • 18 bit adder implemented in local fabric • short-term sums ping pong in block RAM: 2 comp x 18 bit x 256 prod x 2 buffers = 18 Kb Data Ordering - tnf96t512a1024 Within a cell 2 sets of 16 antenna samples for each time point are cross multiplied and added into 256 short term accumulators Process is repeated for 512 t points, when accum is dumped to LTA and cleared Above done for each of 96 channels, then repeated for next time block, etc. Voltage Beamformer Not needed for 32T system, so detailed development has not yet begun For each frequency channel, need to form a linear combination of all 1024 antennas: V = Σ an x gn 16 dual polarization beams are formed (but treated internally as 32 single polarization beams) Computational load ~1.0 TCMAC/s only 6% of correlator load of 16.2 TCMAC/s possibly will be done in routing chips (FX-20’s) Beamformer Gains If every beam, baseline, and channel gain were independently specified, it would require 1 TB/s of coefficient data! But… Complex gains gn are a product of instrumental gain - slow t variation, gi(θ,φ) ionospheric phase - medium t variation, gs(θ,φ), f-2 freq dependence geometric phase - rapid f variation (linear), gg(θ,φ) Gain flow from RTS to beamformer must be efficient, taking advantage of characteristics of each term e.g. geometric phase could be specified as 2 numbers: phase at low f end, and increment per channel