HPCS Productivity Benchmarks Working Group SSCA #3 Sensor Processing Knowledge Formation and Data I/O Serial v1.0 MIT Lincoln Laboratory January 4, 2007 999999-1 XYZ 4/8/2015 MIT Lincoln Laboratory Outline • Scalable Synthetic Compact Applications • SSCA #3 – Overview – Quick Recipe Data I/O Mode • Implementation and Results MIT Lincoln Laboratory 4/8/2015 Scalable Synthetic Compact Applications Goals • APP SIZE/COMPLEXITY NextGen Apps Building on a motivation slide from Fred Johnson (15 January 2004) Full Apps HPCS Compact Apps Micro BMKs Identify which dimensions that must be examined at full complexity and which dimensions that can be examined at reduced scale while providing understanding of both full applications today and future applications SYSTEM SIZE/ COMPLEXITY MIT Lincoln Laboratory 4/8/2015 HPCS Benchmark Spectrum SSCA #3 Execution and Development Performance Indicators Data Generator System Bounds 1. 2. Kernel Optimal 3. Kernel Pattern 4. Kernel Matching HPCchallenge Benchmarks 4. Kernel HPCS Spanning Set of Kernels Discrete Math … Graph Analysis … Linear Solvers … Signal Processing … Simulation … I/O Data Generator 1. Kernel 3. 2. KernelSimulation 3. Kernel NWCHEM 4. Kernel Data Generator 4. Simulation 2. Kernel NAS PB AU 1. Kernel 3. Kernel 4. Kernel Data Generator Existing Applications Kernels Emerging Applications Global Linpack PTRANS RandomAccess 1D FFT 2. Graph 3. Kernel Analysis 2. Kernel Future Applications Local DGEMM STREAM RandomAccess 1D FFT 1. Kernel Current UM2000 GAMESS OVERFLOW LBMHD RFCTH HYCOM Near-Future NWChem ALEGRA CCSM 5. Simulation 2. Kernel Multi-Physics 1. Kernel 3. Kernel 4. Kernel Micro & Commercial CommercialApplications Applications Kernel Medical MedicalImaging Imaging Benchmarks Astronomical Image Astronomical ImageProcessing Processing Environmental Monitoring Environmental Monitoring Data Generator 1. Image Formation 2. Image Storage 3. Image Retrieval 4. Target ID 6. Signal Processing Knowledge Formation Scalable Synthetic Compact Applications Mission Partner Application Benchmarks Simulation Execution Performance Bounds Data Generator Reconnaissance Execution Performance Indicators Intelligence 1. Kernel MIT Lincoln Laboratory 4/8/2015 Outline • The Vision • SSCA #3 – Overview – Quick Recipe Data I/O Mode • Implementation and Results MIT Lincoln Laboratory 4/8/2015 Overview • SSCA #3 Focuses on two stages: – Front end image processing and storage (Stage 1) – Back end image retrieval and knowledge formation (Stage 2) • It is representative of many areas: – Medical imaging (e.g.: tumor growth) Image many patients daily Later compare images of same patient over time – Astronomical image processing (e.g.: monitor supernovae) Image many regions of the sky daily Later compare images of a region over time – Reconnaissance monitoring (e.g.: enemy movement) Image many areas daily Later compare images of a given region over time MIT Lincoln Laboratory 4/8/2015 Overview • Benchmark stresses computation, communication, and data I/O • Can be run in 3 modes: – System Mode: A combination of Compute & Data I/O Modes – Compute Mode (minimized Data I/O Mode) – Data I/O Mode (minimized Compute Mode) • Principal performance goal is throughput – – – – Maximize rate at which answers are generated May overlap operation of data I/O and compute kernels Data I/O and compute kernels may run on different systems Some data is required to be contiguous MIT Lincoln Laboratory 4/8/2015 SSCA #3 – System Mode Stage 1: Front-End Sensor Processing Kernel #1 Data Read and Image Formation Scalable Data and Template Generator SAR Image Template Insertion Kernel #2 Image Storage SAR Image Templates Raw Data Coeffs, Group of Templates Coeffs Raw Data Image Template Positional Indices Coeffs Computation Raw Complex Data Image Pair Community has traditionally focused on Computation … Kernel #3 Image Retrieval Template Indices Group of Templates Groups of Templates Indices, Group of Templates Image Pair Detection Sub-Images Data I/O Grid of Images Detection Sub-Images Kernel #4 Detection Templates & Indices Group of Templates Detections, Template Indices Validation … but Data I/O performance is increasingly important Stage 2: Back-End Knowledge Formation MIT Lincoln Laboratory 4/8/2015 SSCA #3 – Compute Mode Sensor Processing Scalable Data and Template Generator Raw SAR Kernel #1 Image Formation SAR Image Kernel #2 Image Storage SAR Image Templates Templates Raw SAR File Template Insertion Raw SAR File Groups of Template Files Raw SAR Data Files SAR Image File Template Files Sub-Image Detection Files Groups of Template Files SAR Image File Template Files Kernel #3 Image Retrieval SAR Image Pair Image Files Template Files Detection File Kernel #4 Detection Detections Validation Templates Knowledge Formation MIT Lincoln Laboratory 4/8/2015 SSCA #3: Compute Mode Challenges Back-End Knowledge Formation Front-End Sensor Processing Scalable Data and Template Generator Raw SAR Templates • Scalable synthetic data generation Kernel #1 Image Formation SAR Image Template Insertion SAR Image Kernel #4 Detection Detections Validation Templates Templates • Pulse compression • Polar Interpolation • FFT, IFFT (corner turn) • Sequential store • Non-sequential retrieve • Large & small I/O • Large Images difference & Threshold • Many small correlations on selected pieces of a large image MIT Lincoln Laboratory 4/8/2015 SSCA #3 – Data I/O Mode Stage 1: Front-End Kernel #1 Data Read and Image Formation Scalable Data and Template Generator Large Data Large Data Groups of Small Data Large Complex Data Image Image Groups of Small Data Sub-Images Groups of Small Data Group of Small Data Image Pair Kernel #3 Image Retrieval Kernel #2 Image Storage Image Pair Grid of Images Sub-Images Kernel #4 Stage 2: Back-End MIT Lincoln Laboratory 4/8/2015 Outline • The Vision • SSCA #3 – Overview – Quick Recipe Data I/O Mode • Implementation and Results MIT Lincoln Laboratory 4/8/2015 Ingredients To run Data I/O Mode, the user only needs set: 1) SCALE, 2) N_SDG_GROUPS, and 3) grid Where: SCALE = a parameter that sets the size of raw input data, and image. It should be set so that these are a significant fraction of a single processor’s memory. • N_SDG_GROUPS = number of raw input data and templates groups. It should be set large enough to avoid disk cache effects. • And the number of images in the grid is: GRID_SIDE_SIZE x GRID_SIDE_SIZE x AV_GRID_DEPTH GRID_SIDE_SIZE • GRID_SIDE_SIZE MIT Lincoln Laboratory 4/8/2015 Ingredients Parameters to Code: • PICTURE_SIZE = GRID_SIDE_SIZE2 is the number of images in a picture • EST_TOT_GRID_SIZE = PICTURE_SIZE x AV_GRID_DEPTH is the total number of times that the input data will be retrieved, and the total number of images stored to the grid • mc x n = is the size of the raw complex valued input data mc = 2 x ceil(80 x SCALE) n = 2 x ceil(158.496 x SCALE + 60) • ROTATION_STEP is the templates’ rotation angle increment in degrees • nDistinctLetters x nDistinctRotations is total number of pixelated templates nDistinctLetters = number of least correlated letters in alphabet (21) nDistinctRotations = num of ROTATION_STEP angles between 0 and 360 degs • FONT_SIZE x FONT_SIZE = size of a single template in pixels MIT Lincoln Laboratory 4/8/2015 Ingredients Parameters to Code (Cont.): • m x nx = size of an image m = 2*ceil(mc/0.8405246) k1n = 8.3776 x (1.5 -1/n) kxmin = sqrt(70.1841812-6.3165469 x (m/mc)2) kxmax = sqrt((4 x k1n.^2)-25.2661877 x (1/mc)2) nx = 2 x ceil(20 x SCALE*(kxmax-kxmin)/pi) + 20 • nSubImages = floor( pOccupancy x p2ndNot1st x (m /(SARLOBE_DISTANCE x FONT_SIZE)) x (nx/(SARLOBE_DISTANCE x FONT_SIZE)) ) = number of smaller images to be stored (by the last kernel), where: pOccupancy = 0.5 p2ndNot1st = 0.5 is the probability of template occupancy, and is the probability that a template appear in the second image but not in the first Total memory required, in bytes = N_SDG_GROUPS x (8 x mc x n + 4 x nDistinctLetters x nDistinctRotations x FONT_SIZE2) + EST_TOT_GRID_SIZE x (4 x m x nx + 4*nSubImages x (4 x FONT_SIZE)2) + (coefficients, support and verification parameters; stored once) • Grows with SCALE2 MIT Lincoln Laboratory 4/8/2015 Directions SDG • • Create a group – Create a random single precision complex valued (large) mc x n matrix – Store the data – Create a random real valued (small) FONT_SIZE x FONT_SIZE matrix – Store small matrix nDistinctLetters x nDistinctRotations times Copy the above group N_SDG_GROUPS times STAGE 1 for iImage = 1 to EST_TOT_GRID_SIZE KERNEL 1 – Randomly pick and retrieve one of the N_SDG_GROUPS groups – Create a random single precision real valued m x nx matrix KERNEL 2 – Randomly select i and j values in the range [1, GRID_SIDE_SIZE] and use these to create a filename. – Store the image matrix end MIT Lincoln Laboratory 4/8/2015 Directions STAGE 2 for iImageSeq = 1 to PICTURE_SIZE – Randomly select i and j values in the range [1, GRID_SIDE_SIZE] – Find the grid depth at this particular point for k = 1 to gridPointDepth-2 KERNEL 3 – Retrieve a pair of images, and an SDG group of templates KERNEL 4 for l = 1 to nSubImages – Create a random (4 x FONT_SIZE) x (4 x FONT_SIZE) matrix – Store the sub image end end end MIT Lincoln Laboratory 4/8/2015 Outline • The Vision • SSCA #3 – Overview – Quick Recipe Data I/O Mode • Implementation and Results MIT Lincoln Laboratory 4/8/2015 SSCA #3 Serial Release v1.0 Types of Data I/O Implemented: • FWRITE, binary, IEEE floating point with appropriate big or littleendian byte ordering and 32-bit data type • HDF5, HDF5 32 bit float format Modes: • System Mode • • – Includes both Compute (SAR Processing), and Data I/O Modes. Compute Mode – Dials the smallest possible Grid of 2 images, thus minimizing data I/O. Data I/O Mode – Generates random data, thus foregoing SAR processing. Outputs metrics at each level in the system’s hierarchy – Kernels, Stages, and Overall SSCA #3: – Bytes, seconds, bandwidth (bytes/sec) MIT Lincoln Laboratory 4/8/2015 SSCA #3 Serial Release v1.0 • One of many possible implementations • Over 2200 lines of well commented MATLAB code. Carefully picked functional breakdown, data structures, variable names, and comments • Coding standard: Modified “Programming in C++, Rules and Recommendations” by Mats Henricson and Erik Nyquist of Ellemtel Telecommunication System Laboratories, 1990-1992 • Development tools used – MATLAB Version 7.1.0.246 (R14) Service Pack 3 (version required) – Octave Version 2.9.5 – Pentium® 4 2.66GHz CPU with 1.00GB of RAM, and 2.5GB of virtual RAM, running on MS Windows XP Professional Version 2002 Service Pack 1 – On a dedicated dual processor hyperthreaded P4 Xeon, 2.8 GHz, ½ MB cache, GNU/Linux 2.4.20-28.9 (Redhat 9) • Accompanying documentation: – Written Specification, and these slides – MANIFEST.txt – list of files with brief description – README.txt – installation and run time instructions; code overview – RELEASE_NOTES.txt – known outstanding issues in current release MIT Lincoln Laboratory 4/8/2015 SSCA #3 Release v1.0a MIT Lincoln Laboratory 4/8/2015 Summary Challenges: • Large scale parallel two-dimensional (2D) Inverse Fast Fourier Transform (IFFT); may require a ‘corner turn’ or a ‘gather scatter’ (depending on architecture), with large quantities of data. Polar interpolation is known to be even more computationally intense than IFFT (Kernel 1). • Streaming image data storage to a data I/O device (write) may involve large block data transfers, storing one large image after another (Kernel 2). • Random location image sequence retrieval from a data I/O device (read) also involving large quantities of data, with possibly stressful spatial or temporal memory access patterns, and locality issues (Kernel 3). • Small data I/O in all four kernels. Large data I/O in three of the four kernels. • Many small convolutions on random pieces of a large image (Kernel 4). Status: • Written and Matlab Executable Specification v1.0 released June 22, 2006 • Architecture of Data I/O Mode – Martha Bancroft of Shomo Tech Systems, and Jeremy Kepner • Works with Octave 2.9.5 • Written Specification – SAR Editor – Glenn Schrader, MIT Lincoln Laboratory • C version based on release v1.0a (unofficial) – Meng-Ju of UMD, and Janice Onanian McMahon of USC/ISI MIT Lincoln Laboratory 4/8/2015 SSCA #3 Backup Slides MIT Lincoln Laboratory 4/8/2015 SSCA #3 Specification • Intent • Overview • Compute Mode Main Components – – – – – Synthetic Scalable Data Generator Kernel 1 — SAR Image Formation Template Insertion Kernel 4 — Detection Validation – – – – – Kernel 1 — Large & Small Data Retrieval Image Grid Kernel 2 — Image Storage Kernel 3 — Image Retrieval Kernel 4 — Small Image Storage • Data I/O Mode Main Components MIT Lincoln Laboratory 4/8/2015 The Vision ― Scalable Synthetic Compact Applications • Bridge the gap between scalable synthetic kernel • benchmarks and (non-scalable) real applications, and become an important benchmarking tool Is representative of real application workloads while not being numerically rigorous – memory access characteristics – communications characteristics – I/O characteristics • Multi-processor compact application, designed to be easily • • scalable and verifiable No limits on the distribution to vendors and universities SSCAs represent a wide spectrum of potential HPCS Mission Partner applications MIT Lincoln Laboratory 4/8/2015 Executable Specification What is an Executable Specification: • • • • • It implements the Written Specification, illustrating all specified properties; it is just one of many possible implementations It provides developers further insight into the corresponding Written Specification It is a tool for developers with which to validate their own work It includes a serial version, and may include one or more approaches to a parallel version It must be easily readable and intelligible, through its choice of functional structure, variable names, comments, and supporting documentation Structure: • • • • Scalable Data Generator – Creates synthetic data that can be scaled to stress any computer from a single workstation to a petascale multiprocessor Kernels – timed computational algorithms Verification – checks the correctness of select results Validation – validates the resulting solution MIT Lincoln Laboratory 4/8/2015 SSCA #3 Specification • Intent • Overview • Compute Mode Main Components – – – – – Synthetic Scalable Data Generator Kernel 1 — SAR Image Formation Template Insertion Kernel 4 — Detection Validation – – – – – Kernel 1 — Large & Small Data Retrieval Image Grid Kernel 2 — Image Storage Kernel 3 — Image Retrieval Kernel 4 — Small Image Storage • Data I/O Mode Main Components MIT Lincoln Laboratory 4/8/2015 SSCA #3 – Compute Only Mode Sensor Processing Scalable Data and Template Generator Raw SAR Kernel #1 Image Formation SAR Image Kernel #2 Image Storage SAR Image Templates Templates Raw SAR File Template Insertion Raw SAR File Groups of Template Files Raw SAR Data Files SAR Image File Template Files Sub-Image Detection Files Groups of Template Files SAR Image File Template Files Kernel #3 Image Retrieval SAR Image Pair Image Files Template Files Detection File Kernel #4 Detection Detections Validation Templates Knowledge Formation MIT Lincoln Laboratory 4/8/2015 Spotlight SAR MIT Lincoln Laboratory 4/8/2015 Compute Mode - SAR Overview • Radar captures echo returns from a ‘swath’ on the ground • Notional linear FM chirp pulse train, plus two ideally non-overlapping echoes returned from different positions on the swath Synthetic Aperture, L Fixed to Broadside ... • Summation and scaling of echo returns realizes a challengingly long antenna aperture along the flight path Range, X = 2X0 delayed transmitted SAR waveform s(t , u) (n, m) pt (n, m)) pulses swath received ‘raw’ SAR reflection coefficient scale factor, different for each return from the swath Cross-Range, Y = 2Y0 MIT Lincoln Laboratory 4/8/2015 Scalable Synthetic Data Generator • Generates synthetic raw SAR complex data • Data size is scalable to enable rigorous testing of high performance computing systems Spotlight SAR Returns • Generates ‘templates’ that consist of rotated and pixelated capitalized letters Range – User defined scale factor determines the size of images generated Cross-Range MIT Lincoln Laboratory 4/8/2015 Kernel 1 — SAR Image Formation Spatial Frequency Domain Interpolation s*0(w,ku) s(t,u) Fourier s(w,ku) Transform (t,u)B(w,ku) Matched Filtering Interpolation kx = sqrt(4k2 –ku2) ky = ku Inverse f(x,y) Fourier Transform F(kx,ky) (kx,ky) B (x,y) Cross-Range, Pixels Spotlight SAR Reconstruction ky o Received Samples Fit a Polar Swath kx Range, Pixels Processed Samples Fit a Rectangular Swath f MIT Lincoln Laboratory 4/8/2015 Template Insertion ( not timed) • Inserts rotated pixelated capital letter templates into each SAR image – Non-overlapping locations and rotations – Randomly selects 50% – Used as ideal detection targets in Kernel 4 Image Inserted with only %50-Random Templates Y Pixels Y Pixels Hypothetical %100 Insertion of Templates X Pixels X Pixels MIT Lincoln Laboratory 4/8/2015 Kernel 4 — Detection • Detects targets in SAR images 1. 2. 3. 4. Image difference Threshold Sub-regions Correlate with every template max is target ID • – Many small correlations over random pieces of a large image • Image A Image Difference Computationally difficult • Requires 100% recognition and no false alarms including objects Thresholded that cross distributed memory boundaries Sub-region Correlated Image B MIT Lincoln Laboratory 4/8/2015 Computational Challenges Back-End Knowledge Formation Front-End Sensor Processing Scalable Data and Template Generator Raw SAR Templates • Scalable synthetic data generation Kernel #1 Image Formation SAR Image Template Insertion SAR Image Kernel #4 Detection Detections Validation Templates Templates • Pulse compression • Polar Interpolation • FFT, IFFT (corner turn) • Sequential store • Non-sequential retrieve • Large & small IO • Large Images difference & Threshold • Many small correlations on selected pieces of a large image MIT Lincoln Laboratory 4/8/2015 SSCA #3 Specification • Intent • Overview • Compute Mode Main Components – – – – – Synthetic Scalable Data Generator Kernel 1 — SAR Image Formation Template Insertion Kernel 4 — Detection Validation – – – – – Kernel 1 — Large & Small Data Retrieval Image Grid Kernel 2 — Image Storage Kernel 3 — Image Retrieval Kernel 4 — Small Image Storage • Data I/O Mode Main Components MIT Lincoln Laboratory 4/8/2015 SSCA #3 – Data I/O Mode Stage 1: Front-End Kernel #1 Data Read and Image Formation Scalable Data and Template Generator Large Data Large Data Groups of Small Data Large Complex Data Image Image Groups of Small Data Sub-Images Groups of Small Data Group of Small Data Image Pair Kernel #3 Image Retrieval Kernel #2 Image Storage Image Pair Grid of Images Sub-Images Kernel #4 Stage 2: Back-End MIT Lincoln Laboratory 4/8/2015 Scalable Synthetic Data Generator Scalable Data Generator Large Data Groups of Small Data • Generates large complex data, and groups of small data. Kernel #1 • Writes a ‘dialed’ number of large complex data to external memory. • For each large data, it writes a group of small data to external memory. • Single precision Large Complex Data Associated Groups of Small Data • Not timed MIT Lincoln Laboratory 4/8/2015 Kernel 1 — Data Retrieval Stage 1: Front-End Kernel #1 Data Read Large Data Image Small Data • Randomly reads one large complex data from external memory, at each Stage 1 pass. • Also reads associated group of small data from external memory, at each Stage 1 pass. • Generates a single precision random image (of the size dialed by SCALE). • I/O is timed Large Complex Data Associated Groups of Small Data MIT Lincoln Laboratory 4/8/2015 Image Grid • External memory image Grid is accessed by Kernels 2 & 3. • It is scalable by image size, number of images. • Image size requires a non-trivial amount of memory. • Intended for dealing with enormous quantity of data, with simultaneous reads and writes. GRID_SIDE_SIZE Image Grid GRID_SIDE_SIZE Image grid, shown scaled to 80 images MIT Lincoln Laboratory 4/8/2015 Kernel 2 — Image Storage Stage 1: Front-End Image Kernel #2 Image Storage • Writes a different image to a random location in the external memory on the Grid at each Stage 1 pass. • Images may be stored together, or in separate pieces (to allow simultaneous reading/writing of the same image). • I/O is timed Image Images in Grid • Computes filenames and addresses, and writes streaming data to random locations on Grid at each Stage 1 Front-End processing pass. MIT Lincoln Laboratory 4/8/2015 Kernel 3 — Image Retrieval Images In Grid Templates Image Kernel #3 Image Retrieval Group of small data N_image x N_image N_grid x N_grid Image Pair Stage 2: Back-End • From a random location in the Grid, it computes the address of an image sequence and reads a pair of its images until it reaches its full depth, at each Stage 2 pass. • An image sequence is read through its entire Grid’s Depth. • Also reads a group of small data at each Stage 2 pass. • I/O is timed Image Grid MIT Lincoln Laboratory 4/8/2015 Kernels 2 and 3 Additional notes: • If an optimal scheme is picked for data storage, it may not be optimal for data retrieval, and vice versa. • “Read behind Write” is allowed. Kernel 2 Image Output Kernel 3 Image Pair Input MIT Lincoln Laboratory 4/8/2015 Kernel 4 — Small Image • Writes labeled sub-images. This is repeated for each image pair, at each grid point, at each Stage 2 pass. • I/O is timed Sub-Images Sub-Image Image pair Kernel #4 Small Image Output Stage 2: Back-End MIT Lincoln Laboratory 4/8/2015 References • Carrara, Walter G., Ron S. Goodman and Ronald M. Majewski, Spotlight Synthetic Aperture Radar: Signal Processing Algorithms. Boston: Artech House, 1995. • Corlander, John C. and Robert N. McDonough, Synthetic Aperture Radar: Systems and Signal Processing. New York: Wiley, 1991. • Haney, R., Meuse T., Kepner, J., and Lebak, J., The HPEC Challenge Benchmark Suite, High Performance Embedded Computing Conference, Lexington, MA 2005. • Jakowatz, Charles V., Jr., et al., Spotlight-Mode Synthetic Aperture Radar: A Signal Processing Approach. Boston Kluwer Academic Publishers,1996. • Rihaczek, August W., Principles of High-Resolution Radar. Boston: Artech House 1996. Originally published: New York: McGraw-Hill, 1969. • Stimson, George W., III, Introduction to Airborne Radar Second Edition. World Color Book Services, 1998. MIT Lincoln Laboratory 4/8/2015