A Framework for Evaluating Programming Models for Embedded CMP Systems Niraj Shah Mel Tsai CS252 Final Project 4/27/2000 Overview Motivation Target Architectures Programming Model Software Environment Applications Preliminary Results and Conclusions Future Work CS252 Final Project (Spring 2000) Motivation Embedded multiprocessor systems for are different than their GP counterparts Interprocess communication can be very cheap Communication architecture tailored to application Desirable not to have a heavy OS or large library to handle communication Efficiently programming these systems in an HLL is an absolute necessity How do we evaluate the machine abstraction that is presented to the programmer? CS252 Final Project (Spring 2000) Target Architectures PE PE PE Thanks Scott Memory System PE PE PE Register File PE PE FU FU FU SFU SFU Instruction Cache Instructions to perform communication operations Some simplifying assumptions CS252 Final Project (Spring 2000) Programming Model Language specification is a simplified subset of MPI Single Program Multiple Data (SPMD) execution model Separate address spaces for each process Bind each process to a distinct PE Communication primitives Blocking/Non-blocking Sends & Receives MPI Programming Model MPI_Send(data_length, *data_location, type, destination_PE, tag_identifier, MPI_COMM_WORLD); Mescal Programming Model Mescal_Send(data_length, *data_location, destination_PE); How do we evaluate the programming model? CS252 Final Project (Spring 2000) Software Environment Augmented IMPACT framework (single PE) to target CMPs Compiler Generates optimized code for each PE Understands our programming model Generates code to use our hardware CS252 Final Project (Spring 2000) Trace Simulator *.X_im_p emulator generator trace data machine description MP simulator simulator *.c*.c ++ MPI +probes probes “probed” executable simulation data CS252 Final Project (Spring 2000) MPI C gcc compiler input data Application - JPEG JPEG encode/decode processdecode 2 encode process splitter 1 processdecode 3 encode processdecode 4 encode CS252 Final Project (Spring 2000) process combiner5 Application – Network Routing Based on MIT Click Modulator Router Translated to C (from C++) by the MESCAL team CRACK (Click Rapidly Adapted to C-Kode) Built router kernel from CRACK “Elements” CS252 Final Project (Spring 2000) CRACK Parallelized CRACK InfiniteSource process 1 12.1% Idle cycle times 48.8% CheckIPprocess Header 2 49.3% GetIPAddress process 3 88% Port 0 … Port 1 process Port n 5 Lookupprocess IPRoute 4 0% CS252 Final Project (Spring 2000) Preliminary Conclusions Scheme to better parallelize (loadbalance) applications Need way of overlapping computation and communication (i.e. non-blocking) Extensible framework is useful for exploring different programming models Allows for quantitative analysis of the effect of communication primitives CS252 Final Project (Spring 2000) Future Work Get more detailed numbers from parallelized CRACK Implement non-blocking sends and receives Map multiple processes to a single PE Performance evaluation of different programming models for an application set Support dynamic process creation Incorporate microarchitectural simulation of communication instructions CS252 Final Project (Spring 2000)