The SHARC Super Harvard Architecture Computer Clare Smtih SHARC Presentation 1 The SHARC • Developed by Analog Devices • Optimized for demanding DSP and imaging applications. • 32 Bit floating point, with 40 bit extended floating point capabilities. • Large on-chip memory. • Ideal for scalable multi-processing applications. Clare Smtih SHARC Presentation 2 Harvard Architecture • Program memory can store data. • Able to simultaneously read or write data at one location and get instructions from another place in memory. • 2 buses 1 Data memory bus. 2 Program bus. • Either two separate memories or a single dual-port memory. 3 Super Harvard Architecture • Many processor employ Harvard Architecture by having two separate memories or caches integrated into the processor chip • The SHARC is unique in that it’s internal memory is capable of holding a large program as well a large amount of data. This is what makes it SUPER!!! Clare Smtih SHARC Presentation 4 DSP • Digital Signal Processor. • High speed, low overhead data movement and rapid computations required. • Usually has a small on-board ROM, RAM and single cycle multiply. • Designed to run single line, serial in, serial out, signal processing applications very fast. Clare Smtih SHARC Presentation 5 DSP Computations • The inner product of two vectors is a common computation for determining energy or correlation. • The following C code is an example: for (n=0; n<length; n++) result+= x[n] * y[n]; • The process which has the lowest instruction time will have the best performance.Clare Smtih SHARC Presentation 6 SHARC DSP • The SHARC incorporates features aimed at optimizing such loops. • High-Speed Floating Point Capability • Extended Floating Point • These features are DSP specific. • Meaning, when applied to a non-DSP application performance may not be as optimal. Clare Smtih SHARC Presentation 7 Floating Point and Extended Floating Point • The SHARC supports floating, extendedfloating and non-floating point. • No additional clock cycles for floating point computations. • Data automatically truncated and zero padded when moved between 32-bit memory and internal registers. • Not accurate enough for scientific algorithms. Excellent signal to noise ratio. 8 SHARC’s Internal Memory • Makes SHARC unique. • Size • Allows many complex functions to be preformed on-chip. Eliminating the need to move data between internal and external memory. • Memory size is significantly larger then most other high speed computational devices. • Dual-block, Dual-port • Optimizes the Harvard Architecture by allowing the fetch of instructions while performing data memory 9 accesses. Multiply and Accumulate Instructions on the SHARC • Like most DSPs the SHARC is able to compute a product and add the product to a running total in a single clock cycle. • The SHARC’s super instruction is that it can multiply and accumulate while adding, subtracting, or averaging data in two other registers. • These instructions give the SHARC its 120 megaflop rating. 10 Zero Overhead Looping on the SHARC • A single instruction outside the loop performs loop set-up. Informing the SHARC that there is a loop approaching. • The instruction also includes the iteration count and termination condition. • This causes the pipeline to remain full during loop execution and also allows the termination condition to be tested in parallel. 11 DAGs on the SHARC • Data Address Generators are integer computation units that manage the indexing of registers. • Allows the SHARC to to fetch a value and update the index value. • If the updated value exceeds a limit, the DAB adjusts the index so that it wraps. • This occurs in the same clock cycle as the 12 read or write. DAG Capabilities • Circular Buffering • Rather then actually moving data in and out of a vector, circular buffers are used. • Updating the index modulo, the oldest entry can be conveniently replaced by the newest entry. • Bit Reverse Addressing • The bit pattern of a vector index is reversed. • Done automatically by the SHARC. • Required for Fast Fourier Transform (FFT), which is often critical to DSP applications. Clare Smtih SHARC Presentation 13 SHARC DSP • What Makes the SHARC unique? – It also has some features not related directly related to optimizing numeric computations. • Pipelining • Handling Branches • Why has this not emerged sooner? – Technology has only recently become available to make it economical to integrate general single computing devices. Clare Smtih SHARC Presentation 14 SHARC’s Pipeline • 3 stages 1 Instruction Fetch 2 Decode 3 Execution • Takes three clock cycles for an instruction to propagate through the pipeline. • The processor execution speed is one instruction per clock cycle even though each instruction requires three clock cycles.15 Clare Smtih SHARC Presentation SHARC’s Handling Branches Delayed Branching • When a branch instruction is encountered the two instructions which have been loaded and decoded are executed before the branch. • This keeps the pipeline full and avoids junking those two instructions and reloading the pipeline. • Beneficial in situations such as a few instruction loops. When the ratio of wasted clock cycles to instructions is significant. 16 SHARC’s Handling Branches Non-delayed Branching • Traditional branching. • If the pipeline cannot be reordered to use delayed branching, non-delayed branching is space saving. • Uses only one word of storage. • Although, it takes three cycles as the pipeline gets reloaded. Clare Smtih SHARC Presentation 17 Multi-processing • SHARC is uniquely equipped for multiprocessing. • Links to ports are very powerful multiprocessing capabilities. • Two main program models depending on the application. • Adapts well to different multi-processing architectures. Clare Smtih SHARC Presentation 18 Multi-processing SHARC Links • SHARC has 6 link ports that can transport data at rates up to 40Mbytes/sec. • Links designed for point-to-point connections. • Data can be transmitted in either direction but not both simultaneously. Clare Smtih SHARC Presentation 19 Multi-processing Program Model MIMD • Multiple instruction, multiple data. • Good for applications that require multiple instruction threads to execute concurrently. • Processors operate individually. • Each processor executes different code. • Typically used for image reconstruction and multi-channel DSP. Clare Smtih SHARC Presentation 20 Multi-processing Program Model SIMD • Single instruction, multiple data. • Works best when all processors execute identical instruction sequences. • Do not require overhead for inter-processor synchronization. • Typically used for synthetic aperture radar and automatic target recognition. Clare Smtih SHARC Presentation 21 Multi-processing Architectures Cluster Design • Groups of up to 6 in a cluster • Most common for joining multiple SAHRC's • All processors, global I/O and global memory connected to a common “Cluster bus.” • Each SHARC can “drive” the bus. Clare Smtih SHARC Presentation 22 Multi-processing Architectures Mesh Design • All SHARC’s joined by their link ports and are connected to a common bus. • In SIMD mode one single master SHARC drives the bus. • In MIMD mode mesh architecture cannot function if data is lager then on chip available memory. • Advantageous scalability over a wider range 23 of applications. Summary of what makes the SHARC Super • It performs excellently for DSP applications. • Employs a Harvard Architecture with very large on chip memory. • Respectable Megaflop rating. • It’s multiprocessing capabilities. Clare Smtih SHARC Presentation 24 How optimal is the SHARC for non-DSP Applications? • It is obviously geared for DSP applications. • While it may fare better then other processors it is still behind those which are designed specifically for non-DSP applications. Clare Smtih SHARC Presentation 25 Sources • www.alacron.com/news/tp_mimd_simd.htm • www.analog.com • www.cs.seas.gwu.edu/~cs339/cs339lecture2.pdf • www.ixthos.aa.psiweb.com/technical/notes_ articles/articles Clare Smtih SHARC Presentation 26