Full Design DESIGN CONCEPTS • • The main idea behind this design was to create an architecture capable of performing run-time load balancing in order to more efficiently work with N-body type problems. The design consists of the following units: – – – – – – • Feeder FIFO DC Units LJPC Units FLEX Units Inverter Units Main Controller All of these units will be discussed in this presentation. The primary units responsible for the implementation of the load balancing are the FLEX units, which are managed by the main controller. The FLEX units are special purpose processing elements that are capable of performing either the DC or the LJPC computations. This allows them to be used in whichever mode is currently in the most need of assistance. • The main controller works with the Feeder FIFO and the Inverter Units in order to properly determine the most efficient use of the FLEX processors. This is done by monitoring FIFO capacities which provide the controller with an understanding of the congestion of the current area. Since LJPC calculations are only performed on atoms within a specified radius, if the majority of atoms lie outside this area, then most of the computation time will be spent on DC calculations and the Inverter’s FIFO capacity will be small, and vice versa. DC UNIT DESIGN • This unit is responsible for computing the distance squared between two atoms. • X1, Y1, and Z1 represent the 3-dimensional coordinate vector for the source atom. X2, Y2, and Z2 represent the 3dimensional coordinate vector for all other atoms. This value is changed every clock cycle, whereas the source atom remains constant. • Once the distance squared has been calculated, it is compared with a cut-off radius (CR). If the radius is within tolerance, the data is sent through to the inverter. Otherwise, it is discarded. • Represented in hardware by a 1-to-1 mapping, allowing for maximum throughput. Since not all atoms will be within tolerance, it is important to perform as many distance calculations as possible in as short of a time as possible in order to improve the efficiency of the LJPC units. LJPC UNIT DESIGN • This unit is responsible for performing a portion of the Lenard-Jones Potential Calculation. • 1/R2 represents the inverted radius squared that is received from the inverter. • DX, DY, and DZ are the results of the initial subtraction in the DC Unit and are passed along with the radius squared. • FX, FY, and FZ represent the accumulated 3dimensional force vector. POTENTIAL ENERGY represents the accumulated potential energy. • The hardware design is not a 1-to-1 mapping as was done with the DC Unit. Instead, the design is broken into two portions, each containing three operational step each, plus a separate accumulation step. This allows the unit to receive new data after completion of only three operational steps rather than all seven. • Data may be received into this unit for up to six clock cycles at a time. If the two corresponding FLEX units are also in LJPC mode, then they are feed while this unit completes its first 3-step cycle. This allows for a minimum amount of delay in receiving data from the inverter. FEEDER FIFO • This unit is responsible for fetching data from the onboard block ram modules and appropriately feeding this data to each of the units, including FLEX units if they are operating in DC mode. • Load balancing is also achieved within this unit. All six block ram modules send their data into the controller where the controller may then determine which unit to send the data to. Since the DC units will always empty their data units first, they may then be used to assist in emptying any remaining units. If a FLEX unit has spent a large portion of its time in LJPC mode, a large amount of data may be left in its block ram module. This data may then be feed to either of the DC units, thus allowing both sides to balance out. • The determination of which block ram will be taken over by a DC unit is made by the main controller and is based upon the amount of data remaining in each unit. Whichever unit is currently fullest would be the best candidate. Therefore, if one side has produced more atoms within the radius tolerance, then the two DC units may end up working on both available block rams on one side, although data may be available on both sides. INVERTER • This unit is responsible for retrieving data from the DC units and FLEX units working in DC mode, inverting the radius received, and then sending that data on to all available LJPC units and FLEX units operating in LJPC mode. • The controller uses a polling routine to check each of the three FIFOs, each 1K in size, to determine if data is available. This is done on a priority basis, with the DC unit having highest priority. Once valid data has been received by any of the 3 units, it is taken in and sent to the inverter. • Once the radius has been inverted, this data is then sent to the LJPC FIFO. Data is accumulated here until it reaches a specified level, which is currently set at 128 out of 1K. Since the LJPC unit can not accept data at all times, if the level reaches above this point, it is transferred to the next FLEX FIFO. The level of each FLEX FIFO is read every clock cycle to determine if the corresponding FLEX unit should switch modes from DC to LJPC. The unit will not switch back until the FIFO has reached another specified level. Currently, the FLEX will switch from DC to LJPC when the level is greater than 128, but will not switch back until the level is below 32. FLEX • • Flexible unit that can operate in either DC or LJPC mode Using Force-Directed scheduling, the architecture shown was derived – – – – • • Pure data flow architecture – no conditional branching necessary No register files Instructions control MUX select lines and adder/subtractor modes Delays of 1, 2, 5, 6, 9, 10, 11, and 35 clock cycles are implemented where needed There is a one-to-one relationship between arithmetic units in the DC module and in the FLEX module, allowing for identical behavior In LJPC mode, the FLEX processor must reuse resources, causing it to cycle through periods of accepting inputs and of processing those inputs FLEX The FLEX instruction word takes care of – – – Control of the MUXes that feed data to the arithmetic units at each clock cycle Control of the mode of each adder/subtractor at each clock cycle 4 control signals – – – – FW_WE allows data to be written to the force accumulation registers PE_WE allows data to be written to the potential energy accumulation register JUMP causes the program counter to be reset to 0 BUSY indicates the program is busy in LJPC mode and cannot be switched to DC mode Preliminary results show a roughly linear relationship between: 1. Atoms processed and time taken 2. Percentage of atoms within the cutoff radius and time taken An inverse relationship occurs between time taken and the FIFO threshold Average data levels on both input and output FIFOs follow an exponential curve with respect to the percentage atoms within the cutoff radius. FUTURE WORK • System Performance can be enhanced through several optimizations, including – Redesigning the FIFOs to pipe data to all LJPC and applicable FLEX units more efficiently – Improving the logic used by the controller to determine the mode of each FLEX unit – Porting the controller logic to software for use on a PowerPC (There are 2 PowerPC cores available on our Xilinx Virtex 4 FPGA) – Modifying the FLEX processor architecture to ensure an optimal number of computational units and to ensure optimal usage of these units