An Asynchronous 4-to-4 AER mapper H. Kolle Riis and Ph. Häfliger Department of Informatics, University of Oslo, Norway. Abstract. In this paper, a fully functional prototype of an asynchronous 4-to-4 Address Event Representation (AER) mapper is presented. AER is an event driven communication protocol originally used in VLSI implementations of neural networks to transfer action potentials between neurons. Often, this protocol is used for direct inter-chip communication between neuromorphic chips containing assemblies of neurons. Without an active device between two such chips, the network connections between them are hard-wired in the chip design. More flexibility can be achieved by communicating through an AER mapper: The network can freely be configured and, furthermore, several AER busses can be merged and split to form a complex network structure. We present here an asynchronous AER mapper which offers an easy and versatile solution. The AER mapper receives input from four different AER busses and redirects the input AE to four output AER busses. The control circuitry is implemented on an FPGA and is fully asynchronous, and pipelining is used to maximize throughput. The mapping is performed according to preprogrammed lookup tables, which is stored on external RAM. The mapper can emulate a network of up to 219 direct connections and test results show that the mapper can handle as much as 30 × 106 events/second. 1 Introduction The AER protocol [7] is a popular tool in neuromorphic applications. It is used to emulate direct connections in for example neural networks. It uses a high speed digital bus which events are asynchronously multiplexed onto. A unique address identifies each sender, e.g. a neuron, and the receiver is responsible for distributing this event to the correct location. Since the speed of the bus exceeds the frequency of events, very few collision occur and can be handled with a minimal delay. However, the virtual connection must be designed in hardware in advance and can not be changed during operation. So to use the AER protocol with multiple PCB’s, one needs to carefully plan the interconnections. In many applications, this is difficult, in some cases impossible. One example is evolutionary hardware, where a genetic algorithm is used to determine the configuration directly in hardware. Thus, the ability to change the connectivity of several components is crucial to test a system in real time and the need to be able to program the interconnections “on the fly” is apparent. Two research groups in Rome and Sevilla have developed devices which address this issue. The Supported by the EU 5th Framework Programme IST project CAVIAR. 2 1 0 0 1 nWE 1 0 nOE 1 0 1 0 0 1 1 0 1 0 1 0 AER_IREQ[3..0] 1 0 R2ADR[18..0] 1 0 nOE 1 0 0 1 nCE 1 0 nWE 1 0 nOE 1 0 nR2CE nCA[7] nR2OE 0 1 nCE 1 0 nWE 1 0 nOE 1 0 1 0 1 0 0 1 1 0 1 0 1 0 nR2CE nR2OE 1 0 AER_OREQ[0] 1 0 AER_OACK[0] AER_OA0[15..0] 1 0 1 0 oRAM[0] nPGRM 3−state R1ADR[15..0] 0 1 nWE 1 0 nOE 1 0 iRAM[3] nCA[11] nR1OE[3] nR2CE nCA[7] nR2OE oRAM[1] nOE 0 1 1 0 LE 1 0 nCA[12..7] nR1OE[3..0] 1 0 iLatch[3] 1 0 PGRM R1GATE[3] 0 1 nWE 1 0 nOE 1 0 iRAM[3] 1 0 nCA[11] nR1OE[3] AER_OREQ[3..0] AER_OACK[3..0] 0 1 R2DATA[15..0] AER_IACK[3..0] PGRM R1GATE[3..0] R1ADR[15..0] R1DATA[31..0] 1 0 1 0 R2DEST[7..0] nCA[8] nR1OE[0] iRAM[0] nOE 0 1 1 0 LE 1 0 AER_IREQ[3] AER_IACK[3] 0 1 AER_IA3[15..0] 1 0 FLEX10k30A−EPF208PQFP 1 0 PGRM R1GATE[0] 0 1 nWE 1 0 nOE 1 0 iRAM[0] nCA[8] nR1OE[0] 1 0 iLatch[0] AER_IREQ[0] AER_IACK[0] 0 1 AER_IA0[15..0] 0 1 0 1 nCE 0 1 nWE 1 0 nOE 1 0 R2ADR[18..0] nR2CE nCA[12] nR2OE oRAM[2] nPGRM 3−state 0 1 0 1 nOE 0 1 R1ADR[15..0] 1 0 1 0 AER_OREQ[3] 0 1 AER_OACK[3] AER_OA3[15..0] 1 0 Fig. 1. The AER mapper Rome-board [1] is a PCI-AER board and is mainly constructed to be an interface between a PC and boards that communicate with the AER protocol. It can work as an AER mapper, it can monitor the communication on an AER bus or it can send sequences of events to an AER bus to emulate a real stimuli. Though it has many nice features, it is fairly slow (5 × 106 events/sec) and it needs to be connected to a PC to operate. Like the Sevilla-board, which is a simpler, dedicated and faster AER mapper, the design is synchronous. (Both boards are under development and most of the information is based on private communication since there exists no publications to refer to.) And since the time domain is of importance in the AER protocol, e.g. information lies in the timing between successive events [6], to approach the problem from an asynchronous point of view may seem preferable. This is due to the fact that synchronous devices quantizise information in the time domain, thus vital timing information can be lost. Therefore, we introduce the asynchronous 4-to-4 AER mapper, which can easily be programmed to emulate any network of up to eight individual components. 2 Model The AER mapper is a four-to-four mapper. It receives input from four different AER input busses and redirects these addresses to four output AER busses. In this way, the device is capable of interconnecting up to eight individual chips or circuit boards, thus emulating a huge network of connections. The total amount of direct connections which can be emulated is 219 , over 0.5 million connections. The mapping is performed according to preprogrammed lookup tables, which we 3 store on external RAM. This mapping can be changed during normal operation by a separate AER cable. The process of programming is covered in section 4. Each input bus takes as input a 16 bit address and sends out an arbitrarily amount of output AE’s of the same length. The output AE’s can be sent on any of the four output busses, the output bus does not need to be the same for each AE and an output AE can be sent on several output busses at the same time. Since there are four input busses with an address space of 216 and a total of 219 possible direct connections, one input event can on average cause two output events if the mapper operates at the limit of its input capacity. However, individual inputs can cause up to 213 outputs, and inputs that cause the same set of outputs as previously programmed input, need not consume extra mapping capacity. This is possible, because the mapper is constructed with two blocks of RAM. The first block of RAM (iRAM), has a 16 bit address space (64k) of 32 bit words, where 13 bits are used to denote the number of output AE’s and 19 bits are used as a pointer to the second block of RAM (oRAM). The second block of RAM has a 19 bit address space (512k) of 24 bit words, where 16 bits are used as the output address and 4 bits for selecting output bus (4 bits not used). The mapping is performed in two main stages. A schematic of the AER mapper and the control logic on the FPGA are shown in figure 1 and 2 respectively. First, when an event is received on one of the four input AER busses, e.g. bus number 3, a request is sent to the FPGA. The request is processed and triggers LE[3] such that the incoming address is stored on an external latch (iLatch[3]). The request is then acknowledged. The input AER bus is again ready to receive an event. At the same time a request is sent to the next stage on the FPGA which grants access to iRAM. A full non-greedy arbitration is performed for all four inputs such that collisions are avoided. When the event is granted access, nR1OE[3] goes active and the data is loaded from iRAM[3] and sent to the FPGA, where it is latched and acknowledged. The first stage is now complete and a new address can be stored on iLatch[3]. The second stage uses the data from iRAM to determine which addresses to send to oRAM. A simple example illustrates the process. If the 19 bit pointer to oRAM is 1000 (DEC) and the 13 bit number 10 (DEC), the mapper will send 10 successive addresses, i.e. 1009, 1008, .. , 1001, 1000. The first address (1009) is sent to a new internal latch (mLatch) along with a request, where it is stored and acknowledged. The next address (1008) can then be calculated. The first address is then granted access to oRAM by nR2OE and the data is loaded. The data is 20 bits wide and contains the 16 bit output AER address and a 4 bit number. The number is sent back to the FPGA and stored on a latch (oLatch) and acknowledged. Thus, the second address can be latched by mLatch. The latched number determines which of the four output AER busses the output AER address is to be sent to. Thus, if the number is 0101, the AER output address is sent on AER bus 1 and 3. When this output AER address is acknowledged, the second address is granted access to oRAM and new data is loaded. This process continues until all addresses are processed. 4 R1GATE[3..0] 1 0 AER_IREQ[0] AER_IACK[0] 0 1 ireq 1 0 iack OE oreq oack 0 1 sel00 1 0 1 0 1 non 0 map 1. stage 0 greedy 1 1 0 arbcell 1 0 0 1 1 0 AER_IREQ[3] AER_IACK[3] 0 1 ireq 1 0 iack OE oreq 0 1 oack sel01 1 0 sel10 1 0 0 non 1 0 1 1 0 0 greedy 1 0 1 arbcell sel20 1 0 11 non 00 11 greedy 00 11 00 arbcell 0 1 ireq 1 0 iack oreq oack 11 00 sel21 1 0 0 1 ireq 1 0 iack 1 0 oreq oack 0 1 nCE nOE map 2. stage nCA[7] nCA[12] sel11 1 0 116ns delay 0 oreq[15..0] iack oack[15..0] 11 00 00 11 ireq 11 00 R2ADR[18..0] map midstage map 1. stage CDATA[11..8] CREQ CACK 0 1 R1DATA[31..0] nCA[15..0] 1 0 nPGRM sel00 sel20 nPGRM sel01 sel20 nPGRM sel10 sel21 nPGRM sel11 sel21 R2DEST[3..0] 0 1 ireq 1 0 oreq[3..0] iack oack[3..0]0 1 1AER_OREQ[3..0] 0 AER_OACK[3..0] hs distr4to4 nR2CE 0 1 nR2CE 1 0 nCA[8] nR1OE[0] 1 0 nCA[9] nR1OE[1] 1 0 nCA[10] nR1OE[2] 1 0 nCA[11] nR1OE[3] 1 0 hs distr4to16 CDATA[8] CDATA[9] CDATA[10] CDATA[11] CREQ 11 00 11 00 PGRM Fig. 2. FPGA schematic 3 FPGA implementation The control logic of the AER mapper is implemented on an FPGA. We have used an Altera Flex10K30A FPGA which has 208 I/O pins. It operates on a 3.3V supply. The FPGA is programmed using a MasterBlaster serial communication cable, which allows us to program the device from a PC or UNIX workstation. We have also chosen to include a second configuration device (EPC1PDIP3), which is a ROM where the final version of the FPGA design can be programmed and loaded at startup. An alternative solution in future implementations could be to use a flash card with a USB interface to a PC or UNIX workstation. The control logic of the mapper is a fully asynchronous design and is based on the work by Häfliger in [3]. Since asynchronous design is not a very common and preferable design method in FPGA implementations, and commercial FPGA’s are solely constructed for synchronous design [5], there exist no supported timing or delay elements which can be used in commercial FPGA design. Both specialized FPGA’s [4] and different methods have been developed for asynchronous implementations [2], but these remain expensive and cumbersome to use, and the extra effort and money would probably not justify the choice of asynchronous over synchronous. And since timing is a crucial part of asynchronous design, we therefore had to find a method to set a more or less fixed delay on the control signals on a common cheap off-the-shelf FPGA. The solution to the problem was to use a LCELL primitive supported by the Altera Quartus software package. The primitive is basically a buffer, with a simulated delay of approximately 2ns for one LCELL. Test results also showed that the delay was as expected with only minimal variations. Thus, we were able to create fairly accurate delay ele- 5 PGRM 0 1 clk 1 0 CDATA[7..0] nCA[5] 0 1 clk 1 0 wadr[15..8] 1 0 0 1 wadr[15..0] oe CDATA[7..0] nCA[4] wadr[7..0] wadr[18..0] 1 0 11 00 000000 111111 R1ADR[15..0] 1 0 PGRM 0 1 clk 1 0 wadr[18..16] 11 00 oe CDATA[2..0] nCA[6] R2ADR[18..0] 1 0 Fig. 3. The internal programming latches. Two tri-state busses separates the two different physical lines R1ADR[15..0] and R2ADR[18..0] ments throughout the design. Though not a failsafe way of designing, it proved to be a powerful tool in easily creating the control logic of the asynchronous AER mapper. 4 Programming A separate AER cable is used to program the RAM. It takes as input a 12 bit address (CDATA[11..0]), where 8 bits are data (CDATA[7..0]) and the remaining 4 bits (CDATA[11..8]) are used to determine what the data is to represent. CDATA[11..8] is demultiplexed by HSdist4to16, and the resulting signals (CA[12..0]) are used as clock input to internal latches (CA[6..0]) and to control the write enable of the external RAM (nCA[12..7]). The programming is not done directly, but the data is first stored on several internal latches. This is done such that the whole address and data for one entry can be stored before the RAM itself is programmed. For example, to program a set of data at one specific address in iRAM, one need to latch 48 bits internally, i.e. 16 bit address and 32 bit data, before the data can be written to the RAM. Which latch the data is stored on, determines what the data is used for. We have four internal latches for the RAM data and three more for the RAM addresses. Several tri-state buffers are used to separate the different physical lines, such that the same data can be used for both iRAM and oRAM. The programming latches for the address part are shown in figure 3. The programming is executed in MATLAB and some scripts are written to facilitate the operation. The main function is mapping=pgrmMapping(in,out,inbus,outbus); where in is the input address, out is an array with the output addresses, inbus the input bus number and outbus an array with the the output bus numbers. It returns a matrix mapping with all the mappings executed so far. The main function consists of three sub functions [r2adr,mapping]=getR2adr(in,out,inbus,outbus); 6 ta tb Fig. 4. Measurement 1: The timing of input request/acknowledge and the output request is highlighted to the right. ta is 25ns while tb is 225ns. [r,c]=size(out); pgrmRAM1(in,r2adr+r*2^19,inbus); for i=1:r pgrmRAM2(r2adr+(i-1),out(i,1),outbus(i,1)); end; First, the main function checks if the inbus and outbus are correctly specified, i.e. if they are a number from 0 to 3. Then, the first sub function, getR2adr, is called. This function loads a mapping database, mapping, where all previous mappings are stored. The database is constructed such that the column of one entry denotes which oRAM address the output AER address is stored. Each entry holds the information given to the main functions. Therefore, the mapping can easily be retrieved, it prevents the user from overwriting entries in the oRAM and, furthermore, the user do not need to be concerned with both the iRAM data and the oRAM address when programming a mapping. Based on the information retrieved from the first sub function, the two next functions are called. pgrmRAM1 takes as input the input address and the input bus number directly from the main function. The data to be stored (r2adr+r*2^19) is a pointer to oRAM (first 19 bits) plus a number r (last 13 bits) which denotes how many output AER addresses which are to be generated. pgrmRAM2 is then called for each of the output AER addresses. It takes as input the computed oRAM address, which is increased by one for each call, the output AER address and its corresponding output bus number. The programming is now complete for one input AE. 5 Results The circuit was tested with a National Instrument (NI) DAQ board (PCI-6534 High Speed Digital I/O PCI module) connected to a PC workstation. In addition, a PCI-to-AER and a 5V-to-3.3V PCB was used to connect the PCI-bus to the 7 ta tb Fig. 5. Measurement 2: The transition from different input AE is highlighted to the right. ta is 45ns and tb is 20ns. mapper board. All signals from the PCI-bus have a length of 1.6µs, independent of when the acknowledge is received. We used a Hewlett Packard 16500C Logic Analysis System to sample and plot the test data. The sampling period of the 16500C is 4ns, which is sufficient for our test measurements. For more accurate timing measurements, e.g. measuring the delay of the LCELL primitive, we used a Agilent 54624A digital oscilloscope. In figure 4, a simple mapping experiment is plotted where IREQ is the input request from the PCI-board, IACK the corresponding request, OREQ[3..0] the output requests of the four output busses and AE_OUT the output address. Three successive AE’s are received on input bus 0 and redirected to output bus 0, 3 and 2 respectively. The timing of the initial handshake is highlighted to the right. From the figure, one can see that it takes about 25ns (ta ) from the input request is acknowledged and approximately 225ns (tb ) before the output request is sent. Since the input request from the PCI-board has a fixed period of 3.2µs, we are not able to test the speed of such a one-to-one mappings directly. Thus, the effect of pipelining is not taken advantage of and the full potential of the mapper is not shown. However, if we perform a one-to-multiple mapping experiment, i.e. if one input AE results in the sending of multiple output AE’s, several features of the AER mapper can be utilized. Such an experiment is plotted in figure 5. We plot the same signals as in the previous experiment and three input AE’s are programmed to send out 120 AE’s each on different output busses. The temporal resolution of the Logic Analyzer is not high enough to plot the individual spikes, so the output AE’s are shown as black solid lines. The whole process takes roughly 13µs which means that the mapper can send just under 30×106 events/second. The transition between two of the incoming AE’s is again plotted to the right. We see that the period ta of the 360 output AE’s is approximately 45ns and that the delay tb between AE’s from the two different inputs is not more than 20 ns. This suggests that one-to-one mappings also can be processed at a 8 similar rate. It must be noted that the global delay of the signals still is about 225ns, but since this is a fixed delay forced on any of the incoming events, the timing of events are preserved throughout the system. From the same figure, we see that the output address is valid about 5-10ns before and after the output request is set. This is a safety margin which is set by using the previously described delay elements. This margin may be reduces to improve the performance of the system. If only reduced by a total of 4-5 ns, thus reducing the AE period to 40ns, the overall performance is improved by nearly 10% . 6 Conclusion An asynchronous AER mapper has been presented. It can be used as a passive or active device in a multi-node network that uses the AER protocol for inter-chip communication. Its ability to emulate complex networks structures combined with speed and robustness makes it a powerful tool for interconnecting relatively large systems. The asynchronous FPGA implementation is at the present moment not optimized for speed, but instead we have concentrated on making the control circuitry fail proof without any glitches. This is very important in asynchronous design, because any ill timed signal may set the system in a non valid state. However, several improvements are in progress which can significantly speed up the communication without compromising on robustness. Also, some improvements in handling non valid addresses are currently tested, i.e. addresses which are not programmed to be redirected. The final result of the mapper should be an easy and “plug-and-play”-like device, such that anyone who is interested in using the mapper, needs only minimal knowledge of the circuit design and would only need a higher level software script to program and prepare the mapper for operation. References 1. V. Dante and P. Del Giudice. The pci-aer interface board. 2001 Telluride Workshop on Neuromorphic Engineering Report, http://www.ini.unizh.ch/telluride/previous/report01.pdf, pp. 99-103. 2. S.Y. Tan, S.B. Furber and Wen-Fang Yen. The design of an asynchronous VHDL synthesizer. Proceedings - Design, Automation and Test in Europe, pp. 44-51, 1998. 3. P. Häfliger. Asynchronous Event Redirecting in Bio-Inspired Communication. International Conference on Electronics, Circuits and Systems, vol. 1:pp. 87-90, 2001. 4. S. Hauck and S. Burns, G. Borriello and C. Ebelingw. An FPGA for implementing asynchronous circuits. IEEE Design & Test of Computers, vol.11:pp. 60, 1994. 5. R. Payne. Asynchronous FPGA architectures. IEE Proceedings - Computers and Digital Techniques, vol. 143:pp. 282-286, 1996. 6. Wulfram Gertsner. Basic Concepts and Models: Spiking Neurons, Chapter 1. Bradford Books, 1998. 7. M. Mahowald. The Silicon Optic Nerve from An Analog VLSI System for Stereoscopic Vision, Chapter 3. Kluwer Academic Publishers, 1994.