An Asynchronous 4-to-4 AER mapper H. Kolle Riis and Ph. H¨ afliger

advertisement
An Asynchronous 4-to-4 AER mapper
H. Kolle Riis and Ph. Häfliger
Department of Informatics, University of Oslo, Norway.
Abstract. In this paper, a fully functional prototype of an asynchronous
4-to-4 Address Event Representation (AER) mapper is presented. AER
is an event driven communication protocol originally used in VLSI implementations of neural networks to transfer action potentials between
neurons. Often, this protocol is used for direct inter-chip communication
between neuromorphic chips containing assemblies of neurons. Without
an active device between two such chips, the network connections between them are hard-wired in the chip design. More flexibility can be
achieved by communicating through an AER mapper: The network can
freely be configured and, furthermore, several AER busses can be merged
and split to form a complex network structure. We present here an asynchronous AER mapper which offers an easy and versatile solution. The
AER mapper receives input from four different AER busses and redirects
the input AE to four output AER busses. The control circuitry is implemented on an FPGA and is fully asynchronous, and pipelining is used to
maximize throughput. The mapping is performed according to preprogrammed lookup tables, which is stored on external RAM. The mapper
can emulate a network of up to 219 direct connections and test results
show that the mapper can handle as much as 30 × 106 events/second.
1
Introduction
The AER protocol [7] is a popular tool in neuromorphic applications. It is used
to emulate direct connections in for example neural networks. It uses a high
speed digital bus which events are asynchronously multiplexed onto. A unique
address identifies each sender, e.g. a neuron, and the receiver is responsible for
distributing this event to the correct location. Since the speed of the bus exceeds the frequency of events, very few collision occur and can be handled with
a minimal delay. However, the virtual connection must be designed in hardware in advance and can not be changed during operation. So to use the AER
protocol with multiple PCB’s, one needs to carefully plan the interconnections.
In many applications, this is difficult, in some cases impossible. One example
is evolutionary hardware, where a genetic algorithm is used to determine the
configuration directly in hardware. Thus, the ability to change the connectivity
of several components is crucial to test a system in real time and the need to
be able to program the interconnections “on the fly” is apparent. Two research
groups in Rome and Sevilla have developed devices which address this issue. The
Supported by the EU 5th Framework Programme IST project CAVIAR.
2
1
0
0
1
nWE
1
0
nOE
1
0
1
0
0
1
1
0
1
0
1
0
AER_IREQ[3..0]
1
0
R2ADR[18..0]
1
0
nOE
1
0
0
1
nCE
1
0
nWE
1
0
nOE
1
0
nR2CE
nCA[7]
nR2OE
0
1
nCE
1
0
nWE
1
0
nOE
1
0
1
0
1
0
0
1
1
0
1
0
1
0
nR2CE
nR2OE
1
0
AER_OREQ[0]
1
0
AER_OACK[0]
AER_OA0[15..0]
1
0
1
0
oRAM[0]
nPGRM
3−state
R1ADR[15..0]
0
1
nWE
1
0
nOE
1
0
iRAM[3]
nCA[11]
nR1OE[3]
nR2CE
nCA[7]
nR2OE
oRAM[1]
nOE
0
1
1
0
LE
1
0
nCA[12..7]
nR1OE[3..0]
1
0
iLatch[3]
1
0
PGRM
R1GATE[3]
0
1
nWE
1
0
nOE
1
0
iRAM[3]
1
0
nCA[11]
nR1OE[3]
AER_OREQ[3..0]
AER_OACK[3..0] 0
1
R2DATA[15..0]
AER_IACK[3..0]
PGRM
R1GATE[3..0]
R1ADR[15..0]
R1DATA[31..0]
1
0
1
0
R2DEST[7..0]
nCA[8]
nR1OE[0]
iRAM[0]
nOE
0
1
1
0
LE
1
0
AER_IREQ[3]
AER_IACK[3] 0
1
AER_IA3[15..0]
1
0
FLEX10k30A−EPF208PQFP
1
0
PGRM
R1GATE[0]
0
1
nWE
1
0
nOE
1
0
iRAM[0]
nCA[8]
nR1OE[0]
1
0
iLatch[0]
AER_IREQ[0]
AER_IACK[0] 0
1
AER_IA0[15..0]
0
1
0
1
nCE
0
1
nWE
1
0
nOE
1
0
R2ADR[18..0]
nR2CE
nCA[12]
nR2OE
oRAM[2]
nPGRM
3−state
0
1
0
1
nOE
0
1
R1ADR[15..0]
1
0
1
0
AER_OREQ[3]
0
1
AER_OACK[3]
AER_OA3[15..0]
1
0
Fig. 1. The AER mapper
Rome-board [1] is a PCI-AER board and is mainly constructed to be an interface between a PC and boards that communicate with the AER protocol. It can
work as an AER mapper, it can monitor the communication on an AER bus or
it can send sequences of events to an AER bus to emulate a real stimuli. Though
it has many nice features, it is fairly slow (5 × 106 events/sec) and it needs to
be connected to a PC to operate. Like the Sevilla-board, which is a simpler,
dedicated and faster AER mapper, the design is synchronous. (Both boards are
under development and most of the information is based on private communication since there exists no publications to refer to.) And since the time domain is
of importance in the AER protocol, e.g. information lies in the timing between
successive events [6], to approach the problem from an asynchronous point of
view may seem preferable. This is due to the fact that synchronous devices quantizise information in the time domain, thus vital timing information can be lost.
Therefore, we introduce the asynchronous 4-to-4 AER mapper, which can easily
be programmed to emulate any network of up to eight individual components.
2
Model
The AER mapper is a four-to-four mapper. It receives input from four different
AER input busses and redirects these addresses to four output AER busses. In
this way, the device is capable of interconnecting up to eight individual chips or
circuit boards, thus emulating a huge network of connections. The total amount
of direct connections which can be emulated is 219 , over 0.5 million connections.
The mapping is performed according to preprogrammed lookup tables, which we
3
store on external RAM. This mapping can be changed during normal operation
by a separate AER cable. The process of programming is covered in section 4.
Each input bus takes as input a 16 bit address and sends out an arbitrarily
amount of output AE’s of the same length. The output AE’s can be sent on
any of the four output busses, the output bus does not need to be the same for
each AE and an output AE can be sent on several output busses at the same
time. Since there are four input busses with an address space of 216 and a total
of 219 possible direct connections, one input event can on average cause two
output events if the mapper operates at the limit of its input capacity. However,
individual inputs can cause up to 213 outputs, and inputs that cause the same set
of outputs as previously programmed input, need not consume extra mapping
capacity. This is possible, because the mapper is constructed with two blocks
of RAM. The first block of RAM (iRAM), has a 16 bit address space (64k) of
32 bit words, where 13 bits are used to denote the number of output AE’s and
19 bits are used as a pointer to the second block of RAM (oRAM). The second
block of RAM has a 19 bit address space (512k) of 24 bit words, where 16 bits
are used as the output address and 4 bits for selecting output bus (4 bits not
used).
The mapping is performed in two main stages. A schematic of the AER mapper and the control logic on the FPGA are shown in figure 1 and 2 respectively.
First, when an event is received on one of the four input AER busses, e.g. bus
number 3, a request is sent to the FPGA. The request is processed and triggers
LE[3] such that the incoming address is stored on an external latch (iLatch[3]).
The request is then acknowledged. The input AER bus is again ready to receive
an event. At the same time a request is sent to the next stage on the FPGA
which grants access to iRAM. A full non-greedy arbitration is performed for all
four inputs such that collisions are avoided. When the event is granted access,
nR1OE[3] goes active and the data is loaded from iRAM[3] and sent to the
FPGA, where it is latched and acknowledged. The first stage is now complete
and a new address can be stored on iLatch[3].
The second stage uses the data from iRAM to determine which addresses to
send to oRAM. A simple example illustrates the process. If the 19 bit pointer
to oRAM is 1000 (DEC) and the 13 bit number 10 (DEC), the mapper will
send 10 successive addresses, i.e. 1009, 1008, .. , 1001, 1000. The first address
(1009) is sent to a new internal latch (mLatch) along with a request, where it
is stored and acknowledged. The next address (1008) can then be calculated.
The first address is then granted access to oRAM by nR2OE and the data is
loaded. The data is 20 bits wide and contains the 16 bit output AER address
and a 4 bit number. The number is sent back to the FPGA and stored on a
latch (oLatch) and acknowledged. Thus, the second address can be latched by
mLatch. The latched number determines which of the four output AER busses
the output AER address is to be sent to. Thus, if the number is 0101, the AER
output address is sent on AER bus 1 and 3. When this output AER address is
acknowledged, the second address is granted access to oRAM and new data is
loaded. This process continues until all addresses are processed.
4
R1GATE[3..0]
1
0
AER_IREQ[0]
AER_IACK[0] 0
1
ireq
1
0
iack
OE
oreq
oack 0
1
sel00
1
0
1
0
1 non
0
map 1. stage
0 greedy
1
1
0
arbcell
1
0
0
1
1
0
AER_IREQ[3]
AER_IACK[3] 0
1
ireq
1
0
iack
OE
oreq 0
1
oack
sel01
1
0
sel10
1
0
0 non
1
0
1
1
0
0 greedy
1
0
1
arbcell
sel20
1
0
11 non
00
11 greedy
00
11
00
arbcell
0
1
ireq
1
0
iack
oreq
oack
11
00
sel21
1
0
0
1
ireq
1
0
iack
1
0
oreq
oack 0
1
nCE
nOE
map 2. stage
nCA[7]
nCA[12]
sel11
1
0
116ns delay
0
oreq[15..0]
iack oack[15..0]
11
00
00
11
ireq
11
00
R2ADR[18..0]
map midstage
map 1. stage
CDATA[11..8]
CREQ
CACK 0
1
R1DATA[31..0]
nCA[15..0]
1
0
nPGRM
sel00
sel20
nPGRM
sel01
sel20
nPGRM
sel10
sel21
nPGRM
sel11
sel21
R2DEST[3..0]
0
1
ireq
1
0
oreq[3..0]
iack oack[3..0]0
1
1AER_OREQ[3..0]
0
AER_OACK[3..0]
hs distr4to4
nR2CE
0
1
nR2CE
1
0
nCA[8]
nR1OE[0]
1
0
nCA[9]
nR1OE[1]
1
0
nCA[10]
nR1OE[2]
1
0
nCA[11]
nR1OE[3]
1
0
hs distr4to16
CDATA[8]
CDATA[9]
CDATA[10]
CDATA[11]
CREQ
11
00
11
00
PGRM
Fig. 2. FPGA schematic
3
FPGA implementation
The control logic of the AER mapper is implemented on an FPGA. We have
used an Altera Flex10K30A FPGA which has 208 I/O pins. It operates on a 3.3V
supply. The FPGA is programmed using a MasterBlaster serial communication
cable, which allows us to program the device from a PC or UNIX workstation. We
have also chosen to include a second configuration device (EPC1PDIP3), which
is a ROM where the final version of the FPGA design can be programmed and
loaded at startup. An alternative solution in future implementations could be to
use a flash card with a USB interface to a PC or UNIX workstation.
The control logic of the mapper is a fully asynchronous design and is based on
the work by Häfliger in [3]. Since asynchronous design is not a very common and
preferable design method in FPGA implementations, and commercial FPGA’s
are solely constructed for synchronous design [5], there exist no supported timing
or delay elements which can be used in commercial FPGA design. Both specialized FPGA’s [4] and different methods have been developed for asynchronous
implementations [2], but these remain expensive and cumbersome to use, and the
extra effort and money would probably not justify the choice of asynchronous
over synchronous. And since timing is a crucial part of asynchronous design, we
therefore had to find a method to set a more or less fixed delay on the control
signals on a common cheap off-the-shelf FPGA. The solution to the problem was
to use a LCELL primitive supported by the Altera Quartus software package.
The primitive is basically a buffer, with a simulated delay of approximately 2ns
for one LCELL. Test results also showed that the delay was as expected with
only minimal variations. Thus, we were able to create fairly accurate delay ele-
5
PGRM
0
1
clk
1
0
CDATA[7..0]
nCA[5]
0
1
clk
1
0
wadr[15..8]
1
0
0
1
wadr[15..0]
oe
CDATA[7..0]
nCA[4]
wadr[7..0]
wadr[18..0]
1
0
11
00
000000
111111
R1ADR[15..0]
1
0
PGRM
0
1
clk
1
0
wadr[18..16]
11
00
oe
CDATA[2..0]
nCA[6]
R2ADR[18..0]
1
0
Fig. 3. The internal programming latches. Two tri-state busses separates the two different physical lines R1ADR[15..0] and R2ADR[18..0]
ments throughout the design. Though not a failsafe way of designing, it proved
to be a powerful tool in easily creating the control logic of the asynchronous
AER mapper.
4
Programming
A separate AER cable is used to program the RAM. It takes as input a 12
bit address (CDATA[11..0]), where 8 bits are data (CDATA[7..0]) and the remaining 4 bits (CDATA[11..8]) are used to determine what the data is to represent. CDATA[11..8] is demultiplexed by HSdist4to16, and the resulting signals
(CA[12..0]) are used as clock input to internal latches (CA[6..0]) and to control
the write enable of the external RAM (nCA[12..7]).
The programming is not done directly, but the data is first stored on several
internal latches. This is done such that the whole address and data for one entry
can be stored before the RAM itself is programmed. For example, to program a
set of data at one specific address in iRAM, one need to latch 48 bits internally,
i.e. 16 bit address and 32 bit data, before the data can be written to the RAM.
Which latch the data is stored on, determines what the data is used for. We have
four internal latches for the RAM data and three more for the RAM addresses.
Several tri-state buffers are used to separate the different physical lines, such
that the same data can be used for both iRAM and oRAM. The programming
latches for the address part are shown in figure 3.
The programming is executed in MATLAB and some scripts are written to
facilitate the operation. The main function is
mapping=pgrmMapping(in,out,inbus,outbus);
where in is the input address, out is an array with the output addresses, inbus
the input bus number and outbus an array with the the output bus numbers.
It returns a matrix mapping with all the mappings executed so far.
The main function consists of three sub functions
[r2adr,mapping]=getR2adr(in,out,inbus,outbus);
6
ta
tb
Fig. 4. Measurement 1: The timing of input request/acknowledge and the output request is highlighted to the right. ta is 25ns while tb is 225ns.
[r,c]=size(out);
pgrmRAM1(in,r2adr+r*2^19,inbus);
for i=1:r
pgrmRAM2(r2adr+(i-1),out(i,1),outbus(i,1));
end;
First, the main function checks if the inbus and outbus are correctly specified,
i.e. if they are a number from 0 to 3. Then, the first sub function, getR2adr,
is called. This function loads a mapping database, mapping, where all previous
mappings are stored. The database is constructed such that the column of one
entry denotes which oRAM address the output AER address is stored. Each
entry holds the information given to the main functions. Therefore, the mapping
can easily be retrieved, it prevents the user from overwriting entries in the oRAM
and, furthermore, the user do not need to be concerned with both the iRAM
data and the oRAM address when programming a mapping.
Based on the information retrieved from the first sub function, the two next
functions are called. pgrmRAM1 takes as input the input address and the input bus
number directly from the main function. The data to be stored (r2adr+r*2^19)
is a pointer to oRAM (first 19 bits) plus a number r (last 13 bits) which denotes
how many output AER addresses which are to be generated.
pgrmRAM2 is then called for each of the output AER addresses. It takes as
input the computed oRAM address, which is increased by one for each call, the
output AER address and its corresponding output bus number. The programming is now complete for one input AE.
5
Results
The circuit was tested with a National Instrument (NI) DAQ board (PCI-6534
High Speed Digital I/O PCI module) connected to a PC workstation. In addition,
a PCI-to-AER and a 5V-to-3.3V PCB was used to connect the PCI-bus to the
7
ta
tb
Fig. 5. Measurement 2: The transition from different input AE is highlighted to the
right. ta is 45ns and tb is 20ns.
mapper board. All signals from the PCI-bus have a length of 1.6µs, independent
of when the acknowledge is received. We used a Hewlett Packard 16500C Logic
Analysis System to sample and plot the test data. The sampling period of the
16500C is 4ns, which is sufficient for our test measurements. For more accurate
timing measurements, e.g. measuring the delay of the LCELL primitive, we used
a Agilent 54624A digital oscilloscope.
In figure 4, a simple mapping experiment is plotted where IREQ is the input
request from the PCI-board, IACK the corresponding request, OREQ[3..0] the
output requests of the four output busses and AE_OUT the output address. Three
successive AE’s are received on input bus 0 and redirected to output bus 0,
3 and 2 respectively. The timing of the initial handshake is highlighted to the
right. From the figure, one can see that it takes about 25ns (ta ) from the input
request is acknowledged and approximately 225ns (tb ) before the output request
is sent. Since the input request from the PCI-board has a fixed period of 3.2µs,
we are not able to test the speed of such a one-to-one mappings directly. Thus,
the effect of pipelining is not taken advantage of and the full potential of the
mapper is not shown.
However, if we perform a one-to-multiple mapping experiment, i.e. if one input AE results in the sending of multiple output AE’s, several features of the
AER mapper can be utilized. Such an experiment is plotted in figure 5. We
plot the same signals as in the previous experiment and three input AE’s are
programmed to send out 120 AE’s each on different output busses. The temporal resolution of the Logic Analyzer is not high enough to plot the individual
spikes, so the output AE’s are shown as black solid lines. The whole process
takes roughly 13µs which means that the mapper can send just under 30×106
events/second.
The transition between two of the incoming AE’s is again plotted to the
right. We see that the period ta of the 360 output AE’s is approximately 45ns
and that the delay tb between AE’s from the two different inputs is not more
than 20 ns. This suggests that one-to-one mappings also can be processed at a
8
similar rate. It must be noted that the global delay of the signals still is about
225ns, but since this is a fixed delay forced on any of the incoming events, the
timing of events are preserved throughout the system.
From the same figure, we see that the output address is valid about 5-10ns
before and after the output request is set. This is a safety margin which is set
by using the previously described delay elements. This margin may be reduces
to improve the performance of the system. If only reduced by a total of 4-5 ns,
thus reducing the AE period to 40ns, the overall performance is improved by
nearly 10% .
6
Conclusion
An asynchronous AER mapper has been presented. It can be used as a passive or
active device in a multi-node network that uses the AER protocol for inter-chip
communication. Its ability to emulate complex networks structures combined
with speed and robustness makes it a powerful tool for interconnecting relatively
large systems.
The asynchronous FPGA implementation is at the present moment not optimized for speed, but instead we have concentrated on making the control circuitry fail proof without any glitches. This is very important in asynchronous
design, because any ill timed signal may set the system in a non valid state.
However, several improvements are in progress which can significantly speed up
the communication without compromising on robustness. Also, some improvements in handling non valid addresses are currently tested, i.e. addresses which
are not programmed to be redirected. The final result of the mapper should be
an easy and “plug-and-play”-like device, such that anyone who is interested in
using the mapper, needs only minimal knowledge of the circuit design and would
only need a higher level software script to program and prepare the mapper for
operation.
References
1. V.
Dante
and
P.
Del
Giudice.
The
pci-aer
interface
board.
2001
Telluride
Workshop
on
Neuromorphic
Engineering
Report,
http://www.ini.unizh.ch/telluride/previous/report01.pdf, pp. 99-103.
2. S.Y. Tan, S.B. Furber and Wen-Fang Yen. The design of an asynchronous VHDL
synthesizer. Proceedings - Design, Automation and Test in Europe, pp. 44-51, 1998.
3. P. Häfliger. Asynchronous Event Redirecting in Bio-Inspired Communication. International Conference on Electronics, Circuits and Systems, vol. 1:pp. 87-90, 2001.
4. S. Hauck and S. Burns, G. Borriello and C. Ebelingw. An FPGA for implementing
asynchronous circuits. IEEE Design & Test of Computers, vol.11:pp. 60, 1994.
5. R. Payne. Asynchronous FPGA architectures. IEE Proceedings - Computers and
Digital Techniques, vol. 143:pp. 282-286, 1996.
6. Wulfram Gertsner. Basic Concepts and Models: Spiking Neurons, Chapter 1. Bradford Books, 1998.
7. M. Mahowald. The Silicon Optic Nerve from An Analog VLSI System for Stereoscopic Vision, Chapter 3. Kluwer Academic Publishers, 1994.
Download