Block RAM

advertisement
COE 405
Programmable Logic and
Storage Devices
Dr. Aiman H. El-Maleh
Computer Engineering Department
King Fahd University of Petroleum & Minerals
Outline

History of Computational Fabrics

ASIC vs. FPGA

Reconfigurable Logic

Anti-Fuse-Based Approach (Actel)

RAM Based Field Programmable Logic (Xilinx)

CLBs

Carry & Control Logic

FPGA Memory Implementation
1-2
History of Computational Fabrics

Discrete devices: relays, transistors (1940s-50s)

Discrete logic gates (1950s-60s)

Integrated circuits (1960s-70s)


•
e.g. TTL packages: Data Book for 100’s of different parts
Gate Arrays (IBM 1970s)
•
Transistors are pre-placed on the chip & Place and Route
software puts the chip together automatically – only program
the interconnect (mask programming)
Software Based Schemes (1970’s- present)
• Run instructions on a general purpose core
1-3
History of Computational Fabrics

ASIC Design (1980’s to present)
• Turn Verilog directly into layout using a library of standard
•

cells
Effective for high-volume and efficient use of silicon area
Programmable Logic (1980’s to present)
• A chip that is reprogrammed after it has been fabricated
• Examples: PALs, PLAs, EPROM, EEPROM, PLDs, FPGAs
• Excellent support for mapping from Verilog
1-4
What is an FPGA?

A filed programmable gate array (FPGA) is a
reprogrammable silicon chip.

Using prebuilt logic blocks and programmable routing
resources, you can configure these chips to implement
custom hardware functionality without ever having to
pick up a breadboard or soldering iron.

You develop digital computing tasks in software and
compile them down to a configuration file or bitstream
that contains information on how the components
should be wired together.
1-5
ASIC vs. FPGA
ASIC
FPGA
Application Specific
Integrated Circuit
Field Programmable
Gate Array
• designs must be sent
for expensive and time
consuming fabrication
in semiconductor foundry
• bought off the shelf
and reconfigured by
designers themselves
• no physical layout design;
design ends with
• designed all the way
a bitstream used
from behavioral description
to configure a device
to physical layout
1-6
ASIC vs. FPGA
ASICs
FPGAs
Off-the-shelf
High performance
Low development cost
Low power
Short time to market
Low cost in
high volumes
Reconfigurability
1-7
Other FPGA Advantages

Manufacturing cycle for ASIC is very costly, lengthy
and engages lots of manpower
• Mistakes not detected at design time have large impact on
•
development time and cost
FPGAs are perfect for rapid prototyping of digital circuits

Easy upgrades like in case of software

FPGA provide a flexible platform for implementing
digital computing

A rich set of macros and I/Os supported (multipliers,
block RAMS, ROMS, high-speed I/O)

A wide range of applications from prototyping (to
validate a design before ASIC mapping) to high
performance spatial computing
1-8
How are FPGAs Used?

Prototyping
• Ensemble of gate arrays used to
•

emulate a circuit to be manufactured
Get more/better/faster debugging
done than with simulation
Reconfigurable hardware
• One hardware block used to
implement more than one function

Special-purpose computation
engines
• Hardware dedicated to solving one
•
problem (or class of problems)
Accelerators attached to generalpurpose computers (e.g., in a cell
phone!)
1-9
Major FPGA Vendors
SRAM-based FPGAs

Xilinx, Inc.

Altera Corp.

Atmel

Lattice Semiconductor
Share over 60% of the market
Flash & antifuse FPGAs

Actel Corp.

Quick Logic Corp.
1-10
Reconfigurable Logic
1-11
Anti-Fuse-Based Approach (Actel)
1-12
Actel Logic Module
Example Gate Mapping
Combinational Block
S-R Latch
1-13
Actel Routing & Programming
1-14
RAM Based Field Programmable
Logic - Xilinx
1-15
Xilinx FPGA Families



Old families
•
•
XC3000, XC4000, XC5200
Old 0.5µm, 0.35µm and 0.25µm technology. Not recommended for
modern designs.
High-performance families
•
•
•
•
Virtex (0.22µm)
Virtex-E, Virtex-EM (0.18µm)
Virtex-II, Virtex-II PRO (0.13µm)
Virtex-4 (0.09µm)
Low Cost Family
•
•
•
•
Spartan/XL – derived from XC4000
Spartan-II – derived from Virtex
Spartan-IIE – derived from Virtex-E
Spartan-3
1-16
FPGA Nomenclature
1-17
Device Part Marking
1-18
The Xilinx 4000 CLB
1-19
Two 4-input Functions, Registered
Output and a Two Input Function
1-20
5-input Function, Combinational
Output
1-21
5-Input Functions implemented using
two LUTs
1-22
LUT Mapping

N-LUT direct implementation of a truth table: any
function of n-inputs.

N-LUT requires 2N storage elements (latches)

N-inputs select one latch location (like a memory)
1-23
Configuring the CLB as a RAM
1-24
Xilinx 4000 Interconnect
1-25
Xilinx 4000 Interconnect Details
1-26
Xilinx 4000 Flexible IOB
1-27
Basic I/O Block Structure
1-28
IOB Functionality

IOB provides interface between the package pins and
CLBs

Each IOB can work as uni- or bi-directional I/O

Outputs can be forced into High Impedance

Inputs and outputs can be registered

Inputs can be delayed
• advised for high-performance I/O
1-29
Additional Features in Modern FPGAs
1-30
Spartan-3 Xilinx FPGA Block Diagram
1-31
CLB Structure
1-32
CLB Slice Structure

Each slice contains two sets of the following:
• Four-input LUT
• Any 4-input logic function,
• or 16-bit x 1 sync RAM
• or 16-bit shift register
• Carry & Control
• Fast arithmetic logic
• Multiplier logic
• Multiplexer logic
• Storage element
•
•
•
•
Latch or flip-flop
Set and reset
True or inverted inputs
Sync. or async. control
1-33
Xilinx Multipurpose LUT (MLUT)
16-bit SR
16 x 1 RAM
4-input
LUT
16 x 1 ROM
(logic)
1-34
5-Input Functions implemented using
two LUTs

One CLB Slice can implements any function of 5 inputs

Logic function is partitioned between two LUTs

F5 multiplexer selects LUT
1-35
Distributed RAM

CLB LUT configurable as
Distributed RAM
•
•
•
A LUT equals 16x1 RAM
Implements Single and Dual-Ports
Cascade LUTs to increase RAM size

Synchronous write

Synchronous/Asynchronous read

•
Accompanying flip-flops used for
synchronous read
Two LUTs can make
•
•
•
32 x 1 single-port RAM
16 x 2 single-port RAM
16 x 1 dual-port RAM
1-36
Shift Register

Each LUT can be configured as
shift register
•




Serial in, serial out
Dynamically addressable delay
up to 16 cycles
For programmable pipeline
Cascade for greater cycle
delays
Use CLB flip-flops to add
depth
1-37
Shift Register

Register-rich FPGA
• Allows for addition of pipeline stages to increase
throughput

Data paths must be balanced to keep desired
functionality
1-38
Carry & Control Logic
1-39
Fast Carry Logic

Each CLB contains separate logic
and routing for the fast generation
of sum & carry signals
• Increases efficiency and performance
of adders, subtractors, accumulators,
comparators, and counters

Carry logic is independent of
normal logic and routing resources

All major synthesis tools can infer
carry logic for arithmetic functions
1-40
The Virtex II CLB (Half Slice Shown)
1-41
Adder Implementation
1-42
Carry Chain
1-43
New 18 x 18 Embedded Multiplier

Embedded 18-bit x 18-bit
multiplier
• 2’s complement signed operation

Multipliers are organized in
columns

Fast arithmetic functions
• Optimized to implement multiply /
accumulate modules
1-44
Design Flow - Mapping

Technology Mapping: Schematic/HDL to Physical
Logic units
• Compile functions into basic LUT-based groups (function of
target architecture)
1-45
Design Flow – Placement & Route

Placement – assign logic
location on a particular
device

Routing – iterative process
to connect CLB
inputs/outputs and IOBs.
Optimizes critical path delay
– can take hours or days for
large, dense designs
Challenge! Cannot use full
chip for reasonable speeds
(wires are not ideal).
Typically no more than 50%
utilization.
1-46
Example: Verilog to FPGA
1-47
Memory Types
1-48
FPGA Memory Implementation

Regular registers in logic blocks

[Xilinx Vertex II] use the LUTs:
• Piggy use of resources, but convenient & fast if small
• Single port: 16x(1,2,4,8), 32x(1,2,4,8), 64x(1,2), 128x1
• Dual port (1 R/W, 1R): 16x1, 32x1, 64x1
• Can fake extra read ports by cloning memory: all clones are
written with the same addr/data, but each clone can have a
different read address

[Xilinx Vertex II] use block ram:
• 18K bits: 16Kx1, 8Kx2, 4Kx4
• with parity: 2Kx(8+1), 1Kx(16+2), 512x(32+4)
• Single or dual port
• Pipelined (clocked) operations
1-49
LUT-Based RAMS
1-50
LUT-Based RAMS
1-51
LUT-Based RAM Modules
1-52
LUT-Based RAM Modules
// instantiate a LUT-based RAM module
RAM16X1S mymem
(.D(din),.O(dout),.WE(we),.WCLK(clock_27mhz),
.A0(a[0]),.A1(a[1]),.A2(a[2]),.A3(a[3]));
defparam mymem.INIT = 16’b01101111001101011100;
// msb first
1-53
Example of Inferred Memory
1-54
Block RAM

Most efficient memory implementation

Ideal for most memory requirements
• Dedicated blocks of memory
• 4 to 104 memory blocks
• 18 kbits = 18,432 bits per block (16 k without
parity bits)
• Use multiple blocks for larger memories

Builds both single and true dual-port
RAMs

Synchronous write and read (different
from distributed RAM)
1-55
Block RAM

Support of two independent 9 Kb blocks, or a single 18
Kb block RAM.

Each 9 Kb block RAM can be set to simple dual-port
mode, doubling data width of the block RAM to a
maximum of 36 bits.

Simple dual-port mode is defined as having one readonly port and one write-only port with independent
clocks.

18 or 36-bit wide ports can have an individual write
enable per byte. This feature is popular for interfacing
to an on-chip microprocessor.

All inputs are registered with the port clock and have a
setup-to-clock timing specification.
1-56
Block RAM

A write operation requires one clock edge.

A read operation requires one clock edge.

All output ports are latched. The state of the output
port does not change until the port executes another
read or write operation. The default block RAM output
is latch mode.

The output data path has an optional internal pipeline
register. Using the register mode is strongly
recommended. This allows a higher clock rate,
however, it adds a clock cycle latency of one.
1-57
Block RAM
1-58
Block RAM Logic Diagram
1-59
Block RAM Data Combinations and
ADDR Locations
1-60
Block RAM Port Aspect Ratios
1-61
Dual-Port Bus Flexibility

Each port can be configured with a different data bus
width

Provides easy data width conversion without any
additional logic
1-62
Simple Dual-Port Mode Allowed
Combinations for 9 Kb Block RAM
1-63
True Dual-Port Mode Allowed
Combinations for 9 Kb Block RAM
1-64
18 Kb Block RAM—True Dual-Port
Operation
1-65
Read & Write Operations

Read Operation
• In latch mode, the read operation uses one clock edge. The
•

read address is registered on the read port, and the stored
data is loaded into the output latches after the RAM access
time.
When using the output register, the read operation will take
one extra latency cycle to arrive at the output.
Write Operation
• A write operation is a single clock-edge operation. The write
address is registered on the write port, and the data input is
stored in memory.
1-66
Write Modes

Three settings of the write mode determines the
behavior of the data available on the output latches
after a write clock edge: WRITE_FIRST, READ_FIRST,
and NO_CHANGE.

The Write mode attribute can be individually selected
for each port. The default mode is WRITE_FIRST.

WRITE_FIRST outputs the newly written data onto the
output bus.

READ_FIRST outputs the previously stored data while
new data is being written.

NO_CHANGE maintains the output previously
generated by a read operation.
1-67
WRITE_FIRST or Transparent Mode
(Default)

In WRITE_FIRST mode, the input data is
simultaneously written into memory and stored in the
data output (transparent write).
1-68
READ_FIRST or Read-Before-Write
Mode

In READ_FIRST mode, data previously stored at the
write address appears on the output latches, while the
input data is being stored in memory (read before
write).
1-69
NO_CHANGE Mode

In NO_CHANGE mode, the output latches remain
unchanged during a write operation.
1-70
Conflict Avoidance

Block RAM memory is a true dual-port RAM where both
ports can access any memory location at any time.

When accessing the same memory location from both
ports, the user must, however, observe certain
restrictions.

There are no timing restrictions when both ports
perform a read operation.

When one port performs a write operation, the other
port must not read- or write access the exact same
memory location.
1-71
Spartan-3 Block RAM Amounts
1-72
Spartan-3 FPGA Family Members
1-73
Virtex-II 1.5V Architecture
1-74
Virtex-II 1.5V
Device
CLB
Array
Slices
Maximum
I/O
BlockRAM
(18kb)
Multiplier
Blocks
Distributed
RAM bits
XC2V40
8x8
256
88
4
4
8,192
XC2V80
16x8
512
120
8
8
16,384
XC2V250
24x16
1,536
200
24
24
49,152
XC2V500
32x24
3,072
264
32
32
98,304
XC2V1000
40x32
5,120
432
40
40
163,840
XC2V1500
48x40
7,680
528
48
48
245,760
XC2V2000
56x48
10,752
624
56
56
344,064
XC2V3000
64x56
14,336
720
96
96
458,752
XC2V4000
80x72
23,040
912
120
120
737,280
XC2V6000
96x88
33,792
1,104
144
144
1,081,344
XC2V8000 112x104 46,592
1,108
168
168
1,490,944
1-75
Using Core Generator
1-76
Single Port BRAM
1-77
Single Port BRAM
1-78
Single Port BRAM
1-79
Single Port BRAM
1-80
Dual Port BRAM
1-81
Dual Port BRAM
1-82
Dual Port BRAM
1-83
Distributed RAM
1-84
Distributed RAM
1-85
Distributed RAM
1-86
Download