Field programmable gate arrays (FPGAs) are powerful devices for
implementing complex digital systems. FPGAs are best used
with an understanding of the key differences between FPGAs and
previous logic technologies (like PLDs or SSI/MSI).
Understanding these differences and using design techniques
appropriate for FPGAs result in 50%-100% improvement in speed and
density compared to design styles that treat FPGAs and PLDs or
SSI/MSI equally. This application note identifies the key
architectural differences between FPGAs, PLDs, and SSI/MSI,
explains design methodologies that result from understanding these
differences, and gives some simple examples illustrating these
techniques in real applications.
FPGAs Compared to PLDs
PLDs are array-oriented devices that typically have an AND--OR
structure with wide input AND gates feeding a narrower OR
gate. A register is typically available at the output of each
OR. This architecture is termed "logic rich"
because there are typically many more logic gates than register
gates - the ratio can be as much as 5 to 1. PLDs pay a
significant speed penalty when multiple levels of logic are
required because of the large delay through the wider logic
modules. Speeds tend to be more predictable in PLDs because of the
larger "speed quanta".
FPGAs, on the other hand, are register rich, with a logic to
register ratio closer to 2 to 1. This ratio is equivalent to
traditional gate array usage ratios and tends to be related to the
fact that high density designs (more than 1K gates) need more
registers than the traditional "glue" oriented low
density (less them 1K gates) applications. FPGA logic structures
are optimized for functions narrower than PLDs. FPGAs have a
smaller speed quanta than PLDs, so logic functions can be
incremented in complexity while incrementing the delay only a
little each time. Additionally, signals that need to be fast
can be sourced near the bottom of the logic tree, minimizing the
number of logic levels required, while slow signals can be sourced
at the top of the logic tree, where more logic levels are
FPGAs Compared to SSI/MSI
SSI/MSI building blocks are created by optimizing the number of
pins on popular functions to fit in the small packages available.
Logic functions are typically constructed of a few hundred popular
building blocks like counters, multiplexers, shift registers, and
comparators. The typical design is optimized to reduce package
count, and "tricks" have evolved to make the most use of
a device. For example, simple state machines are constructed from
counters and decoders with appropriate pins tied to one or zero.
This technique minimizes package count compared to a
package-intensive gate-for-gate design. The interconnections in
these designs are done on the PC board, causing insignificant
FPGAs have logic building blocks that are closer in function to
SSI than MSI, so SSI-oriented FPGA designs are generally more
efficient. MSI designs, which utilize common tricks (like the
counter decoder based state machine mentioned above), do not make
efficient use of the FPGA architecture, since MSI building blocks
were not developed with FPGA architectures in mind. Even though
most FPGA soft macro libraries contain popular MSI functions (like
decoders and counters), you should refrain from using the popular
MSI-oriented design tricks, since they will result in inefficient
FPGA Design Techniques
Understanding the main differences between FPGAs, PLDs, and SSI/MSI
devices is the first step towards creating efficient FPGA designs.
The next step is to understand efficient FPGA design techniques.
Techniques will be described by grouping them as state machine,
data path, and random logic functions.
State Machine Oriented Techniques
The traditional PLD design techniques for implementing state
machines are geared toward the logic rich and register lean
architecture of the standard PLD. A small number of state registers
are used (usually the theoretical minimum), since registers are
scarce. This requires a larger amount of combinatorial logic to
decode the state, but PLDs usually are able to provide enough
combinatorial logic to do this effectively. Using this technique
for FPGAs would not be an efficient use of FPGA strengths-numerous
registers and fast narrow logic gates. A bit-per-state approach to
state machine design, where each state uses a separate register
instead of encoding states in multiple registers, results in faster
and more efficient state machines in FPGAs. In many cases,
speed improves by 50 percent to 100 percent compared to the
PLD-oriented methodology of an encoded state machine.
In PLD-oriented designs, logic is typically used to develop
outputs from state machines. Usually this requires an additional
level of logic after the state register and adds delay. In FPGAs,
this level of logic can be eliminated in many cases by combining
the logic in front of the state bits in which an output is active.
For example, if the CE output from a state machine needs to be
active in states 3 and 5, the logic feeding state bits 3 and 5 can
be ORed together and registered to create the CE output without
incurring a logic delay after the register. Since the logic in
front of state bits is simple, usually no additions delay or logic
resources are required in front of the new register.
Another popular state machine design technique for PLDs uses
counters to generate a sequence of wait states. For example, a
state machine may need to wait for 16 cycles until a data transfer
can begin. A four-bit counter can be used to generate the required
state sequence. This is fairly efficient in PLD architectures
because of the logic rich and register lean characteristics of the
count function. It is not as good a fit for FPGAs, however. In
FPGAs, registers are rich and a shift register is more efficient
and faster than a counter. A normal shift register will require one
register per wait state. If very large delays are required, a
feedback shift register can be used that implements only one state
less than a counter, but requires much less logic and is
Another useful state machine design technique for FPGAs is state
splitting. Sometimes the overall performance of a state machine is
limited by a few complex states that require additional levels of
logic from all the other levels. If these states could be
simplified, the overall speed of the machine could be significantly
increased. These states can be simplified by "splitting"
the complex state into two or more simple states. This may require
an additional clock period to complete the function associated with
the original state, but this time may be insignificant compared to
the time gained by speeding up the entire machine.
A common state machine design technique with MSI uses a loadable
counter to implement a state machine. Load inputs are tied to a
jump address (sometimes logic is used if more than one jump address
is needed). The counter either counts (to advance to the next
state) or loads (to jump to a different state). This is efficient
in MSI since it requires only a couple of packages to implement a
simple state machine. While this design technique reduces MSI
package count, it results in inefficient logic usage for FPGAs. The
bit-per-state technique is much more efficient and easier to design
when using FPGAs.
Another common inefficient FPGA design technique uses a single
large state machine instead of multiple communicating small state
machines. In MSI, sometimes a single microcoded state machine
controls a complex data path. This works well since large
registered PROMs are available to implement a design in a small
number of packages. These designs are complicated, however, because
each state could have several activities occurring simultaneously,
and the interactions between each activity need to be checked in
every state. In FPGAs, multiple communicating state machines are
easier to design, since most of the communication is local, and
only a few activities need to be communicated between different
state machines. The distributed machines tend to have much simpler
logic requirements that also fit better with the FPGAs register
rich, small logic building block characteristics. This approach is
also better for FPGA routing because the routing resource
requirements are more spread out. Thus, less congestion will occur,
making routing more efficient.
Data Path-Oriented Techniques
One of the big strengths of FPGAs is in implementing data path
functions efficiently and at very high speed. The register rich
architecture combined with the ability to implement multiplexers
efficiently make FPGAs ideal devices for implementing data path
Pipeline adders, multipliers, and other complex data functions
can improve performance considerably over non-pipelined versions.
Since registers are readily available at the output of logic, this
important technique requires virtually no extra resources. Even
functions that wouldn't normally be considered candidates for
pipelining should be considered. For example, pipelining can be
used in counting, comparing, and code translation when latency
isn't an issue. Typically, any portion of the data path where
latency can be introduced is a candidate for pipelining.
Some data path functions are less efficient to implement in
FPGAs than others, and you should use the more efficient functions
as much as possible. As discussed previously, shift registers and
feedback shift registers are easily implemented in FPGAs and should
be considered for non-traditional applications such as address
generation for FIFO or buffer memories, positioning in waveform
generation, or high-speed event timing or counting.
If counters that are only loaded occasionally are required,
prescaling techniques can be used to improve operating frequency.
However, these techniques also result in slower load capability.
Applications that need to generate long address sequences-for
example, memory access-can use this load latency counter very
effectively and operate at a higher speed than a nonlatency
Nonlatency counters have better performance than MSI equivalents
when they are designed using look ahead techniques. These
techniques do not impose any additional constraints on the
application like the load latency counter, but they take advantage
of the register rich nature of FPGAs in implementing counter
functions. For example, in a 16-bit down counter, each register
should "roll over" after the counter reaches all zeros.
Instead of detecting the all zero case by placing combinatorial
logic after the counter registers, logic can be placed in front of
a register to detect the case when the counter contains a one and
is counting down. The register will then be active on the same
cycle in which the counter contains all zeros, saving the
combinatorial delay associated with the all zero detection.
Vendor-supplied soft macros should use this technique to provide
users with the fastest possible nonlatency counters.
Adders are other data path elements commonly used in
FPGAs. When pipelining isn't possible, the carry select
technique is fastest for implementing combinatorial adders. This
technique uses additional logic to produce the two possible results
of an addition operation. One result assumes the carry into a
particular bit is active and the other assumes that the carry into
a particular bit is inactive. The actual carry is developed in
parallel and is used as the input to a multiplexer that selects the
actual result. This utilizes the multiplexer capability of FPGAs to
its fullest and implements adders in the smallest number of levels
of delay possible. This technique can be generalized to other
complex data path functions and shows that logic can be paralleled
to effectively increase performance. More logic modules are
required over serial approaches, but for speed critical paths it is
an excellent technique.
FPGAs can be very efficient at implementing small multiport
memories, for example, in algorithms that require scratch pad data
storage. Since each register input and output is simultaneously
accessible, unlike accessible memories where only single values can
be accessed, algorithms that need to access several variables
simultaneously can be implemented in a single cycle. If this is the
critical portion of a complex algorithm, performance can be
increased dramatically over more serial approaches.
Random Logic Oriented Techniques
Many of the techniques mentioned so far apply to random logic
oriented functions. Using parallel logic, feeding the critical path
near the end of a logic tree to use the minimum number of logic
levels, using registers to predecode and pipeline, and using shift
registers instead of counters are all useful techniques for
optimizing random logic designs.
Management of fanout is perhaps the most important aspect of
implementing high-speed random logic designs. As fanout increases,
interconnect delays also increase, slowing performance. Keeping
fanout low will almost always result in better performance. The two
best techniques for managing fanout are buffering and duplicating
Buffering a design is simply the process of adding buffers to
reduce the fanout of a large net. Typically, the additional delay
of a buffer is less than the additional delay associated with a
heavily loaded net. Thus, buffering results in a faster signal
overall. Buffering is also useful when a logic signal is needed at
two different levels of a logic tree. The signal closest to the
output of the logic tree can arrive later than the signal going
into the higher level of the tree. A buffer can be used to isolate
the portion of the signal that can arrive later and ensure that the
critical signal is as fast as possible. Buffering is also useful if
a signal has a local component and a component that needs to travel
to a more distant portion of the device. The buffer can isolate the
local portion of the signal from the more distant portion so that
the local portion does not suffer any speed degradation because of
the possible long routing delay for the long interconnect.
Since a buffer uses the same logic module as any other logic
function, in some cases it is more effective to duplicate logic
instead of buffering, for example, a three-input OR gate that
drives a fanout of four. If a particular destination is critical,
the load could be split unevenly with the critical signal given the
lower fanout. The inputs to the OR gate now drive one extra
load, but the additional delay associated with a single extra load
is, in almost all cases, less than the speed increase associated
with the logic duplication. This technique is particularly
effective when high fanout registers are duplicated, since
registers are an abundant resource in FPGAs.
Many of the SSI-oriented tricks designers use for random logic
translate directly into FPGA devices because of the similarity of
the basic building blocks. You must keep in mind that routing
resources are limited inside FPGAs, whereas routing resources in
SSI designs on PC boards are virtually inexhaustible.
Sections of logic that use too many different clock sources and
high fan-in may overly constrain routing. For example, it is
usually more efficient to use a synchronous clock source with
synchronous enables instead of a large number of individual clock
signals to load individually selected data bits into registers
because synchronous enable signals have more routing flexibility
than clock signals.
Since FPGAs are most efficient at implementing logic at the
input of registers, a good rule of thumb in implementing random
logic is to use logic at the input of registers instead of the
outputs wherever possible. For example, mulitplexing a signal
prior to a register is more efficient than multiplexing the signal
after the register.
This section has shown several design techniques that should help
you use FPGAs more efficiently. You should learn and use these
techniques to improve the efficiency of your FPGA designs.
Last Modified: 5/27/2005
If you have any questions or concerns about this document, please contact Actel Customer Support: firstname.lastname@example.org | 1.650.318.4460 | 1.800.262.1060 (USA toll-free)