A Quick Guide to High-Speed I/O for SOC Function
Blocks
The two bottlenecks in high-speed SOC block
design are I/O performance and computational
performance. Traditionally, the main bus of a
processor core represents a major I/O bottleneck.
All data into and out of the processor must pass
over this main bus. Consequently, two factors
constrain I/O traffic in and out of the processor.
First, a bus can only perform one transfer at
a time so other pending transfers must wait for
the current transfer to clear. Second, because
processor main buses are designed to accommodate
many system configurations, they tend to require
multiple cycles to effect bus transactions. As
a result of these limitations, processor cores
have lacked the I/O bandwidth required by many
tasks performed in SOCs.
Tensilica’s Xtensa
LX2 processor incorporates several features that
improve I/O bandwidth. In fact, these features
allow the Xtensa LX2 processor to deliver I/O
transfer rates that can match those of hardwired
RTL blocks. However, the Xtensa LX2 processor
achieves those high data rates with an automatically
generated, pre-verified hardware core that greatly
reduces the time required to develop the SOCs
hardware. In addition, the resulting function
is firmware-programmable, which means that it
can be changed at a later date to accommodate
a new or revised industry standard, to add a
feature, or to fix a bug in the system design
without changing the silicon.
The key features that allow the Xtensa LX2 processor
to achieve these high data-transfer rates are
its XLMI local-memory bus (a feature carried
forward from Tensilica’s Xtensa V processor)
and TIE queues. (Note: TIE is Tensilica’s
Instruction Extension language, which allows
designers to extend the processor’s abilities
using a Verilog-like language to describe the
features of the new abilities without the need
to describe the structure of the hardware that
implements these abilities.) The XLMI bus is
a simple, fast, single-cycle bus that can perform
transfers much faster than the Xtensa processor’s
main bus (the PIF).
A new feature introduced with the Xtensa LX2
processor, called TIE ports and queues, allows
designers to add many new input and output ports
that lead directly into and out of the processor’s
execution unit. These ports and queues can be
directly invoked by new instruction extensions,
also written in TIE, so that input and output
operations become implicit in the execution of
a computation. This approach maximizes I/O bandwidth,
similar to the maximum bandwidth achieved by
hand-coding function blocks with RTL, but requires
much less effort from the SOC development team
to design and verify the hardware because the
Xtensa LX2 processor is generated automatically
by Tensilica’s Xtensa Processor Generator.
The 1-Bus Bottleneck
Figure 1 shows the microprocessor
core configuration typically found in SOC designs.
The sole data highway into and out of the processor
is its main bus. Because processors often interact
with other types of bus masters including
other processors and DMA controllers, their main
buses have sophisticated transaction protocols
and arbitration mechanisms for sharing the
bus among masters. These extra mechanisms
result in bus transactions that occur over several
clock cycles.

Figure 1. The 1-bus bottleneck.
Xtensa LX2 PIF read transactions take a minimum
of 6 cycles and the write transaction takes at
least 1 cycle, depending on the speed of the
target device connected to the PIF. From these
transaction timings, we can calculate the minimum
number of cycles needed to perform a simple flow-through
computation, where two numbers are loaded from
memory, added, and stored back into memory. The
assembly code to perform this computation might
look like this:
L32I reg_A, Addr_A ; Load the first operand
L32I reg_B, Addr_B ; Load the second operand
ADD reg_C, reg_A, reg_B ; Add the two operands
S32I reg_C, Addr_C ; Store the result
To simplify this code, we assume that pointers
to memory locations storing values A, B, and
C are already initialized in registers Addr_A,
Addr_B, and Addr_C. If not, then more time will
be needed for this computation.
The minimum cycle count required to perform
this computation is:
L32I reg_A, Addr_A: 6 cycles
L32I reg_B, Addr_B: 6 cycles
ADD reg_C, reg_A, reg_B: 1 cycle
S32I reg_C, Addr_C: 1 cycle
Total: 14 cycles
(Note: This cycle count is a minimum number.
Because the Xtensa LX2 processor is pipelined,
the total number of cycles will be slightly larger
than 14 but the additional cycles will be overlapped
with the execution of other instructions. If
this code sequence sits within a zero-overhead
loop, the cycle count for each loop iteration
is 14 cycles.)
Note that load (L32I) instructions consume 6
cycles. This cycle count is the minimum required
to return the requested information over the
processor’s main bus (the PIF). Loads must
complete before the next instruction that uses
the resulting data executes. Also note that the
store (S32I) instruction consumes only one cycle
because the stored value is immediately placed
in a store buffer. Once the value enters the
processor’s store buffer, the store instruction
completes. The processor’s bus-control
logic subsequently moves the stored value to
the target location.
For high-speed data that must flow through this
function block, the large number of required
cycles for this flow-through operation is often
a major factor in deciding to design a purpose-built
block of RTL to perform the task because a conventional
processor would be too slow.
Break the Bottleneck with a Faster Bus
One way
to solve this problem is to conduct the load
and store transactions over a faster bus to improve
the overall I/O bandwidth. Xtensa processors
have a local-memory bus interface called XLMI
that implements a simpler transaction protocol
than the processor’s main PIF
bus. XLMI transaction protocols are simpler
than PIF protocols, so load and store operations
can occur in as little as one cycle. By conducting
loads and stores over the XLMI bus instead
of the PIF, the above computation timing becomes:
L32I reg_A, Addr_A: 1 cycle
L32I reg_B, Addr_B: 1 cycle
ADD reg_C, reg_A, reg_B: 1 cycle
S32I reg_C, Addr_C: 1 cycle
Total: 4 cycles (with the same caveat regarding
the processor pipeline)
This result represents a 3.5x improvement in
the function’s cycle count, which may mean
the difference between acceptable and unacceptable
performance for a particular task. However, even
with the performance improvement gained from
faster bus transactions, the XLMI bus still conducts
only one transaction at a time, so loads and
stores occur serially.
Attaining Ultimate Bandwidth
Even the 4-cycle
count achieved with single-cycle loads and
stores over a fast local bus can be too slow
for certain SOC tasks. Because of this, Tensilica
has significantly boosted the I/O bandwidth of
the Xtensa LX2 processor with a feature called
TIE ports and queues.
The XLMI bus runs faster than the PIF because
it implements simpler transaction mechanisms.
However, the XLMI port is still a bus, which
is capable of communicating with several attached
memories and devices—but only one of these
at a time because of the nature of bus-oriented
I/O transactions.
Ports and queues are very simple, direct communication
structures. Like XLMI transactions, transactions
over ports and queues occur in one cycle. However,
transactions conducted over ports and queues
are not activated by addresses supplied by the
processor. These simpler structures are activated
by specially written processor instructions that
implicitly initiate port and queue transactions.
Consequently, one designer-defined instruction
can initiate transactions on several ports and
queues at the same time, which boosts I/O bandwidth.
Using TIE, it’s possible to create queues
especially for the example problem discussed
in this white paper. Three queues are needed:
two input queues for the input operands and one
output queue for the result. With these three
queues defined, it’s then possible to define
an addition instruction that
- Implicitly draws input operands A and B from
their respective input queues.
- Adds A and B together.
- Outputs the result of
the addition (C) on the output queue.
The TIE code to create such an instruction is:
queue InQ_A 32 in
queue InQ_B 32 in
queue OutQ_C 32 out
The first two statements declare input queues
named InQ_A and InQ_B that are 32 bits wide.
The third statement declares an output queue
named OutQ_C that is also 32 bits wide. Invoking
a TIE queue causes the Xtensa processor generator
to create an additional I/O port with the handshake
lines needed to connect to a FIFO that’s
external to the processor. A TIE instruction
can implicitly read from several input queues
and write to several output queues during one
clock cycle. Input queues can operate like any
other input operand to an instruction and output
queues can be assigned values just like any other
output operand. (Note that input and output ports
and queues need not be 32 bits wide. They can
be as narrow as 1 bit or as wide as 1024 bits
in the current implementation of TIE.)
The following example describes a TIE instruction
called ADD_XFER that reads operands from each
of the input queues, adds them together, and
writes the result to an output queue.
operation ADD_XFER {in AR val} {in InQ_A, in
InQ_B, out OutQ_C} {
assign OutQ_C = InQ_A + InQ_B;
}
With this new instruction, the example problem
reduces to one instruction:
ADD_XFER
Running through the Xtensa LX2 processor’s
5-stage pipeline, this instruction takes five
cycles to run through the processor’s pipeline
but it has a latency of only one clock cycle.
By placing this instruction within a zero-overhead
loop, the processor can deliver an effective
throughput of one instruction per clock cycle.
Thus the computation and data movement occur
in the absolute minimum number of clock cycles,
namely one. Even hand-coded, hardwired RTL cannot
perform this function any faster than the properly
equipped processor.
Boost Throughput Further with Multiple Computations
Per Cycle
The Xtensa LX2 processor is not limited
to performing one computation per cycle. TIE
provides two ways to perform two or more calculations
at a time. The designer can create a single-cycle
TIE instruction that draws operands from several
input queues, performs multiple operations on
these operands concurrently, and then outputs
the results on several output queues. Alternatively,
the designer can use the Xtensa LX2 processor’s
FLIX (flexible length instruction extension)
technology to develop wide, multi-operation instructions.
FLIX instructions execute these multiple operations
concurrently. The advantage of the first approach
is simplicity. If all the computations are always
performed concurrently in the same manner, then
they can be combined in one instruction. The
advantage of the FLIX approach is flexibility.
If some of the operations are only performed
some of the time or if the operations are performed
in various combinations, the FLIX instruction
provides the ability to easily create many combinations
of concurrent instructions. In some applications,
either method will work equally well. In other
applications, the flexibility of FLIX instructions
will make function-block development much easier.
|