Queue- and Wire-based Input and Output Ports
Allow Processors to Achieve RTL-Like I/O Speeds
on SOCs
Mapping all of a processor’s input
and output interfaces to memory addresses is
neither necessary nor efficient for many tasks
in an SOC design. Sometimes the mapping of input/output
interfaces to memory addresses permits the programmer
or compiler to dynamically choose among several
sources and destinations for computations. However,
if dynamic addressability is not important, direct
connections from external signals to processor
execution units can further accelerate performance
and reduce complexity. Wire-based interfaces
are also more familiar to many RTL designers
and often allow processors to substitute for
hardware blocks without even changing the block-interface
(“pin”) definitions of existing RTL
blocks.
Tensilica’s Xtensa LX2 processor can bring
signals directly into its execution units from
other blocks of logic on the SOC and can output
signals directly to other SOC blocks without
using its traditional buses. Consequently, data
movement bypasses the traditional load and store
instructions so that the I/O performed incurs
no overhead. These additional ports into and
out of the processor are created with Tensilica’s
Instruction Extension (TIE) language using two
features that are new to the Xtensa LX2 processor:
TIE ports and queues.
Two basic styles of interface handshake serve
the different input and output environments for
direct connection of processors to external signals:
- Import of values and export of states through
ports (GPIOs)
- Input and output queues (FIFOs)
Consider a simple hardware function shown in
Figure 1. The primary inputs and outputs of this
function can be simplified to wires at the boundary
of the block.

Figure 1. Fully pipelined instruction implementation
with direct input/output from block
Input and output ports may serve as source and
destination operands for configurable-processor
operations, enabling fast and flexible interfaces.
Figure 2 shows a sample implementation of the
function block in Figure 1 written in TIE. This
example uses queues to move data into and out
of the block.
1: state state1 24 add_read_write
2: state state2 24 add_read_write
3: state lastinput1 24
4: state nextoutput1 24
5: queue input1 24 in
6: queue input2 24 in
7: queue output1 24 out
8: operation lookup.mul.mul {} {in input1, in
input2, in state1, in state2, inout lastinput1,
out output1, inout nextoutput1, out VAddr, in
MemDataIn32} {
9: assign VAddr = {8’h0, lastinput1 + state1};
10: assign lastinput1 = input1;
11: wire [23:0] mulout = MemDataIn32[23:0] *
input2;
12: assign output1 = nextoutput1;
13: assign nextoutput1 = mulout * state2;}
14: schedule inst_sched
{lookup.mul.mul} {use state2 4; use nextoutput1 3; use
input1 2; use input2 2; def lastinput1 3; def
nextoutput1 4; def mulout 3; def output1 3; }
Figure 2. Datapath with input and output queues
TIE example.
In this listing, the “use” and “def” arguments
in the schedule statement on line 14 specify
the pipeline stage where the Xtensa processor’s
input queue interface has data (use) and where
the processor’s output queue interface
accepts the output data from the pipeline (def).
The input queue interface has data available
in the pipeline’s Memory stage (stage 2)
and the output queue interface accepts data during
the pipeline’s Write-Back stage (stage
3). The states lastinput1 and nextoutput1 allow
the late input queue value to be used in the
following instruction and the previous instruction’s
late computational result to be sent to the output
queue.
Queue inputs and outputs use direct connection
of wire structures. Accesses to the corresponding
queue structures automatically pop data from
the input queues and push data into the output
queue. The queue-control mechanism is aware of
instruction cancellation (which can happen because
of a variety of other events) and ensures that
no excess data is popped or pushed even if processor
encounters unexpected error conditions.
Queues are one form of instruction-mapped connection.
They are particularly appropriate for streaming
operand data through an application-specific
processor because the request/acknowledge handshaking
is already part of the queue interface, as shown
in Figure 3.
Figure 3. Basic Handshake for Direct Processor
Connections
Queue inputs represent a stream of data values
to be consumed by the application running on
a processor. Sequential executions of the consuming
instruction should see sequential values. Similarly,
queue outputs represent a sequence of values
being produced by the application processor.
The consumption and production of values can
be managed in hardware so that all of the effects
of possible instruction execution cancellation
are hidden and no explicit request-acknowledge
handshake is needed. The processor consuming
these operands stalls if not enough data has
been produced. The processor producing the operands
will stall if the consuming processor falls behind
and allows the input buffer to fill. These queues
form a highly efficient data-streaming connection
between processors, especially where several
processors comprise a large-scale computational
pipeline
The second form of the direct interface is based
on ports: import of values and export of states
on a set of wires. These ports are especially
useful for tasks that test external status or
condition information or control other logic
functions.
Application-specific instructions use imported
values on an input port just like other input
operands. The wires do not need to be explicitly
declared so they do not consume instruction encoding
bits as would register-address specifiers. When
the corresponding instruction executes, the instruction
simply senses the value on the associated wire.
The processor provides no external indication
that the corresponding instruction is executing
and there is no acknowledgement that value on
the wire has been used. If the application must
signal that an input value has been used, it
does so explicitly via a store to an external
address or by executing an instruction that writes
to an output queue or to an exported state (on
another wire).
Exporting states creates output ports that deliver
information from instructions to external logic
or to other processors. The output signals do
not change until the processor executes an instruction
that explicitly modifies those signals. The normal
implementation of this type of signal hides the
speculative nature of modern processor pipelines.
For pipelined processors, error conditions, conditional
branches, cache misses, and other unexpected
conditions may cause the premature termination
of some instructions that the processor has started
to execute speculatively. Speculative execution
is a performance-enhancing processor-design technique,
but it requires that the system be able to tolerate
early instruction termination. Processor hardware
that maintains the simple programmer’s
model of atomic, in-order instruction execution
and avoids any unintended output glitches that
might otherwise be caused by the processor’s
micro-architectural operations prevents externally-visible
state changes from occurring for instructions
that do not complete.
The example shown above can also be implemented
with imported values and exported states, as
shown below in . This implementation uses an
explicit output signal, next_data, to indicate
that new values for input1 and input2 are required
and that a new output value is available on the
wire output1. It is the developer’s responsibility
to ensure that external logic has sufficient
time to respond to the exported next_data signal
before the next use of the input1 and input2
ports. This guarantee is easily achieved for
moderate-performance applications with tens of
cycles of latency between one input set and the
next. However, queues are generally faster and
simpler for very high input rates. Note that
the instruction set and the program must explicitly
assert the next_data wire to request new input
data and to signal the availability of new output
data. Also note that current Xtensa implementation
allows the value of imported wires to be used
as early as the ALU Stage (“use” as
early as schedule state 1), though exported states
must be defined by the Write Back Stage (no “def” after
schedule stage 3).
1: state state1 24 add_read_write
2: state state2 24 add_read_write
3: state nextoutput1 24
4: state output1 24 24’b0 add_read_write
export
5: state next_data 1 1’b0 add_read_write
export
6: import_wire input1 24
7: import_wire input2 24
8: operation lookup.mul.mul
{} {in input1, in
input2, in state1, in state2, out
output1, inout nextoutput1 out
next_data, out VAddr, in MemDataIn32}
{
9: assign VAddr = {8’h0,
input1 + state1};
10: wire [23:0] mulout = MemDataIn32[23:0] *
input2;
11: assign output1 = nextoutput1;
12: assign nextoutput1 = mulout * state2;
13: assign next_data = 1’h0;}
14: operation assert_next_data
{} {out next_data} {assign next_data = 1’h1;}
14: schedule inst_sched {lookup.mul.mul} {use
state2 4; def output1 3; def nextoutput1 4;}
Figure 4. Data path with import wire/ export
state TIE example
This example also compensates for the fact that
the computational result is available in a pipeline
stage later than that required for state export.
The operation therefore exports the result of
the previous operation and saves the new computation
result in nextoutput1 for export during the next
cycle.
|