Think
Outside the Bus: High-Speed I/O Alternatives
for Inter-Processor Communications on SOCs
The
choice of hardware-interconnection mechanisms
among processor blocks in an SOC affects communication
performance and silicon cost. Message-passing
software communications have a natural correspondence
to data queues, but message passing can be implemented
using other types of hardware such as bus-based
hardware with global memory. Similarly, the shared-memory
software-communications mode has a natural correspondence
to bus-based hardware, but shared-memory protocols
can be physically implemented even when no globally
accessible physical memory exists. This implementation
flexibility allows chip designers to implement
a spectrum of different task-to-task connections
in ways that optimize performance, power, and
cost together.
This white paper provides short descriptions
of the most common hardware mechanisms—buses,
direct connections, and data queues—used
to interconnect processor cores on SOCs.
Except where explicitly noted, this paper
assumes a one-to-one correspondence between
tasks and processors. In fact, multiple tasks
can be mapped onto one time-sliced processor
and tasks can be implemented by other non-programmable
hardware blocks.
Note: In many cases, the task-to-task
connection is not made directly, but between
the task and an attached memory. If that
memory can be reached by more than one
task, then communication between the tasks
becomes possible. Memory sharing may be
hidden from the each task’s
software developer by a software layer, so
the presence of shared memory in hardware
is equivalent to a shared-memory communication
mode.
Processor Buses
A bus is a shared-access
hardware mechanism allowing one or more processors
to communicate with slave memories and input/output
interfaces. In the simplest case, each slave
is accessible only from one bus, so the processor
that owns the bus also owns the slaves. Different
processors must arbitrate for the bus,
but this is the sole arbitration mechanism.
Processors and slaves may have a range
of bus-transfer requirements, based on
hardware limitations (e.g. an 8-bit UART
slave device may not allow any 16- or 32-bit
transfers) or traffic patterns (the processor
may maximize performance with cache-line-sized
block transfers—16 bytes or more).
Moreover, some transfers may be quite sensitive
to latency (the task doesn’t need
much data, but it needs that data immediately)
and others may be more sensitive to bandwidth
(the task must get some average sustained
bandwidth, but the latency of any one transfer
is inconsequential).
Bus design tradeoffs
Bus designs may use
a range of strategies to satisfy conflicting
goals among the processors, memories, and
other devices they connect. Three classes
of design decision stand out:
Bus Width and Clock
Rate: The bus width
and clock rate determine the peak transfer
rate over the bus. These factors affect cost,
power, and technology requirements.
Arbitration: The arbitration mechanism affects
trade-offs between total bus utilization
and the latency seen by any one bus master.
Round-robin arbitration gives all masters
equal access to the bus, but even the most
important requests may face long contention
delay. Round-robin arbitration is fair, in
that all masters have an equal chance to
get bandwidth, and it is efficient, in that
bus cycles are utilized if any master needs
them. Strict-priority arbitration gives the
most critical bus master preferential treatment
all the time so that it sees minimum contention
latency. Reserved-bandwidth arbitration gives
a bus master a minimum guaranteed bandwidth
over a time interval, but the master can
also compete for additional bandwidth on
a round-robin basis. The choice of arbitration
mechanism is driven by the system bandwidth
and latency requirements, but may be constrained
by a pre-defined bus protocol.
Transfer Types: Simple buses may implement
just a few transfer types such as 8-, 16-,
and 32-bit reads and writes. More complex
buses may implement any of a number of more
advanced transfer types:
- Fixed-block transfers: Power-of-two sized
blocks, often used for cache-line refills
and write-backs.
- Variable-block transfers:
Arbitrary-length transfers, often used
to move data in streams with application-dependent
block sizes.
- Split transactions: The decomposition
of a bus request (usually a read) into
two transfers: one to convey an address
from the master to the slave, and a second
to return a response data block from the
slave to the master. The bus is relinquished
to other masters during the interval
between the request and response. Split
transactions are particularly important
for maintaining high bus bandwidth with
long memory device latencies and multiple
bus masters.
- Atomic
transactions: When two or more masters
are competing for access to a shared
resource, some locking mechanism is required
to support arbitration mechanisms. Sometimes
this mechanism is implemented as a bus
lock, in which certain read operations
retain bus mastership after the read data
is returned, so that the processor can
perform a write without risk that another
processor may read the same location. Bus
locking is not efficient, however, in a
system with many processors, many separate
memories, and frequent locking operations.
Bus implementation
with configurable processors
Configurable
processors offer significant flexibility
in supporting arbitrated access to shared
devices and memory. The basic topologies
for shared memory buses are:
1. Remote global memory accessed over a
general processor bus:
The processor implements a general-purpose
interface that allows a wide variety of bus
transactions. If the processor determines
that that the corresponding data is not local
during a read (based on the address or due
to a cache miss), the processor must make
a non-local reference. The processor requests
control of the bus, and when control is granted,
sends the target read address over the bus.
The appropriate device (for example, memory
or input/output interface) decodes that address
and supplies the requested data back over
the bus to the processor, as shown in Figure
1.
Figure 1. Two processors access shared memory
over bus
When two processors are communicating through
global shared memory on the bus, one must
acquire bus control to write the data; the
other processor must later acquire bus control
to read it. Each word transferred in this
fashion requires two bus transactions. This
approach requires modest hardware and maintains
high flexibility, because the global memories
and input/output interfaces are accessible
over a common bus. However, the use of global
memory does not scale well with the number
of processors and devices, because bus traffic
leads to long and unpredictable contention
latency.
2. Local processor memory accessed over
a general processor bus:
Configurable processors may allow local
data memories to participate in general-purpose
bus transactions. These data memories are
primarily used by the processor to which
they are closely coupled. However, the processor
controlling the local data memory can serve
as a bus slave and respond to requests on
the general-purpose bus, as shown in Figure
2.
Figure 2. One processor access local data
memory of a second processor over bus
In this case, the read by Processor 1 may
require access arbitration at two levels:
first when Processor 1 requests access to
the general-purpose bus, and second when
the read request reaches Processor 2. The
read request from Processor 1 arrives over
Processor 2’s processor interface and
may contend with other requests for local
data-memory access from tasks running on
Processor 2. Two arbitration levels may increase
the access latency seen by Processor 1 but
Processor 2 avoids access latency almost
entirely, because latency to local data memory
is short (usually one or two cycles).
This latency asymmetry between Processor
1 and Processor 2 encourages push communication:
when Processor 1 sends data to Processor
2, it writes the data over the bus into Processor
2’s local data memory. If the write
is buffered, Processor 1 can continue execution
without waiting for the write to complete.
Thus the long latency of data transfer to
Processor 2 is hidden. Processor 2 sees minimal
latency when it reads the data, because the
data is local. Similarly, when Processor
2 wants to send data back to Processor 1,
it writes the data into Processor 1’s
local data memory.
3. Multi-ported local memory accessed over
local bus:
When data flows in both directions between
processors and latency is critical, a locally
shared data memory is often the best choice
for inter-task communications. Each processor
uses its local data memory interface to access
a shared memory, as shown in Figure 3. This
memory could have two physical access ports
(two memory references satisfied each cycle)
or could be controlled by a simple arbiter,
where one processor’s access is held
off for a cycle if the other processor is
using the single physical access port.
Figure 3. Two processors shared access to
local data memory
Arbitration for a single port is preferred
in area- and cost-sensitive applications,
especially when shared-memory utilization
is modest, because a true dual-ported memory
is about twice as big per bit when compared
to single-ported RAM. However, a true dual-ported
memory may be the better choice when the
shared memory is very small or when absolute
determinism of access latency is required.
Direct Connect Ports
Direct processor-to-processor
connections reduce cost and latency for communication.
They allow data to move directly from one
processor’s registers to the registers
and execution units of another processor like a GPIO.
A simple example of direct connection is
shown in Figure 4. This example takes advantage
of exportation of state registers and importation
of wire values (features found in some
extensible processors) to create an additional
dedicated interface within each processor
and to directly connect them.
Whenever the Processor 1 writes a value
to the output register, usually as part of
some computation, that value automatically
appears on the output pins of the processor.
That same value is immediately available
as input value to operations in Processor
2. Wire connections can be arbitrarily wide,
allowing large and non-power-of-two-sized
operands to be transferred easily and quickly.
Figure 4. Direct processor-to-processor
ports
Note: Tensilica’s Xtensa LX2 processor
allows you to create registers with exported
state, operations that write these states,
and other operations that use these new input
values from other exported states.
The operation that produces the data for
the output state register may be as simple
as a register-to-register transfer or it
may be a complex logic function based on
many other processor state values. Similarly,
the input value can simply be transferred
to another processor state within Processor
2 (register or memory), or it could be used
as one input to a complex logic function.
This form of direct connection still requires
some handshake between the two processors.
The consumer of data may need to signal to
the producer that the data in the register
has been used, so that the producer can write
the next data value. The producer may need
to signal the consumer than new data is available.
This signaling can be done in several ways,
including:
Consumer-to-producer
port: An architect
can make two additional port connections,
each just one bit wide, one from consumer
processor back to the producer processor,
and one from producer to consumer. The consumer
asserts its “acknowledge” output
when the data has been used. The producer
uses this signal as part of the decision
in the code to generate the next output value.
The producer asserts its “data-ready” handshake
output when the next data value is available.
The consumer should negate its “acknowledge” signal,
in preparation for the next assertion when
the next data word has been processed. The
handshake is shown in the timing diagram
in Figure 5. Because this transaction requires
at least one full instruction execution per
signal transition, this method consumes at
least a dozen cycles per data word transferred.
Figure 5. Two wire handshake
A variant of data queues creates producer-consumer
handshake signals automatically. The “data
ready” signal is equivalent to the
push into the tail of a queue, and the “acknowledge” signal
is equivalent to the pop from the head of
a queue. A flag bit, set by “data ready” and
cleared by “acknowledge” coordinates
the two tasks.
Interrupt-driven handshake: The data transfer
can also be controlled by interrupts between
the two processors. When the producer processor
has created the data and placed it on its
output port, it also asserts a signal on
an output wire connected to an interrupt
input of the consumer processor. The consumer
processor handles the interrupt as soon as
it can (after any higher priority interrupts
are handled), and accepts the data from the
input port within the interrupt handler.
The consumer’s interrupt handler then
asserts its own output signal, which is connected
to an interrupt input on the producer processor.
The producer interrupt handler can then drive
new data to the consumer. The basic structure
of the interrupt-driven handshake is shown
in Figure 6.
Figure 6. Interrupt-driven handshake
Data Queues
The highest-bandwidth mechanism
for task-to-task communication is hardware
implementation of data queues, which are like FIFOs. One data queue
can sustain data rates as high as one transfer
every cycle or more than 10 Gbytes per second
for wide operands (tens of bytes per operand
at a clock rate of hundreds of MHz) because
queue widths need not be tied to a processor’s
bus width or general-register width. The
handshake between producer and consumer
is implicit in the interfaces between the
processors and the queue’s head and
tail.
When the data producer has created the data,
it pushes it into the tail of the queue,
assuming the queue is not full. If the queue
is full, the producer stalls. When the data
consumer is ready for new data, it pops it
from the head of the queue, assuming the
queue is not empty. If the queue is empty,
the consumer stalls.
Queues can also be configured to provide
non-blocking push and pop operations, where
the producer can explicitly check for a full
queue before attempting a push and the consumer
can explicit check for an empty queue before
attempting a pop. This mechanism allows the
producer or consumer task to move to other
work in lieu of stalling.
Application-specific processors allow direct
implementation of queues as part of their
instruction-set extensions. An instruction
can specify a queue as one of the destinations
for result values or use an incoming queue
value as one source. This form of queue interface,
shown in Figure 7, allows a new data value
to be created or used each cycle on each
queue interface. A complex processor extension
could perform multiple queue operations per
cycle, perhaps combining inputs from two
input queues with local data and sending
values to two output queues. The high aggregate
bandwidth and low control overhead of queues
allows application-specific processors to
be used for applications with very high data
rates where processors with conventional
bus or memory interfaces are not appropriate
because they cannot handle the required high
data rates.

Figure 7. Hardware data queue mechanism
Queues decouple the performance of one task
from another. If the rate of data production
and data consumption are quite uniform, the
queue can be shallow. If either production
or consumption rates are highly variable,
a deep queue can mask this mismatch and ensure
throughput at the average rate of producer
and consumer, rather than at the minimum
rate of the producer or the minimum rate
of the consumer. Sizing the queues is an
important optimization driven by good system-level
simulation. If the queue is too shallow,
the processor at one end of the communication
channel may stall when the other processor
slows for some reason. If the queue is too
deep, the silicon cost will be excessive.
One processor can employ queue communications
with multiple partners. When the queue operations
are directly incorporated into the instruction
set, the code sequence entirely determines
which queue is written or read. Sometimes,
less direct mapping is desirable, so the
code sequence that produces or consumes data
can be separated from the selection of the
source or destination queue.
Two methods for flexible queue selection
are possible. First, the ultimate data destination
can be included in the data transfer. This
destination information is pushed into a
common queue with the data. This queue feeds
other queues, where simple logic pops the
destination identifier and uses it to choose
the correct destination-specific queue in
which to push the corresponding data. Full
flexibility of queue width makes this approach
economical. For example, a 2-bit destination
specifier and a 32-bit data word would be
combined in a 34-bit common queue, perhaps
feeding a set of four 32-bit queues, as shown
in Figure 8.
Figure 8. Producer enqueues destination
with data
Second, the queue head and tail can be mapped
into memory, so that a processor store is
used to push a value and a processor load
is used to pop a value. These operations
can be blocking (producing a stall if the
queue is full or empty) or non-blocking (processor
may test the state of the queue before attempting
the push or pop). Figure 9 shows a simple
system with one producer and two consumers.
The queues are mapped into the address spaces
of the processors (here shown using the local-memory
space with a 1-cycle access time), so that
any store to the address of the queue tail
causes a push and load from the address of
the queue head causes a pop.
Figure 9. One producer serves two consumers
through memory-mapped queues
When the data rate is relatively low, the
queue depth can be reduced, even to a single
entry—a register that is written by
the producer and read by the consumer. This
mailbox register serves as a simple and convenient
path between producer and consumer. A memory-mapped
set of mailbox registers is shown in Figure
10. When the two tasks pass data back and
forth, the same register can be used for
transfers in either direction.
Figure 10. Memory-mapped mailbox registers
Memory-mapped and instruction-mapped queues
serve a wide range of processor communication
uses. They work especially well at high data
rates with relatively shallow buffering.
At lower data rates, buses provide ample
communications bandwidth. For applications
with very deep buffering requirements, queues
must be implemented in RAM or replaced with
a shared-memory communication mechanism.
|