Tech Support | Generator Login | Careers | Contact Us
PRODUCTS

  Overview

  Technology

  Diamond Standard

  Xtensa

    Configurable

    Config & Extensible

    Xtensa 7

    Xtensa LX2

  + Architecture

    – Features

    – Create TIE

  + I O Bandwidth

  + Low Power

  + Floating Point

  + Real-time Trace

  + Error Detection

  + Benchmarks

    – EEMBC Networking

  + Configuration Options

  + DSP Options

  + White Papers

  DSPs

    HiFi 2 Audio

    Video

    Communications

  HW/SW Dev Tools

  Literature & Doc

White Papers

Think Outside the Bus: High-Speed I/O Alternatives for Inter-Processor Communications on SOCs

The choice of hardware-interconnection mechanisms among processor blocks in an SOC affects communication performance and silicon cost. Message-passing software communications have a natural correspondence to data queues, but message passing can be implemented using other types of hardware such as bus-based hardware with global memory. Similarly, the shared-memory software-communications mode has a natural correspondence to bus-based hardware, but shared-memory protocols can be physically implemented even when no globally accessible physical memory exists. This implementation flexibility allows chip designers to implement a spectrum of different task-to-task connections in ways that optimize performance, power, and cost together.

This white paper provides short descriptions of the most common hardware mechanisms—buses, direct connections, and data queues—used to interconnect processor cores on SOCs. Except where explicitly noted, this paper assumes a one-to-one correspondence between tasks and processors. In fact, multiple tasks can be mapped onto one time-sliced processor and tasks can be implemented by other non-programmable hardware blocks.

Note: In many cases, the task-to-task connection is not made directly, but between the task and an attached memory. If that memory can be reached by more than one task, then communication between the tasks becomes possible. Memory sharing may be hidden from the each task’s software developer by a software layer, so the presence of shared memory in hardware is equivalent to a shared-memory communication mode.

Processor Buses

A bus is a shared-access hardware mechanism allowing one or more processors to communicate with slave memories and input/output interfaces. In the simplest case, each slave is accessible only from one bus, so the processor that owns the bus also owns the slaves. Different processors must arbitrate for the bus, but this is the sole arbitration mechanism. Processors and slaves may have a range of bus-transfer requirements, based on hardware limitations (e.g. an 8-bit UART slave device may not allow any 16- or 32-bit transfers) or traffic patterns (the processor may maximize performance with cache-line-sized block transfers—16 bytes or more). Moreover, some transfers may be quite sensitive to latency (the task doesn’t need much data, but it needs that data immediately) and others may be more sensitive to bandwidth (the task must get some average sustained bandwidth, but the latency of any one transfer is inconsequential).

Bus design tradeoffs

Bus designs may use a range of strategies to satisfy conflicting goals among the processors, memories, and other devices they connect. Three classes of design decision stand out:

Bus Width and Clock Rate: The bus width and clock rate determine the peak transfer rate over the bus. These factors affect cost, power, and technology requirements.

Arbitration: The arbitration mechanism affects trade-offs between total bus utilization and the latency seen by any one bus master. Round-robin arbitration gives all masters equal access to the bus, but even the most important requests may face long contention delay. Round-robin arbitration is fair, in that all masters have an equal chance to get bandwidth, and it is efficient, in that bus cycles are utilized if any master needs them. Strict-priority arbitration gives the most critical bus master preferential treatment all the time so that it sees minimum contention latency. Reserved-bandwidth arbitration gives a bus master a minimum guaranteed bandwidth over a time interval, but the master can also compete for additional bandwidth on a round-robin basis. The choice of arbitration mechanism is driven by the system bandwidth and latency requirements, but may be constrained by a pre-defined bus protocol.

Transfer Types: Simple buses may implement just a few transfer types such as 8-, 16-, and 32-bit reads and writes. More complex buses may implement any of a number of more advanced transfer types:

  • Fixed-block transfers: Power-of-two sized blocks, often used for cache-line refills and write-backs.
  • Variable-block transfers: Arbitrary-length transfers, often used to move data in streams with application-dependent block sizes.
  • Split transactions: The decomposition of a bus request (usually a read) into two transfers: one to convey an address from the master to the slave, and a second to return a response data block from the slave to the master. The bus is relinquished to other masters during the interval between the request and response. Split transactions are particularly important for maintaining high bus bandwidth with long memory device latencies and multiple bus masters.
  • Atomic transactions: When two or more masters are competing for access to a shared resource, some locking mechanism is required to support arbitration mechanisms. Sometimes this mechanism is implemented as a bus lock, in which certain read operations retain bus mastership after the read data is returned, so that the processor can perform a write without risk that another processor may read the same location. Bus locking is not efficient, however, in a system with many processors, many separate memories, and frequent locking operations.

Bus implementation with configurable processors

Configurable processors offer significant flexibility in supporting arbitrated access to shared devices and memory. The basic topologies for shared memory buses are:

1. Remote global memory accessed over a general processor bus:

The processor implements a general-purpose interface that allows a wide variety of bus transactions. If the processor determines that that the corresponding data is not local during a read (based on the address or due to a cache miss), the processor must make a non-local reference. The processor requests control of the bus, and when control is granted, sends the target read address over the bus. The appropriate device (for example, memory or input/output interface) decodes that address and supplies the requested data back over the bus to the processor, as shown in Figure 1.

Figure 1. Two processors access shared memory over bus

When two processors are communicating through global shared memory on the bus, one must acquire bus control to write the data; the other processor must later acquire bus control to read it. Each word transferred in this fashion requires two bus transactions. This approach requires modest hardware and maintains high flexibility, because the global memories and input/output interfaces are accessible over a common bus. However, the use of global memory does not scale well with the number of processors and devices, because bus traffic leads to long and unpredictable contention latency.

2. Local processor memory accessed over a general processor bus:

Configurable processors may allow local data memories to participate in general-purpose bus transactions. These data memories are primarily used by the processor to which they are closely coupled. However, the processor controlling the local data memory can serve as a bus slave and respond to requests on the general-purpose bus, as shown in Figure 2.

Figure 2. One processor access local data memory of a second processor over bus

In this case, the read by Processor 1 may require access arbitration at two levels: first when Processor 1 requests access to the general-purpose bus, and second when the read request reaches Processor 2. The read request from Processor 1 arrives over Processor 2’s processor interface and may contend with other requests for local data-memory access from tasks running on Processor 2. Two arbitration levels may increase the access latency seen by Processor 1 but Processor 2 avoids access latency almost entirely, because latency to local data memory is short (usually one or two cycles).

This latency asymmetry between Processor 1 and Processor 2 encourages push communication: when Processor 1 sends data to Processor 2, it writes the data over the bus into Processor 2’s local data memory. If the write is buffered, Processor 1 can continue execution without waiting for the write to complete. Thus the long latency of data transfer to Processor 2 is hidden. Processor 2 sees minimal latency when it reads the data, because the data is local. Similarly, when Processor 2 wants to send data back to Processor 1, it writes the data into Processor 1’s local data memory.

3. Multi-ported local memory accessed over local bus:

When data flows in both directions between processors and latency is critical, a locally shared data memory is often the best choice for inter-task communications. Each processor uses its local data memory interface to access a shared memory, as shown in Figure 3. This memory could have two physical access ports (two memory references satisfied each cycle) or could be controlled by a simple arbiter, where one processor’s access is held off for a cycle if the other processor is using the single physical access port.

Figure 3. Two processors shared access to local data memory

Arbitration for a single port is preferred in area- and cost-sensitive applications, especially when shared-memory utilization is modest, because a true dual-ported memory is about twice as big per bit when compared to single-ported RAM. However, a true dual-ported memory may be the better choice when the shared memory is very small or when absolute determinism of access latency is required.

Direct Connect Ports

Direct processor-to-processor connections reduce cost and latency for communication. They allow data to move directly from one processor’s registers to the registers and execution units of another processor like a GPIO. A simple example of direct connection is shown in Figure 4. This example takes advantage of exportation of state registers and importation of wire values (features found in some extensible processors) to create an additional dedicated interface within each processor and to directly connect them.

Whenever the Processor 1 writes a value to the output register, usually as part of some computation, that value automatically appears on the output pins of the processor. That same value is immediately available as input value to operations in Processor 2. Wire connections can be arbitrarily wide, allowing large and non-power-of-two-sized operands to be transferred easily and quickly.

Figure 4. Direct processor-to-processor ports

Note: Tensilica’s Xtensa LX2 processor allows you to create registers with exported state, operations that write these states, and other operations that use these new input values from other exported states.

The operation that produces the data for the output state register may be as simple as a register-to-register transfer or it may be a complex logic function based on many other processor state values. Similarly, the input value can simply be transferred to another processor state within Processor 2 (register or memory), or it could be used as one input to a complex logic function.

This form of direct connection still requires some handshake between the two processors. The consumer of data may need to signal to the producer that the data in the register has been used, so that the producer can write the next data value. The producer may need to signal the consumer than new data is available. This signaling can be done in several ways, including:

Consumer-to-producer port: An architect can make two additional port connections, each just one bit wide, one from consumer processor back to the producer processor, and one from producer to consumer. The consumer asserts its “acknowledge” output when the data has been used. The producer uses this signal as part of the decision in the code to generate the next output value. The producer asserts its “data-ready” handshake output when the next data value is available. The consumer should negate its “acknowledge” signal, in preparation for the next assertion when the next data word has been processed. The handshake is shown in the timing diagram in Figure 5. Because this transaction requires at least one full instruction execution per signal transition, this method consumes at least a dozen cycles per data word transferred.

Figure 5. Two wire handshake

A variant of data queues creates producer-consumer handshake signals automatically. The “data ready” signal is equivalent to the push into the tail of a queue, and the “acknowledge” signal is equivalent to the pop from the head of a queue. A flag bit, set by “data ready” and cleared by “acknowledge” coordinates the two tasks.

Interrupt-driven handshake: The data transfer can also be controlled by interrupts between the two processors. When the producer processor has created the data and placed it on its output port, it also asserts a signal on an output wire connected to an interrupt input of the consumer processor. The consumer processor handles the interrupt as soon as it can (after any higher priority interrupts are handled), and accepts the data from the input port within the interrupt handler. The consumer’s interrupt handler then asserts its own output signal, which is connected to an interrupt input on the producer processor. The producer interrupt handler can then drive new data to the consumer. The basic structure of the interrupt-driven handshake is shown in Figure 6.

Figure 6. Interrupt-driven handshake

Data Queues

The highest-bandwidth mechanism for task-to-task communication is hardware implementation of data queues, which are like FIFOs. One data queue can sustain data rates as high as one transfer every cycle or more than 10 Gbytes per second for wide operands (tens of bytes per operand at a clock rate of hundreds of MHz) because queue widths need not be tied to a processor’s bus width or general-register width. The handshake between producer and consumer is implicit in the interfaces between the processors and the queue’s head and tail.

When the data producer has created the data, it pushes it into the tail of the queue, assuming the queue is not full. If the queue is full, the producer stalls. When the data consumer is ready for new data, it pops it from the head of the queue, assuming the queue is not empty. If the queue is empty, the consumer stalls.

Queues can also be configured to provide non-blocking push and pop operations, where the producer can explicitly check for a full queue before attempting a push and the consumer can explicit check for an empty queue before attempting a pop. This mechanism allows the producer or consumer task to move to other work in lieu of stalling.

Application-specific processors allow direct implementation of queues as part of their instruction-set extensions. An instruction can specify a queue as one of the destinations for result values or use an incoming queue value as one source. This form of queue interface, shown in Figure 7, allows a new data value to be created or used each cycle on each queue interface. A complex processor extension could perform multiple queue operations per cycle, perhaps combining inputs from two input queues with local data and sending values to two output queues. The high aggregate bandwidth and low control overhead of queues allows application-specific processors to be used for applications with very high data rates where processors with conventional bus or memory interfaces are not appropriate because they cannot handle the required high data rates.


Figure 7. Hardware data queue mechanism

Queues decouple the performance of one task from another. If the rate of data production and data consumption are quite uniform, the queue can be shallow. If either production or consumption rates are highly variable, a deep queue can mask this mismatch and ensure throughput at the average rate of producer and consumer, rather than at the minimum rate of the producer or the minimum rate of the consumer. Sizing the queues is an important optimization driven by good system-level simulation. If the queue is too shallow, the processor at one end of the communication channel may stall when the other processor slows for some reason. If the queue is too deep, the silicon cost will be excessive.

One processor can employ queue communications with multiple partners. When the queue operations are directly incorporated into the instruction set, the code sequence entirely determines which queue is written or read. Sometimes, less direct mapping is desirable, so the code sequence that produces or consumes data can be separated from the selection of the source or destination queue.

Two methods for flexible queue selection are possible. First, the ultimate data destination can be included in the data transfer. This destination information is pushed into a common queue with the data. This queue feeds other queues, where simple logic pops the destination identifier and uses it to choose the correct destination-specific queue in which to push the corresponding data. Full flexibility of queue width makes this approach economical. For example, a 2-bit destination specifier and a 32-bit data word would be combined in a 34-bit common queue, perhaps feeding a set of four 32-bit queues, as shown in Figure 8.

Figure 8. Producer enqueues destination with data

Second, the queue head and tail can be mapped into memory, so that a processor store is used to push a value and a processor load is used to pop a value. These operations can be blocking (producing a stall if the queue is full or empty) or non-blocking (processor may test the state of the queue before attempting the push or pop). Figure 9 shows a simple system with one producer and two consumers. The queues are mapped into the address spaces of the processors (here shown using the local-memory space with a 1-cycle access time), so that any store to the address of the queue tail causes a push and load from the address of the queue head causes a pop.

Figure 9. One producer serves two consumers through memory-mapped queues

When the data rate is relatively low, the queue depth can be reduced, even to a single entry—a register that is written by the producer and read by the consumer. This mailbox register serves as a simple and convenient path between producer and consumer. A memory-mapped set of mailbox registers is shown in Figure 10. When the two tasks pass data back and forth, the same register can be used for transfers in either direction.

Figure 10. Memory-mapped mailbox registers

Memory-mapped and instruction-mapped queues serve a wide range of processor communication uses. They work especially well at high data rates with relatively shallow buffering. At lower data rates, buses provide ample communications bandwidth. For applications with very deep buffering requirements, queues must be implemented in RAM or replaced with a shared-memory communication mechanism.

CORE OF THE YEAR
Best Processor Cores of 2004
PRODUCT RESOURCES
Xtensa LX2 Product Brief
Xtensa Processor Developers Toolkit Product Brief
Microprocessor Report’s review of Xtensa LX
  Microprocessor Report's Update on Xtensa LX2 and Xtensa 7
BDTI’s Report on Tensilica Xtensa LX Processor with Vectra LX
  EEMBC Benchmarks
  BDTI Benchmarks
  Epson printer
WHITE PAPERS
FLIX: Fast Relief for Performance-Hungry Applications
XPRES Compiler
Automated Configurable Processor Design Flow
  more >

ARTICLES

Hit Performance Goals with Configurable Processors
FLIX Helps Low-Power CPU Flex its Performance
Compiler Automates RTL Generation
  EDN's 2006 Hot 100 Products
 
QUOTABLE

“Tensilica’s introduction of the Xtensa LX and its revolutionary tool, the XPRES design compiler, made it the clear winner. Even without XPRES, Xtensa LX would be the leading contender for this award, but the combination is unbeatable.”

Tom R. Halfhill,
Senior Analyst, Microprocessor Report

get more information