Tech Support | Generator Login | Careers | Contact Us
PRODUCTS

  Overview

  Technology

  Diamond Standard

  Xtensa

    Configurable

    Config & Extensible

    Xtensa 7

    Xtensa LX2

  + Architecture

    – Features

    – Create TIE

  + I O Bandwidth

  + Low Power

  + Floating Point

  + Real-time Trace

  + Error Detection

  + Benchmarks

    – EEMBC Networking

  + Configuration Options

  + DSP Options

  + White Papers

  DSPs

    HiFi 2 Audio

    Video

    Communications

  HW/SW Dev Tools

  Literature & Doc

White Papers

A Quick Guide to High-Speed I/O for SOC Function Blocks

The two bottlenecks in high-speed SOC block design are I/O performance and computational performance. Traditionally, the main bus of a processor core represents a major I/O bottleneck. All data into and out of the processor must pass over this main bus. Consequently, two factors constrain I/O traffic in and out of the processor. First, a bus can only perform one transfer at a time so other pending transfers must wait for the current transfer to clear. Second, because processor main buses are designed to accommodate many system configurations, they tend to require multiple cycles to effect bus transactions. As a result of these limitations, processor cores have lacked the I/O bandwidth required by many tasks performed in SOCs.

Tensilica’s Xtensa LX2 processor incorporates several features that improve I/O bandwidth. In fact, these features allow the Xtensa LX2 processor to deliver I/O transfer rates that can match those of hardwired RTL blocks. However, the Xtensa LX2 processor achieves those high data rates with an automatically generated, pre-verified hardware core that greatly reduces the time required to develop the SOCs hardware. In addition, the resulting function is firmware-programmable, which means that it can be changed at a later date to accommodate a new or revised industry standard, to add a feature, or to fix a bug in the system design without changing the silicon.

The key features that allow the Xtensa LX2 processor to achieve these high data-transfer rates are its XLMI local-memory bus (a feature carried forward from Tensilica’s Xtensa V processor) and TIE queues. (Note: TIE is Tensilica’s Instruction Extension language, which allows designers to extend the processor’s abilities using a Verilog-like language to describe the features of the new abilities without the need to describe the structure of the hardware that implements these abilities.) The XLMI bus is a simple, fast, single-cycle bus that can perform transfers much faster than the Xtensa processor’s main bus (the PIF).

A new feature introduced with the Xtensa LX2 processor, called TIE ports and queues, allows designers to add many new input and output ports that lead directly into and out of the processor’s execution unit. These ports and queues can be directly invoked by new instruction extensions, also written in TIE, so that input and output operations become implicit in the execution of a computation. This approach maximizes I/O bandwidth, similar to the maximum bandwidth achieved by hand-coding function blocks with RTL, but requires much less effort from the SOC development team to design and verify the hardware because the Xtensa LX2 processor is generated automatically by Tensilica’s Xtensa Processor Generator.

The 1-Bus Bottleneck

Figure 1 shows the microprocessor core configuration typically found in SOC designs. The sole data highway into and out of the processor is its main bus. Because processors often interact with other types of bus masters including other processors and DMA controllers, their main buses have sophisticated transaction protocols and arbitration mechanisms for sharing the bus among masters. These extra mechanisms result in bus transactions that occur over several clock cycles.


Figure 1. The 1-bus bottleneck.

Xtensa LX2 PIF read transactions take a minimum of 6 cycles and the write transaction takes at least 1 cycle, depending on the speed of the target device connected to the PIF. From these transaction timings, we can calculate the minimum number of cycles needed to perform a simple flow-through computation, where two numbers are loaded from memory, added, and stored back into memory. The assembly code to perform this computation might look like this:

L32I reg_A, Addr_A ; Load the first operand
L32I reg_B, Addr_B ; Load the second operand
ADD reg_C, reg_A, reg_B ; Add the two operands
S32I reg_C, Addr_C ; Store the result

To simplify this code, we assume that pointers to memory locations storing values A, B, and C are already initialized in registers Addr_A, Addr_B, and Addr_C. If not, then more time will be needed for this computation.

The minimum cycle count required to perform this computation is:

L32I reg_A, Addr_A: 6 cycles
L32I reg_B, Addr_B: 6 cycles
ADD reg_C, reg_A, reg_B: 1 cycle
S32I reg_C, Addr_C: 1 cycle

Total: 14 cycles

(Note: This cycle count is a minimum number. Because the Xtensa LX2 processor is pipelined, the total number of cycles will be slightly larger than 14 but the additional cycles will be overlapped with the execution of other instructions. If this code sequence sits within a zero-overhead loop, the cycle count for each loop iteration is 14 cycles.)

Note that load (L32I) instructions consume 6 cycles. This cycle count is the minimum required to return the requested information over the processor’s main bus (the PIF). Loads must complete before the next instruction that uses the resulting data executes. Also note that the store (S32I) instruction consumes only one cycle because the stored value is immediately placed in a store buffer. Once the value enters the processor’s store buffer, the store instruction completes. The processor’s bus-control logic subsequently moves the stored value to the target location.

For high-speed data that must flow through this function block, the large number of required cycles for this flow-through operation is often a major factor in deciding to design a purpose-built block of RTL to perform the task because a conventional processor would be too slow.

Break the Bottleneck with a Faster Bus

One way to solve this problem is to conduct the load and store transactions over a faster bus to improve the overall I/O bandwidth. Xtensa processors have a local-memory bus interface called XLMI that implements a simpler transaction protocol than the processor’s main PIF bus. XLMI transaction protocols are simpler than PIF protocols, so load and store operations can occur in as little as one cycle. By conducting loads and stores over the XLMI bus instead of the PIF, the above computation timing becomes:

L32I reg_A, Addr_A: 1 cycle
L32I reg_B, Addr_B: 1 cycle
ADD reg_C, reg_A, reg_B: 1 cycle
S32I reg_C, Addr_C: 1 cycle

Total: 4 cycles (with the same caveat regarding the processor pipeline)

This result represents a 3.5x improvement in the function’s cycle count, which may mean the difference between acceptable and unacceptable performance for a particular task. However, even with the performance improvement gained from faster bus transactions, the XLMI bus still conducts only one transaction at a time, so loads and stores occur serially.

Attaining Ultimate Bandwidth

Even the 4-cycle count achieved with single-cycle loads and stores over a fast local bus can be too slow for certain SOC tasks. Because of this, Tensilica has significantly boosted the I/O bandwidth of the Xtensa LX2 processor with a feature called TIE ports and queues.

The XLMI bus runs faster than the PIF because it implements simpler transaction mechanisms. However, the XLMI port is still a bus, which is capable of communicating with several attached memories and devices—but only one of these at a time because of the nature of bus-oriented I/O transactions.

Ports and queues are very simple, direct communication structures. Like XLMI transactions, transactions over ports and queues occur in one cycle. However, transactions conducted over ports and queues are not activated by addresses supplied by the processor. These simpler structures are activated by specially written processor instructions that implicitly initiate port and queue transactions. Consequently, one designer-defined instruction can initiate transactions on several ports and queues at the same time, which boosts I/O bandwidth.

Using TIE, it’s possible to create queues especially for the example problem discussed in this white paper. Three queues are needed: two input queues for the input operands and one output queue for the result. With these three queues defined, it’s then possible to define an addition instruction that

  1. Implicitly draws input operands A and B from their respective input queues.
  2. Adds A and B together.
  3. Outputs the result of the addition (C) on the output queue.

The TIE code to create such an instruction is:

queue InQ_A 32 in
queue InQ_B 32 in
queue OutQ_C 32 out

The first two statements declare input queues named InQ_A and InQ_B that are 32 bits wide. The third statement declares an output queue named OutQ_C that is also 32 bits wide. Invoking a TIE queue causes the Xtensa processor generator to create an additional I/O port with the handshake lines needed to connect to a FIFO that’s external to the processor. A TIE instruction can implicitly read from several input queues and write to several output queues during one clock cycle. Input queues can operate like any other input operand to an instruction and output queues can be assigned values just like any other output operand. (Note that input and output ports and queues need not be 32 bits wide. They can be as narrow as 1 bit or as wide as 1024 bits in the current implementation of TIE.)

The following example describes a TIE instruction called ADD_XFER that reads operands from each of the input queues, adds them together, and writes the result to an output queue.

operation ADD_XFER {in AR val} {in InQ_A, in InQ_B, out OutQ_C} {
assign OutQ_C = InQ_A + InQ_B;
}

With this new instruction, the example problem reduces to one instruction:

ADD_XFER

Running through the Xtensa LX2 processor’s 5-stage pipeline, this instruction takes five cycles to run through the processor’s pipeline but it has a latency of only one clock cycle. By placing this instruction within a zero-overhead loop, the processor can deliver an effective throughput of one instruction per clock cycle. Thus the computation and data movement occur in the absolute minimum number of clock cycles, namely one. Even hand-coded, hardwired RTL cannot perform this function any faster than the properly equipped processor.

Boost Throughput Further with Multiple Computations Per Cycle

The Xtensa LX2 processor is not limited to performing one computation per cycle. TIE provides two ways to perform two or more calculations at a time. The designer can create a single-cycle TIE instruction that draws operands from several input queues, performs multiple operations on these operands concurrently, and then outputs the results on several output queues. Alternatively, the designer can use the Xtensa LX2 processor’s FLIX (flexible length instruction extension) technology to develop wide, multi-operation instructions. FLIX instructions execute these multiple operations concurrently. The advantage of the first approach is simplicity. If all the computations are always performed concurrently in the same manner, then they can be combined in one instruction. The advantage of the FLIX approach is flexibility. If some of the operations are only performed some of the time or if the operations are performed in various combinations, the FLIX instruction provides the ability to easily create many combinations of concurrent instructions. In some applications, either method will work equally well. In other applications, the flexibility of FLIX instructions will make function-block development much easier.

CORE OF THE YEAR
Best Processor Cores of 2004
PRODUCT RESOURCES
Xtensa LX2 Product Brief
Xtensa Processor Developers Toolkit Product Brief
Microprocessor Report’s review of Xtensa LX
  Microprocessor Report's Update on Xtensa LX2 and Xtensa 7
BDTI’s Report on Tensilica Xtensa LX Processor with Vectra LX
  EEMBC Benchmarks
  BDTI Benchmarks
  Epson printer
WHITE PAPERS
FLIX: Fast Relief for Performance-Hungry Applications
XPRES Compiler
Automated Configurable Processor Design Flow
  more >

ARTICLES

Hit Performance Goals with Configurable Processors
FLIX Helps Low-Power CPU Flex its Performance
Compiler Automates RTL Generation
  EDN's 2006 Hot 100 Products
 
QUOTABLE

“Tensilica’s introduction of the Xtensa LX and its revolutionary tool, the XPRES design compiler, made it the clear winner. Even without XPRES, Xtensa LX would be the leading contender for this award, but the combination is unbeatable.”

Tom R. Halfhill,
Senior Analyst, Microprocessor Report

get more information