Tech Support | Generator Login | Careers | Contact Us
PRODUCTS

  Overview

  Technology

  Diamond Standard

  Xtensa

    Configurable

    Config & Extensible

    Xtensa 7

    Xtensa LX2

  + Architecture

    – Features

    – Create TIE

  + I O Bandwidth

  + Low Power

  + Floating Point

  + Real-time Trace

  + Error Detection

  + Benchmarks

    – EEMBC Networking

  + Configuration Options

  + DSP Options

  + White Papers

  DSPs

    HiFi 2 Audio

    Video

    Communications

  HW/SW Dev Tools

  Literature & Doc

White Papers

Queue- and Wire-based Input and Output Ports Allow Processors to Achieve RTL-Like I/O Speeds on SOCs

Mapping all of a processor’s input and output interfaces to memory addresses is neither necessary nor efficient for many tasks in an SOC design. Sometimes the mapping of input/output interfaces to memory addresses permits the programmer or compiler to dynamically choose among several sources and destinations for computations. However, if dynamic addressability is not important, direct connections from external signals to processor execution units can further accelerate performance and reduce complexity. Wire-based interfaces are also more familiar to many RTL designers and often allow processors to substitute for hardware blocks without even changing the block-interface (“pin”) definitions of existing RTL blocks.

Tensilica’s Xtensa LX2 processor can bring signals directly into its execution units from other blocks of logic on the SOC and can output signals directly to other SOC blocks without using its traditional buses. Consequently, data movement bypasses the traditional load and store instructions so that the I/O performed incurs no overhead. These additional ports into and out of the processor are created with Tensilica’s Instruction Extension (TIE) language using two features that are new to the Xtensa LX2 processor: TIE ports and queues.

Two basic styles of interface handshake serve the different input and output environments for direct connection of processors to external signals:

  • Import of values and export of states through ports (GPIOs)
  • Input and output queues (FIFOs)

Consider a simple hardware function shown in Figure 1. The primary inputs and outputs of this function can be simplified to wires at the boundary of the block.


Figure 1. Fully pipelined instruction implementation with direct input/output from block

Input and output ports may serve as source and destination operands for configurable-processor operations, enabling fast and flexible interfaces. Figure 2 shows a sample implementation of the function block in Figure 1 written in TIE. This example uses queues to move data into and out of the block.

1: state state1 24 add_read_write
2: state state2 24 add_read_write
3: state lastinput1 24
4: state nextoutput1 24
5: queue input1 24 in
6: queue input2 24 in
7: queue output1 24 out
8: operation lookup.mul.mul {} {in input1, in input2, in state1, in state2, inout lastinput1, out output1, inout nextoutput1, out VAddr, in MemDataIn32} {

9: assign VAddr = {8’h0, lastinput1 + state1};
10: assign lastinput1 = input1;
11: wire [23:0] mulout = MemDataIn32[23:0] * input2;
12: assign output1 = nextoutput1;
13: assign nextoutput1 = mulout * state2;}
14: schedule inst_sched {lookup.mul.mul} {use state2 4; use nextoutput1 3; use input1 2; use input2 2; def lastinput1 3; def nextoutput1 4; def mulout 3; def output1 3; }

Figure 2. Datapath with input and output queues TIE example.

In this listing, the “use” and “def” arguments in the schedule statement on line 14 specify the pipeline stage where the Xtensa processor’s input queue interface has data (use) and where the processor’s output queue interface accepts the output data from the pipeline (def). The input queue interface has data available in the pipeline’s Memory stage (stage 2) and the output queue interface accepts data during the pipeline’s Write-Back stage (stage 3). The states lastinput1 and nextoutput1 allow the late input queue value to be used in the following instruction and the previous instruction’s late computational result to be sent to the output queue.

Queue inputs and outputs use direct connection of wire structures. Accesses to the corresponding queue structures automatically pop data from the input queues and push data into the output queue. The queue-control mechanism is aware of instruction cancellation (which can happen because of a variety of other events) and ensures that no excess data is popped or pushed even if processor encounters unexpected error conditions.

Queues are one form of instruction-mapped connection. They are particularly appropriate for streaming operand data through an application-specific processor because the request/acknowledge handshaking is already part of the queue interface, as shown in Figure 3.

Figure 3. Basic Handshake for Direct Processor Connections

Queue inputs represent a stream of data values to be consumed by the application running on a processor. Sequential executions of the consuming instruction should see sequential values. Similarly, queue outputs represent a sequence of values being produced by the application processor. The consumption and production of values can be managed in hardware so that all of the effects of possible instruction execution cancellation are hidden and no explicit request-acknowledge handshake is needed. The processor consuming these operands stalls if not enough data has been produced. The processor producing the operands will stall if the consuming processor falls behind and allows the input buffer to fill. These queues form a highly efficient data-streaming connection between processors, especially where several processors comprise a large-scale computational pipeline

The second form of the direct interface is based on ports: import of values and export of states on a set of wires. These ports are especially useful for tasks that test external status or condition information or control other logic functions.

Application-specific instructions use imported values on an input port just like other input operands. The wires do not need to be explicitly declared so they do not consume instruction encoding bits as would register-address specifiers. When the corresponding instruction executes, the instruction simply senses the value on the associated wire. The processor provides no external indication that the corresponding instruction is executing and there is no acknowledgement that value on the wire has been used. If the application must signal that an input value has been used, it does so explicitly via a store to an external address or by executing an instruction that writes to an output queue or to an exported state (on another wire).

Exporting states creates output ports that deliver information from instructions to external logic or to other processors. The output signals do not change until the processor executes an instruction that explicitly modifies those signals. The normal implementation of this type of signal hides the speculative nature of modern processor pipelines. For pipelined processors, error conditions, conditional branches, cache misses, and other unexpected conditions may cause the premature termination of some instructions that the processor has started to execute speculatively. Speculative execution is a performance-enhancing processor-design technique, but it requires that the system be able to tolerate early instruction termination. Processor hardware that maintains the simple programmer’s model of atomic, in-order instruction execution and avoids any unintended output glitches that might otherwise be caused by the processor’s micro-architectural operations prevents externally-visible state changes from occurring for instructions that do not complete.

The example shown above can also be implemented with imported values and exported states, as shown below in . This implementation uses an explicit output signal, next_data, to indicate that new values for input1 and input2 are required and that a new output value is available on the wire output1. It is the developer’s responsibility to ensure that external logic has sufficient time to respond to the exported next_data signal before the next use of the input1 and input2 ports. This guarantee is easily achieved for moderate-performance applications with tens of cycles of latency between one input set and the next. However, queues are generally faster and simpler for very high input rates. Note that the instruction set and the program must explicitly assert the next_data wire to request new input data and to signal the availability of new output data. Also note that current Xtensa implementation allows the value of imported wires to be used as early as the ALU Stage (“use” as early as schedule state 1), though exported states must be defined by the Write Back Stage (no “def” after schedule stage 3).

1: state state1 24 add_read_write
2: state state2 24 add_read_write
3: state nextoutput1 24
4: state output1 24 24’b0 add_read_write export
5: state next_data 1 1’b0 add_read_write export
6: import_wire input1 24
7: import_wire input2 24

8: operation lookup.mul.mul {} {in input1, in input2, in state1, in state2, out output1, inout nextoutput1 out next_data, out VAddr, in MemDataIn32} {
9: assign VAddr = {8’h0, input1 + state1};
10: wire [23:0] mulout = MemDataIn32[23:0] * input2;
11: assign output1 = nextoutput1;
12: assign nextoutput1 = mulout * state2;
13: assign next_data = 1’h0;}
14: operation assert_next_data {} {out next_data} {assign next_data = 1’h1;}
14: schedule inst_sched {lookup.mul.mul} {use state2 4; def output1 3; def nextoutput1 4;}

Figure 4. Data path with import wire/ export state TIE example

This example also compensates for the fact that the computational result is available in a pipeline stage later than that required for state export. The operation therefore exports the result of the previous operation and saves the new computation result in nextoutput1 for export during the next cycle.

CORE OF THE YEAR
Best Processor Cores of 2004
PRODUCT RESOURCES
Xtensa LX2 Product Brief
Xtensa Processor Developers Toolkit Product Brief
Microprocessor Report’s review of Xtensa LX
  Microprocessor Report's Update on Xtensa LX2 and Xtensa 7
BDTI’s Report on Tensilica Xtensa LX Processor with Vectra LX
  EEMBC Benchmarks
  BDTI Benchmarks
  Epson printer
WHITE PAPERS
FLIX: Fast Relief for Performance-Hungry Applications
XPRES Compiler
Automated Configurable Processor Design Flow
  more >

ARTICLES

Hit Performance Goals with Configurable Processors
FLIX Helps Low-Power CPU Flex its Performance
Compiler Automates RTL Generation
  EDN's 2006 Hot 100 Products
 
QUOTABLE

“Tensilica’s introduction of the Xtensa LX and its revolutionary tool, the XPRES design compiler, made it the clear winner. Even without XPRES, Xtensa LX would be the leading contender for this award, but the combination is unbeatable.”

Tom R. Halfhill,
Senior Analyst, Microprocessor Report

get more information