Tech Support | Generator Login | Careers | Contact Us
PRODUCTS

  Overview

  Technology

  Diamond Standard

  Xtensa

    Configurable

    Config & Extensible

    Xtensa 7

    Xtensa LX2

  + Architecture

    – Features

    – Create TIE

  + I O Bandwidth

  + Low Power

  + Floating Point

  + Real-time Trace

  + Error Detection

  + Benchmarks

    – EEMBC Networking

  + Configuration Options

  + DSP Options

  + White Papers

  DSPs

    HiFi 2 Audio

    Video

    Communications

  HW/SW Dev Tools

  Literature & Doc

White Papers

Increasing Computational Performance Through FLIX (Flexible Length Instruction Extensions)

The Xtensa LX2 processor uses Tensilica’s innovative FLIX (Flexible Length Instruction eXtensions) architecture – a highly efficient implementation of the Xtensa instruction set architecture (ISA) that gives designers more options for cost/performance tradeoffs. FLIX technology provides the flexibility to freely and modelessly intermix single-operation RISC instructions, simple- and compound-operation TIE instructions, and multiple-operation FLIX instructions. By packing multiple operations into a wide 32- or 64-bit instruction word, FLIX technology allows designers to accelerate a broader class of “hot spots” in embedded applications while eliminating the performance and code-size drawbacks of VLIW processor architectures. This white paper provides detailed technical information on Tensilica’s FLIX technology for SOC developers who need more processing performance from their designs.

Instruction-set performance relates to the number of useful operations than can be executed per unit of time or per clock. High performance does not guarantee good flexibility, however. Instruction-set flexibility relates to the wider diversity of different applications whose computations can be efficiently encoded in the instruction stream. A longer instruction word generally allows a greater number and diversity of operations and operand specifiers to be encoded in each word.

RISC architectures generally encode one primitive operation per instruction. Long-instruction-word architectures encode a number of independent sub-instructions per instruction, with operation and operand specifiers for each sub-instruction. The sub-instructions may be primitive generic operations similar to RISC instructions or they may each be more sophisticated, application-specific operations such as those described previously in this chapter as processor extensions. Making the instruction word longer, for any given number of operands and operations, makes instruction encoding simpler and more orthogonal.

(Note: Long-instruction-word processors are not always faster than RISC processors. Sometimes the benefit of RISC execution-unit simplicity boosts maximum clock frequency and the execution of several distinct RISC instructions per cycle can compensate for the relative austerity of RISC instruction sets. Nevertheless, when RISC instruction sets are found in the most demanding data-intensive tasks, they are implemented with super-scalar implementations that attempt to execute multiple instructions per cycle, mimicking the greater intrinsic operational parallelism of long-instruction words.)

Figure 1 shows an example of a basic long-instruction operation encoding example. The figure lays out a 64-bit instruction word with three independent sub-instruction slots, each of which specifies an operation and operands. The first sub-instruction (sub-instruction 0) has an opcode and four operand specifiers—two source registers, an immediate field, and one destination register. The second and third sub-instructions (sub-instructions 1 and 2) have an opcode and three operand specifiers—two source registers and one source/destination register. The 2-bit format field on the left designates this particular grouping of sub-instructions. It may also designate the overall length of the instruction if the processor supports variable-length encoding.

Figure 1. Example of Long Instruction Word Encoding.

Clearly there is a hardware cost associated with long instruction words. Instruction memory is wider, decode logic is bigger, and a larger number of execution units and register files (or register file ports) must be implemented deliver instruction parallelism. Larger numbers of bigger logic blocks are incrementally harder to optimize, so maximum clock frequency can drop compared to simpler, narrower instruction encodings such as RISC. Nevertheless, the performance and flexibility benefits can be substantial, particularly for data-intensive applications with high inherent parallelism.

In some long-instruction-word architectures, each sub-instruction has almost completely independent resources: dedicated execution units, dedicated register files, and dedicated data memories. In other architectures, the sub-instructions share common register files and data memories and require a number of ports into common storage structures to allow effective and efficient data sharing.

Long-instruction-word architectures also vary widely on the question: How “long” is a long instruction? For high-end computer-system processors such as Intel’s Itanium family and for high-end embedded processors such as Texas Instruments’ TMS320C6400 DSP family, the instruction word is very “long” indeed—hundreds of bits. For more cost- and power-sensitive embedded applications, “long” may be just 64 bits. The essential processor architecture principles are largely the same, however, once multiple independent sub-instructions are packed into each instruction word.

Code Size and Long Instructions

One common liability of long-instruction-word architectures is large code size, compared to architectures that encode one independent operation per instruction. This is a common problem for VLIW architectures, but it is an especially important one for SOC designs where instruction memories may consume a significant fraction of total silicon area. Compared to code compiled for code-efficient architectures, VLIW code can often require two to five times more code storage. Figure 2 compares the total code size of a VLIW DSP (TI TMS320C6203) with Tensilica’s Xtensa processor for the EEMBC Telecom suite, with both straight compilation from unmodified C and with optimized C code. No assembly code was used.

Figure 2. EEMBC Telecom Code Size Comparison

Similarly, Figure 3 compares the total code size of a VLIW media processor (Philips Trimedia TM1300) with Tensilica’s Xtensa processor for the EEMBC Consumer suite, with both straight compilation from unmodified C and with full optimization of the C. No hand-written assembly code was created for the optimized Tensilica processor.

Figure 3. EEMBC Consumer Code Size Comparison.

Code bloat stems, in part, from instruction-length inflexibility. If, for example, the compiler can find only one operation whose source operands and execution units are ready, it may be forced to encode several sub-instruction fields as NOPs (no operation). Instruction storage is already a major portion of embedded SOC silicon area, so code expansion translates into higher cost, poorer instruction-cache performance, or both.

A second source of VLIW code bloat is the loose encoding of frequent operations commonly found in VLIW processors. The TI TMS320C6203 DSP, for example, requires 32 bits of instruction to specify a 16-bit multiplication and 32 bits to specify a 16-bit add, so the common multiply/accumulate (MAC) combination takes at least 64 bits. If a loop containing many MACs is unrolled four times (to amortize the cost of branch and address calculations), the resulting eight MAC operations require 512 bits of instruction storage, not counting the additional bits for any loads, stores, branches, or address-calculation instructions.

However, long instructions do not necessarily lead to VLIW code bloat. A long-instruction-word implementation of Tensilica’s Vectra LX DSP architecture needs about 20 bits within the instruction stream to specify eight 16-bit MACs executing in SIMD fashion, not counting the additional bits for any loads, stores, branches, or address-calculation instructions.

One attractive solution for long-instruction-word code bloat is to use a more flexible range of instruction lengths. If the processor allows multiple instruction lengths, including short instructions that encode a single operation, the compiler can achieve significantly better code size and instruction storage efficiency compared to traditional VLIW processor designs with fixed-length instruction words. Reducing code size for long-instruction-word processors also tends to decrease bus-bandwidth requirements and reduces the power dissipation associated with instruction fetches. Tensilica’s Xtensa LX2 processor, for example, incorporates flexible-length instruction extensions (FLIX). This architectural approach addresses the code size challenge by offering 16-bit, 24-bit, and a choice of either 32- or 64-bit instruction lengths. Designer-defined instructions can use the 24-, 32, and 64-bit instruction formats.

Long instructions allow more encoding freedom, where a large number of sub-instruction or operation slots can be defined (although three to six independent slots are typical) depending on the operational richness required in each slot. The operation slots need not be equally sized. Big slots (20-30 bits) accommodate a wide variety of opcodes, relatively deep register files (16-32 entries), and three or four register-operand specifiers. Developers should consider creating processors with big operation slots for applications with modest degrees of parallelism but a strong need for flexibility and generality within the application domain.

Small slots (8-16 bits) lend themselves to direct specification of movement among small register sets and allow a large number of independent slots to be packed into a long instruction word. Each of the larger number of slots offers a more limited range of operations, fewer specifiers (or more implied operands) and shallower register files. Developers should consider creating processors with many small slots for applications with a high degree parallelism among many specialized function units.

Long Instruction Words and Automatic Processor Generation

Long-instruction-word architectures fit very well with automatic generation of processor hardware and software. High-level instruction descriptions can specify the set of sub-instructions that fit into each slot. From these descriptions, the processor generator determines the encoding requirements for each field in each slot, assigns opcodes, and creates instruction-decoding hardware for all necessary instruction formats. The processor generator can also create the corresponding compiler and assembler for the long-word processor. For long-instruction-word architectures, packing of sub-instructions into long instructions is a very complex task. The assembler can handle this packing, so assembly source code programs written by programmers need only specify the operations or sub-instructions, giving less attention to packing constraints. The compiler generates code with instruction-slot availability in mind to maximize performance and minimize code size, so it generally does its own packing of operations into long instructions.

Figure 4 shows a short but complete example of a very simple long-instruction word processor described in TIE with FLIX technology. It relies entirely on built-in definitions of 32-bit integer operations, and defines no new operations. It creates a processor with a high degree of potential parallelism even for applications written purely in terms of standard C integer operations and data-types. The first of three slots supports all the commonly used integer operations, including ALU operations, loads, stores, jumps and branches. The second slot offers loads and stores, plus the most common ALU operations. The third slot offers a full complement of ALU operations, but no loads and stores.

1: length ml64 64 {InstBuf[3:0] == 15}
2: format format1 ml64 {base_slot, ldst_slot, alu_slot}
3: slot_opcodes base_slot {ADD.N, ADDX2, ADDX4, SUB, SUBX2, SUBX4, ADDI.N, AND, OR, XOR, BEQZ.N, BNEZ.N, BGEZ, BEQI, BNEI, BGEI, BNEI, BLTI, BEQ, BNE, BGE, BLT, BGEU, BLTU, L32I.N, L32R, L16UI, L16SI, L8UI, S32I.N, S16I, S8I, SLLI, SRLI, SRAI, J, JX, MOVI.N }
4: slot_opcodes ldst_slot { ADD.N, SUB, ADDI.N, L32I.N, L32R, L16UI, L16SI, L8UI, S32I.N, S16I, S8I, MOVI.N }
5: slot_opcodes alu_slot {ADD.N, ADDX2, ADDX4, SUB, SUBX2, SUBX4, ADDI.N, AND, OR, XOR, SLLI, SRLI, SRAI, MOVI.N }

Figure 4. Simple 32-bit Multi-Slot Architecture Description

The first line of the example declares a new instruction length (64 bits) and specifies the encoding of the first 4 bits of the instruction that determine the length. The second line declares a format for that instruction length, format1, containing three slots: base_slot, ldst_slot, and alu_slot and names the three slots within the new format. The fourth line lists all the TIE instructions that can be packed into the first of those slots: base_slot. In this case, all the instructions happen to be pre-defined Xtensa LX instructions but new instruction could also be included in this slot. The processor generator also creates a NOP (no operation) for each slot, so the software tools can always create complete instruction, even when no other operations for that slot are available for packing into a long instruction. Lines 4 and 5 designate the subset of instructions that can go into the other two slots.

Figure 5 defines a long-instruction-word architecture with a mix of built-in 32-bit operations and new 128-bit operations. It defines one 64-bit instruction format with three sub-instruction slots (base_slot, ldst_slot, and alu_slot). The description takes advantage of the Xtensa processor’s predefined RISC instructions, but also defines a large new register file and three new ALU operations on the new register file:

1: length ml64 64 {InstBuf[3:0] == 15}
2: format format1 ml64 {base_slot, ldst_slot, alu_slot}
3: slot_opcodes base_slot {ADD.N, ADDX2, ADDX4, SUB, SUBX2, SUBX4, ADDI.N, AND, OR, XOR, BEQZ.N, BNEZ.N, BGEZ, BEQI, BNEI, BGEI, BNEI, BLTI, BEQ, BNE, BGE, BLT, BGEU, BLTU, L32I.N, L32R, L16UI, L16SI, L8UI, S32I.N, S16I, S8I, SLLI, SRLI, SRAI, J, JX, MOVI.N }
4: regfile x 128 32 x
5: slot_opcodes ldst_slot {loadx, storex} /* slot does 128b load/store*/
6: immediate_range sim8 -128 127 1 /*8 bit signed offset field */
7: operation loadx {in x *a, in sim8 off, out x d} {out VAddr, in MemDataIn128}{
8: assign VAddr = a + off; assign d = MemDataIn128;}
9: operation storex {in x *a, in sim8 off, in x s} {out VAddr,out MemDataOut128}{
10: assign VAddr = a + off; assign MemDataOut128 = s;}
11: slot_opcodes alu_slot {addx, andx, orx} /* two new ALU operations on x regs */
12: operation addx {in x a, in x b, out x c} {} {assign c = a + b;}
13: operation andx {in x a, in x b, out x c} {} { assign c = a & b;}
14: operation orx {in x a, in x b, out x c} {} { assign c = a | b;}

Figure 5. Mixed 32-bit/128-bit Multi-slot Architecture Description

The first three lines are identical to those of Figure 4. The fourth line declares a new register file 128-bits wide and 32 entries deep. The fifth line lists the two load and store instructions for the new wide register file, which can be found in the second slot of the long instruction word. The sixth line defines a new immediate range, an 8-bit signed value, to be used as the offset range for the new 128-bit load and store instructions. Lines 7-10 fully define the new load and store instructions, in terms of basic interface signals Vaddr (the address used to access local data memory), MemDataIn128 (the data being returned from local data memory), and MemDataOut128 (the data to be sent to the local data memory). The use of 128-bit memory data signals also guarantees that the local data memory will be at least 128 bits wide. Line 11 lists the three new ALU operations that can be put in the third slot of the long instruction word. Lines 12-14 fully define those operations on the 128-bit wide register file: add, bit-wise AND, and bit-wise OR.

With this example, any combination of the 39 instructions (including NOP) in the first slot, three instructions in the second slot (loadx, storex, and NOP), and four instruction in the third slot can be combined to form legal instructions—a total of 468 combinations. This simplified example specifies almost enough instructions to densely populate a long instruction word. The first slot needs about 21 bits, the second slot only needs about 19 bits, the third slot needs about 17 bits, and the format/length field required four bits—for a total of roughly 62 bits. This example shows the potential to independently specify operations to enable instruction-level parallelism. Moreover, all of the techniques for improving the performance of individual instructions—especially fusion and SIMD—are readily applied to the operations encoded in each sub-instruction.

The compound operation technique, as described in Figure 6, can be applied within sub-instructions, but long instruction words also encourage the encoding of independent operations in different slots.

1: length ml32 32 {InstBuf[3:0] == 15}
2: format pair ml32{shift, logic}
3: regfile X 128 4 x
4: slot_opcodes shift {xr_srl, xr_sll }
5: operation xr_sll {in AR a,inout AR b} {} {assign b=b<<{a[3:0],3'h0};}
6: operation xr_srl {in AR a,inout AR b} {} {assign b=b>>{a[3:0],3'h0};}
7: slot_opcodes logic { xr_or, xr_and }
8: operation xr_and {in X c,inout X d} {} {assign d=d & c;}
9: operation xr_or {in X c,inout X d} {} {assign d=d | c;}

Figure 6. Compound Operation TIE Example—Revisited.

The first two lines define a 32-bit wide instruction, a new format, and the two slots within that format. The next line declares a new wide register file. Lines 4-6 define the instructions (byte shifts) that can occupy the first slot. Lines 7-9 define the instructions (bit-wise AND and bit-wise OR). Altogether this TIE example defines four instructions, representing the four combinations. If these were the only instructions, the processor generator would discover that this format requires only 16 bits to encode: 10 bits for the “shift” slot (two four-bit specifiers for the two AR register entries, plus one bit to differentiate shift left from shift right) and 6 bits for the “logic” slot (two two-bit specifiers for the two X register entries, plus one bit to differentiate AND from OR).

CORE OF THE YEAR
Best Processor Cores of 2004
PRODUCT RESOURCES
Xtensa LX2 Product Brief
Xtensa Processor Developers Toolkit Product Brief
Microprocessor Report’s review of Xtensa LX
  Microprocessor Report's Update on Xtensa LX2 and Xtensa 7
BDTI’s Report on Tensilica Xtensa LX Processor with Vectra LX
  EEMBC Benchmarks
  BDTI Benchmarks
  Epson printer
WHITE PAPERS
FLIX: Fast Relief for Performance-Hungry Applications
XPRES Compiler
Automated Configurable Processor Design Flow
  more >

ARTICLES

Hit Performance Goals with Configurable Processors
FLIX Helps Low-Power CPU Flex its Performance
Compiler Automates RTL Generation
  EDN's 2006 Hot 100 Products
 
QUOTABLE

“Tensilica’s introduction of the Xtensa LX and its revolutionary tool, the XPRES design compiler, made it the clear winner. Even without XPRES, Xtensa LX would be the leading contender for this award, but the combination is unbeatable.”

Tom R. Halfhill,
Senior Analyst, Microprocessor Report

get more information