Increasing Computational Performance Through
FLIX (Flexible Length Instruction Extensions)
The
Xtensa LX2 processor uses Tensilica’s
innovative FLIX (Flexible Length Instruction
eXtensions) architecture – a highly efficient
implementation of the Xtensa instruction set
architecture (ISA) that gives designers more
options for cost/performance tradeoffs. FLIX
technology provides the flexibility to freely
and modelessly intermix single-operation RISC
instructions, simple- and compound-operation
TIE instructions, and multiple-operation FLIX
instructions. By packing multiple operations
into a wide 32- or 64-bit instruction word, FLIX
technology allows designers to accelerate a broader
class of “hot spots” in embedded
applications while eliminating the performance
and code-size drawbacks of VLIW processor architectures.
This white paper provides detailed technical
information on Tensilica’s FLIX technology
for SOC developers who need more processing performance
from their designs.
Instruction-set performance relates to the number
of useful operations than can be executed per
unit of time or per clock. High performance does
not guarantee good flexibility, however. Instruction-set
flexibility relates to the wider diversity of
different applications whose computations can
be efficiently encoded in the instruction stream.
A longer instruction word generally allows a
greater number and diversity of operations and
operand specifiers to be encoded in each word.
RISC architectures generally encode one primitive
operation per instruction. Long-instruction-word
architectures encode a number of independent
sub-instructions per instruction, with operation
and operand specifiers for each sub-instruction.
The sub-instructions may be primitive generic
operations similar to RISC instructions or they
may each be more sophisticated, application-specific
operations such as those described previously
in this chapter as processor extensions. Making
the instruction word longer, for any given number
of operands and operations, makes instruction
encoding simpler and more orthogonal.
(Note: Long-instruction-word processors are
not always faster than RISC processors. Sometimes
the benefit of RISC execution-unit simplicity
boosts maximum clock frequency and the execution
of several distinct RISC instructions per cycle
can compensate for the relative austerity of
RISC instruction sets. Nevertheless, when RISC
instruction sets are found in the most demanding
data-intensive tasks, they are implemented with
super-scalar implementations that attempt to
execute multiple instructions per cycle, mimicking
the greater intrinsic operational parallelism
of long-instruction words.)
Figure 1 shows an example of a basic long-instruction
operation encoding example. The figure lays out
a 64-bit instruction word with three independent
sub-instruction slots, each of which specifies
an operation and operands. The first sub-instruction
(sub-instruction 0) has an opcode and four operand
specifiers—two source registers, an immediate
field, and one destination register. The second
and third sub-instructions (sub-instructions
1 and 2) have an opcode and three operand specifiers—two
source registers and one source/destination register.
The 2-bit format field on the left designates
this particular grouping of sub-instructions.
It may also designate the overall length of the
instruction if the processor supports variable-length
encoding.
Figure 1. Example of Long Instruction Word Encoding.
Clearly there is a hardware cost associated
with long instruction words. Instruction memory
is wider, decode logic is bigger, and a larger
number of execution units and register files
(or register file ports) must be implemented
deliver instruction parallelism. Larger numbers
of bigger logic blocks are incrementally harder
to optimize, so maximum clock frequency can drop
compared to simpler, narrower instruction encodings
such as RISC. Nevertheless, the performance and
flexibility benefits can be substantial, particularly
for data-intensive applications with high inherent
parallelism.
In some long-instruction-word architectures,
each sub-instruction has almost completely independent
resources: dedicated execution units, dedicated
register files, and dedicated data memories.
In other architectures, the sub-instructions
share common register files and data memories
and require a number of ports into common storage
structures to allow effective and efficient data
sharing.
Long-instruction-word architectures also vary
widely on the question: How “long” is
a long instruction? For high-end computer-system
processors such as Intel’s Itanium family
and for high-end embedded processors such as
Texas Instruments’ TMS320C6400 DSP family,
the instruction word is very “long” indeed—hundreds
of bits. For more cost- and power-sensitive embedded
applications, “long” may be just
64 bits. The essential processor architecture
principles are largely the same, however, once
multiple independent sub-instructions are packed
into each instruction word.
Code Size and Long Instructions
One common liability
of long-instruction-word architectures is large
code size, compared to architectures that encode
one independent operation per instruction. This
is a common problem for VLIW architectures, but
it is an especially important one for SOC designs
where instruction memories may consume a significant
fraction of total silicon area. Compared to
code compiled for code-efficient architectures,
VLIW code can often require two to five times
more code storage. Figure 2 compares the total
code size of a VLIW DSP (TI TMS320C6203) with
Tensilica’s Xtensa processor for the
EEMBC Telecom suite, with both straight compilation
from unmodified C and with optimized C code.
No assembly code was used.
Figure 2. EEMBC Telecom Code Size Comparison
Similarly, Figure 3 compares the total code
size of a VLIW media processor (Philips Trimedia
TM1300) with Tensilica’s Xtensa processor
for the EEMBC Consumer suite, with both straight
compilation from unmodified C and with full optimization
of the C. No hand-written assembly code was created
for the optimized Tensilica processor.
Figure 3. EEMBC Consumer Code Size Comparison.
Code bloat stems, in part, from instruction-length
inflexibility. If, for example, the compiler
can find only one operation whose source operands
and execution units are ready, it may be forced
to encode several sub-instruction fields as NOPs
(no operation). Instruction storage is already
a major portion of embedded SOC silicon area,
so code expansion translates into higher cost,
poorer instruction-cache performance, or both.
A second source of VLIW code bloat is the loose
encoding of frequent operations commonly found
in VLIW processors. The TI TMS320C6203 DSP, for
example, requires 32 bits of instruction to specify
a 16-bit multiplication and 32 bits to specify
a 16-bit add, so the common multiply/accumulate
(MAC) combination takes at least 64 bits. If
a loop containing many MACs is unrolled four
times (to amortize the cost of branch and address
calculations), the resulting eight MAC operations
require 512 bits of instruction storage, not
counting the additional bits for any loads, stores,
branches, or address-calculation instructions.
However, long instructions do not necessarily
lead to VLIW code bloat. A long-instruction-word
implementation of Tensilica’s Vectra LX
DSP architecture needs about 20 bits within the
instruction stream to specify eight 16-bit MACs
executing in SIMD fashion, not counting the additional
bits for any loads, stores, branches, or address-calculation
instructions.
One attractive solution for long-instruction-word
code bloat is to use a more flexible range of
instruction lengths. If the processor allows
multiple instruction lengths, including short
instructions that encode a single operation,
the compiler can achieve significantly better
code size and instruction storage efficiency
compared to traditional VLIW processor designs
with fixed-length instruction words. Reducing
code size for long-instruction-word processors
also tends to decrease bus-bandwidth requirements
and reduces the power dissipation associated
with instruction fetches. Tensilica’s Xtensa
LX2 processor, for example, incorporates flexible-length
instruction extensions (FLIX). This architectural
approach addresses the code size challenge by
offering 16-bit, 24-bit, and a choice of either
32- or 64-bit instruction lengths. Designer-defined
instructions can use the 24-, 32, and 64-bit
instruction formats.
Long instructions allow more encoding freedom,
where a large number of sub-instruction or operation
slots can be defined (although three to six independent
slots are typical) depending on the operational
richness required in each slot. The operation
slots need not be equally sized. Big slots (20-30
bits) accommodate a wide variety of opcodes,
relatively deep register files (16-32 entries),
and three or four register-operand specifiers.
Developers should consider creating processors
with big operation slots for applications with
modest degrees of parallelism but a strong need
for flexibility and generality within the application
domain.
Small slots (8-16 bits) lend themselves to direct
specification of movement among small register
sets and allow a large number of independent
slots to be packed into a long instruction word.
Each of the larger number of slots offers a more
limited range of operations, fewer specifiers
(or more implied operands) and shallower register
files. Developers should consider creating processors
with many small slots for applications with a
high degree parallelism among many specialized
function units.
Long Instruction Words and Automatic Processor
Generation
Long-instruction-word architectures
fit very well with automatic generation of
processor hardware and software. High-level instruction
descriptions can specify the set of sub-instructions
that fit into each slot. From these descriptions,
the processor generator determines the encoding
requirements for each field in each slot, assigns
opcodes, and creates instruction-decoding hardware
for all necessary instruction formats. The
processor generator can also create the corresponding
compiler and assembler for the long-word processor.
For long-instruction-word architectures, packing
of sub-instructions into long instructions
is a very complex task. The assembler can handle
this packing, so assembly source code programs
written by programmers need only specify the
operations or sub-instructions, giving less
attention to packing constraints. The compiler
generates code with instruction-slot availability
in mind to maximize performance and minimize
code size, so it generally does its own packing
of operations into long instructions.
Figure 4 shows a short but complete example
of a very simple long-instruction word processor
described in TIE with FLIX technology. It relies
entirely on built-in definitions of 32-bit integer
operations, and defines no new operations. It
creates a processor with a high degree of potential
parallelism even for applications written purely
in terms of standard C integer operations and
data-types. The first of three slots supports
all the commonly used integer operations, including
ALU operations, loads, stores, jumps and branches.
The second slot offers loads and stores, plus
the most common ALU operations. The third slot
offers a full complement of ALU operations, but
no loads and stores.
1: length ml64 64 {InstBuf[3:0] == 15}
2: format format1 ml64 {base_slot, ldst_slot,
alu_slot}
3: slot_opcodes base_slot {ADD.N, ADDX2, ADDX4,
SUB, SUBX2, SUBX4, ADDI.N, AND, OR, XOR, BEQZ.N,
BNEZ.N, BGEZ, BEQI, BNEI, BGEI, BNEI, BLTI, BEQ,
BNE, BGE, BLT, BGEU, BLTU, L32I.N, L32R, L16UI,
L16SI, L8UI, S32I.N, S16I, S8I, SLLI, SRLI, SRAI,
J, JX, MOVI.N }
4: slot_opcodes ldst_slot { ADD.N, SUB, ADDI.N,
L32I.N, L32R, L16UI, L16SI, L8UI, S32I.N, S16I,
S8I, MOVI.N }
5: slot_opcodes alu_slot {ADD.N, ADDX2, ADDX4,
SUB, SUBX2, SUBX4, ADDI.N, AND, OR, XOR, SLLI,
SRLI, SRAI, MOVI.N }
Figure 4. Simple 32-bit Multi-Slot Architecture
Description
The first line of the example declares a new
instruction length (64 bits) and specifies the
encoding of the first 4 bits of the instruction
that determine the length. The second line declares
a format for that instruction length, format1,
containing three slots: base_slot, ldst_slot,
and alu_slot and names the three slots within
the new format. The fourth line lists all the
TIE instructions that can be packed into the
first of those slots: base_slot. In this case,
all the instructions happen to be pre-defined
Xtensa LX instructions but new instruction could
also be included in this slot. The processor
generator also creates a NOP (no operation) for
each slot, so the software tools can always create
complete instruction, even when no other operations
for that slot are available for packing into
a long instruction. Lines 4 and 5 designate the
subset of instructions that can go into the other
two slots.
Figure 5 defines a long-instruction-word architecture
with a mix of built-in 32-bit operations and
new 128-bit operations. It defines one 64-bit
instruction format with three sub-instruction
slots (base_slot, ldst_slot, and alu_slot). The
description takes advantage of the Xtensa processor’s
predefined RISC instructions, but also defines
a large new register file and three new ALU operations
on the new register file:
1: length ml64 64 {InstBuf[3:0] == 15}
2: format format1 ml64 {base_slot, ldst_slot,
alu_slot}
3: slot_opcodes base_slot {ADD.N, ADDX2, ADDX4,
SUB, SUBX2, SUBX4, ADDI.N, AND, OR, XOR, BEQZ.N,
BNEZ.N, BGEZ, BEQI, BNEI, BGEI, BNEI, BLTI, BEQ,
BNE, BGE, BLT, BGEU, BLTU, L32I.N, L32R, L16UI,
L16SI, L8UI, S32I.N, S16I, S8I, SLLI, SRLI, SRAI,
J, JX, MOVI.N }
4: regfile x 128 32 x
5: slot_opcodes ldst_slot {loadx, storex} /*
slot does 128b load/store*/
6: immediate_range sim8 -128 127 1 /*8 bit signed
offset field */
7: operation loadx {in x
*a, in sim8 off, out x d} {out VAddr, in MemDataIn128}{
8: assign VAddr = a + off; assign
d = MemDataIn128;}
9: operation storex {in
x *a, in sim8 off,
in x s} {out VAddr,out MemDataOut128}{
10: assign VAddr = a + off; assign
MemDataOut128 = s;}
11: slot_opcodes alu_slot {addx, andx, orx} /*
two new ALU operations on x regs */
12: operation addx {in x a, in
x b, out x c}
{} {assign c = a + b;}
13: operation andx {in x a, in
x b, out x c}
{} { assign c = a & b;}
14: operation orx {in x a, in
x b, out x c} {}
{ assign c = a | b;}
Figure 5. Mixed 32-bit/128-bit Multi-slot Architecture
Description
The first three lines are identical to those
of Figure 4. The fourth line declares a new register
file 128-bits wide and 32 entries deep. The fifth
line lists the two load and store instructions
for the new wide register file, which can be
found in the second slot of the long instruction
word. The sixth line defines a new immediate
range, an 8-bit signed value, to be used as the
offset range for the new 128-bit load and store
instructions. Lines 7-10 fully define the new
load and store instructions, in terms of basic
interface signals Vaddr (the address used to
access local data memory), MemDataIn128 (the
data being returned from local data memory),
and MemDataOut128 (the data to be sent to the
local data memory). The use of 128-bit memory
data signals also guarantees that the local data
memory will be at least 128 bits wide. Line 11
lists the three new ALU operations that can be
put in the third slot of the long instruction
word. Lines 12-14 fully define those operations
on the 128-bit wide register file: add, bit-wise
AND, and bit-wise OR.
With this example, any combination of the 39
instructions (including NOP) in the first slot,
three instructions in the second slot (loadx,
storex, and NOP), and four instruction in the
third slot can be combined to form legal instructions—a
total of 468 combinations. This simplified example
specifies almost enough instructions to densely
populate a long instruction word. The first slot
needs about 21 bits, the second slot only needs
about 19 bits, the third slot needs about 17
bits, and the format/length field required four
bits—for a total of roughly 62 bits. This
example shows the potential to independently
specify operations to enable instruction-level
parallelism. Moreover, all of the techniques
for improving the performance of individual instructions—especially
fusion and SIMD—are readily applied to
the operations encoded in each sub-instruction.
The compound operation technique, as described
in Figure 6, can be applied within sub-instructions,
but long instruction words also encourage the
encoding of independent operations in different
slots.
1: length ml32 32 {InstBuf[3:0] == 15}
2: format pair ml32{shift, logic}
3: regfile X 128 4 x
4: slot_opcodes shift {xr_srl, xr_sll }
5: operation xr_sll {in AR a,inout AR b} {} {assign
b=b<<{a[3:0],3'h0};}
6: operation xr_srl {in AR a,inout AR b} {} {assign
b=b>>{a[3:0],3'h0};}
7: slot_opcodes logic { xr_or, xr_and }
8: operation xr_and {in X c,inout X d} {} {assign
d=d & c;}
9: operation xr_or {in X c,inout X d} {} {assign
d=d | c;}
Figure 6. Compound Operation TIE Example—Revisited.
The first two lines define a 32-bit wide instruction,
a new format, and the two slots within that format.
The next line declares a new wide register file.
Lines 4-6 define the instructions (byte shifts)
that can occupy the first slot. Lines 7-9 define
the instructions (bit-wise AND and bit-wise OR).
Altogether this TIE example defines four instructions,
representing the four combinations. If these
were the only instructions, the processor generator
would discover that this format requires only
16 bits to encode: 10 bits for the “shift” slot
(two four-bit specifiers for the two AR register
entries, plus one bit to differentiate shift
left from shift right) and 6 bits for the “logic” slot
(two two-bit specifiers for the two X register
entries, plus one bit to differentiate AND from
OR).
|