TIE - The Fast Path to High Performance Embedded SOC Processing

TIE - The Fast Path to High Performance Embedded SOC Processing

"All processors are created equal" - Not! Few ASIC, SOC, and system designers believe this statement but most really don't think through the implications of that disbelief. Consider this pair of sentences:

"A key differentiator among processor architectures is how many instructions are issued and executed per clock cycle. The number of instructions executed in parallel, and the amount of work accomplished by each, directly affect the processor's level of parallelism, which in turn affects the processor's speed."

BDTI's Jennifer Eyre wrote published those words in the June, 2001 issue of IEEE Spectrum and they are most certainly true. Applied parallelism produces efficiency that allows a processor to either complete more work at a given clock rate or complete a given amount of work at a reduced clock rate. Most tasks contain a certain amount of parallelism, primarily found as data-level and instruction-level parallelism.

Data-level parallelism, which allows the processor to apply the same operations to individual pieces of independent data, can easily be exploited by appropriate SIMD (single-instruction, multiple-data) function units. Instruction parallelism can be exploited through several architectural schemes that permit the processor to execute multiple operations at the same time. One way to execute multiple independent instructions simultaneously is to use a VLIW (very long instruction word) architecture that permits a compiler to encode the multiple operations into one long instruction. Another way to execute multiple operations-usually dependent operations-in one clock cycle is to fuse the operations into one instruction. A MAC (multiply/accumulate), which fuses a multiplication operation with an addition operation, is a very common version of a fused DSP operation.

Both computational and I/O operations benefit when designers exploit the parallelism inherent in most embedded tasks. Combining computational and I/O operations further exploits the available parallelism. RTL hardware designers routinely exploit task parallelism with their hardware designs but throw in a processor and most ASIC and SOC designers give up. Why? Because they regard processor cores as unchangeable, inviolate blocks. There's no mystery as to why this attitude exists. For the first 40 years or so of the microprocessor's existence, most designers have had to use processor architectures as delivered. Only a specially trained few could custom-build processors that were specially tailored to be especially efficient for targeted applications.

However, the advent of the customizable processor core changes the game. A much broader range of designers, who need not be processor designers, can now develop processor cores that are tailored to specific tasks. These processor cores deliver the same or nearly the same level of performance as custom-built RTL blocks but this high level of performance is accompanied by several significant advantages such as:

  • Firmware programmability to accommodate changes in requirements, specifications, and standards.
  • Correct-by-construction hardware assembly, which greatly reduces the need for slow hardware verification.
  • Fast, high-level system simulation through instruction-set simulators running the actual C application code.

These advantages deliver a fast, programmable piece of hardware that quickly executes target tasks while providing retargetable flexibility through firmware programmability.

Tensilica offers a family of customizable Xtensa processor cores that allow ASIC and SOC designers to quickly develop extremely efficient function blocks for their SOC designs. Designers can customize existing Xtensa processor cores with a wide range of click-box options including hardware multipliers, dividers, and DSP function units. Checking a click box instantly adds the desired functions and makes the appropriate changes to the software-development tools delivered with the processor hardware RTL.

However, there's an even more powerful way to optimize an Xtensa processor-through a processor-description language called TIE, Tensilica's Instruction Extension language. TIE is a simple way to make Xtensa processor cores faster and more efficient by adding new task-optimized instructions and I/O interfaces.

This short White Paper introduces the core ideas behind TIE. You'll see that TIE looks a lot like Verilog, but anyone can learn the basics of TIE in a few minutes whether they already know how to write Verilog descriptions or not. Just a few lines of TIE can make a dramatic difference in an Xtensa processor's performance and flexibility for targeted tasks. Xtensa processors with TIE customizations can compute and move data tens or hundreds of times faster than conventional processor cores. As a result, your SOC gets smaller, cheaper, and faster and it will consume less power.

The TIE Operation

The basic statement in TIE is the operation, which defines a new processor instruction by specifying a computation on a set of operands. Each TIE operation has a name, operand lists, and a block of combinational Verilog code that defines how to compute output operands from the input operands.

For example, this operation statement adds the upper and lower half of a 32-bit value from a register and returns the result to a register (while setting the high 16 bits zero):

operation add2x16 {out AR z, in AR x, in AR y } {} {
assign z = {x[31:16] + y[31:16], x[15:0] + y[15:0]};}

Let's see what this code does. It adds four 16-bit values taken from two 32-bit registers together, treating each 32-bit value as a pair of 16-bit values. This operation effectively doubles the processor's performance when loading, adding, and storing 16-bit values. The name of the operation is add2x16. The two lines of TIE code shown above add a new instruction add2x16 to the processor hardware, to the associated assembler, and adds a new built-in function add2x16() to the processor's C programming model. Note that you need not know how to design a processor or its function units and you need not know how to alter an assembler or compiler to add this function to the processor's instruction set. You only need to know the operation that you want to add.

The operation name - add2x16 - is followed by a list of operands: {out AR z, in AR x, in AR y}. This list contains the explicit operands-the ones that a programmer will need to used when writing C or assembly code. In this case, there are two input operands (x and y) to be read from the processor's general-purpose AR register file, and one output operand (z) that is written back into the AR register file after the computation is made. A list of implicit operands ({}) follows this first list of input and output operands. In the above example, this second operand list is empty because this simple new operation doesn't employ any additional state registers or interfaces.

The actual computation description

assign z = {x[31:16] + y[31:16],x[15:0] + x[15:0];

follows the explicit and implicit operand lists. As you can see in this example, you simply assign a computed value to each output operand, using the input operands. You can also define intermediate results using wires (a concept borrowed from Verilog) using any of Verilog's logical, arithmetic, and bit-manipulation operators.

You can do a lot with TIE with just the operation statement because the Xtensa processor generator does so much of the work for you. It adds the newly defined operations to the processor's hardware description including the automatic addition of logic for decoding the instruction and implementing the new function. It also extends the C compiler, assembler and debugger, so that the new operation becomes native to all the software tools.

TIE - The Fast Path to High Performance Embedded SOC Processing

Marketing Agency