Tech Support | Generator Login | Careers | Contact Us
PRODUCTS

  Overview

  Technology

  Diamond Standard

  Xtensa

    Configurable

    Config & Extensible

    Xtensa 7

    Xtensa LX2

  + Architecture

    – Features

    – Create TIE

  + I O Bandwidth

  + Low Power

  + Floating Point

  + Real-time Trace

  + Error Detection

  + Benchmarks

    – EEMBC Networking

  + Configuration Options

  + DSP Options

  + White Papers

  DSPs

    HiFi 2 Audio

    Video

    Communications

  HW/SW Dev Tools

  Literature & Doc

Configurable Processors:
What, Why, How?

Read the PDF Version

A New Kind of Processor for Complex SOC Designs

A new type of processor core has been getting a lot of attention lately – a processor you can tailor for a specific application. Configurable processors are much faster and can do much more than standard embedded microprocessors. Some can even replace hand-coded RTL in ASICs and SOCs.

What is a configurable processor? What can configurable processors do? Why would anyone want to use this type of processor? How can a configurable processor replace RTL coding? How does an engineer design with configurable processors? These questions and more are answered in this article.

Standard embedded processor cores

First, consider standard, fixed-ISA (instruction set architecture), embedded microprocessor and DSP cores. Popular fixed-ISA embedded microprocessor architectures, including the ARM, MIPS and PowerPC processors, were originally designed as stand-alone chips in the 1980s. Over time, they’ve become faster and additional computational resources have been added so the processor can perform more work per clock. These architectures are good at executing a wide range of algorithms, but designers often have to speed up critical portions of the design in hardware. Even DSP architectures must be designed to provide adequate performance when executing a wide range of algorithms, so they can’t match the speed of a custom-tailored solution.

Hand-coding RTL to speed up designs

Because many applications, especially demanding multimedia and communications applications, just don’t run fast enough on standard embedded microprocessors even with the extra performance boost of an embedded DSP, engineering teams hand-code parts of the design in Verilog or VHDL to achieve the performance they need. However, custom RTL logic for complex functions takes a long time to design and verify. In addition, hand-coded RTL blocks are often too rigid to change once they’re designed, yet changes are often needed to accommodate changing standards or new product features.

A closer look at the make up of the typical RTL block appears in Figure 1, which shows the RTL datapath on the left and the block’s state machine on the right, gives insight into this paradox.


Figure 1. Hardwired RTL = Datapath plus State Machine

In most RTL designs, the datapath consumes the vast majority of the gates in the logic block. A typical datapath may be as narrow as 16 or 32 bits, or hundreds of bits wide. The datapath will typically contain many data registers and will often have significant blocks of RAM or interfaces to RAM that is shared with other RTL blocks.

By contrast, the RTL logic block’s finite state machine contains nothing but control details. All the nuances of the sequencing of data through the datapath, all the exception and error conditions, and all the handshakes with other blocks are captured in this subsystem of the RTL logic block. This state embodies most of the design and verification risk due to its complexity.

A late design change made to an RTL block is much more likely to affect the state machine than the structure of the datapath. Configurable, extensible processors (a fundamentally new form of microprocessor) provide a way of reducing the risk of state-machine design by replacing hard-to-design, hard-to-verify state-machine logic blocks with pre-designed, pre-verified processor cores and application firmware.

The Promise of Configurable Processors

The growth in the use of many large RTL blocks for SOC designs causes the well-recognized “SOC design gap” to widen every year. This gap arises between the explosive growth in chip complexity and the somewhat slower growth in designer productivity. The trend towards high-performance, low-power systems (e.g. long-battery-life cell-phones, four-mega-pixel digital cameras, fast and inexpensive color printers, digital HDTVs, and 3D video games) is increasing the size of SOC designs as well as the SOC design gap.

Hardwired RTL design has many attractive characteristics—small die area, low power, and high-throughput. However, the liabilities of RTL (difficult design, slow and difficult verification, and poor scalability to complex problems) are starting to dominate as chip gate counts become enormous. Configurable processors are now a viable replacement for complex RTL.

What is a Configurable Processor?

A full-featured configurable processor toolkit consists of a pre-defined processor core and a design-tool environment that permits significant adaptation of that base processor design for specific application requirements. Typical forms of configurability include additions, deletions, and modifications to memories, external bus widths and handshake protocols, and commonly used processor peripherals.

Extensible processors, an important superset of configurable processors, provide system designers with the ability to add instructions to the processor that may have never been considered or imagined by designers of the original architecture. The addition of highly customized instructions matched perfectly to a specific application gives configurable processors the ability to deliver performance levels rivaling RTL while gaining the benefits of pre-verified IP (intellectual property). Configurable processors are delivered as RTL code that is synthesized into an FPGA or SOC design. The best configurable processors also come with matching software development tools that reflect the hardware instructions added through designer-defined architectural extensions.

A configurable processor can implement datapath operations that closely match those of RTL functions. The equivalent datapaths are implemented using the integer pipeline of the base processor, plus additional execution units, registers, and other functions added by the chip architect for a specific application.

For example, the Tensilica Instruction Extension language (TIE, a simplified version of Verilog) is and example of a design tool that allows system developers to extend Tensilica’s Xtensa 32-bit processor architecture for specific applications. TIE is optimized for high-level specification of datapath functions in the form of instruction semantics and encoding. A TIE description is both simpler and much more concise than RTL because it omits all sequential logic descriptions, including state machine descriptions, pipeline registers, and initialization sequences. These complex items are actually developed in firmware.

The new processor instructions and registers described in TIE are available to the firmware programmer via the same compiler and assembler that target the processor’s base instructions and register set. All operation sequencing within the processor’s datapaths is controlled by firmware, through the processor’s existing instruction-fetch, decode, and execution mechanisms. State-machine firmware can usually be written in a high-level language such as C or C++ because of the high performance provided by tailored microprocessor architectures.

Configurable Processors as RTL Alternatives

Configurable processors used as RTL replacements routinely use the same datapath structures as traditional RTL blocks: deep pipelines, parallel execution units, task-specific state registers, and wide data buses to local and global memories. These extended processors can sustain the same high computation throughput and support the same data interfaces as typical RTL designs.

Control of configurable-processor datapaths is very different from the RTL counterparts however. Cycle-by-cycle control of a processor’s datapaths is not fixed in hardwired state transitions but is embodied in firmware executed by the processor (shown in Figure 2). Control-flow decisions occur in branches; memory references are explicit in load and store operations; computations are explicit sequences of general-purpose and application-specific computational operations


Figure 2. Programmable hardware function: datapath + processor + software

The migration from RTL hardwired state machine to configurable processors with firmware control has many important implications:

Flexibility: Chip developers and system builders can change a block’s function just by changing the firmware, even after the product has shipped.

Software-based development: Developers use relatively fast and low-cost software tools to implement most chip features.

Faster, more complete system modeling: For a 10-megagate design, even the fastest software-based logic simulator may not exceed a few cycles per second. By contrast, firmware simulations for extended processors run on instruction-set simulators at hundreds of thousands or millions of cycles per second.

Unification of control and data: No modern system consists solely of hardwired logic. There’s always a processor running software. Moving RTL-based functions into a processor removes the artificial separation between control and data processing.

Time-to-market: Moving critical functions from RTL to configurable processors simplifies SOC design, accelerates system modeling, and speeds hardware finalization. Firmware-based state machines easily accommodate changes to standards because implementation details aren’t “cast in stone.”

Designer Productivity: Most importantly, migration from RTL-based design to application-specific processors boosts the engineering team’s productivity by reducing both the engineering manpower needed for RTL development and verification. A processor-based SOC design approach cuts risks of fatal logic bugs and permits graceful recovery when (not if) a bug is discovered.

The benefit of being able to make changes in software rather than hardware with a processor-based approach cannot be understated. Configurable processors reduce the risk of state-machine design by replacing hard-to-design, hard-to-verify state-machine logic blocks with pre-designed, pre-verified processor cores and application firmware.

The Key: Automatic Hardware and Software Generation

The first configurable processors were introduced in the mid-1990s and had one important drawback: once instructions were added to the processor, there was no automatic way to make sure the software-development tools could use those instructions. So companies that chose to use configurable processors had to somehow modify the software-development tools by hand.

In early 1999, Tensilica introduced its first Xtensa processor with a major innovation – automatic hardware and software generation. Designers could specify configuration options using an Internet-based browser approach. New, designer-defined instructions were automatically integrated via the Xtensa Processor Generator, which produces a verified hardware implementation as well as tailored versions of all necessary software-development tools including compilers, debuggers, instruction-set simulators, and much more. The software tools are matched perfectly to the configuration, and no extra work is required to match tools and processor.

The response to the Tensilica Xtensa processor and its ability to automatically generate the hardware and software has been strong. Over 60 companies are designing SOCs using Tensilica’s Xtensa processors. Many of these companies use multiple Xtensa processors – some designs employ multiple copies of the processor performing the same tasks and other designs use different tailored versions of the Xtensa processor to do perform a variety of on-chip tasks.

Use Configurable Processors Within the Software Development Process

Now, let’s look at the software development process currently used to develop embedded applications. Figure 3 illustrates a typical flow for developing embedded application software. Design work starts not with the processor but with the algorithm. Application developers generally start with high-level design tools and languages such as C or C++ and they may purchase algorithms that are already developed using those languages. High-level programming languages and other types of development packages allow developers to create, test, and validate primary algorithmic ideas and smaller independent algorithms and sub-algorithms using tools that deal with the algorithm in a state that’s divorced from a particular processor architecture


Next, the developers translate the main algorithm and sub-algorithms in C to create a portable, processor-independent application code base. C-level simulation, performed on a PC or workstation, then proves that the recoded algorithms perform as expected. After integrating the sub-algorithms and other application software modules into a coherent whole, the entire program (now written in C or C++) is recompiled for a target processor and the resulting application code is tested and profiled.

If the development team is extremely fortunate, the compiled algorithm code executes with the desired speed. However, often, to meet project performance goals, application software teams must convert critical sections of code into hand-tuned assembly code once a fixed-ISA processor is selected. The software development team must generally try to closely map the assembly code to the processor by hand. Otherwise, the selected processor will probably end up being too expensive, too fast, or too power hungry for the intended embedded application. Assembly code developers must carefully dovetail their variables into the available registers because there’s no way to add more registers to a fixed-ISA processor if the existing register set proves inadequate.

Fit the Processor to the Algorithm

Configurable processors allow embedded-system developers to create processors specifically tailored to the target algorithms – producing a much better fit between processor and algorithm. Designers can add special-purpose, variable-width registers; specialized execution units; and wide data buses to reach an optimum processor configuration for specific algorithms. These features allow developers to mold the processor’s characteristics to the algorithm instead of trying to force-fit the 10-pound algorithm into the resources available in a 5-pound, fixed-ISA processor or DSP. Consequently, application developers can more rapidly develop systems that meet all performance specifications using configurable and extensible processors than by using off-the-shelf, fixed-ISA microprocessors and DSPs.

As with hand-tuned assembly language, optimization points for a configurable and extensible processor implementation become apparent through code profiling. Optimization targets typically reside within the innermost software loops that execute many thousands or millions of times per second. Reducing the instruction count of the object code inside of these loops produces a huge and positive effect on system performance. The following three examples illustrate the sort of performance improvements algorithm developers can expect when using configurable and extensible processors. (All of the following examples are based on Tensilica’s Xtensa microprocessor.)

Accelerating the FFT

The heart of the decimation-in-frequency FFT algorithm is an operation called the “butterfly,” which resides at the innermost loop of the FFT. Each butterfly operation requires six additions and four multiplications to compute the real and imaginary components of a radix-2 butterfly result. Using the TIE language, it’s possible for a design team to augment the Xtensa processor’s pipeline with four adders and two multipliers so that half of an FFT butterfly can be computed in one cycle.

The Xtensa processor’s configurable data-bus interface can be defined to be as wide as 128 bits so that all four real and imaginary integer input terms of each butterfly can be loaded into special-purpose FFT input registers in one cycle. All four computed output components can be stored into memory in one cycle as well. Because the load and store operations for each FFT butterfly require a cycle each, the most cost-efficient approach to the FFT computation is to stretch each FFT half-butterfly computation across two cycles, to occur in parallel with a load operation for a subsequent butterfly and a store operation for a prior butterfly. This approach saves hardware and matches the computational and data-transfer resources.

Practically speaking, it’s very hard to create single-cycle, synthesizable multipliers for SOCs that operate at clock rates of several hundred Megahertz. Although it’s possible to create hard-macro IP multipliers that operate in one clock cycle, SOC designers prefer to use synthesizable IP components whenever possible because such components allow maximum freedom in selecting semiconductor manufacturing processes and vendors. Consequently, it’s much better for the overall chip design to stretch the multiplication across two cycles so that the multiplier is not the critical timing element on the SOC. The additional multiplier latency does not affect throughput in this example and, if necessary, even longer latencies can be accommodated through additional state storage in the butterfly execution unit.

This approach to computing the FFT butterfly adds a SIMD (single-instruction, multiple data) butterfly computation unit to the processor (using fewer than 35,000 gates including the two 24x24-bit multipliers). The performance improvements achieved by using this approach over straight C code, and C code augmented with just the addition of a hardware multiplier (like a traditional DSP), appear in Table 1. The table also shows the code size of the FFT programs with and without the TIE extensions.

    C (with software multiplication) C (with hardware multiplier C (with FFT butterfly TIE instructions Performance Improvement
Code Size (bytes)   430 + Libraries 430 158  
  FFT Length        
Performance
(cycles)
128-point 763,548 169,739 2,269 337
256-point 1,787,645 386,498 4,711 379
512-point 3,975,245 867,133 9,841 404
1024-point 9,241,893 1,922,644 20,603 449

Table 1. Acceleration results from processor augmentation with FFT instructions

Accelerating Viterbi Code

A different signal-processing example, Viterbi decoding, comes from GSM cellular telephony. GSM employs Viterbi decoding to pull information symbols out of a noisy communication channel. This decoding scheme employs “Viterbi butterfly” operations consisting of 8 logical operations (4 additions, 2 comparisons, and 2 selections) and performs 8 Viterbi butterfly operations to decode each symbol in the received digital information stream.

Typically, RISC processors need 50 to 80 instruction cycles to execute one Viterbi butterfly. A high-end VLIW DSP (TI’s 320C64xx) requires only 1.75 cycles to compute each Viterbi butterfly. The TIE (Tensilica Instruction Extension) language allows a designer to add a Viterbi butterfly instruction to the Xtensa processor’s ISA. This design uses the processor’s configurable 128-bit I/O bus to load 8 symbols at a time, adds the pipeline hardware shown in Figure 5, and results in an average butterfly execution time of 0.16 cycles per butterfly. An unaugmented Xtensa processor executes Viterbi butterflies in 42 cycles, so the butterfly execution hardware (approximately 11,000 added gates) achieves a 250x speed improvement over the out-of-the-box Xtensa processor.

Figure 4. Detail of Viterbi butterfly augmentation

Accelerating an MPEG-4 Decoder

MPEG-4, the third example of achieving performance through instruction extension and parallel operation execution, is from the video world. One of the most difficult parts of encoding MPEG4 video data is motion estimation, which requires the ability to search adjacent video frames for similar pixel blocks. The search algorithm’s inner loop contains a SAD (sum of absolute differences) operation consisting of a subtraction, an absolute value, and the addition of the resulting value with the previously computed value.

For a QCIF (quarter common image format) image frame, a 15 frames/s image rate, and an exhaustive-search motion-estimation scheme, SAD operations require slightly more than 641 million operations/s. As shown in Figure 5, it’s possible to add SIMD (single instruction, multiple data) SAD hardware capable of executing 16 pixel-wide SAD instructions per cycle using TIE. (Note: Using the Xtensa processor’s 128-bit maximum bus width, it’s possible to load 16 pixels worth of data in one instruction.)

The combination of executing all three SAD component operations (subtraction, absolute value, addition) in one cycle and the SIMD operation that computes the values for all 16 pixels in one clock cycle reduces the 641 million operations/s requirement to 14 million instructions/s, a substantial reduction. This MPEG-4 motion-estimation accelerator is part of an entire MPEG-4 decoder demonstration vehicle developed by Tensilica using Xtensa technology. The MPEG-4 decoder adds approximately 92,000 to 112,000 gates to the base Xtensa processor and performs implements a 2-way QCIF video codec operating at 15 frames/s or QCIF MPEG4 decoding at 30 frames/s using approximately 30 MIPS for either operational mode.

Figure 5. MPEG4 SIMD SAD (sum of absolute differences) instruction execution hardware

Motion estimation is not the only algorithm within the MPEG4 decoder that can benefit from acceleration. Other algorithms that can be accelerated include variable-length decoding, iDCT, bitstream processing, dequantization, AC/DC prediction, color conversion, and post filtering. When instructions are added to accelerate all of these MPEG-4 decoding tasks, creating an MPEG-4 SIMD (single-instruction, multiple-data) engine within the tailored processor, the results can be quite surprising, as shown in Table 2.

Video Clip Original MPEG-4 Decoder Performance Optimized MPEG-4 Decoder Performance Clock Frequency (15 frames/sec) TIE Speedup
Miss America 3.126 G cycles 76.81 M cycles 7.7 MHz 40.1x
Suzie 3.389 G cycles 102.19 M cycles 10.3 MHz 33.2x
Foreman 10.045 G cycles 359.5 M cycles 13.5 MHz 27.9x
Car Phone 9.222 G cycles 308.7M cycles 12.2 MHz 29.9x
Monsters Inc. 29.327 G cycles 822.8 M cycles 8.6 MHz 35.6x

Table 2. MPEG4 Decoder Acceleration Results from processor augmentation with FFT instructions

As Table 2 shows, the resulting SIMD engine acceleration drops the number of cycles required to decode the MPEG-4 video clips from billions to millions and the required processor operating frequency by roughly 30x to around 10MHz. Without the additional acceleration instructions, the processor would need to run at roughly 300MHz to perform the MPEG4 decoding. There is a substantial difference in power dissipation and process technology cost between a 10MHz processor and one that runs at 300MHz. In addition, it’s unlikely that any amount of assembly language coding could produce similarly large drops in the clock rate.

As shown by the above three examples, it’s possible to accelerate the performance of embedded algorithms using configurable and extensible microprocessor cores to create processors that are tailored to the specific algorithm instead of resorting to assembly language coding or resorting to RTL hardware design. The advantage of using extensible processors is that designers can add precisely the resources (special-purpose registers, execution units, and wide data buses) required to achieve the desired algorithmic performance instead of attempting to shoehorn algorithms into the computational assets of a fixed-ISA processor.

This design approach does not require that the members of the design team become processor designers. It only requires that the design team be able to profile existing algorithm code and to find the critical inner loops in that profiled code (two tasks they already do), and then define new processor instructions that will accelerate these critical loops. Only the last task differs from the existing software development process currently employed by many embedded system developers.

The result of this new approach is to greatly accelerate algorithm performance, often far beyond the abilities of today’s most advanced fixed-ISA microprocessors and DSP cores. In most cases, designers can replace entire RTL blocks with configurable processors tuned for the exact application, saving valuable design and verification time and adding an extra level of flexibility because of the inherent programmability of this approach.

CORE OF THE YEAR
Best Processor Cores of 2004
PRODUCT RESOURCES
Xtensa LX2 Product Brief
Xtensa Processor Developers Toolkit Product Brief
Microprocessor Report’s review of Xtensa LX
  Microprocessor Report's Update on Xtensa LX2 and Xtensa 7
BDTI’s Report on Tensilica Xtensa LX Processor with Vectra LX
  EEMBC Benchmarks
  BDTI Benchmarks
  Epson printer
WHITE PAPERS
FLIX: Fast Relief for Performance-Hungry Applications
XPRES Compiler
Automated Configurable Processor Design Flow
  more >

ARTICLES

Hit Performance Goals with Configurable Processors
FLIX Helps Low-Power CPU Flex its Performance
Compiler Automates RTL Generation
  EDN's 2006 Hot 100 Products
 
QUOTABLE

“Tensilica’s introduction of the Xtensa LX and its revolutionary tool, the XPRES design compiler, made it the clear winner. Even without XPRES, Xtensa LX would be the leading contender for this award, but the combination is unbeatable.”

Tom R. Halfhill,
Senior Analyst, Microprocessor Report

get more information