Tech Support | Generator Login | Careers | Contact Us
MARKETS AND APPLICATIONS

  Overview

  Control Processor

  DSP

  Computer Periphrls

  Wireless

  Networking

  Audio-Video

  + Audio Processor

  + Video Processor

  + Build Your Own

  + Hi Def Video Procsr

  + Portable Music Player

  + DVD Recorder

  + Digital Video Cam

  Customer Gallery

Build Your Own Solution

The ability to quickly and efficiently process audio and video streams has become extremely important in the past few years, with portable consumer electronics demanding more and more multimedia features. Standard RISC processor cores cannot efficiently process these streams. Therefore, system designers are forced to choose between hard-wired macro blocks or specialized programmable video-centric DSPs. Now there's a better way - with Tensilica's Xtensa processors

Building Audio and Video Engines using Xtensa Processors

The Xtensa processor has been used to build audio and video engines that are shipping in volume today. Besides the advantages of programmability that a processor-based multimedia platform enables, the Xtensa processor has unique advantages due to the ability to add audio and video specific instructions, execution units, and register files.

Advantages of Xtensa-based Audio and Video Engines

Xtensa technology is ideal for creating multi-standard audio and video engines primarily due to three main reasons:

1. Programmability: Programmable audio and video engines offer the flexibility to not only support a range of audio and video standards, but also to fix bugs post-silicon, do software updates as standards evolve, port new standards when they are published, and to port proprietary algorithms for things like motion estimation, DRM, etc.

2. Flexibility of Tensilica technology: The base Xtensa RISC processor core can be configured and extended to be a multi-issue VLIW, DSP processor with audio and video specific instructions. Designers can create complex instructions that are a fusion of multiple operations (such as a multiple-add-saturate) and can create SIMD execution units that operate on multiple audio/video data elements at the same time. Designers can instantiate new input and output ports (GPIO) and queues (FIFO interfaces) to dramatically increase the I/O bandwidth through the processor and remove the memory bottleneck created by the system bus.

3. Extensibility: Designers can create video-specific instructions and execution units that can be used by multiple codecs. Additionally, these instructions and execution units can be used to accelerate control-dominated code (e.g., CABAC) and data-dominated code (e.g., SAD, motion estimation). Similarly, the designer can accelerate novel algorithms by creating new custom instructions for them.

The TIE (Tensilica Instruction Extension) language is a cross between C and Verilog and is a very powerful mechanism for creating new execution units and corresponding instructions. Designers have to only verify the functionality of the new operation specified in TIE – Tensilica tools generate pre-verified RTL for the processor that is fully pipelined and interlocked. Implementing the same video units in RTL would require considerably larger design and verification effort and time. Also, Tensilica tools automatically generate a new software tool chain that incorporates the new instructions. So the compiler schedules the new instructions, allocates the registers in the designer-defined registers and register files, the debugger displays the designer-defined registers, the cycle-accurate instruction set simulator simulates the new instructions, etc.

Tensilica Xtensa LX2 processors can be used to simplify the design of a video stream encoder or decoder, utilizing wide 128-bit busses, multiple instruction issue per cycle (FLIX) , multiple processor support , etc. Tensilica Xtensa processors can be used to process video streams in a pipelined fashion, by dedicating a processor to a single “stage” such as motion estimation or quantization. The imaging pipeline is similar to a RISC CPU pipeline, where each stage (Xtensa processor) does the same video-processing sub-task as the video stream is being processed in the imaging pipeline.

Creating Audio & Video Specific Data Paths in TIE

A general description of using TIE to create complex, multi-cycle SIMD functional units and custom register files is shown here. The ability to create SIMD functional units is particularly useful to accelerate audio, video and image processing algorithms. Following are other examples of how TIE can be used.

Creating a Complex, Multi-cycle, Multi-operation Instruction using TIE

TIE can be used to create complex multi-cycle instructions that perform multiple operations at the same time. For example, consider the conceptual model of an instruction that accelerates FIR filtering shown in the figure below:

Using TIE for FIR Filtering: 16-bit MAC with Load Address Update

Here, we created a 2-cycle pipelined multiple-accumulate (MAC) instruction that also initiates a load of two 16-bit values and does a load address update, all in one instruction. This achieves the FIR filtering requirement of one tap per cycle and the load data and load address update make sure the MAC operation is not starved of data.

Creating a Simple SIMD TIE Instruction

A simple SIMD TIE adder may be as shown in the figure below. This SIMD adder adds two 16-bit values from two 32-bit registers and stores the result in a 32bit register. This is a useful operation in the DCT (discrete cosine transform) algorithm.

Using SIMD TIE Operations for DCT Acceleration

Creating a Multi-cycle SIMD TIE Instruction

Now consider that we want to accelerate the sum of absolute differences (SAD) algorithm popularly used for motion estimation. The outline of this algorithm is shown below.

We first create an operation that computes the subtract-absolute-add in one cycle. Since the SAD algorithm operates on 8-bit pixels, we could create a SIMD functional unit that operates on 16 pixels in parallel. Thus the hardware for this would look like:

SIMD implementation of Sum of Absolute Differences (SAD) for Motion Estimation

And all this can be done with a few lines of TIE code. More details on creating SIMD instructions using TIE are shown here.

Accelerating Control-Dominated Code using TIE

So far most of these examples were on accelerating data-intensive operations. However, the ability to collapse multiple operations into one instruction is a powerful mechanism to accelerate control code as well. It is particularly useful when the code contains state or data value-based branching (if-then-else).

For example, consider the CABAC (context-adaptive binary arithmetic coding) algorithm used in the H.264 standard, Main Profile. CABAC is a highly computationally expensive operation and requires up to 800 Mcycles n a RISC processor for a D1 H.264 stream. A flow chart of decoding one decision bin (regular coding mode) in CABAC is shown below.

Accelerating Control-Intensive Code: CABAC

This entire decode of one decision bin can be implemented as one single-cycle TIE instruction. This coupled with other CABAC-specific TIE instructions and using a two-issue VLIW Xtensa configuration enable us to reach a peak decode performance of one bin per cycle for the CABAC algorithm.

SOC Book
RECOGNITION
Red Herring top 100
Read The Future of Multicore Processors from Instat/ Microprocessor Report
Read "More Patents for Tensilica" from In-Stat/Microprocessor Report
Portable Design 2006 Editor's Choice Award
EDN 100  Hot Products 2006
QUOTABLE

“We selected Tensilica’s Xtensa processor for its ability to help us achieve our goal of developing innovative-multi-gigabit, lower-power mmWave communications products. By optimizing the Xtensa processor into a tailored processor core, this enables our products to attain the performance these wireless applications demand.”

Kumar Mahesh, Manager of MAC and Software Design for SiBEAM, Inc.