Microprocessor Report's review of Xtensa LX3
Processor Ports and Queues: Easily Overcome I/O Bandwidth Obstacles in Your Next ASIC or SOC Design
How to Increase ASICs and SOC Computational Performance with Long-Wod Processors
Minimize Energy Consumption While Maximizing ASIC and SOC Performance
See entire white paper library
See Building a Multi-Issue Vector DSP with Configurable-Processor Technology from GSPx 2004.
See BDTI's independent analysis of the Xtensa LX processor with Vectra LX.
See Microprocessor Report's article Applications Define DSP Speed.
The Xtensa LX4 processor excels at traditional CPU and DSP tasks in embedded SOCs as demonstrated by industry leading benchmark results on the BDTI BenchmarksTM by Berkeley Design Technology, Inc. (BDTI). The Xtensa LX configurable processor core achieved the highest score recorded to date for a licensable processor core -- the Xtensa LX BDTIsimMark2000 score of 6150 at 370 MHz is 70% faster than the score for the next-fastest licensable core benchmarked by BDTI, the CEVA-X1620.*
Tensilica’s customers today are already using the Xtensa processor core for a variety of DSP tasks including audio processing, image processing, video processing, and communications channel processing. Additionally, Tensilica offers specialized pre-configured, optimized DSPs for audio, communications and video processing.
The Xtensa LX processor can be used in such a wide variety of applications because Tensilica offers several means of accelerating DSP computations.
For moderate intensity signal processing applications, a 16-bit multiply-accumulate engine can be added to the base Xtensa LX processor core with just a click of a configuration button in the Xtensa processor generator. Inclusion of the MAC16 option adds a full suite of multiply / accumulate instructions including auto-incrementing loads and combined multiply-accumulate-load instructions for high performance computation. These DSP instructions are also 100% compiler supported.
See our Embedded Processor Forum 2004 presentation on Vectra LX titled, “A Second-Generation High-performance DSP Engine."
The ConnX Vectra LX communications DSP engine can be added to the base Xtensa LX processor core with just a click of a configuration button in the Xtensa LX processor generator. The ConnX Vectra LX engine takes advantage of the FLIX architecture and uses 64-bit instruction words containing three issue slots for ALU, multiply-accumulate, and load/sore operations. Design teams interested in modifying the ConnX Vectra LX DSP engine for specific configurations should contact Tensilica. The ConnX Vectra LX engine is fully supported by the entire Tensilica software environment including advanced auto-vectorization capabilities in the Xtensa C/C++ Compiler (XCC). XCC enables the ConnX Vectra LX engine users to reap the benefits of vector processing on a SIMD engine without manual assembly-level coding.
| Simple RISC Engine | Minimal configuration Xtensa LX using software multiply | 155,389 cycles |
| Scalar Performance | Base Xtensa LX processor with MUL32 option | 23,633 cycles |
| FLIX Performance | Xtensa LX with Vectra LX option | 994 cycles |
Vectra LX DSP engine really accelerates FFT performance
The ConnX Vectra DSP engine is avaialble with one or two Load/Store units.
The ConnX D2 DSP engine is a click-box option for Xtensa LX4 that is ideal for 16-bit communications DSP functions. The ConnX D2 DSP engine delivers outstanding performance from 'C' code. See the ConnX D2 DSP engine pages.
The ConnX Vectra VMB (Viterbi, 8x20-bit multiply-accumulate and bit unpacking) is for baseband communications acceleration. It includes instructions targeted at FIR, IIR and filtering, bit stream unpacking, and Viterbi trellis decode operations. Available only when the ConnX Vectra DSP engine is also selected.
The ConnX BBE16 is built around a core vector pipeline made of 16 18bx18b MACs. These multipliers and associated adder and multiplexer trees enable operations such as FFT butterflies, parallel complex multiple operations and signal filter structures.
The ConnX BBE64 was designed for the computationally intensive tasks required for LTE-Advanced communications. It is a high-performance DSP with 64 or 128 simultaneous 18x18-bit MACs/cycle supporting more than 100 billion high-precision multiply/accumulate operations per second.
For applications with one or more signal processing applications that require some amount of acceleration beyond the base features and predefined options of the Xtensa LX, the designer can quickly add instructions and hardware execution units tailored to a specific algorithm.
For example: the “butterfly" operation used in Convolutional Coding / Viterbi Decoding applications is a series of combinational Add-Compare-Select (ACS) operations. If the data in question consists of 8-bit values packed in the standard 32-bit registers of Xtensa LX, a designer can easily add an ACS instruction to the Xtensa LX processor with a small incremental block of execution unit hardware to greatly speed up Viterbi decoding for communications applications.
See our for examples of using TIE for DSP functions.
For applications with well-defined, very high performance signal processing computational demands, the TIE language provides a fast means of developing extremely powerful DSP extensions. Add custom registers and register files for unique data types. Create complex multiple-operation instructions and automatically pipeline those instructions into multi-cycle instructions by specifying a command directive in the TIE language that takes only one line of text in a TIE description. Create SIMD (single instruction, multiple data) instructions to tackle algorithms with native data parallelism. Use software-pipelining techniques to create combined compute-and-load, compute-and-store instructions for high data-rate applications that enable continuous computation without the performance overhead of processor load and store cycles.
Tensilica improved compute performance in the Xtensa LX processor through its innovative FLIX (Flexible Length Instruction Xtensions) architecture. The FLIX architecture is a highly efficient implementation of the Xtensa instruction set architecture (ISA) that gives designers more options for cost/performance tradeoffs. The FLIX technology provides the flexibility to freely and modelessly intermix instructions of various lengths (16-, 24-, or 32-/64-bit). By packing multiple operations into a wide 32- or 64-bit instruction word, FLIX technology allows designers to accelerate a broader class of application “hot spots". FLIX eliminates the performance and code-size drawbacks that can occur when using a one-size-fits-all instruction length. Compared to rigid, high-performance processor designs that either encode only one RISC operation per instruction or use ultra-wide 64b/128b/256b VLIW (very long instruction word) formats, FLIX delivers high-performance concurrent execution exactly and only when needed, yet preserves the industry leading code density advantages of the Xtensa processor’s native 16b/24b base architecture instruction formats.
* The BDTIsimMark2000™ provides a summary measure of DSP speed. For more information and scores see www.BDTI.com. Scores © 2004 BDTI.