Xtensa LX4
Xtensa LX4 - Customizable Dataplane Cores Tailored for High-Performance DSP with Flexible I/Os
Processors for the SOC Dataplane
In today's complex SOC designs, processors can be found in many places throughout the chip to add programmability for design flexibility. While most processors do a good job with control functions, they often fail miserably at the complex dataplane processing tasks. That's why designers often turn to RTL blocks for the complex "heavy lifting" SOC tasks. The problem with those RTL blocks is that they take too long to design, take even longer to verify, and are not programmable.
What's needed are processors that can be customized for the task at hand with just the functions, registers and datapath required. In order to provide enough data bandwidth to and from other system blocks, the processor must provide direct connectivity with arbitrary widths and predictable latency without using the system bus.
Xtensa LX4: The Basic Building Block for SOC Design
Tensilica's Xtensa LX4 DPU (dataplane processing unit) was designed from the start to be a basic building block in SOC designs. It is ideal for handling complex compute-intensive DSP applications where an RTL implementation may be the only other option. Xtensa LX4 DPUs are configurable and extensible to meet application requirements exactly.
Configurable: You are offered a menu of pre-verified checkbox and drop-down options ranging from memory size and width to complex DSP functions.
Extensible: You can use our TIE (Tensilica Instruction Extension) methodology, based on a Verilog-like language, to implement the datapath elements in the processor pipeline. The control FSM (finite state machine) can be implemented as software running on the processor. Just specify the functional behavior of the new datapath and the RTL is automatically generated, along with the full matching software toolchain and models.

Xtensa LX4: Flexible Direct Connections Allow RTL-like Throughput
Feature Overview
Backwards-compatible ISA since 1999
- Fundamentally architected for extensibility
- Base instruction set of 80 instructions
- All optional blocks are still available
- Any differentiating designer-defined instructions written in 1999 can be re-used today
Optional pre-defined execution units
- 32-bit multiplier and/or 16-bit multiplier and MAC
- Integer divider
- ConnX Vectra LX DSP with 4-way SIMD and one or two Load/Store units
- ConnX D2 DSP
- ConnX BBE16 baseband DSP
- ConnX BBE64 baseband DSP
- HiFi 2 and HiFi EP Audio DSPs
- Single-precision floating point unit
- Double-precision floating point acceleration
Differentiate with designer-defined instructions
- Make your specific algorithm run even more efficiently by adding the instructions it needs
- Development tools automatically adapt for full support
Natural connectivity with RTL blocks
- Multiple custom width I/O ports for peripheral control and monitoring
- Multiple custom width queue interfaces to FIFOs for data streaming into and out of the processor
- Co-simulation with RTL down to the pin level in SystemC
Highly configurable interfaces
- Optional processor interface (PIF) to system bus, choice of 32-, 64-, or 128-bit width with in-bound slave DMA option
- Up to 128b wide instructions and up to two 512b wide load/stores and hardware prefetch unit
- Write buffer: selectable from 1-32 entries
- Optional second data Load/Store unit
- Optional AMBA AXI and AHB-Lite bridges with synchronous or asynchronous clocking
- Choice of 1-, 2- or 4-way cache and/or local memories
- Up to 32 interrupts
Multi-core design style support
- Multi-core system creation, modeling and SystemC co-simulation out-of-the-box, fully supported within the Xtensa Xplorer IDE
- Homogenous and heterogeneous subsystems supported
- Inter-core on-chip debug with break-in/out control
- Optional 16-bit processor ID
- Conditional store option and synchronization library provide shared memory semaphore operations and the "release consistency model" of memory access ordering
Complete hardware implementation and verification flow support
- Automatic generation of RTL and tailored EDA scripts for leading-edge process technologies, including physical synthesis and 3D extraction tools
- Auto-insertion of fine-grained clock gating for ultra-low power
- Hardware emulation support including automated FPGA netlist generation
- Comprehensive diagnostic test bench
- Formal verification support for designer-defined instructions
High-speed, high-accuracy system simulation models automatically created
- High-speed instruction-accurate simulator for software development
- Pipeline-modeling, cycle-accurate Xtensa instruction set simulator
- Xtensa SystemC (XTSC) transaction-level modeling (TLM) support, including out-of-the-box multi-core simulation
- Hardware co-simulation with RTL in SystemC with Tensilica's pin-level XTSC
Integrated design environment
- Create, simulate, debug and profile whole designs in one tool-Xtensa Xplorer is a high productivity IDE
- Ninth generation software development tools target each processor. The advanced Xtensa C/C++ compiler (XCC) includes optimizations for base, optional and designer-defined instructions
- New Vectorization Assistant directs the programmer to areas of the application that can benefit most from modifications to enable better vectorization
- Increase productivity with multi-core subsystem design and simulation support
- Custom data display formatting for easy debug of vector and fixed-point data types as well as bit-mapped status and control
Robust operating system support
- Use Mentor Graphics Nucleus+, Express Logic's ThreadX, Micrium's uC/OS-II, Tata Elxsi's Ro-SES or the Linux operating systems
Unrivaled Performance
The Xtensa LX-based ConnX BBE64-128 DSP can sustain 128 MACs/cycle. Targeted for use in LTE and LTE Advanced, this equates to over 100 Giga-MACs per second in 28nm high-performance process technology.
Imagine what you can do with Xtensa LX4. Here are some representative performance numbers.
| Configuration |
Post-Route Area (mm2) |
Clock Rate (MHz) |
Power Dissipation (mW/MHz) |
Smallest* - Synopsys library, TSMC 40LP, low-power flow
|
0.024 |
60 |
0.012 |
Smallest* - Synopsys library, TSMC 40LP, high-speed flow
|
0.044 |
670 |
0.018 |
Smallest* - Synopsys library, TSMC 45GS, low-power flow
|
0.024 |
62 |
0.009 |
Smallest* - Synopsys library, TSMC 45GS, low-power flow
|
0.044 |
1032 |
0.014 |
570T** - Synopsys library, TSMC 40LP, low-power flow
|
0.163 |
57 |
0.046 |
570T** - Synopsys library, TSMC 40LP, high-speed flow
|
0.295 |
493 |
0.093 |
570T** - Synopsys library, TSMC 45GS, low-power flow
|
0.158 |
58 |
0.034 |
570T** - Synopsys library, TSMC 45GS, high-speed flow
|
0.283 |
780 |
0.066 |
*Smallest - smallest configuration used by customers with only local instruction and data RAM interfaces and full clock gating.
**570T - similar to Tensilica's
Diamond Standard 570T See a quick comparison to the Xtensa 8 processor.
Xtensa LX4 Use Models
| Configurability |
Configure your processor to fit your application. Get the options you want and not the ones you don't want |
Choose from a menu of common, pre-optimized data path elements like multipliers and shifters |
| Extensibility |
Add application-specific instructions to accelerate the hot spots in your application |
Add multi-cycle execution units, registers, register files, and SIMD units to create the same datapath as you would in RTL |
| Designer-defined I/O interfaces |
Use TIE Ports and Queues to avoid the bottlenecks of the system bus |
Interface to other RTL blocks and processors using direct wires and FIFOs, as you would if you were using RTL |
| Lower power |
Use application-specific extensions to create a higher-performance processor without increasing frequency and power |
Fine grained clock gating automatically generated by Xtensa Processor Generator. Higher power savings than with EDA-generated clock gating of manually produced RTL because clock nets are automatically gates off cycle-by-cycle under program flow execution. No risk of introducing bugs while adding clock gating |
| Lower verification effort |
Automatic pre-verified RTL generation, including control logic, bypass logic, and data path elements |
Only have to verify functional specification of custom instructions and execution units. Significantly lower verification effort than RTL |
| Flexibility |
Extending processor gives headroom to map more tasks as requirements and standards change, unlike fixed processors that rely on increasing frequency (MHz) to increase capability |
Programmability of processor means that multiple applications can be mapped to the same SOC, software can be updated as algorithms change, and bugs can be fixed post-silicon |
| Faster time to market |
Spend less time optimizing software or, on the backend, trying to increase frequency. Instead, just accelerate the application using designer-defined instructions |
Lower verification effort and easy scalability by adding more task-optimized processors |
| Smaller core area and memory area |
Base processor configuration is around 15K gates. Also, 24-bit ISA with 16-bit narrow encodings means higher code density than conventional RISC and DSP cores and, therefore, smaller memory area |
Create optimized task engines with little or no area overhead for the processor |
Xtensa LX4 Solution Diagram

Tensilica offers a complete solution for the Xtensa LX4 processor including automatically generated RTL and EDA scripts, system modeling and design support, Xtensa tools, and the software to optimize the processor for your application.