Microprocessor Report's review of Xtensa LX3
Processor Ports and Queues: Easily Overcome I/O Bandwidth Obstacles in Your Next ASIC or SOC Design
How to Increase ASICs and SOC Computational Performance with Long-Wod Processors
Minimize Energy Consumption While Maximizing ASIC and SOC Performance
See entire white paper library
Tensilica’s Xtensa LX processor achieved the highest score ever reported on the Networking Version 2.0 benchmark suite of the Embedded Microprocessor Benchmark Consortium (EEMBC). Tensilica’s Xtensa LX processor was the first licensable processor core to complete certification on this challenging benchmark suite.
EEMBC benchmark scores, based on simulation, show that an optimized Xtensa LX processor core is significantly faster on a per-MHz basis than the only two other processors certified to date, the 1GHz PowerPC® 750GX and 1.4 GHz PowerPC MPC7447A, both of which are full-chip, standard product processors. The Xtensa LX2 processor delivers this outstanding performance while simultaneously delivering a 4X code density advantage and more than a 100X advantage in both die area and power dissipation.
All of today’s leading edge ASSP and ASIC designs, and a growing number of general-purpose processor designs, employ multiple specialized processing engines on chip, particularly in networking applications and now, even in consumer designs. Examples range from Cisco’s performance-leading CRS-1 terabit router, which relies upon the innovative Cisco-designed Silicon Packet Processor built with 188 Tensilica Xtensa processor cores, to the recently announced Playstation Cell processor and to the emerging “dual-core" war in the desktop PC market.
The key attributes needed in a processor core used in a multi-core architecture are: small physical size and low-power (to maximize the number of cores per chip); excellent code density (to minimize the area needed for local instruction and data memories attached to each processor core); communication infrastructure and capabilities (to quickly transfer data); and outstanding application-specific or function-specific performance (so that each core in the design can be dedicated to a specific type of task).
The EEMBC Networking V2 results demonstrate that the Xtensa LX core excels in all four key attributes. [Note that Tensilica’s results are for a single Xtensa LX processor core in a configuration that is representative of how it could be used in an SOC design for a networking application.]
Size & Power: The Xtensa LX processor configuration consumes a mere 1.2 square mm in a reference high-performance 130 nm process technology, using conventional standard-cell implementation techniques (excluding memory area). This core is projected to consume an estimated 115 milli-watts of power when operated at its maximum 304 MHz operating frequency. Contrast that miserly power figure to that of the leading full-chip processor certified by EEMBC Certification Labs (ECL), the Freescale MPC7447A. This full-chip processor consumes 21W (typical) of power [Freescale website, April 2005]. While the 7447A PowerPC chip includes area and power for integrated memories and I/Os that contribute to the 184X greater power dissipation, even allowing a generous 40% of the chip area and power to these memories and I/Os, the Xtensa LX processor enjoys a more than 100X advantage in both area and power consumption.
Code Density: The Xtensa LX code size for the EEMBC Network V2 benchmark has been certified by ECL at 65,208 bytes. The Freescale MPC7447A code size is certified at 280,984 bytes. Tensilica’s Xtensa LX has a 4X advantage in code size.
Communication Capabilities: The Xtensa LX processor has unique Queues that allow the designer to bypass the bus entirely, thereby increasing throughput (see discussion of Queues below).
Performance: On a per-MHz basis, the Xtensa LX outperforms the closest competitors – Freescale MPC7447A on the TCPmark of the EEMBC benchmark and the IBM 750GX on the IPmark – by nearly a 3X margin.
The normalized (per MHz) EEMBC TCPmark test scores are:
The normalized (by MHz) EEMBC IPmark test scores are:
(Because EEMBC scores for licensable synthesizable processors, such as the Xtensa LX, are expressed on a “per-MHz" basis, the PowerPC results were normalized to a “per-MHz" basis for this comparison.)
With the Networking 2.0 benchmark, EEMBC simulates real-world networking performance with many different users and differing traffic types. The TCPmark represents processor performance in Internet-enabled, client-side devices. The IPmark represents processor performance in network routers, gateways and switches.
The total code size (aggregate total of bytes of object code) for all twelve benchmark kernels in the Networking Version 2 suite are
Tensilica made extensive use of custom FLIX (Flexible Length Instruction Xtensions) instructions in the processor configuration tested by ECL. The tested configuration included seven different 64-bit instruction word formats with up to eight parallel operation slots. FLIX is a technology introduced with the Xtensa LX processor that delivers VLIW-style parallel execution without the “code bloat" typically incurred by VLIW-style processors. In fact, the dramatic 4X to 5X speedup achieved by the Optimized Xtensa LX score versus the Out of the Box Xtensa LX score was accompanied by a decrease of total code size of nearly 2%.
In addition to the benefits of FLIX parallelization, which provided application acceleration across all of the 12 benchmark kernels in the EEMBC Networking Version 2 suite of benchmarks, Tensilica selectively employed user-defined TIE (Tensilica Instruction Extension) Queues to dramatically accelerate the IP packet check kernels.
Tensilica’s unique user-defined Queue capability allows SOC designers to bypass the standard processor bus and directly import data into the execution units of an Xtensa LX processor, much in the same way that a dedicated hardware accelerator block would process data in an SOC design. Whereas conventional processors are limited to a maximum data throughput of one 32-bit or 64-bit data read or write every clock cycle [and hence a typical maximum sustainable throughput on streaming network data of one third or less of the peak transfer rate, assuming a read-compute-write-repeat sequence], Xtensa processors with Queues can sustain data rates of one transfer every clock cycle for every Queue port, and with a user-defined bandwidth of up to 1024 bits per cycle. And Tensilica’s patented processor generator technology automatically delivers full C compiler and Instruction Set Simulator support for user-defined Queues.
Custom instructions in an Xtensa LX processor can perform multiple queue operations per cycle, perhaps combining inputs from two input queues with local data and sending the computed values to two output queues. The high bandwidth and low control overhead of Queues allows the Xtensa LX processor to be used in applications with extreme data rates. IP Packet manipulation in embedded networking devices is a prime example of such a use of TIE Queues. In an SOC design, a network engineer would normally design custom packet header inspection hardware in order to achieve high throughput processing of packets. Using a conventional processor, too many clock cycles are required to first read in a full packet and then perform the required header inspection and checksum calculations to be able to sustain the throughput rates required of Gigabit and 10Gigabit systems. Thus custom “accelerator" or “dataplane" hardware is designed to offload the conventional control processor.
But with Xtensa LX processors, the custom packet-processing hardware and the control interfaces to ingress and egress channel packet-buffer queues can be integrated into the processor. The result: a stunning 33X speedup of the Xtensa LX on the IP Packet Check portion of the benchmark. To equal the level of performance of the 304 MHz Xtensa LX on the 1MB packet size kernel, the PowerPC would have to run at 6.4 GHz. And, this processor-based design approach is far less work for the SOC hardware team. With Tensilica’s patented technology, the Queue interfaces and custom packet-header inspection instructions can be added to a processor within hours, complete with fully verified RTL and software tools and models. Conventional RTL hardware design requires weeks of RTL design followed by months of verification.
Tensilica’s Xtensa LX processor is the only processor that allows designers to bypass the conventional processor-bus-bottleneck in this way. Every other processor requires that data be “fed" to it over a bus, which is inherently much slower. Xtensa Queues provide a high-speed mechanism to transfer streaming data. Input queues and output queues operate to the programmer’s viewpoint like traditional processor registers - with the notable exception that data is always available without the need to load or store the data before and after computation.
EEMBC, the Embedded Microprocessor Benchmark Consortium, develops and certifies real-world benchmarks and benchmark scores to help designers select the right embedded processors for their systems. Every processor submitted for EEMBC benchmarking is tested for parameters representing different workloads and capabilities in communications, networking, consumer, office automation, automotive/industrial, embedded Java, and microcontroller-related applications. With members including leading semiconductor, intellectual property, and compiler companies, EEMBC establishes benchmark standards and provides certified benchmarking results.