Tech Support | Generator Login | Careers | Contact Us
PRODUCTS

  Overview

  Technology

  Diamond Standard

  Xtensa

    Configurable

    Config & Extensible

    Xtensa 7

    Xtensa LX2

  + Architecture

    – Features

    – Create TIE

  + I O Bandwidth

  + Low Power

  + Floating Point

  + Real-time Trace

  + Error Detection

  + Benchmarks

    – EEMBC Networking

  + Configuration Options

  + DSP Options

  + White Papers

  DSPs

    HiFi 2 Audio

    Video

    Communications

  HW/SW Dev Tools

  Literature & Doc

White Papers

Why High MHz Does Not Mean High Performance

Over the past several years, AMD has successfully fought the PR battle with Intel to convince OEMs, PC manufacturers, and consumers that frequency or megahertz (MHz) is not the right metric when evaluating the performance of processors. The essence of the argument is that processor designers can use architectural design techniques that improve the performance of applications running on the processor, without increasing the operating clock rate of the processor. In fact, processors with high clock rates dissipate much more power since power dissipation (P) is proportional to operating frequency (f). In the ultimate acknowledgement to this argument, Intel today ships dual-core chips that run at lower MHz than older generation processors.

The same argument holds true for embedded processors as well. In fact, embedded products operate under very tight power/energy budgets. This is well-known for portable devices because extending the battery life is an important requirement, but it is even true for non-portable devices -- consumers do not want a (noisy) cooling fan running in their set-top box or display projector and IT managers want routers and switches that minimize energy consumption in the data centers.

Consumers want more performance and longer battery life

Embedded products – particularly, portable consumer devices – are continuously incorporating more features and are thus placing increasing computational demands on application and multimedia chipsets. Thus, it is becoming easier for consumers to listen to music, watch videos, take calls, write emails, and even documents on their portable media players (PMPs), cell phones and PDAs.

Conventional RISC processor core IP providers have, over the past decade, responded to this increasing computational demand by designing processors with deeper pipelines that operate at even higher frequencies. This brute force method means that speeds of processor cores have, in the past five years, increased faster than the underlying process technology speed-up during the same time period. Whereas, this meets the sheer MIPS requirements for general purpose applications, it is not well suited for DSP processing applications like multimedia and baseband. Also, higher MHz comes with a large number of disadvantages, such as higher power, larger area, and in many cases, worse performances as discussed below.

The problem with longer pipelines and higher MHz

To achieve higher MHz, conventional RISC processors have to use deeper pipelines. Deeper pipelines come with several disadvantages that include (a) very high penalty for branch delays and branch miss-predictions, (b) high area overhead to support the data forwarding and control logic required for the deeper pipeline, and (c) additional area expensive units such as branch prediction units to alleviate branch penalties. These disadvantages reduce architecture efficiency in that performance degradations due to all of these factors reduce the performance benefits gained by the higher frequency.

But by far the biggest penalty of deeper pipelines and higher frequencies is that the power consumption of the processor core shoots up tremendously. In the best case, power increases proportionally with frequency. In reality, the area overhead for the deeper pipelines increases power consumption even more.

So, whereas using deeper pipelines is a valid approach to address the higher computational demands, it is also the sure way to decrease battery life. And battery life is a key decision metric for consumers when they consider PMPs, cell phones, and PDAs. Therefore power has become a first order consideration for SOC designers, along with area and performance.

Increasing the frequency of embedded processor cores just does not cut it anymore. They consume too much power and require large memories (which are even more power hungry) to support them.

This begs the question: is it possible to get higher performance embedded processors without increasing the frequency?

Higher performance without higher MHz

If the applications targeted at the processor are known, then an application-optimized processor can be created. Such a processor has instructions and functional units that accelerate a particular application or a class of applications. This can be done easily using an advanced configurable and extensible processor such as the Xtensa processor from Tensilica by creating instruction extensions.

If, however, the embedded processor is expected to execute a general set of applications, then using a processor with a VLIW (very long instruction word) architecture can serve as a high performance, yet low MHz solution. For example, the Diamond Standard 570T from Tensilica, a 3-issue VLIW processor, has been rated the highest performance embedded processor (based on EEMBC benchmarks) even when running at 200-250 MHz, outperforming competing single-issue cores running at clock rates up to twice as fast.

The EEMBC benchmark suites serve as a useful data point for comparison among the various processor cores because of the range of the applications in the benchmark suites and because these applications are representative of the various application domains that are interesting in the embedded SOC domain. As shown in Figure 1 below, Diamond Standard 570T outperforms processors such as ARM11 and MIPS 24K on each of the EEMBC benchmark suites.

Figure 1. Comparison of Tensilica's Diamond 570T against ARM11 and MIPS 20K on the EEMBC Benchmark Suite. Note that MIPS 20K is a dual-issue processor and is, therefore, higher performance than a MIPS 24K on a per-MHz basis.

All scores are simulations of licensable cores.
All scores are EEMBC/ECL Certified. All scores “out of the box”
Per-MHz certified benchmark scores normalized to ARM = unit score of 1 for suitability in graphing.
Competitive data as of June 2006. Source: www.eembc.org


In a VLIW architecture, the processor issues more than one operation per instruction (i.e., per cycle). So, a 4-issue VLIW processor issues four operations per instruction and attempts to increase application performance by executing more instructions per cycle than a classic RISC pipeline. Thus, in the ideal case, a 4-issue VLIW effectively provides 4 times the performance of a single issue RISC processor.

In fact, the DSP processor space has also evolved to using VLIW architectures to increase performance. This is evidenced by Texas Instruments’ decision to adopt a VLIW architecture for their highest performance DSP product line, the C6x series.

Diamond Standard 570T: The highest performance embedded CPU

The Diamond Standard 570T is a RISC-based 5-stage VLIW processor core that uses the Xtensa instruction set architecture (ISA). The Xtensa ISA uses 24-bit instructions with 16-bit narrow encodings. The VLIW instructions in the Diamond 570T are encoded using 64 bits and the processor modelessly issues and executes the 64-, 24-, and 16-bit instructions.

The software development toolkit (SDK) for the Diamond 570T includes the Xtensa C/C++ Compiler (XCC), along with a complete GNU-based tool-chain that includes the debugger, profiler, assembler, linker, and profiler. The XCC compiler is an advanced, optimizing compiler that automatically extracts instruction-level parallelism from the C/C++ code and automatically bundles concurrent operations into VLIW instructions. The SDK also includes a cycle-accurate instruction-set simulator (ISS), a fast functional compiled simulator (TurboXim), and system models (SystemC and a C-based model) to enable easy and fast modeling of the processor and the system around it.

Higher performance without higher area equals lower power

One of the benefits of using a lower frequency, shallow pipeline processor with a VLIW architecture such as the Diamond Standard 570T over a 8- or 9-stage high-frequency RISC processors is that the area for the Diamond Standard 570T is much lower than the other processors.

Figure 2: Area and Power Comparisons between ARM11, MIPS 24K, and Diamond 570T in 0.13G.

MIPS and ARM area and power numbers, based on the latest data they published on their websites (MIPS does not report 90nm data) (As of March 2007).

Figure 2 shows a comparison of the area and power of the ARM11, MIPS 24K, and the Diamond Standard 570T. Even though, the Diamond 570T is on an average 2.5x higher performance than an ARM11 and about 2.2x higher performance than a MIPS 24K (based on EEMBC benchmarks), the Diamond 570T is much smaller (almost 45% smaller) than both these processors. This is also reflected in its power/MHz and absolute power at the same frequency. Thus, a Diamond 570T is dissipating almost 1/6th the power (43mW) that an ARM11 is dissipating at 400MHz (240mW).

Summary

The most important metrics when deciding on a processor core for SOC designers are area, performance, power, and price. Traditionally, performance has been associated with higher frequency. The Diamond 570T shows that higher performance can be achieved even while running the processor at lower frequency. This leads to not only lower power because of lower frequency, but also to better architecture-performance efficiency and lower area. This lower area in turn leads to even more power savings when compared to traditional deep-pipeline RISC processors.

As consumer demands continue to grow, power has become a dominant metric in choosing the underlying processor core. Furthermore, the use of multiple specialized processors for tasks such as video, audio, and baseband places an even higher demand on high performance without compromising area and power. We believe that this will fuel the movement towards application-customized processors such as Tensilica’s Xtensa configurable processor and general purpose processors such as the Diamond 570T that achieve high performance at a low frequency.

CORE OF THE YEAR
Best Processor Cores of 2004
PRODUCT RESOURCES
Xtensa LX2 Product Brief
Xtensa Processor Developers Toolkit Product Brief
Microprocessor Report’s review of Xtensa LX
  Microprocessor Report's Update on Xtensa LX2 and Xtensa 7
BDTI’s Report on Tensilica Xtensa LX Processor with Vectra LX
  EEMBC Benchmarks
  BDTI Benchmarks
  Epson printer
WHITE PAPERS
FLIX: Fast Relief for Performance-Hungry Applications
XPRES Compiler
Automated Configurable Processor Design Flow
  more >

ARTICLES

Hit Performance Goals with Configurable Processors
FLIX Helps Low-Power CPU Flex its Performance
Compiler Automates RTL Generation
  EDN's 2006 Hot 100 Products
 
QUOTABLE

“Tensilica’s introduction of the Xtensa LX and its revolutionary tool, the XPRES design compiler, made it the clear winner. Even without XPRES, Xtensa LX would be the leading contender for this award, but the combination is unbeatable.”

Tom R. Halfhill,
Senior Analyst, Microprocessor Report

get more information