Tech Support | Generator Login | Careers | Contact Us
PRODUCTS

  Overview

  Technology

  Diamond Standard

  Xtensa

    Configurable

    Config & Extensible

    Xtensa 7

    Xtensa LX2

  + Architecture

    – Features

    – Create TIE

  + I O Bandwidth

  + Low Power

  + Floating Point

  + Real-time Trace

  + Error Detection

  + Benchmarks

    – EEMBC Networking

  + Configuration Options

  + DSP Options

  + White Papers

  DSPs

    HiFi 2 Audio

    Video

    Communications

  HW/SW Dev Tools

  Literature & Doc

White Papers

Optimizing Energy in Processor-Memory Subsystems during SOC Design

Abstract:

System-level architectural decisions made before any RTL code has been written have a much larger impact on overall system energy than RTL-level, gate-level, or circuit-level tweaks. The Xenergy tool from Tensilica estimates energy for a processor subsystem (processor, caches, local memories) based on the application code that will run on that subsystem. Designers can thus tune the software and optimize their Xtensa configurable processors and the associated memory subsystems for energy.

A focus on total energy consumption is key. Too often, designers will focus merely on the mW/MHz power figure for processor core logic, but ignore the total energy consumption per unit of workload. An increase in power-per-clock of 20%, for example, might be offset by a 3X speedup in application execution. The mW/MHz number increases 20%, but total energy consumption is actually reduced by 60%.

Sometimes applications can be sped up by increasing accesses to local memories. While performance on the processor increases, total energy usage can increase significantly since memory accesses dissipate more energy than processor activity. The Xenergy tool can help the designer make informed trade-offs between performance and energy consumption.

Main Text:

Power has become a first-order concern, right next to performance and area, for SOC designers, whether they are designing for portable mobile devices or for networking boxes. Optimizing for energy at an application and system level has the potential to improve energy efficiency by an order of magnitude, compared to lower levels of abstraction (RTL and lower) where the best case improvements are 2x. Additionally, by iterating early in the design cycle, SOC designers can save months of time-consuming power optimizations much later in the design cycle to meet energy goals.

Several low-power EDA design methodologies target clock gating, voltage and frequency reduction, gate sizing and logic optimization, leakage reduction techniques, and low-power libraries and technology processes. In contrast, system-level architectural decisions such as the number and size of local (tightly coupled) memories and caches, or data flow interconnects for streaming data sources, have a much large impact on overall system energy. Whereas, a lot of emphasis has been placed on guiding a SOC designer towards a performance and/or area optimized architecture while making system choices such as memory subsystem design (banked memories versus a single large memory), interconnect (single bus versus a hierarchy of buses versus point-to-point interconnects), caches, etc., little has been done to guide them towards an energy-efficient solution.

Tensilica’s Xenergy tool addresses this need by providing SOC designers with an early estimate of energy of the processor subsystem. The Xenergy tool estimates energy for a Tensilica processor subsystem (processor, caches, local memories) based on the application code that will run on that subsystem. Energy estimates take minutes versus hours or days for RTL-based power analysis. Using this data, an SOC architect can optimize Xtensa processors and software applications for energy.

Xenergy is the first tool that provides a realistic way to estimate the overall energy impact of different processor configurations and extensions. It also is the first tool that helps in energy-driven tuning of application code on the overall processor plus memory subsystem. Coupled with traditional software tool chains that focus on guiding application code development to improve performance, Tensilica’s Xenergy energy estimation tool guides designers in choosing between performance-energy-area trade-offs during application code development and processor-memory subsystem tuning.

Xenergy: Optimizing Processor and Memory Energy

The Xenergy tool executes a software application binary on an Xtensa configurable processor or a Diamond Standard processor and generates a quick and early estimate of the power and energy consumed by the processor, caches, and local (tightly coupled) memories. The designer can then tune the application software or hardware Xtensa configuration to optimize the energy.

Input to the Xenergy tool includes a software binary, information about which processor the binary is targeting, and information about the process technology and operating conditions. Xenergy then executes the binary on the Tensilica instruction set simulator (ISS) and generates a processor core and memory power and energy report. This energy report includes a breakdown of dynamic, leakage and total power and energy consumed by the processor core and the memories connected to the local memory interfaces (i.e., tightly coupled memories). This flow is depicted in Figure 1.

Figure 1: Using the Xenergy Energy Estimator to estimate energy for an application running on a Tensilica Xtensa configurable or Diamond Standard Processor

There are two use models for the Xenergy tool: (i) designers can use it to tune their application software to reduce the processor and memory energy (for example, by reducing the number of memory accesses), or (ii) designers can tune their hardware for energy, in this case the Xtensa configurable processor and the associated memories, by selecting different configuration options, adding instruction extensions, register files, new execution units, and changing the number and size of local memories and caches.

A focus on total energy consumption is key. Too often, designers will focus on a static milliwatts per megahertz (mw/MHz) power figure, but ignore the total energy consumption per unit of workload. For example, a designer may add a set of application-specific instructions to a processor that increases the total size of the processor core and, thereby, increases the average power per clock cycle (increases the mW/MHz). But if that new instruction set addition dramatically lowers the total clock cycles (milliseconds) required to perform a given functional workload (a target application) then the total energy consumed (power-per-cycle multiplied by total cycle time) can be reduced. For example, an increase in power-per-clock of 20% might be offset by a 3X speed up in application execution. The mW/MHz number increases 20%, but total energy consumption is actually reduced by 60%.

The Xenergy tool is designed to be used iteratively, first as the processor designer is selecting the configuration options and adding new instruction extensions, and then by the software application developer as the application is tuned. Before the availability of the Xenergy tool, the hardware and software developers only had performance and area analysis tools to guide them through the hardware-software tuning process. Xenergy now provides them with early energy guidance as well.

Energy Modeling Strategy

The Xenergy tool uses statistical models for energy-per-memory-access (read and write) and energy-per-instruction, including energy estimates for designer-defined instruction extensions specified in the Tensilica Instruction Extension (TIE) language. These statistical models were developed by doing detailed synthesis and RTL and gate-level simulation on a very wide range of processor configurations on a variety of different technology process nodes.

For each designer-defined instruction extension in an Xtensa processor, Xenergy creates an energy estimate for the newly created instruction, including modeling the energy consumed by all locally attached memories that are active for the given instruction. The Xenergy tool then simulates the application on Tensilica’s cycle-accurate ISS, which gives detailed profiling information about each instruction executed and every memory access made. Based on this profiling information, Xenergy uses its statistical models to give an estimate of the dynamic, leakage and total energy dissipated by the processor, the caches, and the local (tightly coupled) instruction and data memories.

Energy as another variable in the design decision matrix

The RGB-to-YUV color conversion benchmark from EEMBC (the Embedded Microprocessor Benchmark Consortium at www.eembc.org) can be used to illustrate the use of the Xenergy tool. This kernel converts pixel color information from RGB to YUV for a 32x32 sized image.

Tensilica’s XPRES (Xtensa PRocessor Extension Synthesis) compiler was used on the color conversion benchmark. The XPRES compiler takes as input the application software specified in C or C++ and generates processor extensions in the TIE language. The XPRES compiler explores the design space in an attempt to find the highest performance solution – a designer can control XPRES’s search strategies by placing constrains on the area overhead and the amount of performance improvement required.

We directed the XPRES compiler to generate three solutions based on three optimizations for instruction extensions for the Xtensa processor.

  1. First, XPRES was asked to only generate TIE instructions that were operation fusions. A fusion operation is a combination of multiple operators into a single, complex operation.
  2. Second, XPRES was directed to also generate SIMD (single instruction multiple data) functional units (and corresponding instructions) that are vector operations, which apply the same operator on multiple data elements.
  3. Finally, XPRES also extended the Xtensa processor into a VLIW (very long instruction word) architecture using Tensilica’s FLIX (Flexible Length Instruction eXtensions) technology. In this approach, XPRES creates a multi-issue datapath in which a VLIW instruction contains several operations. The compiler automatically extracts parallelism from the application C/C++ code and packs multiple operations into a single VLIW instruction bundle.

Figure 2: Performance, Energy, and Area Trade-offs for Different Xtensa Processor Extensions (click here for enlarged version)

The results for performance (cycle count), energy (uJ), and area (gate count) normalized to the largest value in the data set are shown in Figure 2. The cycle count was determined by executing the color conversion application on the ISS. The energy estimates for processor, memory, and the total of processor and memory energy were generated by the Xenergy tool. The gate count is estimated by the TIE Compiler.

This figure demonstrates two things:

  • When XPRES generated SIMD operations in addition to fusion operations, the performance improves significantly – by about 3.8x. The gate count is almost 5x more. Energy for the processor and memory tracks performance quite well.
  • When XPRES generated the VLIW (FLIX) architecture, it improved performance by roughly 20%. However, gate count doubled. In this case, even though performance improved, energy became worse – particularly, the energy dissipated by the processor.

These results show that the performance improvements due to the SIMD operations lead to large energy reductions that clearly outweigh the power/energy increases due to the increase in area (gate count). In the VLIW case, the energy increase due to the increase in area outweighs the decrease in energy due to the performance improvements.

This example demonstrates that Xenergy energy estimation tool is an indispensable tool for SOC designers for evaluating complex, non-obvious trade-offs between performance, area, and energy.

The Effect of Memories and Application Code

The inclusion of memory power consumption is a key aspect of the Xenergy tool. Imagine a scenario where a custom TIE instruction improves application performance, but also significantly increases accesses to memory. Even though the application may finish faster and, therefore, consume less energy on the processor, the extra memory accesses will increase the energy consumption. Similarly, a designer can modify the cache configuration (size, associativity) to optimize for energy.

If designers do not pay attention to this increase in memory energy consumption, the new TIE instruction may lead to a less energy-efficient solution. The Xenergy program will point out this energy increase, making it easy for the designer to understand the impact of these changes on the total processor with memory energy early in the processor configuration process.

Similarly, a software programmer developing application code for a Tensilica Xtensa or Diamond Standard processor traditionally would tune the application for either performance or code size. The Xenergy tool helps the developer tune the application to reduce energy dissipation by the processor and its memories. For example, a developer may restructure the data structures in the application to reduce accesses to memories by exploiting temporal locality of the data. Intuitively, this should not only lead to better application code performance, but should also reduce energy. Tensilica’s standard software profiling tools will demonstrate if application performance improves and Xenergy will demonstrate if this code tuning reduces energy as well.

Summary

Xenergy is a powerful tool that gives an early estimate of the energy of the processor-memory sub-system. Designers can immediately see the impact on total energy consumption of their selection of Xtensa configuration options (multipliers, DSP engines, a floating point unit, et cetera), the addition of TIE instruction extensions that they write, and their choice of the number and size of local memories and caches.

The ability of the Xenergy tool to model designer-defined TIE instruction extensions is critical to designers that use the Xtensa processor as an RTL alternative while designing the data plane of their SOC. These users write a significant amount of TIE to create the same hardware structures they would have if they were implementing the architecture using hardwired RTL. Being able to get an early estimate of the impact on energy of their custom TIE instructions is as important to most designers as the area estimation and performance profiling tools.

CORE OF THE YEAR
Best Processor Cores of 2004
PRODUCT RESOURCES
Xtensa LX2 Product Brief
Xtensa Processor Developers Toolkit Product Brief
Microprocessor Report’s review of Xtensa LX
  Microprocessor Report's Update on Xtensa LX2 and Xtensa 7
BDTI’s Report on Tensilica Xtensa LX Processor with Vectra LX
  EEMBC Benchmarks
  BDTI Benchmarks
  Epson printer
WHITE PAPERS
FLIX: Fast Relief for Performance-Hungry Applications
XPRES Compiler
Automated Configurable Processor Design Flow
  more >

ARTICLES

Hit Performance Goals with Configurable Processors
FLIX Helps Low-Power CPU Flex its Performance
Compiler Automates RTL Generation
  EDN's 2006 Hot 100 Products
 
QUOTABLE

“Tensilica’s introduction of the Xtensa LX and its revolutionary tool, the XPRES design compiler, made it the clear winner. Even without XPRES, Xtensa LX would be the leading contender for this award, but the combination is unbeatable.”

Tom R. Halfhill,
Senior Analyst, Microprocessor Report

get more information