Xenergy Energy Estimator Tool White Paper
Power has become a first-order concern, right next to performance and area, for SOC designers, whether they are designing for portable mobile devices or for networking boxes. Optimizing for energy at an application and system level has the potential to cut processor and local memory energy requirements by up to half in most cases by making intelligent design trade-offs. Any power savings made at this early architectural level far outweigh any power savings made later at the RTL or physical design levels.
Currently, there are several low-power EDA design methodologies such as clock gating, voltage and frequency reduction, gate sizing and logic optimization, leakage reduction techniques, and low-power libraries and technology processes. These low-power methodologies can take months to implement and still not have the impact that system-level architectural decisions can have on energy efficiency when the architectural decisions are made before any RTL code has been written.
A lot of emphasis has been placed on guiding a SOC designer towards a performance and/or area optimized architecture, while making system choices such as memory sub-system design (banked memories versus a single large memory), interconnect (single bus versus a hierarchy of buses versus point-to-point interconnects), caches, etc. However, little has been done to guide designers towards an energy-efficient solution.
The Xenergy tool is the first tool available from the industry that provides a realistic way to estimate the overall energy impact of different processor configurations and extensions. It also helps software developers with energy-driven application code tuning on the overall processor plus memory subsystem. Whereas most software tool chains in the past have focused on guiding application code development to improve performance, Tensilica’s Xenergy energy estimation tool can guide designers towards a more energy efficient processor-memory sub-system configuration.
Designers can use the Xenergy tool to execute a software application binary on an Xtensa configuration or a Diamond Standard processor to get a quick and early estimate of the power and energy consumed by the processor, caches, and local (tightly coupled) memories. The designer can then modify their Xtensa configuration, add instruction extensions, register files, complex execution units, or simply tune their application code, with the objective of reducing overall processor and memory energy requirements.
Focus on Total Energy Consumption
A focus on total energy consumption is key. Too often, designers will focus on a static milliwatts per megahertz (mw/MHz) power figure, but ignore the real total energy consumption per unit of workload. For example, a designer may add a set of custom instructions to a processor that increases the total size of the processor core and thereby, increases the average power per clock cycle (i.e., increases the mW/MHz). But if that custom instruction set addition dramatically lowers the total clock cycles required to perform a given functional workload (a target C code application)? Then the total energy consumed (power-per-cycle multiplied by total cycle time) can be reduced. For example: an increase in power per clock of 20% is offset by a 3X speed up in application execution. The mW/Mhz number increases 20%, but total energy consumption is actually reduced by 60%.
Cycle-by-Cycle Energy Consumption Estimation
The Xenergy energy estimator works by computing a power-consumption estimate per-cycle for each different instruction of an Xtensa configurable or Diamond Standard processor. For each designer-defined instruction extension in an Xtensa processor, created using Tensilica’s powerful TIE (Tensilica Instruction Extension) language, the Xenergy tool uses statistical models for energy per memory access (read and write) and energy per instruction.
The Xenergy tool then simulates the application on Tensilica’s cycle-accurate instruction set simulator (ISS), which gives detailed profiling information about each instruction executed and every memory access made. Based on this profiling information, the processor configuration, the TIE instructions, and the process technology information, Xenergy uses its statistical models to give an estimate of the dynamic, leakage and total energy dissipated by the processor, the caches, and the local (tightly coupled) instruction and data memories.
The Xenergy tool builds on Tensilica’s existing energy estimation tools. Xtensa processor configuration tools already provide the designer with a dynamic estimate of the area, MHz (Mega Hertz), and power of a processor core as they are configuring the processor. The Xenergy tool takes this a step further by giving the designer an estimate of the energy dissipated by the processor and its local memory subsystem when a particular application is executed on it.
The Xenergy tool is designed to be used iteratively, first as the processor designer is selecting the configuration options and adding new instruction extensions, and then by the software application developer as he is tuning the application. In reality, this hardware-software co-design process itself is iterative. Before the availability of the Xenergy tool, the hardware and software developers only had performance and area analysis tools to guide them through the hardware-software tuning process. Xenergy now provides them with early energy guidance as well.
Impact on Processor Design
Total energy to complete a task (power dissipated over time taken for the task to complete) can be dramatically reduced by customizing an Xtensa processor, as shown in the chart below.
Configuration |
Dot Product |
AES |
Viterbi |
FFT |
|
Baseline Xtensa Processor |
K Cycles |
12 |
283 |
280 |
326 |
Energy (uJ) |
3.3 |
61.1 |
65.7 |
56.6 |
|
Optimized Xtensa Processor |
K Cycles |
5.9 |
2.8 |
7.6 |
13.8 |
Energy (uJ) |
1.6 |
0.7 |
2.0 |
2.5 |
|
Energy Improvement |
2x |
82x |
33x |
22x |
The chart above assumes no changes in basic software algorithm (except for use of custom instruction C intrinsics) and identical memory cache sizes in the baseline and optimized processors.
The Xenergy tool can be used during the process of configuring an Xtensa processor. Designers can immediately see the effect on total energy consumption when they select configuration options (multipliers, DSP engines, a floating point unit, and many additional configuration choices) or when they add designer-defined instructions. They can see the effect of different interface options as well as memory subsystem options.
The ability of the Xenergy tool to model designer-defined TIE instruction extensions is critical to users of the Xtensa processor that use the processor as an RTL alternative while designing the data plane of their SOC. These designers write a significant amount of TIE to create the same hardware structures they would have if they were implementing the architecture using hardwired RTL. So, being able to get an early estimate of the impact on energy of their custom TIE instructions is as important to most users as the area estimation and performance profiling tools.
The inclusion of memory power consumption is another key aspect to the new Xenergy tool. Imagine a scenario where designer-defined processor extensions are used to create custom state registers and register files within an Xtensa processor core, not to appreciably improve execution performance, but instead aim at significantly decreasing accesses to local memory, thus decreasing overall energy. The Xenergy program points out this energy decrease, making it easy for the designer to weigh area, performance and power tradeoffs early in the processor configuration process.
Impact on Software Design
Similarly, a software programmer developing application code for a Tensilica Xtensa or Diamond Standard processor traditionally would tune the application for either performance or code size. The Xenergy tool now provides a tool to help the developer tune the application to reduce energy dissipation by the processor and its memories. For example, one may restructure the data structures in the application to reduce accesses to memories by exploiting temporal locality of the data. Intuitively, this should not only lead to better application code performance, but should also reduce energy. Tensilica’s standard software profiling tools will demonstrate if application performance improves and Xenergy will demonstrate if this code tuning does indeed reduce energy as well.
Starting with a Low Power Architecture
The base Xtensa instruction set architecture, common to both the Xtensa processors and the Diamond Standard processor cores, provides the industry’s lowest power and highest performance when compared to legacy fixed-architecture cores. For example, a high-performance version of the Xtensa LX2 processor uses less than half the die area and power of the equivalent ARM 1136J-S. Note: This is not the base Xtensa LX2 processor. Rather, this version of Xtensa LX2 has been configured to be a high-performance, general-purpose CPU, equivalent to the ARM 1136J-S. Performance analysis on EEMBC benchmarks for this Xtensa configuration has shown it to be an average of 2.5x higher performance than the ARM11 core.
Processor |
Equivalent Frequency (0.3u G Worst Case) |
Power - mW per MHz (0.13u G) |
Dhrystone MIPS/mW |
ARM 1136J-S |
333 MHz (single issue) |
0.60 |
1.98 |
Xtensa LX2 3-way FLIX performance configuration |
700 equivalent MHz (three issue) |
0.17 |
10.4 |
Tensilica has augmented its more modern Xtensa architecture with several energy saving features. The architecture implements power-down modes that lower overall system power, including power-down modes for local memory accesses and external power-down of the trace port control and on-chip debug modules. The architecture also implements automatic fine-grain clock gating for every functional element of the Xtensa processor including TIE functions added by designers.
Power consumption for the minimum Xtensa configuration is:
- 38 μW/MHz in 130 nm LV process, speed-optimized netlist, typical operating conditions.
- 48 μW/MHz in 90nm GT process, speed-optimized netlist, typical operating conditions.
Summary
The Xenergy tool estimates energy for an Xtensa subsystem (processor, caches, local memories) or Diamond Standard processor based on the application code that will run on that subsystem. Energy estimates take minutes versus hours or days for RTL-based power analysis. The Xenergy tool enables architects to optimize Xtensa processors and software applications for energy.
|