Tech Support | Generator Login | Careers | Contact Us
PRODUCTS

  Overview

  Technology

  Diamond Standard

  Xtensa

    Configurable

    Config & Extensible

    Xtensa 7

    Xtensa LX2

  + Architecture

    – Features

    – Create TIE

  + I O Bandwidth

  + Low Power

  + Floating Point

  + Real-time Trace

  + Error Detection

  + Benchmarks

    – EEMBC Networking

  + Configuration Options

  + DSP Options

  + White Papers

  DSPs

    HiFi 2 Audio

    Video

    Communications

  HW/SW Dev Tools

  Literature & Doc

White Paper

How to Optimize SOC Performance Through Memory Tuning

The purpose of memory tuning is to explore the target application’s sensitivity to memory system parameters and to choose parameters for each processor that balance performance, processor cost, and size appropriately. The processor simulator plays a central role in this assessment, because the simulator models and reports the expected performance with a breakdown of memory-related stalls.

Memory system configuration consists of two phases. First, the designer must establish the strategy for each major segment of instruction and data storage, answering basic questions such as:

  • Will the processor execute its initialization code from off-chip, read-only, or flash memory; or will that code be preloaded into a local on-chip instruction RAM? This question is important because the boot-code location helps determine the kind of memory-bus interface that is required. If the processor never needs access to remote memory except for initialization code, a simpler and smaller memory interface may be appropriate.
  • Is the performance-sensitive application code small enough to fit entirely in instruction RAM local to the processor or is an instruction cache necessary?
  • How does the processor load application input data? Is it loaded from remote input buffers or memories, is it pushed into the processor’s local data RAM by an outside agent such as a DMA controller or another processor, or does the processor directly load the data using input or load instructions? These questions are important because data I/O references also help determine the kind of memory-bus interface that is required.
  • How does the application result data exit the processor? Is it sent to remote output buffers or memories, is it pulled from the processor’s local data RAM by an outside agent such as a DMA controller or another processor, or does the processor transfer the data through direct output operations using output or store instructions? These questions are important (like the previous set) because data I/O references also help determine the kind of memory-bus interface that is required.
  • Can all of the performance-sensitive data (including application data, maximum-sized stack, and other variable data storage) fit entirely in data RAM local to the processor or will it be necessary to emulate a large local data RAM using a local data cache and a large external memory?
  • Sometimes the instruction or data segments are so large that instruction and data caches are necessary for good overall performance. Unfortunately, the complex interactions among memory references may make cache behavior appear non-deterministic. This trait presents a significant problem for some embedded applications, where certain instruction sequences or data regions must be accessible with a small, constant latency.

This need can be addressed either by closely-coupling both cache and local RAM to the processor (with time-critical instructions or data allocated to addresses mapped to the local RAM) or by using cache locking to prevent certain data or instruction lines from being evicted from the cache once the correct contents are loaded. Cache locking temporarily makes selected cache regions (on a line-by-line basis) act as local memory.

Detailed Memory System Tuning

Once the basic organization of instruction, data, and cache memory has been established, detailed memory-system tuning can ensue. Analyzing memory stalls associated with the processor’s memory system configurations drives this process. Figure 1 below shows a typical application performance profile for an MPEG-4 encoding application, including instruction-cache-miss stalls, data-cache-miss stalls, store-buffer stalls (for a write-through cache), and other instruction-execution delays (exceptions, source interlocks and branch-taken delays).

Figure 1. Memory System Profile and Parameters

Two perspectives guide detailed memory-system tuning:

The macro view of the memory system’s aggregate performance across all instruction and data references in a complete application
The micro view of the memory system’s behavior, especially data references, in the key application hot spots or inner loops.

Aggregate Memory System Performance

The macro view is driven by the accumulated dynamic statistics of all program-memory references. The cumulative statistics often give little insight into why, for example, cache misses occur, but they serve as the foundation for tuning overall application throughput. Simulating the application with a range of different memory-system parameters establishes the trade-offs between application performance and memory-system implementation cost.

Typically, the behavioral patterns of an application’s instruction references and its data references are uncorrelated—the statistics of each should be examined to determine the appropriate configuration. The impact of various parameters is summarized below:

  • Memory-access latency (first word): Reducing the memory latency for the first word of a cache-refill block always reduces the penalty for cache misses and increases application performance.
  • Memory-access time (each additional word of the block): Reducing the incremental delay for each subsequent word in a block reduces the penalty for cache misses (when the cache-line size is larger than one memory word) and increases application performance.
  • Cache size: Increasing the size of the cache almost always reduces the number of cache misses and thus increases application performance.
    Cache set associativity: Increasing the number of ways in the cache almost always reduces the number of cache misses and increases application performance.
  • Cache-line size: Increasing the cache-line size may increase or decrease application performance, depending on the pattern of memory references, because longer cache lines take more time to load from or write back to main memory and increase the risk of interference among lines (longer cache lines mean fewer total lines in a given size cache). When all the words in the cache line are used before a line is evicted from the cache, a longer line size generally improves application performance, particularly when the latency to access the first memory word is long and the incremental time for additional words is short. Conversely, when the application uses little of each cache, a shorter cache line may improve performance, particularly if the memory latency is short or the cost of each incremental word of the cache line is large. This uncertainty is why accurate simulation is so important.
  • Refill width: Increasing the number of bits that can be transferred into or out of the cache on each cycle generally reduces the delay for reading or writing a cache line and improves application performance, especially for long cache lines. The refill size often corresponds to the width of the bus connecting the processor to main memory, so instruction- and data-refill widths are almost always the same, though this is not mandatory.

Write-back vs. write-through data cache: Choosing a write-back cache generally reduces the number of write operations to the remote (on-chip or off-chip) memory associated with the data cache, because several processor stores to locations within one cache line result in just one write operation on the processor-to-main-memory bus. If the memory write bandwidth or the bus bandwidth to the memory is relatively narrow, choosing a write-back policy may increase application performance. If write bandwidth is not a problem, a write-through cache can reduce the delay incurred by data-cache misses and increase application performance. The data-cache miss delay is often lower for write-through caches because making room for a new cache line (by evicting a victim line) never requires the write-back of a dirty cache line. In addition, write-back caches may stall the processor during a write operation to allow the cache to load all of the other words in the target cache line before the write occurs. This type of stall doesn’t happen with write-through caches.

Each of the memory-system configuration choices affects the silicon cost of the processor implementation. Larger caches, wider refill width, shorter cache-line size, and greater set associativity all increase the silicon area for RAM, logic, or both. Of these choices, increasing cache capacity and decreasing the cache-line size generally cause the biggest increases in silicon area.

Remote-memory latency is often a central concern in processor configuration. Caches reduce that sensitivity, sometimes dramatically. In many cases, however, optimizing the system’s main memory, independent of the processor’s local-memory system, may be critically important to improving overall performance. Accessing off-chip dynamic RAM, for example, may require 50-100 processor cycles per access, particularly if the path from processor to memory winds through several different on-chip buses or if the RAM interface is not optimized for rapid access times.

Figure 2 shows a memory hierarchy with caches and RAM local to the processor, global on-chip RAM, off-chip RAM, and off-chip flash memories. The access latency from the processor may increase ten-fold for each level in the hierarchy:

  • local memory access latency 1 cycle
  • global on-chip RAM access latency 10 cycles
  • off-chip RAM access latency 100 cycles.


Figure 2. SOC Memory Hierarchy.

Configurable processors allow the designer to tweak many different memory-system parameters. Understanding the performance tradeoffs and choosing the optimal configuration may be a complex process. Charts of application performance as a function of key parameters can help the designer visualize those tradeoffs and finalize the memory system more quickly. Figure 3 below shows simulated data-cache performance results for an optimized JPEG encoding application. It plots the total number of execution cycles as a function of cache size, cache-line size, and set associativity.

Figure 3. Data Memory Stalls Graph

For this application, data-cache behavior is an important design consideration. The simplest cache (4 Kbytes, direct mapped, 16-byte line) has a load miss rate of 13.4%, while the most complex (32 Kbytes, 4-way set associative, 64-byte line) has a load miss rate of 1.9%. This chart clearly shows that larger cache sizes are better, but it also suggests diminishing returns above 16 Kbytes.

Line size—the number of bytes brought in on each cache miss—is also a significant factor. Moving from 16- to 64-byte cache lines creates more performance benefit than doubling the cache size. This result is notable because longer cache lines actually reduce silicon cost, while doubling cache size can be expensive (in 0.13m technology, a 16Kbyte RAM array requires roughly 1mm2). The figure also shows that 2-way set associativity is clearly better than direct-mapped (one-way) cache, but going from 2-way to 4-way set associativity has less dramatic benefits.

This simulation is based on a single-processor design, without consideration of other processors that may be contending for access to shared, non-local memory. The single-processor behavior may change depending on the memory design, the interconnect structure between processors and memory, and the pattern of memory references by other processors. In particular, the effective memory-access latency may increase due to memory or bus contention. When the set of processors is known, re-simulation with more accurate modeling of processor-to-memory contention may give more exact predictions of final system performance. This refined simulation may suggest further improvements to the memory-system configuration of some processors, perhaps including increased cache size or set associativity, a change to write-back policy, or an increase in the cache-refill block size.

CORE OF THE YEAR
Best Processor Cores of 2004
PRODUCT RESOURCES
Xtensa LX2 Product Brief
Xtensa Processor Developers Toolkit Product Brief
Microprocessor Report’s review of Xtensa LX
  Microprocessor Report's Update on Xtensa LX2 and Xtensa 7
BDTI’s Report on Tensilica Xtensa LX Processor with Vectra LX
  EEMBC Benchmarks
  BDTI Benchmarks
  Epson printer
WHITE PAPERS
FLIX: Fast Relief for Performance-Hungry Applications
XPRES Compiler
Automated Configurable Processor Design Flow
  more >

ARTICLES

Hit Performance Goals with Configurable Processors
FLIX Helps Low-Power CPU Flex its Performance
Compiler Automates RTL Generation
  EDN's 2006 Hot 100 Products
 
QUOTABLE

“Tensilica’s introduction of the Xtensa LX and its revolutionary tool, the XPRES design compiler, made it the clear winner. Even without XPRES, Xtensa LX would be the leading contender for this award, but the combination is unbeatable.”

Tom R. Halfhill,
Senior Analyst, Microprocessor Report

get more information