How
to Optimize SOC Performance Through Memory
Tuning
The purpose of memory tuning is to explore
the target application’s sensitivity to
memory system parameters and to choose parameters
for each processor that balance performance,
processor cost, and size appropriately. The processor
simulator plays a central role in this assessment,
because the simulator models and reports the
expected performance with a breakdown of memory-related
stalls.
Memory system configuration consists of two
phases. First, the designer must establish the
strategy for each major segment of instruction
and data storage, answering basic questions such
as:
- Will the processor execute its initialization
code from off-chip, read-only, or flash memory;
or will that code be preloaded into a local
on-chip instruction RAM? This question is important
because the boot-code location helps determine
the kind of memory-bus interface that is required.
If the processor never needs access to remote
memory except for initialization code, a simpler
and smaller memory interface may be appropriate.
- Is the performance-sensitive application
code small enough to fit entirely in instruction
RAM local to the processor or is an instruction
cache necessary?
- How does the processor load application
input data? Is it loaded from remote input
buffers or memories, is it pushed into the
processor’s
local data RAM by an outside agent such as
a DMA controller or another processor, or does
the processor directly load the data using
input or load instructions? These questions
are important because data I/O references also
help determine the kind of memory-bus interface
that is required.
- How does the application result data exit
the processor? Is it sent to remote output
buffers or memories, is it pulled from the
processor’s
local data RAM by an outside agent such as
a DMA controller or another processor, or does
the processor transfer the data through direct
output operations using output or store instructions?
These questions are important (like the previous
set) because data I/O references also help
determine the kind of memory-bus interface
that is required.
- Can all of the performance-sensitive
data (including application data, maximum-sized
stack, and other variable data storage) fit
entirely in data RAM local to the processor
or will it be necessary to emulate a large
local data RAM using a local data cache and
a large external memory?
- Sometimes
the instruction or data segments are so
large that instruction and data caches are
necessary for good overall performance. Unfortunately,
the complex interactions among memory references
may make cache behavior appear non-deterministic.
This trait presents a significant problem
for some embedded applications, where certain
instruction sequences or data regions must
be accessible with a small, constant latency.
This need can
be addressed either by closely-coupling both
cache and local RAM to the processor (with
time-critical instructions or data allocated
to addresses mapped to the local RAM) or by using
cache locking to prevent certain data or instruction
lines from being evicted from the cache once
the correct contents are loaded. Cache locking
temporarily makes selected cache regions (on
a line-by-line basis) act as local memory.
Detailed Memory System Tuning
Once the basic organization of instruction,
data, and cache memory has been established,
detailed memory-system tuning can ensue. Analyzing
memory stalls associated with the processor’s
memory system configurations drives this process.
Figure 1 below shows a typical application performance
profile for an MPEG-4 encoding application, including
instruction-cache-miss stalls, data-cache-miss
stalls, store-buffer stalls (for a write-through
cache), and other instruction-execution delays
(exceptions, source interlocks and branch-taken
delays).
Figure 1. Memory System Profile and Parameters
Two perspectives guide detailed memory-system
tuning:
The macro view of the memory system’s
aggregate performance across all instruction
and data references in a complete application
The micro view of the memory system’s behavior,
especially data references, in the key application
hot spots or inner loops.
Aggregate Memory System Performance
The macro view is driven by the accumulated dynamic
statistics of all program-memory references.
The cumulative statistics often give little insight
into why, for example, cache misses occur, but
they serve as the foundation for tuning overall
application throughput. Simulating the application
with a range of different memory-system parameters
establishes the trade-offs between application
performance and memory-system implementation
cost.
Typically, the behavioral patterns of an application’s
instruction references and its data references
are uncorrelated—the statistics of each
should be examined to determine the appropriate
configuration. The impact of various parameters
is summarized below:
- Memory-access latency
(first word): Reducing
the memory latency for the first word of
a cache-refill block always reduces the penalty
for cache misses and increases application
performance.
- Memory-access
time (each additional word of the block): Reducing the incremental delay for each subsequent
word in a block reduces the penalty for cache
misses (when the cache-line size is larger
than one memory word) and increases application
performance.
- Cache size: Increasing the size of the cache
almost always reduces the number of cache
misses and thus increases application performance.
Cache set associativity: Increasing the number
of ways in the cache almost always reduces
the number of cache misses and increases
application performance.
- Cache-line size: Increasing the
cache-line size may increase or decrease
application performance, depending on the pattern
of memory references, because longer cache
lines take more time to load from or write
back to main memory and increase the risk of
interference among lines (longer cache lines
mean fewer total lines in a given size cache).
When all the words in the cache line are used
before a line is evicted from the cache, a
longer line size generally improves application
performance, particularly when the latency
to access the first memory word is long and
the incremental time for additional words is
short. Conversely, when the application uses
little of each cache, a shorter cache line
may improve performance, particularly if the
memory latency is short or the cost of each
incremental word of the cache line is large.
This uncertainty is why accurate simulation
is so important.
- Refill
width: Increasing the number of bits that
can be transferred into or out of the cache
on each cycle generally reduces the delay for
reading or writing a cache line and improves
application performance, especially for long
cache lines. The refill size often corresponds
to the width of the bus connecting the processor
to main memory, so instruction- and data-refill
widths are almost always the same, though this
is not mandatory.
Write-back vs. write-through
data cache: Choosing
a write-back cache generally reduces the number
of write operations to the remote (on-chip or
off-chip) memory associated with the data cache,
because several processor stores to locations
within one cache line result in just one write
operation on the processor-to-main-memory bus.
If the memory write bandwidth or the bus bandwidth
to the memory is relatively narrow, choosing
a write-back policy may increase application
performance. If write bandwidth is not a problem,
a write-through cache can reduce the delay incurred
by data-cache misses and increase application
performance. The data-cache miss delay is often
lower for write-through caches because making
room for a new cache line (by evicting a victim
line) never requires the write-back of a dirty
cache line. In addition, write-back caches may
stall the processor during a write operation
to allow the cache to load all of the other words
in the target cache line before the write occurs.
This type of stall doesn’t happen with
write-through caches.
Each of the memory-system configuration choices
affects the silicon cost of the processor implementation.
Larger caches, wider refill width, shorter cache-line
size, and greater set associativity all increase
the silicon area for RAM, logic, or both. Of
these choices, increasing cache capacity and
decreasing the cache-line size generally cause
the biggest increases in silicon area.
Remote-memory latency is often a central concern
in processor configuration. Caches reduce that
sensitivity, sometimes dramatically. In many
cases, however, optimizing the system’s
main memory, independent of the processor’s
local-memory system, may be critically important
to improving overall performance. Accessing off-chip
dynamic RAM, for example, may require 50-100
processor cycles per access, particularly if
the path from processor to memory winds through
several different on-chip buses or if the RAM
interface is not optimized for rapid access times.
Figure 2 shows a memory hierarchy with caches
and RAM local to the processor, global on-chip
RAM, off-chip RAM, and off-chip flash memories.
The access latency from the processor may increase
ten-fold for each level in the hierarchy:
- local memory access latency 1 cycle
- global on-chip
RAM access latency 10 cycles
- off-chip RAM access
latency 100 cycles.

Figure 2. SOC Memory Hierarchy.
Configurable processors allow the designer to
tweak many different memory-system parameters.
Understanding the performance tradeoffs and choosing
the optimal configuration may be a complex process.
Charts of application performance as a function
of key parameters can help the designer visualize
those tradeoffs and finalize the memory system
more quickly. Figure 3 below shows simulated
data-cache performance results for an optimized
JPEG encoding application. It plots the total
number of execution cycles as a function of cache
size, cache-line size, and set associativity.
Figure 3. Data Memory Stalls Graph
For this application, data-cache behavior is
an important design consideration. The simplest
cache (4 Kbytes, direct mapped, 16-byte line)
has a load miss rate of 13.4%, while the most
complex (32 Kbytes, 4-way set associative, 64-byte
line) has a load miss rate of 1.9%. This chart
clearly shows that larger cache sizes are better,
but it also suggests diminishing returns above
16 Kbytes.
Line size—the number of bytes brought
in on each cache miss—is also a significant
factor. Moving from 16- to 64-byte cache lines
creates more performance benefit than doubling
the cache size. This result is notable because
longer cache lines actually reduce silicon cost,
while doubling cache size can be expensive (in
0.13m technology, a 16Kbyte RAM array requires
roughly 1mm2). The figure also shows that 2-way
set associativity is clearly better than direct-mapped
(one-way) cache, but going from 2-way to 4-way
set associativity has less dramatic benefits.
This simulation is based on a single-processor
design, without consideration of other processors
that may be contending for access to shared,
non-local memory. The single-processor behavior
may change depending on the memory design, the
interconnect structure between processors and
memory, and the pattern of memory references
by other processors. In particular, the effective
memory-access latency may increase due to memory
or bus contention. When the set of processors
is known, re-simulation with more accurate modeling
of processor-to-memory contention may give more
exact predictions of final system performance.
This refined simulation may suggest further improvements
to the memory-system configuration of some processors,
perhaps including increased cache size or set
associativity, a change to write-back policy,
or an increase in the cache-refill block size.
|