|
Read the PDF Version
A New Kind of Processor for Complex SOC Designs
A new type of processor core has been getting
a lot of attention lately – a processor
you can tailor for a specific application. Configurable
processors are much faster and can do much more
than standard embedded microprocessors. Some
can even replace hand-coded RTL in ASICs and
SOCs.
What is a configurable processor? What can configurable
processors do? Why would anyone want to use this
type of processor? How can a configurable processor
replace RTL coding? How does an engineer design
with configurable processors? These questions
and more are answered in this article.
Standard embedded processor cores
First, consider standard, fixed-ISA (instruction
set architecture), embedded microprocessor and
DSP cores. Popular fixed-ISA embedded microprocessor
architectures, including the ARM, MIPS and PowerPC
processors, were originally designed as stand-alone
chips in the 1980s. Over time, they’ve
become faster and additional computational resources
have been added so the processor can perform
more work per clock. These architectures are
good at executing a wide range of algorithms,
but designers often have to speed up critical
portions of the design in hardware. Even DSP
architectures must be designed to provide adequate
performance when executing a wide range of algorithms,
so they can’t match the speed of a custom-tailored
solution.
Hand-coding RTL to speed up designs
Because many applications, especially demanding
multimedia and communications applications, just
don’t run fast enough on standard embedded
microprocessors even with the extra performance
boost of an embedded DSP, engineering teams hand-code
parts of the design in Verilog or VHDL to achieve
the performance they need. However, custom RTL
logic for complex functions takes a long time
to design and verify. In addition, hand-coded
RTL blocks are often too rigid to change once
they’re designed, yet changes are often
needed to accommodate changing standards or new
product features.
A closer look at the make up of the typical
RTL block appears in Figure 1, which shows the
RTL datapath on the left and the block’s
state machine on the right, gives insight into
this paradox.

Figure 1. Hardwired RTL = Datapath plus State
Machine
In most RTL designs, the datapath consumes the
vast majority of the gates in the logic block.
A typical datapath may be as narrow as 16 or
32 bits, or hundreds of bits wide. The datapath
will typically contain many data registers and
will often have significant blocks of RAM or
interfaces to RAM that is shared with other RTL
blocks.
By contrast, the RTL logic block’s finite
state machine contains nothing but control details.
All the nuances of the sequencing of data through
the datapath, all the exception and error conditions,
and all the handshakes with other blocks are
captured in this subsystem of the RTL logic block.
This state embodies most of the design and verification
risk due to its complexity.
A late design change made to an RTL block is
much more likely to affect the state machine
than the structure of the datapath. Configurable,
extensible processors (a fundamentally new form
of microprocessor) provide a way of reducing
the risk of state-machine design by replacing
hard-to-design, hard-to-verify state-machine
logic blocks with pre-designed, pre-verified
processor cores and application firmware.
The Promise of Configurable Processors
The growth in the use of many large RTL blocks
for SOC designs causes the well-recognized “SOC
design gap” to widen every year. This gap
arises between the explosive growth in chip complexity
and the somewhat slower growth in designer productivity.
The trend towards high-performance, low-power
systems (e.g. long-battery-life cell-phones,
four-mega-pixel digital cameras, fast and inexpensive
color printers, digital HDTVs, and 3D video games)
is increasing the size of SOC designs as well
as the SOC design gap.
Hardwired RTL design has many attractive characteristics—small
die area, low power, and high-throughput. However,
the liabilities of RTL (difficult design, slow
and difficult verification, and poor scalability
to complex problems) are starting to dominate
as chip gate counts become enormous. Configurable
processors are now a viable replacement for complex
RTL.
What is a Configurable Processor?
A full-featured configurable processor toolkit
consists of a pre-defined processor core and
a design-tool environment that permits significant
adaptation of that base processor design for
specific application requirements. Typical forms
of configurability include additions, deletions,
and modifications to memories, external bus widths
and handshake protocols, and commonly used processor
peripherals.
Extensible processors, an important superset
of configurable processors, provide system designers
with the ability to add instructions to the processor
that may have never been considered or imagined
by designers of the original architecture. The
addition of highly customized instructions matched
perfectly to a specific application gives configurable
processors the ability to deliver performance
levels rivaling RTL while gaining the benefits
of pre-verified IP (intellectual property). Configurable
processors are delivered as RTL code that is
synthesized into an FPGA or SOC design. The best
configurable processors also come with matching
software development tools that reflect the hardware
instructions added through designer-defined architectural
extensions.
A configurable processor can implement datapath
operations that closely match those of RTL functions.
The equivalent datapaths are implemented using
the integer pipeline of the base processor, plus
additional execution units, registers, and other
functions added by the chip architect for a specific
application.
For example, the Tensilica Instruction Extension
language (TIE, a simplified version of Verilog)
is and example of a design tool that allows system
developers to extend Tensilica’s Xtensa
32-bit processor architecture for specific applications.
TIE is optimized for high-level specification
of datapath functions in the form of instruction
semantics and encoding. A TIE description is
both simpler and much more concise than RTL because
it omits all sequential logic descriptions, including
state machine descriptions, pipeline registers,
and initialization sequences. These complex items
are actually developed in firmware.
The new processor instructions and registers
described in TIE are available to the firmware
programmer via the same compiler and assembler
that target the processor’s base instructions
and register set. All operation sequencing within
the processor’s datapaths is controlled
by firmware, through the processor’s existing
instruction-fetch, decode, and execution mechanisms.
State-machine firmware can usually be written
in a high-level language such as C or C++ because
of the high performance provided by tailored
microprocessor architectures.
Configurable Processors as RTL Alternatives
Configurable processors used as RTL replacements
routinely use the same datapath structures as
traditional RTL blocks: deep pipelines, parallel
execution units, task-specific state registers,
and wide data buses to local and global memories.
These extended processors can sustain the same
high computation throughput and support the same
data interfaces as typical RTL designs.
Control of configurable-processor datapaths
is very different from the RTL counterparts however.
Cycle-by-cycle control of a processor’s
datapaths is not fixed in hardwired state transitions
but is embodied in firmware executed by the processor
(shown in Figure 2). Control-flow decisions occur
in branches; memory references are explicit in
load and store operations; computations are explicit
sequences of general-purpose and application-specific
computational operations

Figure 2. Programmable hardware
function: datapath
+ processor + software
The migration from RTL hardwired state machine
to configurable processors with firmware control
has many important implications:
Flexibility: Chip developers and system builders
can change a block’s function just by changing
the firmware, even after the product has shipped.
Software-based development: Developers use relatively
fast and low-cost software tools to implement
most chip features.
Faster, more complete system modeling: For a
10-megagate design, even the fastest software-based
logic simulator may not exceed a few cycles per
second. By contrast, firmware simulations for
extended processors run on instruction-set simulators
at hundreds of thousands or millions of cycles
per second.
Unification of control and data: No modern system
consists solely of hardwired logic. There’s
always a processor running software. Moving RTL-based
functions into a processor removes the artificial
separation between control and data processing.
Time-to-market: Moving critical functions from
RTL to configurable processors simplifies SOC
design, accelerates system modeling, and speeds
hardware finalization. Firmware-based state machines
easily accommodate changes to standards because
implementation details aren’t “cast
in stone.”
Designer Productivity: Most importantly, migration
from RTL-based design to application-specific
processors boosts the engineering team’s
productivity by reducing both the engineering
manpower needed for RTL development and verification.
A processor-based SOC design approach cuts risks
of fatal logic bugs and permits graceful recovery
when (not if) a bug is discovered.
The benefit of being able to make changes in
software rather than hardware with a processor-based
approach cannot be understated. Configurable
processors reduce the risk of state-machine design
by replacing hard-to-design, hard-to-verify state-machine
logic blocks with pre-designed, pre-verified
processor cores and application firmware.
The Key: Automatic Hardware and Software Generation
The first configurable processors were introduced
in the mid-1990s and had one important drawback:
once instructions were added to the processor,
there was no automatic way to make sure the software-development
tools could use those instructions. So companies
that chose to use configurable processors had
to somehow modify the software-development tools
by hand.
In early 1999, Tensilica introduced its first
Xtensa processor with a major innovation – automatic
hardware and software generation. Designers
could specify configuration options using an
Internet-based browser approach. New, designer-defined
instructions were automatically integrated via
the Xtensa Processor Generator, which produces
a verified hardware implementation as well as
tailored versions of all necessary software-development
tools including compilers, debuggers, instruction-set
simulators, and much more. The software tools
are matched perfectly to the configuration, and
no extra work is required to match tools and
processor.
The response to the Tensilica Xtensa processor
and its ability to automatically generate the
hardware and software has been strong. Over 60
companies are designing SOCs using Tensilica’s
Xtensa processors. Many of these companies use
multiple Xtensa processors – some designs
employ multiple copies of the processor performing
the same tasks and other designs use different
tailored versions of the Xtensa processor to
do perform a variety of on-chip tasks.
Use Configurable Processors Within the Software
Development Process
Now, let’s look at the software development
process currently used to develop embedded applications.
Figure 3 illustrates a typical flow for developing
embedded application software. Design work starts
not with the processor but with the algorithm.
Application developers generally start with high-level
design tools and languages such as C or C++ and
they may purchase algorithms that are already
developed using those languages. High-level programming
languages and other types of development packages
allow developers to create, test, and validate
primary algorithmic ideas and smaller independent
algorithms and sub-algorithms using tools that
deal with the algorithm in a state that’s
divorced from a particular processor architecture

Next, the developers translate the main algorithm
and sub-algorithms in C to create a portable,
processor-independent application code base.
C-level simulation, performed on a PC or workstation,
then proves that the recoded algorithms perform
as expected. After integrating the sub-algorithms
and other application software modules into a
coherent whole, the entire program (now written
in C or C++) is recompiled for a target processor
and the resulting application code is tested
and profiled.
If the development team is extremely fortunate,
the compiled algorithm code executes with the
desired speed. However, often, to meet project
performance goals, application software teams
must convert critical sections of code into hand-tuned
assembly code once a fixed-ISA processor is selected.
The software development team must generally
try to closely map the assembly code to the processor
by hand. Otherwise, the selected processor will
probably end up being too expensive, too fast,
or too power hungry for the intended embedded
application. Assembly code developers must carefully
dovetail their variables into the available registers
because there’s no way to add more registers
to a fixed-ISA processor if the existing register
set proves inadequate.
Fit the Processor to the Algorithm
Configurable processors allow embedded-system
developers to create processors specifically
tailored to the target algorithms – producing
a much better fit between processor and algorithm.
Designers can add special-purpose, variable-width
registers; specialized execution units; and wide
data buses to reach an optimum processor configuration
for specific algorithms. These features allow
developers to mold the processor’s characteristics
to the algorithm instead of trying to force-fit
the 10-pound algorithm into the resources available
in a 5-pound, fixed-ISA processor or DSP. Consequently,
application developers can more rapidly develop
systems that meet all performance specifications
using configurable and extensible processors
than by using off-the-shelf, fixed-ISA microprocessors
and DSPs.
As with hand-tuned assembly language, optimization
points for a configurable and extensible processor
implementation become apparent through code profiling.
Optimization targets typically reside within
the innermost software loops that execute many
thousands or millions of times per second. Reducing
the instruction count of the object code inside
of these loops produces a huge and positive effect
on system performance. The following three examples
illustrate the sort of performance improvements
algorithm developers can expect when using configurable
and extensible processors. (All of the following
examples are based on Tensilica’s Xtensa
microprocessor.)
Accelerating the FFT
The heart of the decimation-in-frequency FFT
algorithm is an operation called the “butterfly,” which
resides at the innermost loop of the FFT. Each
butterfly operation requires six additions and
four multiplications to compute the real and
imaginary components of a radix-2 butterfly result.
Using the TIE language, it’s possible for
a design team to augment the Xtensa processor’s
pipeline with four adders and two multipliers
so that half of an FFT butterfly can be computed
in one cycle.
The Xtensa processor’s configurable data-bus
interface can be defined to be as wide as 128
bits so that all four real and imaginary integer
input terms of each butterfly can be loaded into
special-purpose FFT input registers in one cycle.
All four computed output components can be stored
into memory in one cycle as well. Because the
load and store operations for each FFT butterfly
require a cycle each, the most cost-efficient
approach to the FFT computation is to stretch
each FFT half-butterfly computation across two
cycles, to occur in parallel with a load operation
for a subsequent butterfly and a store operation
for a prior butterfly. This approach saves hardware
and matches the computational and data-transfer
resources.
Practically speaking, it’s very hard to
create single-cycle, synthesizable multipliers
for SOCs that operate at clock rates of several
hundred Megahertz. Although it’s possible
to create hard-macro IP multipliers that operate
in one clock cycle, SOC designers prefer to use
synthesizable IP components whenever possible
because such components allow maximum freedom
in selecting semiconductor manufacturing processes
and vendors. Consequently, it’s much better
for the overall chip design to stretch the multiplication
across two cycles so that the multiplier is not
the critical timing element on the SOC. The additional
multiplier latency does not affect throughput
in this example and, if necessary, even longer
latencies can be accommodated through additional
state storage in the butterfly execution unit.
This approach to computing the FFT butterfly
adds a SIMD (single-instruction, multiple data)
butterfly computation unit to the processor (using
fewer than 35,000 gates including the two 24x24-bit
multipliers). The performance improvements achieved
by using this approach over straight C code,
and C code augmented with just the addition of
a hardware multiplier (like a traditional DSP),
appear in Table 1. The table also shows the code
size of the FFT programs with and without the
TIE extensions.
| Code
Size (bytes) |
|
430
+ Libraries |
430 |
158 |
|
| |
FFT
Length |
|
|
|
|
Performance
(cycles) |
128-point |
763,548 |
169,739 |
2,269 |
337 |
| 256-point |
1,787,645 |
386,498 |
4,711 |
379 |
| 512-point |
3,975,245 |
867,133 |
9,841 |
404 |
| 1024-point |
9,241,893 |
1,922,644 |
20,603 |
449 |
|
Table 1. Acceleration results from processor
augmentation with FFT instructions
Accelerating Viterbi Code
A different signal-processing example, Viterbi
decoding, comes from GSM cellular telephony.
GSM employs Viterbi decoding to pull information
symbols out of a noisy communication channel.
This decoding scheme employs “Viterbi butterfly” operations
consisting of 8 logical operations (4 additions,
2 comparisons, and 2 selections) and performs
8 Viterbi butterfly operations to decode each
symbol in the received digital information stream.
Typically, RISC processors need 50 to 80 instruction
cycles to execute one Viterbi butterfly. A high-end
VLIW DSP (TI’s 320C64xx) requires only
1.75 cycles to compute each Viterbi butterfly.
The TIE (Tensilica Instruction Extension) language
allows a designer to add a Viterbi butterfly
instruction to the Xtensa processor’s ISA.
This design uses the processor’s configurable
128-bit I/O bus to load 8 symbols at a time,
adds the pipeline hardware shown in Figure 5,
and results in an average butterfly execution
time of 0.16 cycles per butterfly. An unaugmented
Xtensa processor executes Viterbi butterflies
in 42 cycles, so the butterfly execution hardware
(approximately 11,000 added gates) achieves a
250x speed improvement over the out-of-the-box
Xtensa processor.

Figure 4. Detail of Viterbi butterfly augmentation
Accelerating an MPEG-4 Decoder
MPEG-4, the third example of achieving performance
through instruction extension and parallel operation
execution, is from the video world. One of the
most difficult parts of encoding MPEG4 video
data is motion estimation, which requires the
ability to search adjacent video frames for similar
pixel blocks. The search algorithm’s inner
loop contains a SAD (sum of absolute differences)
operation consisting of a subtraction, an absolute
value, and the addition of the resulting value
with the previously computed value.
For a QCIF (quarter common image format) image
frame, a 15 frames/s image rate, and an exhaustive-search
motion-estimation scheme, SAD operations require
slightly more than 641 million operations/s.
As shown in Figure 5, it’s possible to
add SIMD (single instruction, multiple data)
SAD hardware capable of executing 16 pixel-wide
SAD instructions per cycle using TIE. (Note:
Using the Xtensa processor’s 128-bit maximum
bus width, it’s possible to load 16 pixels
worth of data in one instruction.)
The combination of executing all three SAD component
operations (subtraction, absolute value, addition)
in one cycle and the SIMD operation that computes
the values for all 16 pixels in one clock cycle
reduces the 641 million operations/s requirement
to 14 million instructions/s, a substantial reduction.
This MPEG-4 motion-estimation accelerator is
part of an entire MPEG-4 decoder demonstration
vehicle developed by Tensilica using Xtensa technology.
The MPEG-4 decoder adds approximately 92,000
to 112,000 gates to the base Xtensa processor
and performs implements a 2-way QCIF video codec
operating at 15 frames/s or QCIF MPEG4 decoding
at 30 frames/s using approximately 30 MIPS for
either operational mode.
Figure 5. MPEG4 SIMD SAD (sum of absolute differences)
instruction execution hardware
Motion estimation is not the only algorithm
within the MPEG4 decoder that can benefit from
acceleration. Other algorithms that can be accelerated
include variable-length decoding, iDCT, bitstream
processing, dequantization, AC/DC prediction,
color conversion, and post filtering. When instructions
are added to accelerate all of these MPEG-4 decoding
tasks, creating an MPEG-4 SIMD (single-instruction,
multiple-data) engine within the tailored processor,
the results can be quite surprising, as shown
in Table 2.
| Miss
America |
3.126
G cycles |
76.81
M cycles |
7.7
MHz |
40.1x |
| Suzie |
3.389
G cycles |
102.19
M cycles |
10.3
MHz |
33.2x |
| Foreman |
10.045
G cycles |
359.5
M cycles |
13.5
MHz |
27.9x |
| Car Phone |
9.222
G cycles |
308.7M
cycles |
12.2
MHz |
29.9x |
| Monsters
Inc. |
29.327 G cycles |
822.8 M cycles |
8.6 MHz |
35.6x |
|
Table 2. MPEG4 Decoder Acceleration Results
from processor augmentation with FFT instructions
As Table 2 shows, the resulting SIMD engine
acceleration drops the number of cycles required
to decode the MPEG-4 video clips from billions
to millions and the required processor operating
frequency by roughly 30x to around 10MHz. Without
the additional acceleration instructions, the
processor would need to run at roughly 300MHz
to perform the MPEG4 decoding. There is a substantial
difference in power dissipation and process technology
cost between a 10MHz processor and one that runs
at 300MHz. In addition, it’s unlikely that
any amount of assembly language coding could
produce similarly large drops in the clock rate.
As shown by the above three examples, it’s
possible to accelerate the performance of embedded
algorithms using configurable and extensible
microprocessor cores to create processors that
are tailored to the specific algorithm instead
of resorting to assembly language coding or resorting
to RTL hardware design. The advantage of using
extensible processors is that designers can add
precisely the resources (special-purpose registers,
execution units, and wide data buses) required
to achieve the desired algorithmic performance
instead of attempting to shoehorn algorithms
into the computational assets of a fixed-ISA
processor.
This design approach does not require that the
members of the design team become processor designers.
It only requires that the design team be able
to profile existing algorithm code and to find
the critical inner loops in that profiled code
(two tasks they already do), and then define
new processor instructions that will accelerate
these critical loops. Only the last task differs
from the existing software development process
currently employed by many embedded system developers.
The result of this new approach is to greatly
accelerate algorithm performance, often far beyond
the abilities of today’s most advanced
fixed-ISA microprocessors and DSP cores. In most
cases, designers can replace entire RTL blocks
with configurable processors tuned for the exact
application, saving valuable design and verification
time and adding an extra level of flexibility
because of the inherent programmability of this
approach.
|