How to Quickly Simulate Entire SOCs to Explore
and Optimize Architectural Performance
Building
a network-router chip is not just a question
of how to forward a series of IPV4 packet headers
and building a handheld, high-resolution digital
television IC is not just a question of how
to decode an MPEG4 video stream. These complex
SOC designs must deal with the system-level issues
in addition to individual subsystems (such
as video decode, audio encode, network-packet
forwarding, and DES encryption). An effective
system-design methodology for these complex SOC
designs must allow designers to more easily,
quickly, and cheaply pull various subsystems
together into a whole design and to reliably
verify that the assembly is correct. Because,
after all, the general SOC design problem is
to get each of the subsystems to do the right
thing and get all of the subsystems to work together
effectively.
At the conceptual level, an entire system can
be treated as a constellation of concurrent,
interacting subsystems. Each subsystem implements
a set of interfaces through which it communicates
with other subsystems and shares common resources
(memory, data structures, and network ports).
At a minimum, any modern electronic system will
contain at least one software-based task—hence
at least one block of processor hardware plus
data and instruction memories—and at least
one input and one output interface device. In
practice, a system can be viewed as dozens or
even hundreds of interacting tasks or blocks.
Software tasks communicate with other software
tasks through software abstractions such as application
programming interfaces (abstracted into messages
or synchronized access to shared memory). Hardware
blocks communicate with other hardware blocks
over wires (often abstracted into buses). Software
blocks typically communicate with hardware blocks
through memory-mapped control registers, (abstracted
into device drivers). A hypothetical system is
shown in figure 1, with each subsystem mapped
either to software or hardware block. Moreover,
the blocks are mapped onto both a scale of relative
complexity and relative computational throughput
demands.

Complex but undemanding or infrequently-executed
tasks naturally gravitate towards software implementation.
Simple but high-throughput functions, especially
heavily used functions at the heart of the system
application, naturally gravitate towards hardware
implementation. The hard design decisions revolve
around two questions:
- What is the right implementation (hardware
vs. software) for each block?
- What is the best implementation
for the interfaces between a block and
those with which it communicates?
In , few issues arise over software blocks A,
B, C, and D and the communications among them.
Standard software methods for developing task-to-task
communication are probably adequate. The performance
demands are modest, so a traditional processor
core may serve admirably. Similarly, hardware
blocks G, J, and K are simple, so hardware
design and verification may not be difficult
and changes are unlikely. Communication among
simple blocks is probably also simple. Other
blocks (H and I and especially E and F) present
bigger challenges to traditional methodology.
Here the combination of complexity and performance
may both increase the effort required inside
the block, and also complicate the interfaces
among them.
The hardware/software interfaces in (C:E, D:F,
D:H and D:G) also present challenges. Matching
the programming model of the interface, as seen
by software, and the wire implementation in hardware
is intrinsically complex and error-prone. Two
representations, one written in C, for example,
describing sequential operations on data-structures,
and one written in Verilog, for example, describing
parallel operations on signals, must be synchronized.
A small sub-industry focused on hardware-software
co-verification has emerged just to address this
deep-seated incompatibility.
As system requirements evolve over the course
of a project and from product generation to product
generation, both complexity and throughput inevitably
grow. More blocks are added to the system and
most blocks move up and to the right.
The introduction of configurable, extensible
processors changes the SOC design equation. Essentially,
these tailorable, application-specific processors
significantly increase the potential subsystem-design
space that can be covered by processors. Even
very small processors can now deliver very high
performance. By tailoring the processor for the
intended application class, and by leaving out
hardware not needed by the application class,
processor efficiency improves dramatically. The
performance per gate, performance per square
millimeter of silicon, performance per watt,
and performance per clock of these processors
can often rival the performance of hard-wired
logic blocks that they replace.
Efficient application-specific processors open
up a world in which all but a handful of the
SOC subsystems and functions can be implemented
in software. In this scenario, several different
functions can often share a single processor,
effectively time-slicing it. In other cases,
different tasks will require dedicated application-specific
processors. The distribution of processors in
an SOC becomes just a function of system partitioning
and most SOCs will employ many processors to
implement the majority of the SOC’s subsystems.
The leverage of the complex SOC design methodology
on the partitioning problem is particularly important
to understand. Previously, when a system designer
looked at SOC design partitioning, it was important
to settle on a partition between hardware and
software early in the project. Once the partition
was established, the task of backtracking (of
saying “gosh, I was wrong”) became
complex and difficult. Designers sometimes discovered
that planned hardware subsystems were too complex
and had to be implemented in software to take
advantage of software’s better ability
to manage complexity.
Conversely, tasks slated to be implemented in
software sometimes required more performance
than the general-purpose processor could provide,
so designers had to figure out how to move the
function into hardware. Each change between hardware
and software implementation would necessitate
a change in all the interfaces to that function,
so every hardware or software function that interacted
with the modified function would need redesign
and re-verification. Often many iterations would
be required to meet the system’s performance
and functionality goals.
These difficult partitioning choices are a central
and critical task associated with the current
method of SOC design. The tools and design methods
available to help designers make these partitioning
choices and changes have been quite limited.
Migrating a task between hardware and software
has been very painful, especially because the
hardware and software task representations are
so different (high-level languages versus hardware-description
languages). Further, it’s more painful
to verify the proper interaction between the
SOC’s hardware and software prior to building
the chip. It’s still more painful to find
out that something’s wrong with the design
after the chip has been built.
Figure 2 revisits the system partitioning example
of figure 1. Blocks E, F, H, and I are implementable
as application-specific processors. This means
that inter-task communications are implementable
in software and can evolve easily and inexpensively,
even after the chip is built. Not all hardware
blocks are eliminated, of course, but the number
of hardware-software interfaces, especially complex
interfaces, is reduced. In addition, configurable
processors can include optimized application-specific
interfaces allowing key interfaces such as H:J,
to be simply and directly implemented as a native
part of the processor’s definition. The
low-throughput tasks A, B, C, and D also map
efficiently onto configurable processors, so
all software can run on a single family of processors
with a common set of tools, models and development
methods.

So, the true leverage of the application-specific
processor really arises from the way it enables
the designer to do more of the total work in
a software-friendly form and to move more easily
between the hardware and software worlds. When
a much wider variety of subsystems all fit within
the capabilities of a processor, the effort to
move a software task running on a generic processor
to an application-specific processor is very
low, because the functional specification remains
primarily the software, generally written in
a high-level language such as C or C++.
As SOC designers seek even more subsystem performance,
they need make only minor changes to the definition
of the affected application-specific processor
(adding facilities to improve execution speed
and efficiency) and minor changes to the program
running on that processor to take advantage of
the processor’s new enhancements. Thus
the effort needed to move a function onto or
off of a particular processor, to split a function,
or to combine functions, is much lower than the
Herculean effort required to move a task from
a software representation to a hardware representation,
an effort that requires fundamentally rethinking
the design and completely rewriting that function
using, for example, Verilog instead of C.
The advanced SOC design methodology also affects
simulation and validation of the individual system
functions and combinations of these functions.
The world of electronic design already offers
many facilities for modeling a piece of embedded
software running on a processor. The program
can either run on a hardware prototype of the
processor or on an instruction-set simulator
(ISS) for that processor. Software simulation
has gotten so efficient now (around one million
simulation cycles/second versus hundreds per
second for gate-level hardware simulation) that
in many cases it’s perfectly adequate to
prototype significant pieces of embedded software
using an ISS. Moreover, new modeling tools enable
rapid description of tightly-communicating groups
of processors, memories and other blocks. These
tools make design of complex multiple processor
systems fast and simple. Better modeling removes
the need for hardware prototyping—the software
may never run on real hardware until the SOC
prototype is powered on.
Note: This Tensilica White Paper is based on
the book Engineering
the Complex SOC by Chris
Rowen, published in June 2004 by Prentice Hall.
|