Tech Support | Generator Login | Careers | Contact Us
PRODUCTS

  Overview

  Technology

  Diamond Standard

  Xtensa

    Configurable

    Config & Extensible

    Xtensa 7

    Xtensa LX2

  + Architecture

    – Features

    – Create TIE

  + I O Bandwidth

  + Low Power

  + Floating Point

  + Real-time Trace

  + Error Detection

  + Benchmarks

    – EEMBC Networking

  + Configuration Options

  + DSP Options

  + White Papers

  DSPs

    HiFi 2 Audio

    Video

    Communications

  HW/SW Dev Tools

  Literature & Doc

White Papers

How to Quickly Simulate Entire SOCs to Explore and Optimize Architectural Performance

Building a network-router chip is not just a question of how to forward a series of IPV4 packet headers and building a handheld, high-resolution digital television IC is not just a question of how to decode an MPEG4 video stream. These complex SOC designs must deal with the system-level issues in addition to individual subsystems (such as video decode, audio encode, network-packet forwarding, and DES encryption). An effective system-design methodology for these complex SOC designs must allow designers to more easily, quickly, and cheaply pull various subsystems together into a whole design and to reliably verify that the assembly is correct. Because, after all, the general SOC design problem is to get each of the subsystems to do the right thing and get all of the subsystems to work together effectively.

At the conceptual level, an entire system can be treated as a constellation of concurrent, interacting subsystems. Each subsystem implements a set of interfaces through which it communicates with other subsystems and shares common resources (memory, data structures, and network ports). At a minimum, any modern electronic system will contain at least one software-based task—hence at least one block of processor hardware plus data and instruction memories—and at least one input and one output interface device. In practice, a system can be viewed as dozens or even hundreds of interacting tasks or blocks.

Software tasks communicate with other software tasks through software abstractions such as application programming interfaces (abstracted into messages or synchronized access to shared memory). Hardware blocks communicate with other hardware blocks over wires (often abstracted into buses). Software blocks typically communicate with hardware blocks through memory-mapped control registers, (abstracted into device drivers). A hypothetical system is shown in figure 1, with each subsystem mapped either to software or hardware block. Moreover, the blocks are mapped onto both a scale of relative complexity and relative computational throughput demands.


Complex but undemanding or infrequently-executed tasks naturally gravitate towards software implementation. Simple but high-throughput functions, especially heavily used functions at the heart of the system application, naturally gravitate towards hardware implementation. The hard design decisions revolve around two questions:

  • What is the right implementation (hardware vs. software) for each block?
  • What is the best implementation for the interfaces between a block and those with which it communicates?

In , few issues arise over software blocks A, B, C, and D and the communications among them. Standard software methods for developing task-to-task communication are probably adequate. The performance demands are modest, so a traditional processor core may serve admirably. Similarly, hardware blocks G, J, and K are simple, so hardware design and verification may not be difficult and changes are unlikely. Communication among simple blocks is probably also simple. Other blocks (H and I and especially E and F) present bigger challenges to traditional methodology. Here the combination of complexity and performance may both increase the effort required inside the block, and also complicate the interfaces among them.

The hardware/software interfaces in (C:E, D:F, D:H and D:G) also present challenges. Matching the programming model of the interface, as seen by software, and the wire implementation in hardware is intrinsically complex and error-prone. Two representations, one written in C, for example, describing sequential operations on data-structures, and one written in Verilog, for example, describing parallel operations on signals, must be synchronized. A small sub-industry focused on hardware-software co-verification has emerged just to address this deep-seated incompatibility.

As system requirements evolve over the course of a project and from product generation to product generation, both complexity and throughput inevitably grow. More blocks are added to the system and most blocks move up and to the right.

The introduction of configurable, extensible processors changes the SOC design equation. Essentially, these tailorable, application-specific processors significantly increase the potential subsystem-design space that can be covered by processors. Even very small processors can now deliver very high performance. By tailoring the processor for the intended application class, and by leaving out hardware not needed by the application class, processor efficiency improves dramatically. The performance per gate, performance per square millimeter of silicon, performance per watt, and performance per clock of these processors can often rival the performance of hard-wired logic blocks that they replace.

Efficient application-specific processors open up a world in which all but a handful of the SOC subsystems and functions can be implemented in software. In this scenario, several different functions can often share a single processor, effectively time-slicing it. In other cases, different tasks will require dedicated application-specific processors. The distribution of processors in an SOC becomes just a function of system partitioning and most SOCs will employ many processors to implement the majority of the SOC’s subsystems.

The leverage of the complex SOC design methodology on the partitioning problem is particularly important to understand. Previously, when a system designer looked at SOC design partitioning, it was important to settle on a partition between hardware and software early in the project. Once the partition was established, the task of backtracking (of saying “gosh, I was wrong”) became complex and difficult. Designers sometimes discovered that planned hardware subsystems were too complex and had to be implemented in software to take advantage of software’s better ability to manage complexity.

Conversely, tasks slated to be implemented in software sometimes required more performance than the general-purpose processor could provide, so designers had to figure out how to move the function into hardware. Each change between hardware and software implementation would necessitate a change in all the interfaces to that function, so every hardware or software function that interacted with the modified function would need redesign and re-verification. Often many iterations would be required to meet the system’s performance and functionality goals.

These difficult partitioning choices are a central and critical task associated with the current method of SOC design. The tools and design methods available to help designers make these partitioning choices and changes have been quite limited. Migrating a task between hardware and software has been very painful, especially because the hardware and software task representations are so different (high-level languages versus hardware-description languages). Further, it’s more painful to verify the proper interaction between the SOC’s hardware and software prior to building the chip. It’s still more painful to find out that something’s wrong with the design after the chip has been built.

Figure 2 revisits the system partitioning example of figure 1. Blocks E, F, H, and I are implementable as application-specific processors. This means that inter-task communications are implementable in software and can evolve easily and inexpensively, even after the chip is built. Not all hardware blocks are eliminated, of course, but the number of hardware-software interfaces, especially complex interfaces, is reduced. In addition, configurable processors can include optimized application-specific interfaces allowing key interfaces such as H:J, to be simply and directly implemented as a native part of the processor’s definition. The low-throughput tasks A, B, C, and D also map efficiently onto configurable processors, so all software can run on a single family of processors with a common set of tools, models and development methods.


So, the true leverage of the application-specific processor really arises from the way it enables the designer to do more of the total work in a software-friendly form and to move more easily between the hardware and software worlds. When a much wider variety of subsystems all fit within the capabilities of a processor, the effort to move a software task running on a generic processor to an application-specific processor is very low, because the functional specification remains primarily the software, generally written in a high-level language such as C or C++.

As SOC designers seek even more subsystem performance, they need make only minor changes to the definition of the affected application-specific processor (adding facilities to improve execution speed and efficiency) and minor changes to the program running on that processor to take advantage of the processor’s new enhancements. Thus the effort needed to move a function onto or off of a particular processor, to split a function, or to combine functions, is much lower than the Herculean effort required to move a task from a software representation to a hardware representation, an effort that requires fundamentally rethinking the design and completely rewriting that function using, for example, Verilog instead of C.

The advanced SOC design methodology also affects simulation and validation of the individual system functions and combinations of these functions. The world of electronic design already offers many facilities for modeling a piece of embedded software running on a processor. The program can either run on a hardware prototype of the processor or on an instruction-set simulator (ISS) for that processor. Software simulation has gotten so efficient now (around one million simulation cycles/second versus hundreds per second for gate-level hardware simulation) that in many cases it’s perfectly adequate to prototype significant pieces of embedded software using an ISS. Moreover, new modeling tools enable rapid description of tightly-communicating groups of processors, memories and other blocks. These tools make design of complex multiple processor systems fast and simple. Better modeling removes the need for hardware prototyping—the software may never run on real hardware until the SOC prototype is powered on.

Note: This Tensilica White Paper is based on the book Engineering the Complex SOC by Chris Rowen, published in June 2004 by Prentice Hall.

CORE OF THE YEAR
Best Processor Cores of 2004
PRODUCT RESOURCES
Xtensa LX2 Product Brief
Xtensa Processor Developers Toolkit Product Brief
Microprocessor Report’s review of Xtensa LX
  Microprocessor Report's Update on Xtensa LX2 and Xtensa 7
BDTI’s Report on Tensilica Xtensa LX Processor with Vectra LX
  EEMBC Benchmarks
  BDTI Benchmarks
  Epson printer
WHITE PAPERS
FLIX: Fast Relief for Performance-Hungry Applications
XPRES Compiler
Automated Configurable Processor Design Flow
  more >

ARTICLES

Hit Performance Goals with Configurable Processors
FLIX Helps Low-Power CPU Flex its Performance
Compiler Automates RTL Generation
  EDN's 2006 Hot 100 Products
 
QUOTABLE

“Tensilica’s introduction of the Xtensa LX and its revolutionary tool, the XPRES design compiler, made it the clear winner. Even without XPRES, Xtensa LX would be the leading contender for this award, but the combination is unbeatable.”

Tom R. Halfhill,
Senior Analyst, Microprocessor Report

get more information