Tech Support | Generator Login | Careers | Contact Us
PRODUCTS

  Overview

  Technology

  Diamond Standard

  Xtensa

    Configurable

    Config & Extensible

    Xtensa 7

    Xtensa LX2

  + Architecture

    – Features

    – Create TIE

  + I O Bandwidth

  + Low Power

  + Floating Point

  + Real-time Trace

  + Error Detection

  + Benchmarks

    – EEMBC Networking

  + Configuration Options

  + DSP Options

  + White Papers

  DSPs

    HiFi 2 Audio

    Video

    Communications

  HW/SW Dev Tools

  Literature & Doc

White Papers

See Also: Epson Success Story

Designing Image Processing Pipelines for Printer SOCs using Xtensa Configurable Processors

High-quality color printers have become the norm in households. In fact, printers are often sold bundled with personal computers. There is a major shift in the factors driving printer adoption in the home. Digital cameras have become the largest driving factor; consumers want instant gratification – click and print. They want this soo much so that printers now feature docking stations and USB cable sockets that a digital camera can be hooked into and pictures can be printed directly from the camera (or camcorder), no computer required.

This consumer demand has led to a fundamental shift in the design of printers. All of the image processing required to prepare a document or an image for printing that used to be done on the PC has now shifted to the printer. As a result, the image processing capability of printers has advanced considerably. Today, printers allow consumers to print images directly from CDs and DVDs as well as digital cameras and camcorders. They feature LCD displays that enable the user to view pictures before printing them.

While printer manufacturers have added image processing intelligence, they’ve been pushed by their competitors to lower costs by integrating as many features as possible into as few chips as possible. In addition, it’s no longer cost effective to develop a different SOC for each printer model – one SOC needs to work for multiple printer designs. Printer SOCs typically have one system control processor that interacts with the image processing domain and the motor engine domain. The motor engine domain may have several small micro-controllers to control the speed and direction of the printer head. The image processing domain has recently moved from using hardwired RTL in the past to using multiple DSP processors.

Using Programmable Processors for Image Processing

The flexibility required to support such a diverse range of high-performance processing has led printer chip designers to use programmable platforms as the basis of the image processing pipeline in printer SOCs. Processors provide flexibility over hardwired RTL blocks to support multiple variants of algorithms and to upgrade and fine tune algorithms in software. For example, half-toning algorithms in printers are continuously being improved, even after the printer chip has been deployed. Programmability also allows the processors to support new image compression standards as well as fix bugs in software post-silicon. Additionally, processors allow one design to be used for multiple printers – different features can be turned on or off depending on the model. By being able to use one design for multiple models, printer manufacturers can lower costs significantly. Finally, multiple variations of the chip can be quickly designed by increasing or decreasing the number of processors.

Using Configurable Processors to go Beyond Fixed DSP Performance

Configurable processors, such as Tensilica's Xtensa processors, have been used extensively by manufacturers of consumer inkjet and laser printers. Three out of four of the major printer manufacturers have printer SOCs that have Xtensa processors in them.

Xtensa processors enable architects to design DSP engines that are customized to particular tasks. Consider the conceptual diagram of a printer SOC shown in Figure 1. In this example, multiple Xtensa processors are used and customized to different tasks – the JPEG decompression engine, the image enhancement engine, the color processing engine, etc.

printer architecture

Figure 1: Conceptual diagram of a printer SOC

Xtensa processors let printer companies have a programmable platform used in dozens of printer models. Xtensa processors have been designed into:

It is possible to tailor each processor for the particular task due to the unique ability to add instructions to the Xtensa processor. Since each of these engines is customized for a particular type of task, it leads to a very efficient implementation unlike fixed DSPs that have instructions and features that may not be required by the task being computed.

Advantages of using Xtensa Processors for Imaging Pipelines

There are five main advantages to using Xtensa processors in printing/imaging applications (besides programmability itself):

  1. Ability to create task-specific engines by adding imaging specific instructions to the Xtensa processor: These instructions can be general enough to be used by multiple variants of similar algorithms. Also, since no other non-task specific instructions are added to the processor, the DSP can be tailored exactly to the task. This DSP can easily achieve performance levels not possible with general-purpose DSPs.
    is thus extremely competitive in area and power to hardwired RTL block.
  2. Xtensa processors achieve RTL-like performance with competitive area and power: The base Xtensa processor is implemented in approximately 20,000 gates. The extensions made by the designers are inserted right into the processor pipeline and usually require approximately the same number of gates as an RTL implementation. Also, by extending the Xtensa processor with imaging specific instructions, performance can be greatly improved without increasing MHz or requiring a very high MHz deep-pipeline (and thus area-hungry processor). Therefore, the area is totally competitive with a hardwired RTL block. On the power side, once the designer optimizes the Xtensa processor, the Xtensa Processor Generator automatically inserts extensive fine-grain clock gating throughout the processor pipeline – the designer doesn’t have to do a thing – it’s all automatic. This provides greater power savings than a typical RTL block.
  3. Excellent compiler and software tool chain support for task-specific instruction extensions: Once the designer creates new task-specific instructions, Tensilica’s Xtensa C/C++ compiler (XCC) and the rest of the software tool chain is immediately and automatically updated to support the new instructions. The new instructions are used by using C intrinsics (function calls) in the application code. The XCC compiler automatically schedules the designer-defined instructions and does register allocation on the designer-defined register files. The instruction set simulator (ISS) simulates the new instructions, complete with timing information about multi-cycle operations. The debugger displays the new instructions and the values in the user-defined registers and register files, etc..
  4. Faster time to market than RTL: By using Tensilica's design methodology, a new function block can be designed in a significantly shorter period of time than an RTL block. This is because the designer-defined instructions are specified in a high-level language called the TIE (Tensilica Instruction Extension) language. TIE is similar to Verilog (with C-like data types), but the designer only has to specify the functionality of an operation and not the structural implementation. This makes it much easier to verify, since the designer only has to verify the input-output relationship of an operation versus what needs to be verified in RTL, where the designer also has to verify the implementation. Tensilica guarantees that the processor RTL implementation that it will generate from a designer’s TIE description is pre-verified. This can significantly reduce verification schedules compared to hardwired RTL block development.
  5. Ability to create efficient, flexible image-processing pipelines: Tensilica's technology provides several advantages specific to imaging applications. Tensilica's Xtensa LX2 processor, in particular, offers the ability to create complex instructions, making the processor into a multi-slot VLIW processor. Designers can also add variable-width SIMD operations. Designers can dramatically restructure the data flow on chip by adding custom I/O ports and FIFO interfaces on the processor.

Example: Half-Toning Using Error Diffusion Algorithms

A great way to demonstrate the use of Xtensa for image processing is with the example of mapping the Floyd-Steinburg error diffusion dithering algorithm to an Xtensa processor.

An essential step during image processing is that of image quantization. However, image quantization introduces intensity errors, since a grey pixel may be quantized to black or white. These intensity errors are mitigated by spreading the errors over neighboring pixels. This is known as Error Diffusion and there is a class of dithering algorithms for doing this. The Floyd-Steinburg algorithm is one of the oldest and most popular ones among these. Variants of this algorithm are in widespread use in printers today.

Figure 2: Basic Floyd-Steinburg Error Diffusion Algorithm

The Floyd-Steinburg algorithm works by spreading the quantization error from a pixel to its neighboring pixels as shown by the algorithm in Figure 2. As shown in this pseudo-code, first the error is computed and then fractions of the error are added to neighboring pixels. Note that the error for these pixels has still not been computed.

The computation of the error and error diffusion can proceed in parallel as shown in Figure 3. In this figure, “X” denotes the pixel being processed and the numbers around the pixel are the register numbers that the intermediate results can be stored to, if this computation were to be mapped to an Xtensa processor.

Figure 3: Parallel computation of quantization errors and the error diffusion across all the pixels in an 8x8 image

This parallel pixel processing looks like a computation wavefront. A maximum of four pixels can be processed for an 8x8 image block as shown in Figure 4. Note that, at the maximum, 15 registers are required to store the intermediate results.

Figure 4: Steady state of error diffusion computation. Upto 4 pixels can be processed in parallel

Mapping Error Diffusion Algorithms to Xtensa Processors

The Floyd-Steinburg error diffusion algorithm can be mapped to an Xtensa processor by first creating a 16-entry register file. This is a simple single-line statement in the TIE language that looks like “regfile myReg 8 16 mr”. This statement creates a 16-entry 8-bit register file called “myReg” with the short name “mr”. The TIE compiler will also create a C datatype called “myReg” associated with this register file, so that variables of type “myReg” can be declared in the C/C++ code. The XCC compiler can then automatically do register allocation of the variables mapped to this register file.

One way to create the functional units to calculate the error and the error diffusion is to create two user-defined TIE instructions. The first instruction computes the error for a pixel and adds the errors that are from the error diffusion of the neighboring pixels (stored in the myReg register file). Note that errors are added to the pixel only once after that pixel has been loaded:
TIE Instruction 1:

error = ComputeError() + mr0 + mr15

where mr0 and mr15 are the registers that store the errors from the neighboring pixels. These are added to the error for the current pixel by this TIE instruction.

The second instruction computes the error diffusions that will be distributed to the neighboring pixels and stored as intermediate values in the myReg register file. This computes errors for four neighboring pixels for each pixel and it also adds the errors from other neighboring pixels that are accumulated for the four pixels. This TIE Instruction 2:

mr0 = e.α + mr14; mr1 = e.β; mr2 = e.γ + mr1; mr3 = e.δ + mr2

where “α, β, γ, and δ” are the error fractions (for Floyd-Steinburg, these are 7/16, 3/16, 5/16, and 1/16). mr14, mr1, and mr2 are the errors diffused from pixels that have already been processed. Note that the computation of the fractions can be done by multiplication and shifting instead of divisions. For example, divide by 16 is equal to shifting right by 4.

The individual operations for these two instructions are shown in Figure 5.

Figure 5: Individual operations that comprise the two TIE instructions

Loading and Storing Pixel Data

For an 8x8 image block, 64-bit loads and stores are used. Xtensa LX2 processors support up to two load/store units and each can be 128 bits wide. First, Load64 and Store64 instructions are created that load and store 64 bits (8 pixels) at a time. First a Load64, instruction is executed, then the first pixel is computed and the errors are saved in the “myReg” register file. Then the error for the next pixel is computed and the next row of pixels are loaded in parallel. This process continues untill the first row has been computed. Then a second Store64 is executed.

This process of computing, storing errors, and loading the next set of pixels continues as shown in Figure 6. Note that in the steady state, the errors for up to 4 pixels are computed at the same time. This means that four copies of the basic TIE instructions described above (i.e., there are 4 functional units for the error computation and diffusion) are required.

Figure 6: Loading and storing 64-bits (8 pixels) at a time in parallel with computing the errors. After a while, the compute, load, store falls into a regular pattern. Click here for a larger version of this image.

Extensions for a Bigger Image Block Size and Other Algorithms

This outline of a basic error diffusion architecture easily can be extended to 16x16 or 32x32 or larger image block sizes. Two load/stores can be used with 128-bit load/stores, or FIFO interfaces can be instantiated on the Xtensa processor that directly stream the data to and from the processor pipeline. Larger images also require more error computations to occur in parallel as shown in Figure 7 for a 16x16 image.

Figure 7: Processing up to 8 pixels at a time in a 16x16 image

There are also several algorithms that are variants of Floyd-Steinburg such as the algorithm proposed by Jarvis, Judice, and Ninke that diffuse the error over a wider range of pixels for better dithering as shown in Figure 8 . By adding more error computations in TIE, the requirement to add hardwired RTL blocks is eliminated.

Figure 8: An example of a more complex error diffusion algorithm

Case Study: Epson’s Use of Xtensa Processors for Their Printer SOC

Epson has embraced the Xtensa processor for designing their REALOID printer SOCs. Over the years, the complexity of the chips that Epson has been designing has increased considerably. When Epson was considering their latest architecture, they wanted this architecture to form the basis of printers for several years. This meant that the architecture had to be flexible enough to be upgraded to new imaging algorithms – particularly half-toning algorithms that are being improved frequently. The Epson team also wanted the SOC to have some headroom to deploy more complex algorithms in the future.

At first, Epson considered designing a multi-million gate chip completely using hardwired RTL blocks. However, this solution would not give them the flexibility and programmability they needed and would require a very large verification effort. They then found that the Xtensa processor offered them a design alternative to using hardwired RTL. Using configurable processors gave the flexibility they needed and the extensibility of the Xtensa architecture meant that they could still create a solution that matched the high performance of a hardwired architecture.

Epson designed a scalable architecture using multiple Xtensa processors. As algorithm complexity and product requirements scale up, they can simply add more Xtensa processors to address these needs.

Figure 9: Epson's Scalable Xtensa-based Image Processing Pipeline for their Realoid Printer SOCs

The Xtensa processors in the REALOID chip communicate with memories and each other using a conventional system bus plus FIFOs implemented using Tensilica’s unique TIE Queue interfaces. The Xtensa processors also send control communications to each other using TIE Ports, which are equivalent to having GPIO on the processor core. These Queue and Port interfaces are accessed directly from the data path by using instructions like “Pop_Queue” and “Write_Port”. The bandwidth and throughput of the data that can flow through the Xtensa processors therefore is not limited by the load/store unit and the system bus.

Summary

A few of the features that make the Xtensa configurable processor ideal of implementing image processing pipelines have been explained. The error diffusion dithering example demonstrates the applicability of Xtensa processors to image processing. It leverages the ability to create complex multi-cycle, multi-operation and parallel instructions in TIE. Wide load/stores help maintain the bandwidth required for efficient operations.

The unique features of Xtensa processors have propelled them to become the platform of choice for implementing image processing pipelines in printer SOCs. Three out of four major printer manufacturers use chips that feature the Xtensa processor. The ever increasing demands placed on the image processing pipelines make programmable processors the easiest and most efficient way to create scalable solutions.

CORE OF THE YEAR
Best Processor Cores of 2004
PRODUCT RESOURCES
Xtensa LX2 Product Brief
Xtensa Processor Developers Toolkit Product Brief
Microprocessor Report’s review of Xtensa LX
  Microprocessor Report's Update on Xtensa LX2 and Xtensa 7
BDTI’s Report on Tensilica Xtensa LX Processor with Vectra LX
  EEMBC Benchmarks
  BDTI Benchmarks
  Epson printer
WHITE PAPERS
FLIX: Fast Relief for Performance-Hungry Applications
XPRES Compiler
Automated Configurable Processor Design Flow
  more >

ARTICLES

Hit Performance Goals with Configurable Processors
FLIX Helps Low-Power CPU Flex its Performance
Compiler Automates RTL Generation
  EDN's 2006 Hot 100 Products
 
QUOTABLE

“Tensilica’s introduction of the Xtensa LX and its revolutionary tool, the XPRES design compiler, made it the clear winner. Even without XPRES, Xtensa LX would be the leading contender for this award, but the combination is unbeatable.”

Tom R. Halfhill,
Senior Analyst, Microprocessor Report

get more information