Category Archives: compilers

YARDstick: custom processor development toolset

YARDstick is a design automation tool for custom processor development flows that focuces on the hard part: generating and evaluating application-specific hardware extensions. YARDstick is a powerful building block for ASIP (Application-Specific Instruction-set Processor) development, since it integrates application analysis, ultra-fast algorithms for custom instruction generation and selection with user-defined compiler intermediate representations. It integrates retargetable compiler features for the targeted IRs/architectures.


Important features of YARDstick are as follows:

  • retargetable to used-defined IRs by machine description
  • can be targeted to low-level compiler IRs, assembly-level representations of
    virtual machines, or assembly code for existing processors
  • fully parameterized custom instruction generation and selection engine
  • code selector for multiple-input multiple-output patterns
  • virtual register assignment for virtual machine targets
  • an extensive set of backends including:
    • assembly code emitter
    • C backend
    • visualization backend for Graphviz (
    • visualization backend for VCG ( (or aiSee:
    • an XML format backend amenable to graph rewriting.

YARDstick comes along with a cross-platform GUI written in Tcl/Tk 8.5 (


The ultimate goal of YARDstick is to liberate the designer’s development infrastructure from compiler and simulator idiosyncrasies. With YARDstick, the ASIP designer is empowered
with the freedom of specifying the target architecture of choice and adding new implementations of analyses and custom instruction generation/selection methods.

Typically, 2x to 15x speedups for benchmark applications (ANSI C optimized source code) can be fully automatically obtained by using YARDstick depending on the target architecture. Speedups are evaluated against a scalar RISC architecture.

Detailed look

  1. Analysis engines generating both static and dynamic statistics:
    1. Data types
    2. Operation-level statistics
    3. Basic block statistics (ranking)
    4. Performance estimations with/without custom instructions.
  2. Generation of CDFGs (Control-Data Flow Graphs).
  3. Backend engines:
    1. ANSI C
    2. dot (Graphviz)
    3. VCG (GDL, aiSee)
    4. XML (GGX for the AGG graph rewriting tool)
    5. Retargetable assembly emitter for entire translation units
    6. CDFG formats for various RTL synthesis tools
  4. Custom instruction engines:
    1. Full-parameterized MIMO custom instruction generation algorithm
    2. Fast heuristic
    3. Configurable number of inputs
    4. Configurable number of outputs
  5. Custom instruction selection:
    1. Based on priority metrics
  6. Graph (and graph-subgraph) isomorphism features for eliminating redundant patterns. Multiple algorithms supported.
  7. Visualization of custom instructions, basic blocks, control-flow graphs and control-data flow graphs (basic block nodes expanded to their constituent instructions).
  8. Basic retargetable compiler features:
    1. Code selector for MIMO instructions (tested with large cases).
    2. Virtual register assignment (allocation for a VM).
    3. Hard register allocator in the works.
  9. Miscellaneous features:
    1. single constant multiplication optimizer
    2. elimination of false data-dependences in assembly-level CDFGs.
    3. beautification options for visualization
    4. interfacing (co-operation) with external tools such as peephole optimizers, profilers, code generators etc.
    5. features related to the custom processor architecture named ByoRISC


Here is a list of application benchmarks that have been tested with YARDstick:

  • ADPCM encoder and decoder (typically: 4x speedup)
  • Video processing kernels: full-search block-matching motion estimation,
    logarithmic search motion estimation, motion compensation
  • Image processing kernels: steganography (hide/uncover), edge detection, matrix
  • Cryptographic kernels: crc32, rc5, raiden (7x speedup, 12x for unrolled

At the YARDstick homepage ( ) you can find
some additional materials:

Supporting testimonial to the ArchC infrastructure (2004)

[I was an early PhD student at the time and was quite impressed with ArchC had to offer; I still am]

What ArchC is all about

ArchC is an architecture description language that exploits the already mature SystemC legacy. It targets the automated generation of a) functional and cycle-accurate instruction set simulators and at a later stage of b) tools from the entire application development toolchain for a custom processor (compiler, assembler, linker, debugger, etc).

Simulator generation is at a very good and usable state, however needs to be extended towards the system-level (bus generation ?). Toolset generation which naturally comes next, requires more hard effort and therefore the appropriate support of the open-source community. In my personal opinion, compiler generation is so hard that it can only be possible through funding. Other issues as automated bus generation are also very HARD and need support.

Strengths of ArchC

Important points regarding ArchC are:

  • Nice integration with SystemC. The ArchC user codes instruction decoding, hardware resources, etc in small ArchC files and the corresponding SystemC are automatically generated, which removes heavy burden from the user.
  • Unique approach to the design of a retargetable system call library. This makes possible to benchmark real-sized applications with realistic demands, i.e. many input arguments. Many of the other simulators can’t cope with that.
  • Ease of introducing resource elements, e.g. multi-banked memories.
  • Reuse possibilities for existing SystemC codes.
  • Models of 3 processors, 2 of them with complete retargeted GNU toolchains! Plus binaries of very popular benchmarks (MediaBench, [MediaBench-II is here], MiBench).

How has ArchC helped me in my research

  • Ease of designing variations of a base processor. Most of my papers require tradeoff analysis on alternative architectures.
  • My “drproc” model is a 5-pipeline stage RISC processor with specialized instructions for data-reuse transformations [work of Catthoor et al, IMEC]. I originally had coded an assembler (~few days) and the entire processor in plain SystemC (~month). When I found out about ArchC, it required only 3-4 8-hour days to deliver to my colleagues a much more robust cycle-accurate simulator!
  • I have ran reasonably sized applications (e.g. one sourceforge application, of 25K C lines with intensive use of pointers) with no problem on the R3000 and MIPS models of ArchC.

What needs to be done

NOTE: In the most part, the roadmap has been completed!

This work will be incomplete unless the “ArchC roadmap” is fulfilled. While I believe the compiler retargeting is a very hard issue, [addressing UNICAMP team] I am certain that your team is experienced in DSP compilation techniques (I have read some of the papers). Certainly, assembler generation, and bus wrapper generation (e.g. for OCP) plus some more models are awaited by many users, including myself.

Thank you for all the (free) fish!

Multiplierless hack for constant division by certain factors

I would like to present you with a formula for quotient calculation that works for any constant divisor of the form 2^n-1 (^ denotes exponentiation). For n=2,3,4,5,6,7, these divisors are 3,7,15,31,63,127. I have tested all divisors with n up to 16 and the approach works. My formula works only for dividends x that are in the range [0,(2^(2*n))-2].

The formula is as follows and it is both divisionless and multiplierless (as initially derived):

y = (((x>>n)+x+(1<<n)+1)>>n)-1;

which can also be written as:

y = (x + (x>>n) + 1) >> n;

This is a formula for software integer division, i.e. truncating division that calculates the quotient of the division. Please note that 1<<n is the strength-reduced form for calculating 2^n. It does not require any double-word arithmetic (like with multiplying by the multiplicative inverse and then performing adjustment steps).

E.g. for n = 6, the divisor is 2^n – 1 = 63. The formula will work for x <= 4094 = (2^(2*n)) – 2 = (1<<n)*(1<<n)-2. It will produce one-off errors for certain x >= (2^(2*n))-2. But it will be correct for the specified “small” x values.

A testcase for the algorithm is here:

#include <stdio.h>

int main(void)
  int qapprox, qexact;
  int i, j, k;

  for (i = 2; i < 8; i++) {
    for (j = 0; j < (1<<i)*(1<<i)-1; j++) {
      k = (1<<i) - 1;       
      qapprox = (((j>>i)+j+(1<<i)+1)>>i)-1;
      qexact  = j / k;
      if (qapprox != qexact) {
        fprintf(stderr, "qapprox = (%d/%d) = %d\tqexact = (%d/%d) = %d\n",
          j, k, qapprox, j, k, qexact);
  return 0;

I had devised the hack via “algorithmic sketching”. I mean that I had started from a basic formula with some “holes” in it. The holes where small constants or operators. There exist such techniques in academia, put more nicely; sometimes called algorithmic sketching, combinatorial
sketching, and it is relevant to forms of program synthesis. Essentially it is about devising algorithmic sketches (incomplete algorithms with wildcards) and then using a mechanism to fill the gaps.

Seems I can claim the specific hack as back as May 16, 2011.

Special thanks to everyone who has participated in the following threads. The Taylor series connection is very interesting, as mentioned by Tim Wescott and Tauno Voipio.