Monthly Archives: October 2014

YARDstick: custom processor development toolset

YARDstick is a design automation tool for custom processor development flows that focuces on the hard part: generating and evaluating application-specific hardware extensions. YARDstick is a powerful building block for ASIP (Application-Specific Instruction-set Processor) development, since it integrates application analysis, ultra-fast algorithms for custom instruction generation and selection with user-defined compiler intermediate representations. It integrates retargetable compiler features for the targeted IRs/architectures.

Features

Important features of YARDstick are as follows:

  • retargetable to used-defined IRs by machine description
  • can be targeted to low-level compiler IRs, assembly-level representations of
    virtual machines, or assembly code for existing processors
  • fully parameterized custom instruction generation and selection engine
  • code selector for multiple-input multiple-output patterns
  • virtual register assignment for virtual machine targets
  • an extensive set of backends including:
    • assembly code emitter
    • C backend
    • visualization backend for Graphviz (http://www.graphviz.org)
    • visualization backend for VCG (http://rw4.cs.uni-sb.de/~sander/html/gsvcg1.html) (or aiSee: http://www.absint.com/aisee/)
    • an XML format backend amenable to graph rewriting.

YARDstick comes along with a cross-platform GUI written in Tcl/Tk 8.5 (http://wiki.tcl.tk).

Aim

The ultimate goal of YARDstick is to liberate the designer’s development infrastructure from compiler and simulator idiosyncrasies. With YARDstick, the ASIP designer is empowered
with the freedom of specifying the target architecture of choice and adding new implementations of analyses and custom instruction generation/selection methods.

Typically, 2x to 15x speedups for benchmark applications (ANSI C optimized source code) can be fully automatically obtained by using YARDstick depending on the target architecture. Speedups are evaluated against a scalar RISC architecture.

Detailed look

  1. Analysis engines generating both static and dynamic statistics:
    1. Data types
    2. Operation-level statistics
    3. Basic block statistics (ranking)
    4. Performance estimations with/without custom instructions.
  2. Generation of CDFGs (Control-Data Flow Graphs).
  3. Backend engines:
    1. ANSI C
    2. dot (Graphviz)
    3. VCG (GDL, aiSee)
    4. XML (GGX for the AGG graph rewriting tool)
    5. Retargetable assembly emitter for entire translation units
    6. CDFG formats for various RTL synthesis tools
  4. Custom instruction engines:
    1. Full-parameterized MIMO custom instruction generation algorithm
    2. Fast heuristic
    3. Configurable number of inputs
    4. Configurable number of outputs
  5. Custom instruction selection:
    1. Based on priority metrics
  6. Graph (and graph-subgraph) isomorphism features for eliminating redundant patterns. Multiple algorithms supported.
  7. Visualization of custom instructions, basic blocks, control-flow graphs and control-data flow graphs (basic block nodes expanded to their constituent instructions).
  8. Basic retargetable compiler features:
    1. Code selector for MIMO instructions (tested with large cases).
    2. Virtual register assignment (allocation for a VM).
    3. Hard register allocator in the works.
  9. Miscellaneous features:
    1. single constant multiplication optimizer
    2. elimination of false data-dependences in assembly-level CDFGs.
    3. beautification options for visualization
    4. interfacing (co-operation) with external tools such as peephole optimizers, profilers, code generators etc.
    5. features related to the custom processor architecture named ByoRISC

Benchmarks

Here is a list of application benchmarks that have been tested with YARDstick:

  • ADPCM encoder and decoder (typically: 4x speedup)
  • Video processing kernels: full-search block-matching motion estimation,
    logarithmic search motion estimation, motion compensation
  • Image processing kernels: steganography (hide/uncover), edge detection, matrix
    multiplication
  • Cryptographic kernels: crc32, rc5, raiden (7x speedup, 12x for unrolled
    version)

At the YARDstick homepage (http://www.nkavvadias.com/yardstick/ ) you can find
some additional materials:

Supporting testimonial to the ArchC infrastructure (2004)

[I was an early PhD student at the time and was quite impressed with ArchC had to offer; I still am]

What ArchC is all about

ArchC is an architecture description language that exploits the already mature SystemC legacy. It targets the automated generation of a) functional and cycle-accurate instruction set simulators and at a later stage of b) tools from the entire application development toolchain for a custom processor (compiler, assembler, linker, debugger, etc).

Simulator generation is at a very good and usable state, however needs to be extended towards the system-level (bus generation ?). Toolset generation which naturally comes next, requires more hard effort and therefore the appropriate support of the open-source community. In my personal opinion, compiler generation is so hard that it can only be possible through funding. Other issues as automated bus generation are also very HARD and need support.

Strengths of ArchC

Important points regarding ArchC are:

  • Nice integration with SystemC. The ArchC user codes instruction decoding, hardware resources, etc in small ArchC files and the corresponding SystemC are automatically generated, which removes heavy burden from the user.
  • Unique approach to the design of a retargetable system call library. This makes possible to benchmark real-sized applications with realistic demands, i.e. many input arguments. Many of the other simulators can’t cope with that.
  • Ease of introducing resource elements, e.g. multi-banked memories.
  • Reuse possibilities for existing SystemC codes.
  • Models of 3 processors, 2 of them with complete retargeted GNU toolchains! Plus binaries of very popular benchmarks (MediaBench, [MediaBench-II is here], MiBench).

How has ArchC helped me in my research

  • Ease of designing variations of a base processor. Most of my papers require tradeoff analysis on alternative architectures.
  • My “drproc” model is a 5-pipeline stage RISC processor with specialized instructions for data-reuse transformations [work of Catthoor et al, IMEC]. I originally had coded an assembler (~few days) and the entire processor in plain SystemC (~month). When I found out about ArchC, it required only 3-4 8-hour days to deliver to my colleagues a much more robust cycle-accurate simulator!
  • I have ran reasonably sized applications (e.g. one sourceforge application, of 25K C lines with intensive use of pointers) with no problem on the R3000 and MIPS models of ArchC.

What needs to be done

NOTE: In the most part, the roadmap has been completed!

This work will be incomplete unless the “ArchC roadmap” is fulfilled. While I believe the compiler retargeting is a very hard issue, [addressing UNICAMP team] I am certain that your team is experienced in DSP compilation techniques (I have read some of the papers). Certainly, assembler generation, and bus wrapper generation (e.g. for OCP) plus some more models are awaited by many users, including myself.

Thank you www.archc.org for all the (free) fish!

Top 12 HercuLeS HLS user feedback patterns

As you might already know, HercuLeS by Ajax Compilers, Inc. (http://www.ajaxcompilers.com ; tech page here: http://www.nkavvadias.com/hercules/) is a high-level synthesis (HLS) environment for the automatic translation of algorithms to hardware.

Since November 2013, the FREE version of HercuLeS has been made free to download, install and use for Windows 7 (64-bit) or Linux (32-bit and 64-bit): http://www.nkavvadias.com/temp/index.php

This free/demo version has generated substantial user feedback. Dear users, wherever you are located (US, Canada, Japan, P.R. China, Sweden, Germany, UK, India, Brazil, the list is not conclusive by any means) thank you-thank you-thank you!!!

I have been compiling a short list of (let’s say) the “Top 12” of user feedback patterns. Focusing on the more generic points, I am disclosing the list as food-for-thought and to generate even more feedback!

Top-12 user feedback patterns and concerns (not in any particular order)

  1. Development time minimization (algorithm, early verification, RTL generation, simulation, implementation, late verification).
  2. High result quality, reducing runtime requirements, chip area and power consumption (QoR). (Latest head-to-head out-of-the-box to Vivado HLS is a tie: http://nkavvadias.com/blog/2014/10/14/vivado-hls-vs-hercules-2/)
  3. Readability of the generated HDL code. (HercuLeS code is much more readable than code generated by competition)
  4. Source to IR to RTL to netlist cross-referencing (via means of “intelligent” cross-tagging). (It looks that an intelligent IDE is a whole new project.)
  5. Theoretically provable correct-by-design approach. (I have been looking into automatic proving of code transformations.)
  6. Transparent interface to the logic synthesis tool and up to downloading the bitstream. (The prototype version works and has been checked for my development boards.)
  7. Plug-in approach to interconnecting legacy reusable designs with HLS-generated designs (“I have this piece of code in idiosyncratic C flavor or assembly or FORTRAN…”). Looks like a call for implementing point solutions for select (aka paying) users.
  8. Optimize HDL descriptions for specific implementation processes (e.g. FPGA devices); users don’t actually seek portability!!!
  9. Pthread or OpenMP frontend support. Explicit parallelism if you please! (My bet is with parallelism extraction but I get this important point)
  10. VHDL frontend support (!) for behavioral VHDL to netlist HDL end-to-end flow. (There exist hard-to-the-core developers — and greatness is with them — that do their algorithmic exploration in behavioral VHDL. I usually do my own in C or VHDL and only lately I have been increasingly using MATLAB and Processing.)
  11. Can you synthesize malloc and free? (Yes I can, and I am improving it, since until now I had been *intercepting* malloc and free and mapping them to a hardware-managed stack.)
  12. Can you show me the automatic IP integration feature? (Yes I can, check this blog post as well: http://nkavvadias.com/blog/2014/10/13/hercules-overview/)

NOTE: quoted text is not a factual reply of an individual but rather artistic rendition thereof.

10x productivity boost with HLS: Myth, legend or fact?

There exists a saying about pretty much every new disruptive technology and so it goes for high-level synthesis. Maybe you have heard it before reading these lines; as an ingredient to a pitch, on colleagues’ coffee break, at a press conference or maybe you have found yourself thinking more or less of this: “When I become proficient with this XYZ HLS tool, I will be 10x more productive! I will be able to fast-forward this task, put a smile on my supervisor’s face; earn a good deal of quality time too!”

Well this reads as a dream-come-true, but is it reasonable to believe in it? Is it already possible, or will it become possible? Being an order of magnitude more productive; is it myth, legend or a factual thing?

HLS attempts to solve the tedious iterations at the RTL (primarily). This is done by having the developer describe the intent or behavior of the system using increased abstraction. No need to reason for each FSM state, exact timing, and taking every stepwise error-prone decision. Ideally, by giving the algorithm and a set of requirements to the HLS tool, you can effectively get a correct design with the desired QoR. Or get a no-go.

To add some data into the soup, let’s examine the following figure.

ips-sloc

For these ten algorithms, ANSI C code is passed to HercuLeS HLS ( http://www.nkavvadias.com/hercules/ ). At the end of the flow, other things apart, we get the synthesizable RTL VHDL for the given C code. By measuring the VHDL-to-C ratio of useful lines of code (no comments etc), a 7.8x to 25.15x variation with an average of about 15 times (15.4x) is obtained.

This is by no means conclusive but it is indicative. HercuLeS takes your code through a series of transformations; from C to N-Address Code (intermediate language) to graph-based representation (in Graphviz) and finally to RTL VHDL. Every step is from a higher abstraction level to a lower one and every step results in a more detailed representation as it moves from software to hardware description.

This approach sounds obviously the correct one. But bare in mind the potential fallacy: 10x less code to write does not necessarily yield 10x less time to develop it. What must also be accounted for are:

  1. Iterations in developing the C code itself.
  2. Multiplicity in the number of generated RTL designs, as points in the architectural design space. These iterations are sloooow, so you have to keep them a) few, b) short. For this reason, it is much better that they should not happen at the RTL, but rather at the level of an IL (intermediate language) estimator/profiler like aprof: http://www.nkavvadias.com/doc/aprof/aprof-README.html .
  3. RTL design is only a fraction of concept-to-implementation. As cliche as it may read, this is Amdahl alright.
  4. Coding for parallelism, whether implicit or explicit, is challenging in the sense of near-optimal scaling for regular or irregular applications and adequacy of the underlying hardware paradigm.

And the verdict: tenfold increase in productivity is a FACT, and you should be able to experience it, if you do it the RIGHT way!

Vivado HLS vs HercuLeS (Kintex-7 and VDS 2013.2 update)

As a followup to a previous blog post on out-of-the-box Vivado HLS vs HercuLeS comparison the following table provides updated information on the performance of HercuLeS against Vivado HLS 2013.2 on Virtex-6 and Kintex-7 (XC7K70TFBG676-2 FPGA device).

Better results (lower execution time; smaller area) have been typeset in bold. It can be clearly seen that HercuLeS outperforms Vivado HLS in key benchmarks such as filtering and numerical processing. As expected in many occasions, better speed/performance can be traded-off for lower area. With 12 partial wins each, one could call this a tie :)

 Benchmark  Description  Vivado HLS (VHLS)  HercuLeS  Device
LUTs Regs TET (ns) LUTs Regs TET (ns)
1 bitrev Bit reversal 67 39 72.0 42 40 11.6  Virtex-6
2 divider Radix-2 division 218 226 63.6 318 332 30.6  Kintex-7
3 edgedet Edge detection 246 130 1636.3 680 361 1606.4  Virtex-6; 1  BRAM for VHLS
4 fibo Fibonacci series 138 131 60.2 137 197 102.7  Virtex-6
5 fir FIR filter 89 114 1027.1 606 540 393.8  Kintex-7
6 gcd Greatest common divisor 210 98 35.2 128 93 75.9  Virtex-6
7 icbrt Cubic root approximation 239 207 260.6 365 201 400.5  Virtex-6
8 sieve Prime sieve of Eratosthenes 525 595 6108.4 565 523 3869.5  Virtex-6;  1 BRAM for VHLS

NOTES:

  • TET is Total Execution Time in ns.
  • VHLS is a shortened form for Vivado HLS.
  • Vivado HLS 2013.2 was used.
  • Bold denotes smaller area and lower execution time.
  • Italic denotes an inconclusive comparison.
  • For the cases of edgedet and sieve, VHLS identifies a BRAM; HercuLeS does not. In these cases, HercuLeS saves a BRAM while VHLS saves on LUTs and FFs (Registers).

HercuLeS: Overview and features

HercuLeS is an extensible high-level synthesis environment for automatically mapping algorithms to hardware. Essentially, HercuLeS translates programs in a typed-assembly language named N-Address Code (NAC) to a collection of Graphviz CDFGs (Control-Data Flow Graphs) which are then synthesized to vendor-independent self-contained RTL VHDL. HercuLeS is also used for push-button synthesis of ANSI C code to VHDL.

Overview

The basic steps in the HercuLeS flow are shown in the figure below. C code is passed to GCC for GIMPLE dump generation, optionally following an external source-level optimizer. Textual GIMPLE is then processed by gimple2nac; alternatively the user should directly supply a NAC translation unit or use an owned frontend. Alternative frontends which are currently WIP are being developed for LLVM (C, C++, Objective-C) and domain-specific languages (DSLs).

HercuLeS overview.

HercuLeS overview.

Various optimizations can be applied at the NAC level; peephole transformations, if-conversion, and function call insertion to enable IP integration. Heuristic basic block partitioning avoids the introduction of excessive critical paths due to operation chaining. The core of HercuLeS comprises of a frontend (nac2cdfg) and a graph-based backend (cdfg2hdl). nac2cdfg is a translator from NAC to flat CDFGs represented in Graphviz. cdfg2hdl is the actual synthesis kernel for automatic FSMD hardware from Graphviz CDFGs to VHDL and self-checking testbench generation.

nac2cdfg is used for parsing, analysis and CDFG extraction from NAC programs. cdfg2hdl maps CDFGs to extended FSMDs (Finite-State Machines with Datapath). An ANSI C backend allows for rapid algorithm prototyping and NAC verification. VHDL code can be simulated with GHDL and Modelsim and synthesized in Xilinx XST and Vivado Design Suite using automatically generated scripts.

Features

HercuLeS supports a wealth of features. The following graphic provides a visualization of the most important existing or planned features.

HercuLeS features.

HercuLeS features.

Third-party IP integration

Third-party hardware IP blocks can be seamlessly intergrated to HercuLeS. HercuLeS allows for automatic IP integration given that the user supplies builtin functionalities. The HercuLeS flow user is able to import and use an owned IP by the following process:

  1. Implement IP with expected interface and place in proper subdirectory.
  2. Add corresponding entry in a textual database.
  3. Use TXL transformations for replacing an operator use by a black-box function call via a script.
  4. A list of black box functions is generated.
  5. HercuLeS automatically creates a hierarchical FSMD with the requested callee(s).

The following graphic illustrates the combined TXL/C approach. The first two steps apply preprocessing for splitting local variable declarations and removing those that are redundant or unused. Then, they are localized and subsequently procedure calls to black-box functions are introduced. These routines are the actual builtin functions. If the corresponding builtins are listed in the IP database, an interface-compatible VHDL implementation to HercuLeS caller FSMDs is assumed. Then, cdfg2hdl automatically handles interface generation and component instantiation in the HDL description for the caller FSMD description. In addition, simulation and synthesis scripts already account for the IP HDL files.

hercules-ipopt

Automatic IP integration in HercuLeS.

This approach is also valid for floating-point computation, while both pipelined and multi-cycle third-party components are supported.

Summary

HercuLeS delivers a contemporary HLS environment that can be comfortably used for algorithm acceleration by predominantly software-oriented engineers. In this article, we only covered a high-level overview of how HercuLeS work, walked through existing and planned features and briefly discussed automatic hardware IP integration.

Useful literature and links

hercules-gui-splashscreen

3 predictions on the role of HLS for the 2020 Digital Platform

The latest ITRS tables (for 2011-2012 since the 2013 tables are not yet complete: http://www.itrs.net/reports.html) highlight three points that are important to anyone developing or using HLS/ESL solutions.

  • The responsibility for power reduction efforts will be shifted to higher abstractions.

The following figure illustrates this expected trend: behavioral and architectural level requirements will rise from a 50% to an 80% contribution to overall system power reduction.

Power reduction effort for the 2020 Digital Platform.

Power reduction effort for the 2020 Digital Platform.

  • High-level synthesis estimates regarding speed, area, power and costs will be needed to be extremely accurate in the near future.

The quality of estimations produced during the phase of architectural design space exploration is projected rise to an approximate 100% within less than 10 years! The tradeoff of better-vs-faster estimation will be even more evident.

hls-est-vs-meas

Expected high-level estimates accuracy for the 2020 Digital Platform.

  • The 2020 Digital Platform will be fully supported by development toolchains.

This means that both programmable and configurable aspects of the platform (such as hardware accelerators) will be accessible via convenient layers of programmability. The current shift towards parallelism-aware languages including C/OpenMP, OpenCL, CUDA, AMP++, and OpenACC is clearly visible and a vibrant reality among programmers. The following figure indicates that within less than 10 years ALL computational platforms from the HPC realm
to the autonomous, omnipresent, embedded systems will require full support by accessible tools.

tool-support

Expected levels of tool support for the 2020 Digital Platform.

What is your opinion?

Multiplierless hack for constant division by certain factors

I would like to present you with a formula for quotient calculation that works for any constant divisor of the form 2^n-1 (^ denotes exponentiation). For n=2,3,4,5,6,7, these divisors are 3,7,15,31,63,127. I have tested all divisors with n up to 16 and the approach works. My formula works only for dividends x that are in the range [0,(2^(2*n))-2].

The formula is as follows and it is both divisionless and multiplierless (as initially derived):

y = (((x>>n)+x+(1<<n)+1)>>n)-1;

which can also be written as:

y = (x + (x>>n) + 1) >> n;

This is a formula for software integer division, i.e. truncating division that calculates the quotient of the division. Please note that 1<<n is the strength-reduced form for calculating 2^n. It does not require any double-word arithmetic (like with multiplying by the multiplicative inverse and then performing adjustment steps).

E.g. for n = 6, the divisor is 2^n – 1 = 63. The formula will work for x <= 4094 = (2^(2*n)) – 2 = (1<<n)*(1<<n)-2. It will produce one-off errors for certain x >= (2^(2*n))-2. But it will be correct for the specified “small” x values.

A testcase for the algorithm is here:

#include <stdio.h>

int main(void)
{
  int qapprox, qexact;
  int i, j, k;

  for (i = 2; i < 8; i++) {
    for (j = 0; j < (1<<i)*(1<<i)-1; j++) {
      k = (1<<i) - 1;       
      qapprox = (((j>>i)+j+(1<<i)+1)>>i)-1;
      qexact  = j / k;
      if (qapprox != qexact) {
        fprintf(stderr, "qapprox = (%d/%d) = %d\tqexact = (%d/%d) = %d\n",
          j, k, qapprox, j, k, qexact);
      }
    }
  }
  return 0;
}

I had devised the hack via “algorithmic sketching”. I mean that I had started from a basic formula with some “holes” in it. The holes where small constants or operators. There exist such techniques in academia, put more nicely; sometimes called algorithmic sketching, combinatorial
sketching, and it is relevant to forms of program synthesis. Essentially it is about devising algorithmic sketches (incomplete algorithms with wildcards) and then using a mechanism to fill the gaps.

Seems I can claim the specific hack as back as May 16, 2011.

Special thanks to everyone who has participated in the following threads. The Taylor series connection is very interesting, as mentioned by Tim Wescott and Tauno Voipio.

elemapprox — The Rosetta stone of elementary functions approximation and plotting

Going back to my HDL development/design projects, I’ve been having fun working with elemapprox, a multi-language collection of modules and packages to assist in simulating floating-point hardware. It is kind of a Rosetta stone for elementary functions approximation adding basic plotting facilities as ASCII (low-resolution) and PBM (monochrome) bitmaps (higher res). Available ports include ANSI C, Verilog, VHDL and “VHDLIEEE” (perusing the existing approximations in the IEEE.math_real package). The data type used for the computations is Verilog’s and VHDL’s real.

This code has been tested with Icarus Verilog, GHDL and Modelsim (VHDL only). The Verilog driver module (testfunc.v) makes advanced use of Verilog system functions and tasks. By using this strong feature of Verilog, I was able to closely imitate the operation of the driver code (testfunc.c) from the ANSI C version. Development of the test driver for the VHDL version was not that straightforward and had to bypass some VHDL quirks; for instance the handling of variable-length strings.

The complete code is here: http://github.com/nkkav/elemapprox and is licensed under the Modified BSD license.

My motivation was to extend the original work on evaluating (single-precision) and plotting transcendental functions as discussed in Prof. Mark G. Arnold’s HDLCON 2001 paper .

At this point my version adds support for all trigonometric (cyclic), inverse trigonometric, hyperbolic and inverse hyperbolic functions as well as a few others: exp, log (ln), log2, log10, pow. I will be adding more functions in the future, for instance hypot, cbrt (cubic root) as well as other special functions that are of interest.

With using any of the versions of elemapprox (whether ANSI C, Verilog or the VHDL ones), you can easily plot the arctangent as ASCII:


...........................................................*********************
..................................................**********....................
...............................................****.............................
.............................................***................................
............................................**..................................
...........................................*....................................
..........................................*.....................................
.........................................**.....................................
........................................**......................................
........................................*.......................................
.......................................**.......................................
......................................**........................................
......................................*.........................................
.....................................*..........................................
...................................**...........................................
.................................***............................................
..............................****..............................................
.....................**********.................................................
**********************..........................................................
................................................................................

or as a PBM bitmap file (for much higher resolution):

testkatan

Or you can plot your typical sine:

testksin

You can always configure elemapprox to suite your needs: write your own function approximations, plotting routines, etc..

In the end: use the source Luke! Besides that, there is also some documentation (thankfully) to get you started, or just browse the README directly at the project’s Github repo or just drop a note here.

I really hope that the community will find this work useful!

METATOR – A look into processor synthesis

These last few months, I have been slowly moving back to my main interests, EDA tools (as a developer and as a user), FPGA application engineering, and last but not least processor design. After a 5-year hiatus I have started revamping (and modernizing) my own environment, developed as an outcome of my PhD work on application-specific instruction-set processors (ASIPs). The flow was based on SUIF/Machine-SUIF (compiler), SALTO (assembly-level transformations) and ArchC (architecture description language for producing binary tools and simulators). It was a highly-successful flow that allowed me (along with my custom instruction generator YARDstick) to explore configurations and extensions of processors within seconds or minutes.

I have been thinking about what’s next. We have tools to assist the designer (the processor design engineer per se) to speedup his/her development. Still, the processor must be designed explicitly. What would go beyond the state-of-the-art is not to have to design the golden model of the processor at all.

What I am proposing is an application-specific processor synthesis tool that goes beyond the state-of-the-art. A model generator for producing the high-level description of the processor, based only on application analysis and user-defined constraints. And for the fun of it, let’s codename it METATOR, because I tend to watch too much Supernatural these days, and METATOR (messenger) is a possible meaning for METATRON, an angelic being from the Apocrypha with a human past. So think of METATOR as an upgrade (spiritual or not) to the current status of both academic and commercial ASIP design tools.

The Context, the Problem and its Solution

ASIPs are tuned for cost-effective execution of targeted application sets. An ASIP design flow involves profiling, architecture exploration, generation and selection of functionalities and synthesis of the corresponding hardware while enabling the user taking certain decisions.

The state-of-the-art in ASIP synthesis includes commercial efforts from Synopsys which has accumulated three relevant portfolios: the ARC configurable processor cores, Processor Designer (previously LISATek) and the IP Designer nML-based tools (previously Target Compiler Technologies); ASIPmeister by ASIP Solutions (site down?), Lissom/CodAL by Codasip, and the academic TCE and NISC toolsets. Apologies if I have missed any other ASIP technology provider!

The key differentiation point of METATOR against existing approaches is that ASIP synthesis should not require the explicit definition of a processor model by a human developer. The solution implies the development of a novel scheme for the extraction of a common denominator architectural model from a given set of user applications (accounting for high-level constraints and requirements) that are intended to be executed on the generated processor by the means of graph similarity extraction. From this automatically generated model, an RTL description, verification IP and a programming toolchain would be produced as part of an automated targeting process, in like “meta-“: a generated model generating models!.

 

Conceptual ASIP Synthesis Flow

METATOR would accept as input the so-called algorithmic soup (narrow set of applications) and generate the ADL (Architecture Description Language) description of the processor. My first aim would be for ArchC but this could also expand to the dominant ADLs, LISA 2.0 and nML.

METATOR would rely upon HercuLeS high-level synthesis technology and the YARDstick profiling and custom instruction generation environment. In the past, YARDstick has been used for generating custom instructions (CIs) for ByoRISC (Build Your Own RISC) soft-core processors. ByoRISC is a configurable in-order RISC design, allowing the execution of multiple-input, multiple-output custom instructions and achieving higher performance than typical VLIW architectures. CIs for ByoRISC where generated by YARDstick, which purpose is to perform application analysis on targeted codes, identify application hotspots, extract custom instructions and evaluate their potential impact on code performance for ByoRISC.

                                                                                                                Conclusion

To sum this up, METATOR is a mind experiment in ASIP synthesis technology. It automatically generates a full-fledged processor and toolchain merely from its usage intent, expressed as indicative targeted application sets.