From GMP programs to hardware

Some time ago I did a small experiment to automatically generate RTL from GMP (GNU multi-precision integer) programs (as in gmplib.org). These programs use API calls to perform arbitrary-precision integer arithmetic for use in cryptography, conducting/supporting mathematical proofs and for purposes of scientific computing in general.

I have developed a concept implementation that is part of HercuLeS HLS (http://www.nkavvadias.com/hercules/). Essentially it is a set of additional features to HercuLeS. I actually used the fgmp (public domain) implementation and API, due to GMP’s licensing. The extensions (coded within a few days or so) did not yet generate synthesizable code, but I was able to generate simulation HDL (VHDL) with gmp API calls as proof-of-concept.

In order to support for the basic GMP (multi-precision integer) API/DSL for just addition and multiplication, the following additions were needed:

  • new VHDL package, mpint.vhd: MP integer and SLV (standard logic vector) operators: 500 lines of code (LOC)
  • C to N-Address Code frontend extensions: 50 LOC
  • Backend code generation in HercuLes: 150 LOC (in addition to what is already there)

This proof-of-concept frontend for GNU multi-precision integer programs has been implemented along with a VHDL IP library for key library functionalities such as mpz_set, mpz_add and mpz_mul.

A simple example

The following code implements factorial with “`fgmp“`, which is under the public domain:


MP_INT gmp_fact(unsigned int n)
{
  MP_INT prod;
  int i;
  mpz_init(&prod);
  mpz_set_si(&prod, 1);
  for (i = 2; i <= n; i++) {
    mpz_mul_ui(&prod, &prod, i);
  }
  return (prod);
}

This is the unoptimized FSMD in VHDL:


library IEEE;
use WORK.operpack.all;
use WORK.gmp_fact_cdt_pkg.all;
use STD.textio.all;
use WORK.std_logic_textio.all;
use WORK.mpint.all;
use WORK.std_logic_1164_tinyadditions.all;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;

entity gmp_fact is
port (
  clk : in std_logic;
  reset : in std_logic;
  start : in std_logic;
  n : in std_logic_vector(31 downto 0);
  D_2747 : out std_logic_vector(31 downto 0);
  done : out std_logic;
  ready : out std_logic
);
end gmp_fact;

architecture fsmd of gmp_fact is
  type state_type is (S_ENTRY, S_EXIT, S_001_001, S_001_002, S_001_003, S_001_004, S_001_005, S_002_001, S_002_002, S_002_003, S_002_004, S_002_005, S_003_001, S_003_002, S_004_001, S_004_002);
  signal current_state, next_state: state_type;
  signal i_21_next : std_logic_vector(31 downto 0);
  signal i_21_reg : std_logic_vector(31 downto 0);
  signal m_11_next : std_logic_vector(31 downto 0);
  signal m_11_reg : std_logic_vector(31 downto 0);
  signal i_11_next : std_logic_vector(31 downto 0);
  signal i_11_reg : std_logic_vector(31 downto 0);
  signal prod_next : mp_int;
  signal prod_reg : mp_int;
  signal i_next : std_logic_vector(31 downto 0);
  signal i_reg : std_logic_vector(31 downto 0);
  signal prod_11_next : mp_int;
  signal prod_11_reg : mp_int;
  signal prod_21_next : mp_int;
  signal prod_21_reg : mp_int;
  signal prod_12_next : mp_int;
  signal prod_12_reg : mp_int;
  signal i_2_31_next : std_logic_vector(31 downto 0);
  signal i_2_31_reg : std_logic_vector(31 downto 0);
  signal i_1_21_next : std_logic_vector(31 downto 0);
  signal i_1_21_reg : std_logic_vector(31 downto 0);
  signal D_2747_next : std_logic_vector(31 downto 0);
  signal D_2747_reg : std_logic_vector(31 downto 0);
  constant CNST_0 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000000";
  constant CNST_1 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000001";
  constant CNST_2 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000010";
begin
-- current state logic
process (clk, reset)
begin
  if (reset = '1') then
    current_state <= S_ENTRY;
    i_21_reg <= (others => '0');
    m_11_reg <= (others => '0');
    i_11_reg <= (others => '0');
    prod_reg.data <= (others => 0);
    i_reg <= (others => '0');
    prod_11_reg.data <= (others => 0);
    prod_21_reg.data <= (others => 0);
    prod_12_reg.data <= (others => 0);
    i_2_31_reg <= (others => '0');
    i_1_21_reg <= (others => '0');
    D_2747_reg <= (others => '0');
  elsif (clk = '1' and clk'EVENT) then
    current_state <= next_state;
    i_21_reg <= i_21_next;
    m_11_reg <= m_11_next;
    i_11_reg <= i_11_next;
    prod_reg <= prod_next;
    i_reg <= i_next;
    prod_11_reg <= prod_11_next;
    prod_21_reg <= prod_21_next;
    prod_12_reg <= prod_12_next;
    i_2_31_reg <= i_2_31_next;
    i_1_21_reg <= i_1_21_next;
    D_2747_reg <= D_2747_next;
  end if;
end process;

-- next state and output logic
process (current_state, start,
  n,
  D_2747_reg,
  i_21_reg, i_21_next,
  m_11_reg, m_11_next,
  i_11_reg, i_11_next,
  prod_reg, prod_next,
  i_reg, i_next,
  prod_11_reg, prod_11_next,
  prod_21_reg, prod_21_next,
  prod_12_reg, prod_12_next,
  i_2_31_reg, i_2_31_next,
  i_1_21_reg, i_1_21_next
)
begin
  done <= '0';
  ready <= '0';
  i_21_next <= i_21_reg;
  m_11_next <= m_11_reg;
  i_11_next <= i_11_reg;
  prod_next <= prod_reg;
  i_next <= i_reg;
  prod_11_next <= prod_11_reg;
  prod_21_next <= prod_21_reg;
  prod_12_next <= prod_12_reg;
  i_2_31_next <= i_2_31_reg;
  i_1_21_next <= i_1_21_reg;
  D_2747_next <= D_2747_reg; 
  case current_state is 
    when S_ENTRY =>
      ready <= '1';
      if (start = '1') then
        next_state <= S_001_001;
      else
        next_state <= S_ENTRY; 
      end if; 
    when S_001_001 =>
      m_11_next <= CNST_1(31 downto 0);
      next_state <= S_001_002; 
    when S_001_002 =>
      prod_12_next <= mpz_set_ui(m_11_reg);
      next_state <= S_001_003; 
    when S_001_003 =>
      i_11_next <= CNST_2(31 downto 0);
      prod_next <= prod_12_reg;
      next_state <= S_001_004; 
    when S_001_004 =>
      i_next <= i_11_reg(31 downto 0);
      next_state <= S_001_005; 
    when S_001_005 =>
      next_state <= S_003_001; 
    when S_002_001 =>
      i_1_21_next(31 downto 0) <= i_reg;
      next_state <= S_002_002; 
    when S_002_002 =>
      prod_21_next <= mpz_mul_ui(prod_reg, i_1_21_reg);
      next_state <= S_002_003; 
    when S_002_003 =>
      mpz_printh(prod_21_reg);
      next_state <= S_002_004; 
    when S_002_004 =>
      i_21_next <= std_logic_vector(signed(i_reg) + signed(CNST_1(31 downto 0)));
      next_state <= S_002_005; 
    when S_002_005 =>
      prod_next <= prod_21_reg;
      i_next <= i_21_reg(31 downto 0);
      next_state <= S_003_001; 
    when S_003_001 =>
      i_2_31_next(31 downto 0) <= i_reg;
      next_state <= S_003_002; 
    when S_003_002 =>
      if (i_2_31_reg <= n(31 downto 0)) then
        next_state <= S_002_001;
      else
        next_state <= S_004_001; end if; 
    when S_004_001 =>
      D_2747_next <= CNST_0(31 downto 0);
      next_state <= S_004_002; 
    when S_004_002 =>
      next_state <= S_EXIT; 
    when S_EXIT =>
      done <= '1';
      next_state <= S_ENTRY; 
    when others =>
      next_state <= S_ENTRY;
  end case;
end process;

D_2747 <= D_2747_reg;

end fsmd;

The 5-minute introduction to FSMDs for practitioners

A design approach that is widely used by HLS (high-level synthesis) tools but is not really advertised loud and proud to HLS users is the Finite-State Machine with Datapath, aka FSMD. For instance, the Wikipedia entry on FSMDs is really sketchy. FSMDs are the primary approach for dealing with generic/control-flow dominated codes in an HLS context.

An FSMD is a microarchitectural paradigm for implementing non-programmable/ hardwired processors with their control unit and datapath combined/merged. In FSMDs, the datapath actions are embedded within the actual next state and output logic decoder of your FSM description. From an RTL (Register Transfer Level) abstraction point you can view an FSM as comprising of:

  • a current state logic process for updating state and register storage
  • a next state logic process for calculating the subsequent state to transition
  • an output logic process for producing the circuit’s outputs.

[NOTE: There is an excellent writeup on alternate FSM description styles in VHDL by Douglas J. Smith that you can consult; any recent XST manual provides good advice if targeting Xilinx FPGAs for casual RTL coding of FSMs.]

Let’s see FSMDs as considered by HercuLeS high-level synthesis (http://www.nkavvadias.com/hercules/); a manual for HercuLeS is here: http://www.nkavvadias.com/hercules-reference-manual/hercules-refman.pdf while a relevant book chapter can be downloaded from: http://cdn.intechweb.org/pdfs/29207.pdf if you want to go beyond these five minutes.

HercuLeS’ FSMDs are based on Prof. Gajski’s and Pong P. Chu’s work, mostly on some of their books and published papers. When I had started my work on HercuLeS, I had rented a couple of Gajski’s books from the local library and had actually bought two of P.P. Chu’s works; the RTL Hardware Design using VHDL book is highly relevant. Gajski’s work on SpecC and the classic TRs (technical reports such as Modeling Custom Hardware in VHDL) from his group were at some point night (by the bed) and day (by the desk) readings…

I believe Vivado HLS (aka AutoESL/xPilot) and the others do the same thing, following a very similar approach, with one key difference on how the actual RTL FSMD code is presented. Their datapath code is implemented with concurrent assignments and there are lots of control and status signals going in and out of the next state logic decoder. On the contrary I prefer datapath actions embedded within state decoding; produces a little slower and marginally larger hardware overall, but the user’s intention in the RTL is much more clear and it is to grasp and follow.

In an FSMD, the key notion is understanding how the _reg and _next signals work as they represent a register, i.e. its currently accessible value and the value that is going to be written into that register. Essentially _reg and _next is what you can see if probing the register’s output and input port at any time.

If following the basic principles from Pong P. Chu, every register is associated to a _reg and a _next signal. Some advice:

  1. Have a _reg and _next version for each register as declared signals in VHDL code.
  2. In each state, read all needed _reg signals and assign all needed _next ones.
  3. Donnot reassign the same _next version of a register within a single FSMD state.
  4. You can totally avoid variables in your code. Not all tools provide equally mature support for synthesizing code with variables.
  5. Operation chaining is possible but requires that you write _next versions and read them in the same state. Then these are plain wires and donnot implement registers. Again, you can’t peruse (for writing) the same _next version more than once in the same state.

At some point I had developed a technique for automatically modifying a VHDL FSMD code for adding controlled operation chaining. It just uses a lexer and to read more about it, see chapter III.E of http://www.nkavvadias.com/publications/kavvadias_asap12_cr.pdf.

If you have a deeper curiosity on HercuLeS, you can read http://www.nkavvadias.com/publications/hercules-pci13.pdf; a journal paper has been accepted for publication and will soon be available. I had to say it!

YARDstick: custom processor development toolset

YARDstick is a design automation tool for custom processor development flows that focuces on the hard part: generating and evaluating application-specific hardware extensions. YARDstick is a powerful building block for ASIP (Application-Specific Instruction-set Processor) development, since it integrates application analysis, ultra-fast algorithms for custom instruction generation and selection with user-defined compiler intermediate representations. It integrates retargetable compiler features for the targeted IRs/architectures.

Features

Important features of YARDstick are as follows:

  • retargetable to used-defined IRs by machine description
  • can be targeted to low-level compiler IRs, assembly-level representations of
    virtual machines, or assembly code for existing processors
  • fully parameterized custom instruction generation and selection engine
  • code selector for multiple-input multiple-output patterns
  • virtual register assignment for virtual machine targets
  • an extensive set of backends including:
    • assembly code emitter
    • C backend
    • visualization backend for Graphviz (http://www.graphviz.org)
    • visualization backend for VCG (http://rw4.cs.uni-sb.de/~sander/html/gsvcg1.html) (or aiSee: http://www.absint.com/aisee/)
    • an XML format backend amenable to graph rewriting.

YARDstick comes along with a cross-platform GUI written in Tcl/Tk 8.5 (http://wiki.tcl.tk).

Aim

The ultimate goal of YARDstick is to liberate the designer’s development infrastructure from compiler and simulator idiosyncrasies. With YARDstick, the ASIP designer is empowered
with the freedom of specifying the target architecture of choice and adding new implementations of analyses and custom instruction generation/selection methods.

Typically, 2x to 15x speedups for benchmark applications (ANSI C optimized source code) can be fully automatically obtained by using YARDstick depending on the target architecture. Speedups are evaluated against a scalar RISC architecture.

Detailed look

  1. Analysis engines generating both static and dynamic statistics:
    1. Data types
    2. Operation-level statistics
    3. Basic block statistics (ranking)
    4. Performance estimations with/without custom instructions.
  2. Generation of CDFGs (Control-Data Flow Graphs).
  3. Backend engines:
    1. ANSI C
    2. dot (Graphviz)
    3. VCG (GDL, aiSee)
    4. XML (GGX for the AGG graph rewriting tool)
    5. Retargetable assembly emitter for entire translation units
    6. CDFG formats for various RTL synthesis tools
  4. Custom instruction engines:
    1. Full-parameterized MIMO custom instruction generation algorithm
    2. Fast heuristic
    3. Configurable number of inputs
    4. Configurable number of outputs
  5. Custom instruction selection:
    1. Based on priority metrics
  6. Graph (and graph-subgraph) isomorphism features for eliminating redundant patterns. Multiple algorithms supported.
  7. Visualization of custom instructions, basic blocks, control-flow graphs and control-data flow graphs (basic block nodes expanded to their constituent instructions).
  8. Basic retargetable compiler features:
    1. Code selector for MIMO instructions (tested with large cases).
    2. Virtual register assignment (allocation for a VM).
    3. Hard register allocator in the works.
  9. Miscellaneous features:
    1. single constant multiplication optimizer
    2. elimination of false data-dependences in assembly-level CDFGs.
    3. beautification options for visualization
    4. interfacing (co-operation) with external tools such as peephole optimizers, profilers, code generators etc.
    5. features related to the custom processor architecture named ByoRISC

Benchmarks

Here is a list of application benchmarks that have been tested with YARDstick:

  • ADPCM encoder and decoder (typically: 4x speedup)
  • Video processing kernels: full-search block-matching motion estimation,
    logarithmic search motion estimation, motion compensation
  • Image processing kernels: steganography (hide/uncover), edge detection, matrix
    multiplication
  • Cryptographic kernels: crc32, rc5, raiden (7x speedup, 12x for unrolled
    version)

At the YARDstick homepage (http://www.nkavvadias.com/yardstick/ ) you can find
some additional materials:

Supporting testimonial to the ArchC infrastructure (2004)

[I was an early PhD student at the time and was quite impressed with ArchC had to offer; I still am]

What ArchC is all about

ArchC is an architecture description language that exploits the already mature SystemC legacy. It targets the automated generation of a) functional and cycle-accurate instruction set simulators and at a later stage of b) tools from the entire application development toolchain for a custom processor (compiler, assembler, linker, debugger, etc).

Simulator generation is at a very good and usable state, however needs to be extended towards the system-level (bus generation ?). Toolset generation which naturally comes next, requires more hard effort and therefore the appropriate support of the open-source community. In my personal opinion, compiler generation is so hard that it can only be possible through funding. Other issues as automated bus generation are also very HARD and need support.

Strengths of ArchC

Important points regarding ArchC are:

  • Nice integration with SystemC. The ArchC user codes instruction decoding, hardware resources, etc in small ArchC files and the corresponding SystemC are automatically generated, which removes heavy burden from the user.
  • Unique approach to the design of a retargetable system call library. This makes possible to benchmark real-sized applications with realistic demands, i.e. many input arguments. Many of the other simulators can’t cope with that.
  • Ease of introducing resource elements, e.g. multi-banked memories.
  • Reuse possibilities for existing SystemC codes.
  • Models of 3 processors, 2 of them with complete retargeted GNU toolchains! Plus binaries of very popular benchmarks (MediaBench, [MediaBench-II is here], MiBench).

How has ArchC helped me in my research

  • Ease of designing variations of a base processor. Most of my papers require tradeoff analysis on alternative architectures.
  • My “drproc” model is a 5-pipeline stage RISC processor with specialized instructions for data-reuse transformations [work of Catthoor et al, IMEC]. I originally had coded an assembler (~few days) and the entire processor in plain SystemC (~month). When I found out about ArchC, it required only 3-4 8-hour days to deliver to my colleagues a much more robust cycle-accurate simulator!
  • I have ran reasonably sized applications (e.g. one sourceforge application, of 25K C lines with intensive use of pointers) with no problem on the R3000 and MIPS models of ArchC.

What needs to be done

NOTE: In the most part, the roadmap has been completed!

This work will be incomplete unless the “ArchC roadmap” is fulfilled. While I believe the compiler retargeting is a very hard issue, [addressing UNICAMP team] I am certain that your team is experienced in DSP compilation techniques (I have read some of the papers). Certainly, assembler generation, and bus wrapper generation (e.g. for OCP) plus some more models are awaited by many users, including myself.

Thank you www.archc.org for all the (free) fish!

Top 12 HercuLeS HLS user feedback patterns

As you might already know, HercuLeS by Ajax Compilers, Inc. (http://www.ajaxcompilers.com ; tech page here: http://www.nkavvadias.com/hercules/) is a high-level synthesis (HLS) environment for the automatic translation of algorithms to hardware.

Since November 2013, the FREE version of HercuLeS has been made free to download, install and use for Windows 7 (64-bit) or Linux (32-bit and 64-bit): http://www.nkavvadias.com/temp/index.php

This free/demo version has generated substantial user feedback. Dear users, wherever you are located (US, Canada, Japan, P.R. China, Sweden, Germany, UK, India, Brazil, the list is not conclusive by any means) thank you-thank you-thank you!!!

I have been compiling a short list of (let’s say) the “Top 12” of user feedback patterns. Focusing on the more generic points, I am disclosing the list as food-for-thought and to generate even more feedback!

Top-12 user feedback patterns and concerns (not in any particular order)

  1. Development time minimization (algorithm, early verification, RTL generation, simulation, implementation, late verification).
  2. High result quality, reducing runtime requirements, chip area and power consumption (QoR). (Latest head-to-head out-of-the-box to Vivado HLS is a tie: http://nkavvadias.com/blog/2014/10/14/vivado-hls-vs-hercules-2/)
  3. Readability of the generated HDL code. (HercuLeS code is much more readable than code generated by competition)
  4. Source to IR to RTL to netlist cross-referencing (via means of “intelligent” cross-tagging). (It looks that an intelligent IDE is a whole new project.)
  5. Theoretically provable correct-by-design approach. (I have been looking into automatic proving of code transformations.)
  6. Transparent interface to the logic synthesis tool and up to downloading the bitstream. (The prototype version works and has been checked for my development boards.)
  7. Plug-in approach to interconnecting legacy reusable designs with HLS-generated designs (“I have this piece of code in idiosyncratic C flavor or assembly or FORTRAN…”). Looks like a call for implementing point solutions for select (aka paying) users.
  8. Optimize HDL descriptions for specific implementation processes (e.g. FPGA devices); users don’t actually seek portability!!!
  9. Pthread or OpenMP frontend support. Explicit parallelism if you please! (My bet is with parallelism extraction but I get this important point)
  10. VHDL frontend support (!) for behavioral VHDL to netlist HDL end-to-end flow. (There exist hard-to-the-core developers — and greatness is with them — that do their algorithmic exploration in behavioral VHDL. I usually do my own in C or VHDL and only lately I have been increasingly using MATLAB and Processing.)
  11. Can you synthesize malloc and free? (Yes I can, and I am improving it, since until now I had been *intercepting* malloc and free and mapping them to a hardware-managed stack.)
  12. Can you show me the automatic IP integration feature? (Yes I can, check this blog post as well: http://nkavvadias.com/blog/2014/10/13/hercules-overview/)

NOTE: quoted text is not a factual reply of an individual but rather artistic rendition thereof.

10x productivity boost with HLS: Myth, legend or fact?

There exists a saying about pretty much every new disruptive technology and so it goes for high-level synthesis. Maybe you have heard it before reading these lines; as an ingredient to a pitch, on colleagues’ coffee break, at a press conference or maybe you have found yourself thinking more or less of this: “When I become proficient with this XYZ HLS tool, I will be 10x more productive! I will be able to fast-forward this task, put a smile on my supervisor’s face; earn a good deal of quality time too!”

Well this reads as a dream-come-true, but is it reasonable to believe in it? Is it already possible, or will it become possible? Being an order of magnitude more productive; is it myth, legend or a factual thing?

HLS attempts to solve the tedious iterations at the RTL (primarily). This is done by having the developer describe the intent or behavior of the system using increased abstraction. No need to reason for each FSM state, exact timing, and taking every stepwise error-prone decision. Ideally, by giving the algorithm and a set of requirements to the HLS tool, you can effectively get a correct design with the desired QoR. Or get a no-go.

To add some data into the soup, let’s examine the following figure.

ips-sloc

For these ten algorithms, ANSI C code is passed to HercuLeS HLS ( http://www.nkavvadias.com/hercules/ ). At the end of the flow, other things apart, we get the synthesizable RTL VHDL for the given C code. By measuring the VHDL-to-C ratio of useful lines of code (no comments etc), a 7.8x to 25.15x variation with an average of about 15 times (15.4x) is obtained.

This is by no means conclusive but it is indicative. HercuLeS takes your code through a series of transformations; from C to N-Address Code (intermediate language) to graph-based representation (in Graphviz) and finally to RTL VHDL. Every step is from a higher abstraction level to a lower one and every step results in a more detailed representation as it moves from software to hardware description.

This approach sounds obviously the correct one. But bare in mind the potential fallacy: 10x less code to write does not necessarily yield 10x less time to develop it. What must also be accounted for are:

  1. Iterations in developing the C code itself.
  2. Multiplicity in the number of generated RTL designs, as points in the architectural design space. These iterations are sloooow, so you have to keep them a) few, b) short. For this reason, it is much better that they should not happen at the RTL, but rather at the level of an IL (intermediate language) estimator/profiler like aprof: http://www.nkavvadias.com/doc/aprof/aprof-README.html .
  3. RTL design is only a fraction of concept-to-implementation. As cliche as it may read, this is Amdahl alright.
  4. Coding for parallelism, whether implicit or explicit, is challenging in the sense of near-optimal scaling for regular or irregular applications and adequacy of the underlying hardware paradigm.

And the verdict: tenfold increase in productivity is a FACT, and you should be able to experience it, if you do it the RIGHT way!

Vivado HLS vs HercuLeS (Kintex-7 and VDS 2013.2 update)

As a followup to a previous blog post on out-of-the-box Vivado HLS vs HercuLeS comparison the following table provides updated information on the performance of HercuLeS against Vivado HLS 2013.2 on Virtex-6 and Kintex-7 (XC7K70TFBG676-2 FPGA device).

Better results (lower execution time; smaller area) have been typeset in bold. It can be clearly seen that HercuLeS outperforms Vivado HLS in key benchmarks such as filtering and numerical processing. As expected in many occasions, better speed/performance can be traded-off for lower area. With 12 partial wins each, one could call this a tie :)

 Benchmark  Description  Vivado HLS (VHLS)  HercuLeS  Device
LUTs Regs TET (ns) LUTs Regs TET (ns)
1 bitrev Bit reversal 67 39 72.0 42 40 11.6  Virtex-6
2 divider Radix-2 division 218 226 63.6 318 332 30.6  Kintex-7
3 edgedet Edge detection 246 130 1636.3 680 361 1606.4  Virtex-6; 1  BRAM for VHLS
4 fibo Fibonacci series 138 131 60.2 137 197 102.7  Virtex-6
5 fir FIR filter 89 114 1027.1 606 540 393.8  Kintex-7
6 gcd Greatest common divisor 210 98 35.2 128 93 75.9  Virtex-6
7 icbrt Cubic root approximation 239 207 260.6 365 201 400.5  Virtex-6
8 sieve Prime sieve of Eratosthenes 525 595 6108.4 565 523 3869.5  Virtex-6;  1 BRAM for VHLS

NOTES:

  • TET is Total Execution Time in ns.
  • VHLS is a shortened form for Vivado HLS.
  • Vivado HLS 2013.2 was used.
  • Bold denotes smaller area and lower execution time.
  • Italic denotes an inconclusive comparison.
  • For the cases of edgedet and sieve, VHLS identifies a BRAM; HercuLeS does not. In these cases, HercuLeS saves a BRAM while VHLS saves on LUTs and FFs (Registers).

HercuLeS: Overview and features

HercuLeS is an extensible high-level synthesis environment for automatically mapping algorithms to hardware. Essentially, HercuLeS translates programs in a typed-assembly language named N-Address Code (NAC) to a collection of Graphviz CDFGs (Control-Data Flow Graphs) which are then synthesized to vendor-independent self-contained RTL VHDL. HercuLeS is also used for push-button synthesis of ANSI C code to VHDL.

Overview

The basic steps in the HercuLeS flow are shown in the figure below. C code is passed to GCC for GIMPLE dump generation, optionally following an external source-level optimizer. Textual GIMPLE is then processed by gimple2nac; alternatively the user should directly supply a NAC translation unit or use an owned frontend. Alternative frontends which are currently WIP are being developed for LLVM (C, C++, Objective-C) and domain-specific languages (DSLs).

HercuLeS overview.

HercuLeS overview.

Various optimizations can be applied at the NAC level; peephole transformations, if-conversion, and function call insertion to enable IP integration. Heuristic basic block partitioning avoids the introduction of excessive critical paths due to operation chaining. The core of HercuLeS comprises of a frontend (nac2cdfg) and a graph-based backend (cdfg2hdl). nac2cdfg is a translator from NAC to flat CDFGs represented in Graphviz. cdfg2hdl is the actual synthesis kernel for automatic FSMD hardware from Graphviz CDFGs to VHDL and self-checking testbench generation.

nac2cdfg is used for parsing, analysis and CDFG extraction from NAC programs. cdfg2hdl maps CDFGs to extended FSMDs (Finite-State Machines with Datapath). An ANSI C backend allows for rapid algorithm prototyping and NAC verification. VHDL code can be simulated with GHDL and Modelsim and synthesized in Xilinx XST and Vivado Design Suite using automatically generated scripts.

Features

HercuLeS supports a wealth of features. The following graphic provides a visualization of the most important existing or planned features.

HercuLeS features.

HercuLeS features.

Third-party IP integration

Third-party hardware IP blocks can be seamlessly intergrated to HercuLeS. HercuLeS allows for automatic IP integration given that the user supplies builtin functionalities. The HercuLeS flow user is able to import and use an owned IP by the following process:

  1. Implement IP with expected interface and place in proper subdirectory.
  2. Add corresponding entry in a textual database.
  3. Use TXL transformations for replacing an operator use by a black-box function call via a script.
  4. A list of black box functions is generated.
  5. HercuLeS automatically creates a hierarchical FSMD with the requested callee(s).

The following graphic illustrates the combined TXL/C approach. The first two steps apply preprocessing for splitting local variable declarations and removing those that are redundant or unused. Then, they are localized and subsequently procedure calls to black-box functions are introduced. These routines are the actual builtin functions. If the corresponding builtins are listed in the IP database, an interface-compatible VHDL implementation to HercuLeS caller FSMDs is assumed. Then, cdfg2hdl automatically handles interface generation and component instantiation in the HDL description for the caller FSMD description. In addition, simulation and synthesis scripts already account for the IP HDL files.

hercules-ipopt

Automatic IP integration in HercuLeS.

This approach is also valid for floating-point computation, while both pipelined and multi-cycle third-party components are supported.

Summary

HercuLeS delivers a contemporary HLS environment that can be comfortably used for algorithm acceleration by predominantly software-oriented engineers. In this article, we only covered a high-level overview of how HercuLeS work, walked through existing and planned features and briefly discussed automatic hardware IP integration.

Useful literature and links

hercules-gui-splashscreen

3 predictions on the role of HLS for the 2020 Digital Platform

The latest ITRS tables (for 2011-2012 since the 2013 tables are not yet complete: http://www.itrs.net/reports.html) highlight three points that are important to anyone developing or using HLS/ESL solutions.

  • The responsibility for power reduction efforts will be shifted to higher abstractions.

The following figure illustrates this expected trend: behavioral and architectural level requirements will rise from a 50% to an 80% contribution to overall system power reduction.

Power reduction effort for the 2020 Digital Platform.

Power reduction effort for the 2020 Digital Platform.

  • High-level synthesis estimates regarding speed, area, power and costs will be needed to be extremely accurate in the near future.

The quality of estimations produced during the phase of architectural design space exploration is projected rise to an approximate 100% within less than 10 years! The tradeoff of better-vs-faster estimation will be even more evident.

hls-est-vs-meas

Expected high-level estimates accuracy for the 2020 Digital Platform.

  • The 2020 Digital Platform will be fully supported by development toolchains.

This means that both programmable and configurable aspects of the platform (such as hardware accelerators) will be accessible via convenient layers of programmability. The current shift towards parallelism-aware languages including C/OpenMP, OpenCL, CUDA, AMP++, and OpenACC is clearly visible and a vibrant reality among programmers. The following figure indicates that within less than 10 years ALL computational platforms from the HPC realm
to the autonomous, omnipresent, embedded systems will require full support by accessible tools.

tool-support

Expected levels of tool support for the 2020 Digital Platform.

What is your opinion?

Multiplierless hack for constant division by certain factors

I would like to present you with a formula for quotient calculation that works for any constant divisor of the form 2^n-1 (^ denotes exponentiation). For n=2,3,4,5,6,7, these divisors are 3,7,15,31,63,127. I have tested all divisors with n up to 16 and the approach works. My formula works only for dividends x that are in the range [0,(2^(2*n))-2].

The formula is as follows and it is both divisionless and multiplierless (as initially derived):

y = (((x>>n)+x+(1<<n)+1)>>n)-1;

which can also be written as:

y = (x + (x>>n) + 1) >> n;

This is a formula for software integer division, i.e. truncating division that calculates the quotient of the division. Please note that 1<<n is the strength-reduced form for calculating 2^n. It does not require any double-word arithmetic (like with multiplying by the multiplicative inverse and then performing adjustment steps).

E.g. for n = 6, the divisor is 2^n – 1 = 63. The formula will work for x <= 4094 = (2^(2*n)) – 2 = (1<<n)*(1<<n)-2. It will produce one-off errors for certain x >= (2^(2*n))-2. But it will be correct for the specified “small” x values.

A testcase for the algorithm is here:

#include <stdio.h>

int main(void)
{
  int qapprox, qexact;
  int i, j, k;

  for (i = 2; i < 8; i++) {
    for (j = 0; j < (1<<i)*(1<<i)-1; j++) {
      k = (1<<i) - 1;       
      qapprox = (((j>>i)+j+(1<<i)+1)>>i)-1;
      qexact  = j / k;
      if (qapprox != qexact) {
        fprintf(stderr, "qapprox = (%d/%d) = %d\tqexact = (%d/%d) = %d\n",
          j, k, qapprox, j, k, qexact);
      }
    }
  }
  return 0;
}

I had devised the hack via “algorithmic sketching”. I mean that I had started from a basic formula with some “holes” in it. The holes where small constants or operators. There exist such techniques in academia, put more nicely; sometimes called algorithmic sketching, combinatorial
sketching, and it is relevant to forms of program synthesis. Essentially it is about devising algorithmic sketches (incomplete algorithms with wildcards) and then using a mechanism to fill the gaps.

Seems I can claim the specific hack as back as May 16, 2011.

Special thanks to everyone who has participated in the following threads. The Taylor series connection is very interesting, as mentioned by Tim Wescott and Tauno Voipio.