# From GMP programs to hardware

Some time ago I did a small experiment to automatically generate RTL from GMP (GNU multi-precision integer) programs (as in gmplib.org). These programs use API calls to perform arbitrary-precision integer arithmetic for use in cryptography, conducting/supporting mathematical proofs and for purposes of scientific computing in general.

I have developed a concept implementation that is part of HercuLeS HLS (http://www.nkavvadias.com/hercules/). Essentially it is a set of additional features to HercuLeS. I actually used the fgmp (public domain) implementation and API, due to GMP’s licensing. The extensions (coded within a few days or so) did not yet generate synthesizable code, but I was able to generate simulation HDL (VHDL) with gmp API calls as proof-of-concept.

In order to support for the basic GMP (multi-precision integer) API/DSL for just addition and multiplication, the following additions were needed:

• new VHDL package, mpint.vhd: MP integer and SLV (standard logic vector) operators: 500 lines of code (LOC)
• C to N-Address Code frontend extensions: 50 LOC
• Backend code generation in HercuLes: 150 LOC (in addition to what is already there)

This proof-of-concept frontend for GNU multi-precision integer programs has been implemented along with a VHDL IP library for key library functionalities such as mpz_set, mpz_add and mpz_mul.

# A simple example

The following code implements factorial with “`fgmp“`, which is under the public domain:

MP_INT gmp_fact(unsigned int n)
{
MP_INT prod;
int i;
mpz_init(&prod);
mpz_set_si(&prod, 1);
for (i = 2; i <= n; i++) {
mpz_mul_ui(&prod, &prod, i);
}
return (prod);
}

This is the unoptimized FSMD in VHDL:

library IEEE;
use WORK.operpack.all;
use WORK.gmp_fact_cdt_pkg.all;
use STD.textio.all;
use WORK.std_logic_textio.all;
use WORK.mpint.all;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;

entity gmp_fact is
port (
clk : in std_logic;
reset : in std_logic;
start : in std_logic;
n : in std_logic_vector(31 downto 0);
D_2747 : out std_logic_vector(31 downto 0);
done : out std_logic;
);
end gmp_fact;

architecture fsmd of gmp_fact is
type state_type is (S_ENTRY, S_EXIT, S_001_001, S_001_002, S_001_003, S_001_004, S_001_005, S_002_001, S_002_002, S_002_003, S_002_004, S_002_005, S_003_001, S_003_002, S_004_001, S_004_002);
signal current_state, next_state: state_type;
signal i_21_next : std_logic_vector(31 downto 0);
signal i_21_reg : std_logic_vector(31 downto 0);
signal m_11_next : std_logic_vector(31 downto 0);
signal m_11_reg : std_logic_vector(31 downto 0);
signal i_11_next : std_logic_vector(31 downto 0);
signal i_11_reg : std_logic_vector(31 downto 0);
signal prod_next : mp_int;
signal prod_reg : mp_int;
signal i_next : std_logic_vector(31 downto 0);
signal i_reg : std_logic_vector(31 downto 0);
signal prod_11_next : mp_int;
signal prod_11_reg : mp_int;
signal prod_21_next : mp_int;
signal prod_21_reg : mp_int;
signal prod_12_next : mp_int;
signal prod_12_reg : mp_int;
signal i_2_31_next : std_logic_vector(31 downto 0);
signal i_2_31_reg : std_logic_vector(31 downto 0);
signal i_1_21_next : std_logic_vector(31 downto 0);
signal i_1_21_reg : std_logic_vector(31 downto 0);
signal D_2747_next : std_logic_vector(31 downto 0);
signal D_2747_reg : std_logic_vector(31 downto 0);
constant CNST_0 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000000";
constant CNST_1 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000001";
constant CNST_2 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000010";
begin
-- current state logic
process (clk, reset)
begin
if (reset = '1') then
current_state <= S_ENTRY;
i_21_reg <= (others => '0');
m_11_reg <= (others => '0');
i_11_reg <= (others => '0');
prod_reg.data <= (others => 0);
i_reg <= (others => '0');
prod_11_reg.data <= (others => 0);
prod_21_reg.data <= (others => 0);
prod_12_reg.data <= (others => 0);
i_2_31_reg <= (others => '0');
i_1_21_reg <= (others => '0');
D_2747_reg <= (others => '0');
elsif (clk = '1' and clk'EVENT) then
current_state <= next_state;
i_21_reg <= i_21_next;
m_11_reg <= m_11_next;
i_11_reg <= i_11_next;
prod_reg <= prod_next;
i_reg <= i_next;
prod_11_reg <= prod_11_next;
prod_21_reg <= prod_21_next;
prod_12_reg <= prod_12_next;
i_2_31_reg <= i_2_31_next;
i_1_21_reg <= i_1_21_next;
D_2747_reg <= D_2747_next;
end if;
end process;

-- next state and output logic
process (current_state, start,
n,
D_2747_reg,
i_21_reg, i_21_next,
m_11_reg, m_11_next,
i_11_reg, i_11_next,
prod_reg, prod_next,
i_reg, i_next,
prod_11_reg, prod_11_next,
prod_21_reg, prod_21_next,
prod_12_reg, prod_12_next,
i_2_31_reg, i_2_31_next,
i_1_21_reg, i_1_21_next
)
begin
done <= '0';
i_21_next <= i_21_reg;
m_11_next <= m_11_reg;
i_11_next <= i_11_reg;
prod_next <= prod_reg;
i_next <= i_reg;
prod_11_next <= prod_11_reg;
prod_21_next <= prod_21_reg;
prod_12_next <= prod_12_reg;
i_2_31_next <= i_2_31_reg;
i_1_21_next <= i_1_21_reg;
D_2747_next <= D_2747_reg;
case current_state is
when S_ENTRY =>
if (start = '1') then
next_state <= S_001_001;
else
next_state <= S_ENTRY;
end if;
when S_001_001 =>
m_11_next <= CNST_1(31 downto 0);
next_state <= S_001_002;
when S_001_002 =>
prod_12_next <= mpz_set_ui(m_11_reg);
next_state <= S_001_003;
when S_001_003 =>
i_11_next <= CNST_2(31 downto 0);
prod_next <= prod_12_reg;
next_state <= S_001_004;
when S_001_004 =>
i_next <= i_11_reg(31 downto 0);
next_state <= S_001_005;
when S_001_005 =>
next_state <= S_003_001;
when S_002_001 =>
i_1_21_next(31 downto 0) <= i_reg;
next_state <= S_002_002;
when S_002_002 =>
prod_21_next <= mpz_mul_ui(prod_reg, i_1_21_reg);
next_state <= S_002_003;
when S_002_003 =>
mpz_printh(prod_21_reg);
next_state <= S_002_004;
when S_002_004 =>
i_21_next <= std_logic_vector(signed(i_reg) + signed(CNST_1(31 downto 0)));
next_state <= S_002_005;
when S_002_005 =>
prod_next <= prod_21_reg;
i_next <= i_21_reg(31 downto 0);
next_state <= S_003_001;
when S_003_001 =>
i_2_31_next(31 downto 0) <= i_reg;
next_state <= S_003_002;
when S_003_002 =>
if (i_2_31_reg <= n(31 downto 0)) then
next_state <= S_002_001;
else
next_state <= S_004_001; end if;
when S_004_001 =>
D_2747_next <= CNST_0(31 downto 0);
next_state <= S_004_002;
when S_004_002 =>
next_state <= S_EXIT;
when S_EXIT =>
done <= '1';
next_state <= S_ENTRY;
when others =>
next_state <= S_ENTRY;
end case;
end process;

D_2747 <= D_2747_reg;

end fsmd;

# The 5-minute introduction to FSMDs for practitioners

A design approach that is widely used by HLS (high-level synthesis) tools but is not really advertised loud and proud to HLS users is the Finite-State Machine with Datapath, aka FSMD. For instance, the Wikipedia entry on FSMDs is really sketchy. FSMDs are the primary approach for dealing with generic/control-flow dominated codes in an HLS context.

An FSMD is a microarchitectural paradigm for implementing non-programmable/ hardwired processors with their control unit and datapath combined/merged. In FSMDs, the datapath actions are embedded within the actual next state and output logic decoder of your FSM description. From an RTL (Register Transfer Level) abstraction point you can view an FSM as comprising of:

• a current state logic process for updating state and register storage
• a next state logic process for calculating the subsequent state to transition
• an output logic process for producing the circuit’s outputs.

[NOTE: There is an excellent writeup on alternate FSM description styles in VHDL by Douglas J. Smith that you can consult; any recent XST manual provides good advice if targeting Xilinx FPGAs for casual RTL coding of FSMs.]

Let’s see FSMDs as considered by HercuLeS high-level synthesis (http://www.nkavvadias.com/hercules/); a manual for HercuLeS is here: http://www.nkavvadias.com/hercules-reference-manual/hercules-refman.pdf while a relevant book chapter can be downloaded from: http://cdn.intechweb.org/pdfs/29207.pdf if you want to go beyond these five minutes.

HercuLeS’ FSMDs are based on Prof. Gajski’s and Pong P. Chu’s work, mostly on some of their books and published papers. When I had started my work on HercuLeS, I had rented a couple of Gajski’s books from the local library and had actually bought two of P.P. Chu’s works; the RTL Hardware Design using VHDL book is highly relevant. Gajski’s work on SpecC and the classic TRs (technical reports such as Modeling Custom Hardware in VHDL) from his group were at some point night (by the bed) and day (by the desk) readings…

I believe Vivado HLS (aka AutoESL/xPilot) and the others do the same thing, following a very similar approach, with one key difference on how the actual RTL FSMD code is presented. Their datapath code is implemented with concurrent assignments and there are lots of control and status signals going in and out of the next state logic decoder. On the contrary I prefer datapath actions embedded within state decoding; produces a little slower and marginally larger hardware overall, but the user’s intention in the RTL is much more clear and it is to grasp and follow.

In an FSMD, the key notion is understanding how the _reg and _next signals work as they represent a register, i.e. its currently accessible value and the value that is going to be written into that register. Essentially _reg and _next is what you can see if probing the register’s output and input port at any time.

If following the basic principles from Pong P. Chu, every register is associated to a _reg and a _next signal. Some advice:

1. Have a _reg and _next version for each register as declared signals in VHDL code.
2. In each state, read all needed _reg signals and assign all needed _next ones.
3. Donnot reassign the same _next version of a register within a single FSMD state.
4. You can totally avoid variables in your code. Not all tools provide equally mature support for synthesizing code with variables.
5. Operation chaining is possible but requires that you write _next versions and read them in the same state. Then these are plain wires and donnot implement registers. Again, you can’t peruse (for writing) the same _next version more than once in the same state.

At some point I had developed a technique for automatically modifying a VHDL FSMD code for adding controlled operation chaining. It just uses a lexer and to read more about it, see chapter III.E of http://www.nkavvadias.com/publications/kavvadias_asap12_cr.pdf.

If you have a deeper curiosity on HercuLeS, you can read http://www.nkavvadias.com/publications/hercules-pci13.pdf; a journal paper has been accepted for publication and will soon be available. I had to say it!

# Top 12 HercuLeS HLS user feedback patterns

As you might already know, HercuLeS by Ajax Compilers, Inc. (http://www.ajaxcompilers.com ; tech page here: http://www.nkavvadias.com/hercules/) is a high-level synthesis (HLS) environment for the automatic translation of algorithms to hardware.

Since November 2013, the FREE version of HercuLeS has been made free to download, install and use for Windows 7 (64-bit) or Linux (32-bit and 64-bit): http://www.nkavvadias.com/temp/index.php

This free/demo version has generated substantial user feedback. Dear users, wherever you are located (US, Canada, Japan, P.R. China, Sweden, Germany, UK, India, Brazil, the list is not conclusive by any means) thank you-thank you-thank you!!!

I have been compiling a short list of (let’s say) the “Top 12” of user feedback patterns. Focusing on the more generic points, I am disclosing the list as food-for-thought and to generate even more feedback!

# Top-12 user feedback patterns and concerns (not in any particular order)

1. Development time minimization (algorithm, early verification, RTL generation, simulation, implementation, late verification).
3. Readability of the generated HDL code. (HercuLeS code is much more readable than code generated by competition)
4. Source to IR to RTL to netlist cross-referencing (via means of “intelligent” cross-tagging). (It looks that an intelligent IDE is a whole new project.)
5. Theoretically provable correct-by-design approach. (I have been looking into automatic proving of code transformations.)
6. Transparent interface to the logic synthesis tool and up to downloading the bitstream. (The prototype version works and has been checked for my development boards.)
7. Plug-in approach to interconnecting legacy reusable designs with HLS-generated designs (“I have this piece of code in idiosyncratic C flavor or assembly or FORTRAN…”). Looks like a call for implementing point solutions for select (aka paying) users.
8. Optimize HDL descriptions for specific implementation processes (e.g. FPGA devices); users don’t actually seek portability!!!
9. Pthread or OpenMP frontend support. Explicit parallelism if you please! (My bet is with parallelism extraction but I get this important point)
10. VHDL frontend support (!) for behavioral VHDL to netlist HDL end-to-end flow. (There exist hard-to-the-core developers — and greatness is with them — that do their algorithmic exploration in behavioral VHDL. I usually do my own in C or VHDL and only lately I have been increasingly using MATLAB and Processing.)
11. Can you synthesize malloc and free? (Yes I can, and I am improving it, since until now I had been *intercepting* malloc and free and mapping them to a hardware-managed stack.)
12. Can you show me the automatic IP integration feature? (Yes I can, check this blog post as well: http://nkavvadias.com/blog/2014/10/13/hercules-overview/)

NOTE: quoted text is not a factual reply of an individual but rather artistic rendition thereof.

# 10x productivity boost with HLS: Myth, legend or fact?

There exists a saying about pretty much every new disruptive technology and so it goes for high-level synthesis. Maybe you have heard it before reading these lines; as an ingredient to a pitch, on colleagues’ coffee break, at a press conference or maybe you have found yourself thinking more or less of this: “When I become proficient with this XYZ HLS tool, I will be 10x more productive! I will be able to fast-forward this task, put a smile on my supervisor’s face; earn a good deal of quality time too!”

Well this reads as a dream-come-true, but is it reasonable to believe in it? Is it already possible, or will it become possible? Being an order of magnitude more productive; is it myth, legend or a factual thing?

HLS attempts to solve the tedious iterations at the RTL (primarily). This is done by having the developer describe the intent or behavior of the system using increased abstraction. No need to reason for each FSM state, exact timing, and taking every stepwise error-prone decision. Ideally, by giving the algorithm and a set of requirements to the HLS tool, you can effectively get a correct design with the desired QoR. Or get a no-go.

To add some data into the soup, let’s examine the following figure.

For these ten algorithms, ANSI C code is passed to HercuLeS HLS ( http://www.nkavvadias.com/hercules/ ). At the end of the flow, other things apart, we get the synthesizable RTL VHDL for the given C code. By measuring the VHDL-to-C ratio of useful lines of code (no comments etc), a 7.8x to 25.15x variation with an average of about 15 times (15.4x) is obtained.

This is by no means conclusive but it is indicative. HercuLeS takes your code through a series of transformations; from C to N-Address Code (intermediate language) to graph-based representation (in Graphviz) and finally to RTL VHDL. Every step is from a higher abstraction level to a lower one and every step results in a more detailed representation as it moves from software to hardware description.

This approach sounds obviously the correct one. But bare in mind the potential fallacy: 10x less code to write does not necessarily yield 10x less time to develop it. What must also be accounted for are:

1. Iterations in developing the C code itself.
2. Multiplicity in the number of generated RTL designs, as points in the architectural design space. These iterations are sloooow, so you have to keep them a) few, b) short. For this reason, it is much better that they should not happen at the RTL, but rather at the level of an IL (intermediate language) estimator/profiler like aprof: http://www.nkavvadias.com/doc/aprof/aprof-README.html .
3. RTL design is only a fraction of concept-to-implementation. As cliche as it may read, this is Amdahl alright.
4. Coding for parallelism, whether implicit or explicit, is challenging in the sense of near-optimal scaling for regular or irregular applications and adequacy of the underlying hardware paradigm.

And the verdict: tenfold increase in productivity is a FACT, and you should be able to experience it, if you do it the RIGHT way!

# Vivado HLS vs HercuLeS (Kintex-7 and VDS 2013.2 update)

As a followup to a previous blog post on out-of-the-box Vivado HLS vs HercuLeS comparison the following table provides updated information on the performance of HercuLeS against Vivado HLS 2013.2 on Virtex-6 and Kintex-7 (XC7K70TFBG676-2 FPGA device).

Better results (lower execution time; smaller area) have been typeset in bold. It can be clearly seen that HercuLeS outperforms Vivado HLS in key benchmarks such as filtering and numerical processing. As expected in many occasions, better speed/performance can be traded-off for lower area. With 12 partial wins each, one could call this a tie

 Benchmark Description Vivado HLS (VHLS) HercuLeS Device LUTs Regs TET (ns) LUTs Regs TET (ns) 1 bitrev Bit reversal 67 39 72.0 42 40 11.6 Virtex-6 2 divider Radix-2 division 218 226 63.6 318 332 30.6 Kintex-7 3 edgedet Edge detection 246 130 1636.3 680 361 1606.4 Virtex-6; 1  BRAM for VHLS 4 fibo Fibonacci series 138 131 60.2 137 197 102.7 Virtex-6 5 fir FIR filter 89 114 1027.1 606 540 393.8 Kintex-7 6 gcd Greatest common divisor 210 98 35.2 128 93 75.9 Virtex-6 7 icbrt Cubic root approximation 239 207 260.6 365 201 400.5 Virtex-6 8 sieve Prime sieve of Eratosthenes 525 595 6108.4 565 523 3869.5 Virtex-6;  1 BRAM for VHLS

NOTES:

• TET is Total Execution Time in ns.
• VHLS is a shortened form for Vivado HLS.
• Vivado HLS 2013.2 was used.
• Bold denotes smaller area and lower execution time.
• Italic denotes an inconclusive comparison.
• For the cases of edgedet and sieve, VHLS identifies a BRAM; HercuLeS does not. In these cases, HercuLeS saves a BRAM while VHLS saves on LUTs and FFs (Registers).

# HercuLeS: Overview and features

HercuLeS is an extensible high-level synthesis environment for automatically mapping algorithms to hardware. Essentially, HercuLeS translates programs in a typed-assembly language named N-Address Code (NAC) to a collection of Graphviz CDFGs (Control-Data Flow Graphs) which are then synthesized to vendor-independent self-contained RTL VHDL. HercuLeS is also used for push-button synthesis of ANSI C code to VHDL.

# Overview

The basic steps in the HercuLeS flow are shown in the figure below. C code is passed to GCC for GIMPLE dump generation, optionally following an external source-level optimizer. Textual GIMPLE is then processed by gimple2nac; alternatively the user should directly supply a NAC translation unit or use an owned frontend. Alternative frontends which are currently WIP are being developed for LLVM (C, C++, Objective-C) and domain-specific languages (DSLs).

HercuLeS overview.

Various optimizations can be applied at the NAC level; peephole transformations, if-conversion, and function call insertion to enable IP integration. Heuristic basic block partitioning avoids the introduction of excessive critical paths due to operation chaining. The core of HercuLeS comprises of a frontend (nac2cdfg) and a graph-based backend (cdfg2hdl). nac2cdfg is a translator from NAC to flat CDFGs represented in Graphviz. cdfg2hdl is the actual synthesis kernel for automatic FSMD hardware from Graphviz CDFGs to VHDL and self-checking testbench generation.

nac2cdfg is used for parsing, analysis and CDFG extraction from NAC programs. cdfg2hdl maps CDFGs to extended FSMDs (Finite-State Machines with Datapath). An ANSI C backend allows for rapid algorithm prototyping and NAC verification. VHDL code can be simulated with GHDL and Modelsim and synthesized in Xilinx XST and Vivado Design Suite using automatically generated scripts.

Features

HercuLeS supports a wealth of features. The following graphic provides a visualization of the most important existing or planned features.

HercuLeS features.

# Third-party IP integration

Third-party hardware IP blocks can be seamlessly intergrated to HercuLeS. HercuLeS allows for automatic IP integration given that the user supplies builtin functionalities. The HercuLeS flow user is able to import and use an owned IP by the following process:

1. Implement IP with expected interface and place in proper subdirectory.
2. Add corresponding entry in a textual database.
3. Use TXL transformations for replacing an operator use by a black-box function call via a script.
4. A list of black box functions is generated.
5. HercuLeS automatically creates a hierarchical FSMD with the requested callee(s).

The following graphic illustrates the combined TXL/C approach. The first two steps apply preprocessing for splitting local variable declarations and removing those that are redundant or unused. Then, they are localized and subsequently procedure calls to black-box functions are introduced. These routines are the actual builtin functions. If the corresponding builtins are listed in the IP database, an interface-compatible VHDL implementation to HercuLeS caller FSMDs is assumed. Then, cdfg2hdl automatically handles interface generation and component instantiation in the HDL description for the caller FSMD description. In addition, simulation and synthesis scripts already account for the IP HDL files.

Automatic IP integration in HercuLeS.

This approach is also valid for floating-point computation, while both pipelined and multi-cycle third-party components are supported.

# Summary

HercuLeS delivers a contemporary HLS environment that can be comfortably used for algorithm acceleration by predominantly software-oriented engineers. In this article, we only covered a high-level overview of how HercuLeS work, walked through existing and planned features and briefly discussed automatic hardware IP integration.

# 3 predictions on the role of HLS for the 2020 Digital Platform

The latest ITRS tables (for 2011-2012 since the 2013 tables are not yet complete: http://www.itrs.net/reports.html) highlight three points that are important to anyone developing or using HLS/ESL solutions.

• The responsibility for power reduction efforts will be shifted to higher abstractions.

The following figure illustrates this expected trend: behavioral and architectural level requirements will rise from a 50% to an 80% contribution to overall system power reduction.

Power reduction effort for the 2020 Digital Platform.

• High-level synthesis estimates regarding speed, area, power and costs will be needed to be extremely accurate in the near future.

The quality of estimations produced during the phase of architectural design space exploration is projected rise to an approximate 100% within less than 10 years! The tradeoff of better-vs-faster estimation will be even more evident.

Expected high-level estimates accuracy for the 2020 Digital Platform.

• The 2020 Digital Platform will be fully supported by development toolchains.

This means that both programmable and configurable aspects of the platform (such as hardware accelerators) will be accessible via convenient layers of programmability. The current shift towards parallelism-aware languages including C/OpenMP, OpenCL, CUDA, AMP++, and OpenACC is clearly visible and a vibrant reality among programmers. The following figure indicates that within less than 10 years ALL computational platforms from the HPC realm
to the autonomous, omnipresent, embedded systems will require full support by accessible tools.

Expected levels of tool support for the 2020 Digital Platform.

# Multiplierless hack for constant division by certain factors

I would like to present you with a formula for quotient calculation that works for any constant divisor of the form 2^n-1 (^ denotes exponentiation). For n=2,3,4,5,6,7, these divisors are 3,7,15,31,63,127. I have tested all divisors with n up to 16 and the approach works. My formula works only for dividends x that are in the range [0,(2^(2*n))-2].

The formula is as follows and it is both divisionless and multiplierless (as initially derived):

y = (((x>>n)+x+(1<<n)+1)>>n)-1;

which can also be written as:

y = (x + (x>>n) + 1) >> n;

This is a formula for software integer division, i.e. truncating division that calculates the quotient of the division. Please note that 1<<n is the strength-reduced form for calculating 2^n. It does not require any double-word arithmetic (like with multiplying by the multiplicative inverse and then performing adjustment steps).

E.g. for n = 6, the divisor is 2^n – 1 = 63. The formula will work for x <= 4094 = (2^(2*n)) – 2 = (1<<n)*(1<<n)-2. It will produce one-off errors for certain x >= (2^(2*n))-2. But it will be correct for the specified “small” x values.

A testcase for the algorithm is here:

#include <stdio.h>

int main(void)
{
int qapprox, qexact;
int i, j, k;

for (i = 2; i < 8; i++) {
for (j = 0; j < (1<<i)*(1<<i)-1; j++) {
k = (1<<i) - 1;
qapprox = (((j>>i)+j+(1<<i)+1)>>i)-1;
qexact  = j / k;
if (qapprox != qexact) {
fprintf(stderr, "qapprox = (%d/%d) = %d\tqexact = (%d/%d) = %d\n",
j, k, qapprox, j, k, qexact);
}
}
}
return 0;
}

I had devised the hack via “algorithmic sketching”. I mean that I had started from a basic formula with some “holes” in it. The holes where small constants or operators. There exist such techniques in academia, put more nicely; sometimes called algorithmic sketching, combinatorial
sketching, and it is relevant to forms of program synthesis. Essentially it is about devising algorithmic sketches (incomplete algorithms with wildcards) and then using a mechanism to fill the gaps.

Seems I can claim the specific hack as back as May 16, 2011.

Special thanks to everyone who has participated in the following threads. The Taylor series connection is very interesting, as mentioned by Tim Wescott and Tauno Voipio.

# elemapprox — The Rosetta stone of elementary functions approximation and plotting

Going back to my HDL development/design projects, I’ve been having fun working with elemapprox, a multi-language collection of modules and packages to assist in simulating floating-point hardware. It is kind of a Rosetta stone for elementary functions approximation adding basic plotting facilities as ASCII (low-resolution) and PBM (monochrome) bitmaps (higher res). Available ports include ANSI C, Verilog, VHDL and “VHDLIEEE” (perusing the existing approximations in the IEEE.math_real package). The data type used for the computations is Verilog’s and VHDL’s real.

This code has been tested with Icarus Verilog, GHDL and Modelsim (VHDL only). The Verilog driver module (testfunc.v) makes advanced use of Verilog system functions and tasks. By using this strong feature of Verilog, I was able to closely imitate the operation of the driver code (testfunc.c) from the ANSI C version. Development of the test driver for the VHDL version was not that straightforward and had to bypass some VHDL quirks; for instance the handling of variable-length strings.

My motivation was to extend the original work on evaluating (single-precision) and plotting transcendental functions as discussed in Prof. Mark G. Arnold’s HDLCON 2001 paper .

At this point my version adds support for all trigonometric (cyclic), inverse trigonometric, hyperbolic and inverse hyperbolic functions as well as a few others: exp, log (ln), log2, log10, pow. I will be adding more functions in the future, for instance hypot, cbrt (cubic root) as well as other special functions that are of interest.

With using any of the versions of elemapprox (whether ANSI C, Verilog or the VHDL ones), you can easily plot the arctangent as ASCII:

...........................................................*********************
..................................................**********....................
...............................................****.............................
.............................................***................................
............................................**..................................
...........................................*....................................
..........................................*.....................................
.........................................**.....................................
........................................**......................................
........................................*.......................................
.......................................**.......................................
......................................**........................................
......................................*.........................................
.....................................*..........................................
...................................**...........................................
.................................***............................................
..............................****..............................................
.....................**********.................................................
**********************..........................................................
................................................................................

or as a PBM bitmap file (for much higher resolution):

Or you can plot your typical sine:

You can always configure elemapprox to suite your needs: write your own function approximations, plotting routines, etc..

In the end: use the source Luke! Besides that, there is also some documentation (thankfully) to get you started, or just browse the README directly at the project’s Github repo or just drop a note here.

I really hope that the community will find this work useful!

# METATOR – A look into processor synthesis

These last few months, I have been slowly moving back to my main interests, EDA tools (as a developer and as a user), FPGA application engineering, and last but not least processor design. After a 5-year hiatus I have started revamping (and modernizing) my own environment, developed as an outcome of my PhD work on application-specific instruction-set processors (ASIPs). The flow was based on SUIF/Machine-SUIF (compiler), SALTO (assembly-level transformations) and ArchC (architecture description language for producing binary tools and simulators). It was a highly-successful flow that allowed me (along with my custom instruction generator YARDstick) to explore configurations and extensions of processors within seconds or minutes.

I have been thinking about what’s next. We have tools to assist the designer (the processor design engineer per se) to speedup his/her development. Still, the processor must be designed explicitly. What would go beyond the state-of-the-art is not to have to design the golden model of the processor at all.

What I am proposing is an application-specific processor synthesis tool that goes beyond the state-of-the-art. A model generator for producing the high-level description of the processor, based only on application analysis and user-defined constraints. And for the fun of it, let’s codename it METATOR, because I tend to watch too much Supernatural these days, and METATOR (messenger) is a possible meaning for METATRON, an angelic being from the Apocrypha with a human past. So think of METATOR as an upgrade (spiritual or not) to the current status of both academic and commercial ASIP design tools.

# The Context, the Problem and its Solution

ASIPs are tuned for cost-effective execution of targeted application sets. An ASIP design flow involves profiling, architecture exploration, generation and selection of functionalities and synthesis of the corresponding hardware while enabling the user taking certain decisions.

The state-of-the-art in ASIP synthesis includes commercial efforts from Synopsys which has accumulated three relevant portfolios: the ARC configurable processor cores, Processor Designer (previously LISATek) and the IP Designer nML-based tools (previously Target Compiler Technologies); ASIPmeister by ASIP Solutions (site down?), Lissom/CodAL by Codasip, and the academic TCE and NISC toolsets. Apologies if I have missed any other ASIP technology provider!

The key differentiation point of METATOR against existing approaches is that ASIP synthesis should not require the explicit definition of a processor model by a human developer. The solution implies the development of a novel scheme for the extraction of a common denominator architectural model from a given set of user applications (accounting for high-level constraints and requirements) that are intended to be executed on the generated processor by the means of graph similarity extraction. From this automatically generated model, an RTL description, verification IP and a programming toolchain would be produced as part of an automated targeting process, in like “meta-“: a generated model generating models!.

# Conceptual ASIP Synthesis Flow

METATOR would accept as input the so-called algorithmic soup (narrow set of applications) and generate the ADL (Architecture Description Language) description of the processor. My first aim would be for ArchC but this could also expand to the dominant ADLs, LISA 2.0 and nML.

METATOR would rely upon HercuLeS high-level synthesis technology and the YARDstick profiling and custom instruction generation environment. In the past, YARDstick has been used for generating custom instructions (CIs) for ByoRISC (Build Your Own RISC) soft-core processors. ByoRISC is a configurable in-order RISC design, allowing the execution of multiple-input, multiple-output custom instructions and achieving higher performance than typical VLIW architectures. CIs for ByoRISC where generated by YARDstick, which purpose is to perform application analysis on targeted codes, identify application hotspots, extract custom instructions and evaluate their potential impact on code performance for ByoRISC.

# Conclusion

To sum this up, METATOR is a mind experiment in ASIP synthesis technology. It automatically generates a full-fledged processor and toolchain merely from its usage intent, expressed as indicative targeted application sets.