From GMP programs to hardware

Some time ago I did a small experiment to automatically generate RTL from GMP (GNU multi-precision integer) programs (as in gmplib.org). These programs use API calls to perform arbitrary-precision integer arithmetic for use in cryptography, conducting/supporting mathematical proofs and for purposes of scientific computing in general.

I have developed a concept implementation that is part of HercuLeS HLS (http://www.nkavvadias.com/hercules/). Essentially it is a set of additional features to HercuLeS. I actually used the fgmp (public domain) implementation and API, due to GMP’s licensing. The extensions (coded within a few days or so) did not yet generate synthesizable code, but I was able to generate simulation HDL (VHDL) with gmp API calls as proof-of-concept.

In order to support for the basic GMP (multi-precision integer) API/DSL for just addition and multiplication, the following additions were needed:

• new VHDL package, mpint.vhd: MP integer and SLV (standard logic vector) operators: 500 lines of code (LOC)
• C to N-Address Code frontend extensions: 50 LOC
• Backend code generation in HercuLes: 150 LOC (in addition to what is already there)

This proof-of-concept frontend for GNU multi-precision integer programs has been implemented along with a VHDL IP library for key library functionalities such as mpz_set, mpz_add and mpz_mul.

A simple example

The following code implements factorial with “`fgmp“`, which is under the public domain:

``````
MP_INT gmp_fact(unsigned int n)
{
MP_INT prod;
int i;
mpz_init(&prod);
mpz_set_si(&prod, 1);
for (i = 2; i <= n; i++) {
mpz_mul_ui(&prod, &prod, i);
}
return (prod);
}
``````

This is the unoptimized FSMD in VHDL:

``````
library IEEE;
use WORK.operpack.all;
use WORK.gmp_fact_cdt_pkg.all;
use STD.textio.all;
use WORK.std_logic_textio.all;
use WORK.mpint.all;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;

entity gmp_fact is
port (
clk : in std_logic;
reset : in std_logic;
start : in std_logic;
n : in std_logic_vector(31 downto 0);
D_2747 : out std_logic_vector(31 downto 0);
done : out std_logic;
);
end gmp_fact;

architecture fsmd of gmp_fact is
type state_type is (S_ENTRY, S_EXIT, S_001_001, S_001_002, S_001_003, S_001_004, S_001_005, S_002_001, S_002_002, S_002_003, S_002_004, S_002_005, S_003_001, S_003_002, S_004_001, S_004_002);
signal current_state, next_state: state_type;
signal i_21_next : std_logic_vector(31 downto 0);
signal i_21_reg : std_logic_vector(31 downto 0);
signal m_11_next : std_logic_vector(31 downto 0);
signal m_11_reg : std_logic_vector(31 downto 0);
signal i_11_next : std_logic_vector(31 downto 0);
signal i_11_reg : std_logic_vector(31 downto 0);
signal prod_next : mp_int;
signal prod_reg : mp_int;
signal i_next : std_logic_vector(31 downto 0);
signal i_reg : std_logic_vector(31 downto 0);
signal prod_11_next : mp_int;
signal prod_11_reg : mp_int;
signal prod_21_next : mp_int;
signal prod_21_reg : mp_int;
signal prod_12_next : mp_int;
signal prod_12_reg : mp_int;
signal i_2_31_next : std_logic_vector(31 downto 0);
signal i_2_31_reg : std_logic_vector(31 downto 0);
signal i_1_21_next : std_logic_vector(31 downto 0);
signal i_1_21_reg : std_logic_vector(31 downto 0);
signal D_2747_next : std_logic_vector(31 downto 0);
signal D_2747_reg : std_logic_vector(31 downto 0);
constant CNST_0 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000000";
constant CNST_1 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000001";
constant CNST_2 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000010";
begin
-- current state logic
process (clk, reset)
begin
if (reset = '1') then
current_state <= S_ENTRY;
i_21_reg <= (others => '0');
m_11_reg <= (others => '0');
i_11_reg <= (others => '0');
prod_reg.data <= (others => 0);
i_reg <= (others => '0');
prod_11_reg.data <= (others => 0);
prod_21_reg.data <= (others => 0);
prod_12_reg.data <= (others => 0);
i_2_31_reg <= (others => '0');
i_1_21_reg <= (others => '0');
D_2747_reg <= (others => '0');
elsif (clk = '1' and clk'EVENT) then
current_state <= next_state;
i_21_reg <= i_21_next;
m_11_reg <= m_11_next;
i_11_reg <= i_11_next;
prod_reg <= prod_next;
i_reg <= i_next;
prod_11_reg <= prod_11_next;
prod_21_reg <= prod_21_next;
prod_12_reg <= prod_12_next;
i_2_31_reg <= i_2_31_next;
i_1_21_reg <= i_1_21_next;
D_2747_reg <= D_2747_next;
end if;
end process;

-- next state and output logic
process (current_state, start,
n,
D_2747_reg,
i_21_reg, i_21_next,
m_11_reg, m_11_next,
i_11_reg, i_11_next,
prod_reg, prod_next,
i_reg, i_next,
prod_11_reg, prod_11_next,
prod_21_reg, prod_21_next,
prod_12_reg, prod_12_next,
i_2_31_reg, i_2_31_next,
i_1_21_reg, i_1_21_next
)
begin
done <= '0';
i_21_next <= i_21_reg;
m_11_next <= m_11_reg;
i_11_next <= i_11_reg;
prod_next <= prod_reg;
i_next <= i_reg;
prod_11_next <= prod_11_reg;
prod_21_next <= prod_21_reg;
prod_12_next <= prod_12_reg;
i_2_31_next <= i_2_31_reg;
i_1_21_next <= i_1_21_reg;
D_2747_next <= D_2747_reg;
case current_state is
when S_ENTRY =>
if (start = '1') then
next_state <= S_001_001;
else
next_state <= S_ENTRY;
end if;
when S_001_001 =>
m_11_next <= CNST_1(31 downto 0);
next_state <= S_001_002;
when S_001_002 =>
prod_12_next <= mpz_set_ui(m_11_reg);
next_state <= S_001_003;
when S_001_003 =>
i_11_next <= CNST_2(31 downto 0);
prod_next <= prod_12_reg;
next_state <= S_001_004;
when S_001_004 =>
i_next <= i_11_reg(31 downto 0);
next_state <= S_001_005;
when S_001_005 =>
next_state <= S_003_001;
when S_002_001 =>
i_1_21_next(31 downto 0) <= i_reg;
next_state <= S_002_002;
when S_002_002 =>
prod_21_next <= mpz_mul_ui(prod_reg, i_1_21_reg);
next_state <= S_002_003;
when S_002_003 =>
mpz_printh(prod_21_reg);
next_state <= S_002_004;
when S_002_004 =>
i_21_next <= std_logic_vector(signed(i_reg) + signed(CNST_1(31 downto 0)));
next_state <= S_002_005;
when S_002_005 =>
prod_next <= prod_21_reg;
i_next <= i_21_reg(31 downto 0);
next_state <= S_003_001;
when S_003_001 =>
i_2_31_next(31 downto 0) <= i_reg;
next_state <= S_003_002;
when S_003_002 =>
if (i_2_31_reg <= n(31 downto 0)) then
next_state <= S_002_001;
else
next_state <= S_004_001; end if;
when S_004_001 =>
D_2747_next <= CNST_0(31 downto 0);
next_state <= S_004_002;
when S_004_002 =>
next_state <= S_EXIT;
when S_EXIT =>
done <= '1';
next_state <= S_ENTRY;
when others =>
next_state <= S_ENTRY;
end case;
end process;

D_2747 <= D_2747_reg;

end fsmd;
``````

Supporting testimonial to the ArchC infrastructure (2004)

[I was an early PhD student at the time and was quite impressed with ArchC had to offer; I still am]

ArchC is an architecture description language that exploits the already mature SystemC legacy. It targets the automated generation of a) functional and cycle-accurate instruction set simulators and at a later stage of b) tools from the entire application development toolchain for a custom processor (compiler, assembler, linker, debugger, etc).

Simulator generation is at a very good and usable state, however needs to be extended towards the system-level (bus generation ?). Toolset generation which naturally comes next, requires more hard effort and therefore the appropriate support of the open-source community. In my personal opinion, compiler generation is so hard that it can only be possible through funding. Other issues as automated bus generation are also very HARD and need support.

Strengths of ArchC

Important points regarding ArchC are:

• Nice integration with SystemC. The ArchC user codes instruction decoding, hardware resources, etc in small ArchC files and the corresponding SystemC are automatically generated, which removes heavy burden from the user.
• Unique approach to the design of a retargetable system call library. This makes possible to benchmark real-sized applications with realistic demands, i.e. many input arguments. Many of the other simulators can’t cope with that.
• Ease of introducing resource elements, e.g. multi-banked memories.
• Reuse possibilities for existing SystemC codes.
• Models of 3 processors, 2 of them with complete retargeted GNU toolchains! Plus binaries of very popular benchmarks (MediaBench, [MediaBench-II is here], MiBench).

How has ArchC helped me in my research

• Ease of designing variations of a base processor. Most of my papers require tradeoff analysis on alternative architectures.
• My “drproc” model is a 5-pipeline stage RISC processor with specialized instructions for data-reuse transformations [work of Catthoor et al, IMEC]. I originally had coded an assembler (~few days) and the entire processor in plain SystemC (~month). When I found out about ArchC, it required only 3-4 8-hour days to deliver to my colleagues a much more robust cycle-accurate simulator!
• I have ran reasonably sized applications (e.g. one sourceforge application, of 25K C lines with intensive use of pointers) with no problem on the R3000 and MIPS models of ArchC.

What needs to be done

NOTE: In the most part, the roadmap has been completed!

This work will be incomplete unless the “ArchC roadmap” is fulfilled. While I believe the compiler retargeting is a very hard issue, [addressing UNICAMP team] I am certain that your team is experienced in DSP compilation techniques (I have read some of the papers). Certainly, assembler generation, and bus wrapper generation (e.g. for OCP) plus some more models are awaited by many users, including myself.

Thank you www.archc.org for all the (free) fish!

Multiplierless hack for constant division by certain factors

I would like to present you with a formula for quotient calculation that works for any constant divisor of the form 2^n-1 (^ denotes exponentiation). For n=2,3,4,5,6,7, these divisors are 3,7,15,31,63,127. I have tested all divisors with n up to 16 and the approach works. My formula works only for dividends x that are in the range [0,(2^(2*n))-2].

The formula is as follows and it is both divisionless and multiplierless (as initially derived):

``````y = (((x>>n)+x+(1<<n)+1)>>n)-1;
``````

which can also be written as:

``y = (x + (x>>n) + 1) >> n;``

This is a formula for software integer division, i.e. truncating division that calculates the quotient of the division. Please note that 1<<n is the strength-reduced form for calculating 2^n. It does not require any double-word arithmetic (like with multiplying by the multiplicative inverse and then performing adjustment steps).

E.g. for n = 6, the divisor is 2^n – 1 = 63. The formula will work for x <= 4094 = (2^(2*n)) – 2 = (1<<n)*(1<<n)-2. It will produce one-off errors for certain x >= (2^(2*n))-2. But it will be correct for the specified “small” x values.

A testcase for the algorithm is here:

``````#include <stdio.h>

int main(void)
{
int qapprox, qexact;
int i, j, k;

for (i = 2; i < 8; i++) {
for (j = 0; j < (1<<i)*(1<<i)-1; j++) {
k = (1<<i) - 1;
qapprox = (((j>>i)+j+(1<<i)+1)>>i)-1;
qexact  = j / k;
if (qapprox != qexact) {
fprintf(stderr, "qapprox = (%d/%d) = %d\tqexact = (%d/%d) = %d\n",
j, k, qapprox, j, k, qexact);
}
}
}
return 0;
}
``````

I had devised the hack via “algorithmic sketching”. I mean that I had started from a basic formula with some “holes” in it. The holes where small constants or operators. There exist such techniques in academia, put more nicely; sometimes called algorithmic sketching, combinatorial
sketching, and it is relevant to forms of program synthesis. Essentially it is about devising algorithmic sketches (incomplete algorithms with wildcards) and then using a mechanism to fill the gaps.

Seems I can claim the specific hack as back as May 16, 2011.

Special thanks to everyone who has participated in the following threads. The Taylor series connection is very interesting, as mentioned by Tim Wescott and Tauno Voipio.