elemapprox — The Rosetta stone of elementary functions approximation and plotting

Going back to my HDL development/design projects, I’ve been having fun working with elemapprox, a multi-language collection of modules and packages to assist in simulating floating-point hardware. It is kind of a Rosetta stone for elementary functions approximation adding basic plotting facilities as ASCII (low-resolution) and PBM (monochrome) bitmaps (higher res). Available ports include ANSI C, Verilog, VHDL and “VHDLIEEE” (perusing the existing approximations in the IEEE.math_real package). The data type used for the computations is Verilog’s and VHDL’s real.

This code has been tested with Icarus Verilog, GHDL and Modelsim (VHDL only). The Verilog driver module (testfunc.v) makes advanced use of Verilog system functions and tasks. By using this strong feature of Verilog, I was able to closely imitate the operation of the driver code (testfunc.c) from the ANSI C version. Development of the test driver for the VHDL version was not that straightforward and had to bypass some VHDL quirks; for instance the handling of variable-length strings.

The complete code is here: http://github.com/nkkav/elemapprox and is licensed under the Modified BSD license.

My motivation was to extend the original work on evaluating (single-precision) and plotting transcendental functions as discussed in Prof. Mark G. Arnold’s HDLCON 2001 paper .

At this point my version adds support for all trigonometric (cyclic), inverse trigonometric, hyperbolic and inverse hyperbolic functions as well as a few others: exp, log (ln), log2, log10, pow. I will be adding more functions in the future, for instance hypot, cbrt (cubic root) as well as other special functions that are of interest.

With using any of the versions of elemapprox (whether ANSI C, Verilog or the VHDL ones), you can easily plot the arctangent as ASCII:


or as a PBM bitmap file (for much higher resolution):


Or you can plot your typical sine:


You can always configure elemapprox to suite your needs: write your own function approximations, plotting routines, etc..

In the end: use the source Luke! Besides that, there is also some documentation (thankfully) to get you started, or just browse the README directly at the project’s Github repo or just drop a note here.

I really hope that the community will find this work useful!

METATOR – A look into processor synthesis

These last few months, I have been slowly moving back to my main interests, EDA tools (as a developer and as a user), FPGA application engineering, and last but not least processor design. After a 5-year hiatus I have started revamping (and modernizing) my own environment, developed as an outcome of my PhD work on application-specific instruction-set processors (ASIPs). The flow was based on SUIF/Machine-SUIF (compiler), SALTO (assembly-level transformations) and ArchC (architecture description language for producing binary tools and simulators). It was a highly-successful flow that allowed me (along with my custom instruction generator YARDstick) to explore configurations and extensions of processors within seconds or minutes.

I have been thinking about what’s next. We have tools to assist the designer (the processor design engineer per se) to speedup his/her development. Still, the processor must be designed explicitly. What would go beyond the state-of-the-art is not to have to design the golden model of the processor at all.

What I am proposing is an application-specific processor synthesis tool that goes beyond the state-of-the-art. A model generator for producing the high-level description of the processor, based only on application analysis and user-defined constraints. And for the fun of it, let’s codename it METATOR, because I tend to watch too much Supernatural these days, and METATOR (messenger) is a possible meaning for METATRON, an angelic being from the Apocrypha with a human past. So think of METATOR as an upgrade (spiritual or not) to the current status of both academic and commercial ASIP design tools.

The Context, the Problem and its Solution

ASIPs are tuned for cost-effective execution of targeted application sets. An ASIP design flow involves profiling, architecture exploration, generation and selection of functionalities and synthesis of the corresponding hardware while enabling the user taking certain decisions.

The state-of-the-art in ASIP synthesis includes commercial efforts from Synopsys which has accumulated three relevant portfolios: the ARC configurable processor cores, Processor Designer (previously LISATek) and the IP Designer nML-based tools (previously Target Compiler Technologies); ASIPmeister by ASIP Solutions (site down?), Lissom/CodAL by Codasip, and the academic TCE and NISC toolsets. Apologies if I have missed any other ASIP technology provider!

The key differentiation point of METATOR against existing approaches is that ASIP synthesis should not require the explicit definition of a processor model by a human developer. The solution implies the development of a novel scheme for the extraction of a common denominator architectural model from a given set of user applications (accounting for high-level constraints and requirements) that are intended to be executed on the generated processor by the means of graph similarity extraction. From this automatically generated model, an RTL description, verification IP and a programming toolchain would be produced as part of an automated targeting process, in like “meta-“: a generated model generating models!.


Conceptual ASIP Synthesis Flow

METATOR would accept as input the so-called algorithmic soup (narrow set of applications) and generate the ADL (Architecture Description Language) description of the processor. My first aim would be for ArchC but this could also expand to the dominant ADLs, LISA 2.0 and nML.

METATOR would rely upon HercuLeS high-level synthesis technology and the YARDstick profiling and custom instruction generation environment. In the past, YARDstick has been used for generating custom instructions (CIs) for ByoRISC (Build Your Own RISC) soft-core processors. ByoRISC is a configurable in-order RISC design, allowing the execution of multiple-input, multiple-output custom instructions and achieving higher performance than typical VLIW architectures. CIs for ByoRISC where generated by YARDstick, which purpose is to perform application analysis on targeted codes, identify application hotspots, extract custom instructions and evaluate their potential impact on code performance for ByoRISC.


To sum this up, METATOR is a mind experiment in ASIP synthesis technology. It automatically generates a full-fledged processor and toolchain merely from its usage intent, expressed as indicative targeted application sets.

Streamlining your FPGA synthesis process

These last few days, I had an appetite for experimenting with a number of FPGA projects, ranging from very simple logic to quite more complex state machines. The idea was to port a number of existing designs to the Xilinx Spartan 3E and 3AN starter kit boards: a few of my own designs, either targeting other boards/devices or never been yet tested at the board level, or designs from other people. Mike Field’s hamsterworks website is an excellent source of FPGA designs with varying degree of complexity! Most of them are for Spartan 6/Artix 7, so I had to backport some of his ideas to my less contemporary devices :)

One thing is that in order to effectively use my time on porting or backporting a number of designs (maybe 15 or 20), I had to streamline the synthesis process; all had to be done from the command line. Borrowing some ideas from Evgeni Stavinov’s Using Xilinx Tools in Command-Line Mode I was able to synthesize, generate bitstream and finally program the FPGA device via my download cable, without any GUI interaction.

I will use as our vehicle maybe the simplest possible design. Let’s call it bstest, as a stand-in for “buttons and switches tester”. It is a very simple design for testing the four push buttons, four slide switches and eight discrete LEDs available on the Spartan-3E Starter Kit board by Digilent (link at Xilinx website) (link at Digilent website).

The design

The design associates push button and slide switch actions to specific LEDs. The code (bstest.vhd) is really simple:

library IEEE;

entity bstest is
  port ( 
    sldsw  : in  std_logic_vector(3 downto 0);
    button : in  std_logic_vector(3 downto 0); -- N-E-W-S
    led    : out std_logic_vector(7 downto 0)
end bstest;

architecture dataflow of bstest is
  led(7 downto 4) <= sldsw;
  led(3 downto 0) <= button;
--  led <= sldsw & button; -- we could also do it this way
end dataflow;

User constraints file

The UCF (User Constraints File) associates a “net” (input or output port of our top-level design) to a “loc” (location), referring to a specific FPGA pin. The specific FPGA device available on the Spartan-3E Starter Kit is the XC3S500E-FG320-4 and the UCF (bstest.ucf) in this case should be as follows:

NET "sldsw"   LOC = "N17";
NET "sldsw"   LOC = "H18";
NET "sldsw"   LOC = "L14";
NET "sldsw"   LOC = "L13";

NET "button"  LOC = "V4";   # NORTH
NET "button"  LOC = "H13";  # EAST
NET "button"  LOC = "D18";  # WEST 
NET "button"  LOC = "K17";  # SOUTH

NET "led"     LOC = "F9"  | IOSTANDARD = LVTTL | SLEW = SLOW | DRIVE = 8;
NET "led"     LOC = "E9"  | IOSTANDARD = LVTTL | SLEW = SLOW | DRIVE = 8;
NET "led"     LOC = "D11" | IOSTANDARD = LVTTL | SLEW = SLOW | DRIVE = 8;
NET "led"     LOC = "C11" | IOSTANDARD = LVTTL | SLEW = SLOW | DRIVE = 8;
NET "led"     LOC = "F11" | IOSTANDARD = LVTTL | SLEW = SLOW | DRIVE = 8;
NET "led"     LOC = "E11" | IOSTANDARD = LVTTL | SLEW = SLOW | DRIVE = 8;
NET "led"     LOC = "E12" | IOSTANDARD = LVTTL | SLEW = SLOW | DRIVE = 8;
NET "led"     LOC = "F12" | IOSTANDARD = LVTTL | SLEW = SLOW | DRIVE = 8;

Automation scripts

The top-level HDL design file and the UCF map is all we would need, if we intended to use the Xilinx ISE/XST GUI (I have version 14.6 installed). However, we can be much more productive, if we resort to a few scripts for automating this part of the process.

First, a Makefile (xst.mk) based on Evgeni Stavinov’s article can be used for automating all the way to bitfile generation:

all: $(PROJECT).bit

floorplan: $(PROJECT).ngd $(PROJECT).par.ncd

	cat *.srp

	rm -f *.work *.xst
	rm -f *.ngc *.ngd *.bld *.srp *.lso *.prj
	rm -f *.map.mrp *.map.ncd *.map.ngm *.mcs *.par.ncd *.par.pad
	rm -f *.pcf *.prm *.bgn *.drc
	rm -f *.par_pad.csv *.par_pad.txt *.par.par *.par.xpi
	rm -f *.bit
	rm -f *.vcd *.vvp
	rm -f verilog.dump verilog.log
	rm -rf _ngo/
	rm -rf xst/

# Xilinx tools and wine

DEFAULT_ARCH = spartan3
DEFAULT_PART = xc3s700an-fgg484-4

XBIN = $(XDIR)/bin/nt64


IMPACT_OPTIONS_FILE   ?= _impact.cmd    

ifndef ARCH
ifndef PART

	> $@
	for a in $(SOURCES); do echo "vhdl work $$a" >> $@; done   

	> $@
	echo -n "run -ifn $(XSTWORK) -ifmt mixed -top $(TOP) -ofn $(PROJECT).ngc" >> $@
	echo " -ofmt NGC -p $(PART) -iobuf yes -opt_mode $(XST_OPT_MODE) -opt_level $(XST_OPT_LEVEL)" >> $@

	$(XST) -intstyle ise -ifn $(PROJECT).xst -ofn $(PROJECT).syr
	$(NGDBUILD) -intstyle ise -dd _ngo -nt timestamp -uc $(PROJECT).ucf -p $(PART) $(PROJECT).ngc $(PROJECT).ngd
	$(MAP) -intstyle ise -p $(PART) -w -ol high -t 1 -global_opt off -o $(PROJECT).map.ncd $(PROJECT).ngd $(PROJECT).pcf
	$(PAR) -w -intstyle ise -ol high $(PROJECT).map.ncd $(PROJECT).ncd $(PROJECT).pcf
	$(TRCE) -intstyle ise -v 4 -s 4 -n 4 -fastpaths -xml $(PROJECT).twx $(PROJECT).ncd -o $(PROJECT).twr $(PROJECT).pcf
	$(BITGEN) -intstyle ise $(PROJECT).ncd

$(PROJECT).bin: $(PROJECT).bit
	$(PROMGEN) -w -p bin -o $(PROJECT).bin -u 0 $(PROJECT).bin

So what happens here? The synthesis procedure invokes several Xilinx ISE command-line tools for logic synthesis as described in the corresponding Makefile, found in the bstest main directory.

Typically, the process includes the following:

  • Generation of the *.xst synthesis script file.
  • Generation of the *.ngc gate-level netlist file in NGC format.
  • Building the corresponding *.ngd file.
  • Performing mapping using map which generates the corresponding *.ncd file.
  • Place-and-routing using par which updates the corresponding *.ncd file.
  • Tracing critical paths using trce for reoptimizing the *.ncd file.
  • Bitstream generation (*.bit) using bitgen, however with unused pins.

As a result of this process, the bstest.bit bitstream file is produced.

Then, the shell script invokes the Xilinx IMPACT tool by a Windows batch file named impact_s3esk.bat, automatically passing a series of commands that are necessary for configuring the target FPGA device:

setMode -bs
setCable -p auto
assignFile -p 1 -file bstest.bit
program -p 1

Each line provides a specific command to the IMPACT tool

  1. Set mode to binary scan.
  2. Set cable port detection to auto (tests various ports).
  3. Identify parts and their order in the scan chain.
  4. Assign the bitstream to the first part in the scan chain.
  5. Program the selected device.
  6. Exit IMPACT.

If using the Spartan-3AN starter kit board, the “program” command should be used with -onlyFPGA command-line option, in order to only program the FPGA device.

Finally, a Bash shell script can be used as our entry point (and only needed access point) to the process. The corresponding synthesis script (bstest-syn.sh) can be edited in order to specify the following for adapting to the user’s setup:

  • XDIR: the path to the /bin subdirectory of the Xilinx ISE/XST installation where the xst.exe executable is placed
  • arch: specific FPGA architecture (device family) to be used for synthesis
  • part: specific FPGA part (device) to be used for synthesis

# Change XDIR according to your host configuration.
export XDIR=/c/XilinxISE/14.6/ISE_DS/ISE
make -f xst.mk clean
make -f xst.mk PROJECT="bstest" \
  SOURCES="bstest.vhd" \
  TOPDIR="./log" TOP="bstest" \
  ARCH="spartan3" PART="xc3s500e-fg320-4"

# Invoke impact.exe for manual download of the generated bitstream to a 
# hardware platform.
${XDIR}/bin/nt64/impact.exe -batch impact_s3esk.bat

And that’s about it!

I have used this script in MinGW on Windows 7/64-bit. My setup is for an ISE 14.6 installation. There might be a catch here; on some systems, there is a board driver problem regarding ISE. To fix this issue, open a command prompt and uninstall, then reinstall the driver, as follows:

cd c:\XilinxISE\14.6\ISE_DS\ISE\bin\nt64
wdreg -compat -inf %cd%/windrvr6.inf uninstall
wdreg -compat -inf %cd%/windrvr6.inf install



Following these steps, you will be able to synthesize the bstest design. You can download either the Spartan-3E version (bstest-s3esk.zip) or the Spartan-3AN one (bstest-s3ansk.zip) to start experimenting!

BTW in case your antivirus program goes crybaby you can safely ignore it, impact_s3esk.bat is the Windows batch file for passing commands to IMPACT, nothing special about it :)

Implementing 2D cellular automata in plain hardware (FPGA)

Dear all, it has been some time.

A couple of weeks ago, I had the honor to exhibit at the 2nd Panhellenic meeting on New Technologies, Robotics and Enterpreneurship (http://robo.teiste.gr). The meeting took place in Lamia, Greece (where I currently live) and the venue was at a convenient 2 min drive from home camp :)

I want to warmheartedly thank Prof. Panayotis Papazoglou (http://papazoglou.edu.gr) for the hospitality. He made a great effort in making for the second consecutive time the ROBO meeting a success!

My first exhibition at a ROBO meeting was based on an all-digital, all-hardware demo based on 2D cellular automata. The demo was nicknamed “digital kaleidoscope” but the work was actually done by 2D automata using the so-called rug rule (http://www.mirekw.com/ca/rullex_udll.html)

Introduction to 2D cellular automata

In our case, these automata comprise of a two-dimensional matrix composed of identical cells, the internal state of which can be visualized by assigning it to pixels of a display through a palette of 256 colors.

According to the rug rule, the following three steps are executed:

  1. Calculate the sum of the values for the 8 neighbors (Moore neighborhood) of a given cell C.
  2. Divide by 8 to get their floored average
  3. Calculate the new value for the cell, C’, by adding a small integer increment. This computation takes place in modulo 256 arithmetic

Following these simple steps for every cell, the digital kaleidoscope presents an explosive, chaotic and at the same time, highly interesting behavior.


For the implementation, I followed a number of specific steps.

Software exploration

First, a software-based, host-running implementation was examined. I coded this in plain ANSI C, and used it to produce PPM snapshots for each generation of the automaton. Then using gifsicle, I had these PPMs converted to nice-looking animated GIFs. These would allow me very early in the development cycle to have a grasp of how the hardware demo would potentionally look.

The code makes use of my libpnmio library for PBM/PGM/PPM I/O and is given here:

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
#include <math.h>
#include "pnmio.h"

#define XDIM_DEFAULT 128
#define YDIM_DEFAULT 64
int step=1, incr=1, delay=0, gens=1;
char imgout_file_name[96];
FILE *imgout_file;
int img_xdim=XDIM_DEFAULT, img_ydim=YDIM_DEFAULT;
int *img_temp, *img_work, *img_out;

/* decode:
 * Decode the RGB encoding of the specified color.
 * NOTE: This scheme can only allow for up to 256 distinct colors
 * (essentially: R3G3B2).
void decode(int c, int *red, int *green, int *blue)
  int t = c;
  *red = ((t >> 5) & 0x7) << 5;
  *green = ((t >> 2) & 0x7) << 5;
  *blue = ((t ) & 0x3) << 6;

/* rugca:
 * Generic implementation of the rug rule automaton.
void rugca(int xsize, int ysize, int s, int inc, int g, int d)
  int i, k, x, y;
  int taddr, u, uaddr;
  int red, green, blue;
  int height=ysize, width=xsize;
  int cs;
  int sum=0;
  int x_offset[8] = {-1, 0, 1, 1, 1, 0,-1,-1};
  int y_offset[8] = {-1,-1,-1, 0, 1, 1, 1, 0};

  i = 0;
  while (i < g) {

    printf("### GENERATION %09d ###\n", i);

    // Print current generation.
    if ((i % s) == 0) {
      sprintf(imgout_file_name, "rugca-%09d.ppm", i);
      imgout_file = fopen(imgout_file_name, "w");
      for (y = 0; y < height; y++) {
        for (x = 0; x < width; x++) {
          taddr = y*width+x;
          decode(img_temp[taddr], &red, &green, &blue);
          img_out[3*taddr+0] = red;
          img_out[3*taddr+1] = green;
          img_out[3*taddr+2] = blue;
      write_ppm_file(imgout_file, img_out, imgout_file_name,
        xsize, ysize, 1, 1, 255);

    // Calculate next grid state.
    for (y = 1; y < height-1; y++) {
      for (x = 1; x < width-1; x++) {
        sum = 0;
        taddr = y*width + x;
        for (k = 0; k < 8; k++) {
          uaddr = taddr + y_offset[k]*width + x_offset[k];
          u = img_temp[uaddr];
          sum += u;
        // Averaging sum.
        sum = sum >> 3;
        // Increment cs, modulo 256.
        cs = (sum + inc) & 0xFF;
        img_work[taddr] = cs;

    // Copy back current generation.
    for (x = 0; x < width*height; x++) {
      img_temp[x] = img_work[x];

    // Advance generation.

/* print_usage:
 * Print usage instructions for the "rugca" program.
static void print_usage()
  printf("* Usage:\n");
  printf("* rugca [options]\n");
  printf("* \n");
  printf("* Options:\n");
  printf("* -h: Print this help.\n");
  printf("* -xsize <num>: Image width (Default: 128).\n");
  printf("* -ysize <num>: Image height (Default: 64).\n");
  printf("* -step <num>: Generate a PPM image every step generations (Default: 1).\n");
  printf("* -gens <num>: Total number of CCA generations (Default: 1).\n");
  printf("* -incr <num>: Cell increment (Default: 1).\n");
  printf("* -delay <num>: Delay factor for slowing down the main loop (Default: 0).\n");
  printf("* \n");
  printf("* For further information, please refer to the website:\n");
  printf("* http://www.nkavvadias.com\n\n");

/* main:
 * The main routine.
int main(int argc, char **argv)
  int i, x, y;

  // Read input arguments
  if (argc < 2) {

  for (i = 1; i < argc; i++) {
    if (strcmp("-h",argv[i]) == 0) {
    } else if (strcmp("-xsize", argv[i]) == 0) {
      if ((i+1) < argc) {
        img_xdim = atoi(argv[i]);
    } else if (strcmp("-ysize", argv[i]) == 0) {
      if ((i+1) < argc) {
        img_ydim = atoi(argv[i]);
    } else if (strcmp("-step", argv[i]) == 0) {
      if ((i+1) < argc) {
        step = atoi(argv[i]);
    } else if (strcmp("-gens", argv[i]) == 0) {
      if ((i+1) < argc) {
        gens = atoi(argv[i]);
    } else if (strcmp("-incr", argv[i]) == 0) {
      if ((i+1) < argc) {
        incr = atoi(argv[i]);
    } else if (strcmp("-delay", argv[i]) == 0) {
      if ((i+1) < argc) {
        delay = atoi(argv[i]);

  /* Allocate space for image data. */
  img_temp = malloc(img_xdim * img_ydim * sizeof(int));
  img_work = malloc(img_xdim * img_ydim * sizeof(int));
  img_out = malloc(3 * img_xdim * img_ydim * sizeof(int));

  for (y = 0; y < img_ydim; y++) {
    for (x = 0; x < img_xdim; x++) {
      img_temp[y*img_xdim+x] = 0x00;

  /* Perform operations. */
  rugca(img_xdim, img_ydim, step, incr, gens, delay);

  /* Deallocate memory. */

  return 0;

Adapting reference C for high-level synthesis

Following this, the reference C code had to be adapted for high-level synthesis. I used my own high-level synthesis technology, named HercuLeS HLS: http://www.nkavvadias.com/hercules/

A detailed manual for HercuLeS can be found here: http://www.nkavvadias.com/hercules-reference-manual/hercules-refman.pdf

So HercuLeS can generate single IP blocks (for single procedures) or entire system IP (from a given translation unit with a number of procedures). In our case, we will be generating a single block IP with two streaming outputs,

  • ok: is the state of the currently addressed cell
  • xy: the address of that cell (linearized from 0 to XDIM*YDIM-1)

This block will then be incorporated in a given system I have developed for image and video synthesis demonstrations. This happens naturally in a plug-and-play way. Meaning that for custom procedural image/video generation, this system needs only be updated by the specific finite-state machine with datapath (FSMD) with proper streaming outputs for the purpose.

The C code is adapted to the following snippet and then it is passed to HercuLeS for cooking:

#define XSIZE       80
#define YSIZE       60

void rugca(int *ok, int *xy)
  unsigned int i, j, k, x, y;
  unsigned int g=100000000, d=10000000;
  unsigned int taddr, uaddr;
  unsigned char cs, u, sum, nval;
  static unsigned char img_temp[XYSIZE], img_work[XYSIZE];
  static char x_offset[8] = {-1, 0, 1, 1, 1, 0,-1,-1};
  static char y_offset[8] = {-1,-1,-1, 0, 1, 1, 1, 0};  
  // Default.
  unsigned char incr=3;

  for (y = 0; y < YSIZE; y++) {
    for (x = 0; x < XSIZE; x++) {       
      img_temp[y*XSIZE+x] = ((x*y) >> 8) & 0x1;

  i = 0;
  while (i < g) {  

    // Calculate next grid state.
    for (y = 1; y < YSIZE-1; y++) {
      for (x = 1; x < XSIZE-1; x++) {
        sum = 0;
        taddr = y*XSIZE + x;
        for (k = 0; k < 8; k++) {           
          uaddr = taddr + y_offset[k]*XSIZE + x_offset[k];           
          u     = img_temp[uaddr];           
          sum   = sum + u;         
        // Averaging sum.         
        sum = sum >> 3;
        // Increment cs, modulo 256.
        nval = (sum + incr) & 0xFF;
        cs = img_temp[taddr];        
        *ok = cs;
        *xy = taddr;
        img_work[taddr] = nval;      

    // Copy back current generation.
    for (x = 0; x < XYSIZE; x++) {
      img_temp[x] = img_work[x];
    j = 0;
    while (j < d) {

    // Advance generation.

Automatically generated VHDL from HercuLeS

HercuLeS now is ready to rumble. Within a few tens of seconds, the VHDL code for the block is generated. Remember that “humans were not involved in the process :)”. So let’s see what we can do with this automatically-generated code. First, let’s see how does it look like.

library IEEE;
use WORK.operpack.all;
use WORK.rugca_cdt_pkg.all;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;

entity rugca is
  port (
    clk : in std_logic;
    reset : in std_logic;
    start : in std_logic;
    mode : in std_logic_vector(3 downto 0);
    ok : out std_logic_vector(7 downto 0);
    xy : out std_logic_vector(12 downto 0);
    valid : out std_logic;
    done : out std_logic;
    ready : out std_logic
end rugca;

architecture fsmd of rugca is
  type state_type is (S_ENTRY, S_EXIT, S_001_001, S_001_002, S_001_003, S_002_001, S_002_002, S_002_003, 
  S_003_001, S_003_002, S_003_003, S_003_004, S_003_005, S_003_006, S_004_001, S_005_001, S_005_002, S_005_003, 
  S_006_001, S_007_001, S_007_002, S_007_003, S_008_001, S_008_002, S_008_003, 
  S_009_001, S_009_002, S_009_003, S_010_001, S_010_002, S_010_003, S_010_004, 
  S_011_001, S_011_002, S_011_003, S_011_004, S_011_005, S_011_006, S_011_007, S_011_008, 
  S_012_001, S_012_002, S_012_003, S_013_001, S_013_002, S_013_003, S_014_001, S_014a_001, 
  S_015_001, S_015_002, S_015_003, S_015_004, S_018_001, S_018_002, S_019_001, S_019_002, S_019_003, 
  S_020_001, S_021_001, S_022_001, S_023_001, S_024_001, S_025_001, S_026_001, 
  S_027_001, S_028_001, S_029_001, S_029_002, S_029_003, S_030_001, 
  S_031_001, S_031_002, S_031_003, S_032_001, S_033_001, S_033_002, S_033_003, 
  S_034_001, S_034_002, S_034_003, S_034_004, S_035_001, 
  S_036_001, S_036_002, S_036_003, S_037_001, S_037_002, S_037_003, 
  S_038_001, S_039_001, S_039_002, S_039_003, S_040_001, 
  S_041_001, S_041_002, S_042_001);
  signal current_state, next_state: state_type;
  signal img_temp_we : std_logic;
  signal img_temp_addr : std_logic_vector(12 downto 0);
  signal img_temp_din : std_logic_vector(7 downto 0);
  signal img_temp_dout : std_logic_vector(7 downto 0);
  signal img_work_we : std_logic;
  signal img_work_addr : std_logic_vector(12 downto 0);
  signal img_work_din : std_logic_vector(7 downto 0);
  signal img_work_dout : std_logic_vector(7 downto 0);
  signal x_offset_we : std_logic;
  signal x_offset_addr : std_logic_vector(2 downto 0);
  signal x_offset_din : std_logic_vector(7 downto 0);
  signal x_offset_dout : std_logic_vector(7 downto 0);
  signal y_offset_we : std_logic;
  signal y_offset_addr : std_logic_vector(2 downto 0);
  signal y_offset_din : std_logic_vector(7 downto 0);
  signal y_offset_dout : std_logic_vector(7 downto 0);
  signal x_1_next : std_logic_vector(31 downto 0);
  signal x_1_reg : std_logic_vector(31 downto 0);
  signal j_1_next : std_logic_vector(31 downto 0);
  signal j_1_reg : std_logic_vector(31 downto 0);
  signal i_1_next : std_logic_vector(31 downto 0);
  signal i_1_reg : std_logic_vector(31 downto 0);
  signal D_1408_1_next : std_logic_vector(31 downto 0);
  signal D_1408_1_reg : std_logic_vector(31 downto 0);
  signal y_1_next : std_logic_vector(31 downto 0);
  signal y_1_reg : std_logic_vector(31 downto 0);
  signal taddr_1_next : std_logic_vector(31 downto 0);
  signal taddr_1_reg : std_logic_vector(31 downto 0);
  signal D_1417_1_next : std_logic_vector(31 downto 0);
  signal D_1417_1_reg : std_logic_vector(31 downto 0);
  signal uaddr_1_next : std_logic_vector(31 downto 0);
  signal uaddr_1_reg : std_logic_vector(31 downto 0);
  signal sum_1_next : std_logic_vector(15 downto 0);
  signal sum_1_reg : std_logic_vector(15 downto 0);
  signal k_1_next : std_logic_vector(31 downto 0);
  signal k_1_reg : std_logic_vector(31 downto 0);
  signal D_1429_1_next : std_logic_vector(7 downto 0);
  signal D_1429_1_reg : std_logic_vector(7 downto 0);
  signal D_1412_1_next : std_logic_vector(7 downto 0);
  signal D_1412_1_reg : std_logic_vector(7 downto 0);
  signal g_1_next : std_logic_vector(31 downto 0);
  signal g_1_reg : std_logic_vector(31 downto 0);
  signal d_1_next : std_logic_vector(31 downto 0);
  signal d_1_reg : std_logic_vector(31 downto 0);
  signal c_1_next : std_logic_vector(7 downto 0);
  signal c_1_reg : std_logic_vector(7 downto 0);
  signal D_1439_1_next : std_logic_vector(7 downto 0);
  signal D_1439_1_reg : std_logic_vector(7 downto 0);
  signal D_1413_1_next : std_logic_vector(7 downto 0);
  signal D_1413_1_reg : std_logic_vector(7 downto 0);
  signal D_1418_1_next : std_logic_vector(7 downto 0);
  signal D_1418_1_reg : std_logic_vector(7 downto 0);
  signal u_1_next : std_logic_vector(7 downto 0);
  signal u_1_reg : std_logic_vector(7 downto 0);
  signal u_next : std_logic_vector(15 downto 0);
  signal u_reg : std_logic_vector(15 downto 0);
  signal cs_1_next : std_logic_vector(7 downto 0);
  signal cs_1_reg : std_logic_vector(7 downto 0);
  signal x_next : std_logic_vector(31 downto 0);
  signal x_reg : std_logic_vector(31 downto 0);
  signal j_next : std_logic_vector(31 downto 0);
  signal j_reg : std_logic_vector(31 downto 0);
  signal i_next : std_logic_vector(31 downto 0);
  signal i_reg : std_logic_vector(31 downto 0);
  signal y_next : std_logic_vector(31 downto 0);
  signal y_reg : std_logic_vector(31 downto 0);
  signal k_next : std_logic_vector(31 downto 0);
  signal k_reg : std_logic_vector(31 downto 0);
  signal taddr_next : std_logic_vector(31 downto 0);
  signal taddr_reg : std_logic_vector(31 downto 0);
  signal sum_next : std_logic_vector(15 downto 0);
  signal sum_reg : std_logic_vector(15 downto 0);
  signal g_next : std_logic_vector(31 downto 0);
  signal g_reg : std_logic_vector(31 downto 0);
  signal cs_next : std_logic_vector(7 downto 0);
  signal cs_reg : std_logic_vector(7 downto 0);
  signal d_next : std_logic_vector(31 downto 0);
  signal d_reg : std_logic_vector(31 downto 0);
  signal nval_next : std_logic_vector(15 downto 0);
  signal nval_reg : std_logic_vector(15 downto 0);
  signal mode16_next : std_logic_vector(15 downto 0);
  signal mode16_reg : std_logic_vector(15 downto 0);
  signal D_1407_1_next : std_logic_vector(31 downto 0);
  signal D_1407_1_reg : std_logic_vector(31 downto 0);
  signal D_1409_1_next : std_logic_vector(31 downto 0);
  signal D_1409_1_reg : std_logic_vector(31 downto 0);
  signal D_1415_1_next : std_logic_vector(31 downto 0);
  signal D_1415_1_reg : std_logic_vector(31 downto 0);
  signal D_1410_1_next : std_logic_vector(31 downto 0);
  signal D_1410_1_reg : std_logic_vector(31 downto 0);
  signal D_1414_1_next : std_logic_vector(31 downto 0);
  signal D_1414_1_reg : std_logic_vector(31 downto 0);
  signal D_1422_1_next : std_logic_vector(31 downto 0);
  signal D_1422_1_reg : std_logic_vector(31 downto 0);
  signal taddr_0_1_next : std_logic_vector(31 downto 0);
  signal taddr_0_1_reg : std_logic_vector(31 downto 0);
  signal D_1411_1_next : std_logic_vector(7 downto 0);
  signal D_1411_1_reg : std_logic_vector(7 downto 0);
  signal D_1416_1_next : std_logic_vector(31 downto 0);
  signal D_1416_1_reg : std_logic_vector(31 downto 0);
  signal D_1419_1_next : std_logic_vector(31 downto 0);
  signal D_1419_1_reg : std_logic_vector(31 downto 0);
  signal ok_next : std_logic_vector(7 downto 0);
  signal ok_reg : std_logic_vector(7 downto 0);
  signal xy_next : std_logic_vector(12 downto 0);
  signal xy_reg : std_logic_vector(12 downto 0);
  signal serenity_next : std_logic;
  signal serenity_reg : std_logic;
  signal waitstate_next : std_logic;
  signal waitstate_reg : std_logic;
  constant CNST_0 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000000";
  constant CNST_1 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000001";
  constant CNST_500000 : std_logic_vector(63 downto 0)    := "0000000000000000000000000000000000000000000001111010000100100000";
  constant CNST_2000000 : std_logic_vector(63 downto 0)   := "0000000000000000000000000000000000000000000111101000010010000000";
  constant CNST_10000000 : std_logic_vector(63 downto 0)  := "0000000000000000000000000000000000000000100110001001011010000000";
  constant CNST_25000000 : std_logic_vector(63 downto 0)  := "0000000000000000000000000000000000000001011111010111100001000000";
  constant CNST_100000000 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000101111101011110000100000000";
  constant CNST_2 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000010";
  constant CNST_254 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000011111110";
  constant CNST_3 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000011";
  constant CNST_4 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000100";
  constant CNST_4799 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000001001010111111";
  constant CNST_5 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000101";
  constant CNST_58 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000111010";
  constant CNST_59 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000111011";
  constant CNST_6 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000110";
  constant CNST_7 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000000000111";
  constant CNST_78 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000001001110";
  constant CNST_79 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000000000001001111";
  constant CNST_8 : std_logic_vector(63 downto 0)     := "0000000000000000000000000000000000000000000000000000000000001000";
  constant CNST_80 : std_logic_vector(63 downto 0)    := "0000000000000000000000000000000000000000000000000000000001010000";
  constant CNST_118 : std_logic_vector(63 downto 0)   := "0000000000000000000000000000000000000000000000000000000001110110";
  constant CNST_119 : std_logic_vector(63 downto 0)   := "0000000000000000000000000000000000000000000000000000000001110111";
  constant CNST_158 : std_logic_vector(63 downto 0)   := "0000000000000000000000000000000000000000000000000000000010011110";
  constant CNST_159 : std_logic_vector(63 downto 0)   := "0000000000000000000000000000000000000000000000000000000010011111";
  constant CNST_160 : std_logic_vector(63 downto 0)   := "0000000000000000000000000000000000000000000000000000000010100000";
  constant CNST_19199 : std_logic_vector(63 downto 0) := "0000000000000000000000000000000000000000000000000100101011111111";
  -- current state logic
  process (clk, reset)
    if (reset = '1') then
      current_state <= S_ENTRY;
      x_1_reg <= (others => '0');
      j_1_reg <= (others => '0');
      i_1_reg <= (others => '0');
      D_1408_1_reg <= (others => '0');
      y_1_reg <= (others => '0');
      taddr_1_reg <= (others => '0');
      D_1417_1_reg <= (others => '0');
      uaddr_1_reg <= (others => '0');
      sum_1_reg <= (others => '0');
      k_1_reg <= (others => '0');
      D_1429_1_reg <= (others => '0');
      D_1412_1_reg <= (others => '0');
      g_1_reg <= (others => '0');
      d_1_reg <= (others => '0');
      c_1_reg <= (others => '0');
      D_1439_1_reg <= (others => '0');
      D_1413_1_reg <= (others => '0');
      D_1418_1_reg <= (others => '0');
      u_1_reg <= (others => '0');
      u_reg <= (others => '0');
      cs_1_reg <= (others => '0');
      x_reg <= (others => '0');
      j_reg <= (others => '0');
      i_reg <= (others => '0');
      y_reg <= (others => '0');
      k_reg <= (others => '0');
      taddr_reg <= (others => '0');
      sum_reg <= (others => '0');
      g_reg <= (others => '0');
      cs_reg <= (others => '0');
      d_reg <= (others => '0');
      nval_reg <= (others => '0');
      mode16_reg <= (others => '0');
      D_1407_1_reg <= (others => '0');
      D_1409_1_reg <= (others => '0');
      D_1415_1_reg <= (others => '0');
      D_1410_1_reg <= (others => '0');
      D_1414_1_reg <= (others => '0');
      D_1422_1_reg <= (others => '0');
      taddr_0_1_reg <= (others => '0');
      D_1411_1_reg <= (others => '0');
      D_1416_1_reg <= (others => '0');
      D_1419_1_reg <= (others => '0');
      ok_reg <= (others => '0');
      xy_reg <= (others => '0');
      serenity_reg <= '0';
      waitstate_reg <= '0';
    elsif (clk = '1' and clk'EVENT) then
      current_state <= next_state;
      x_1_reg <= x_1_next;
      j_1_reg <= j_1_next;
      i_1_reg <= i_1_next;
      D_1408_1_reg <= D_1408_1_next;
      y_1_reg <= y_1_next;
      taddr_1_reg <= taddr_1_next;
      D_1417_1_reg <= D_1417_1_next;
      uaddr_1_reg <= uaddr_1_next;
      sum_1_reg <= sum_1_next;
      k_1_reg <= k_1_next;
      D_1429_1_reg <= D_1429_1_next;
      D_1412_1_reg <= D_1412_1_next;
      g_1_reg <= g_1_next;
      d_1_reg <= d_1_next;
      c_1_reg <= c_1_next;
      D_1439_1_reg <= D_1439_1_next;
      D_1413_1_reg <= D_1413_1_next;
      D_1418_1_reg <= D_1418_1_next;
      u_1_reg <= u_1_next;
      u_reg <= u_next;	  
      cs_1_reg <= cs_1_next;
      x_reg <= x_next;
      j_reg <= j_next;
      i_reg <= i_next;
      y_reg <= y_next;
      k_reg <= k_next;
      taddr_reg <= taddr_next;
      sum_reg <= sum_next;
      g_reg <= g_next;
      cs_reg <= cs_next;
      d_reg <= d_next;
      nval_reg <= nval_next;
      mode16_reg <= mode16_next;	  
      D_1407_1_reg <= D_1407_1_next;
      D_1409_1_reg <= D_1409_1_next;
      D_1415_1_reg <= D_1415_1_next;
      D_1410_1_reg <= D_1410_1_next;
      D_1414_1_reg <= D_1414_1_next;
      D_1422_1_reg <= D_1422_1_next;
      taddr_0_1_reg <= taddr_0_1_next;
      D_1411_1_reg <= D_1411_1_next;
      D_1416_1_reg <= D_1416_1_next;
      D_1419_1_reg <= D_1419_1_next;
      ok_reg <= ok_next;
      xy_reg <= xy_next;
      serenity_reg <= serenity_next;
      waitstate_reg <= waitstate_next;
    end if;
  end process;

  -- next state and output logic
  process (current_state, start, mode,
    serenity_reg, serenity_next,
    waitstate_reg, waitstate_next,
    x_1_reg, x_1_next,
    j_1_reg, j_1_next,
    i_1_reg, i_1_next,
    D_1408_1_reg, D_1408_1_next,
    y_1_reg, y_1_next,
    taddr_1_reg, taddr_1_next,
    D_1417_1_reg, D_1417_1_next,
    uaddr_1_reg, uaddr_1_next,
    sum_1_reg, sum_1_next,
    k_1_reg, k_1_next,
    D_1429_1_reg, D_1429_1_next,
    D_1412_1_reg, D_1412_1_next,
    g_1_reg, g_1_next,
    d_1_reg, d_1_next,
    c_1_reg, c_1_next,
    D_1439_1_reg, D_1439_1_next,
    D_1413_1_reg, D_1413_1_next,
    D_1418_1_reg, D_1418_1_next,
    u_1_reg, u_1_next,
    u_reg, u_next,	
    cs_1_reg, cs_1_next,
    x_reg, x_next,
    j_reg, j_next,
    i_reg, i_next,
    y_reg, y_next,
    k_reg, k_next,
    taddr_reg, taddr_next,
    sum_reg, sum_next,
    g_reg, g_next,
    cs_reg, cs_next,
    d_reg, d_next,
    nval_reg, nval_next,
    mode16_reg, mode16_next,	
    D_1407_1_reg, D_1407_1_next,
    D_1409_1_reg, D_1409_1_next,
    D_1415_1_reg, D_1415_1_next,
    D_1410_1_reg, D_1410_1_next,
    D_1414_1_reg, D_1414_1_next,
    D_1422_1_reg, D_1422_1_next,
    taddr_0_1_reg, taddr_0_1_next,
    D_1411_1_reg, D_1411_1_next,
    D_1416_1_reg, D_1416_1_next,
    D_1419_1_reg, D_1419_1_next
    valid <= '0';
    done <= '0';
    ready <= '0';
    x_1_next <= x_1_reg;
    j_1_next <= j_1_reg;
    i_1_next <= i_1_reg;
    D_1408_1_next <= D_1408_1_reg;
    y_1_next <= y_1_reg;
    taddr_1_next <= taddr_1_reg;
    D_1417_1_next <= D_1417_1_reg;
    uaddr_1_next <= uaddr_1_reg;
    sum_1_next <= sum_1_reg;
    k_1_next <= k_1_reg;
    D_1429_1_next <= D_1429_1_reg;
    D_1412_1_next <= D_1412_1_reg;
    g_1_next <= g_1_reg;
    d_1_next <= d_1_reg;
    c_1_next <= c_1_reg;
    D_1439_1_next <= D_1439_1_reg;
    D_1413_1_next <= D_1413_1_reg;
    D_1418_1_next <= D_1418_1_reg;
    u_1_next <= u_1_reg;
    u_next <= u_reg;	
    cs_1_next <= cs_1_reg;
    x_next <= x_reg;
    j_next <= j_reg;
    i_next <= i_reg;
    y_next <= y_reg;
    k_next <= k_reg;
    taddr_next <= taddr_reg;
    sum_next <= sum_reg;
    g_next <= g_reg;
    cs_next <= cs_reg;
    d_next <= d_reg;
    nval_next <= nval_reg;
    mode16_next <= mode16_reg;	
    D_1407_1_next <= D_1407_1_reg;
    D_1409_1_next <= D_1409_1_reg;
    D_1415_1_next <= D_1415_1_reg;
    D_1410_1_next <= D_1410_1_reg;
    D_1414_1_next <= D_1414_1_reg;
    D_1422_1_next <= D_1422_1_reg;
    taddr_0_1_next <= taddr_0_1_reg;
    D_1411_1_next <= D_1411_1_reg;
    D_1416_1_next <= D_1416_1_reg;
    D_1419_1_next <= D_1419_1_reg;
    ok_next <= ok_reg;
    xy_next <= xy_reg;
    serenity_next <= serenity_reg;
    waitstate_next <= waitstate_reg;
    img_temp_we <= '0';
    img_temp_addr <= (others => '0');
    img_temp_din <= (others => '0');
    img_work_we <= '0';
    img_work_addr <= (others => '0');
    img_work_din <= (others => '0');
    x_offset_we <= '0';
    x_offset_addr <= (others => '0');
    x_offset_din <= (others => '0');
    y_offset_we <= '0';
    y_offset_addr <= (others => '0');
    y_offset_din <= (others => '0');
    case current_state is
      when S_ENTRY =>
        ready <= '1';
        if (start = '1') then
          next_state <= S_001_001;
          next_state <= S_ENTRY;         end if;       when S_001_001 =>
        g_1_next <= CNST_100000000(31 downto 0);
        d_1_next <= CNST_500000(31 downto 0);
        y_1_next <= CNST_0(31 downto 0);
        next_state <= S_001_002;       when S_001_002 =>
        y_next <= y_1_reg(31 downto 0);
        g_next <= g_1_reg(31 downto 0);
        d_next <= d_1_reg(31 downto 0);
        next_state <= S_001_003;       when S_001_003 =>
        next_state <= S_006_001;       when S_002_001 =>
        x_1_next <= CNST_0(31 downto 0);
        next_state <= S_002_002;       when S_002_002 =>
        x_next <= x_1_reg(31 downto 0);
        next_state <= S_002_003;       when S_002_003 =>
        next_state <= S_004_001;       when S_003_001 =>
        x_1_next <= std_logic_vector(unsigned(x_reg) + unsigned(CNST_1(31 downto 0)));
        D_1407_1_next <= mul(y_reg, CNST_80(31 downto 0), '0', 32);
        next_state <= S_003_002;       when S_003_002 =>
        D_1408_1_next <= std_logic_vector(unsigned(D_1407_1_reg) + unsigned(x_reg(31 downto 0)));
        next_state <= S_003_003;       when S_003_003 =>
        x_next <= x_1_reg(31 downto 0);
        next_state <= S_003_004;       when S_003_004 =>
        D_1412_1_next <= (others => '0');
        next_state <= S_003_005;       when S_003_005 =>
        img_temp_we <= '1';
        img_temp_addr <= D_1408_1_reg(12 downto 0);
        img_temp_din <= D_1412_1_reg(7 downto 0);
        next_state <= S_003_006;       when S_003_006 =>
        next_state <= S_004_001;       when S_004_001 =>
        if (x_reg <= CNST_79(31 downto 0)) then
          next_state <= S_003_001;
          next_state <= S_005_001;         end if;       when S_005_001 =>
        y_1_next <= std_logic_vector(unsigned(y_reg) + unsigned(CNST_1(31 downto 0)));
        next_state <= S_005_002;       when S_005_002 =>
        y_next <= y_1_reg(31 downto 0);
        next_state <= S_005_003;       when S_005_003 =>
        next_state <= S_006_001;       when S_006_001 =>
        if (y_reg <= CNST_59(31 downto 0)) then
          next_state <= S_002_001;
          next_state <= S_007_001;         end if;       when S_007_001 =>
        i_1_next <= CNST_0(31 downto 0);
        next_state <= S_007_002;       when S_007_002 =>
        i_next <= i_1_reg(31 downto 0);
        next_state <= S_007_003;       when S_007_003 =>
        next_state <= S_040_001;       when S_008_001 =>
        y_1_next(31 downto 8) <= (others => '0');
        y_1_next(7 downto 0) <= CNST_1(7 downto 0);
        next_state <= S_008_002;       when S_008_002 =>
        y_next <= y_1_reg(31 downto 0);
        next_state <= S_008_003;       when S_008_003 =>
        next_state <= S_032_001;       when S_009_001 =>
        x_1_next(31 downto 8) <= (others => '0');
        x_1_next(7 downto 0) <= CNST_1(7 downto 0);
        next_state <= S_009_002;       when S_009_002 =>
        x_next <= x_1_reg(31 downto 0);
        next_state <= S_009_003;       when S_009_003 =>
        next_state <= S_030_001;       when S_010_001 =>
        sum_1_next <= CNST_0(15 downto 0);
        k_1_next <= CNST_0(31 downto 0);
        D_1407_1_next <= mul(y_reg, CNST_80(31 downto 0), '0', 32);
        next_state <= S_010_002;       when S_010_002 =>
        taddr_1_next <= std_logic_vector(unsigned(D_1407_1_reg) + unsigned(x_reg(31 downto 0)));
        k_next <= k_1_reg(31 downto 0);
        sum_next <= sum_1_reg(15 downto 0);
        next_state <= S_010_003;       when S_010_003 =>
        taddr_next <= taddr_1_reg(31 downto 0);
        next_state <= S_010_004;       when S_010_004 =>
        next_state <= S_014_001;       when S_011_001 =>
        y_offset_addr <= k_reg(2 downto 0);
        x_offset_addr <= k_reg(2 downto 0);
        waitstate_next <= not (waitstate_reg);
        if (waitstate_reg = '1') then
          D_1413_1_next <= y_offset_dout;
          D_1418_1_next <= x_offset_dout;
          next_state <= S_011_002;
          next_state <= S_011_001;         end if;       when S_011_002 =>
        D_1414_1_next(31 downto 8) <= (others => D_1413_1_reg(7));
        D_1414_1_next(7 downto 0) <= D_1413_1_reg;
        D_1419_1_next(31 downto 8) <= (others => D_1418_1_reg(7));
        D_1419_1_next(7 downto 0) <= D_1418_1_reg;
        next_state <= S_011_003;       when S_011_003 =>
        D_1415_1_next <= mul(D_1414_1_reg, CNST_80(31 downto 0), '1', 32);
        next_state <= S_011_004;       when S_011_004 =>
        D_1416_1_next(31 downto 0) <= D_1415_1_reg;
        next_state <= S_011_005;       when S_011_005 =>
        D_1417_1_next <= std_logic_vector(signed(D_1416_1_reg) + signed(taddr_reg(31 downto 0)));
        next_state <= S_011_006;       when S_011_006 =>
        uaddr_1_next <= std_logic_vector(signed(D_1417_1_reg) + signed(D_1419_1_reg(31 downto 0)));
        next_state <= S_011_007;       when S_011_007 =>
        img_temp_addr <= uaddr_1_reg(12 downto 0);
        waitstate_next <= not (waitstate_reg);
        if (waitstate_reg = '1') then
          u_1_next <= img_temp_dout;
          next_state <= S_011_008;
          next_state <= S_011_007;         end if;       when S_011_008 =>
	    u_next <= X"00" & u_1_reg(7 downto 0);
        next_state <= S_012_001;       when S_012_001 =>
        sum_1_next <= std_logic_vector(unsigned(sum_1_reg) + unsigned(u_reg(15 downto 0)));
        next_state <= S_012_002;       when S_012_002 =>
        sum_next <= sum_1_reg(15 downto 0);
        next_state <= S_013_001;       when S_013_001 =>
        k_1_next <= std_logic_vector(unsigned(k_reg) + unsigned(CNST_1(31 downto 0)));
        next_state <= S_013_002;       when S_013_002 =>
        k_next <= k_1_reg(31 downto 0);
        next_state <= S_013_003;       when S_013_003 =>
	    mode16_next <= X"000" & mode(3 downto 0);
		sum_next <= "000" & sum_reg(15 downto 3);
        next_state <= S_014_001;       when S_014_001 =>
        if (k_reg <= CNST_7(31 downto 0)) then
          next_state <= S_011_001;
          next_state <= S_014a_001;         end if;       when S_014a_001 =>
        nval_next <= std_logic_vector(unsigned(sum_reg) + unsigned(mode16_reg(15 downto 0)));
        next_state <= S_015_001;       when S_015_001 =>
        taddr_0_1_next(31 downto 0) <= taddr_reg;
        img_temp_addr <= taddr_reg(12 downto 0);
        waitstate_next <= not (waitstate_reg);
        if (waitstate_reg = '1') then
          cs_1_next <= img_temp_dout;
          next_state <= S_015_002;
          next_state <= S_015_001;         end if;       when S_015_002 =>
        xy_next <= taddr_0_1_reg(12 downto 0);
        cs_next <= cs_1_reg(7 downto 0);
        D_1422_1_next(31 downto 8) <= (others => '0');
        D_1422_1_next(7 downto 0) <= cs_1_reg;
        serenity_next <= not (serenity_reg);
        if (serenity_reg = '1') then
          valid <= '1';
          next_state <= S_015_003;
          next_state <= S_015_002;         end if;       when S_015_003 =>
        ok_next <= D_1422_1_reg(7 downto 0);
        serenity_next <= not (serenity_reg);
        if (serenity_reg = '1') then
          valid <= '1';
          next_state <= S_015_004;
          next_state <= S_015_003;         end if;       when S_015_004 =>
        next_state <= S_018_001;       when S_018_001 =>
        img_work_we <= '1';
        img_work_addr <= taddr_reg(12 downto 0);
        img_work_din <= nval_reg(7 downto 0);
        next_state <= S_019_001;       when S_019_001 =>
        next_state <= S_020_001;       when S_020_001 =>
        next_state <= S_021_001;       when S_021_001 =>
        next_state <= S_022_001;       when S_022_001 =>
        next_state <= S_023_001;       when S_023_001 =>
        next_state <= S_024_001;       when S_024_001 =>
        next_state <= S_025_001;       when S_025_001 =>
        next_state <= S_026_001;       when S_026_001 =>
        next_state <= S_027_001;       when S_027_001 =>
        next_state <= S_028_001;       when S_028_001 =>
        next_state <= S_029_001;       when S_029_001 =>
        x_1_next <= std_logic_vector(unsigned(x_reg) + unsigned(CNST_1(31 downto 0)));
        next_state <= S_029_002;       when S_029_002 =>
        x_next <= x_1_reg(31 downto 0);
        next_state <= S_029_003;       when S_029_003 =>
        next_state <= S_030_001;       when S_030_001 =>
        if (x_reg <= CNST_78(31 downto 0)) then
          next_state <= S_010_001;
          next_state <= S_031_001;         end if;       when S_031_001 =>
        y_1_next <= std_logic_vector(unsigned(y_reg) + unsigned(CNST_1(31 downto 0)));
        next_state <= S_031_002;       when S_031_002 =>
        y_next <= y_1_reg(31 downto 0);
        next_state <= S_031_003;       when S_031_003 =>
        next_state <= S_032_001;       when S_032_001 =>
        if (y_reg <= CNST_58(31 downto 0)) then
          next_state <= S_009_001;
          next_state <= S_033_001;         end if;       when S_033_001 =>
        x_1_next <= CNST_0(31 downto 0);
        next_state <= S_033_002;       when S_033_002 =>
        x_next <= x_1_reg(31 downto 0);
        next_state <= S_033_003;       when S_033_003 =>
        next_state <= S_035_001;       when S_034_001 =>
        x_1_next <= std_logic_vector(unsigned(x_reg) + unsigned(CNST_1(31 downto 0)));
        img_work_addr <= x_reg(12 downto 0);
        waitstate_next <= not (waitstate_reg);
        if (waitstate_reg = '1') then
          D_1439_1_next <= img_work_dout;
          next_state <= S_034_002;
          next_state <= S_034_001;         end if;       when S_034_002 =>
        img_temp_we <= '1';
        img_temp_addr <= x_reg(12 downto 0);
        img_temp_din <= D_1439_1_reg(7 downto 0);
        next_state <= S_034_003;       when S_034_003 =>
        x_next <= x_1_reg(31 downto 0);
        next_state <= S_034_004;       when S_034_004 =>
        next_state <= S_035_001;       when S_035_001 =>
        if (x_reg <= CNST_4799(31 downto 0)) then
          next_state <= S_034_001;
          next_state <= S_036_001;         end if;       when S_036_001 =>
        j_1_next <= CNST_0(31 downto 0);
        next_state <= S_036_002;       when S_036_002 =>
        j_next <= j_1_reg(31 downto 0);
        next_state <= S_036_003;       when S_036_003 =>
        next_state <= S_038_001;       when S_037_001 =>
        j_1_next <= std_logic_vector(unsigned(j_reg) + unsigned(CNST_1(31 downto 0)));
        next_state <= S_037_002;       when S_037_002 =>
        j_next <= j_1_reg(31 downto 0);
        next_state <= S_037_003;       when S_037_003 =>
        next_state <= S_038_001;       when S_038_001 =>
        if (j_reg < d_reg(31 downto 0)) then
          next_state <= S_037_001;
          next_state <= S_039_001;         end if;       when S_039_001 =>
        i_1_next <= std_logic_vector(unsigned(i_reg) + unsigned(CNST_1(31 downto 0)));
        next_state <= S_039_002;       when S_039_002 =>
        i_next <= i_1_reg(31 downto 0);
        next_state <= S_039_003;       when S_039_003 =>
        next_state <= S_040_001;       when S_040_001 =>
        if (i_reg < g_reg(31 downto 0)) then
          next_state <= S_008_001;
          next_state <= S_041_001;         end if;       when S_041_001 =>
        next_state <= S_042_001;       when S_042_001 =>
        next_state <= S_EXIT;       when S_EXIT =>
        done <= '1';
        next_state <= S_ENTRY;       when others =>
        next_state <= S_ENTRY;
    end case;
  end process;

  ok <= ok_reg;
  xy <= xy_reg;   img_temp_instance : entity WORK.ram(img_temp)     generic map (       AW     => 13,
      DW     => 8,
      NR     => 4800
    port map (
      clk    => clk,
      we     => img_temp_we,
      en     => '1',
      rwaddr => img_temp_addr,
      din    => img_temp_din,
      dout   => img_temp_dout

  img_work_instance : entity WORK.ram(img_work)
    generic map (
      AW     => 13,
      DW     => 8,
      NR     => 4800
    port map (
      clk    => clk,
      we     => img_work_we,
      en     => '1',
      rwaddr => img_work_addr,
      din    => img_work_din,
      dout   => img_work_dout

  x_offset_instance : entity WORK.ram(x_offset)
    generic map (
      AW     => 3,
      DW     => 8,
      NR     => 8
    port map (
      clk    => clk,
      we     => x_offset_we,
      en     => '1',
      rwaddr => x_offset_addr,
      din    => x_offset_din,
      dout   => x_offset_dout

  y_offset_instance : entity WORK.ram(y_offset)
    generic map (
      AW     => 3,
      DW     => 8,
      NR     => 8
    port map (
      clk    => clk,
      we     => y_offset_we,
      en     => '1',
      rwaddr => y_offset_addr,
      din    => y_offset_din,
      dout   => y_offset_dout

end fsmd;

Wowa! That’s a lot of stuff that went on in HercuLeS. It seems that it did the work. A self-checking testbench was also automatically generated by HercuLeS but we will not focus on that in this particular blog post.

Technically, this is a single FSMD with separate processes for current state logic and next-state/output logic. Datapath actions are embedded within the next-state/output logic process, no messy code with concurrent assignments (has its pros and cons). Overall, the code closely follows the FSMD paradigm as presented in Prof. D. Gajski works (http://www.cecs.uci.edu/~gajski/) and how this scheme was presented in Prof. Pong P. Chu’s books: http://academic.csuohio.edu/chu_p/ (I own two of them).

The automatically-generated implementation uses a kind of triple-buffering. We need a working and a temporary memory for the automaton world, representing generations n and n+1. In the hardware-oriented version, it is of size 80×60, using 8×8 upscaling, due to the limits of the available internal RAM of the FPGA device (block RAM), which is around 360 kbits (we will use around 70% of this). For better visual output, and since we will run computations within the video on timings, we use a separate, third memory, as a video frame buffer. Of course, improvements are possible against this scheme, e.g. by using line buffers are doing all computations within the blanking interval durations. It will also be interesting to port this demo to another board using fast, zero-cycle turnaround SRAM.

About the exhibition

I had a great time with the exhibition, moving away from remote customer interaction (and their virtual whiplashes :) and meeting a lot of people in person, including school children, parents, technology afficionados, higher education students, hobbyists, local industry, teachers and professors.

This is what my demo looked like (so it is true hardware, no hidden computers running the show, I had to point this out a lot). It appears I was a little tired, but hey this was towards the end of the day (and I needed refueling).



And another shot of the demo:



I have uploaded two short videos showcasing the digital kaleidoscope demo at my YouTube channel:

Overview of the demo: http://www.youtube.com/watch?v=ahyBUAFcXHw

Starting sequence: http://www.youtube.com/watch?v=-sxB8DSznGU

The hardware is using a delay loop in order to let humans visualize the process. The increment parameter of the automaton is controlled by the four slide switches available on the specific Digilent board and we can set any value from 0 to 15.

Technology used for the demo and summary

The digital circuit was designed in the VHDL hardware description language.

To dramatically reduce design time, the behavior of the circuit was first described in the C programming language. The C program was automatically translated to VHDL using the HercuLeS high-level synthesis tool.

The resulting description was then synthesized on an FPGA integrated circuit (Xilinx XC3S700AN) using the Xilinx ISE/XST logic synthesis environment.

The development board which has been used is the Xilinx Spartan-3AN Starter Kit by Digilent.


Folks, I hope you have enjoyed this short (or long) walkthrough through the lost artland of Kaveirian (that’s me) high-level synthesis. My next steps would involve pretty much everything, after all HercuLeS is used for day-by-day, real-life, commercial-grade work; most frequently for work intended for clients (that most times cannot be disclosed).

So I am thinking of a more impressive set of demos, like an algorithmically-generated 3D world which you can explore via a simple keyboard interface, 3D graphics demos (all done in plain hardware), chess engines, obscure IOCCC entries, etc. I am always collecting ideas across the web, especially “mini-codes” or “tiny-codes” that could be turned into interesting hardware demos.


Ghosts of HLS past, present and future

This is mostly an adaptation of my position statement as requested by Brian Bailey Consulting.

  • The current state of HLS compared to the original expectations for it 10 years ago

I think that these past few years, a lot of interesting developments occurred in the HLS field, especially in the programmable/FPGA realm. Essentially, 3rd generation high-level synthesis tools and environments made a successful, yet belated, entry in the FPGA market. Here, I’m arbitrarily making a distinction among 1st generation HLS tools (academic endeavors of the 80’s), 2nd generation HLS tools that made the ASIC market in the 90’s (e.g. Behavioral Compiler), and the current generation with usable high-level language frontends, rich optimization portfolios, IP integration and verification facilities. For these tools, the entry bar has lowered significantly from tens of thousand USD to about 2-5k USD and this really helps broader adoption. Technology vendors are not interested in selling their HLS tools but the entire platforms instead.

On the other side, it is much more difficult for tools of this grade to penetrate the ASIC market, where design failures are much more costly. There exist both software and hardware infrastructure issues. Most HLS offers lack the design space exploration and analysis tools that would allow a safer and faster assessment of QoR on multiple design points.


  • Technological changes during the last 10 years. Where does the most potential lie.

I don’t think that 3rd gen HLS tools encompass significant theoretical advances compared to what was achievable 10 or 15 years ago in core HLS; most of the theory (scheduling, resource sharing, retiming) was already there. Changes and improvements are incremental in effect. However, it is this new bunch of tools, that have usable C/C++/SystemC frontends and target accessible FPGA platforms that start to make a difference. There is potential for increased competition around MATLAB or Python to hardware. This market will be increasingly more important. Computationally-wise, bioinformatics will be big, big data of course, as well as non-Von Neumann computing, particularly neuromorphic computing for instance to map the mammal brain. Von Neumann will still be in use for emulating neurocomputers basically as a convenience.


  • Is HLS a disruptive innovation that will shake the EDA industry? What has changed?

I think that HLS starts to find its place within the ESL flow. I don’t believe that the prevailing view is of expecting HLS to be disruptive. It seems that we never really lacked HLS, but the flow was not there (in this sense I agree with Gabe on EDA). There were a lot of things missing (interfaces, integration, frontends, competitive processes to ASIC) for HLS to be disruptive in the past.


  • Estimating market size.

I would say that the estimated market has rised from a few million USD to maybe 30 to 50 million USD. In order to expand the pie, software-oriented engineers must ride the wagon of HLS, for instance algorithm developers. The majority of these engineers work with MATLAB, Python (or CUDA or OpenCL), so the corresponding language frontends have to be implemented. FPGA/SoC system engineers are either already heavily using HLS or considering extending their use of HLS technology. The easier converts are DSP engineers who clearly see the benefits of HLS in their day-by-day work, e.g. implementing matrix algebra. However, trusting HLS up to tapeout is a different story; I still see lots of manual interaction layers following the initial HLS outcome, primarily for interface modifications, old school optimizations and late adaptations (which should be back-propagated effectively by the current HLS tools in the first place).

A question rises here: what will ultimately replace RTL. If HLS is the answer, then the market size will first grow to the limits of the current RTL-powered market, and will then contract since it will have become a commodity. This transition will take about 12-15 years to complete; highly-customized functions such as device controllers will be the last stand.


  • Who benefits from HLS?

Semiconductor companies from the far East are known to be early, faithful adopters of HLS. I think that HLS has played a small part in their success, primarily in reducing time-to-market. A number of IP vendors use either third-party or homebrew, partial, HLS tools for streamlining the IP. Whenever an IP vendor releases a new non-trivial IP every one or two weeks, this is a typical sign for heavy HLS use :) Apart these, HLS should find each way to high-performance computing applications. HPC applications are most of the cases stencil codes, and the most troublesome part is to exploit and map task-level or processor-level parallelism. I think that there is much available room in HLS for HPC scientific computing. So companies offering HPC (either large or small form factor) design/programming services have lots to benefit from HLS.


  • How has HLS progressed over the past 10 years?

In the most part, academic ideas are continuously transforming HLS. The key algorithms are established but (for instance) polyhedral frameworks are just starting out to be used in hardware compilers. Further, putting an intermediate representation at the heart of HLS is the right idea; frontends, backends, analyses and optimizations are naturally more easy to extend and maintain. Exposing this representation might also be of benefit to all parties.


  • HLS in 10 years from now.

There are still lots of interesting things to happen in HLS:

  1. Support very high-level, functional, dynamic, and concurrency-oriented specifications to HLS (there are multiple attempts towards these ends already)
  2. Transparent preoptimization through intelligent code refactoring
  3. The high-quality, extensible, open HLS toolflow: an LLVM for HLS. A key part is missing at the backend side of the flow; it is very difficult for open-source projects like VPR/VTR and Torc to keep up with process advances.
  4. HLS as a service: a superoptimizing hardware backend reusing its acquired knowledge for aggressive state optimizations running on the cloud.
  5. Better tools at all levels: early assessment, design-space exploration, and analysis tools in the HLS environment.
  6. Transparent development environments: ideally it will not be necessary to even know that we are using HLS, especially for hybrid, heterogeneous systems.

In 10-20 years from now, at or past the end of Moore, HLS will be the preferred choice for squizzing all performance potential out of 5nm or 8nm silicon. These last processes will be around for a decade or so. Universities will not offer advanced courses for ASIC/FPGA (no academic interest there); RTL will only be hobbyist, fun nonetheless. Still, most of the theory will be usable on graphene, carbon nanotube, organic or bio- processes and in general to whatever else will come in prominence.

HLS tools: Portable generated HDL code is a must

This is a cross-post of my reply to Matthieu Wipliez of Synflow, an antagonist that made a point about the necessity of having HLS tools generating readable RTL code.

I totally agree with his position statement; generated HDL must be readable. Manual tracing by the human expert is still invaluable and should always be there.

There are certain rules that should apply to generating portable, generic and readable HDL code:

1) Maintain program symbols (variables) in the generated code. Temporaries should keep certain, easy to follow, naming conventions. I do this in HercuLeS.

2) Cross-tagging between source level, intermediate representation, graph-based representation and final HDL code. I’m currently investigating cross-tagging approaches since HercuLeS should support tracing among source (C w/wo GMP API for now), IR (N-Address Code), low-level IR (Graphviz CDFGs) and RTL VHDL.

3) Keep control steps (FSM/FSMD states) clear and visible. Consistent naming conventions can help (to associate states with the corresponding basic blocks or regions for instance). Again cross-tagging can make this more elegant.

4) Exploit the casual FSMD feature: do embed datapath actions into the next state logic code of your FSMD code. Don’t do it Vivado HLS style. The Vivado HLS approach is an ugly mess. Datapath actions are thrown out in concurrent code form, and nobody can follow anything. There is more to pay here that the minor gains in easier resource sharing. X and A people, I’m talking to you: your backend tools are better than that, not much is lost in optimization if you embed datapath actions.

Vivado HLS vs HercuLeS

I’ve spent these last couple of days to perform head-to-head comparisons of Xilinx Vivado HLS against HercuLeS on HLS-generated digital circuits (from input C code).

I believe that HercuLeS lived up to the challenge; it is competitive to Vivado HLS. The reader should take account that:

  1. Both tools have been used (almost) out-of-the-box. Vivado HLS was configured with no bufg inclusion, and in “out_of_context” mode. These mean that no clock buffers and I/O pins were routed.
  2. HercuLeS does not (yet) customize the generated HDL in order to fit better specific architectural features (DSP blocks, embedded SRL units).
  3. Vivado HLS had some TOTAL FAILURES on some relatively simple codes such as a simple perfect number detector (positive integers equal to the sum of their divisors), a 1D wavelet code, and easter date calculation. It seems that Vivado HLS experiences some hard time with integer modulo/remainder. Codes are provided to anyone interested.

The following table provides a summary of the results:

Vivado HLS (VHLS) HercuLeS Comment
Benchmark Description LUTs Regs TET (ns) LUTs Regs TET (ns)
1 arraysum Array sum 102 132 26.5 103 63 73.3
2 bitrev Bit reversal 67 39 72.0 42 40 11.6
3 edgedet Edge detection 246 130 1636.3 680 361 1606.4 1 BRAM for VHLS
4 fibo Fibonacci series 138 131 60.2 137 197 102.7
5 fir FIR filter 102 52 833.4 217 140 2729.4
6 gcd Greatest common divisor 210 98 35.2 128 93 75.9
7 icbrt Cubic root approximation 239 207 260.6 365 201 400.5
8 popcount Population count 45 65 19.4 53 102 26.1
9 sieve Prime sieve of Eratosthenes 525 595 6108.4 565 523 3869.5 1 BRAM for VHLS
10 sierpinski Sierpinski triangle 88 163 11326.5 230 200 16224.9


  • Measurements where obtained for the KC705 development board device: xc7k325t-ffg900-2
  • TET is Total Execution Time in ns.
  • VHLS is a shortened form for Vivado HLS.
  • Vivado HLS 2013.1 was used.
  • Bold denotes smaller area and lower execution time.
  • Italic denotes an inconclusive comparison.
  • For the cases of edgedet and sieve, VHLS identifies a BRAM; HercuLeS does not. In these cases, HercuLeS saves a BRAM while VHLS saves on LUTs and FFs (Registers).

Overall, there are about 30% wins for HercuLeS and ~70% wins for Vivado HLS. Not too bad for a tool like HercuLeS; producing generic, portable, vendor-independent code. I estimate that HercuLeS development effort is around 1-5% to Vivado HLS.

I believe that HercuLeS will do much better in the out-of-the-box experience (which is of high importance in order to draw more software-minded engineers in the game) in the near future.

Both HercuLeS and Vivado HLS have optimization features (e.g. loop unrolling). HercuLeS applies optimizations by using a source-to-source C code optimizer. Vivado HLS mostly resorts to end-user directives. These coding aspects will be taken into account in a followup comparison; they also yield a much more extensive solution space.


A few words on HercuLeS high-level synthesis

HercuLeS is a new high-level synthesis tool marketed by Ajax Compilers (http://www.ajaxcompilers.com). HercuLeS has been in development since 2009 and it seems that now is the proper time to hit the market :) Full disclosure: I’m the main (read: sole) developer of HercuLeS.

A free evaluation of HercuLeS is available. You can grab it by sending me an email (see either ajaxcompilers.com or nkavvadias.com for contact details).

HercuLeS is based on the following flow: C-> GIMPLE -> N-Address Code -> VHDL.

HercuLeS is extensible in since frontends, analyses and optimization passes can be added by third parties. At this moment, HercuLeS is bundled with a number of external modules for analyses and optimizations at the C, NAC (N-Address Code, its textual IR), Graphviz, and VHDL levels. It supports vendor-independent code so generated HDL descriptions are synthesizable (in principle) to either FPGA or ASIC targets.

It should be noted that certain things are still missing from HercuLeS and there is ongoing work to support them in the future. This is inevitable since our resources are somewhat limited. For instance there is no Verilog backend yet.

We are looking to establish close communication with our users. Our users provide inspiration and their requests drive future development. Criticism is well-accepted at Ajax Compilers :)

Hello world!

Hi all! This is yet another blog on EDA (Electronic Design Automation) tools, high-level synthesis (HLS), compilers, processor design and digital circuit design in general. Hope you will find it interesting!

Now let’s start writing…