Category Archives: EDA

Streamlining your FPGA synthesis process

These last few days, I had an appetite for experimenting with a number of FPGA projects, ranging from very simple logic to quite more complex state machines. The idea was to port a number of existing designs to the Xilinx Spartan 3E and 3AN starter kit boards: a few of my own designs, either targeting other boards/devices or never been yet tested at the board level, or designs from other people. Mike Field’s hamsterworks website is an excellent source of FPGA designs with varying degree of complexity! Most of them are for Spartan 6/Artix 7, so I had to backport some of his ideas to my less contemporary devices :)

One thing is that in order to effectively use my time on porting or backporting a number of designs (maybe 15 or 20), I had to streamline the synthesis process; all had to be done from the command line. Borrowing some ideas from Evgeni Stavinov’s Using Xilinx Tools in Command-Line Mode I was able to synthesize, generate bitstream and finally program the FPGA device via my download cable, without any GUI interaction.

I will use as our vehicle maybe the simplest possible design. Let’s call it bstest, as a stand-in for “buttons and switches tester”. It is a very simple design for testing the four push buttons, four slide switches and eight discrete LEDs available on the Spartan-3E Starter Kit board by Digilent (link at Xilinx website) (link at Digilent website).

The design

The design associates push button and slide switch actions to specific LEDs. The code (bstest.vhd) is really simple:


library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;

entity bstest is
  port ( 
    sldsw  : in  std_logic_vector(3 downto 0);
    button : in  std_logic_vector(3 downto 0); -- N-E-W-S
    led    : out std_logic_vector(7 downto 0)
  );
end bstest;

architecture dataflow of bstest is
begin
  --
  led(7 downto 4) <= sldsw;
  led(3 downto 0) <= button;
--  led <= sldsw & button; -- we could also do it this way
  --
end dataflow;

User constraints file

The UCF (User Constraints File) associates a “net” (input or output port of our top-level design) to a “loc” (location), referring to a specific FPGA pin. The specific FPGA device available on the Spartan-3E Starter Kit is the XC3S500E-FG320-4 and the UCF (bstest.ucf) in this case should be as follows:


NET "sldsw"   LOC = "N17";
NET "sldsw"   LOC = "H18";
NET "sldsw"   LOC = "L14";
NET "sldsw"   LOC = "L13";

NET "button"  LOC = "V4";   # NORTH
NET "button"  LOC = "H13";  # EAST
NET "button"  LOC = "D18";  # WEST 
NET "button"  LOC = "K17";  # SOUTH

NET "led"     LOC = "F9"  | IOSTANDARD = LVTTL | SLEW = SLOW | DRIVE = 8;
NET "led"     LOC = "E9"  | IOSTANDARD = LVTTL | SLEW = SLOW | DRIVE = 8;
NET "led"     LOC = "D11" | IOSTANDARD = LVTTL | SLEW = SLOW | DRIVE = 8;
NET "led"     LOC = "C11" | IOSTANDARD = LVTTL | SLEW = SLOW | DRIVE = 8;
NET "led"     LOC = "F11" | IOSTANDARD = LVTTL | SLEW = SLOW | DRIVE = 8;
NET "led"     LOC = "E11" | IOSTANDARD = LVTTL | SLEW = SLOW | DRIVE = 8;
NET "led"     LOC = "E12" | IOSTANDARD = LVTTL | SLEW = SLOW | DRIVE = 8;
NET "led"     LOC = "F12" | IOSTANDARD = LVTTL | SLEW = SLOW | DRIVE = 8;

Automation scripts

The top-level HDL design file and the UCF map is all we would need, if we intended to use the Xilinx ISE/XST GUI (I have version 14.6 installed). However, we can be much more productive, if we resort to a few scripts for automating this part of the process.

First, a Makefile (xst.mk) based on Evgeni Stavinov’s article can be used for automating all the way to bitfile generation:


all: $(PROJECT).bit

floorplan: $(PROJECT).ngd $(PROJECT).par.ncd
	$(FLOORPLAN) $^

report:
	cat *.srp

clean::
	rm -f *.work *.xst
	rm -f *.ngc *.ngd *.bld *.srp *.lso *.prj
	rm -f *.map.mrp *.map.ncd *.map.ngm *.mcs *.par.ncd *.par.pad
	rm -f *.pcf *.prm *.bgn *.drc
	rm -f *.par_pad.csv *.par_pad.txt *.par.par *.par.xpi
	rm -f *.bit
	rm -f *.vcd *.vvp
	rm -f verilog.dump verilog.log
	rm -rf _ngo/
	rm -rf xst/

############################################################################
# Xilinx tools and wine
############################################################################

XST_DEFAULT_OPT_MODE = Speed
XST_DEFAULT_OPT_LEVEL = 1
DEFAULT_ARCH = spartan3
DEFAULT_PART = xc3s700an-fgg484-4

XBIN = $(XDIR)/bin/nt64
XST=$(XBIN)/xst
NGDBUILD=$(XBIN)/ngdbuild
MAP=$(XBIN)/map
PAR=$(XBIN)/par
TRCE=$(XBIN)/trce
BITGEN=$(XBIN)/bitgen
PROMGEN=$(XBIN)/promgen
FLOORPLAN=$(XBIN)/floorplanner

XSTWORK   = $(PROJECT).work
XSTSCRIPT = $(PROJECT).xst

IMPACT_OPTIONS_FILE   ?= _impact.cmd    

ifndef XST_OPT_MODE
XST_OPT_MODE = $(XST_DEFAULT_OPT_MODE)
endif
ifndef XST_OPT_LEVEL
XST_OPT_LEVEL = $(XST_DEFAULT_OPT_LEVEL)
endif
ifndef ARCH
ARCH = $(DEFAULT_ARCH)
endif
ifndef PART
PART = $(DEFAULT_PART)
endif

$(XSTWORK): $(SOURCES)
	> $@
	for a in $(SOURCES); do echo "vhdl work $$a" >> $@; done   

$(XSTSCRIPT): $(XSTWORK)
	> $@
	echo -n "run -ifn $(XSTWORK) -ifmt mixed -top $(TOP) -ofn $(PROJECT).ngc" >> $@
	echo " -ofmt NGC -p $(PART) -iobuf yes -opt_mode $(XST_OPT_MODE) -opt_level $(XST_OPT_LEVEL)" >> $@

$(PROJECT).bit: $(XSTSCRIPT)
	$(XST) -intstyle ise -ifn $(PROJECT).xst -ofn $(PROJECT).syr
	$(NGDBUILD) -intstyle ise -dd _ngo -nt timestamp -uc $(PROJECT).ucf -p $(PART) $(PROJECT).ngc $(PROJECT).ngd
	$(MAP) -intstyle ise -p $(PART) -w -ol high -t 1 -global_opt off -o $(PROJECT).map.ncd $(PROJECT).ngd $(PROJECT).pcf
	$(PAR) -w -intstyle ise -ol high $(PROJECT).map.ncd $(PROJECT).ncd $(PROJECT).pcf
	$(TRCE) -intstyle ise -v 4 -s 4 -n 4 -fastpaths -xml $(PROJECT).twx $(PROJECT).ncd -o $(PROJECT).twr $(PROJECT).pcf
	$(BITGEN) -intstyle ise $(PROJECT).ncd

$(PROJECT).bin: $(PROJECT).bit
	$(PROMGEN) -w -p bin -o $(PROJECT).bin -u 0 $(PROJECT).bin

So what happens here? The synthesis procedure invokes several Xilinx ISE command-line tools for logic synthesis as described in the corresponding Makefile, found in the bstest main directory.

Typically, the process includes the following:

  • Generation of the *.xst synthesis script file.
  • Generation of the *.ngc gate-level netlist file in NGC format.
  • Building the corresponding *.ngd file.
  • Performing mapping using map which generates the corresponding *.ncd file.
  • Place-and-routing using par which updates the corresponding *.ncd file.
  • Tracing critical paths using trce for reoptimizing the *.ncd file.
  • Bitstream generation (*.bit) using bitgen, however with unused pins.

As a result of this process, the bstest.bit bitstream file is produced.

Then, the shell script invokes the Xilinx IMPACT tool by a Windows batch file named impact_s3esk.bat, automatically passing a series of commands that are necessary for configuring the target FPGA device:


setMode -bs
setCable -p auto
identify 
assignFile -p 1 -file bstest.bit
program -p 1
exit

Each line provides a specific command to the IMPACT tool

  1. Set mode to binary scan.
  2. Set cable port detection to auto (tests various ports).
  3. Identify parts and their order in the scan chain.
  4. Assign the bitstream to the first part in the scan chain.
  5. Program the selected device.
  6. Exit IMPACT.

If using the Spartan-3AN starter kit board, the “program” command should be used with -onlyFPGA command-line option, in order to only program the FPGA device.

Finally, a Bash shell script can be used as our entry point (and only needed access point) to the process. The corresponding synthesis script (bstest-syn.sh) can be edited in order to specify the following for adapting to the user’s setup:

  • XDIR: the path to the /bin subdirectory of the Xilinx ISE/XST installation where the xst.exe executable is placed
  • arch: specific FPGA architecture (device family) to be used for synthesis
  • part: specific FPGA part (device) to be used for synthesis
#!/bin/bash

# Change XDIR according to your host configuration.
export XDIR=/c/XilinxISE/14.6/ISE_DS/ISE
make -f xst.mk clean
make -f xst.mk PROJECT="bstest" \
  SOURCES="bstest.vhd" \
  TOPDIR="./log" TOP="bstest" \
  ARCH="spartan3" PART="xc3s500e-fg320-4"

# Invoke impact.exe for manual download of the generated bitstream to a 
# hardware platform.
${XDIR}/bin/nt64/impact.exe -batch impact_s3esk.bat

And that’s about it!

I have used this script in MinGW on Windows 7/64-bit. My setup is for an ISE 14.6 installation. There might be a catch here; on some systems, there is a board driver problem regarding ISE. To fix this issue, open a command prompt and uninstall, then reinstall the driver, as follows:

cd c:\XilinxISE\14.6\ISE_DS\ISE\bin\nt64
wdreg -compat -inf %cd%/windrvr6.inf uninstall
wdreg -compat -inf %cd%/windrvr6.inf install

 

Wrap-up

Following these steps, you will be able to synthesize the bstest design. You can download either the Spartan-3E version (bstest-s3esk.zip) or the Spartan-3AN one (bstest-s3ansk.zip) to start experimenting!

BTW in case your antivirus program goes crybaby you can safely ignore it, impact_s3esk.bat is the Windows batch file for passing commands to IMPACT, nothing special about it :)

Ghosts of HLS past, present and future

This is mostly an adaptation of my position statement as requested by Brian Bailey Consulting.

  • The current state of HLS compared to the original expectations for it 10 years ago

I think that these past few years, a lot of interesting developments occurred in the HLS field, especially in the programmable/FPGA realm. Essentially, 3rd generation high-level synthesis tools and environments made a successful, yet belated, entry in the FPGA market. Here, I’m arbitrarily making a distinction among 1st generation HLS tools (academic endeavors of the 80’s), 2nd generation HLS tools that made the ASIC market in the 90’s (e.g. Behavioral Compiler), and the current generation with usable high-level language frontends, rich optimization portfolios, IP integration and verification facilities. For these tools, the entry bar has lowered significantly from tens of thousand USD to about 2-5k USD and this really helps broader adoption. Technology vendors are not interested in selling their HLS tools but the entire platforms instead.

On the other side, it is much more difficult for tools of this grade to penetrate the ASIC market, where design failures are much more costly. There exist both software and hardware infrastructure issues. Most HLS offers lack the design space exploration and analysis tools that would allow a safer and faster assessment of QoR on multiple design points.

 

  • Technological changes during the last 10 years. Where does the most potential lie.

I don’t think that 3rd gen HLS tools encompass significant theoretical advances compared to what was achievable 10 or 15 years ago in core HLS; most of the theory (scheduling, resource sharing, retiming) was already there. Changes and improvements are incremental in effect. However, it is this new bunch of tools, that have usable C/C++/SystemC frontends and target accessible FPGA platforms that start to make a difference. There is potential for increased competition around MATLAB or Python to hardware. This market will be increasingly more important. Computationally-wise, bioinformatics will be big, big data of course, as well as non-Von Neumann computing, particularly neuromorphic computing for instance to map the mammal brain. Von Neumann will still be in use for emulating neurocomputers basically as a convenience.

 

  • Is HLS a disruptive innovation that will shake the EDA industry? What has changed?

I think that HLS starts to find its place within the ESL flow. I don’t believe that the prevailing view is of expecting HLS to be disruptive. It seems that we never really lacked HLS, but the flow was not there (in this sense I agree with Gabe on EDA). There were a lot of things missing (interfaces, integration, frontends, competitive processes to ASIC) for HLS to be disruptive in the past.

 

  • Estimating market size.

I would say that the estimated market has rised from a few million USD to maybe 30 to 50 million USD. In order to expand the pie, software-oriented engineers must ride the wagon of HLS, for instance algorithm developers. The majority of these engineers work with MATLAB, Python (or CUDA or OpenCL), so the corresponding language frontends have to be implemented. FPGA/SoC system engineers are either already heavily using HLS or considering extending their use of HLS technology. The easier converts are DSP engineers who clearly see the benefits of HLS in their day-by-day work, e.g. implementing matrix algebra. However, trusting HLS up to tapeout is a different story; I still see lots of manual interaction layers following the initial HLS outcome, primarily for interface modifications, old school optimizations and late adaptations (which should be back-propagated effectively by the current HLS tools in the first place).

A question rises here: what will ultimately replace RTL. If HLS is the answer, then the market size will first grow to the limits of the current RTL-powered market, and will then contract since it will have become a commodity. This transition will take about 12-15 years to complete; highly-customized functions such as device controllers will be the last stand.

 

  • Who benefits from HLS?

Semiconductor companies from the far East are known to be early, faithful adopters of HLS. I think that HLS has played a small part in their success, primarily in reducing time-to-market. A number of IP vendors use either third-party or homebrew, partial, HLS tools for streamlining the IP. Whenever an IP vendor releases a new non-trivial IP every one or two weeks, this is a typical sign for heavy HLS use :) Apart these, HLS should find each way to high-performance computing applications. HPC applications are most of the cases stencil codes, and the most troublesome part is to exploit and map task-level or processor-level parallelism. I think that there is much available room in HLS for HPC scientific computing. So companies offering HPC (either large or small form factor) design/programming services have lots to benefit from HLS.

 

  • How has HLS progressed over the past 10 years?

In the most part, academic ideas are continuously transforming HLS. The key algorithms are established but (for instance) polyhedral frameworks are just starting out to be used in hardware compilers. Further, putting an intermediate representation at the heart of HLS is the right idea; frontends, backends, analyses and optimizations are naturally more easy to extend and maintain. Exposing this representation might also be of benefit to all parties.

 

  • HLS in 10 years from now.

There are still lots of interesting things to happen in HLS:

  1. Support very high-level, functional, dynamic, and concurrency-oriented specifications to HLS (there are multiple attempts towards these ends already)
  2. Transparent preoptimization through intelligent code refactoring
  3. The high-quality, extensible, open HLS toolflow: an LLVM for HLS. A key part is missing at the backend side of the flow; it is very difficult for open-source projects like VPR/VTR and Torc to keep up with process advances.
  4. HLS as a service: a superoptimizing hardware backend reusing its acquired knowledge for aggressive state optimizations running on the cloud.
  5. Better tools at all levels: early assessment, design-space exploration, and analysis tools in the HLS environment.
  6. Transparent development environments: ideally it will not be necessary to even know that we are using HLS, especially for hybrid, heterogeneous systems.

In 10-20 years from now, at or past the end of Moore, HLS will be the preferred choice for squizzing all performance potential out of 5nm or 8nm silicon. These last processes will be around for a decade or so. Universities will not offer advanced courses for ASIC/FPGA (no academic interest there); RTL will only be hobbyist, fun nonetheless. Still, most of the theory will be usable on graphene, carbon nanotube, organic or bio- processes and in general to whatever else will come in prominence.

HLS tools: Portable generated HDL code is a must

This is a cross-post of my reply to Matthieu Wipliez of Synflow, an antagonist that made a point about the necessity of having HLS tools generating readable RTL code.

I totally agree with his position statement; generated HDL must be readable. Manual tracing by the human expert is still invaluable and should always be there.

There are certain rules that should apply to generating portable, generic and readable HDL code:

1) Maintain program symbols (variables) in the generated code. Temporaries should keep certain, easy to follow, naming conventions. I do this in HercuLeS.

2) Cross-tagging between source level, intermediate representation, graph-based representation and final HDL code. I’m currently investigating cross-tagging approaches since HercuLeS should support tracing among source (C w/wo GMP API for now), IR (N-Address Code), low-level IR (Graphviz CDFGs) and RTL VHDL.

3) Keep control steps (FSM/FSMD states) clear and visible. Consistent naming conventions can help (to associate states with the corresponding basic blocks or regions for instance). Again cross-tagging can make this more elegant.

4) Exploit the casual FSMD feature: do embed datapath actions into the next state logic code of your FSMD code. Don’t do it Vivado HLS style. The Vivado HLS approach is an ugly mess. Datapath actions are thrown out in concurrent code form, and nobody can follow anything. There is more to pay here that the minor gains in easier resource sharing. X and A people, I’m talking to you: your backend tools are better than that, not much is lost in optimization if you embed datapath actions.

Vivado HLS vs HercuLeS

I’ve spent these last couple of days to perform head-to-head comparisons of Xilinx Vivado HLS against HercuLeS on HLS-generated digital circuits (from input C code).

I believe that HercuLeS lived up to the challenge; it is competitive to Vivado HLS. The reader should take account that:

  1. Both tools have been used (almost) out-of-the-box. Vivado HLS was configured with no bufg inclusion, and in “out_of_context” mode. These mean that no clock buffers and I/O pins were routed.
  2. HercuLeS does not (yet) customize the generated HDL in order to fit better specific architectural features (DSP blocks, embedded SRL units).
  3. Vivado HLS had some TOTAL FAILURES on some relatively simple codes such as a simple perfect number detector (positive integers equal to the sum of their divisors), a 1D wavelet code, and easter date calculation. It seems that Vivado HLS experiences some hard time with integer modulo/remainder. Codes are provided to anyone interested.

The following table provides a summary of the results:

Vivado HLS (VHLS) HercuLeS Comment
Benchmark Description LUTs Regs TET (ns) LUTs Regs TET (ns)
1 arraysum Array sum 102 132 26.5 103 63 73.3
2 bitrev Bit reversal 67 39 72.0 42 40 11.6
3 edgedet Edge detection 246 130 1636.3 680 361 1606.4 1 BRAM for VHLS
4 fibo Fibonacci series 138 131 60.2 137 197 102.7
5 fir FIR filter 102 52 833.4 217 140 2729.4
6 gcd Greatest common divisor 210 98 35.2 128 93 75.9
7 icbrt Cubic root approximation 239 207 260.6 365 201 400.5
8 popcount Population count 45 65 19.4 53 102 26.1
9 sieve Prime sieve of Eratosthenes 525 595 6108.4 565 523 3869.5 1 BRAM for VHLS
10 sierpinski Sierpinski triangle 88 163 11326.5 230 200 16224.9

NOTES:

  • Measurements where obtained for the KC705 development board device: xc7k325t-ffg900-2
  • TET is Total Execution Time in ns.
  • VHLS is a shortened form for Vivado HLS.
  • Vivado HLS 2013.1 was used.
  • Bold denotes smaller area and lower execution time.
  • Italic denotes an inconclusive comparison.
  • For the cases of edgedet and sieve, VHLS identifies a BRAM; HercuLeS does not. In these cases, HercuLeS saves a BRAM while VHLS saves on LUTs and FFs (Registers).

Overall, there are about 30% wins for HercuLeS and ~70% wins for Vivado HLS. Not too bad for a tool like HercuLeS; producing generic, portable, vendor-independent code. I estimate that HercuLeS development effort is around 1-5% to Vivado HLS.

I believe that HercuLeS will do much better in the out-of-the-box experience (which is of high importance in order to draw more software-minded engineers in the game) in the near future.

Both HercuLeS and Vivado HLS have optimization features (e.g. loop unrolling). HercuLeS applies optimizations by using a source-to-source C code optimizer. Vivado HLS mostly resorts to end-user directives. These coding aspects will be taken into account in a followup comparison; they also yield a much more extensive solution space.

 

A few words on HercuLeS high-level synthesis

HercuLeS is a new high-level synthesis tool marketed by Ajax Compilers (http://www.ajaxcompilers.com). HercuLeS has been in development since 2009 and it seems that now is the proper time to hit the market :) Full disclosure: I’m the main (read: sole) developer of HercuLeS.

A free evaluation of HercuLeS is available. You can grab it by sending me an email (see either ajaxcompilers.com or nkavvadias.com for contact details).

HercuLeS is based on the following flow: C-> GIMPLE -> N-Address Code -> VHDL.

HercuLeS is extensible in since frontends, analyses and optimization passes can be added by third parties. At this moment, HercuLeS is bundled with a number of external modules for analyses and optimizations at the C, NAC (N-Address Code, its textual IR), Graphviz, and VHDL levels. It supports vendor-independent code so generated HDL descriptions are synthesizable (in principle) to either FPGA or ASIC targets.

It should be noted that certain things are still missing from HercuLeS and there is ongoing work to support them in the future. This is inevitable since our resources are somewhat limited. For instance there is no Verilog backend yet.

We are looking to establish close communication with our users. Our users provide inspiration and their requests drive future development. Criticism is well-accepted at Ajax Compilers :)