As a bit of a counterpoint: One of my prior projects involved working with a lot...

kyboren · on July 23, 2020

GPUs work great for accelerating many applications, and it's true that that reduces interest in FPGAs. For applications that map well to GPUs, you're absolutely correct that the higher clock speeds (and greater effective logic area) make GPUs superior as accelerators.

However, some applications do not map well to GPUs. Particularly those applications with a great deal of bit-level parallelism can achieve enormous speedups with bespoke hardware. For those applications where it doesn't make sense to tape out an ASIC, FPGAs are beautiful--even if they only operate at a few hundred MHz.

I think the "programming model" is actually the biggest barrier to wider adoption. Your comment is suffused with what I believe is the source of this disagreement: The idea that one programs an FPGA. One designs hardware that is implemented on an FPGA. The difference may sound pedantic, but it really is not. There is a massively huge difference between software programming and hardware design, and hardware design is downright unnatural for software developers. They are completely different skill sets.

On top of that add all the headaches that come with implementing a physical device with physical constraints (the article complains about P&R times but this is far from the only burden) and it becomes clear that FPGAs are quite frankly a massive pain in the ass compared to software running on CPUs or GPUs.

exmadscientist · on July 23, 2020

Very much this.

(Also, in general, FPGA tools are just some of the lowest quality garbage out there... and that is saying something. They're that bad. This is a completely unnecessary speedbump.)

The rebuttal to your objection is always tools like "HLS" (High-Level Synthesis), or in English it's "C to HDL" (FPGAs are 'programmed' in the two Hardware Definition Languages VHDL (bad) or Verilog (worse, but manageable if you learn VHDL first).) These are not programming languages, they are hardware definition languages. That means things like "everything in a block always executes in parallel". (Take that, Erlang?) In fact, everything on the chip always executes in parallel, all the time, no exceptions; you "just" select which output is valid. That's because this is how hardware works.

This model maps very, very poorly to traditional programming languages. This makes FPGAs hard to learn for engineers and hard to target for HLS tools. The tools can give you decent enough output to meet low- to mid-performance needs, but if you need high performance -- and if not, why are you going through this masochism? -- you're going to need to write some HDL yourself, which is hard and makes you use the industry's worst tools.

Thus, FPGAs languish.

jhj · on July 23, 2020

The biggest problem with HLS is that the HLS vendors still want to pretend it's "C++ / OpenCL / whatever to gates". What you get is pretending that there is no such concept of a clock even though you know it is always there and you care about it, and the language you are really writing consists mostly of all the crazy pragmas that you have to sprinkle over everything. It ends up failing on both counts: it isn't C++ to gates, and it is an exceedingly difficult HDL to use because it tries to hide the clock from you always even when you really need to do something with it (e.g., a handshake).

A weak spot of high-end commercial HLS tools (Catapult, Stratus) is in interfacing with the rest of the hardware world, and how the clock is handled (SystemC, you handle it yourself) or kind of vaguely (Catapult's ac_channel). Getting HLS to deal with pipeline scheduling is great, but sometimes you want to break through and do something with the clock. Want to write a memory DMA in HLS? Talk AXI? Build a NoC in HLS? Build even something like a CPU in HLS? Interface with "legacy" RTL blocks, whether combinational or straight pipeline or with ready/valid interfaces or whatever? These things are sort of/just feasible at present with these commercial HLS tools, but very very hard (I've tried it).

If they want to stick with it, I think C++11 could provide a superior type-safe metaprogramming facility for building hardware (compared to the extremely primitive metaprogramming and lack of type safety notions in SystemVerilog) or generators such as Chisel or the hand-written Perl/Python/TCL/whatever ones in use at most companies, but sometimes you need to break down and do something with the clock or interface with things that care about a clock, much in the same way that one would put inline asm statements in code. I want to do that, but not have to deal with the clock 95% of the time when I don't really need to, which is where the generators fail (let the tool determine the schedule most of the time). HLS needs to sit between the two: not a generator (glorified RTL), but not "pretend you write untimed C++ all the time" (not hardware at all).

jcranmer · on July 23, 2020

Again, a counterpoint:

I worked on hardware for something akin to a FPGA on a much coarser granularity (kind of like coarse-grained reconfigurable arrays)--close enough that you have to adapt tools like place-and-route to compile to the hardware. The programming for this was mostly driven in pretty vanilla C++, with some extra intrinsics thrown in. This C++ was close enough to handcoded performance that many people didn't even bother trying to tune their applications by resorting to hand-coding in the assembly-ish syntax.

This helped bolster my opinion that FPGAs aren't really the answer that most people are looking for, and that there are useful nearby technologies that can leverage the benefits of FPGAs while having programming models that are on par with (say) GPGPU.

kyboren · on July 23, 2020

For sure. FPGAs are probably not the answer that most people are looking for. FPGAs are but one point in the trade-off space, and they're not one you jump to "just because".

> [...] there are useful nearby technologies that can leverage the benefits of FPGAs while having programming models that are on par with (say) GPGPU

I think CGRAs are really cool but they're even more niche, and I suspect your original point about GPUs eating everyone's lunch applies particularly strongly to CGRAs. The point is well taken, though, and I don't necessarily disagree.

panpanna · on July 23, 2020

> FPGA tools are just some of the lowest quality garbage out there

I think things are about to change thanks to yosys and other open source tools.

> VHDL (bad) or Verilog (worse,

VHDL (and its software counterpart Ada) are very well thought and great to use once you get to know them (and understand why they are the way they are). Yeah, they are a bit verbose but I prefer a strong base to syntactic sugar.

adwn · on July 23, 2020

> VHDL (and its software counterpart Ada) are very well thought and great to use once you get to know them (and understand why they are the way they are). Yeah, they are a bit verbose but I prefer a strong base to syntactic sugar.

As a professional FPGA developer: VHDL (and Verilog even moreso) are bad [1] at what they're used for today: implementing and verifying digital hardware designs. In fact, they're at most moderately tolerable at what they were originally intended for: describing hardware.

[1] They're not completely terrible – a completely terrible idea would be to start with C and try to bend it so that you can design FPGAs with it...

roastsquirrel · on July 23, 2020

Parts of VHDL leave a little to be desired but overall I find it to be a really great language. To the extent I bought Ada 2012 by John Barnes and I kind of like that too after coding in C/C++ etc, but maybe I'm now biased after many years of VHDL coding :) It's not uncommon to see "VHDL is bad" and such like, and I do wonder what the reasons are for those comments.

adwn · on July 24, 2020

> It's not uncommon to see "VHDL is bad" and such like, and I do wonder what the reasons are for those comments.

VHDL is bad because it's bad at prototyping and implementing digital hardware [1]. One reason why it's bad at that task is the mismatch between the hardware you want and the way you have to describe it in the language. For example: You want a 32-bit register x which is assigned the value of a plus b whenever c is 0, and you want its reset value to be 25. VHDL code:

    signal x: unsigned(31 downto 0);
    ...
    process (clk, rst)
    begin
        if rst then
            x <= to_unsigned(25, x'length);
        elsif rising_edge(clk) then
            if c = '0' then
                x <= a + b;
            end if;
        end if;
    end;

The synthesis software has to interpret the constructs you use according to some quasi-standard conventions, and will hopefully emit those hardware primitives you intended. I say "hopefully", because of the many, many footguns arising from those two translation steps.

[1] Okay, I concede that in theory, there might be a use case where VHDL is perfectly suited for, which would make VHDL a not-bad language. But designing digital hardware is not such a use case.

panpanna · on July 24, 2020

Writing this with good intentions, not trying to start a fight...

---

There are some minor issues with your code that shows you are probably a verilog/SV guy and not an experienced VHDL guy.

Please read Andrew Rushtons "VHDL for Logic Synthesis". I also recommend you read on VHDLs 9-valued logic and why it was designed this way and how it differs from verilogs Bit.

adwn · on July 24, 2020

> you are probably a verilog/SV guy and not an experienced VHDL guy

Wrong on both counts.

Please, enlighten me, what's wrong with my code? Note that it's in VHDL-2008, and the async. reset is intentional.

> I also recommend you read on VHDLs 9-valued logic and why it was designed this way

My main issue with VHDL is not the IEEE 1164 std_(u)logic, although it really doesn't help that this de-facto standard type for bitvectors and numbers (via the signed/unsigned types) is just a second-class citizen in the language – as opposed to bit and integer, which are fully supported syntactically and semantically, but which have serious shortcomings.

panpanna · on July 24, 2020

Inconsistent Boolean expressions + lack of familiarity with unsigned and how it is supported by the tools.

Nothing major, but in my books this is the difference between a Jr and a Sr designer. Nitpicking, yes. But the hardware business is like.

adwn · on July 24, 2020

> lack of familiarity with unsigned and how it is supported by the tools

Do you mean this: "x <= to_unsigned(25, x'length);" ? Some tools, like Synopsys, allow "x <= 25;" here, but other tools, like ModelSim, do not. The VHDL-2008 standard does not allow "x <= 25;".

> Inconsistent Boolean expressions

Do you mean because I wrote "if rst ..." but later "if c = '0'..."? Come on, you're not nitpicking, you're trying to find issues where there are none. Fixating on such anal-retentive details does not make you a "Sr designer", it makes you a bad engineer.

exmadscientist · on July 24, 2020

As someone who just said that exact thing upthread, half of it is general curmudgeonry. VHDL is not a terrible language, though it does have terrible tools. The IDE side of things is a big opportunity to improve the language. Making refactoring easier by not needing to manually touch up three different files to fix one name is a huge help. (And the IDEs have probably improved in recent times; I've done mostly hardware recently.) The compilers/synthesizers... those are vendor crud and so dragons lie there. VHDL-2008 support would go a long way to improving life....

froh · on July 24, 2020

If IDE support for basics is an issue,like consistent renaming, then language server protocol support will help:

https://github.com/ghdl/ghdl-language-server

Edit: typo in url

tinus_hn · on July 24, 2020

So what’s better?

fanf2 · on July 26, 2020

I've heard good things about Bluespec. It is used for Cambridge's CHERI capability architecture extensions, for example - https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/

kyboren · on July 23, 2020

> The rebuttal to your objection is always tools like "HLS"

Yup. I know HLS has gotten a lot better recently but my impression is that, somewhat like fusion, HLS as a first-class design paradigm is always a decade away.

> FPGA tools are just some of the lowest quality garbage out there

Absolutely. I think the problem is vendors see FPGA tooling as a cost center and a necessary evil in order to use their real products, the chips themselves. Users are also highly technical and traditionally have no alternative, so (mostly) working but poor-quality software is simply pushed out the door. "They'll figure it out".

Finally, to expand on the difficulties imposed by physical constraints, I think another huge blocker to wide adoption is that FPGAs are physically incompatible. I cannot take a bitstream compiled for one FPGA and program it to any other FPGA. Hell, I can't even take a bitstream compiled for one FPGA and use that bitstream for any other device in the same device family. Without some kind of standardized portability, FPGAs will remain niche devices used only for very specific applications.

s_gourichon · on July 23, 2020

> cannot take a bitstream compiled for one FPGA and program it to any other FPGA.

Like considering dumping memory content on a PC and reinject it on another with different RAM layout and devices and complaining the OS and programs can't continue running? Is that a sane expectation?

There are upstream formats targeting FPGAs that can be shared, although yes redoing place and route is slow.

Should manufacturers provide new formats closer to final form yet would allow binaries that can be adjusted, kind of like .a .so or even llvm?

Alternatively, would building whole images for many families of FPGA make sense? Feels like programs distributed as binaries for p OS variants times q hardware architectures, each producing a different binary... random example https://github.com/krallin/tini/releases/tag/v0.19.0 has 114 assets.

ianhowson · on July 23, 2020

> bitstream ... Is that a sane expectation?

No. Bitstream formats are not in any way compatible across devices. Because timing is a factor, even if you had the same physical layout of LUTs and routing, it's unlikely that your design would work.

(From parent)

> use that bitstream for any other device in the same device family

Not at the bitstream level. However, you can take a place&routed chunk of logic and treat it as a unit. You can replicate it (without repeating P&R), move it around, copy it onto other devices in the same family. This is super useful as most FPGA applications have large repeating structures, but P&R doesn't know that it's a factorable unit. It'll repeat P&R for each instance and you'll get unpredictable timing characteristics.

> Should manufacturers provide new formats closer to final form yet would allow binaries that can be adjusted, kind of like .a .so or even llvm?

> would building whole images for many families of FPGA make sense

You can license libraries that are a P&R'd blob and drop them into your design. There's no easy way to make this generalizable across devices without shipping the original RTL, and conversion from RTL->bitstream is where most of the pain lies.

kyboren · on July 23, 2020

> Like considering dumping memory content on a PC and reinject it on another with different RAM layout and devices and complaining the OS and programs can't continue running? Is that a sane expectation?

Even worse; it's more like that plus extracting the raw microarchitectural state of a CPU, serializing it in a somewhat arbitrary way, trying to shove that blob into a different CPU and still expecting everything to continue running.

I'm not necessarily complaining, just pointing out this significant difference WRT software programs running on CPUs.

> There are upstream formats targeting FPGAs that can be shared, although yes redoing place and route is slow.

Can you show me an example? I'd like to see this. You do not mean FPGA overlays, correct?

> Should manufacturers provide new formats closer to final form yet would allow binaries that can be adjusted, kind of like .a .so or even llvm?

Like you say, at the very least you will need to re-do place and route. But actually the problem is much worse than this. Different FPGAs have different physical resources. Not just differing amounts of logic area, but different amounts of block RAM, different DSP blocks and in varying numbers, high-speed transceivers, etc. This necessitates making different design trade-offs. Simply shoehorning the same design into different FPGAs, even if it were kind of possible, will not work well.

> Alternatively, would building whole images for many families of FPGA make sense?

Currently I think that's the only real option. But the extreme overhead, duplication of effort and maintenance burden make it very unattractive.

My napkin sketch is some sort of generalized array of partial reconfiguration regions with standardized resources in each region. Accelerator applications can distribute versions targeting different numbers of regions (e.g. one version for FPGAs supporting up to 8 regions, one for FPGAs supporting up to 16 regions, etc.). The FPGA gets loaded with a bitstream supporting a PCIe endpoint and management engine, and some sort of crossbar between regions. At accelerator load time, previously mapped, placed, and routed logical regions used in the application are placed onto actual partial reconfiguration regions and connections between regions are routed appropriately. The idea is to pre-compute as much of the work as possible, leaving a lower dimension problem to solve for final implementation. Timing closure and clock management are left as exercises for the reader :P.

monocasa · on July 23, 2020

> Can you show me an example? I'd like to see this. You do not mean FPGA overlays, correct?

Some of the coolest work to come out of the Chisel project is their intermediate representation FIRRTL.

phkahler · on July 23, 2020

Not sure why they think chip details and bitstreams need to be kept secret. If they would open up, people would make better tools for them.

imtringued · on July 24, 2020

Because competitors could make compatible chips.

vzidex · on July 23, 2020

>I think the problem is vendors see FPGA tooling as a cost center and a necessary evil

Yes to a degree, but another part of the problem is the "physical constraints" you mention. FPGA tooling has to solve multiple hard problems, on the fly, at large scale (some of the latest chips are edging up to 10M logic elements). Unfortunately for the FPGA industry, I think that this is unavoidable - though a lot of interesting work is being done around partial reconfiguration, which should allow for users to work with smaller designs on a large chip.

kyboren · on July 23, 2020

Well, that's an explanation for why FPGA compilation flows take so much time, but it's not a good explanation for why the software is so crap.

I think partial reconfiguration is really sexy, but it's been around for a long time. What's new and exciting there? Genuinely curious.

qppo · on July 23, 2020

> HLS as a first-class design paradigm is always a decade away.

What about Chisel?

henrikeh · on July 23, 2020

Chisel is not a HSL. Chisel is much closer to VHDL and Verilog, since the hardware is directly described.

qppo · on July 23, 2020

Chisel would allow me to write say, a codec algorithm and compile it into hardware, correct? As well as specify the hardware that is necessary to describe it?

I'm a casual in that space but I thought Chisel was an HDL that could be used to support HLS.

henrikeh · on July 24, 2020

And you do the same in VHDL and Verilog. And like in Chisel, you have to manually pipeline it and you can exactly control where registers are used and how resources are reused.

You could build something HLS like using Scala/JVM and Chisel, but Chisel itself is much closer to traditional HDLs.

https://en.m.wikipedia.org/wiki/High-level_synthesis

seldridge · on July 24, 2020

> These are not programming languages, they are hardware definition languages.

There's a subtle point in that Verilog/SystemVerilog and VHDL are also just not powerful languages. While parametric, they lack polymorphism, object oriented programming (excluding SV simulation-only constructs), functional programming, etc.

Your point about the abstraction being different is well taken---hardware description languages describe circuits and programming languages describe programs. However, it's exceedingly unfortunate that the industry is stuck in a rut of such weak languages and trying to explain that weakness to hardware engineers, who haven't seen anything else, runs into the "Blub paradox" (e.g., a programmer who only knows assembly can't evaluate the benefits of C++). [^1]

[^1]: http://www.paulgraham.com/avg.html

mikevin · on July 25, 2020

While there's plenty of room to improve a language like Verilog I fail to see how these paradigms would help me in RTL. What would polymorphism even look like in an environment without a concept of runtime? Can you elaborate and enlighten me?

Edit: Disclaimer, I'm well aware of the pros and cons of these paradigms in software development and use them plenty

seldridge · on Aug 4, 2020

(Sorry! Just saw this!)

Polymorphism makes it way easier to build hardware that can handle any possible data type. Things like queues and arbiters beg for type parameters (you should be able to enqueue any data). Without polymorphism you can make something parameterized by data width (and then flatten/reconstruct the data), but it's janky and you lose any concept of type safety (as you're "casting" to a collection of bits and then back).

There was some interesting work out of the University of Washington [^1] to build a "standard template library" using SystemVerilog. Polymorphism was identified as one of the shortcomings that made this difficult (Section 5: "A Wishlist for SystemVerilog"). [^2]

[^1]: https://github.com/bespoke-silicon-group/basejump_stl [^2]: http://cseweb.ucsd.edu/~mbtaylor/papers/BaseJump_STL_DAC_Sli...

imtringued · on July 24, 2020

Just let those programmers play around with Redstone in Minecraft before you hand them an FPGA. They'll understand it very quickly.

Stubb · on July 23, 2020

Another big advantage of FPGAs is low latency and the ability to hit precise timing deadlines. When working with radio hardware, you still need an FPGA for automatic gain control calculations and recording/playing out samples. Similarly, you need to do your CRC and other calculations in an FPGA if you need to immediately respond to incoming signals, such as the CTS->RTS->DATA->ACK exchange in 802.11.

daxfohl · on July 23, 2020

I think that's the big advantage of FPGA. If you need acceleration to hit a 10 microsecond latency target, FPGA is what you need. If your latency target is like a millisecond or longer, then GPU can handle a lot more throughput. But GPU can't typically give you a 10-us guarantee.

Okay, bit-banging is another advantage of FPGA that GPU doesn't do as well. There are a few things.

inaccel · on July 24, 2020

Regarding DNN inference FPGA can provide low latency AND higher throughput than GPUS.

If you want to compare apples-to-apples, we have done a comparison with realistic (and not synthetic) data regarding the performance of GPUs and FPGAs.

https://medium.com/@inaccel/faster-inference-real-benchmarks...

daxfohl · on July 24, 2020

Ugh, ad spam taking over HN.

cmrdporcupine · on July 23, 2020

See it's funny, I (software guy) have recently started doing a bunch of FPGA stuff on the side for "fun" and I find the programming model to not be the biggest challenge.

The tools, yes, because it seems like hardware engineers have a fetish for all-encompassing painful vendor specific IDEs with half the features that us software developers have, and with a crapload of vendor lock-in... but I digress.

I find working in Verilog to be pretty pleasant. Yes I can see that with sufficient complexity it wouldn't scale out well. But SystemVerilog does give you some pretty good tools for managing with modularity.

On the other hand, I've never particularly enjoyed working with GPUS, CUDA, etc.

So I would agree with your statement that the structural issues prevent their utility in wider market classes -- and those really are as you say ... lower clock speeds, cost, but also vendor tooling.

FPGAs could really do with a GCC/LLVM type open, universal, modular tooling. I use fusesoc, which is about as close to that as I will get (declarative build that generates the Vivado project behind the scenes), but it's not perfect, still.

jjoonathan · on July 23, 2020

I don't mean to belittle your exploration, but are you sure it's an apples-to-apples comparison? This suggests to me that it isn't:

> it seems like hardware engineers have a fetish for all-encompassing painful vendor specific IDEs

Hardware engineers feel pain just like you do. The reason why they put up with those awful software suites is because they have features they need that aren't available elsewhere. In particular, they interface with IP blocks and hard blocks, including at a debug + simulation level. Those tend to evolve quickly and last time I looked -- which admittedly was a while ago -- the open source FPGA tooling pretty much completely ignored them, even though they're critical to commercial development.

If you are content to live without gigabit transceivers, PCIe controllers, DRAM controllers, embedded ARM cores, and so on, I suspect it would be relatively easy to use the open source tooling, but you would only be able to address a small fraction of FPGA applications.

cmrdporcupine · on July 24, 2020

Vivado ships all kinds of "IP" for those things, yes. And once you get past the GUI wizards, drag and drop boxes and lines, and Tcl scripts you find in the end it's just a library of Verilog, all mangled to the point of illegibility.

I wasn't talking about open sourcing. I accept we won't have open source DRAM controllers and the like from them. I understand the licensing restrictions. I just don't like how they force all this stuff to be gatewayed through their baroque and over complicated GUI tools.

I prefer tools that are scriptable, that can work with the build system of my choice, that work properly with source control (imagine that!), where you have your choice of editor rather than having their garbage one rammed down your throat, where there's wizbang features like reformatting and auto-indentation... Hell, even refactoring.

Vivado and Quartus just get in the way. There's no reason to tie all the stuff you're talking about into an integrated tool. They could just ship libraries.

Fusesoc does in fact try to make them behave this way. But you can tell it's a bit of a war to make it happen.

jjoonathan · on July 24, 2020

Well yes, they shouldn't cram the awful GUI tools down HW engineers' throats, but they do.

I'm glad Fusesoc is fighting the good fight and I'm glad you're fighting the good fight, but as you point out, it's definitely a fight. It was hardly fair to call the desire to avoid said fight a "fetish."

cmrdporcupine · on July 24, 2020

I can only assume hardware engineers are asking for this kind of tooling, because I can't imagine why companies would be spending the enormous development effort on them and then giving them away for free if they weren't being asked for?

So many things that could be done in a programatic, testable, declarative, scriptable, repeatable way are done with futzy GUI tools in hardware land. Schematic design _could_ be a matter of declaring components, buses, etc. and letting the tool produce something (and then manually manipulate the visual layout if necessary) ; I mean you could literally describe your board using something similar to Verilog and get the tool to produce the schematic for you... we have these kinds of powers in the 21st century -- Instead it's futz with tools that are vaguely Illustrator-esque, find that half your connection points are not actually connected, etc. Why do people want to suffer like this?

Want to use a DRAM controller in Vivado? Find the wizard, enter into 10 text boxes... and if you're lucky you can find the Tcl scripts it generated and in the future just write your Tcl script... but they certainly won't make it easy.

Vivado project in source control? You're going to jump through hoops for that.

I want hardware engineers to demand better.

imtringued · on July 24, 2020

> the open source FPGA tooling pretty much completely ignored them, even though they're critical to commercial development.

"ignored" as in the vendors aren't cooperating with the developers of the open source tools? What the opensource tools are doing is hard enough as is. When you consider how fragmented FPGA chips are it's difficult to support a wide variety of them even if you wanted.

jjoonathan · on July 24, 2020

I'm not blaming the open source devs at all. I admire them greatly. Unfortunately, it's one thing to admire someone greatly and quite another to believe they have a compelling offering.

andromeduck · on July 24, 2020

Please then explain why is still no standardized synthesizable subset for verilog yet? Even C/C++ at its worst was never this absurd.

tieze · on July 23, 2020

LLVM folks have actually just started on such tooling: CIRCT. With Chris Lattner at the helm, and industry players like Xilinx and Intel seemingly on board.

daxfohl · on July 23, 2020

Agreed. I never thought the mental leap to Verilog was a big hurdle. It's just C-like syntax with some new constructs around signaling and parallelism. I found this interesting rather than foreboding.

The main challenge I had was compilation time. It can sometimes take overnight to compile a simple application if there's a lot of nested looping, only to have it run out of gates. This can be a royal pain.

I'd expect most HPC scenarios would have lots of nested looping, and probably memory accesses, and thus have to spend a lot of time writing state machines to get around gate count limitations and wait for memory responses, at which point you're basically designing a 200 MHz CPU.

So I don't see it as being very useful for general purpose acceleration, but could be a good CPU offload for some very specific use cases that are more bit-banging than computing. Azure accelerates all its networking via FPGA, which seems like the ideal use case.

CamperBob2 · on July 24, 2020

There's no such thing as a "loop" on an FPGA. If you declare a loop in Verilog, the synthesizer allocates one set of gates per iteration. That's probably why your runs take all night.

HLS notwithstanding, you don't use traditional control structures to tell an FPGA what to do. You use clocked FSMs and asynchronous expressions to tell it what to be.

daxfohl · on July 24, 2020

Right. But for HPC, loops (in Verilog) will be the norm, to squeeze out as much from each clock tick as possible. Running everything as discrete steps in a FSM would defeat the purpose.

lnsru · on July 23, 2020

It’s not the speed, that holds FPGA adaptation back. It’s development process/time. While one can start with GPU immediately, there is a need for FPGA to develop whole PCIe infrastructure and efficient data movers. One is done with GPU while FPGA developers just start with algorithms. As long as one does not need real time capability, GPU is an obvious choice. My 200 MHz design outcompetes every CPU and GPU out there with very narrow data processing window, but development time is 5x compared to regular software.

sfgnilnio · on July 23, 2020

You ever work with an FPGA? The programming model and the tooling are a huge part of the problem.

Verilog and VHDL have basically nothing in common with any language you've ever used.

Compilation can take multiple days. This means that debugging happens in simulation, at maybe 1/10000th of the desired speed of the circuit.

If you try to make something too big, it just plain won't fit. There is no graceful degradation in performance; an inefficient design will just not function, come Hell or high water.

The existing compilers will happily build you the wrong thing if you write something ill-defined. There are a ton of things expressible in a hardware description language that don't actually map onto a real circuit (at least not one that can be automatically derived). In any normal language anything you can express is well-defined and can be compiled and executed. Not so in hardware.

Timing problems are a nightmare. Every single logic element acts like its own processor, writing directly into the registers of its neighbours, with no primitives for coordination. Imagine if you had to worry about race conditions inside of a single instruction!

Maybe if all these problems are solved FPGAs still wouldn't catch on, but let's not pretend the programming model isn't a problem. Hardware is fundamentally hard to design and the tooling is all 50 years out of date.

formerly_proven · on July 23, 2020

> You ever work with an FPGA? The programming model and the tooling are a huge part of the problem.

I'd argue FPGAs aren't programmed and don't have a programming model. Complaints that the programming model of FPGAs holds their adoption back are thus conceptually ill-founded. (The tooling still sucks).

alfalfasprout · on July 23, 2020

I mean, the problem is that in the FPGA world the tooling and synthesis languages are inextricably linked. HLS is an approach that, IMO, is also the completely wrong direction since a general purpose programming language like C/C++ won't map nicely to the constructs you need in FPGA design.

What we really need is a lightweight, open source toolchain for FPGAs and one or more "higher level" synthesis languages. I've always wondered if a DSL using a higher language like Python isn't a better way to do this. Rather than try to transpile an entire language, just provide building blocks and interfaces that can then be used to generate verilog/VHDL.

bb88 · on July 23, 2020

> What we really need is a lightweight, open source toolchain for FPGAs and one or more "higher level" synthesis languages.

nMigen: python based DSL to verilog translator

LiteX: Open source gateware

SymbiFlow: Open source verilog compiler + PnR tooling.

There a linux kernel running on liteX and a Risc V core running on an ECP5 running out on the internets.

A micropython version running on a risc V core and migen (earlier version of nMigen) can also be found here: https://fupy.github.io/

jlokier · on July 23, 2020

> I've always wondered if a DSL using a higher language like Python isn't a better way to do this

Like this? http://www.myhdl.org/

bb88 · on July 23, 2020

nMigen for python is where it's at these days.

https://github.com/m-labs/nmigen

tyingq · on July 23, 2020

There is another traditional FPGA use case where you need real time data capture or signal generation. That seems to be getting eaten from the bottom now that there are really high speed MCUs that are easier to program. It's less efficient, but easier to develop for.

exmadscientist · on July 23, 2020

The other problem with using an FPGA here is that microcontrollers are cheap and have great cheap dev boards. FPGAs, not so much. I've wanted to just "drop in" a small FPGA in several designs, the way you can drop in a microcontroller, but there's no available FPGA that's not a massive headache in that use case. Trust me, I've looked.

The iCE40 series is almost there but not quite. It's a bit pricey (this is sometimes okay, sometimes a dealbreaker) but its care and feeding is too annoying. Who wants to source a separate configuration memory? Sometimes I don't have the space for that crap.

If any company can bring a small, cheap, low power FPGA to the market, preferably with onboard non-volatile configuration memory, a microcontroller-like peripheral mix (UART, I2C, SPI, etc.), easy configuration (re)loading, and with good tool and dev board support, they'll sell a lot of units. They don't even have to be fast!

gvb · on July 23, 2020

The MiniZED is $89 and a ton of fun! It has an ARM processor (Xilinx Zynq XC7Z007S SoC), Arduino compatible daughterboard connectors, microcontroller-like peripheral mix, and runs linux.

http://zedboard.org/product/minized

https://www.avnet.com/shop/us/products/avnet-engineering-ser...

Oh, and Vivado (the FPGA development IDE) is free (as in beer) for that FPGA as well as Xilinx' other mid to low end FPGAs.

exmadscientist · on July 24, 2020

The XC7Z007S is $46 in volume at distributors (though with no volume discounts; Xilinx pricing is weird).

Zynq chips are beautiful parts. But they are not "low-cost drop-in" anything. They are chips that you can architect an entire system around and replace a dozen other chips with. I know; I've done it. (But they didn't bite on our proposal, so my sketched architecture remained just a detailed sketch.)

GeorgeTirebiter · on July 24, 2020

In my last project, I just big-banged a port to load up the configuration bits in a 4K iCE40, something like 131KBytes; this was just a .h file that was included in the bit-banger; the static array ended up in Flash (the ST MPU had 2 MB flash, so no problem), and it only took a second or so to load the FPGA bits before it was ready-to-go. So, from my perspective, what you describe is already here. If even that's too much trouble, there's always TinyFPGA BX https://tinyfpga.com/ You can use the open source yosys or you can use Synplify and the Lattice dev system, which is free w/free license.

exmadscientist · on July 24, 2020

Dropping in a midsize MCU with 256kB of Flash just to program a single FPGA is not viable in a margin-constrained commercial product. It works great if it's already there, of course, but the applications I'm thinking of have been the ones where it isn't.

Not to mention there are many FPGA applications where one purpose of the FPGA is to avoid having software in the path. If software is only responsible for configuration load, it's better, but still can be a problem.

ncmncm · on July 24, 2020

Crowd Supply has an endless variety of hobbyist-friendly variously FPGA / USB / MCU / PCIE / SDR combination boards.

It's ridiculous for anybody to insist that programming an FPGA isn't writing software. By definition, anything you can put in a text file that ends up controlling what some piece of hardware does is software. Probably almost all of what is wrong with FPGA ecosystems comes from failure to treat it like software.

It's not much like your typical C program, but that's a very parochial viewpoint. The languages available to program FPGAs in are abysmal, a poor match to the hardware: actually too much like ordinary programming languages, to their detriment. A person who makes an FPGA do something is going to be an engineer, and to an engineer any microprocessor and any FPGA are just two different state machines. Somebody who studied "computer science" will be disoriented, but that is just because the field has narrowed, as network effects pared down the field of computing substrates until practically nothing is left.

FPGAs emulating ASICs or von Neumann CPUs is the greatest waste of potential anywhere. If the architecture of (some) FPGAs could be elucidated, it could fuel a renaissance of programming formalisms. We could begin program them in a language actually well-suited to the task, and vary their configuration in real time according to the instantaneous task at hand.

exmadscientist · on July 24, 2020

FPGAs aren't state machines or processors. Not inherently, anyway, even if you can build those things out of them or if they sometimes are sold co-packaged.

And their internal architecture is pretty well documented. See, for example, the Spartan-6 slices: https://www.xilinx.com/support/documentation/user_guides/ug3...

What's less well documented, at least publicly, is the routing, but on some level that's less interesting since it's "just" how you get the electrons from point A to point B, not about choosing A or B. But even the routing is decently well described, though you have to look in some fairly obscure places (like the device floorplan viewer).

I'm not sure why you think FPGAs emulating ASICs is a "waste of potential". By definition, ASICs are strictly more capable and more powerful than FPGAs, so you're climbing up the potential ladder, not down!

ncmncm · on July 24, 2020

Why? Because ASICs do one thing from the first time they are powered up until they are finally ground up into sand. But an FPGA could, if programmed right, do completely different things from one millisecond to the next. Their ability to do that is never exploited because our tooling is still much too primitive, and current devices' internal connectivity probably can't route signals to the places needed.

If you think an FPGA is not inherently and necessarily a state machine, no matter how it is programmed (provided power and clock are in specified bounds), that only means you don't know what a state machine is. All clocked digital devices are state machines, and can never be anything other than state machines.

(There is an argument to be made that an FPGA is, itself, an ASIC: an IC whose Specific Application is to be an FPGA. But such an argument would be transparent sophistry.)

exmadscientist · on July 24, 2020

There's also plenty of unclocked stuff in the FPGA... like the LUTs that do all the work. There's enough of this and it's important enough that I believe thinking of FPGAs as "just state machines" is dumb. But then I also believe that digital electronics are not "just digital circuits", but better thought of as "bistable analog circuits", so what do I know....

ncmncm · on July 24, 2020

If the results of the LUTs don't end up clocked into a register, where do they go?

Of course everything is analog, and ultimately quantum-electrodynamic, but the languages FPGAs are programmed in don't provide access to those domains.

jburgess777 · on July 23, 2020

Gowin might just fill this niche. They are working with yosys on open source support as well.

https://www.gowinsemi.com/en/product/detail/2/

http://www.clifford.at/yosys/cmd_synth_gowin.html

panpanna · on July 23, 2020

I think Cypress had a product line that combined a CPU and a small programmable array, just big enough to implement your own custom IO and protocols and maybe some minimal logic beyond that.

Maybe that's what most hobbyists need?

exmadscientist · on July 24, 2020

You're probably thinking of the Cypress PSoC, Programmable System on Chip.

Those things are fantastic for hobbyists and can be nice for low-volume production. But they're kind of crap for higher volume work:

* Expensive

* Physically fragile/easy to kill: personal experience suggests they are noticeably more fragile than their competition; ALWAYS add pull resistors and ESD diodes to their JTAG/SWD pins and use a real voltage supervisor, not the internal PoR/brownout, no matter what the datasheet says because it does not speak the truth

* Actually, just add external ESD diodes to anything even the least bit sketchy

* On-chip analog not good enough for serious applications or stupidly limited (just give me two of those please? no?)

* On-chip routing is very, very limiting

* Weak MCU cores

* Few large parts (high GPIO, fast core, ...); the 5LP is better but needs a refresh with bigger, better, cheaper flagships

* More digital blocks (UDBs). They use a crappy old macrocell architecture, which wouldn't be a problem except they only give you TWO of them!

I've actually whined about the last one to the Cypress FAE (great guy!) and he just started laughing. Turns out, he's repeatedly said that to their higher-ups and gotten shot down... only to have customers like me ask for it again, over and over....

Hopefully under Infineon the PSoC line will be better managed. It could be a huge powerhouse, but right now it just does not have a good enough lineup of sane models.

panpanna · on July 24, 2020

Since you seem to have some experience with these: are the tools hoobyist friendly?

(Small install, no need for licences and license renewal, work reasonably well on a cheap laptop)

exmadscientist · on July 24, 2020

Yeah, not bad at all. A little annoying, but above average for the HW side of things.

But that's PSoC Creator, used for their PSoC 4 and 5 lines. (Avoid the 3 and older -- they're really old.) The newer 6 requires Modus Toolbox, which I think doesn't support the 4 or 5 lines (STUPID). I have no experience with that one. It's Eclipse based, so who knows.

tyingq · on July 23, 2020

In the hobbyist space, I also see a fair amount of CPLDs used when something like a GAL (https://en.m.wikipedia.org/wiki/Generic_array_logic) would be much cheaper and easier. Doesn't work for everything, but they can be handy.

rogerbinns · on July 23, 2020

I good example of this is XMOS. Their chips are divided into "tiles" which can simultaneously run code, together with multiple interfaces such as USB, i2s, i2c, and GPIO. Latency is very deterministic because the tiles are not using caches, interrupts, shared buses etc.

Their development environment is Eclipse based with numerous libraries such as audio processing, interface management, DFU etc. They use a variant of C (xc) that lets you send data between channels/tiles, and easily parallelize processing.

An example use is in voice assistants where multiple microphones need to be analyzed simultaneously, echo and background noise has to be eliminated, and the speaker isolated into a single audio stream. I've used it for an audio processing product that needed match hardware timers exactly, provide USB access, matched input and output etc.

borramakot · on July 23, 2020

Just to throw in one more complication, I'll assert that the only benefits of FPGAs over ASICs are one time costs and time to market. Those are big benefits, but almost by definition, they aren't as important for workloads that are large scale and stable. So, if you do have a workload that's an excellent match for FPGAs, and if that workload will have lots of long term volume, you should make an ASIC for it.

So, for FPGAs to be the next big thing in HPC, you'd need to find a class of workloads that benefit from the FPGA architecture, for long enough and with high enough volume to be worth the work to move over, and are also unstable or low volume enough that it's not worth making them their own chip.

cbzoiav · on July 23, 2020

Thats not entirely true - the flexibility can have its own value. Unlike an ASIC you can handle multiple workloads or update flows.

For example timing protocols on backbone equipment handling 100-400Gbps. Depending on how its configured you may need to do different things. Additionally you probably don't want to replace 6 figure hardware every generation.

Another example is test equipment where you can't run the tests in parallel. A single piece of hardware can be far more portable / cost effective.

borramakot · on July 23, 2020

I may not have said it well, but I broadly agree with you. If a workload needs high performance but not consistently (e.g. because you're doing serial tests by swapping bitstreams), predictably (e.g. because you need flexibility for network stuff you can't predict at design time), or with enough volume (e.g. costs in the low millions are prohibitive), an ASIC isn't the right solution.

But my point is that for FPGAs to come to prominence as a major computation paradigm, it probably won't be because it outperforms GPU on one really big workload like bitcoin or genetic analysis or something. It'll have to be a moderately large number of medium scale workloads.

mindentropy · on July 24, 2020

There is also glue logic between different interfaces that can be satisfied with FPGAs or CPLDs.

Unklejoe · on July 24, 2020

> I'll assert that the only benefits of FPGAs over ASICs are one time costs and time to market.

There's one more big one: the ability to update the logic in the field.

rthomas6 · on July 23, 2020

Take a look at Vitis. Xilinx is aware of this problem and are seeking to capture the market of people that want magic programming solutions to speed up existing software. Who knows if it will be successful, but they are trying more than ever to make FPGAs usable without having to know how to make hardware designs and verification.

mdiesel · on July 23, 2020

I work with fpgas, but from LabVIEW. NI have put some effort into making the same language work for everything including fpgas, and a graphical language is great for this kind of work.

It's so easy that it's quite common to see people pass off work onto the fpga if it involves some slightly heavier data processing, which is exactly how it should be.