Reconfigurable Computing



okay okay so it's a pleasure to introduce so before he can join them by you about computers so and that's what and here he's been in charge of the high performance computing Center but think that is going to do much more in the future is to work with us to help us optimize because you have a big cluster and but if we don't think about at all about optimizing our the way we do calculations you're going to use it up and not there won't be enough money thanks to it so right so I'm David said I'm running the high-performance computing facility right now but in my previous life at RIT I did research and reconfigurable computing so I want to just give you sort of a taste of what's possible with reconfigurable computing specifically when it comes to biomedical and mainly bioinformatics applications it's really just a little bit of a tour so the the problem in David alluded to it but the problem is suppose you have something you want to compute and maybe you have an algorithm to compute it but the question is how you implement that algorithm and traditionally there's two ways to implement the algorithm you could implement it in software or in hardware so the advantage of software is that it's cheap and it's generally taken to be easy by comparison to hardware and it's flexible the disadvantage relative to hardware is poor performance which is almost a necessary consequence of the fact that you're running on a general-purpose computer where the hardware is not optimized for your application but optimized for executing any application now the hardware alternative the extreme hardware alternative is a custom chip so say you actually fabric design hardware for your application that just runs your application and fabricated a chip to do that so now you have the potential for much better performance but it's really expensive in the sense of non-recurring engineering costs that's what that NRI is is non-recurring engineering cost what that means is the first time cost so if you're going to design and fabricate a chip you're committing to millions of dollars upfront now fabricating a chip can be really cheap per chip if you're cranking out millions of chips so the per unit cost can be low if you enjoy a high volume but not the upfront cost and so that can be prohibitive for a lot of applications and hardware is inflexible if you make a mistake in a chip you've got to design a whole new chip and generally taken to be and this can be mitigated but generally taken to be hard to do especially by people who are only used to software so so are we stuck with this trade-off well one possible solution is reconfigurable Hardware so what's reconfigurable Hardware it's hardware that can be configured to act like any other hardware to emulate any other hardware and an example probably the most common example of reconfigurable Hardware is a field programmable gate array and yeah it can be reconfigured to emulate different custom hardware so ideally what you would like for this third alternative is the best of both worlds in fact sometimes that's the advertising pitch for reconfigurable Hardware is it's the best of both worlds you get the the inexpensive flexible and easy nature of software but you get the performance of hardware beware of advertising pitches you it's almost impossible to achieve this best of both worlds very very very rarely achieved but the reality may be something like this and this actually isn't that bad so the reality is before you were stuck with two extremes the upper-left corner of the custom chip in the lower right corner of software and by the way what I'm doing here and this is a very hand wavy graph qualitative not quantitative but I'm talking about development performance so how much work you have to put in to develop the thing in the first place you know that we're the thing but what I mean by the thing is your implementation of your algorithm versus runtime performance so how fast your implementation runs once you've got it up and running and so before we were stuck with these two corners the idea with reconfigurable Hardware is now you have a place somewhere in between on the trade-off spectrum so even though it's not the best of both worlds it may be useful to be able to land someplace in between if that's better for your application that means stuck with the two extremes okay so just to give you some examples as it relates to some of the things that we do in chibi Bayesian network learning so this is the machine learning type area so there's a paper 2008 where they had an FPGA based solution that achieved four orders of magnitude speed-up compared to the software however as I say then the plot thickens because a couple of years later you have two papers one by the same it's basically all the same group actually different cast of characters sort of shuffled around who the same first author is but basically in 2010 a couple of things one is you have to beware when I say speed-up you always have to be careful compared to what and if you're comparing to software was the software heavily optimized or just the first software they could cobble together or the first software they found off the shelf and was that software itself parallelized so against a heavily optimized software version they found only two times speed-up and then also same group came up with a hybrid system with software FPGAs and GPUs and they found similar performance to the FPGA based version at just a fraction of the cost so GPUs are graphics processing units they share a lot in common with FPGAs in the sense of exploiting a lot of fine grained parallelism so they were able to take advantage of GPUs there and actually sort of late breaking news relative to this slide not that it's really that news that new but relative to those 2010 papers they also came up with another FPGA solution that beats the GPU ones by a factor of three to four in performance which is significant speed-up but not significant compared to that first four orders of magnitude so is that comparable to the FPGA is maybe okay more Bayesian network learning this is Hibbert at all but Hibbert was my students so that includes me we were working on basically a network learning as well in this case we were doing a particular combination of using particle swarm optimization for Bayesian network learning on an FPGA just a proof-of-concept design in this case two and a half times improvement per slave what does that mean well this was a a parallel implementation the software was parallel running on a cluster it's what's known as a master slave implement a where you have one master controlling things and it farms out the compute too many slaves so we got about two and a half times improvement per slave out about the same cost per slave so you could also call that two and a half times improvement per dollar however we could only fit one of these slaves on one FPGA so versus the cluster we were still slower than the cluster so depending on what your metric is what's the point the pitfall here was a direct translation from the software this student did basically taking the software and translating him as direct as possible into the hardware and that doesn't really take advantage of all of the parallelism that's available on the hardware so plenty of opportunities for future work there more examples a short read mapping these guys in 2012 got 250 times speed-up vs. be fast and 31 times speed-up vs. bowtie and in this case it wasn't an identical algorithm they're actually adapting the algorithm to take advantage of the FPGA which is often how you get really good speed-up but there's always a question are you getting the same results or are the results as good here they're claiming better results along the lines of how many reads is it aligning it's aligning more reads and 91% for the hardware versus 80% for both I so now of course you sequencing types would have to look at the paper and see you know are they is this are they doing it right but they were claiming this as actually doing better in terms of functionality and doing significantly better in terms of performance more applications peptide mass fingerprinting this is a proteomics application they're claiming three orders of magnitude speed-up back to genomics local complexity analysis of massive genomic data they're claiming a hundred to a thousand sorry five thousand times speed and compared both to cereal and even to multi-threaded hardware so not parallelizing across the whole compute cluster but multi-threaded within a node de novo genome assembly 13 times speed-up in 2013 and most recent paper here from this year NCBI blast P so these guys claim 5 times speed-up over a parallel implementation on a CPU designed at the same time ok so similar technology generation and it's a parallel implementation and another way these guys sold it was so to speak is they said that they believe this is the best implementation per socket so a socket is is like the well socket on a motherboard where a CPU chip plugs in so they're saying if you plugged in an FPGA so you could think of one FPGA as equivalent to one CPU fitting in one socket so that CPU may have multiple cores but it's one CPU chip so think of it as one node of our cluster so one node but you're still doing multi-threading and they're saying for purposes of that comparison they're doing the best there okay so how does I think I got a little out of order here okay so all right so what do fpg is consist of anyway FPGAs consists of configurable logic so logic gates are like and or and not functions so basically all hardware consists of itty-bitty pieces of logic to compute functions of binary variables digit functions of 0 and 1 and or or or not and the thing is each of those has a truth table you could think of it as a just like an addition table or a multiplication table and or a not each have logic tables you can store those logic tables in a little memory and by changing the values in that memory change whether you're computing and or or not so that's basically one way in which these are configured and I had a little slide with that that I seem to be missing we'll see if that comes up at the end but basically by storing values in look-up tables for the function you're trying to compute you can change what function you're trying to compute and that's how the hardware becomes configurable similarly you can do configurable routing where routing is the interconnection between the functions that you're computing and you could actually make a whole configurable device out of just configurable logic because it turns out you can fake configurable routing with configurable logic or the other way around you can implement the whole thing out of configurable routing that's enough but modern FPGAs have augmented these resources with embedded hard resources like embedded memories and embedded multipliers digital signal processing blocks and even embedded CPU cores on the FPGA and that looks something like this where you have most of this stuff are the configurable logic blocks but built-in multipliers and built-in memories and one of the things to realize here is that you have a sea of stuff where you have logic and in surrounding embedded memories rather than one CPU with all the computation concentrated in one place and then all of the memory all right so what are the advantages of doing things in reconfigurable Hardware one thing is you've got fine grained parallelism available in that fabric and you can deploy all the silicon to the task at hand instead of a CPU which has to have logic on hand for other tasks that you're not necessarily doing you can tailor the bit widths to what you're doing so CPU has a fixed word size but an application on an FPGA can use different word sizes can match the physical layout to your application needs and we're going to see that that can be the best way to achieve performance is actually thinking spatially about layout and potentially and I alluded to this by the fact that there are small memories embedded on the FPGA potentially in void the memory bottleneck of going between one central processor and one central memory okay but there are some significant challenges with moving to reconfigurable Hardware okay the clock on an FPGA is typically only 1/10 the clock rate of the clock on a conventional CPU implemented in the same technology so at any given time any given year it's about 1/10 of that rate to justify switching an application to an entirely new paradigm saying we're not going to do this in software we're going to have to learn how to do this in hardware and optimize it for that people usually want to see at least an order of magnitude in speed-up so we're coming from a disadvantage of a factor of 10 and now we want an advantage of a factor of 10 so we're going to need to find a factor of a hundred in terms of parallelism that we can exploit on the FPGA and not elsewhere so that's a pretty high bar a pretty tall order to me now the other thing is to even take advantage of the FPGA you've got to think like a hardware designer you have to start thinking spatially and plan the interconnection of units it's typically to really take advantage of an FPGA not sufficient to just think like the software under the software so here's actually an example from paper where they were implementing Smith Watterson in an FPGA and I think this gives a flavor for what I mean to think like a hardware designer and to think spatially so this is something where they're querying a database and they have a query sequence and a database sequence and this is like an integer linear programming approach to this where there are prior results in your computing on the diagonal the next diagonal in terms of the values that came before and after this time step so this is on the left is what's going on in the algorithm and time is moving from top to bottom in terms of this rec square and then the next matrix and you see that the Addai agonal has advanced in the next time step computing a new set of diagonals of numbers now what they did is they arranged Hardware so each of these processing elements goes physically at a position of one slot in the matrix and it's taking inputs from the neighbor above and the neighbor to the left and it's producing outputs to the neighbor to the right and the neighbor below and each of these things you see in here that say reg these are registers think of those as one clock cycle delay elements and so what that means is that this thing is pipeline so at each position of the matrix the processing element can be doing something computing the next value at every clock cycle and while one of them is computing on the previous datum that's going through another one is working on another datum so this is a pipeline so you can keep all of these busy in parallel and waves of computation travel through this so this is an example of not just implementing this code in a programming language and then translating line by line into hardware but thinking if this is what's going on in our algorithm how can we arrange hardware that will compute this directly and it might not look anything like the original code all right another challenge is the relative costs of doing various things in FPGAs versus software okay just how you do the arithmetic is an interesting question floating point versus fixed points so if you're programming in a language like C and you declare our valuable variable to be a float as opposed to declaring it to be an integer when you say float you're asking for floating point floating point basically stores numbers and works on numbers in scientific notation so you have a mantissa and an exponent and that is that concentrates the values around zero so it's really good at controlling the relative error in numbers fixed point is just expressing things with the binary point in a fixed location so it's not scientific notation regular notation you could think of it that way okay that controls the absolute error it spreads out the possible numbers evenly across the number line not concentrated around zero now floating point is really good if you have an application where for a given variable you need very high dynamic range the value is sometimes really small and sometimes really big and you need to control the relative error it's often used in scientific programming not but it and it's often used in programming in general but not always no I don't want a new version of Java kind of like Java not always used because you need it but just because it's convenient that's the case also about floating point some programmers we use floating point as soon as they see that they need to store a number like 1.5 which isn't an integer and not realize that if you just rescale your numbers you could store it all as integers essentially now why is this important because an FPGA is a floating-point unit takes up a lot of area so it's really expensive whereas in software the floating-point unit is already there because some applications need it so in interest a question if you're mapping an application onto an FPGA is can your particular application do what it's doing now just as well in fixed point and so actually a student of mine and I and my father so this is the law firm of Firenze Peskin and Peskin back in 2010 we did an initial study of the immerse boundary method which is my father's algorithm for simulating blood flow in the heart among other things that uses floating point we did an initial study to see well what if you used fixed point what would happen to errors in in values within the application and would you need to use an enormous number of bits to get decent error or not so this is one of the graphs from the paper and what we're plotting here the program keeps track of position of fiber points and velocity of fluid and we're plotting essentially how many bits were saying log base two of the x resolution but if that's minus ten that means we're using ten bits for position and if log 2 of u resolution use velocity if that's say minus 20 and that means we're using 20 bits of resolution for the velocity and then on the vertical we're plotting the error in velocity measured in fluid grid points per time step and basically what we see is that even if we use say 14 bits to store velocity and 10 bits to store position we get a inaccuracy essentially of one part per thousand and if we're willing to use say 24 bits for a velocity and 18 bits for position we're down to about 1 an error of one part per million which is not bad and we ran this on a 2d an example 2d immerse boundary simulation of just just a loop sitting in some fluid and basically you can't see the differences if you're using about I think it was 16 bits so this is pretty good news we didn't have to use an enormous number of bits to get decent error and the error we're getting is also scaling pretty well with the number of bits we're using because the the difference of ten bits between this minus 14 and minus 24 here ten bits gives you a factor of about a thousand and sure enough we're reducing the error by a factor of a thousand so it's scaling with the number of bits which is what you would expect and hope for but not necessarily what you would get depending on the application okay so perhaps the examples I showed with those speed ups convinced you that FPGAs are cool but maybe you're scared about this now I have to think like a hardware designer wait a minute I'm not a hardware designer so isn't there just some button I can push to map my algorithm into FPGAs well yes sorta there are actually a lot of buttons people will sell you out there that claim press this button and it will convert your your favorite algorithm to an FPGA based implementation I have pressed a bunch of these buttons I can't claim that I've pressed all of them they at the end of the day it's a lot more than a button push because typically if you have one of these things that says will translate MATLAB code into an FPGA based Simula implementation it turns out it accepts only a restricted subset of MATLAB and to translate your application that uses all sorts of MATLAB stuff into the restricted subset may take so much work that at the end of the day especially if you happen to have a little hardware background under your belt you know we're like you know it would have been easier to just build the hardware from scratch rather than trying to shoehorn this MATLAB code into something that it will eat and the other problem is often you don't get anywhere near the performance you would get with a with a heavily optimized hand design or with with or even without heavily optimizing the details just with rethinking it yourself as a hardware design as opposed to automatically translating from anything so basically the conclusions are there's great potential and reconfigurable Hardware a great potential in FPGAs but it's non-trivial to realize this potential and we saw a couple of applications I've done some of them where the instinctive approach of let's just translate from Hardware didn't buy us much sorry just translate from software it didn't buy us much but if but you kind of get what you pay for and if you put in a work to do the interesting stuff which is to really redesign in terms of hardware you can realize a lot of potential and yeah direct translation automatic push-button stuff doesn't necessarily realize that potential but if you rethink the algorithms from the beginning and if you get close collaboration between an application expert and a hardware expert or a reconfigurable Hardware expert in tint I want you guys to collaborate with me on your favorite applications then there's potential to do interesting stuff there and yeah I have some a little bit more about the immerse boundary method but any questions he mentioned on the cloud remotely okay right so it's yeah right GPU machines are available okay so theoretically there's no reason why and I I doubt so yet so the question was to start working with this do you have to be sitting down with FPGA hardware attached locally where is or I I guess I might split your question into two pieces which is one is can you be sitting at a desktop with a simulation of an FPGA or do you have to actually have the hardware and the other question is you were asking about the cloud is this there a remote resource that you can use so there are FPGAs that are designed to you could have FPGA nodes that are inside a cluster it indeed there are super computers from Cray that consist of regular CPU nodes with FPGA nodes sprinkled in so there are clusters where their FPGA nodes available our cluster doesn't have that our cluster has GPU nodes available in that but it's the same idea so such clusters they could be made available remotely and theoretically a cloud provider could have such a cluster that they then made available as a cloud service I've never heard of that I think I think it's unlikely that that exists currently but theoretically there's no reason why it couldn't but the other thing is the other part of the question is just never mind the cloud or even never mind a remote resource do I have to already have the hardware to start playing with this there are simulators of the hardware and there are also simulators of hardware description languages that you can play with even if you don't have hardware attached and you can even do and we've done this before you can even have software that's designed to synthesize your design for an FPGA it has detailed models of the FPGA and of the elements on the FPGA so it can do do all the place and route and consider the delays that are inherent in the FPGA and tell you here's how fast your design would go if you put it on this hardware and so we've done something like that with the immersed boundary method another student of mine and I have something where one step of the immerse boundary method is actually in hardware the rest isn't yet but it's been completely designed such that if you had a big enough FPGA you could download it onto it and we've simulated it and we've asked the software how fast would this go kind of question yes this so this software is from the vendor of the FPGA so typically so there is there's public software just for simulating the hardware description languages available if you want to get the detailed delays of if I put this exact design using these libraries on this particular FPGA for that you need the vendor software because the actual internals of the FPGA are proprietary to that vendor so they sell their own software other questions questions right so on so on my favorite hit list of projects to convert to an FPGA next okay so something that Constantine and I were talking about a while back is taking so he has another causal discovery algorithm hit on PC and we were talking about doing an FPGA implementation of that and so that's would probably be next on my hit list in terms of things here there's also Stewart and I were talking about young FEMA who's working for pay does a lot of blast on our cluster this is actually an example that gets to what David was talking about about swapping the cluster so one of the times when we find the cluster swamped is with a bazillion copies of blasts being run by this particular person and who has a lot of work to do for some very interesting micro biome stuff blast is very resource intensive blast is one of the algorithms that has had a lot of success on FPGAs in the past including Stewart lend me this old FPGA hardware that was basically a desktop around an FPGA together with software and FPGA implementation from this company called code quest and the deal with that implementation is if you have many queries to run against one database you can you it can go significantly faster I don't know if Stewart if you remember any numbers of having tried that before but right but it's totally depend on that if you just have like one query for one database and then another query for another database it's no good because the whole way it takes advantage of the hardware is reconfiguring it around the database that you're going to be querying so one of the things we want to do is just resurrect existing implementations on FPGAs and see if this does well freeing FAMAS work but then a very interesting question would be seeing what's the research recent research that's been done out there and blast and can we improve on that and maybe optimizing for what Ying Fei needs or something like that that's another thing on on my hit list yep right so that yeah so that depends that depends highly on what FPGA you need and whether you need multiple FPGAs or not there are FPGA cards that go into existing servers they can go into a standard PC for about two thousand dollars for an FPGA card I think there are also development kits that you can get if you're more oriented toward say I'm an electrical engineering professor and I'm teaching a digital logic course right there are for some of the small FPGA so you could probably get down to more like a hundred dollars or fifty or something heavily dependent on whether your academic discount or not and so on there's a B cube unit that I have in my office that's more like twenty thousand dollars and then you know from there the sky's the limit right as I say there are supercomputer clusters from Cray that are hundreds of regular nodes and hundreds of fpg FPGA nodes in there and and so on so I think the key question there becomes how big an fpga do you need to fit your project on it and also how much integration if any with the standard CPU do you need and do you need it integrated into the cluster and what kind of communication back and forth do you need between an FPGA and a standard CPU that that can affect which FPGA solutions you want to buy and and how expensive they are buy a lot and then also just of course the generation of FPGA I usually can try this out on something relatively old just as a proof of concept but then there are new ones that have a lot more area so they can fit bigger designs and significantly more speed it's a long way of saying it depends right well I don't know if I don't know a right if a cloud FPGA solution exists yet but right right with with GPUs sure and that's actually I mean I've kind of almost ignored GPUs in this talk you could argue that GPUs fall in a similar category of being non-conventional computation with a lot more fine grained parallelism and we're taking advantage of spatial organization of computation helps but they're not they're not as radical departure from what you guys are used to doing as FPGA czar and there are cloud instances that have GPUs in them so I mean again it depends what you're doing and I think it's worth trying both yeah so in terms of the cost what you do is you try cheap one first and then see is the conclusion that you know this doesn't work out FPGA is at all or it's a conclusion that yeah it would if only I had this bigger and faster FPGA there's a lot of potential here yeah Yeah right know that there is a lot of that in in the CAD industry not just in terms of which FPGA and which design but also in terms of the bells and whistles that the software has a lot of those are licensed add-ons and in terms of IP course so one of the interesting things about FPGA is is that their hardware but their reconfigurable and so there's a brisk business and intellectual property cores which are configurations of a region of the FPGA that either the FPGA vendor can sell you or a third party can sell you so there's a lot of this well if you had this core that we've designed then you could do this yeah I encountered such things very early on as an electrical engineering student we were I don't know looking to do something with the CAD tool we were using and we got to this dialog box that popped up which you know if you bought this add-on feature for this CAD tool you could do this yeah there's a lot of that thank you

One Comment

  1. Md. Sahil Hassan said:

    Thank you very much for this video. It contains the whole program, including the QA session. Very nice and complete presentation. Really helpful.

    June 28, 2019
    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *