High Performance Computing with C++



hello everyone welcome to this webinar my name is Demetria I'm going to be your host for this one one and a half hours that we're going to talk about C++ and high performance computing thanks for joining us before we begin I just like to mention Daria who is the PMM the product marketing manager for resharper she's also here with us you might be able to answer some of the questions related to well if you do have any questions about Rishabha but this isn't a product specific webinar this is a webinar where basically I want to discuss some of the some of the things that are going in the high-performance computing space some of the technology solutions that are there so to begin with I better tell you a little bit about myself so I'm I'm a quant I work in quant finance it's a discipline which is kind of like a marriage of mathematics and finance and software engineering I I program quite a bit in both C++ and.net and MATLAB and I also I even do a little bit of HDL programming for FPGAs and I also mention FPGA is in in the context of our current discussion I have been in my MVP for five years when in the c-sharp discipline I've done a couple of courses on thorough site and things like MATLAB and code and D and generally high-performance stuff so I check those out and a jetbrains I have the role of a Technical Evangelist I'm somebody who basically talks about all the wonderful technologies and how they can improve our lives but as I said this webinar is mainly a kind of it's not product specific it's more technology oriented webinar and I want to talk about other phones computing so what exactly are we going to cover well first of all I want to basically talk about the technologies which are available for computation and I'm sure you know the personal computer or the server isn't the only technology that you can use to actually calculate something there are lots of different different platforms different implementations and so on and I'll I'll show you three particular technologies apart from the typical x86 architecture I'll also discuss this idea of managed versus unmanaged code and bye-bye managed I mean things like a c-sharp and Java things which have a virtual machine which is supposed to be working in our benefit in terms of performance but I'm not sure that's that's exactly true I'll I'll talk about how to leverage the capabilities of the the x86 architecture the architecture that we've already become accustomed to and I also talked about those specialized and the unusual hardware solutions that exist for accelerating computation I'll talk about what they are how to leverage them why would you want to be working with them in the first place so I want to begin by this discussion of native versus managed code I'm sure all of you know the distinction so native code the kind of code that's produced by c plus plus and the other native code programming languages such as d for example is when basically when source code gets turned into machine code and the the alternative to that is managed code something that we have in languages like C sharp and Java that's when source code actually gets compiled into some some intermediate representation something that looks a bit like assembly language and subsequently this gets turned into Platteville specific code at the moment of execution that that idea of JIT compilation and it's it's a great idea in theory that you know you have your platform neutral representation and then off you go and when you need to execute something that's when things get converted but you know it's it's a bit too ideal a bit too good to be true for performance reasons so manage code does have its advantages it's considered to be more portable but then again at the moment I'm recording this I would argue that C++ has also become at least somewhat portable provided you don't use any platform specific things and for example in my practice I write on Windows and I run on Linux and I write the same code and I act I use the same compiler I use the Intel compiler which exists on both Windows and Linux and end OS X as well so it's kind of its uniform so in that way claiming that you Java is the only portable language out there isn't really fair you can write C++ it is portable in theory managed code when you turn it into this intermediate representation that subsequently gets JIT compiled in theory that process of JIT compilation optimizes the code for various platforms but in practice the optimizations are well I'm not I'm generally not a believer in optimizing compilers in the sense that they can take every situation make it perfect I think they can make the most obvious situations perfect the most obvious parallelizable loop can in fact be paralyzed but in the general case I I believe in having the power to fine-tune this myself managed code general if we look at it today it does not permit any kind of low-level interaction with the process I think we're only now seeing the dotnet JIT compiler including including sims support and I think that's only happening in some in some preview version of the JIT so we're not there yet of course amateur code gives you additional safety you know things like a rate bound checks and various type conversion checks but then of course after this we end up with languages such as D which are configurable in the sense that for example in the deeper gaming language you can choose whether or not you get array bound checks you can turn them off entirely or you can turn them off for specific types of functions and it's a lot more flexible that way because with the what languages such as C sharp in Java sometimes you don't really control whether this happens or not so in addition manage code isn't actually as portable as you'd like it to be so just because dotnet just because Roslyn went open-source and is now available on the web that doesn't mean that all the libraries suddenly became totally portable and available so things like the user interface this idea that you know the dotnet doesn't have a kind of canonical shall we say cross-platform UI implementation or libraries like WCF then really everywhere therefore it's kind of you cannot claim total portability and the state of this portability keeps changing because sometimes mono for example catches up the things and sometimes it kind of loses the losses it loses out again I think things are going to be a bit different now that c-sharp itself is open-source so I'm hoping for the best obviously managed code typically went with the at least c-sharp and Java they went with this idea of garbage collection which I guess for performance reasons sometimes you might want to get rid of and also we we're now seeing languages such as di keep bringing up D as an example it's a good example of the fact that you can have a language where garbage collection is available but if you don't want it you can switch it off for particular construct it's a lot more it's a lot more flexible that way and one thing about garbage collection I'll say is that diagnose this 'ti is a lot better because essentially if you don't know when an object gets disposed then you're going to be using a going to be using a tracing application to figure out when exactly the object went out of scope and who held that reference and so on so you it's simply different problems on different platforms and of course it's great to be able to interact so if you want to use C++ and a native code for the performance critical bits you can always you can always interact with it from Java or from dotnet I'm not a great expert on Java interrupt but I can say that on the.net platform interoperability is generally great especially if all you need is functions it's being exported from dynamic libraries written in C++ in that in that regard you know c-sharp is it makes it makes the consumption of this code really easy so one thing that that is always a problem and our company is in fact in this game of productivity and the question here is well if you're advocating C++ what kind of productivity gains or losses are we actually experiencing because I think we have to make a distinction between developer productivity and the productivity ourself software so we do want our code to run as quickly as possible but we also want the developers to to be able to write that code and to maintain it and I think in C power plus the the place where C++ loses out and even taking into account the modern C++ variety I think that the place where it loses out is the initial level challenge the fact that you know if you're starting out in C++ it's a lot more difficult to just get something done then in managed languages but so we have this this understanding that managed language is somehow a simpler to use historically they've had better coding assistance but as you probably know we are working at JetBrains to redress the balance so we're making a C++ ID and we're also adding C++ support to resharper so those two things are going to maybe shift the focus slightly back to C++ and it's kind of it's made to coincide with the fact that C++ itself I started Lee has started evolving because for about I don't know thirteen years since C++ 98 until C++ 11 there's been this huge area of stagnation basically nothing was happening in C++ and so I'm hoping that will play will play some sort of role in that as well now this talk is mainly focused on a CPU bound problems meaning that you're not getting enough you're not getting enough processing speed you're not getting your tasks executed correctly but some of the problems and it's important to notice that some of the problems actually bottleneck on the i/o aspect and the typical example of that is compilation speed that's the problem that's been worrying me for since forever I think especially in the C++ space because as you know we have some compilers which are extremely fast like the D compiler is you know it's almost instant you you press f6 and everything has been compiled already and then we have compilers which are kind of tolerable like panthan Java compilers you can you can generally live with the the amount of time that it takes to compile something and then of course on the extreme scale we have C++ and Scala which are very slow I mean I use a cluster build system I use incredible to compile C++ because for large projects you know this is a the only way that you can escape atrociously long compilation speed so the the point here is that some problems are in fact iota and when you swap out a typical hard drive by an SSD and you compile those applications it becomes a lot faster so the solution here is we're going to be talking about optimizing obviously the processes but in terms of optimizing the the i/o mechanism there there are options here as well there's obviously arrayed and fast SSDs and maybe fusion-io if you can afford it I have an approach where I build these kind of virtual labs you take lots of tiny small SSDs they're cheap actually and you put them into a virtual machine kind of a hypervisor that looks after them so each SSD has a separate virtual operating system and then everything performs your your peril compilation or testing or whatever that's a that's a paradigm that works for me and it's actually cost-effective and of course we have brown I'm disc and recently I was actually some colleague asked me is RAM disk so relevant and I told them no SSDs came along and and we no longer have to worry about any of this stuff but he said well check it again so I went back to my computer and I made a made a ram disk just a tiny one and I checked it again and Ram disk is still faster than this as these at least four compilation purposes of course you can no longer buy a hardware Ram disk that sits on the PCI bus and having having a ram disk on your main memory requires lots of RAM which in turn requires you know a ECC Ram on server boards and Zeon's and it becomes very expensive so I don't know how I don't know how realistic that advice is but at least when I see people not using SSDs I begin to worry because they're obviously losing some of the performance but that's not what this talk is about so I'm gonna talk about some of the things that affect processes specifically like parallelization for example because I mean we're at this weird stage in our history where we cannot really expect CPU clock speeds to suddenly pick up we're not going to see 100 gigahertz processor so maybe I'm wrong maybe maybe we'll have either no graphene-based processes or some some scientific breakthrough might come along but as it stands right now and as it's been for the last I don't know five years but we can't expect we can't expect higher clock speeds nor can we expect the number of course to rise significantly do you remember all those promises when Intel was saying you're going to have 80 core processes soon well it's not it's not happening I'm not seeing it happen well what we do have like six core processes and yeah if you're lucky you can have a 32 core processor I think I have some of those but but they are stupidly expensive and you know for the most part I think if you find somebody with a laptop or were given a desktop they are likely to have a quad core and that's it that's that's pretty much our limitation and also we cannot expect the number of CPU sockets been not able to increase I think there might be there might be technological limitations but I suspect and I have no proof of course but I suspect the manufacturers are just too greedy nobody's going to make a motherboard way you can arbitrarily plug in plug in new processes I cannot go out and you know buy an Uzi on and stick it because there's nowhere to put it into so unfortunately the conclusion we can draw from this like the overall conclusion is that the PC and server architectures they do not scale you cannot you cannot buy new stuff and stick it into your machine and make it faster because you've already taken up all the other memory slots all the CPU slots and whatever so the only way to really accelerate computation is to somehow increase that number of entities to compute on and by entities I mean absolutely anything so an entity can be the amount of data that you process per a single instruction it can be the number of course of course or the number of processes and the the sort of worst case scenario is that you simply buy another machine and that's one of the things that personally I've been trying to avoid as much as I can because I mean certainly you can buy lots and lots of computers but when you buy a new computer not only do you pay the cost of a processor that would be fine but you also pay for the power supply the motherboard the memory you pay for everything basically so it's not very cost efficient although that's the that's the model that's what people do right now a computer cluster right now as as people understand it is in a lot of machines joined up together in some sort of network so so let's talk about parallelization now before we do it's important to ask us ask ourselves the question like is it still relevant because I mean given the fact that we're not seeing an explosion in a number of course why should we really care but even having I don't know quad-core or something is already pretty good and some some of the parallelization things I'll mention on even related to course so the first one for example the instruction level parallelism this is the idea that on the CPU you have you have a set of very large registers so a register that's larger than the typical word size and then you have instructions which operate in themselves for example you can have an instruction which instead of adding like two floating-point values together it takes four pairs of those values and is ask them up in a single instruction and this approach is called slimmed single instruction multiple data and you might recognize some of the acronyms like MMX and SSE and AVX these things have been around for years and they provide a certain amount of speed up because I mean after all if you if you can add four floating-point values instead of just four pairs instead of just one pair that's great that's a that's effectively a fairly significant the increase in speed so we can actually see if we're using C++ and not Java and not C sharp then we can leverage these things we can use different mechanisms we can write assembly language and stick it right into C++ we can use intrinsic s– Oh finally we can help for the compiler to do this for us automatically so apart from that we we come to the the classical approach is like using separate threads so if you have an logical cores or and Hardware threads then simply make as many software threads and then you sort of help for the operating system to schedule them equally so that they get placed in the right in the right location and when constructing new threads I know there is always an API for making new threads but it's also possible to make new threads declaratively so in c++ the openmp technology for example you simply put a pragma in front of the loop and that's it kind of it paralyzes things by itself and this is one of the places where I actually trust the compiler I trust the compiler to paralyze things at least only simpler loops here we say and then of course at the high level we have sort of machine level paralyzation where you basically build a cluster of machines they get they get to communicate through some sort of some sort of interconnect and that that option is always on the table maybe not the most cost-efficient so I just wanted to mention Sims briefly as a technology that that's available for leveraging those those larger instructions because I mean I have serious doubts that a JIT compiler from Java a c-sharp would leverage this automatically in the correct way but in C++ you can kind of get to it directly so you can write inline assembly you can write ASM blocks right inside your C++ code to do this or you can do it via so-called intrinsics and these are essentially easy buffers functions but they get turned into effectively simple instructions but of course there is there is a certain amount of tax when you leverage these because you have to have to use special datatypes you have to for example if your sim purchases one 128 bits you have to use a special data type and so you cannot use operators you cannot write a times B unfortunately which can be a problem because I mean I'm used to writing a times B when I mean a tend to be I don't want to write MMO PS a comma B although there is no there is no way around it so an alternative again is compiler vectorization that's when you basically trust the compiler to uh to sort of leverage sims automatically in the loop for example and it should work in the simplest of cases but I think that on critical paths you you really want to handcraft this stuff and of course as an extra option there are actually a special compilers or compiler extensions like Intel SPMD which actually provide language extension specifically for leveraging synth and the it's kind of an admission of the fact that the default construct in C+ might not be the most user-friendly one thing I want to point out here though is that it's a well and good discussing sim but you have to remember that some of the some of the how should I put it the symptom structions are evolving each of the new each of the new constructs each of the new processes is essentially introducing more and more instructions they're introducing wider registers and so the the consequence of that is that you may end up writing C++ code that simply isn't going to run on somebody's machine because they don't have support for the particular instructions that you've used so this is a this is an interesting kind of slant on portability because suddenly you know you you're targeting particular processes and in certain cases this approach is completely valid and specifically if we're talking about high performance computing then one of the assumptions at least one of the assumptions I'm making is that the code I'm writing is going to run on my hardware and I know my hardware I know the level of streaming Sims extension support that is going to be on my processes so I can write anything I want and it's the same with a let's say you're using a hosted solution where all the servers are identical you know what the servers are so you can leverage that particular that particular level of sim and and it's great so I want to just just briefly show how this looks in practice limpia bring up Visual Studio hopefully all of you can see this so what I have on the screen is a function which takes a pointer to a set of bytes from an image and it turns it from a color image to a black and white image but instead of using the instead of using the ordinary ordinary loops and whatever it's using its using sim so you can see the EM 128 types here so uh what why is this why is this a good demonstration well you can see some of the some of the problems like for example you want to initialize a 128-bit register so you have to have a and intrinsic for that and then you want to when you want to address memory that refer says it references the the elements for elements at a time then you have to sort of do a pointer cast I think Bishop is actually reminding me I can use interpret casts here but it doesn't matter and subsequently there is a lot of kind of pointer manipulation going on because essentially the simplest bus itself is not attuned shall we say to working with these kinds of registers so effectively this M 128 construct is just a a union and that's a union that you can address in in various different ways so um it's just a small illustration of how this looks in practice and you can see that it's not very readable but it does provide certain performance advantages and ultimately the ultimately that's that's one it's it's not the only approach to leveraging Sim but it's it's one of the approaches to actually work with it as closely as possible without writing inline assembly which it's a which is another possibility as well so um moving past sim so what what we looked at with sim is called instruction level parallelism so a single instruction works on several elements of data the next option is obviously data level parallelism and the idea is that you know you've got an array of data and you have to you want to process every element in roughly the same fashion how do you do it so the kind of MapReduce and similar concerns how how can these be addressed and the way I do it is with OpenMP OpenMP is a declarative rather than imperative mechanism for paralyzing things and simply simply put if you have like a for loop for example you can put the following pragma right in front of the for loop you can write hash pragma OMP parallel for and this will attempt to automatically or should I say automagically paralyze the loop provided that of course it's not interdependent in any way there are no blocking conditions there and so on and this this syntax for paralyzing loops I have a simple simple illustration here but it can be far more complicated and there are lots of options and OpenMP is generally a very mature technology if you want to leverage it and you can mix it with Simba I think I've actually done this if you look here right it's at the top here it is so this line not only am i leveraging Simba I'm also attempting to paralyze the whole loop so that each line of the image that we're trying to convert gets gets converted in a on a separate core effectively so that's another advantage and of course you can you can do things in a imperative fashion so work working with threads directly or using a library like for example Intel threading building blocks or the Microsoft parallel patterns library they actually have a more or less compatible interface I think that the Intel library has a few more constructs so in this case there are functions like parallel for parallel for each if you're a.net developer you should recognize some of this stuff because in in.net for example i don't know about java but in dotnet we we do have this kind of blind parallelism that instead of writing a for loop you can write parallel dot for and you know have it have it done automatically um and there are collections which kind of play nice with the idea of parallels meaning that they play nice with being able to access the collections from separate threads should you need to but I'm a big fan of specialized hardware you can see some of the images here on the screen because ultimately we are i I mean I have a personal personal computer under my desk and we're very constrained there it's very difficult to expand anywhere I can certainly keep upgrading my machine until I've spent all my money on the fasted Zeon's and you know spend lots of time that way but uh yeah I like I like expensive toys and I think that there are cases when they they're actually relevant so um I'm going to mention only the three technologies shown here on the right I think there are there are lots of other ones that are even more specialized shall we say lots of specialized solutions dedicated to different industries but what I want to talk about is the more general purpose a general purpose hardware so first of all the piece of hardware they you all probably know is GPGPU graphics cards basically they used to be just for graphics but now we kind of we leverage them for a general-purpose computation thus the acronym GPGPU but the problem with GPUs I mean they can they're highly parallel and they're excellent at computation whether it's graphical or just you know mathematics I we're no longer constrained by the ideas of using just the graphical applications but the problem with GPUs is ultimately that they're actually not general enough so I cannot run I cannot use boost libraries I can use the STL on on on the graphics card they're OK for data parallel mathematics so things like plus minus sign cause that sort of thing but you cannot write a general code on them and you need something else so another option is well different expansion boards essentially but all of these are expansion boards but some of them are specifically kind of computation related so the idea is always the same you have a computer inside a computer that's the only way you're going to expand horizontally I guess that's the right to plug in computers into your computer so the most well-known example is the Intel Xeon Phi that's essentially a coprocessor that you can plug into your machine and it's kind of like well it's it's plugging a computer inside a computer and finally the last thing I want to mention is custom chips so things like FPGAs or a6 that are also used in certain industries here we say and in the quant finance industry we use FPGAs for some of the tasks but generally this is a the danger zone shall we say because these things are uh they're more well they're more expensive both to buy and to program and also they're you know they require a change in the mindset they behave differently so it's a it does require specialized skills but I'll mention FPGAs as well so GPGPU is probably the the most accessible technology because it's it's it's been made accessible by companies which actually manufacture the hardware and they've realized that not only is this for gaming but we can let people actually program program things so essentially the the hardware platforms the two most popular platforms for doing any kind of accelerated the C++ development are Nvidia and ATI and subsequently we have software platforms which are well I'm going to mainly talk about CUDA but there's also OpenCL and CFS apps so if you are into Nvidia's hardware then CUDA is what could I in my opinion the most successful uh GPU related technology I wouldn't claim that it because it doesn't try to cover other devices but on the GPU I would say CUDA is the most polished and the most kind of accessible I see a question here actually somebody's asking whether resharper has support for CUDA I actually asked the guys to to make sure that CUDA is supported I I'm hoping that CUDA will be supported because there isn't really that much to support there's only one language extension there so I think I think they're gonna make it but I'm not making any promises but I think this should definitely definitely be done at some point then there is open CL and the thing about open CL is well I am not such a big fan but the great thing about open CL is that it targets all the platforms it tries to target absolutely everything including FPGAs although unfortunately I think head of PG development boards for with open CL support I know exactly cheap right now but that's likely to change and then there is a Microsoft and what they came up with is a library called support for amp which is an attempt to make a universal API for both AMD and NVIDIA graphics accelerators but at the moment it's kind of it's Windows only it's fairly constrained I'm hoping that maybe other compiler compiler manufacturers would implement the standard but no guarantee of that so um I'll just show CUDA in my demo one thing about all of these devices not just not just GPUs is the question of how many devices can you plug in because remember I just claimed that that we have horizontal scaling and the reason why the reason why we have horizontal scaling is because there's more than one PCI slot on the motherboard so theoretically you can plug in more than one device the typical number is two however in GPUs people generally see a drop-off in effectiveness after that but I think this actually needs to be verified right I'm working on machines where every machine has to to GPS or to Xeon Phi's but I I think that's a that there are other limitations there like for example is the power supply enough to power all these devices and those sorts of things but sometimes you easy to get there you to get limitations in terms of computability as well because you do get PCI bus congestion it depends on your usage pattern if you're sending lots of data to and from then then you know that's the that's one situation if you're sending off lots of data and then you're waiting for like an hour for it to compute then it's not a problem if you have more than two devices although once again as I said there are there might be other natural limitations related to just system design there is a question here as to why I'm not a fan of OpenCL I'm not I'm not a fan of OpenCL for GPUs specifically I think OpenCL as an idea is generally great the problem is that for GPUs it does get a bit verbose because uh programming programming OpenCL is a bit like programming CUDA but programming it in the in the device API rather than the kind of end user API so it does become a bit more verbose I think that OpenCL is generally a fantastic idea the fact that we're seeing a major optic of OpenCL for FPGAs means that essentially you can write a program you can write an open scale program and then you can run it on the xeon phi or you can run it on an FPGA and that's that's a fantastic idea for some reason though the at least in my experience the tool set for GPUs specifically on from nvidia is so great that i simply you know i love it and i'm going to show some of it during this presentation actually um oh yeah i have a slide on cuda versus OpenCL so ah the way I see it could a is kind of like the the main commercially successful GPU platform OpenGL is still present there and it does have its uses but the evidence I'm looking at why why am I making the strange claim it's because if we look at who supports CUDA then we see that programs like Photoshop and MATLAB and all sorts of other major manufacturers they go off and they support CUDA first of all and some of them yes some of them golf and support OpenGL as well but typically it's the second choice so first of all it's good as support and then if there is time we'll do support for OpenGL in that regard I'm not saying that's a an indicator of technology's what I'm saying it's an industry trend and an indicator of success of the technology in many in many domains especially surprisingly enough domains which which should be leveraging GPU to the maximum in many domains we've seen in the situation where GPU isn't being leveraged well and video transcoding is you know you should google this article the sad state of GPU video transcoding because essentially even though we have this technology people aren't leveraging it well enough there's a question here about the cloud and in fact that Amazon provides high performance nodes I yeah yeah you can you can rent GPU nodes I think they're actually in video ones and and so the in that regard yeah you don't have to buy your own gp's in fact if you look at the design of something like CUDA you will see that it's for development purposes it's a client sort of a platform meaning that your GPUs can be stored somewhere far far away and the developer doesn't really need direct access to the GPU but uh but you know it's at some point well it's if it's been made very convenient for us shall we say I I keep waiting for the slide where I actually get to show off the the the way the development is done so in terms of performance though and and this is this is very subjective I guess but it's generally assumed that CUDA is better at floating-point and AMD is better at integral math at some point during the Bitcoin craze ATI or AMD devices were preferred to for mining bitcoins and the reason was that one of the fixed-point instructions was twice as fast as could so that's that that's that's the perception of some people but I think you have to you have to basically verify this for your your particular applications because I mean it depends on what exactly you're doing so um CUDA then one big surprise that might come to developers especially coming from C sharp in Java is there could is actually manage technology so it's it's not as native code as you speak because essentially you do write programs in languages such as C well C courtesy is kind of the main language you can also use others of course and but they get compiled into an intermediate representation so kind of like c-sharp getting compiled into il this gets compiled into something called PD X and then the graphics driver kind of G composite into something that's device specific and executable on the device and and this allows CUDA to support a different well it's the last good is to support different devices and it's also permits could it to support different programming languages such as Python because if if you want your language to support good you just have to transcript I'll to this PDX format and the rest the rest of the problem is kind of taken care of so CUDA I mean you might be above what I just said might might look like a claim that could a is not device independent but like could it is device independent but if it really isn't so much because device capabilities they keep changing and so you you keep having to adapt to new devices there is no there is no swap for example so how do you know how much memory to allocate you allocate too much because you're working with a modern device and then somebody runs it on a really really old terrible graphics card with no capabilities that you are after and you have a problem so this is why I love being able to work with hardware or the whose capabilities I know rather than you know what games manufacturers have to do because if you're writing a game for the general market you have to adapt to like zillion different devices so courtesy is the primary development language for CUDA and CUDA actually uses something called a compiler driver it takes your sources can can be you know ordinary C++ and it rips it into two parts so the part that's done for the CPU gets compiled on the CPU and the part that's for the GPU gets compiled by CUDA and then those two are glued together and everything's kind of taking care of automatically I really want to show a demo of what could is like so let me let me open up visual studio so yes the entire toolkit that's what allows development for CUDA integrates individual studio and you get all sorts of wonderful a debugging support and whatever so this is an application which actually just adds to erase together it's a bit purpose I think CUDA six is making a working with memory a bit easier in this regard but essentially what we're doing if we ignore all these checks is we're taking two arrays of numbers and we're adding them up together on a graphics device so the way it works as I said you have a single a single source code file but that file gets ripped into two pieces so everything that's decorated with indicators such as global or device the the compiler driver knows that they relate to the GPU so it rips them out it compiles them for it turns them into PD X whatever the rest gets passed to an ordinary C++ compiler so the question is well how do we know how do we glue the pieces together and the answer is well there is a language extension here so um the language extension is let's see if I can find it where is it come on ctrl F less less less no that's less this you here it is alright so this is probably the only extension to the language the triple angle Chevron's that's basically a point of invocation of the so called kernel so this is a call from the CPU to the GP that's basically all we need to effectively leverage the leverage of the device so um at this point we're calling something on the GPU the kernel called add curdle and this this part executes on the GPU and we can actually debug on the GPU I can actually literally put a breakpoint on the GPU and press a start coded debugging button and I'm in a breakpoint on the GPU I can sort of look at the look at the variables look at the thread ID it's also possible to kind of walk around different threads because remember it's a highly parallel setting and we might want to examine a particular thread then there are tools for this sort of thing there is good at debug focus I can say that I want to be on thread number one for example I press ok now I'm in a different thread so I have thread index equal to one so all the tools for debugging and figuring out what's going on there there and and you can you can leverage them and it just makes I mean I once again I don't know much about OpenCL I've looked at it at some point but I leveraging CUDA at least you know I'm running this on on a very very early stage CUDA device and leveraging it is fairly easy and it and it's great you know the toolkit is fantastic you can just obviously there is a certain paradigm shift because you have to you have to prepare your data you have to send it off to the device you have to specifically allocate things and if you have more than one device then you have to spawn threads and whatever so the the programming paradigm is different but ultimately you know leveraging leveraging graphics capabilities has never been easier though I think it could be a lot easier still so um just as a bird's-eye view of what could it is uh so the architecture shall we say and the architecture is actually similar on AMD and CUDA devices I just use goodest nology so you have a set of circle streaming multi processes on the device you can certainly have more than one and then each of these quality processes has a set of processes also known as could a course and then we can launch a huge number of threads in parallel to be executed on those on those devices and so computation is data parallel and one of the constraints really unfortunate constraint I think is that each thread that you schedule on the device has to execute the same instruction that's what's called a simpie architecture as opposed to simply that we had previously so in reality you can try splitting your execution model you can sort of stream things onto the device and do separate things but what it doesn't allow you to do is doesn't allow you to do too much branching so this very large number of streaming processes basically make sure that even though the clock speed of a GPU is lower their ultimate the fact that there is so many of them the GPU wins out of a CPU of course it only wins out for problems which are immediately parallelizable I hope that's a that's fairly obvious and let's don't forget the fact that you can use both CPU and GPU simultaneously simply you know connecting the two together when you need to pick up the the result of the data that's been processed so I think we already we already did some kind of demo I better run along so the first limitation of CUDA is that it doesn't support an ordinary x86 you cannot run STL or boost or whatever your favorite library is it's mainly usable for data parallel math so things like plus minus sign cause there are a few functions supported in there but I it would be really problematic to build you know complicated object oriented systems and run them in code that's not something that people do so ordinary codes isn't going to run on the GPU um next problem is that running several tests on the GP is generally difficult lots of different of running lots of different processes is well it's kind of not good as gain shall we say so if you want to simulate a card game where the outcomes of Monte Carlo simulation are different then CUDA is not your not the right place to do it and the reason for that is a problem called branch divergence that's what I mentioned that each of the threads wants to run the same thing each of the threads wants to run a at the same instruction at the same time so if you have a branching code on some of the threads then effectively you're parallel computation turns serial it turns sequential which is really annoying because well it's it's a waste of processing resources so um given these limitations there must be something else that you can use if you don't want data parallel math and we have other coprocessor types to help us take care of this so the problem is uh the problem is always the same we we want to plug a computer into our computer and we cannot plug in new CPU so what can we do well the alternative is to put some side of coprocessors some kind of coprocessor on the PCI bus and and a GPU is is certainly a CPU certainiy core processor but there are other ones which are more flexible shall we say so uh but but the pattern of interaction is going to stay the same you sending some data to that other inner machine and you're getting some data from that machine and this is this is somewhat scalable so the the device I want to talk about is the Intel Xeon Phi essentially it's a maybe the most compared the most well-known commercial core processor implementation it's made by Intel it's essentially a PCI board with 60 course they are slow course just just so you know they're they're like a Pentium 4 class course I think but the the key thing about this there are two key things about this the first key thing is that these cores can run entirely separate independent tasks that means you don't have that GPU constraint anymore you don't have to do data parallel math you can do parallel anything effectively but the key thing is that this thing supports x86 you can have your cake and eat it you can have boost and whatever but it's still it does to require a V compile but once you recompile you is certainly based have already been we compiled for you by the way by until themselves once you recompile you can run your stuff on this card and there are different ways of actually leveraging the leveraging the parallelism given that there are 60 cause there so you can use OpenMP you can use MPI you can use soap plus which is Intel's own Intel's own proprietary technology you can use ordinary I think you know P threads if you're on linux or threating building blocks some of the external libraries provided by intel already leverage xeon phi so that's that's another benefit and this is a computer so it's not just you know something that comes with a driver like a graphics card this is a computer runs its own instance of Linux you can SSH to it and so on and there are different ways of interacting with this beast so one of the ways is you know running it independently as a separate machine and another way is offloading some data to that to that device so um you do as I said you do need to recompile you do need special development tools for this specifically they Intel Sigler host compiler I have to confess that I only use the Intel CFS compiler I don't use the Microsoft Visual C++ compiler or whatever I think the Intel compiler has a critical advantage which means and this advantage is that it exists on both Windows and Linux the fact that you know you're developing you can develop on Windows and then run something on Linux and it's kind of its kind of in equivalent in a way so that's helpful so uh Intel also makes lots of tools for C+ as developers not just the compiler but libraries profilers and so on it does it does also come with Visual Studio integration as you may have guessed velocity integration is a key thing for me I love Visual Studio so to work with Xeon Phi you basically need this compiler you also need a special specific technology stack I guess for interacting with it and you need lots of patience as you figure out this paradigm because it's somewhat different to to what you might be used to so um the mechanics of interaction with this device are that there are three different paradigms for interaction so you can offload things to the to the device so you you launch your application on the host but then if you got extra work that would benefit from running on xeon phi you offload that extra work and there are custom directives to take care of that then there is the native execution mode that's basically when you treat the whole thing as a separate machine and you can sort of you can you can have the same at the same application that you compile for both the both the native shall we say side as well as the phi side and the symmetric execution as well so if you if you think of a phi as simply an element in your in your computer cluster something you would control with MPI or similar technology then you can have five participate in that and there is no problem as well so it's a very flexible device in terms of in terms of general you know execution depending on depending on how you want to actually leverage it so i want to give you a brief demo of the CN phi bear with me as i fire this up this might take a while meanwhile let me answer some of the questions or at least try to so i could it has to copy data to the devices of which magnitude of data do we gain something from cuda well one of the problems with cuda is yeah it's it does it does take time to send data to and from which means that if you want to uh use well this basically means that for an application like I don't know high-frequency trading for example CUDA is not the best option because that time for sending data is is still fairly significant so uh if you want you know sort of microsecond responses and whatever then that's going to be that's going to be a problem but if you have a if you have no significant timing requirements and you simply have computation requirements that you know your calculation takes too much time that's when you leverage things so if you have well let me give you an example kind of real-life example uh the Moscow exchange here in Russia it gives me about eight gigabytes a day of data all the the whole transaction logs about eight gigabytes they can be more on particularly active days and you want to process this all somehow meaningful time and you can certainly do it on the you can certainly do it on the the ordinary the ordinary machines it will just take a ridiculous amount of time so that's when you leverage the GPU as for instant responses so to speak that's when you leverage something like FPGA technologies and and similar things seems like the the rule of presentations is a very much active here let me oh wait wait wait now we're we're on a device that has two xeon phi cards plugged in let's try doing something i think i have something in the fight directory that I can show so um let me VI v dot cpp alright so here is an example C++ application that's almost entirely meaningless so what we're doing is we're getting the number of processes or in this case Z on file will get us the number of Hardware threads because there's 60 processes and each supports four Hardware thread so we're going to get that amount and then notice what I'm doing I'm leveraging openmp to do a sum in parallel so on each of these processes I'm incrementing this this one variable and doing a kind of a reduction I know it's meaningless to leverage powerful technology like this but anyways that's what we're going to work with so let me exit from here and then I'll show you how to how this whole gets compiled so I'm going to use it I'm obviously required to use the Intel compiler I need the switch mmm I see em I see is an abbreviation for many integrated course that's kind of like the the key name for the Intel Xeon Phi technology then I want C++ 0 X here I also want OpenMP and uh the the file itself so phi dot cpp try compiling this wait a second yeah module load intel try this again hooray so so I've got a dot out which is an application designed specifically for the Xeon Phi I don't think I can run this here because it's compiled for a device see I get a narrow a cannot execute binary file what I can do is I can send it off to the to one of the xeon phi so i can say a CP a dot out and i send it to mixer or these two devices here make 0 and make one I set it to make 0 um hopefully that's that's been done please ignore all the weird error messages this happens sometimes but anyways let's try SSH into the device and actually running it there ah a dot out oh that's that's another issue I have to specify the libraries and where to find the library so the library path equals dot and then a dot out again alright so I know it doesn't look like much we've just executed a code on to 240 threads inside a Xeon Phi and we're looking at this this black and white console but this is just a taste of what it's really like to program this thing obviously there's a depending on the mode of interaction you choose and so and there are different different issues and different levels of complexity and so on but it is something that you can plug into your machine to make it more powerful and it is something you can compile things like boost 4 which is you know key to the whole thing that that's what I want to have basically so jumping back to jumping back to our presentation essentially yes as I said it 60 processes for Hardware threads per core so that's what we got the 240 from it's got eight gigs of memory so that's a constraint I guess because there is nowhere to swap to that's still a problem even here it's got a 5 512 bit sims register so in case you're doing computation you're getting this massive simple register as well that you can also leverage but for me the key thing is that you can branch you can put conditional logic you can write different things on different processes so uh and another important thing is that this this device these devices they support programming models which are similar to what you have an ordinary PC so you can do things like you know leverage OpenMP or use MPI or you know explicit threading if you want to Intel is working on other models as well I think OpenCL is supported already so there is no problem in this regard so it's it's another fantastic device and once again I will stress the fact that you can be using both the CPU and and both of you Xeon Phi's at the same time and obviously if you have a cluster then you can be using well in my case you can be using I don't know 24 Tesla amps and 24 Xeon Phi's at the same time if you need this kind of power to compute something so next up we come to the most obscure technology of all the FPGA so FPGA actually stands for field programmable gate array and this is probably as the technology this is the clothes if you're going to get to designing your own CPUs or designing your own custom chips as opposed to using something that's provided for you so for those of you who may be not too aware of computer architectures the CPUs that we have on desktop machines and you know smartphones and whatever they're very general-purpose you feed them instructions they will do anything for you on the other end of the spectrum we have things called a 6 and a 6 are essentially chips which are hard-coded to do one thing and one thing only that's it and that's what that's what's being used for Bitcoin mining right now because all the other technologies are considered not powerful enough anymore so FPGA is are the middle ground between something that's hardwired and something that's very flexible they're kind of like hardware emulators so there's certainly slower than a six but they can be faster than ordinary CPUs in certain scenarios I'm not saying they're great for everything one of the key things is that these beasts are typically perfectly I'm using hardware description language is less such as VHDL and Verilog there are other approaches to this like using system c or as I mentioned OpenCL there are a couple of higher-level solutions so for example I use MATLAB and I've recently been investigating well MATLAB has this thing called HDL coder where you write something in MATLAB and intentive converts it into HDL unfortunately that conversion doesn't please me so much another approach is embedder which is actually built on our technology jetbrains mps that's that's a way of basically generating things like state machines and other things from high-level constructs so check it out if you're interested in in this sort of thing so the question of the question is why FPGA is well an FPGA basically gives you a huge array of logic gates and you can do anything you want with them you can reconfigure them in in any in any kind of way and so the end result is that you can end up with something that's so intrinsically parallel that it beats everything out of the water because effectively you've designed your own CPU specifically for a task of I don't know for a task of handling data from the stock market for example so you have something on the coming in on the the tcp/ip stream and you parse it and you very quickly faster than your competition you rip out the data you can even perform some calculations another key thing is that FPGA is I generally they don't draw so much power another key thing that I like is that they have much better scalability so for example you can you can go off and buy a board which has like twenty FPGAs on it you cannot go off and buy a board which has 20 Zeon's on it or at least for in a affordable amount of money but the key thing about the FPGA is they are not a commercial off-the-shelf solution you can't just go into your PC world and and pick up an FPGA card because it's so extremely specialized and because programming it is is very difficult and very costly I would I would actually estimate that the the actual if you want to compare it to ordinary software development I would say the time taking to a program an FPGA device for a particular task is ten to a hundred times slower so that's something to keep in mind if you want to if you want to leverage this technology there is a question here uh from Orlando is is my domain is your domain currently is the domain you currently work which hardware alternative do you recommend using well actually I recommend using everything every every one of these devices has a has a particular use and you know it it really depends on what you're doing in point finance there's certainly scope for leveraging everything basically and it's just you know horses for courses kind of things so if you have some data power computation that's taking a long time you can send it off to the GPUs especially if it's single point you know single precision you can you can just fire it off across a cluster and all your Tesla's will will munch it and whatever if you're if you're trying very complicated logic like for example let's suppose you're simulating card games in the Monte Carlo setting in that way you would use is en fine it works fantastically it can branch up for a change it can leverage you know you can you can write all the new C++ you can use boost and whatever so that's great if you need the very fast responses on a hosted solution when you're for example getting data from the markets and you need to do something to them in like microsecond time illness you're going to be looking at FPGAs quite often they're you know custom custom boards for the the ethernet interface the sometimes it's programmable you can also program the actual interface so the and you know it really depends on on what you're actually doing so what is the FPGA for once again the idea is exactly the same you're offloading some tasks from the CPU but an alternative is that there is no CPU that an FPGA has an Ethernet port in it you plug in the cable and everything happens on the FPGA and something entirely different happens on the CPU that's also possible though you have infrequent interactions yeah there are you know it's the sky is the limit as far as implementations go FPGAs are of course a lot less flexible they're not so good at for math particularly because well there is no typically there is no specially design you know math coprocessor or anything so you you'll end up implementing a fixed point logic yourself which is always what you want but for fixed a fixed size the data for example fixpoint data it's it's totally it's only the right solution it is however a low-level construct and FPGA is if you factor in the cost of development and production they're definitely relatively expensive I mean they don't draw much power okay but the whole process of getting them into a getting them into the state where you can you know start leveraging and you know start making money off of those FPGAs that's that's quite far away from just just buying a device and thinking that it works there is an additional duality I guess in the fact that first of all you have to buy the development boards and then you have to either produce the actual cards yourself of help somebody produce the cards for you and typically if you buy third party cards they're already you know they would be in a particular configuration so you would have to adapt to that there was lots of stuff to cover so um MPGs I wouldn't say they they directly compete with ordinary CPUs typically they try to take a chunk of the market by by solving a fixed number of tasks and their advantage is due to the highly synchronization the fact that during a single clock cycle the FPGA will do what you told it to do you have a massive huge array of gates they can do several calculations at the same time they can spread data across the across the chip it's it's really totally different to what we're used to in this this kind of instruction processing mindset um so the goal is fairly simple you have to pre program an FPGA to uh to solve a particular problem set very quickly and that's it so an example would be something like protocol parsing in hardware and yes there is a fair comment here that I'm talking about about FPGAs and it's unclear what this has to do with C++ programming well in actual fact you can program FPGA is using open Co as I said and I sort of lump C and C++ into the same bucket shall we say but the key thing here is to the key thing is to talk about computation about the technologies that's available for speeding computation so yeah I mean in the same regard you could say that could isn't really C++ because you know you can use the C interface but there are C flat horse approaches to all of these and in fact one of the key things about C++ as a language is that when you have a new device come out what language do you think the device manufacturer supports first well I bet that in in the vast majority of cases as we can see from our GPUs and from Xeon Phi the first language to be supported is C++ so I think I feel I've made it within the scope of a single hour which is great let me let me try and going through the going through the questions I think I've actually addressed most of them there is a comment here about an FPGA kit for Raspberry Pi I haven't heard about that I actually use Altera Altera's FPGAs so things like cyclone and REO and so on but but I guess the in the fpga space we only have two major manufacturers we have altair and xylenes there are a few smaller ones as well but if you're looking if you're looking for a sort of ready-made solution I can actually try and well maybe maybe not right now but if you're looking for it as somebody to build you FPGA cards they're likely to be using either Altera horse eiling's but i can can i operate on the protocol parsing yes indeed so um I guess one of the things I did not mention is the fact that in addition to being able to program FPGA is using a very low-level construct hard with the definition languages it's also possible to a grab existing kind of almost like micro processor implementations and put them right on the FPGA chip so for example you can get the implementation of like the tcp/ip stack and you can stick it right on your FPGA so in terms of protocol parsing why is this whole thing happening well the reason is that if you if you consider a typical financial market what's happening is not only do you get the data about the deals that happen on the market you get the day about every single order that happens on the market and with with a huge number of participants everybody's you know putting in orders they're canceling orders orders are being folded into deals all the time and you want to somehow track it and unfortunately conventional conventional hardware doesn't do it as well as as you'd like or what I would say this in part it's come to a competition of technology as well as you know skill in the sense that people are simply finding ways of doing this like a few microseconds faster than the competition because I mean if you if you build your order book a few microseconds faster then you have the right information a bit faster and so you can jump in an arbitrage opportunity and actually make some money so I guess one of the reasons why why custom feed handlers that's the that's the term for a protocol parser it's a feed handler one of the reasons why these things exist and why people actually sell them you know commercial solutions in this space is because we're we're engaged in the war so instead of buying tanks we're buying FPGA based feed handlers we're buying a network interface cards that are faster that are programmable that have different you know more sophisticated interfaces than ordinary consumer cards and so on and so forth and basically that's what this whole tool has been about is me me discuss discussing the the methods of war shall we say or at least in in this industry of course that's not that's not all that you should be using it for but you know most mostly actually to be fair I mean uh this is this is my final slide I wanted to mention that I wanted to mention the JetBrains we're working on both a separate ID and C++ support for eShop which both of these I haven't really shown in part because they're still in the works but the key thing about what we're doing is that we're we're generally targeting the general market shall we say the commercial off-the-shelf PC uses so and I wanted to sort of the more complicated side of the world but of course the key thing about all these custom technologies that I mentioned is they're only applicable to a tiny proportion of the population because ultimately most people don't need them most people don't need to I mean the only the only technology out of those I mentioned that gained any proliferation at all is the GPUs now there's a question here ah so turn to them okay that that's a very long question about streaming seemed extensions and I'm not sure off the top of my head I'm not sure I could answer it the key thing about SSD and the question here is to the tune of what performance benefits would one gain from using sim D in a particular setting the answer is not it's not so clear because sometimes when you jump from I like I said I like to mix openmp NSSE so sometimes when you mix a mix the two you don't get as much you don't get a linear performance growth and that's something that you know is worth it needs to be investigated because there is no there is no assumption that you know once you're using SSE you're getting you getting you know caught the performance over a times the performance it really depends on your task it really depends on the usage patterns the the the memory alignments and so on it's there are lots and lots of concerns in this regard so I'm not sure I can I can answer this particular question ah just like that everything requires measurement and that's why I was saying that you know typically we use two cards per machine but it does require measurement in the sense that you know if more cards per machine work for you then why not Who am I to stop you so I think we're done with the questions so I'd like to thank everyone for our wait wait wait can you recommend to use another so the question is uh what I think the what you mean to say is would you recommend to use Intel Xeon Phi for complex simulations like a car simulator I imagine yes why not I mean if you think about it you're getting 60 processes that can do you know entirely separate things they can each each do their own thing and so for if you have a problem that scales to 60 or scales beyond 60 at least although as I said there's four Hardware threads per core so that's 240 if you have a problem that scales beyond that then you can leverage the parallelism and I mean totally go ahead and leverage it but you know it really it has to bear comparison with the speed of the actual course and so on and unlike GPUs I didn't think I don't think Amazon offers a Xeon Phi's to play around with although on the other hand when I was when I didn't have any device of my own I Intel actually gave me a virtual machine to play around with so you can I mean I think for for both good and for the Xeon Phi's you can basically if you ask nicely the companies that sell those devices they'll give you a virtual machine to investigate and that's what I think you should do so if you have a car a car simulator or some other complex simulation and asked to use a virtual machine and and port your code over see see if you getting any performance benefits and if you are well you can you can consider buying such a device they're not exactly cheap though but they're not they're not terribly expensive either so right so it's not you know it won't put you through the roof so um I guess we're done at this point thanks everyone for joining if you have any other questions after this talk then feel free to tweet or email or contact us however you want we're going to put the the recording and the slides up at some point so you can you can look at those and best of luck with your high performance development practices take care bye bye

6 Comments

  1. Hariprasad Kannan said:

    23:25 SIMD is an example of data level parallelism.

    June 29, 2019
    Reply
  2. Dzung Nguyen said:

    nice talk πŸ™‚

    June 29, 2019
    Reply
  3. John Foe said:

    Thanks!

    June 29, 2019
    Reply
  4. Kowboy USA said:

    Appreciate the upload, but I was soon lost. Maybe if it was uploaded in smaller, more specific units it may be easier for the a viewer like myself to keep up.

    June 29, 2019
    Reply
  5. You Wang said:

    Maybe this knowledge could be used in mobile computation…

    June 29, 2019
    Reply
  6. Dev Guy said:

    What a waste of time talking about managed code? You lost me!

    June 29, 2019
    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *