Stanford Seminar – Computational memory: A stepping-stone to non-von Neumann computing?



okay so thanks sandy like an introduction so you know as you know like these are like really exciting times in the field of information technology right I mean we generally don't know what comes next and so it's it's really a pressure for me in fact an absolute honor for me to come to Stanford to this cauldron of innovation and share with you some of the work that I've been doing in the space of commutation memory and get your opinion on that I would also appreciate if the if the if the lecture is going to be you know very interactive so that I can see questions during the during the talk okay so where do I come from I'm from IBM Research in Zurich I just put up this picture to show that it is actually a very old research lab IBM research was established does the zurich lab was established in nineteen nineteen fifty-six and you can see you know from these pictures here you know luckily we don't dress like that anymore in zurich so currently the lab is located in this beautiful village of freshly corn by the but is auric lake and butt off late it's looking more like that so all the more reason for me to be in California now ok so this is the agenda for the day so first I want to you know talk about ok what is this in-memory computing or what is computational memory and why we need it so it's a first part then I will look at what is what constitutes its computation memory now what do we have in there then we'll talk about some of the operations some of the computational tasks that you can perform in in computational memory and you know then we will look at you know this concept of mixed precision in memory computing and finally I would conclude with a summary and outlook ok so the motivation so I'm sure that you guys have seen or what is some of the statistics that is listed here we have something like you know 2.5 exabytes of data created everyday right and 90% of the data in the world today has been creating the last two years and the global scientific output doubles every nine years so it's truly remarkable the amount of data that that we are generating on top of it we have you know the the IOT revolution going on right so if you look at some of the estimates we will have something like thirty billion devices connected to the Internet by 2020 which is just a couple of years from now right and it is all the way from you know smartphones to you know kitchen appliances and all these things are generating all this tremendous amount of data we are expecting something like you know 40 trillion gigabytes of data to be generated by all these devices which would be on the internet so so the impact of that is that an increasing number of our computers are literally combing through this vast amount of data that is being generated and trying to you know derive insights and intelligence that's what they're doing so much so that you know many of us believe that we are on the cusp of you know a revolution in artificial intelligence so why is it important because you know this AI revolution will have significant impact on our computing systems right and there's an immense computation challenge involved there and nothing tells the story better than looking at the hood of one of those computers which looks under you know looks at the data and try to derive intelligence and that is IBM's Watson right a classic example which beat two former world champions in 2011 in the game of Jeopardy so it was a remarkable achievement of course but the not so remarkable aspect was at what cost so Watson was consuming 80 kilowatts of power as opposed to the human operands consuming tens of watts right Watson had 2880 processor trades and 16 terabytes of RAM I'm pretty sure the human opens in hell and much of this you know inefficiency is attributed to the fact that Watson you know just like any other computing system that you're working with is still based on the four anointment architecture right that means there's a physical separation between where you store your data and where you process that and anytime you wanted to perform anyway any time you want to perform computation you have to move this data from the memory to the processing unit and back and forth so let's fast forward to 2018 right that's there 2018 that is yeah this is the landscape of AI algorithms over there I mean when you say a ideas what it is right and in all these cases all of them are running on conventional fundament computers be it CPUs or GPUs okay there are some Essex coming into play but essentially they are all von Neumann machines so what we have done is that we have thrown in immense computing resources into this problem of AI right but even after that you know look at the results we still need weeks to train certain state-of-the-art deep neural networks so in spite of having all these computational resources in your hand so before we say ok let us move and beyond you know foreign women computing of course there's a lot that we can do in the context of foreign aemond computing itself one of them is you know this storage class memory the whole idea there is as follows so if you look at you know if you look soo-min to the memory part of the you know the architecture there we typically have a fast volatile DRAM and a slow non-volatile flash right so that's what we have but now if we can have access to a large amount of data which we can store in a in a non-volatile fashion and if we can access that reasonably fast right then this will of course improve the performance of this computing system and that is what we typically refer to a storage class memory so something which sits in between DRAM and flash so there's a lot of activity going on there and and right now we have even started seeing some commercial products coming up which try to fill that niche area right between des gram and flash then of course there are ideas such as processor in memory I think Andy was alluding to that right so there's this idea of using or bringing your processing units very close to the memory system right probably even in the memory chip so that you can accelerate some workloads right that's what he called processor in memory or in or in-memory computing or near memory computing then of course there are more exotic ideas like this one which is being pursued here and Stanford actually by Philip warm and so on where you have this you know monolithic 3d integration of memory interleaved with you know computational layers right so this is what you do so if you look at all these approaches the overarching theme is to minimize the time and distance to memory axis right but they're all confined within the four anointment architecture so we are not deviating from the for environment picture we are still in the following world but we are trying to minimize the time and distance to memory access so what we do is fundamentally different from it's a distinctly non fun I mean approach and that is one we talk about that is the in-memory computing approach so again let us look at the the conventional you know the system there so you have your the processing unity at the memory unit and you know you have some data a written in the memory somewhere right if you want to perform the operation F of a on that memory a memory a on the data a what you have to do is you have to fetch that a bring it all the way to the processing unit perform there for free operation and store it back into the into the memory unit right and this is what is causing this immense inefficiency now let's see you know suppose we have another system slightly different so you have a guinea processing unit you have you know the the memory here and besides the conventional memory assume that they have something called a computational memory right sitting here and now this a little BC written onto the computation memory as opposed to the conventional memory if you want to perform the F of a operation right what you could possibly do is you can apply a signal to the computational memory unit to the a there and somehow the F of a you know gets done in in place in the memory itself that's the whole idea right without ever having to go into a processing unit so this is the whole concept of in memory computing so there are a few caveats here of course we cannot do all kind of stuff right so we can perform certain computational tasks using certain memory cause in place without the need to shuttle data back and forth in the process but having said that we can do quite a few things we can do logical operations between varied medic operations we can even perform some optimization problems in this manner and how exactly are we doing it we exploit the physics of the storage devices right so we exploit the physical attributes and state dynamics of these memory devices on which you store information so that's all the idea here ok so before I proceed are there any questions so far or shall I go ahead ok perfect so what do these you know I talked about computational memory so what do they comprise of right so let's look at you know how we store information in our existing memory devices right so if you look at DRAM or flash in both cases you store information in the charge state of a capacitor right so in the case of DRAM you have a capacitor which is in series to a transistor if you have flash you have a capacitor which is coupled to the gate of a transistor so that the presence or absence of charge is what determines the logical state 0 and 1 right so clearly I mean this is the way we have been storing information for decades now and then I had invented the diagram in 1966 right think about that so that is how we have been storing information there is some recent work on using this type of storage devices to also do in-memory computing but it's very recent stuff but I still believe that a better way to perform in memory computing is using a different type of devices and not the ones based on charge based storage and that is what I want to show here so in this mode of op in this mode of storing information we don't store information in terms of the charge state of a capacitor but rather in terms of the atomic arrangements within a nano scale volume of material so you know if you have one atomic arrangement you have one logic state if you have another atomic arrangement here another logic state right and these atomic arrangements you can now alter by the application of electrical pulses and you can measure these differences in atomic arrangements in terms of the resistance state by measuring the low field resistance and that we can decipher in which state you are and in terms of the physical mechanisms this spans a wide range of mechanisms all the way from ionic drift in metal oxides to phase transition in certain types of materials so that is the whole spectrum there so what I want to talk about is you know you know in more detail is one of these things which is based on the phase transition and and that is what what is known as the phase change memory so phase change memory is one such resistive memory device by the way these you know this type of memory is called resistive memory because you are deciphering the stored information which is in terms of the atomic configurations by the resistance state right that's why they are known as resistive memory or memory stirs or whatever phase change memory is one of them so phase change memory is based on this unique property of compounds of germanium antimony and tellurium such that in the disordered atomic arrangement in the amorphous space they have a very high resistivity and in the ordered crystalline phase they have a very low resistivity now if you take this material and sandwiched them between two metal electrodes as shown here then by applying suitable electrical pulses you can reversibly change the face from the amorphous phase to the coast line phase right and since you have this big difference in the resistivity values depending on the phase configuration you can have you know a very low resistance if it is crystalline and a very high resistance if it is amorphous and this of course you can read back by applying a low field a low bias voltage so that's what you do so clearly in phase change memory you can have these binary states which is predominantly amorphous and predominantly crystalline but we are a also able to achieve intermediate phase configurations you know by by again by apply suitable optical pulses so if you price acrylic well since you are able to get these intermediate you know or these kind of mix face configurations which means that we can achieve a continuum of conductance values as shown here so these are you know the the applied voltage pulses I mean the amplitude of the voltage pulse and this is the the resistance value can see that you can get this continuum right so essentially what we have is an analog storage device right because you can get this continuum of states so besides the fact that we can store this continuum of resistance states in these devices these devices also have some very rich dynamic behavior so if you look at a PCM device the the PCM dynamics is basically a you know an implicit feedback interconnection of electrical thermal and structural dynamics so you know it has a strong field and temperature dependence there is a very interesting thermal system there in a sense you know there are huge gradients within the within the device thermal gradients which means that thermoelectric effects will start playing a role but of course what is also very interesting is that it has very interesting structural dynamics such as crystallization and structural relaxation I will come back to it and when I when I say how we can do computation using the structural dynamics right so that is why it's very interesting so we should be able to exploit the dynamics of PCM devices to perform computation in place but first let us look at you know performing logical operations using computational memory so I'll only have a couple of slides on it but the reason why I want to talk about logical operations is because it is also one of the first you know the initial ideas on computation memory came about in the context of performing logical operations so it's good for the completeness sake so what's the whole idea there if you look at the conventional CMOS logic right there is only one logic state that is represented in terms of voltage right so your input signals or voltage signals you process on voltage signals and you output voltage signal right in all cases you have voltage signals and the CMOS gates they essentially regenerate this state variable during the computation process that's what it does now suppose you have these memories steep or assistive memory devices into the mix right so why not use the resistant state of these devices as an additional state variable for computation so that's the whole idea here so if you take a resistive memory device let's say in the high resistance state it is in logic 0 and if it's the low resting state it is in logic 1 and the nice thing is that we can also toggle them from one state to the other by applying electrical pulses right so which means that we should be able to enable a whole bunch of logical operations just by the interaction between the voltage state variable and the resistance state variable right and in fact people of all the years have come with many such logic families where they created logic which comprised of both these state variables as opposed to just having voltage as a state variable and why is it interesting because this gives you a natural pathway towards seamlessly integrating memory and processing on they're on the same place so I will just focus on just one such logic family and that family has very interesting property of statefulness so what are you my statefulness so stateful logic means that you only have one logic state and this case is not voltage but it's the resistance so that means you only have resistance as the input and output right so that's what you have so I just illustrate that with one simple example of how you could perform for example the Noor operation in stateful logic so let us say that I have these 3 devices which are hooked up in this particular manner right and as I said earlier in this logic family the logic is represented only in terms of the resistance state variable that means sorry that means both the inputs are now devices or in the resistance States so these are the inputs and this is the output now first let us initialize the output resistance to say the logic state 1 which is the low resistance state right now if either of these input States is in logic 1 which means logic it is in a Lauriston state and if we apply a voltage signal VC here then at least VC by 2 will drop across the output state right and now now you can tune your voltage in such a way that when you do this operation the output toggles from one to zero because of the way in which you know the circuitous hooked up now let's take the scenario where both the inputs are in the zero state that means they're in the high resistance state if they are in the high resting state and if I apply a VC here hardly anything drops across the output which means that the one remains as one it doesn't normally yeah so so what we've done is we have the the inputs towards western states and the output is also now you know is a result of the logical operation nor in this case right so what you've done is you're implemented the the nor operation purely using the resistance as the as a state variable so that is the whole idea behind stateful logic and so this is a very simple thing so how can we scale it up right so the idea is of course you can now perform both bitwise operations so you can have you can put the whole idea into a large array where you don't perform this operation at a single device level but you cannot operate across a whole row of devices so here for example you have you know these this draw of devices indicating you know this vector here and there's an adjacent row here and you would like to perform an or operation and store it here so what you do you first initialize all these devices into the Lauriston state one and now you apply this VC to the row and what happens because of the way the current flows clearly you will get the nor operation implemented and written on to the adjacent role here so we are able to perform bulk bitwise operations in a completely stateful manner and clearly you know if you don't want to perform this operation on some devices you can of course use appropriate isolation voltages to do kind of you know remove them from the from the operation and clearly the idea is that you know we can now take more complicated processing tasks you know breaking down into these sub parts and then finally you can you know work you work towards a more general purpose processor based on these very basic operations so that is the idea behind performing logical operations in in computational memory so any questions yes devices right so you're right I mean I mean this is very simple in to show in this in the PowerPoint sitting but in reality of course you have to take care of the switching dynamics and all these effects so clearly you need to consider that not really but I talked about the device dynamics for example the electrical dynamics so there are some very interesting dynamics like called threshold switching in phase change memory so we have published a lot of work on that but unfortunately today I don't think I'll be talking about that much but it's a very important yeah it is in literature citations separate rows of large memory device right yielding years old yes inherently okay if this word a useful function could it not have been done with earlier technologies going back decades with rams and so forth my kind CMOS perhaps but the way they memory works is to cause bit lines to be pulled high if transistor is set or cleared yeah not if it's not and with an existing memory device pretty much any technology if you deteriorate two lines and either of them could pull the bit line high you would have the same function being performed all right in fact that is a paper out of CMU and eth where they show that you can do bulk bitwise operations using DRAM I think I have cited it in my references list towards the end so maybe you could look at it I think it I think they're doing exactly what you are listing ecology is that it allows to do this whereas you don't need to jump to this new somewhat less well established technology liberties I'm missing the linking technology with the philosophy at the same time generally increases the risk so sure sure and also mind you like I just wanted to I'm not saying that noise equal operation is the best thing to do with these devices I mean I'll go into other things as well right but on but I must also say that even even in this context there are so many other logical operations that you can implement there are many different logic families right and some of them I'm sure have their own merit I mean this particular operation that you mentioned I think there is some work very recent work in fact where they were using DRAM to perform this bulk retries operations but also the fact that it is stateful is very interesting that you I mean it's non-volatile right which is which is extremely nice so you never had read it back at any point in time yeah I mean it's not very clear I mean okay so I will now move on to arithmetic operations using computational memory right so one of the arithmetic operations that you can perform using these devices is matrix vector multiplication so suppose you want to perform ax equal to B then what you typically do is you arrange these resistive memory devices in a crossbar configuration C as shown here and then you map the conductance values of these devices proportional to the a values here right so in g1 one is proportional to a one and then you actually apply voltage signals along the rows which is now proportional to the X values here if you do that then the the current that flows through the column would be you know proportional to the result be right so what you've done is you're just using the multi level storage capability in this case so now it's not enough that you only have binary storage capability because you need to map your a to G right so that means you need this analog storage capability and we are exploiting the Kirchhoff's circuit laws the Ohm's law and the and the kirchoff's current law and what is really nice is that we can actually you know reverse the process apply the apply the voltage along the columns and then and then read the the current back through the rows and what it does is now the multiplication with the matrix transpose so this really nice so you can perform multiplication with the matrix and the matrix transpose at the same time okay so before we go about performing this multiplication operations we should be able to store the so we need to store the conductance values in the devices right and for that we use you know what we call the ight rative programming algorithm so that means we need to go to a desired conductance value so for that we apply a sequence of pulses so that you can go to the desired value that you want to program to so so I mean this is an illustration of you know suppose you want to go to a G target then you apply some pulses then you look at how far away you are from the G target and then you keep on you know modifying the applied pulse so that you can eventually reach the G the G target value here in fact this works beautifully well so these are now experimental results from our lab on on phase change devices across a large array we are able to you know go to a desired target value and then these are the distributions that show how well your algorithm is performing right and then how you can go to this I'm go to these values here and once you have this desired conductance value you know achieved through this iterative scheme then you can perform like scalar multiplication so just this is just an illustration of the Ohm's law again experimental so what do you do you want to perform beta is equal to alpha times C you map your you know the Alpha value to the conductance value programs the device through the ight rative scheme go there in a precise manner and then you map your Reed voltage proportional to the C value here and then you just read back the current and that would be an estimate of your beta correct we can actually perform this operation over you know in this case we have shown over a thousand 24 combinations of alpha and C that means many different devices programmed to different conductance values and so on and so forth and then this is now the computed beta hat against the exact beta you can see that smallest linear right so it's fairly good so this is the kind of precision we have when you perform multiplication using Ohm's law and this is the error distribution you know again showing the error distribution here and what we can also do is we can always average and multiple devices because these devices are really tiny right so of course you can have many devices to code the information for you and then you can take the average and as you can see as you increase the number of devices on your average then you are improving your your multiplication precision right so that is what you what do you do here so against experimental data on on on PCM devices now the question is suppose we have a multiplier like that right using these devices what is it good for in fact there's a lot of work going on right now where the people are using this the crossbar based multipliers for a range of applications so one specific application that we looked at in zurich is that of compressed sensing and recovery so what do we do here so in compress sensing the whole idea is as follows so you take some signal from a high dimensional space you you know you you you compress it into something to allow the lower dimension then thereafter you reconstruct that signal in an accurate manner right and what is really nice is that we perform this sampling and compression in simultaneously that means unlike other compression schemes we are not taking the whole signal and then we are compressing but rather as in when we sample the signal we are compressing it right that's the whole idea as you can imagine this compress sensing is extremely useful in several applications such as MRI or facial recognition you know or even mobile phone camera senses and so on writes extremely useful in many domains so how do we do it it's an extremely simple operation so if you look at compression we are taking a signal from say dimension n you want to bring you to dimension m and if the compression is done using a linear map then of course you can do it with a simple multiplication in a matrix right so Y is equal to ax so this is n dimensional input M dimensional output now this a matrix can be coded into the memory steve array correct so with the order one time complexity you can perform the compression is the easy part single multiplication now the reconstruction is a bit more tricky there you need to convert this m dimensional signal to an n dimensional signal right and there typically you go to some i traited procedure and one such eye trait of reconstruction is based on this approximate message passing algorithm that it came out of plus imaginary here in Stanford so where you you you have this iterative scheme whereby you reconstruct your higher dimensional signal from the lower dimensional signal that you compress to right and if you look at this algorithm what you see is that it has a whole bunch of matrix vector multiplications involving the a matrix as well as a transpose and both of them we can use the same array for that right so that means we have a single measurement matrix which is n by M dimensional rather right M by M by n dimensional matrix and we use the same array to perform the the compression as well as reconstruction that's the beauty of the whole thing and if you work out the math you see that in this way overall for the reconstruction we can reduce the complexity from order n em to order n so that's a significant reduction in the in the complexity so does it work yes it does so we performed experiments on this image compression I mean this is actually compressed sensing in the context of image compression where we took this image which is 128 by 128 pixel image then we do a 50% you know compression on that image and then we reconstructed it while all the time we had coded the information the measurement matrix onto a PCM array so we have a measurement matrix do the entirely code in the PCM array all the multiplications are done using Ohm's law and then we have gotten back the result from that a couple of very interesting points here so one thing first of all the reconstruction is not too bad right I mean given that we had such low precision right and by the way I have also shown here the normalized mean square error as a function of the iterations I told you the reconstruction is a nitrate a process and if you look at the the normalized mean square error a couple of things one thing is of course it is not as good as floating point but very interestingly the the convergence rate is not affected by that it is good right because your height relative algorithm it's still converging with the same speed even though it is done with lower precision multiplication if you look at it it is a slightly better than a 4 by 4 fixed point implementation so which means that our multiplier is having you know an equivalent precision of a four bit four by four bit fixed point operation and if you have to do the whole thing now you know implement in FPGA or something we still have a power reduction of something like 50 times so that's what you gain and also mind you when you look at the measurement matrix just a dimension of n by M right so it's a lot of data and all this data is now residing in the computational memory ever having to come back to the processing unit right so that's a big advantage here so that's that's a lot of data that you can store in the combination and perform these operations on them okay so any questions on this part huh yes so it takes like some like six to seven write operations yeah oh if each operation so that the PCM the the right pulse is I mean it typically it's learnt from like 50 nanoseconds 5-0 so is unit like six of them took it together right yes yes it depends on how many right heads we have right so typically the limit comes from you know how much current can flow through the through the wires right essentially but having said that in all these applications you are not changing the weights at all right I mean if you look at the the compress sensing it's always the measurement matrix sitting there right it's never changed once you're programmed so then it's being used for all kind of compression stuff right so it's kind of like a thing that you can amortize I think so you just look at the cost of writing yes five volts or five in with this yeah but having said that I view computational memory as an extension of memory right so you have you have memory and you have typically of memory controllers right so even now if you have a memory chip right you have your you know you have your memory controller which is taking care of some of these these things right so we'll have a computational memory controller which will do these conversions that is required because because if you look at you know if you have a multi-level you know flash chip right I mean it does so much stuff right even even now it does titrated programming for example right so many of the things that I mentioned here are already being done in existing memory chips so III cannot see of anything which is substantially different from what is already not there okay so let's now go to computing with device dynamics so you know I talked about logical operations at medic operations so the question is okay can we do a bit more right can we use can be computed using the dynamics as you know for the logical operations we're just using the binary storage capability for arithmetic operations we use the multi-level storage capability but either of them didn't use any dynamics right so this is a vision here so we have a whole bunch of memory devices sitting there right you want to perform F of a operation on those devices so what you do is you apply an electrical signal to this to all these devices right and the idea is that the conductance of these devices now evolves in accordance with this electrical input and somehow the result of the computation gets imprinted on the memory devices right that's the whole idea that'll be really cool so one obvious candidate at least in the context of phase change memory for performing computation is the so-called crystallization dynamics so what we see is that if you have a PCM device and if you now initialize sitting to a predominately amorphous state as shown here then if we apply electrical pulses such as these you know you know if you if you successively apply them what you see is that the the crystal front will grow right as shown here that means you will progressively break a slice and which means that your conductance of the device will keep on increasing as shown here so if you apply the number of pulses as shown here as a number of pulses increases your conductance key keeps increasing here correct and what's also nice is that if you increase the amplitude of the programming pulse then you have a larger increase in your conductance as shown here so so this is now increasing current you see 50 all the way 300 micrograms and these are the this 100 micron cohere so can we use this very interesting behavior for computation so essentially what we have is is a non-volatile integrator right an accumulator so can we use it for performing computations so first I will take as a first tech application I will show how we could use it probably to find factors of numbers it's it's more like a toy example but I think it really gives an idea of how the whole thing works so let's take a PCM device assume that it is initialized in such a way that you know it goes into a low resistance state after you apply X number of pulses right because then that's when it it completely could slices right it just goes into this state here the question is can we use this device to perform different to find if X is a factor of Y so what do you do you know you have a you have a larger number Y sitting there and you want to know if X is a factor of Y right you apply Y number of pulses to the device and and as you apply the pulses at any point in time if the resistance of the device goes to Lauriston state then you reset it back correct and then you keep on doing it till you finish applying Y number of pulses and after the application of the Y number of pulses then you just go back and measure the resistance of the device if it is in the isten state the Nexus effect of why otherwise not right it's a very simple very simple algorithm so this is just a schematic illustration again so you want to know if 4 is a factor of 8 you apply 8 number of pulses you know these are the eight pulses here after the application of the eight pulses you measure the resistance oh yeah the device from the low resistance state yes for its effective weight right is for a factor of ten you apply ten pulses you measure the resistance after the application of ten pulses residence is high no it's not effective correct so that is as simple as that does it work experimentally yes it does but also even better we can do it in parallel that means I can have say you know different devices initialized to different states so for example this device corresponds to thirteen eleven nine six and four so this is a number of pulses it takes for it to go to the Holston State and then you apply Y number of pulses to all of them right and then you go and check okay how many of those devices are right now in the Lauriston state and okay those are the ones which are factors of Y right that's the whole idea here this is an experimental illustration now in this case so you have again you know the I mean the same same concept here number of pulses here at any point in time you could just go and check you know how many devices are the loads in state and you can figure out which those factors it's an example you look at 44 here right you will see that the device which corresponds to X equal to 4 and X equal to 11 both of them are in the Lauriston state at 44 that means 4 and 11 are factors of 44 right it's a very simple experiment you know which shows you know what I mean by computing in place and storing the data in the device in the form of the resistance levels but clearly I mean it is more or less a toy example just to illustrate the concept here let's now move on and see if we can solve more interesting problems right and one such problem is learning the correlations between signals in a completely unsupervised manner it's a very interesting problem it arises in a range of applications so suppose you have a whole bunch of you know even wave data streams by the stochastic processes coming into a correlation detector a subset of these processes are weakly correlated you would like to find out which ones they are right so you can imagine it comes up in you know a lot of applications where you probably want to find out anomaly or you want to know which signal is interesting and so on right it comes in that context the question is can we perform this using this operation using the device dynamics indeed yes so what do you do you take these even based processes so they are binary stochastic processes that means the ones and zeros right so you have n processes maybe a million of them right they are all coming into the in they're coming into the PCM array the the computational memory each of the process is now assigned to a single PCM device as shown here and anytime there is a one here you apply a pulse to the Pythian device right parts right one of these crystalizing pulses and the amplitude of the pulse is now proportional to the instantaneous some of the processes so this one here right at any point in time you count okay how many of them are there and the amplitude is not proportional to that I must also say that okay before we start the whole thing we have to initialize all the devices to a high resting stage fully amorphous state right and what you see is that as this progresses over time all those devices which are now connected to the correlated processes they go towards a high conductance state and the ones which are connected to the uncorrelated processes they remain at the low conductance state so now just by reading back the resistance values of these PCM devices we are able to determine which prophecies are temporally correlated and which ones are not right in fact it is I mean this looks very trivial this algorithm but what it does is that it is finding out the sum of the covariance matrix of these of this data here so it's what it does yeah so that that's what it does right it is MIT is very simple looking I'll here and this is also one instance where you really are performing here rather soft stickier computational tasks completely in memory and then the result of the computation is being stored in the memory itself so again does it work yes it does so we have you know a million devices in this case these are all experimental results again each of the PCM devices is now connected to one of these stochastic processes in this case you know these are represented by these flickering pixels here so each time the the pixel is flickering that is is one of these ones showing up right so that's that's the process here and some of these pixels are flickering with a weak correlation it's very difficult to figure out because it's 0.01 is it is the correlation coefficient right you cannot really see it but now these each of these pixels is now mapped to a million devices arranged here so it's thousand by thousand devices and now you can see the conductance values as read from the PCM device you see it's evolving time progresses and you see that okay these were the pixels which were temporally correlated and that information is now completely returned into the PCM device so you can imagine this is one instance where we really cannot say we're processing is happening in where storage is happening right if someone can tell me okay where is the processing and so it would be extremely interesting to see that and so what we have is we have now imprinted the result of the computation into the memory devices yeah so so that's that's that's that's the idea here so we wanted to see okay how these kind of algorithms fare against state-of-the-art computing systems right someone could say okay I can implement this algorithm in a state-of-the-art computing system so for that we chose this particular system with two power eight processes and and and four GPUs you know here and we try to implement the exact same algorithm that that that was used for the you know what I what I talked about earlier implemented in this system here right and it was also implemented with a lot of optimization to make sure that we use all the parallelism and all that stuff and so this is now the run time as a function of the number of processes view and you can see that by the time we reach like 10 million processes even in run time our system is a factor of 200 better than the implementation on the conventional computing systems right so that's the power of the system and of course in terms of energy we are far superior again orders of magnitude better in terms of the power so so to my knowledge is one of the largest and most significant demonstrations of really in memory computing is in device dynamics where the result of the computation is written into the into the devices will be surprised but in this case we just want to compare you know apples to apples right your PCs I'll be interested knowing which one which algorithm that is though because two algorithm that we tried one of them is the k-means which of course is not more efficient than this the other one is the one where with some of the covariance matrices right GPUs know these are the two we looked into and so one of them is but if you have something else I'd be interested yeah a decision to deliberately choose the exact same algorithm yeah idea was mostly to try you know if did to make a fair comparison between the two huh you may not be taking advantages of properties of the GPUs so it's an unfair comparison to say well look how much power we say but look how much faster we are when perhaps a different algorithm would have made that difference less but do you have in mind or it doesn't matter whether I have it okay because we couldn't think of anything else I mean you know when you go for when you try to find correlations between those those patterns I mean the two things that came to mind or were these two algorithms right right so I don't know there's something else that we could try but yeah but yeah it's interesting any other questions on this party information if you speak about billions elements yes you mean to say we're very put up trash old and so on right yeah yeah yes yes I guess you I mean it's it's I mean you should know you know how many processes you have and what is the you know what is the rate of the process and how much correlation we expect so there are some rough you know calculations you could do to figure out how to set the set the thresholds right it's basically putting a threshold on the binary classifier right so you can get some idea on you know on what it should look like but you are right it it requires some tuning of course it looks to me like right now the problem would be working get from and Mister devices it looks like you have to make the furlough level device to actually take advantage of this isn't that correct yeah you mean to say right I can't go by supplier this is this is a situation where you're building something with the very specific intent and tailoring it to a particular class of problem that's it yep and but having said that I think if you look at all the memory manufacturers out there I mean they're all capable of doing it like Samsung isn't all that weird so but whether it is easily accessible in any academic environment that's a different question but I think many of the companies they have access to what he said yes so you actually can get advantage by creating a special purpose machines neither terribly chic right ok so let's move on to the next topic which is you know essentially makes precision computing so I guess the elephant in the room so far is you know the lack of precision right I think that's one thing which people keep saying I mean again came up in the context of calibration but also you know I showed you earlier how we perform matrix multiplications and again you see that there is a lot of precision issues there so that is why we start looking into this concept of mixed precision computing to counter that so that's what I'm gonna talk about so so what is the challenge of imprecision so let's assume that you know you know you know when you compute these memory devices we often we don't go to the exact solution that you want to write but that's okay in machine learning because there are lot of problems where you're okay you can tolerate a whole bunch of imprecision because there's no real golden answer when it comes to comes to some of the machine learning problems so it pretty much like this you know this drunken guy who wants to go to his destination here but he ends up with an approximate solution and he's happy about it right so that's that's that's one class of problems but of course someone could say ok no he has to go to the precise solution right what if we need arbitrary high precision what do we do so what you can do is you can give him a GPS every accurate one which now at any point in time tells him ok how far you are away from the exact solution so if he has this GPS in his hand then he can afford to make some imprecise steps right approximate steps but he can keep recalibrating himself and eventually reach the the final solution so this is the whole idea or this is what leads to you know the concept of mixed position computing or in the within the context of computing itself you will see that many computational tasks can be formulated as a sequence of low and high position components so what you can do is when you want solve the computational task you can have you can split the two things one of them in the first step you have an approximate solution is obtained right and that one typically takes higher computational load and the resulting error in the overall objective is calculated in a precise manner inaccurately right so that's the second step then you keep iterating over it right and in that case you can reach your very high precision very high accurate solution in the end so that leads us to mixed precision in memory computing so what do you do here so you always use your low precision computational memory in conjunction with a high precision standard foreign language machine right and then we use you know the bulk of the computation you still try to perform it on the computational memory unit but then you do the error correction or error while finding out the error operation that you perform in this precise foreign I mean machine so what you can see is that if you do it in the right way in many problems you can still gain lot of area land power speed efficiency even though we have this for anointment a machine in conjunction used in conjunction with the computation of memory unit so this is a very powerful concept and it's a very good solution towards you know dealing with the lack of precision associated with computational memory so I'll show you a couple of examples to prove it to you so one of them is the problem of solving systems of linear equations right so if you want to solve this problem if ax equal to B find X then one way to go about it is you first find an initial solution X right and then you start making approximate additions to it so these are the corrections to it so X equal to X plus Z and how you find your Z is by solving yet and the problem is equal to R but this time in an approximate way we use typically Kyllo subspace methods such as conjugate gradient descent or GM race or one of these approaches and the catch is that the R here which is a residual has to be calculated with very high precision right so that's a catch if you look at this in a solver there are a whole bunch of matrix vector multiplications involved all that stuff you offload to the computation memory right because in this computation memory unit we are storing the a matrix right which we are using over and over again so we are not reprogramming it right so we only had to program it when you have to solve that that ax equal to B problem if the right hand side changes doesn't matter right if B changes it doesn't matter you can still use the same thing but for the whole thing you perform all the multiplications in here and and finally of course the algorithm runs till your residual is below a tolerance which you set right so that is how how you solve the problem so now these are now again experimental results on the on the PCM devices so this time we perform this experiments on the so-called model covariance matrices so these are a special type of matrices which have this you know this decaying behavior as shown here they are kind of simulating the you know the decreasing correlations you know between features as you move away from the diagonals right so that is what it's kind of capturing there they occur very often in data analytics related problems and all the matrix multiplications are performed using PCM devices again fabricating the 19 animal technology node and this is now the norm of the error between the exact solution and the and you know estimated solution plotted against the number of iterations again it's an iterative scheme and you can see that for all these matrix dimensions the error keeps dropping right and it's not limited by the precision of the computational memory unit it is only limited by the precision of the high precision you know in fact we did one experiment all the way till it reaches the Machine precision of the foreign lineman machine right because it's not limited by that that's a property of the algorithm so this way we can beat this problem of you know the lack of precision involved in the in the multiplication and still get a very accurate solution so this is the this is one example so and and then I don't have to convince you that you know there's a very useful thing to do and it's solving systems of linear equations but of course this appears a lot in in problems associated with data analytics so one such problem is you know the task of finding the gene interaction networks from RNA expression measurements for cancer and normal tissues so what you do is you you you get the covariance matrix you find the inverse covariance matrix and finding inverse covariance matrix involves solving a whole bunch of linear equations and all this stuff is done in a PCM device so the covariance matrix we programmed it into the PCM array and then we solved all these linear equations the different right hand side and then we obtained these gene interaction networks from the experiments right and then of course you can see that while some of the gene interactions are preserved there is a big difference between the connectivity graphs between those linked to cancer cells and and normal cells normal tissues so just realist raishin of you know how you could use these kind of things in in a problem like data analytics but the natural question is okay what do we gain by going through all this process right by mixing the four norm and machine with with with the computational in fact we did a lot of studies system level measurements in fact where we used power eight CPUs and and p100 the Pascal 100 GPUs as the high-precision units and along with the computational memory and what you see is that we have significant improvements in the time energy to solution as you go to larger matrices as opposed to doing the full CPU only OGP only implementation right so we can gain a lot if we use Cisco if you had access to this computational memory in addition to just having the processing units and what is also very nice is that if at all the in-memory computing becomes more accurate then your gain performance becomes more right that's really nice because your convergence becomes faster so the hope is that as and when the memory devices mature and we start getting better devices then we can tap into this property that means things get better right this as it goes along ok another application of mixed precision computing is that of training deep neural networks so as you know you know I mean deep neural networks they are essentially you know loosely inspired by the brain so you have this multiple layers of you know parallel processing units which we called neurons and they are interconnected by this plastic synapses and once you've tuned the synapses in a very nice manner then these networks are able to solve certain tasks remarkably well in fact some of the state-of-the-art neural networks can they even outperform humans right in terms of tasks such as image recognition and and voice recognition so that is really really interesting but how are they trained so these networks are trained based on a global supervised learning algorithm as you know it is back propagation so the whole idea is you know you take your inputs you propagate through through these neuronal layers it's a forward propagation step right and where the synaptic network will now perform much multiply accumulated operations the final layer responses are compared to the input labels the resulting error is calculated and they are now back propagated again matrix vector multiplications multiply accumulate opera at each layer you will have its own error and you will now change your synaptic weights in such a way that you reduce you error right so it has a forward propagation backward propagation and synaptic weight update three steps so that is the whole idea behind forward propagation and behind back propagation what we do is essentially brute force optimization right we have millions of synapses and we keep doing this over and over again over multiplied samples multiplied box and finally everything gets strained if you look at now you know the state-of-the-art networks it takes days and weeks to Train even on highly sophisticated cpu GPU clusters right so that is the status now and a big reason for this inefficiency is again because you're storing your synaptic weights you know in one place and every time you need to perform computation you need to bring it perform computation storing back right so that is that's one big reason for the inefficiency so can we do better right so this is one idea where we can apply this mixed precision approach to deep learning so let's say that you know we we take a neural network layer and then we map your synaptic weights again into a computational memory as shown here you perform the forward propagation with order one time complexity using these synaptic layer here perform backward propagation by opening that that inverse transpose operation so that again you use the same away but the weight update however has to be calculated in high precision correct and that you perform in your high precision limit and anytime you're updated weight exceeds some value you issue some programming pulses to the devices in a completely blind manner so you you just apply this you know you nudge them forward or backward right it and what you see is that this system is able to learn in to train so I will I will show you an example of that how it works so we basically took like an M NIST handwritten digit classification problem again you know these are the input neurons and these are the outputs of course telling you which digit it is and then we used now PCM devices the synaptic weights and and as you see here this is now the training accuracy as a number of the epic numbers here and after the training process what you see is that the the network is able to achieve ninety seven point seven eight percent accuracy which is quite respectable right and so this is really interesting so we are able to perform all the forward and the backward propagation in place using the memory state devices but and only accumulating the the weights in high precision right again it's a classic mix position in memory computing approach which leads you to this very nice performance okay so that brings to the summary of my talk so first of all I mean as I said because of this explosive growth of data centric AI applications we have this immense computing challenge that we need to deal with and the computational memory I believe is a significant step towards that it's essentially a memory unit that can perform certain computational tasks in place and it comprises typically of what we call resistive memory devices where you store information in terms of the resistance state of devices and of course in computational memory in some types of combinational memory you can perform logical operations in place and then of course you can do arithmetic operations and one such very interesting primitive that we can do there is matrix vector multiplications that we can perform in order one time complexity and of course you have a wide range of replications one of them being the compressed sensing and recovery that I talked about another very fascinating aspect is computing the device dynamics I think a lot can be done in that space I just showed a couple of examples where I used the accumulative dynamics of devices to perform computation in place one of them being finding temporal correlations in an unsupervised fashion and clearly I think mixed position in memory computing is a significant step rewards tackling this key problem of imprecision that arises in computational memory alright and there I think there are why range of applications I just stretched upon two very important ones one of them is solving systems of linear equations the other one being training do deep neural networks so I just conclude with just an outlook of how I see the computing systems are evolving over the next couple of years it's a very difficult task but I think I still stick up my hand and say something about it so first of all you know you have your conventional computing system so what you see is that you know I strongly think that storage class memory has a place sorry it has a place you know going forward I think it will become more and more prominent a lot more place we'll get into that seeing that there's a market there and that will of course improve our computing systems for the good and I think there'll be a lot more research going on in your memory computing I think I think it'll get more traction now with all the requirements in in in air related research of course for Norman accelerators I mean GPUs general-purpose GPUs are all over Romania so if it is right now but I think there'll be a big push towards Essex which are more tailor-made for specific applications like deep learning and so on so this you would see in the near future so that's about it for the foreign women side of things but in the non-fun mind side I think that computational memory could start playing a role it was heavily depend on how successful storage class memory is going to be right because then they could feed each other right because the essential element is the same in both of them right so so this is something that that could be very interesting going forward but even further down the road I strongly believe that we will start seeing more more of neuromorphic o processors because you know essentially our brain is a remarkable engine of cognition it is for us the existence proof that we can perform you know learning with remarkable energy efficiency so clearly I mean that is probably where we will all go at some point but probably a few years years down the road I must also say that there is a nice confluence now of you know work in computational neuroscience and nano scale devices which could all come together to make you know make this possible this whole neuromorphic coprocessors and of course there is you know quantum computing clearly lurking in the background which we don't know how that would play out right so that to me is like my outlook or my take on on how things are proceeding this space I would like to thank my my students my colleagues my collaborators and also some very generous funding from the European Union as well as the Swiss National Science Foundation thank you I mean the right times are mostly governed by the how fast the material crystallizes so it's not really the size of the device that that place they're all there so the size of the device is mostly influencing the reset current right how much current you need for programming but in terms of the speed it doesn't play that bigger role so if if speed is a concern there are materials phase change materials which could slice much faster or even one nanosecond if you want I mean there are like I mean if you take like pure GTE germanium telluride it it could slice is much faster than the ones we used here but there are there are reasons why we use this material as opposed to that because this brings something else writes a trade-off in the end right what is that what does it bring one thing one reason why we use this material is because it has a lower current of operation you know you need less current to operate it because you have bloody devices now right so that's a big concern for us so we want to reduce the current with which we can operate it but if you say okay if you are going to smaller dimensions then your current is anyway lower then you you could probably go to the faster one you see so it's it's a multi-dimensional optimization problem in the hint I have two sets of questions I'll save one basically a hardware question of software question let me ask the hardwood question yes I was expecting to say a little more about the memory architecture because other firms such as first when I heard Anil for in Aurora Colorado the second was came out of that was Tara up in Seattle which bought craning and what they did with their memory in addition to parity with the correction of the detection was they dumped lamented full empty bits and they did semaphore memory that said anything about how you would feel with race conditions between processors to your memory here I'm assuming your smallest regular memory but you know the database community Jim Gray convinced the people who say well we have this concept acid atomic consistent yeah I mean okay first of all I mean I mean we are doing we are using like you know the memory like so this is all done before the memory controller right there's really done in the analog space right so we have read and write heads integrated in the memory chip but we are working with real analog values so so the I didn't talk anything about the memory controller because all that stuff is right now when an FPGA in that picture that I showed earlier right so we have just a pure analog memory chip and then we have this FPGA wrapped around it which does all the other stuff so it's it's quite far from you know working with a real memory chip right it is still an emulation of how a computational memory looks like that's why I couldn't touch upon that much okay I might be getting hung up in details I was intrigued by your example of how to use these technologies to factor 100 if 4 is a factor of 10 you hit the system with ten pulses Inspectorate four yes that doesn't seem to scale to whether inspectors at 10 to the 13th – 7 I that was mostly intended as a toy example to show how you could do some combinations don't assert in the memory right my sense is that you're sort of suggesting a combination of digital and analog techniques yes and the analog techniques are so I must also say that some of the things I talked about like the matrix vector multiply is right like the ones which I showed here these ones I think are you know this is something that we can use right now right I mean as I said if you have a mixed precision architecture and if you have this rather imprecise multiplier but if it can do a lot of multiplications in order one time complexity then this is something that you can incorporate new computing system right so it's not completely exotic at this point so this precision is equivalent to something like 4 to 6 bit fixed point right and again you don't have to really beauty as an analog system because you have a digital interface there right so you have Dax lady sees and you have the memory which is sitting there of course it's a bit imprecise but that's what it is right you can take into a digital system and you can incorporate it so this one is not so exotic but some of the things that I said with the computing the device dynamics is more exotic and there we need to more research it's not ready for prime time yet but I just want to show that we can use these these kind of properties to potentially perform computation in place and that's why I showed these two examples they're a true biological neural network does not encode a whole opposition and you see it already with the neural network example right so if you have you know I mean if you if you have if you store the synaptic weights in this precision and you enforce it during the training process then that seems to be enough right so so that's that's a very interesting thing and and and i think i think mixed precision is also very good idea in a sense that really you know allows us to get this orbital high precision with these you know faulty devices right which is which is extremely useful at least in the beginning application or we have to have specialized right at the beginning before I manufacture them I can use for anything yep here when you build it looks like a the UNAM innocence that i can program it to do multiplication do neural networks yes capability to program different aspects of these things and tie them all together right you could have like multiple cores in a single chip right because I mean this is nothing right having 1 million 1 megabyte is one but one mega cell is not a not a big deal at all right so you can have many different codes that's one option you could have you know one core yeah in program yeah exactly so yeah yeah Jimmy as I said earlier Jimmy I view combinational memory is an extension of memory supposed to a processor right so it's really you're trying to it should look very similar to a memory chip in some sense that would be the ideal scenario I was wondering if there are any other types of fields of arithmetic computations that land itself really well for this like in this do really fast is that be I mean I'm gonna get this from like a they'll be really interesting in in in cryptography if this can do incredibly fast discrete logarithms that can be potentially problematic I think that could be because as I said I mean if you look at now the the electrical characteristics of these devices right so they have this nice exponential field dependence right so maybe someone could exploit some of those things right so I mean it's it's yeah it's to early days I think but I think it's possible to have more complicated functionality out of these devices we still I mean make a lot of stuff together with Global Foundries right so that was the come again yeah so we design it and then yeah I mean but it's not just a PCM chip we have a lot of I mean the PCM was not done in Global Foundries because we still have in a fabrication facilities within IBM where we can do PCM but the circuit the CMOS was done elsewhere and he attempted two different languages so let's question is what's offer our audience here yeah thank you if you went out for instance of the higher force computing world the first thing they're gonna ask was Danny Hillis had to come back tail tail between his legs and apologizing says yes for the connection machine we finally got around and developed a Fortran compiler no clearly there is a lot of work to be done in the software stack side but having said that you know general-purpose accelerators sorry special-purpose accelerators like you know I mean I mean like PLAs seeing some of the GPUs right I mean and all the infrastructure around it was built up right so I don't see anything substantially more different there in this case as well so but it's an interesting African effort really to look at the like the problems you have to figure out how to you know break it down into two parts which can be done in the computational memory are supposed to elsewhere but it's also Sonia we are looking into but I don't have ya know obvious thing is a matrix vector multiply oh right I mean this would be a very very efficient thing to realize right so I would say that would be the first met that's sort of one small question notice the cases where you did interview fitting one metric for that class of problem till one Mexico better which one to know well in the iteration and process for you basically computed a distance when minimized so for compressed sensing that is an approach of using the element norm because where you exploit the sparsity but I think yeah I I don't think it performs really well in the end compared to the approximate message passing for example yeah right okay sure [Applause] you

3 Comments

  1. DotBowder said:

    I really like these seminar videos. It's insightful to see where technology is right now! Thanks a ton for putting them up!

    I'm not sure who all is attending, but the audience seems very rude at times.

    June 26, 2019
    Reply
  2. abchandle said:

    Thanks for revisiting Watson accordingly. I was actually on assignment at the IBM Thomas J Watson Research Center (Yorktown Heights) the date of that Jeopardy show's airing. IBM had an gigantic pizza party for all to watch together, but surprisingly the turnout was extremely low… Maybe it was actually the perfect metaphor as a few years later to date approximately 90% of US operations (blue badgers) are gone from technological reasons to the morphing of globalization. Good luck in Zurich and cheers.

    June 26, 2019
    Reply
  3. Erwan Ounn said:

    The sound is too low 🙁

    June 26, 2019
    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *