Ralf Gommers: The evolution of array computing in Python | PyData Amsterdam 2019

thank you for the introduction yes so talk about array computing in Python so you know mainly as an umpire developer you know starting from numpy but I'll try to see like you know what else is there besides numpy you'll start with a very very short introduction yeah so I've been contributing to numpy inside by for about a decade I've also been on the you know board of non focus for six years at the start of the the PI data conference series which partly set up to support these projects and to grow the community and since about one month I'm the director of concert labs which is basically a startup company that tries to hire people that work on numpy ipod jupiter pandas dusk all these kind of projects that you're probably using so if you want to contribute to these projects or you know you want you're interested in making work on these kind of open source packages like a job rather than like I'm gonna try this in my evening a few times or you work for a company that relies on these projects and you know you might want to help them make make them sustainable grow them please come talk to me afterwards so well start about 20 years ago on 25 by now so python is the you know the early 90s and pretty soon after you know numeric got invented so in Americus the you know the first thing that was basically an N dimensional array in in vital and you know then for 80 years not much happened numeric grew a little bit and then someone you know wasn't happy with numeric anymore because it didn't basically didn't work well for small arrays so they invented memory that was faster on small arrays and that's about when I started using numpy so you saw this thing like there's numeric there's number a which they're pretty much the same for a beginning user which one do I have to use so luckily someone wrote numpy at that point to kind of bring you know back together and then for a decade like that's that's all we had and you know there were a few related things that were invented like Tiano in 2008 for like the first you know deep learning framework of this kind I think and then number to speed things up a little bit but it's really like from the last three four years that there's kind of this explosion of you know new libraries that you know do different kinds of array computing so if you want distributed arrays you you know you're probably using tasks if you're using deep learning if you're using tensor flow or pi torch or maybe even a max net or q pi with chainer and then you have a new library for sparse arrays and it kind of goes all over the place so you know if that keeps on continuing you know we're all in for a lot of trouble so I want to talk about you know what we're as an umpire team trying to do to kind of you know deal with this explosion of different libraries that all don't talk to each other another question how many of you do deep learning like a third or so right you probably have had the problem that you know you want to do some computer vision let's say and you know there's just been a new algorithm like you know faster RC and then a couple of years ago and it's it was building cafe right and now all your other coach isn't tensorflow so do you switch the cafe do you not use the algorithm at all like this if there are all these little silos that don't talk to each other right so that's a problem we're trying to solve and and here is the first example of you know what that what it can look like so if you have some code I don't know oh this point with my head so if you have some code and you know it just does something useful right and it uses number functions right normally you would expect to feed that numpy race right you can't feed it like a tensor flow array so if we if we time that like this little function takes 21 milliseconds so now I make a new array of Koopa so Koopa is basically numpy for GPUs there's a couple of those like my torch is very similar but cuba is the most like a standalone library about the size of an empire with almost an identical API so this now is living on my NVIDIA GPU and I can now feed it into this same function and all of a sudden at six times faster that's kind of nice because I didn't have to change my function but this this wasn't really possible until today so I'm gonna explain you after like how this actually works and then you know how you can start using it here's another one so if you use task right task is this library that basically provides you an array but under the hood it's like a collection that can live in on different machines and that transparently like distributes your data over different machines so you can do things at a really large scale but normally you would have to adapt all your code you know because you start with a small data sample you know your billet and numpy and then you have to rewrite all your code to used articles but what you see here is you know some linear algebra right you just feed it in the array and you get you know some arrays Q and R back to do some QR decomposition now if you if you do that with tasks normally you would have to find the task equivalent that does QR decomposition but now in this example you can just use the numpy function and in the end you still get your deskaway back I guess so just wanna point out you you should try this at home or here and it seems like we have good Wi-Fi so this is this will this will download quite a few quite a few packages if you just want to try this works the same and pip if your app user but I assume most people use Conda I just installed the latest versions of numpy tasks and Jupiter lab and then if you have a GPU on your computer you want to install a Q PI in this case it's only support I just showed you is only supported in cube by 6 which will come out like next month or so but you can install it with this command so the first release candidate is out and then the last important thing is in number 5 1.16 all of this works and that's the current version of numpy but you have to set this experiment at this environment variable before you start so it will become default in a future version probably in about 2 months or so but if you want to try it today you have to set this ok so what do we actually want to achieve with it with this whole exercise we wanted to take the numpy api and kind of separate it from the rest of numpy from let's call it the execution engine so you can call an umpire function and then basically route it to Tantra flow PI torch or q pi or whatever because what these libraries do you know have done already is basically they looked at numpy said huh that's kind of nice but I want it on a GPU or something so they just copied the API and then made some slight changes and and then you know now I have a duplicate right and we actually have five or six duplicates today so we wanted to give them a an option and future libraries that come along to say like hey I don't need to just copy paste this whole API and I reimplemented just I can use this API and just plug into it and that will help because right now we have like you know all these array libraries but we only have one side PI and we have you know once I could learn right and if we don't do anything like in two years we may have you know side by four tensorflow and they have side by 4 PI torch and then it will come you know everything it takes five times more effort so we basically want to go from this where you have all these libraries that consumer raised and that basically have to like use each individual one to this situation where it just talked to numb pine you can still talk direct but you can just talk to numpy and then go to the library that you want all right so how does that'll work I gave a start start simple with a single I'll talk about a single function because most of most of numpy and most of you know data science is just using functions so if you look at a function you have a signature and a function body right signature is just this one line that defines your function and it has a number of parameters and keywords right and the function body is everything else right and that everything else just doesn't doesn't tell you what you can feed in that's the signature but it tells you how it then behaves right so and for numpy these can be two types of functions like the distinction usually doesn't matter to you as a user but yeah you know there's these thing calls you you funks or universal functions those are the ones that have that are all look the same to have an act D type an out argument and you know they kind of built with standard machinery and everything that's not a you func is like usually all slightly different because it doesn't fit nicely in the universal function pattern so for both of these types of functions this this works now and before it used to work only for you folks because it was much easier with this you know standard patro so what we do now is we basically separated the function signature from the function body so we kind of copy it over conceptually and then we kind of check the input arguments so if the first argument to a function is an array then we check if it has this array function attribute right and if it does you know we basically hand over execution of everything to whatever is in in that array function attribute right so if you're now Q PI or tensorflow you know you can just add to your own tensor or n dimensional array object this array function thing and then say like you know just use my implementation right so it allows you to replace the whole implementation of vampire Jeron so at the moment this works for all functions and it works for an the array method so if you have code that just uses that you're basically good to go today if you have like complicated like classes or decorators or something that are really numpy those are those you can't yet overwrite so there she would then still have to use from the actual library you want to be executing your code with all right let's see if we can switch to a notebook this is readable no way is it readable like this no that's good okay all right so as import some things and just tell us what version it is I'll publish this later by the way so I'll put my slice I assumed it you're distributing all the slides right yes okay so this is the example I showed you and so let's go and explain that so let's not go through that quickly and first of all I'll show you how to check if this actually works for a particular function so just use the double question mark to show the source code easiest way to show the source code so here you see this signature right and the line above it it says array function dispatch right that means it's implemented and the way this this function signature copying working works it's with this one decorator so if you'd be writing your own library you want to be using that thank you right there write that so now I wanted to try that on just some function that I wrote because you know that's in the end you know what what you're probably most interested in so I took a function from from side by that I was familiar with and that's just a pull function doing a t-test and I first just copy the thing as is and the only line I copied out I commented out was this one that the as array one because what that does and you'll see that in most functions probably not in your own code but in library code but it checks if something is is a numpy array if it's not it it converts it into one yeah that kind of defeats the whole purpose right yes so I copied that one I commented that one out but the rest is all the same so this is basically what you'll find in sight by as well but it's it's standard code that does some you know it does some checks for Nan's it calculates the mean you know variance square root and then in the end it you know uses some distribution to calculate the actual statistic we're interested in and I inserted one thing I inserted here was a print to check that you know we're actually getting you know for feeding an umpire way we get an umpire array at the end and if we feed in a task right we got a task radiant so you know here we just create some random numbers you know so 100 random numbers feed them you know split them into two columns so we compare a column a with column B right and then we do a t-test on that and then we get some result that says if we get a p-value here it's not really interesting what the p-value is but you get some value out so now we do that same thing for desk all right so here will tell you now you got a desk right that's what we printed and okay so it kind of works but you know there's one tricky bit with tasks it looks a bit weird so if you look at the statistic here it tells you the stored away with something blah blah but if you if you read the whole thing there's no actual values because you have to remember a task is kind of this distributed distributed array object and it you know it kind of builds up a graph of what you want to do and it only executes it when you tell it to so at the end to make this work we have to do one thing that's specific to dusk and say like a compute me this result actually so if you get if you just print the statistic right it just say the end result will be a task array but it doesn't tell you what's in it so now if we ask it to actually compute then you get the values right so this is still fairly new and it won't work for everything but it will work like for a lot of your code if you know if you're just using regular functions and not too many magical magical things all right so there are some limits sure you know small example here of something that does not work so for example this numpy dot air earth state right it's checks for invalid so I tell it here like if you see an invalid like an an or a nymph right raise an error so if you feed in like an umpire array that has an an in it because I just explicitly put one in here right I get an error as I expect because that's what I asked it to do now if you do the same for q pi right there's also an an in here but if I ask you to do something useful like it doesn't raise an error it just gives me the nav back right and that's because this earth did that is not in Q PI right it hasn't been overridden so you have to be a little bit careful still and we're at the moment talking about ways of like making this more comprehensive but you know for really special objects it's kind of hard to override them because the idea is you know we override them by I give you an array and I check if the array has an array function thing you know attribute attached to it but you know in this in this case this MP dot earth state there is no input so there is nothing to check right so we just need a different mechanism to overriding that so the most important exceptions that don't work that you may be using a lot are the array creation functions if you have in the code that you write something that says oh NP dot array you know and then a list of values or something right that explicitly says like you know I'm constructing a numpy array right you can't check those individual values and say like oh you know you meant a cube array or a part torturer right here and so if you have a right you know creating arrays you want to separate those out of your function and the function itself should just have your what's called your business logic okay so that I think is the the single most important change we made in lump I over the past three years or so so we we now have actually for the first time in ever like we now have a few people that are paid to work on numpy because we got a grant from the Moore Foundation I sort more in Sloan so we have three engineers working on by now so there's a lot more interesting features coming hopefully so these are kind of at a high level what we're planning to do so the interoperability there so I call this a red function interoperability you know we can now work better with you know all these GPU objects and these distributed arrays so there will be there would be more of that so it becomes more complete and more robust because right now like we've basically just released it and yeah maybe you can start using it tomorrow but there's not much code out there and other library steps using this then extensibility so it's very hard to make custom D types everything is nice if you have this normal floats and int and maybe date times right but if you want to add like missing data support or you want to have a race with units that's a very popular one there are like 10 different libraries that make a race with units so if you you know multiply meters by seconds then you know its meter second and if that's not what you wanted to get you know consistency checks that raise a flag that that works today but it's extremely hard so that there's an example of a thing called quaternions which is kind of a rotation matrix I just looked at it as 1600 lines of C code to make it the D type that's a quaternion right and it should be like an you know maybe a hundred lines of code then performance we haven't done a whole lot on performance I mean it's fast compared to Python but you know compared to number and you know at some of the newer things that have come in and it's starting to get slightly slower and more performance is always helpful right that's why we like GPUs in the first place so there will be you Frank optimizations to basically make every every function and numpy have less overhead and make it faster there we vector instructions so if you have very large arrays you know the CPU has special instructions to process them in batches so we have some support for those but it's incomplete at the moment so if you're on Linux and with GCC then you know it's a lot faster than if you're on some exotic platform at a compiler that's not ECC and so it's shouldn't be too hard to make that more complete so that's that's in the pipeline as well then random number generation there's a whole rewrite that's going to be that's about to be merged so that whole module is new it won't break your existing code but there's a lot more options for faster sampler samplers with more you know but longer quasi periodicity then indexing there will be new indexing behavior if you used x-ray that has in some cases more intuitive indexing so if you give it let's say you know one two three and you know one five seven it gives you like this well let's call it I don't have to call it but basically it's more intuitive to see which elements will get selected while this fancy 'm indexing in numpy it's often like a bit counterintuitive so add a new indexing indexing mode basically and then last one is type annotations so that's going a bit slower it's not the highest priority but it's nice to have for like understanding your code better and a bunch of other purposes okay so let's see how much time do I have eight minutes okay so that was kind of an a brief overview of what's in numpy and you know I'll also give an overview of what might be coming you know in the next few years because it's actually actually interesting like numpy is like you know almost almost everybody here uses it at some point but it's like because it's so old and it has such a large user base it's kind of hard to evolve so we don't know what's gonna happen in a few years we're gonna keep pushing them forward but I've also been new new options alright so I just there's no good way I tried this for a really long time this slide there's no good way to put like all of these array libraries honest and on to on the two-dimensional picture right because some are really big some focus on performance some focus on new you know new features like you know better missing data support like like R has but the axes are chose here our maturity and then I divided in GP or CPU or both and then I've put in data frames as well maturity is kind of interesting because this is one of the first thing I think you should look at if you keep trying to use something new if you want to live on the bleeding edge and see you know what everybody else will be using in five years you know you want it you want to probably here somewhere and if you just want goat that words you know you want to be on this end and then I put a few things in gray and I'll I'll give a few like you array I'll talk about that in last slides intentially those are not really array objects but they're kind of interfaces to them and the same for arrow in terms of data frames like you've got pandas you know you got like the GPU version of pandas an arrow is kind of this interchange format that makes things work together better all right so the first one I'll talk about it X and D that says the library that's about started two years ago with the idea to take all the concepts of numpy and create them as a number of individual libraries so it's all more modular it's very you know for clean new C code so there's three libraries basically one for an array container object one for the actual data types and a third one how do you make functions on top of those two things so it basically it's it's still you know very young so it might be not future completely other places but it it's in terms of performance and features very similar to numpy at the time and then it has a whole bunch of things that we always wanted in numpy but you know we're always too hard to build like variable length strings if you work with strings you probably have noticed that it you know those are really annoying to work with because there are no real variable length strings so if you take a sentence and you break it up into words and you want to put them in an array right you have to make the data type of the array as long as the longest word that you have and then when you encounter or longer word later it doesn't fit and all your code breaks all right so accidentally has proper variable length strings which means you know it's easier to work with and it fits in less memory so you can work with bigger data then ragged arrays same thing for you know if you have something that's not nicely Square so it fits in an umpire error data frame but it has various various lengths that's called ragged arrays and has support for that and a few more things like you know easier d types it's automatic multi-threading which numpy doesn't have no part if you install from anaconda you do get a non-pilot mkl you know things run on all the all the course that you have but but if you just install an umpire whip it you just get your single core by default and you have to really try to paralyze it so automatic multi-threading is nice if you do a plus B you know endurance as fast as it can on all the hardware that you have and then it also has yet compilation just like just like number via number so a brief example of that so variable length strength so it's like this is a test notebook it tells you it at the type is string and you know it has five elements basically right well if you do this in numpy it's it tells you you know you so I guess for down here Unicode of length 8 because there's notebook is the longest word so you know it needs eight characters and now you know all of this this a you know it's one character but it also takes up the same space as the word notebook and the same for ragged arrays like you know you see like in this case you get type of fare so variable lengths basically well in numpy if you try this it kind of eats it but it transfers everything to an object and that's basically fairly useless it's like it's it's in an array but I don't know what it is basically it's like it's just the reference to some object and and you can't really process it very well all right next array library is extensor this is a very interesting one especially if you're a C++ developer so they took all the concepts of numpy like broadcasting and indexing and put them into C++ and then focused on performance and reusability so and it's really fast it has lazy evaluations there are some problems where it's like you know the fastest thing out there it can operate on Empire is you know from C++ without doing any copying and it has fight and Julian are bindings so if you if your company and you want to ship your algorithm and you we use it in Python and in R or button and in Julia or all three right the idea is you write it with extensors and C++ so you get nice array semantics I don't have to use just only the standard built-in you know it things like broadcasting are really nice it didn't exist in C++ C++ before but then basically you write it once and you can get it into all of Python Julie and are it also has yet compilation not file number but firefighter and it's just you know comparable in terms of performance and then I just noticed last month that's our and now has this project called RA where some guy thought like okay you know I'm gonna get stuck with or I don't really like all right now it doesn't really have good arrays now they have known PI like arrays in R so if you're an hour user he like check that out are we ah yes okay so I'll skip this example but go to the extensor website you can you can use C++ interactively in the Jupiter notebook and last I'll leave this up you array is basically a generalization of the array protocols that I talked about so that's that's still a very young project but that will allow you to override things like decorators and classes and whatever you want all right thank you very much [Applause]

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *