Michael Franklin | AMPLab + the Berkeley Data Analytics Stack



great well thanks for having me here it's been a it's been a great visits been a long time since I've been to Ann Arbor thanks for delaying your weekend to hear a talk I'm gonna try not to not to make it too technical so hopefully it be more interesting and in fact a lot of the conversations I had today it kind of were around less about the technical aspects of what we did in the amp lab and and and more around sort of how that amp loud was built and run so I'm gonna try to cover a little bit of both but feel free to ask questions and interrupt me and whatever I'm looking at uh yeah okay and I guess I'll talk if I have time at the end a little bit about what what we're trying to do at Chicago and why I'm there and why I hope some of you as you're finishing up your degrees and you're thinking of places to apply for faculty jobs you know take a look take a look at what we're up to in Chicago so let me get started so you know kind of amp lab as Jack mentioned you know played a pretty large role I think in kind of the the the evolution of the open-source Big Data ecosystem this is a chart I found online that talks about Apache Hadoop and it goes back to 2006 when Doug cutting and some other guy I can't remember his name Mike a fella who's not at the talk built this thing called Hadoop if you kind of build this out this way you'd see a few other things so in particular Google and and some other big web companies but but in large part Google we're building a lot of their own infrastructure to handle handle large amounts of data they felt that the existing database systems either couldn't scale to the way they wanted or they would be too expensive if they tried so kind of out here is is Google building a bunch of infrastructure in particular the MapReduce system for parallel processing if you go out this way at the grumpy ol old database folks like jogging myself there was actually this whole area of massively parallel database systems that pioneered a lot of the techniques that get used for taking large data sets spreading them across large clusters and and running queries I know you had Mike Stonebraker here earlier in the week or last week you know Mike famously raised some questions about whether the MapReduce movement was a good thing or a bad thing you know well you know I think that questions been answered but anyway if you go forward so you know open source Hadoop came out in 2006 and then there was this explosion of new systems being added to do all sorts of things whether it was stream processing you know graph processing you know sequel query processing ETL you name it right and things just keep getting added and added to the stock stack spark kind of showed up on the consciousness of of Apache around 2012 but it was actually grew out of a project that we started in Berkeley before the amp loud the rad lab rad stands for reliable adaptive and distributed computing and it started off basically the the goal was to build self-managing clusters and self-managing systems and the the insight that they had and I actually wasn't I was doing some other stuff at the time I was off doing stream query processing but when the rad lab started what they were trying to do was they said alright if we want to build self managing data centers let's say data centers are really complex we're going to need machine learning to you know watch all the telemetry that's coming off the machines build models and then do predictions so they can then adjust what's going on in the machines but since those systems are so complex we probably can't do it just by you know taking an operating systems graduate student and giving them the introduction to machine learning book right what maybe what you do is get some people who work on the cutting edge of machine learning put them together with people that are building these cutting-edge computer system and then together try to solve this problem of self-managing systems okay so it was a pretty good idea and you know around 2006 they put together a group of people from these different areas to attack that problem and towards the end of that one of the things that sort of bubbled up towards the end of the rad lab was this system called spark and you'll notice there's a lot of a lot of connections here with this talk and so you notice this guy here you know took a break from some of his networking work to do spark but you know the oh wow sheriff can tell me if this is a true story not here's the way I heard the story the way I heard the story is that so so we had machine learning and systems people sitting together machine learning people were trying to scale up some of their algorithms they took Hadoop they tried to run their algorithms on Hadoop they ran horribly in terms of speed they were really slow so they walked over to you know Matei and musharraf and some of the other systems people in the lab and they said hey you know do you mind helping us debug our machine learning programs because we're running him on this cutting edge parallel infrastructure and it's really slow and you know they they looked at the code they saw what was going on and they said hey you know what actually it's not a bug what's happening is you're trying to use Hadoop for something that it's really not good at and in particular what they were trying to do is they were running iterative machine learning algorithms where you you know you read in a data set you process it you come up with some model but then you iterate over that data to refine the model and the way that Hadoop was set up and we'll talk about this in a in a few minutes and again since caffarelli didn't come to the talk I can say whatever I want basically between every iteration it had no memory no recollection of what it brought into memory and where it put it so it read it all in from disk again all right so you're basically every time you iterate over the data you're starting from scratch and it goes really slowly so part of the motivation behind spark was to solve that problem and say hey let's build a scale out computing engine that could be smarter about how it manages memory and can preserve that knowledge across multiple parts of the program and then that way get rid of a lot of those inefficiencies and and the reason I tell this it's kind of a long story but the reason I tell it is because it's gonna it's it's part of a theme that I'm going to revisit a few times that happened because we had a group of machine learning people who felt very comfortable walking you know not across the hall or to a different building but over to the next desk and asking somebody who's working on computing systems you know to help out and it's it's that kind of close collaboration that's been the hallmark of these types of projects that have been set up at Berkeley over the years rad lab being one and then when the rad lab wound down this other thing we started up called the amp lab which is what I'm really going to talk about you know for the rest of the talk for most of the rest of the talk so you know towards the end of this project a bunch of us well you know some people were working on scalable machine learning so they saw kind of big data happening from that point of view me and actually me and Scott and Yann we're all off taking industrial leaves doing startup companies of various types and because of that sort of became aware of what was going on in industry in terms of you know the huge volumes of data that people were starting to collect and the the difficulty that those people were having extracting any value out of that data so when we all got back from our various leaves and and spark started up we looked at me said okay the next big thing you know should be what's now called big data analytics but at the time we kind of were calling it scalable machine learning and things like that okay so that's kind of how that happened the other thing I just want to mention is you know this explosion of innovation and an explosion of different systems really was Abell by the open source software movement so you know people all around the world would be able to create all these different systems and get some amount of traction with them and you know there's some good things and some bad things about this right there's lots of innovation on the other hand if you're you know a company and you're trying to figure out hey what's the latest greatest big data stack that I should use to solve my problem well you've got a you've got a real nightmare here because there's so many different systems they have overlapping functionality they weren't meant to work together the chances of any one of them solving your problem is basically zero so so from a industrial point of view from a production point of view all this innovation is great but it's it's it's it's kind of unwieldy and it's hard to work with okay so on the good side open source let a thousand flowers bloom on the bad side it's it's it's kind of chaotic and and an industry in general doesn't doesn't like chaos okay so good so well that'll be another theme alright so so you know we started working on spark the initial release I think was 2009 very quickly we started getting a lot of interest in it and some of the students in the project got an interest in really building a community so they said hey we're getting so much interest in this software that we kind of just threw out into the world why don't we you know do the extra work they were probably encouraged by some of the faculty but why don't we do the extra work make sure it really works make sure people can use it and then let's go out and do some evangelism let's actually you know build a community so we started a meetup group or if you're familiar with meetup calm about spark none of us I don't think we're forward-looking enough to remember to take a snapshot of this picture back when we had the first meetup so this was a couple months later this you know was the oldest one I could find on the wayback machine but at the time early to 2013 so a little over you know almost not even four years ago we had one group in the world with a little over 500 members then a bunch of weird things started happening so you start seeing articles like this I always show this article when I when I give a talk at a university because it gets faculty really nervous and basically what it says is so the O'Reilly guys did a study where they looked at job offers job openings he looked at salaries they you know they did a textual analysis to figure out which skills correlated with higher amounts of money and they found when they did this year ago that spark was the number one saying that you could put on your resume to make more money as a data scientist if you said you knew spark and if you said you had a PhD it gave you a bump but not as much as if you said you you do spark so you know for those students in the audience put down your books start relearning no absolute our fall spark works no not anyway so you know but and that's just it's a fun kind of crazy example um this is less crazy but you know spark became super popular and it became you know and I think it still is the most active open source project in this whole big data menagerie so if you look last night at about 11 o'clock to see how many meetup groups there are now so so in 2013 we had 549 members we now as of yesterday have 549 groups around the world with over a quarter of a million people that you know are willing to at least sign up on a web page that says yes I'm I'm willing to go to meetings with people who talk about spark so you know thinking that about this as a graduate student research project you know that's just phenomenal you know impact it's it's you know and so you know I want to talk a little bit about about how that happened all right so the amp as I said was kind of this next generation of the this Berkeley shared lab thing that that we'd been doing for a while at its height it was probably about ninety to a hundred people involved and these people were from systems from machine learning and here I'll just build this out we were able to do something I think it was pretty pretty interesting which is when we started the project we weren't sure that it was the kind of thing that the NSF and other government agencies would fund so we we went out to companies and we went to Google first sa P next IBM Amazon and we told them what we were thinking of doing and we asked them if they would be willing to contribute some funding to help this along and and those companies agreed as did at the end of the day you know 30 or so others and so this is kind of a snapshot of the the sponsors at a certain time but you know given that headstart we were then able to apply for this national science expeditions award which is one of these big you know five year you know multi-million dollar awards that they they give a very few of and we were able to get that so the project had this really nice funding profile that we're about half the money was coming from you know the kind of grants that we usually get and about half was this very flexible funding that was coming from from companies who were interested in the work and interested of course in meeting the students and and as the software got more and more popular we're actually interested in being in on the activities around the software and figuring out new directions for the project so it ended up being you know a great environment in terms of resources but also and I'll talk about this again you know towards the end you know we would meet with these companies you know two or three times a year and all the students at every meeting would get up and present their work they would get you know detailed feedback from people at these companies who were interested in what they were doing and so that you know we got you know this leads to internships this leads to joint projects and and it leads to very immediate and useful feedback in terms of the direction of the project so in particular you know if you're a student and you're working on say networking and you say hey I want to solve this this networking problem in data centers you know you could talk to people at these companies and they'll tell you oh yeah well you know what that's actually not really a problem in data centers for this reason or that reason or they might say well you know that is a problem but not for the reason you think it's because of this other thing and and that type of feedback is invaluable so you know now that I've moved out of the Senate kind of the Berkeley you know envelope that that's good at doing these things I mean it's it's it's gonna be interesting to see about building that kind of thing I know you guys have a lot of industrial and engagement as well but it's just it for this particular project it was super valuable all right so so great so what did we pitch these companies what do we actually say we're gonna do well the idea was you want to make sense of big data so what do you have available to do that and the idea was you have three resources algorithms machines and people that's where the name comes from so algorithms you know think about machine learning statistical processing and so on machines we were thinking about large clusters data centers scale computing clouds and so on and then people our initial view was we were going to use things like Mechanical Turk and other connected types of systems like that to allow people to participate in the analytics and the idea initially was that you know using machine learning on scalable machines will get you a certain amount of accuracy and will solve a certain class of problems but there'll be other things that the algorithms aren't up to do yet and that's when you bring in people right because there are certain things that people are good at certain things that algorithms are good at if you build that hybrid system you'll be able to solve much more of the analytics problem okay and so the the pitch was you know let's build a system that integrates these three very different types of resources to solve analytics problems okay and that's what amp stands for okay so what did we end up doing we ended up building this thing called the berkeley data analytics stack it's the acronym is be das it's pronounced called badass thank you I'm glad you said it not me but it's called badass and you know I built it this way because that's a good way to think of the stack at the bottom there's this virtualization layer called mesas that allows you to take a physical cluster and partition it into smaller clusters that actually was something that grew out of the rad lab as well as spark there's a storage layer Alexio is a system that used to call it be called tachyon until a company that had a product called tachyon had their lawyers send us a friendly note that we needed to change the name succinct is a more recent project that basically does compress storage and allows you to to do queries on the compressed stuff without having to uncompress it above that are two processing engines spark which we're going to talk about in a little more detail that's that extended version of MapReduce for doing iterative processing and something called Velux which is kind of undergoing a name change now which is focused on on serving machine learning models so spark is for building the models once you've built a model you now wanted to deploy it in an application that's what Velox is intended for and then on top of that are all these different systems that were built to allow people to do different types of analytics so you know there's a sequel engine there's a streaming engine there's an our interface for statistical processing machine learning graphs approximate query processing data cleaning these are all things that are built on top of the rest of the stack and the way I like to explain it is each one of these white boxes is is basically somebody's PhD thesis right maybe a couple people but this is all student generated code and student generated systems that were built in such a way that they work together okay and that was a conscious effort by the students doing that work to get their stuff to actually work together and in particular you know if you're doing something up at this part of the system if you're able to make it play well with spark all of a sudden you know those quarter of a million people that are willing to you know go to spark meetups as well you know the millions of other people that are actually you know using spark all of a sudden could be interested in your software so it's a really good way to to get a user group very very quickly okay so that's what we built and I'm going to talk about a couple pieces of it just to give you a feel for it so of course I think I'm gonna start with spark so if you think back to that first picture I showed which had this explosion of systems for doing all sorts of things you know I talked about why that's problematic for people who really want to do big data analytics and the idea there was you know everybody said hey you know I like MapReduce I love this idea of being able to to scale out so that when I get more data I add a few more machines I add some more disks and the system just grows to accommodate that new load but if you but-but-but MapReduce doesn't do exactly what I want so what I want to do is I want to build a MapReduce system for graphs or I want to build a MapReduce system for answering relational database queries okay and when you do that you end up with that problem that I was talking about was to get all these individual systems none of which solves your entire problem and none of which were really meant to work together so the insight behind SPARC was to say well you know instead of specializing MapReduce instead of saying hey I want to take MapReduce and kind of move it that way instead once you just grow MapReduce why don't you just generalize it make it support more things than it used to and then see if you can support all those different use cases so that's kind of the key idea you know that the thing that got spark off the ground was that story I told you about iterative query machine learning but but what really made it take off was this idea that you could do more than just any one way of getting at the data in the same system so we're going to generalize MapReduce and if you do that you can plug in all these different modalities on into into that substrate ok and so what do you have to do to generalize MapReduce it turns out there's two things the first one is you need to be able to do more than map and reduce ok and then that way you increase the expressiveness of the system and the second thing you need to do is get a fix that problem that I talked about where the original hadoop mapreduce wasn't able to be to effectively manage the memory of the cluster in a complex programming environment okay so you want to allow different parts of a complex program to share data that they've already brought into memory and stage there okay and if you could solve those then if you do that then you're not moving data from disk to memory to a different machine and so on and remember if you're talking about big data you know the last thing you want to do with big data is move it because it's big right so you'd like to keep it where it is and send the processing there if you can if you can support this sharing of of materialized data in memory across different parts of the program that's how you do it so those two changes to MapReduce is what SPARC brought to the table okay so how did it do it the original abstraction was something called we resilient distributed datasets rdd's and basically it's just an abstraction of the of the the the cluster so that you you take a collection of objects and you spread it across the cluster but you you don't really have to worry as a programmer where it is so you have it you have a cluster I'm sorry you have a collection you know it's distributed but you don't worry about the actual where the where the different pieces are okay now I'm not gonna go into this in a lot of detail but one of the things you give up when you move from hadoop mapreduce which is storing everything on disk constantly to an environment where you're not doing that so much is you got to worry about what happens when a machine fails right because you've got this big cluster you're running large jobs the probability of any one machine failing in the project the middle of a job is pretty high especially if you're not you know using the the best managed systems out in the world and so you need to be able to tolerate a partial failure during during an execution and so if you're not going to be writing things out to disk all the time you need a way to solve that and the way that's that rdd's help you solve that is that there the the the interface to them is restricted and it's restricted in two ways and I should probably write this a different you know in the opposite order the first thing is that they're immutable okay which means you can't change them so you create an RDD you can't go in and change one piece of an RDD right you create the RDD you spread it across the machine that's what you've got then what you can do is you can apply these coarse-grained transformations to an RDD to create a new RDV okay which again is going to be immutable so all you can't go in and do a point update on an RDD but you could take an RDD you can apply a transformation or a sequence of transformations to it and get a new RDD out the other side okay it turns out that this very restricted environment this is a very restricted api is rich enough to let you do a bunch of those things that people were doing in that big collection software you know graphs queries machine learning streaming you know all these types of different ways of getting at the data okay so that was kind of the big AHA of spark was that this simple API was rich enough to implement all these different things all right so what do we mean by allowing it to do more so you know if you look at the operators that you can apply these transformations you can apply to rdd's you got map you get reduce so we can do Map Reduce and anything you do with Map Reduce you can do here plus you've got all these great things you can see jog is feeling much more comfortable now because there's joins as group bys you know we even got outer joins you know left and right you know so so you know the the things that of course you want to do to collections of data you don't want just map and reduce them you want to do all these other things so so that's what we mean by by expanding the the functionality on the fault-tolerant side the idea is we're going to use this this trick of logical logging so you know we're not writing the data out to you know we're not writing the data out to disk after every iteration and so you need to be able to keep track of what needs to be rebuild if you lose a machine with with its memory contents right you need to be able to rebuild what was ever on that machine and so this is where the immutability comes in because now you've got an RDD you've created it you've spread it across the machines you have a map of where all the pieces are and you know that if the it's binary either you have this piece of the RDD in which case you've got it exactly what you need or you don't have it in which case you have to figure out how to rebuild it but you don't have to worry about if you've got a piece of an RDD sitting on your system you'd have to worry about jeez is it before or after that update because can't update them okay so when you put that together with the the simplified interface you end up being able to do stuff like this so here's a operations on rdd's where you read in a file you apply this filter transformation to it and then you apply a map transformation to that and then what you could do is you could basically treat that as the lineage for any piece of this output RDD so basically you know I created an RDD by reading in a file from HDFS I then applied a filter to it and I got this RDD then I applied this map operation to it and I get that RDD and that gives you a recipe to recreate this guy if you need to okay and that's kind of how SPARC address this fault tolerance without having to constantly be writing things out to disk for every you know in between every operation okay so those are kind of the big ideas behind spark initially where it's gone is people have realized that while some programmers like the low-level abstraction of rdd's it is fairly low-level and so in most environments people actually would prefer to write sequel believe it or not or if you're familiar with languages like r or pandas or you know some statistical language they have another so sequel works on these thing called tables right which are sequences of rows statistical languages work on these thing called data frames which are basically you know matrices of data if you squint they're really not that much different so spark introduced this idea of a data frame I mean or incorporated this idea of a data frame I should say where you could do things like this so this is a join between you know employees and departments the same thing you do with sequel but it's written in this somewhat imperative looking way which is more comfortable for some people other people might want to write sequel and and the system lets you do that too okay so most users these days don't program directly against rdd's they write either data frames or sequel or something else and then that gets translated by a compiler into operations on rdd's so the compiler has in it this thing called the catalyst optimizer and the catalyst optimizer is an interesting thing so for those of you who know about database query optimization typically you have a bunch of rules that you want to implement that say things like well if I have a query that's doing a join on one relation plus another relation that's got a selection on it then no matter how I wrote that query I want to do the selection first and then do the join because that's going to be less work in the long run and and a bunch of rules like that and so what catalyst does is it's get built into it a bunch of rules but you know it's been written by a handful of people and it's competing against optimizers say like by Oracle in companies like that that people are working on 30 years or longer so so how do you catch up and the way you catch up is you do it in the open source so that when somebody sees an optimization that they need that it doesn't exist yet they can write it and they can put it into the system so it's this extensible query optimizer combined with open source right that's gonna leverage that community for making the optimizer smarter and smarter as time goes on and so it's got that it's got a bunch of things that modern query optimizers do like remove but you know compilation tricks you know to get a lot of the the function calls that you normally do inside a query processor you know out so to inline those functions and so on it's got a nice extension ability to be able to bring in data you know from HDFS but you know in CSV form in one of these other formats you know from another database via JDBC interface right you can keep adding these these extenders and it's got support for user-defined types so it's the whole idea behind the optimizer is extensibility and that's not particularly new but I think what is new is the combination of that with open source so that it's not just a small cobble of people you know sitting in a basement in Redwood City or something but it's really people around the whole world can can can add to the optimizer if they want to okay great so how does it work well if you program with rdd's and this is like some aggregation query where you just got a bunch of data and you're computing some summary statistic on it rdd's you know in this environment you know if you use the Python interface which a lot of people like it takes this long if you do it with Scala it'll go faster if you do it with sequel using this data frame package that comes with the catalyst optimizer we can play off the malaysian tricks to make it run faster that you know you wouldn't necessarily necessarily know how to do if you're coding it against straight rdd's okay so that's interesting that's not super interesting so the question is I don't like sequel I like our or I don't even like our I just want to write Python or actually I you know I really do like Scala what kind of performance do you get there a fused data frames and this is the fun thing if you kind of get the same performance no matter which of those interfaces you use okay so why is that because what's happening is we're taking the data frame or they're taking the data frame code compiling it into an internal representation and then running that internal representation through the optimizer and so as long as you get to the same internal representation you're gonna end up with a plan that's very similar and it's gonna run at about the same speed so it gives you like this this flexibility that I had never seen in a database system before where you get the same performance regardless of you know which of the interface languages you want to go at the data with so that's actually a big step forward – all right I want to say briefly a a couple other techniques that we use approximation is one so you know when you've got big data you know even if you have a linear algorithm it takes a long time to go through all the data so if you want to go faster than that got to do something sublinear one way of doing something sub linear is to do sampling and then run on that smaller set of data so you're still linear but you get going much faster and and bad-ass does that really in two ways one is the system called blink DB which is bars and cities presents – you mighta left oh hey bars on so bars on worked on this where what you do is you take a sample of the data you run queries on the sample and then you give the answer back with a with a confidence interval right it's what goes a lot faster and you you learn about you know how accurate your answer is okay and you can trade-off accuracy versus you know or at least confidence versus time so that's one way the other way is in this sample clean part of the system where we we take that same insight but we use it for for data cleaning so the idea is if I wanted to go through all the records in my database and and check them to see if they're accurate and then fix them if they're not that's going to be super expensive it's probably going to involve a lot of people people are slow people are expensive people take make mistakes so instead what we're gonna do is gonna play that same trick and we're gonna take a sample of the data we're gonna clean that sample and then and then you've got two ways to go the first thing you do is just run the query on that clean sample okay in which case and get error bars in which case you're basically running bling TB but you're doing it on data that you're pretty sure is clean right is higher quality but if you don't like that if you don't like that sampling error you can then run the query on the dirty database and then use what you learned by cleaning that sample as a correcting it's kind of a it's like what they did with the Hubble telescope when they found out it wasn't working properly they they built the little correction lens and they they set somebody up on the Space Shuttle to put that in the in the telescope it's kind of same thing here you you build this little correction lens and you put that on the answer of your query that you got over the dirty data and get a better answer that way so it's just kind of to places where approximation comes into the into the stack all right good say just briefly one thing that as I mentioned when the rad lab started I was off like a bunch of other database people working on this idea of stream processing where instead of storing the data and then running queries on it you would run the query incremental e as as the data is coming in we thought that was a good idea industry wasn't quite ready for it I don't think but now if you look people are getting excited about stream processing again and there's all these systems coming out now around stream processing you know things like storm and flink and Google actually we have to update this is this thing called Apache beam now that Google released and then sort of at a little lower level there's a bunch of interesting projects around reliable messaging you put these together to build these kind of streaming applications what does the badass stack have to say about that well the way people were talking about doing it this is you know I'm kind of repeating myself there was this thing called the lambda architecture that was you know kind of developed to solve this problem and basically the lambda architecture said I've got systems like storm that I can kind of pipe data through and get instantaneous answers on that data and I've got systems like spark or hadoop mapreduce that I can sort of store the data and then run big programs on it and get answers that way but if I want to get answers that combine the latest real-time data with the data I can get for the answers I can get from the batch system what I'm going to do is I'm going to glue those together so I'm going to run I'm going to take when data comes in I'm gonna stick it in the batch system I'm gonna run it through the speed layer through the stream processor I'm gonna write my queries twice right once we're batch once for this streaming system and then when I get answers coming out I'm gonna stitch them back together to produce those answers okay and that's what you have to do if you don't have a system that can do both batch and streaming okay but if you do it it's got all the problems that you'd expect right you've got two systems that were never meant to work together they've got different semantics they're on different revision cycles they have different bugs when you write your query twice you have to figure out that I write it the same way here as I wrote it here if you wrote a bug into your query you're gonna do the bug in here you got a debug it there it's not a happy happy situation so a lot of people now are realizing that really what you want to do is build a system that that does both streaming in batch and and you know as you can imagine I wouldn't be talking about it if that's not what you know if we didn't do that so of course that's what we did so the way it was done in spark was to basically say okay we have this ability to take an RDD which is a collection of data and process it if we take a stream and this was actually not an insight due to spark if you look back at at that at the streaming database literature that I was talking about you know people in particular Jennifer Widom and her group at Stanford you know said very early if you want to understand the semantics of a streaming system think of the stream as a table that's just being kind of continuously appended to and at any time you can take a snapshot of that table and get an answer okay and then you understand the semantics of a streaming query so that's kind of what's being done here is is you take the stream and you break it into these mini rdd's and then you apply your spark program to each of those little rdd's and by doing that you can get this kind of streaming real-time behavior it's not record at a time but it goes you know pretty fast you can get sub-second response times in this type of a system and the advantage of doing it this way is you can then I'm sorry you can then take a program say like this which is a program you wrote for your batch system that says hey I want to read in this log file and then I want to calculate something on it I forget we're calculated in here you're figuring out for given actions how many times they happen within a within a one-hour interval okay you're doing that over data that you've stored and if you said hey I'd really like to do that over data as it's coming in then you basically write the same code except instead of reading a file you read a stream and instead of you know writing out to a file you write out to a stream but if you look at this and I you know you don't have to look at it closely it's the same code other than that okay so you remove this this this lambda architecture problem of having two different systems alright good what I wanted to say about the stack was really graduate student code combination of people from across a pretty wide set of sub areas of computer science working together to build this thing and I talked about you know spark in terms of scale out processing with fault tolerance talked a little bit about how sequel and streaming work and what I didn't talk about because I don't really have too much to say about it was we also had a another set of projects that had to do with trying to make machine learning easier to use and so the insight basically was just the way you you take you write a sequel query and the system compiles it into a plan that gets run on the machine you know can we do that for can we do that for machine learning so can we state at a very high level what would like to predict over what kind of data set and then have the system compile the the program that's going to do that that prediction for you and the thing I was going to talk about was something called Keystone ml which is basically a pipeline system for doing machine learning where you take multiple machine learning operators you stitch them together and then you do the kinds of optimizations that you think you might want to do you know based on what we've learned from database systems so you can write and say hey I want a linear solver here and Keystone ml can figure out which of the six linear solvers that exist in spark ml Lib is the right one to use for the data set that you have okay and or you could say hey I've got this set of operations I want to do and and Keystone ml could figure out oh well you know if you order the operations this way you'll get the right answer and you can compute this thing once instead of twice and you can keep it in memory and and have everyone access it so doing those kind of pipe level optimizations are sort of multiple operator optimizations okay so that's what I wanted to say about the badass stack I want to say just something briefly about people the idea was if you look at where the the people part of the the system people part of the project is really we realized there were all these different ways that people were going to be involved in analytics and so the obvious one is you need to serve predictions and answers to people we had thought very early in the in the project that we'd do these kinds of citizen science things where we ask people to go out and find data for us so that's another thing you can have people collect data for you I talked a little bit about sample clean where you take a sample of the data you give it to the crowd and asked the crowd to clean it so you can actually have people do you know sophisticated operations on the data to make it better and then you know there are data scientists who are writing programs that are going to then do you know make these predictions and so on and so you know part of the project was really supporting all these different ways of people interacting with the system and what we we built a bunch of things what what where it kind of ended up from a systems point of view is we're building this thing called amp crowd which basically tries to take systems insights and to the extent possible apply them to you know crowd workers so that you can take data that you're getting from a live spark system and integrate that with data that you're getting from an engine like Mechanical Turk and I don't want to go into it too much because I want to get to a couple other things but I guess the key things to keep to notice is there'd been a bunch of work in the crowdsourcing literature about how do you do real-time crowdsourcing well because the problem is you send a job to Mechanical Turk response time is measured in in minutes or hours or days and that's just not going to be sufficient for a lot of analytics applications and so there'd been a bunch of work on keeping pools of workers you know kind of Unruh tainer so that when you ask them a question they give you an answer right away so it's good sometimes called real-time crowdsourcing and so we took some of those techniques and then we extended them to do a couple things one is we took a page from kind of the Hadoop style notebook where what happens is when you spread a job off of across a bunch of machines the time to completion of a task is basically the time of the slowest machine so systems like Hadoop do these things called straggler mitigation where if you notice a node is going slowly you'll fire up another job maybe speculatively and you'll see which one of those finishes first and you'll take that answer and so you can do that with the crowd so we did that and we showed it works really well you can get much faster answers and you have to pay only a little bit more than you would have paid previously we also do things like monitor workers and try to figure out which jobs they're good at and which jobs are not good at and so on but the basic idea was to try to build oh and then we incorporated a bunch of active learning stuff some of the stuff that bars on figured out and then built that in with with with other ways of scheduling tasks to improve quality and to get faster answers so really all I want to say about amp crowd is that you can take a systems approach to crowd work and and and some of those some of those insights carry over from systems and and some of them don't in some some pretty surprising ways so so Dave Patterson wrote this article a couple years ago in CAC I'm called how to build a bad Research Center and the idea was to say you know here's a bunch of things you shouldn't do to build a databases so if you do the opposite of what he says in this article then you'll you know how to build a better Research Center so you know amp lab followed a lot of these rules the rad lab before it and so on and and this article is if you haven't seen it it talks about this evolution of this model so you know one of the things that you're supposed to do that we did is you build a cross-disciplinary team you try to get people who wouldn't normally work together to work together and one of the ways you do that is to actually physically put them in the same place right so the amp and the rad lab before it was really a big room I think got some pictures is a big room with a bunch of cubicles or even you know desks like you might see it you know Facebook or Google you know some you know Open Office a bunch of meeting rooms not enough as some of you guys in the booth spend time in the in the rad lab no but the interesting thing is when we started the project and these are just some random pictures of things that would happen in the lab you know would use it to have our industry visitors this is just a really horrible funny picture of people with cables we would actually did play what is that space invaders thank you yeah we actually did have space invaders running so but the the interesting thing is the very beginning of the project you'll notice that this area here was where the faculty sat so in the early days of the amp lab there were a group of I think there was six to eight of us who basically gave up our offices and moved into the lab with you know all the students and postdocs and researchers you know also in the lab and and that was in the very early days you know how we collaborated and it allowed us to move really quickly and to and to build that you know you might be wondering you know I showed you that stack with all those white boxes and said each of those boxes was somebody's PhD project you might wonder how could you actually ever get all that stuff to work together a big part of it is the fact that people were constantly talking to each other to try to make sure that things fit together it's an interesting idea eventually a lot of faculty said boy I really like having an office they got a migrated but at the beginning of the project it helped a lot I already talked about this one you know engage industry and and and in foster community so we would do things and again this was started by the students we started something called amp camp which was a set of tutorials that we would run for industry people for our sponsors and really anybody else who wanted come we'd webcast them so some of them had thousands of people joining in and I mean it's kind of a funny story that the student who wanted to start amp camp was this guy and II can win ski and he came to me and he said hey I want to do a tutorial and you know be two days you know would invite a few hundred people and and I said boy that sounds like a lot of work I mean I don't I don't I don't think we should do that he said oh but listen I got it we're gonna call it amp camp and I was like okay we got to do it so so that's how amp camp started and we ran seven or eight of these eventually a group in China that's what this picture is it says I'm furious is welcome to amp camp China so groups of China started running amp camps and this was a way that we could bring people in and and get them to use the software and you know the students basically ran these and and you know fostered that community this was just one thing we did there's a whole bunch of sort of online stuff that people did as well so that's another one is you know engage the outside world and then you know we talked a lot about the badass stack so that's this idea of building things and and really getting people to use them and then you know finally I guess I'll just end with this which is the reason this is a retrospective is because when we started the project we said okay we're going to run it for five years and then a year into it we got a five-year grant so she's okay we're gonna run it for six years but you know we always knew all along that there was going to be a day when we would say okay this project is now you know ending or at least winding down students working on it sorry if you didn't finish in time you gotta leave no of course the students still working on their PhDs get to finish but you know we don't start up any new things on this project but we figure out who wants to keep working together bring in some new people and think about okay we just spent five years doing that what do we want to do for the next five years so instead of just you know kind of having this thing kind of go on and on and on you know put in a date at which you're going to reevaluate and try to move into a new direction and so that was really valuable too so I think with that I'm gonna stop and hopefully hopefully you have some questions it breaks

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *