Research in Focus: New Research Opportunities in Data Analytics



[Applause] hello and welcome to research in focus I'm Gretchen Hugh's Inga today we're gonna talk about data analytics in the cloud and I'm talking with distinguished scientist Sergey Chowdhury and researcher by Lu ding both in the DMX group of Microsoft research data management and exploration thanks for coming today both of you so let's start with a level set give us an overview of the field of data analytics and then in general what are the new research opportunities in data analytics ok this has been a very active field of work both in the industry as well as in research a simple way to say what data analytics does for you is we collect data and data analytics is sort of the technology that allows you to get to the insights in the data and that insights in turn you know leads to a lot of business innovations so when you step back and say what have we achieved so far today we can deal with a huge volume of data tera and petabytes of data and in many cases exabytes of data so we can process a lot of data well and we can handle scale but this scale although we can handle it is not cheap handling at last scale cost us money so that's one challenge how to reduce the cost of data analytics as we go along the second problem is what we can process it a well getting the data into the data analytic systems is a lot of work and that process hasn't yet smooth as much as we would like that process to be because the smoother you make that process the more is the reach of the data analytics because it can get in more data and handle those the second aspect of that is configuring this machinery of data analytics it's not simple and if you don't set it right it would often not perform as well as you want and this is sort of the second challenge and the third challenge that I think we feel is very important and goes back to the point of our cost is how can we sort of get back up the envelope answers because this is really makes it easy for us to deal with large volume of data relatively low cost so what specifically are you doing to get to that in the research community right now so one idea is approximate Quarry processing and what that does is it says when you ask me a question just as we do in real life if you stop me and say how much do you think it is how long it will take you to cook this dish or you know how long it will take you to go from point A to point B I might give you an approximate answer I say well I think it would be I'll be there between 20 and 30 minutes with all the traffic and everything including yes of course today we have maps and everything that makes it easier if I ask you how much it will take you to cook this dish I'll say yeah 25 minutes half an hour what it is this we haven't used as much to this large collection of the data if we did then and we can answer these questions relatively snappily in effect I am using lot less resources to answer those questions but it is important that these answers are not way off because I'm using these answers to decide where we sort of focus to investigate right right right well I'll ask you about that in just a second researchers are always trying to shorten the amount of time between when data is born and when we get insights from it so what's the felt need there in our current data analytics landscape what problems are we trying to solve with that shortening of the time I think that the shortening that time is important because if you look at the pace of the world today like I'll give you an example as supermarket decides to have a sale event right or a departmental store is going to have a sales event so they want to launch the sales event but they want to get feedback based on how the first few hours went to understand what you need to change either about pricing or the morality so they want to react quickly to the incoming data of course in a robust way so if they can't get that data in sufficiently early the sales event or the promotion event would be over before they can do course correction or interview I think that's sort of an example case we're getting the data in quickly and being able to analyze in a timely manner is important and that goes back to your point of shortening the gap in time between analysis and ingestion of data right so bilou talk to us a little bit about the idea of Auto tuning and cloud databases how do cloud-based databases differ from regular databases and ultimately how do we get rid of manual tuning or do we need to do that okay so the databases have been moved into a very powerful but also complicated piece of software so the it provides a lot of tools for naps for the user to customize the database so it works best for their own web applications the usually is precisely the user needs to figure out what there are the best configurations for their application and to get the most out of the database so previously there has been work on how to choose the english's for databases but this again provides another toolkit for the user it requests the input from the user and the user needs to use their judgment to decide like when to use this tool and how to use this – so this is just like like nowadays you it's easier to drive a car with automatic transmission yeah then manual transmission but you still need to pay attention to surroundings to decide when to accelerate and when to slow down so the vision of this project is to automate the interim process of database tuning so that when the user just nature specify the metric and the target they want to have and we will figure out how to get there and we actually go ahead and really get there for you so in terms of the cloud actual Azure provides a unique opportunity for us to know within a database and 'nor across multiple databases so but ranking legal telemetry from the crowd we can transfer and generalize this non learning to millions of database in yeah sure that's awesome let's go back to approximate answers for a second Sergey why would in a database query because I understand the idea of you asking me a question well I guess 20 minutes but if I'm going on my computer and I do a query I'm thinking I'm gonna get something exact and that's what's been aimed for hasn't it why would close enough be better than for sure yeah so you have to think about cost if you don't think about cost or latency or how much time it takes for me to answer the question approximation would not seem interesting right and and so the challenge here is the following if a and as you know we have been accumulating data at a very rapid pace so we have a lot of data for a computer to go and look at that data one of the two things have to happen it has to be a super powerful computer means often like a large cluster of machines who are all working super hard to get that data out and that cost money if you think about the how much work the computer is doing the time money you are spending in the world of cloud disappearance relating to money it would be a lot of money to answer your question exactly alternatively it will also take a lot of time to answer it and unless you use this bring all these resources to bear very much large clusters what that means is that I trying to get the answer I have to wait for quite a while you know so either from the latency or from the cost or sometimes from both getting an exact answers in a world of big data when a lot of data has been accumulated is expensive and what we are trying to do is to give you a mode of answer but you get a back-of-the-envelope feel for the estimate before you know let me before you go and spend that money to get the exact answer so we are not at all saying exact answer is not important or that world will go away absolutely but we are trying to have a portfolio in that not always I have to see exact answer and spend a lot of money instead what I do is work in a world where I sometime ask for an estimate and based on their student I may say huh now let me ask the exact question because I really want to know this in an accurate way right so is there going to be some kind of customer retraining in expectations or expectation management in an approximate world versus for sure world you know the realities approximation happens already today if you think about what happens today when I face a large data collection and ask I'm going to analyze that data often out of the need to get an answer quickly or to spend less money we say you know just take last three months data don't bother looking at older data just look at the last three months data right or just look at last week's data or just look at the data from Pacific Northwest for now don't look at the other region so the customer base is already tuned to the notion of approximation that said this approximations are kind of ad-hoc approximations well out of necessity they are making some you know ad-hoc calls to see which part of the data they'll look at what we are trying to do is to say let's bring to bear the foundations of sampling sketches and other powerful tools that we have innovated we in the broader scientific community and statistical community have innovated over sometimes decades or in recent years and bring it to bear to answer approximately in a principled way that's the journey we are in yes there is some adjustment to be had made but I think the idea of approximation is not new yeah people kind of have that expectation okay so database systems are made up of a lot of moving parts and pieces how can machine learning and AI techniques make individual components smarter and maybe our systems more robust yeah I can start so what I'm thinking about is that especially for the cloud services now we have the opportunity to collect a lot of performance metrics which is not like confidential data it's not illegal data it's just elementary of the execution characteristics of the databases once you see one database we see – you see millions of them you are starting to discover some patterns from them so if you are new customer come to the cloud we can leverage this knowledge and transfer our knurling to your databases so that you can enjoy this intelligent database picture from the start right yeah so just to refresh another way what pilou has just told you is traditionally we have built up one piece of software we sort of goes and gets installed in many places this database system goes on your data it works and it works on my data in a very similar way it has a fixed set of capabilities it doesn't go and very much tune itself based on your characteristics it essentially gives an answer in a certain fixed way no matter which data you point it to the availability of the data that Bilu talked about this availability of the telemetry data from the cloud opens the door for an investigation which says oh I'm on Gretchen's you know world and in that world I'm looking that that data has a certain characteristics and now I know how I can optimize the functioning of this database system to suit her environment better and if I were in by loose work environment I may see a very different patterns maybe she is very bursty on how she you know ask questions and I may do very different strategy and how optimize the resource of the system I think where we stand today that opportunity is very much there and it behooves us to explore that opportunity are we ready today to call it a victory that this data would has modified the way the systems are going are going to be built in the near term I don't think we are there yet and the question you raised about robustness is real Jeremie systems take pride in being a very dependable piece of software in the enterprise so before we make these transitions in a database to be based on the telemetry data we have to be super sure that we have the guardrails to make sure they act in a robust way so that they're certainly based on the data they don't do something erratic and this part will require just like software engineering needs to be understood much better and how we deal with these ml powered systems or specifically ml powered database system yeah so the machine learning bit is is personalizing yeah the search thing but it needs more research to sort of exact solidify whether this is going to be something that's gonna play well in in my world yes exactly without creating unexpected surprises and also just in a few words there are two aspects of machine learning why is true how do we generalize this knowledge to others the other ones how do you adapt this knowledge to you so actually we are doing both dimensions Wow that's exciting so let's talk about self-service data wrangling it's an entrance I've throw back to the gas stations when self-service was the thing first of all what is self-service data wrangling and how do machine learning and program synthesis inform the research that you're doing in this area yeah so when you sort of opened our discussion you asked what are the problems in data analytics and one of the problem I mentioned is the journey from having the data to getting it to a point where I'm ready to run powerful analytics on it is a long one and one reason is of course getting the data from a different system where data is being born to the system in which I'm going to analyze it you know is some time not so easy to get to moving the data from one point to another the second reason is when the data is born – when I'm analyzing it often the shape of the data needs to be changed like the data gets generated in a certain way but I want to sort of reshape the data before I analyze it this task of reshaping and you can think in terms of if you are familiar or work with expressions often we sort of move around and try to do it shape it in the right way before we sort of run some aggregate statistics or other stuff this process is what is known as data wrangling data preparation data cleaning and this phase is often is extremely labor intensive for human intensive the push of self-service you know data wrangling is to say how can I make it easier not compromise on how you want to shape the data but instead of a heavy duty manual intervention where I have to be a rocket scientist to learn it which automatically means fewer people are going to be able to do that which means lesser amount of the data are going to be analyzed can I put the power in the hand of an user like you Gretchen and the idea is give some examples if I have the details in a certain shape show me the examples of what shape you want to be for a couple of your data items and if you show that to me what self surface data wrangling try to do is to say I got your example I know that kind of problem second I know the space of programs I can possibly do to reshape the data based on your examples I can narrow what program needs to be written instead of you writing the program you simply show the examples and the program synthesis technology with the ability of machine learning to rank various candidate programs among the using your examples which will fit best your example can present you the program on how the data will be automatically reshaped and this is a very powerful tool yeah so is this like autofill or flash fill but only with programs exactly right that's the right way to think about it what flash field does is that just for the you know just to remind all of us is essentially you give some example in an Excel saying that hey I want this column to be like this and you're really giving the example of this reshaping for a couple of rows maybe and then flash fill automatically helps you fill in the rest right and so it has behind it a space of programs it can think about and it generates from it well we are taking this idea using exactly the same paradigm which is example driven for you to indicate the V shaping you want but we hypothesize many people have written many powerful transforms and they're available because in enterprises out of necessity people have written what we want to do is to create a search engine for such transformations the idea would we'll give these examples and these examples allow me to search over all the transforms that we have seen including the kind of transforms flash feelings today already capable of and pick the program that I'm know of that already meets your requirements and this allows that Gretchen when you write a transformation Bilu can automatically discover and leverage that and this I think is very powerful little experiment yeah and you know I just think of how much autofill and flash fill have done for me speed and productivity wise I don't have to think through all of the filling in each cell and so on that I can imagine this would be really amazing for this program exactly I think this is really you know for an area of this data wrangling you know which has been kind of a slow-moving feel and very manual this combination of program synthesis and machine learning together have really an opportunity a kind of an unique opportunity to reshape the field in terms of new innovations that will come and we are not alone in this and I think there is a broader opportunity for the research community to do even more here you know that's great and that kind of leads into my last question for you I'll start with you by Lou sure what would be the goal or vision of this research if in a perfect world you were wildly successful what would the data analytics and the query and the database world look like I say for me this is how long my research or interested in my current project it's like right now the cloud services are valuable in the sense that it offload some of the overhead of management of data and databases of the shoulder of the users so that an average user does not need to necessarily be an expert in the field can go ahead and deploy their own database services and use the services without any friction to my view of the future of the soil work is to make the cloud services even more smart so that the users don't need to worry about the underlying infrastructure and do a side job as a DBA right instead they can just focus on their application and business logic and rely on us to figure out how to support them for their applications so you need less expertise on the huge outsource but it's more well it's like a diamond crossing the database services to average users yeah or lower than average what would you say would be well I think the world kind of where we started right I mean I think that from the time the data is born to where we are getting the insights the data I want this process to have very little friction and be inexpensive and if you think about it today it requires a lot of manual intervention and as data grows the cost goes up and I want to shorten that I want data is born I can get to the insights relatively quickly and it doesn't cost that much and everything we talked about today lead to that direction and that's sort of my uber dream the uber dream I love that Sergey bailu thank you so much for joining us today I'm glad you came in I look forward to seeing where this is going to go in the future thank you so much for the conversation yeah and thanks for joining us for this session we'll see you next time [Applause]

2 Comments

  1. shantanu shukla said:

    Nice explanation both of u.. Thanks

    May 22, 2019
    Reply
  2. okay okay said:

    Okay

    May 22, 2019
    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *