Computing with Uncertainty



each year microsoft research helps hundreds of influential speakers from around the world including leading scientists renowned experts in technology book authors and leading academics and makes videos of these lectures freely available well good morning let me add my welcome to MSR Cambridge now we're sort of 40 or 50 years into the digital revolution and it's been driven by one extraordinary fact the fact that you can fit the number of transistors that you can fit onto a piece of silicon doubles about every couple of years that's called Moore's law and then it's at the heart of the digital revolution and it's fundamentally different from any other technology in human history because it grows exponentially everything else grows linearly it goes 100 200 300 400 this technology grows exponentially 100 200 400 800 and so on that makes it fundamentally very different because it's not just the IT industry that's better to benefit from exponential growth of course it's touched pretty much every aspect of our lives and some aspects have been utterly transformed other aspects for example education I think are on the brink of a major transformation well today we're seeing for the second time in human history another kind of exponential growth it's an exponential growth in the quantity of data in the world it seems to be doubling about every two or three years and it's interesting because data contains information and that information has value to businesses to society so computing is undergoing a transformation it's becoming data-driven and there's a lot of hype and discussion around big data and so on I think it's very fundamental and the reason I think it's fundamental is because it's an exponential growth and that's qualitatively very different from any other kind of behavior so what can you do if you've got data well a lots of things you could do a good place to start is just to visualize the data here's some data from our computational sciences group a good place to start gives you a flavor for the data may be plotting crime statistics might already give you useful information always a good place to start next thing you can do is compute statistics to the data compute functions of the data averages trends that sort of thing now you're starting to get more useful information out of that data when you run out of steam with that process you can enter to bring up the big guns you can turn to machine learning and can mention an example of machine learning which is connect which is we can think of this as a prediction angel you think of this as a box which takes in this case a depth image as input and its output are the labels of the body parts left hand right elbow and so on and we show in case of connector million examples of pairs of inputs and outputs we tune all the parameters up in the middle and it becomes very good at taking a new depth image and predicting the body parts and of course there are thousands and thousands of applications now at machine learning in particular of these prediction engines to predict all kinds of things but if you really want to go the whole hog if you really want to bring out the heavy artillery then you go to a probabilistic modeling and that's really the subject of this talk so it's sort of machine learning on steroids if you like now along with the the data revolution computing is undergoing a major transformation in another sense as well so again for the last 50 years we've thought of computing as based on logic it's based on boolean algebra okay everything is binary at zero or one is deterministic it has a precise value and computers work because they they act in this very precise and predictable way all that changes when we move to the world of data because everything in the data world has uncertainty so if I want to know what movie should i recommend to the user I may know something about their preferences but I can't be certain which movie they're going to like the best so there's uncertainty which word did the user right we might handwriting is so bad even I have trouble reading it i is a huge amount of uncertainty speech recognition notoriously difficult technical challenge often a lot of ambiguity and what did somebody say tickly if it's in a noisy environment which web pages the user trying to find which link are they going to click on what product do they want to buy which gesture is the user making uncertainties everywhere so uncertainty is very fundamental to make this next phase there's data-driven phase of the other computing revolution so we need to be very clear about the way we manage uncertainty in particular we need a foundation a principal foundation a mathematical calculus for dealing with uncertainty now luckily this is all figured out 250 years ago by the Reverend Thomas Bayes a particularly the math the French mathematician Laplace and they developed a branch of mathematics called probability theory a probability theory is the the foundation it is mathematically provable as being the unique consistent calculus of uncertainty if you want to describe uncertainty with numbers that either using probability theory or something that's equivalent to it or what you're doing is inconsistent or suboptimal so we use probability theory now this is not just a little tweak hey caio add a bit of probability theory and off we go this is very very fundamental and very different and I'm just going to use this to illustrate why it's different think about the numbers one two three four okay I can put them in order such that too is bigger than 13 as bigger than two fours bigger than three but they're what we call transitive I can't put them in a circle so that each one is bigger than the previous one obviously okay that changes when you go to random numbers those think of these dice imagine these are these are unbiased ice they're equally likely to land on it any one of the six sides but they've got unusual choices of numbers on them and these special choices of numbers mean they are now non transitive so if I hadn't told you this i could win money off you in the following way I play a game I show you these dice I convince you that they're not loaded that they really like lease land on any any of these sides and you can inspect them you can look at all the numbers and then you can pick one of them let's say you pick the red dye then I'll pick the yellow dye and will will will roll them and whoever gets the higher number wind will play an odd number of games a 9 or fifteen or twenty one or something and whoever gets the higher number of whoever wins the biggest number of times wins the bet well the yellow dye will beat the red dye two-thirds of the time so if we place a 911 Rose I'm almost sure to win money off you okay you get a bit fed up losing money so you go for the yellow dye that's great i'll pick the purple dye i'm going to win two-thirds of the time whichever one you pick i can pick a number that's going to beat it two-thirds of the time okay absolutely bizarre and weird and if you don't believe me we're going to be giving out little tins of these non transitive dice so for your friends who don't who didn't come to Microsoft Research and you weren't carefully selected to participate in today's event you can take them to the cleaners the only thing is really handing those out at the very end of the event so you do have to stay for the last talked okay so that's sort of probably theory in the abstract and I want to show you a little demonstration of how we can use probability theory in a real application so I'm just going to switch to this demonstration this is just going to learn about my movie preferences and make recommendations this is actually a toy demonstration put together just for talks like this but the technology that drives this is used in real applications and it's a toy demonstration the sense that we've deliberately stripped out all information about the type of movie in practice you know that it's a romantic comedy you know the director you know the axes you know the lengths you have lots of features we're to pretend we don't know those you also may know something about the about the user you may know their gender their age and so on and that may correlate with movie preferences again we're going to ignore that what we do have though is a large body of data from Netflix which really just says that that user number 53 light movie number 5 and didn't like new v7 okay we don't know anything about users or the movies just which people like which movies the system's already been trained on hundreds of thousands of those preferences but it knows nothing about me so it's going to start to learn about my movie preferences 'um so it might just to start with you might just recommend popular movies but let's say let's say i watch one of these movies when I found the mouse this let's say i watch this movie and i like the movie what i'll do is drag it across to the right hand side just to tell the system that I I like that movie okay what it's doing is it's reordering all the other movies that i haven't seen according to the probability that it thinks i will like the movie so if the movie is down the down the right-hand side if the movies down the right-hand side of the white region is very sure that i will like that movie if it's down this in this white region it's very sure i won't like the movie and in the middle it's 5050 now the only thing it knows about me is that I liked pretty woman there's no anything else at all so it hasn't got much to go on and if you notice most of the movies are clustered in the middle a lot of white space here or a white space here most most the movies are in the middle it's pretty uncertain about whether I'm good to like those movies or not so obviously what we need to do is to give it a little bit more data so so I'm going to do is find another movie and let's say so watch this movie and I said don't like this movie so i'll drag it across to the red region you see what's happening even after two data points they're starting to spread out the movies are moving towards the right hand region where it's more confident that i like it or the left hand region which more compliments it comforted that i won't like it so what we're seeing is a reduction in uncertainty as a result of seeing data now that's the modern version of machine learning right we're just a vending machine learning as tuning parameters to predict Alex different inputs but at this broader view of machine learning we think of the computer the machine the computational engine learning by virtue of its uncertainty reducing as it sees new data let me just give your or give the system a couple more examples let's pick a couple more movies that Sarah like this one and let's say I dislike this one you'll notice after just four examples is getting pretty confident there's a bunch of movies down here it's pretty confident I'm going to like these pretty confident I'm going to dislike these there's no a lot of white space in the middle so there's probability have moved towards zero and one much less sunset and it's learned from those four data points and there's a few in the middle that it's just not sure about and I just can't resist making just one more little technical point because I think this is rather than I supposing I take a movie here and it's very confident that I'm going to like this movie so let me drag it across and say yes actually I do like that movie watch what happens when I let go of the mouse button okay very little change the reason was it was it was sure I was going to like the movie when I said yes I do like the movie is very unsurprising so there's very little information in that data if I do a similar thing if I or if I'd take this move and say no actually in fact although it's very confident I was going to like that move in fact I didn't like the movie let's see what happens was let me do that okay is a nice demo because you can undo can unwrap movies as well let me let me rate at this time as one that I didn't like okay you see a huge change okay the reason it's very surprising to the system is very confident that I was going to like the movie and I said I didn't so there's a lot more information in the data in fact Claude Shannon invented the field of information theory in a classical paper he posed the three main challenges of the fit of the field of information theory and he solved two of them all in one paper it's quite a landmark paper and we defined information mathematically is the degree of surprise so this is that there's a key difference here between data and information every one of those ratings had a single bit of data but some of them carried a lot of information because they had a lot of the group of surprise and others carried very little information so we'll come back to that point later so that was just a simple demo just to illustrate the idea of learning and from in the face of uncertainty but actually the engine that powers this the system called matchbox is also used at the heart of recommendation engines in particular the recommendation engine on xbox live which is serving a hundred million recommendation today and you can see that see a demonstration of that during the outside during the breaks and because I'm a scientist and I can't resist going into just a little bit of technical detail about how this works okay but I promise you know mathematics I'm just going to choose a very very simple example illustrated with pictures just give you a little flavor of what sort of going on and under the under the hood as it were so I'm going to choose the following problem the problem is ranking people in the face of uncertainty so imagine we're all members of a chess club and we play games of chess against each other you want to know who's the best chess player who's the second best and so on and the problem is we have uncertainty let's say you're a stronger player than me that means most of the time you're going to win but sometimes I'll be leavin I'm a weaker player if there's a big difference in strength than very rarely will the week by win if we're very closely match quite often the weaker player will win but the outcome of any particular game is uncertain so there's a standard technique for dealing this it's called ello named after its inventor is used in many many games throughout the world if you're a chess player you'll have a rating your rating will be determined by the Ehlo algorithm okay it's a classical non probabilistic algorithm it assigns us unique number to every person when you win or lose that number goes up or down so a single number for each player other lots of limitations to e lo it doesn't apply for more than two players that doesn't apply if they're a team games well researchers in the lab here back in 2005 produced a what we call a bayesian system a probabilistic a solution to the problem of ranking in the face of uncertainty it's called true skill and true skill was launched as part of the Xbox Live platform and it's used today for the there are something like 50 million users of xbox live there are about 150 competitive game titles on xbox live and all of the outcomes from all of those titles the outcome of every single game played on every one of those titles today is ranked using true skill and it's very important because true skill is used to do matchmaking so you say hey I want a game of Halo or something in the next few seconds the server is going to try and match you up with other people around the world who have a similar skill level so that skill level is very important to the service so how does this all work okay well I'm going to draw pictures but these are actually the pictures that we as researchers will draw on the whiteboard they're called factor graphs and their pictures that we draw and be we're building new models and trying to solve new applications so here's how it works imagine two people playing a particular game maybe a game of chess or a game of something on xbox for example instead of giving every player a particular skill level 123 we're going to give them a probability distribution we use a thing called a Gaussian as that bell shaped curve it has a mean a middle but it has a width that width is the uncertainty so your skill instead of being a hundred twenty might be 120 plus or minus 10 mind 105 plus or minus 20 okay there's a mean and there's an uncertainty associated with it likewise for the other player is the second player might have a different skill level and perhaps less uncertainty because we've seen more data from them in the past now what we do is we consider the particular game on which they're going to play and we allow for the fact that weaker player can win by assuming they have a performance on that particular game which is a noisy version of their skill is as if I have at my skill is my average performance sometimes a bit better sometimes a little bit worse and again that's another one of these gas in distributions I think of it adding a little bit of noise for this particular game then finally we simply say whichever player has the higher performance is the winner that's all that's a little model of a situation so we first of all build our model then we go out we collect data we illustrate that by shading in the node so this node is something we've observed we've observed that player one was the winner on this game so this node is no longer an uncertain quantity we know its value then what we do the computational process we do is we have to figure out what does that imply for the skills so when we observe that I liked a particular movie we updated the probabilities for the other movies so here we observe that player 1 is the winner and that's somehow going to increase our estimate of player 1 skill and decrease our estimate of player 2 skill but also the uncertainty and both will reduce a little bit that computational problem we call inference that's where all the computation grunt is and it can be expressed mathematically in a very beautiful way in terms of messages being passed along this graph and the reason that those messages are important is twofold first of all because it's computationally efficient we can spread this across many processes by having different processes work asynchrony on different messages on different part of the graph so it scales very well the other beautiful thing about this is it becomes very modular these elements of the graph become things that we can plug together I'll come back to that in a moment and so what you see here then is that as a result of those messages the in this case player two is the winner they're skilled shifts to the right and it becomes a bit narrower and player one was the loser the distribution shift to the left but again it becomes a bit narrow because a bit less uncertain about their skill okay let's compare old-fashioned elow with true skill on some real beta and this is data taken from a halo and this is one thousands of players but let's just look at two of the players this is a graph of the estimate of their skill level versus the number of games they've played and that notice the axis here 100 200 300 a lot of games this is what we get if we analyze that data using ello we see that everybody starts off with skill at zero and we see that as they play games their skill evolves okay this is our estimate of their skill evolving through time let's analyze exactly the same data using true skill we get a dramatically different result this converges in terms of how much data we have to collect or how many games they have to play more than an order of magnitude faster so instead of having to wait for a hundred games even just after a few games we're getting quite close to convergence we're getting much better estimates of their skill the reason has to do with uncertainty and just take a moment to explain this because this is you know just an illustration of the power of uncertainty let's if I can pick you as an example let's say we're chess players and let's say let's say I'm the stronger player you're the weaker player i'm at into an eel oh I'm a hundred and thirty year 120 but let's say today you're the winner well the ELO algorithm would reduce my hundred and thirty a little bit 129 it would increase your 122 121 and we go off we play another game it makes these small changes cuz that's all it can do it it's not able to make big changes in one step that sort of optimal if you don't have uncertainty but now suppose I'm a very experienced player I've been in the club a very long time I'm 130 in true skill I'm 130 plus or minus one alright very narrow distribution because the system is very sure about my skill level you're quite new to the club your 120 plus or minus 20 right much much more uncertainty now you're the winner well you know intuitively you can see what's going to happen if I am a hundred and thirty plus or minus one and you beat me you must be 129 at least so in one step true skill will move your skill from 120 to 129 it will make a very big jump in one step because it has the uncertainties in these distributions now we don't have to code that intuition the beauty is we just write down the laws of probability and run computational inference and it does the right thing automatically that intuition just happens correctly and automatically there's something else which is very important very nice about this it comes back to that modularity that's true skill equivalent to e lo but I said that ello had these limitations it doesn't handle multiple players well can handle multiple multiple players very easily we just extend the graph that that's the graph for three players it's trivial to extend the graph to this more complex situation and likewise with teams here's two teams each consisting of two players we model this just by drawing the corresponding graph and we run influence the correct thing will happen so far I've assumed that the person's skill is constant but then and what's happening is as they play more games we're learning more about their skill but of course they're still itself is probably going to change through time if you play a hundred games of halo you're probably better at halo at the end then you are at the beginning so we can allow for that we can allow their skill to evolve through time the the players skill at the new time depends upon the players skill at the previous time but it can evolve a little bit once again we can just model that by drawing the graph so one of the one of the biggest projects we have well probably the biggest project we have in the lab in the field of machine learning is called infer dot net and for that Nets a very ambitious project to deliver what we call probabilistic programming so the idea is that for 50 years we've been programming computers using computer languages that treat variables as certain quantities by the deterministic quantities so if X is a boolean variable it's either 0 or it's one or it may be undefined when it's either 0 1 if it has a as a state but now we want to deal with uncertain quantities so X now can be what we call daba new variable it's may be 0 with probability thirty percent and 1 with probability seventy percent it's now random variable now again if you think this is just a little tweak let me just point the following power to you think of a standard piece of code we have a test if x is 0 do this piece of code else do that piece of code what happens is you grind through you test the value of x and either you do the first piece or you do the second piece depending on the value of x at the time supposing X is now a bit newly variable it's 0 with probability thirty percent and one with probabilities seventy percent you don't know which you have to execute both pieces of code because it could be zero it could be one you don't know you have to execute both pieces of code and later on you might find out other information that informs the value of x so it is very fundamental it's very different from traditional programming so infonet allows us to code up things like the the movie recommender or true skill and many other balls by taking that graph and expressing it as a very compact piece of a programming language so the original true skill was built before infant netted it's like a very smart very competent developer researcher about six months to code this up we can put this together in a couple of hours because we can express that graph using literally 20 or so lines of code info dotnet is then a compiler it takes that that model as we call it that description of the pro lista craft and it generates highly efficient spoke software that uses message passing to run inference efficiently at scale and that's that's expressed in c-sharp let's then compiled to machine code using a again a standard compiler and if you want to extend the model two teams to multiple players and so on you just simply change the the code that describes the graph and then recompile so it's a very very flexible system so pro basic programming has actually become quite a hot topic earlier this year DARPA which is the US Defense Advanced Research Projects Agency so the funding agency in the u.s. launched a fifty-million-dollar program for academics in the US to stimulate interest in probabilistic programming there's a lot of work being done in this field but as far as I know infant is the only system that focuses on doing this at scale it in the industrial-strength system that I'm currently aware of I'm going to finish with one final comment and it's about this phrase big data as we hear the phrase big data you've learned something interesting about data from that root movie recommend you've learned about the difference between data and information so I just want to comment on on big data so if people talk about big data we usually think they mean data that they've got so much data that storing it is a headache moving it from one place to another as a headache computing with it is a headache all right it's just there's just so much of it well let's just think about the size of a data set what we mean by the signs of data set because the size of a dataset has two quite different meanings there's the computational size many how many bits are there and there's the statistical size which means how much information as they contain and these are different things i'm just going to show you two corner cases sort of two extremes just to sort of illustrate the point so the first example is it is a data set which is computationally tiny bit statistically large so let's imagine we've got the block of material we apply a voltage sure and measure the current we get some data yet seven data points and our goal let's say is to predict the current for some new value of the voltage well our friendly local physicist is very kindly told us about something called Ohm's law and apparently this has a linear relationship so we're off to a really good start so we can all use our favorite technique to fit a straight line or in this lab we would fit a probability distribution of a possible straight lines okay that's how we do things in this lab and that allows us to make a prediction and again we have a prediction with uncertainty the last thing about this 22 points define a straight line 7 slightly noisy points define a straight line pretty well so although there's some uncertainty in that straight line there's not very much uncertainties or predictions are very precise they're really tightly grouped now that's a data set which has seven pairs of floating point numbers so computationally it's a trivial data set but statistically that is big data that's a large data set by that I mean if I give you another million measurements of voltage and current your uncertainty will reduce but not by much is already pretty small I mean after a million data points are getting close to zero but it's already very small so that data set is already a large data set statistically and then just to sort of give you the other corner case let's think of an example of a dataset which is computationally large but statistically small it's the problem of recognizing objects let's consider images of objects buses and cars and planes and that sort of thing each image might have millions of pixels and I might give you millions of examples of each I'm going to be millions of examples of airplanes millions of examples of bicycles right each image contains a bicycle or an aeroplane that starting to be a computationally large data set millions of images take millions of pixels a lot of data is it statistically large well let's think about this what is an aeroplane well then aeroplanes could be in different positions horizontally and vertically in the image it can be at different orientations in 3d space it can be different distances from the camera which are making different sizes in that image it can be painted different colors it can be illuminated with different lighting parts of it could be occluded other objects you'd be including the air part of the aeroplane and so on there all pictures of the aeroplane in fact the number of possible images that I could display which will all agree constitute a picture of an aeroplane is is vast vast compared to the number of atoms in the universe no matter users tiny compared to the Republic majors of aeroplanes so if I give you a million pictures of aeroplanes that is incredibly sparse within the space of possible image devera plays it doesn't it's the opposite of the left hand side it doesn't fill that space and constrain it far from it it's very sparse so that's an statistically small and computation large I should just add in fairness to my colleagues from the computer vision group so be there be livid if they were sat in the audience at this point we don't naively treat an image of an aeroplane is just some long veteran feeding into a classifier we have sophisticated computer vision techniques that recognize that is an image that pixels nearby and so on so actually you can build good object recognizes of things like aeroplanes using far less than a million images but I just I just choose that as an illustration with that i think i will i will stop but I I'm happy to take questions we have a minute or two yeah okay time for questions stunned silence tells my guest lecture dice at the end of the questionnaire the film's many films have zero value in that you know maybe it's a blockbuster every other the Sun has watched it so you gain the example of pretty woman to me huge thumbs up a location does he does your approach taking those outliers that you know more arts and arts film wars are filled with a much smaller or the iron one is like to carry much more information then maybe you know the blockbusters boom right you hit on two really good points only give you two answers because you touch on two very interesting things the first one let's do is recommendation and let's say you've just bought a weather prediction app for your phone the last thing you ought to be shown is twenty adverts for weather prediction apps on your phones you've got what you don't want more even though you obviously liked it because you just bought it you obviously like weather prediction apps you actually want one of them so there's something to do with diversity here it's do I discover you like action adventure now I'm just going to show you action adventure for the rest of your life you'll never get to see those nice romantic comedies and so on so there's got to be a little bit of diversity in there as well it's not just about so what I'm talking about here is what we call the inference problem working out the probability that you will like each movie afterwards we have a decision step which is generally much simpler busy important where they actually distantly take those probabilities and then we try to optimize something in this case we're trying to optimize maybe the number of movies that you buy or your your satisfaction with watching movies or whatever it is we're trying to tune we use that uncertainty to achieve some goal and that's said that separately is the decision step and that might involve showing you lots of action ventures but occasionally putting in some other movies and documentaries and so on just just to give you a little bit of variety and that touches on the second point which is what we call the exploration exploitation trade-off and it goes like this I'll show you a couple of movies there happen to be action adventures you like them I know that you like action adventures I've now got a dilemma because I think you're going to carry on buying action adventure movies off me so if i want to make lots of money in the next five minutes I'm going to show you an action adventure but maybe it turns out you you like romantic comedies as well and I can make a whole lot more money selling you those but I don't know that because I haven't shown you any so I've got to I've gotta do a trade offer but occasionally show you other stuff so I can carry on learning what at the same time delivering things that you need now so I've got a trade-off delivering what you want in the short term compared to delivering you want over the long term and there's a trade-off and and the how you make that trade-off is not part of this inference step is part of that decision step it depends what you're trying to optimize okay whether your claim for the long term or plain for the short term so that's that's quite fundamental and part of the solution is to recognize that every time you take an action or you don't take an action its data is not communities learned so they sort of information everywhere even in the thing that you don't do as well so you have to think of data as a rich source of information and and you know sort of generally good policies don't throw date away if you're allowed to keep the data and you know who knows what's in there so you know keeping data is good which is one of the reasons why it's growing exponentially of course don't stop that okay okay right well thanks thanks Chris thank you

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *