Cloud Computing – Big Data and Beyond

each year microsoft research helps hundreds of influential speakers from around the world including leading scientists renowned experts in technology book authors and leading academics and makes videos of these lectures freely available you you i'm going to start really looking at this largely from an end user application side to some extent so i'll talk about what cloud means in the world of kind of research science ok so how many of you are kind of computer science PhDs and how many years sort of non computer science science PhD ok few v ok so hopefully this will be of interest to everyone because there's some pretty cool science applications but I'll also explain how they're built under the hood using cloud computing so so in the scientific world and we think about science so think about computer science as well so back back many centuries ago millennia ago science was very observational so people like Galileo would look up into the sky look at the stars look at the planets take down observations of what was happening Newton Newton Leonardo da Vinci for instance would sit by the river and see the water flowing through the river and see how it would flow around rocks and you know he would then think about designing helicopters and airplanes so very much kind of experimental or observational ok so that was many centuries ago then came mathematics and what that meant was these observations could be instantiated in mathematics so does anyone know what that equation is very famous equation you would die without it ok it's called navier-stokes equations now via stokes equations describe fluid flow everything from the atmosphere to the oceans to flow around cars flow around airplanes air in your lungs blood in your veins and arteries ok so that equation again from da Vinci sitting watching rivers we can get to an equation that can do all that and then we can get to sometimes you can solve that equation so that equation if you can solve that if you can show that that equation has a unique solution in two dimensions you get a million dollars today because no one's done it done it in one dimension but showing that there is a single solution to that equation in two dimensions nobody has ever achieved okay so sometimes you can't solve equations analytically and we can resort to computers to do simulations so that's an example of how a weather simulation is done for weather forecasts and climate modeling and we typically can do that on sort of big supercomputers so that's kind of where science has got to okay pretty much state of the art but we're moving into an interesting world where Jim Gray who is one of the researchers at Microsoft who actually went missing a few years ago but was very big in databases but got interested in science he coined this idea of the fourth paradigm of scientific discovery so the fact that we have data all around us and we can collect data be that from sensors be that from simulations be that for instance from computer networks okay we can take that data and actually use that as the basis for hypothesizing about new science and what that means is the nature of science is changing okay so when we think about the first paradigm of observational science it was very much you know a single person like Galileo okay and with mathematicians as well and some extent with simulation but we're this data intensive sort of science it can be much more collaborative the so the Large Hadron Collider is an interesting example where you know teams of thousands of scientists are actually working together okay and that's what this fourth paradigm of science makes it very interesting what's interesting is we think about you know the Large Hadron Collider is a great example of you know a huge experiment okay where you know they're colliding particles seeing what happens and then having to analyze this huge cascade of data and so it's a sort of classic example of what we would call data intensive science a petabyte of data i went to talk last week actually where the person in charge of this gave a talk and the amount of data just the amount of data per second coming out of the Large Hadron Collider is amazing so that's particle physics but if we think about climate science okay it's the same astronomy is the same so there's a project called the Square Kilometre Array which is being built in South Africa and Australia and it's it's looking at you know the night sky in much more detail than ever before and they're talking there about exabytes they're saying in 2018 they'll have three exabytes of data ok and so if so those are saw what we call big science fields but even feels like social science if you think about Twitter and you think about Facebook and you think about the data on phones and how we can do science around that I can understand what people do you know that's in this realm as well as petabytes and petabytes of data and so this explosion of data so how many of you heard of the term big data okay so kind of pretty trendy term at the moment so big data big science big data science the case is pretty much everywhere now and when we think about how we do science in this new world we can think about data being in the middle data is this kind of interesting resource that people can tap into so if we start at twelve o'clock you know typically that the standard scientific model would be starting here at twelve o'clock people mess over there okay where we might design experiment acquire data from the experiment I'll then sort of look at my results maybe with some colleagues I'll then do my analysis okay and then our basically test my hypothesis or my null hypothesis and then I'll publish my paper okay and share that and then the paper goes to library and it's archived so think about that as a linear process whereas now in this kind of world of data science big data thinking about data in the middle I Mike kind of with the Large Hadron Collider for instance somebody else has already collected the data and done some pre analysis and I can come at that and run my own analysis without having to do the acquisition myself because the data is already there and so we're seeing this increasingly in many different fields where we want to be able to tap into existing data okay so it's a much more sort of holistic way of doing science so that's where science is heading in terms of these different sort of epochs of science in it so it's pretty exciting for people who work with computers like us to be in that world so I'm going to talk about Big Data but in the context of sort of science not so much in the commercial world so I want to show you an interesting tool that we've built in Microsoft Research called worldwide telescope this is a planetarium on your computer okay not only is it a planetarium on your computer but it's actually the highest resolution map of the night sky okay again all access through your computer so the problem was that NASA had done some sky surveys and they had to stitch together the individual images so literally a panorama you know like you do with your camera or phone but for the whole night sky and you can see the numbers there are lots and lots of images lots of data and what's interesting is you get this kind of mosaic okay this happens again when you if you use your camera phone you do panorama you kind of get a mismatch because it does the light metering kind of differently as you change angles depending where the Sun is so you get this kind of different effect this mosaic effect so what we wanted to do is to create a seamless image of the sky okay so we had to figure out how to do that and get rid of these kind of nasty edges and do the contrast matching etc and so it's an example of a data pipeline for processing we're transitions are all really slow there we go where we went from the raw data here and applied lots of different Corrections and created a pipeline using dry air dried link which is the kind of MapReduce type dag framework okay and you know see doing parallel processing and here we were using Windows 8 PC so the supercomputing on windows in order to process those gigabytes of data in order to create this seamless we call a tiled image so it's like with maps you know bing maps google maps where you zoom in and out and it seamlessly zooms in and out the same thing for the night sky okay so it's an interesting say big data big data problem and so we start with the mosaic type image and ended up with a very very smooth map of the night sky and I want to show you what that looks like what happens if I hit play yes it's very very difficult to navigate with the mouse so this shows you essentially that worldwide telescope sound and so you can see that you know in the original image these are all tiles and so we can we can pan around the night sky but we can also zoom in into the different galaxies and you can see how we're overlaying the constellations you know we can turn those on and off we can actually put in different spectra so you can put an x-ray spectra for instance an overlay images for instance from Hubble telescope that's the famous hubble deep sky image okay that shows the sort of early universe each one of those is galaxy okay so pretty incredible that I've that's one of the pictures from Hubble that that I always remember and so you can see how we can stitch this together an overlay data on our map and you can see that that great sort of zooming in and out like they like we get on you know terrestrial maps but for the night sky but it required a huge amount of processing you know we can deliver this you know over the web we're switching to a different spectra there to show so I just wanted to show you that see so you got a worldwide telescope org and you can see that so when we talk about Big Data one of the interesting things is obviously there's a lot of data and that can be some big files or lots of lots and lots of small files and it gets quite difficult to handle so one of the definitions of big data is data you can't handle using conventional data systems like databases okay it's kind of a definition of big data so then to be able to apply analysis against that again becomes very difficult and this is actually where cloud computing can become very useful because in the cloud we can ultimately have infinite storage and infinite compute okay from a Sony from a technical basis and so what we're trying to do within certainly our group and Microsoft Research is to make that available to people which previously it was very difficult to have both the access and the expertise to be able to manage lots of data and then apply complex algorithms against the data so one of the things our team in research connections is doing is trying to make that much more accessible through the desktop through things like Excel where typically you would just have to do an awful lot of work to do that so an example of this is actually from one of our colleagues david hackman who's actually based out in California who works in genomics and I'll talk a bit more about kind of what he used to do so on the left-hand axis you can see the cost and it has lots of zeros behind it the cost of sequencing a human genome is dropping this white line is Moore's law okay so what you can see is that a few years ago all of a sudden that the genome sequencing technology advanced so much that there's this sudden drop in price where it's becoming much much cheaper to the point where you will soon be able to buy that device for about a thousand dollars and notice what's on the end of the vice it's a USB plug so you can plug that into your laptop and it will sequence your genome for about a thousand dollars so we're going through a revolution basically where we're going to be able to sequence individual people's genomes personalized genomes and that's going to create a mound of data but it's also going to create a huge amount of opportunity for you know the medical community to try and understand disease for instance and screening and come up with sort of vaccines for instance which is what the video will show you but what I'll do is I'll go through here and then come back to video so david has been at Microsoft quite a long time and he joined in the team that was doing spam filtering okay so back in the day spam started out where it wasn't really a problem and then it became a problem and so for example what would happen is that they're working ok that the spammers would for instance try to advertise something so for instance something like viagra right and that they could put in your email message something about viagra so the filter people got ok that's fine we'll do a search for the word viagra and if the email has viagra in it will filter out as junk so the spammers when oh they're filtering all our stuff so they said are what we'll do is we'll replace the eye with a one so the filters didn't spot it spam would start coming through not getting filtered so then the spam filter people we go aha we've noticed that spammers are replacing eyes with ones will look for that and will and so they would then add that to the filter or in it start putting it in the junk folder and then the spammers would put out signs instead of A's then they got really clever and said we're going to put a bitmap in that says viagra rights it's not even text that will fool the filter people so the filter people at that point Dave in team they just went this is nuts this is an arms race between the spammers and the spam anti-spam people so they then step back and said okay well what's the common thing that all spam email does all spam email directs you to a website okay because that's ultimately where they make money so they said okay why don't we look for the links to the websites because every spam email will link to a website and so they kind of had to think what's the weak spot in the spammers technique so what's interesting is that HIV is similar it mutates so every time you try and go after it and the immune system tries to attack it it will mutate so David and the team thought okay can we apply that same logic of going against HIV s weak spot what's the weak spot in HIV and it turns out that you can so this is a picture of HIV okay and what happens I've tried to highlight here in in red and green two spots on on HIV where what will happen is the red spot there will mutate okay but if it mutates then we want to go after it and if we go after what happened is the green protein there will also mutate okay so they sort of compensate if you go after the red piece and we stop that mutating than the green one will and so what David Nasim said well why don't we look at the pair's okay to look for pairs so that we can stop this adaptive mutation and that's the weak spot an HIV and what we can do is if we identify those pairs we can sort of train the immune system to go after those spots ok on the HIV virus and that is what the team has done and what they've done is they've applied machine learning against a massive data set what's interesting is to do this what you would typically do is you would take a single patient take a blood sample sequence it look at what the virus looks like then a week later take another secret another sample sequence that do that every week and then you can see how the HIV is mutating okay that's very difficult to do with one patient it's much harder to do with hundreds or thousands of patients somewhere like Africa where you know getting them to do that on a weekly basis is pretty much impossible so what David did is in the team was they looked at a single snapshot of a population of thousands of people with HIV and sequence those genomes and they use the machine learning actually to make that the equivalent of sampling a single person over time so it's an example and sort of big data of applying the machine learning to the dataset looking at it in a different way and also you know coming up with this novel way of trying to figure out the weakness in HIV so it's a good example I think in in sort of health care of big data machine learning to go after a sort of big problem cloud computing so let me talk to you a little bit about cloud computing so the video actually shows I'll point you to the URL for the video where David and his team are actually looking at lots of different diseases and running on I think 27,000 cores on agile to do an analysis across different diseases to look at that so the cloud that Microsoft has not just windows agile but everything so this includes you know Alec calm it includes big maps it includes Xbox we have lots of data centers around the world we have the big data centers that you'll see there we also have what are called edge facilities that allow us to do global data distribution and so we have one of the biggest sort of data center networks in the world as you might expect and so those data centers have evolved over time from just racks of boxes okay to sort of high density rack so this is pretty much what most server rooms if you go up to the sixth floor here our server rooms got a bunch of these racks okay and then what happened what we call generation three was people came up with the idea of using shipping containers okay and actually putting the racks in a shipping container because then we can stack them up build our data centers very quickly and that's really good so you just plug in power you plug in water cooling network you put in a container maybe a couple of thousand servers you let that run basically when enough servers failed you can just take the whole container out put a new container and you don't replace individual machines that was pretty much state-of-the-art this team called global foundation services this is a great website if you're interested in sort of what goes on behind the scenes in cloud does a lot of R&D on okay well how do we make our datacenters better so they came up with the fact that these containers are designed to ship cuddly toys and electronics over the seas okay these containers have been around for a long time and their standardized so you can stack them on a ship take them off a ship stick them on a truck stick them on train and transport them they were not designed to efficiently house 2000 service okay we sort of crowbarred that in so the team in global foundation services said okay well let's start from a blank sheet we still like the idea of containerized data centers but what happens if we we start from scratch Nick I up with this idea called IT pack which our latest data centers have and there's a little video here which doesn't have sounds or okay I have which shows you or should show you when I hit play what one of those containers look like so these containers are now being deployed in some of our newest data centers and what's interesting is by designing them from scratch we can really optimize sort of how they behave there's a thing called pua which is kind of basically a ratio of how much power we were putting in to a server a data center versus how much power is required to power the server okay so it's the overhead of power so typically the pua of a normal date sensor at your university the pua would be something like 1.5 1.8 so there's essentially fifty to eighty percent extra energy going in to the data center to cool typically cool the servers okay with the new IT packs or with the generation three the pua was about 1.25 1.3 so the wastage extra energy was 25,000 with these IT packs we're using some very interesting cooling technology evaporative cooling technology but even then we're putting them and using free air cooling where we don't even need to use the water okay to do that where we can get now down to pier ease of around 1.1 so there's only ten percent extra energy going into that data center in order to cool cool the servers so so again this shows you how different think about this compared to a steel shipping container okay and how much more thought has gone into the design so here it's interesting it shows the outside temperature versus the temperature and humidity of the service and it can shows how the system's changing you all to the airflow in order to get the service at the right time the other thing we do is we run the service hotter than you might normally do they're still within the bounds of the like the process and manufacturers okay so they're still operating within a you know within bounds design bounds but if you run them a bit higher viously don't have to cool them quite as much so you say when you scale that across thousands or millions of servers globally that's a huge saving on on power so it kind of shows you kind of in our cloud data center how you know we can really with careful design optimize how we operate the servers in order to keep our energy consumption down you know reduce our carbon footprint you know and so there's quite a nice fatigue I think that videos on YouTube as well so we're going to talk a lot more about as well but this is kind of you know a high level of what Asher is it's building blocks ok different pieces that you can put together to build whatever you want to build and it's quite complicated there's quite a lot to it it's not just the m's people think about cloud as just virtual machines and that's true but with windows azure we provide lots of different i say building blocks and Carlos will be going through that straight after this lecture so we've been looking at uses of cloud computing across different research areas and so things like protein folding so looking at how things like viruses mutate this is good for things like drug design civil protection so that's looking at how we can predict fire outbreaks okay and looking at sort of mineral extraction from France looking at MRI of the brain for instance and also in Japan looking at things like you know internet sort of word analysis so where cloud computing is quite interesting is it essentially like so it gives you access to very large amounts of compute so one of the projects we did with Newcastle University and various other partners was looking at what's called cuse are so this is looking at the toxicity of a drug okay so it's trying to figure out dosage of a drug a new drug so basically previously to run this on a sort of a single server which is what they would do to do this big study it might take them it took them fight it would take five years to run this case and so what they could do is is the newcastle team actually ported this on to the cloud using something called e-science central which is a nice kind of workflow system and they could get that down to ten hours because what you could do is you could fire up hundreds of servers okay and this was sort of pretty much embarrassingly parallel and you could do that very quickly now you can do that in a supercomputer center okay if you have access to a supercomputer center supercomputers are great I Mississippi compute person they typically will have very high performance network interconnects very low latency which are very good and necessary for things like weather forecasting for things like that this they're not necessary for embarrassingly parallel and so that's kind of one applications where clouds very interesting because I can spin up a thousand nodes and spin them down it costs the same for me to run one node for a thousand hours as a thousand nodes for one hour if that means I can get a thousand times speed up and I have instant access on the cloud then we can do things like this and so these are some of the things just quotes from people on this project which show how it really cloud is changing the way we can think about using our computers okay so it is that idea of sort of parallel computing but a lot of its just the access and the ease of use and you know cars will show you some examples of that interestingly as well with data so here are some examples not so much in the research realm so Transport for London haven't you been to London okay quite a few so London has a very big sort of transport network buses and underground transport for london actually make the real-time data available over the web okay when they did this they deploy this in their data center or in a higher data center and everybody started building apps and essentially the system ground to a halt and so so microsoft work with Transport for London to take that data feed and stick it up onto our cloud on to actual so that it could scale out with demand so they're just seen a hundreds of apps now that are consuming those live data feeds so it's a good example where the cloud can do very good scalable distribution of data another example is ordnance survey so Ordnance Survey is the UK's mapping agency and so they have a full digital mapping service so you can go onto the website you can draw a box around Cambridge ascetically give me a map at 5000 resolution and download it and all of that data you can imagine all of that high res vector data mapping for the whole of the UK and again by putting that on a show it really helps them there and again with data there's something called the agile marketplace which I don't think I was going to go through but here there are lots of data sets a lot of them are free okay and some great kind of data feeds on there and one of the examples is the UK Met Office which has its weather forecasts available through actual market place so you can program against this and build apps that will pull weather data from actual market place it's not just data so there are services so Microsoft translation service you can write programs that will translate between lots of different languages including Klingon now a hair and and but you can do it programmatically okay so there are web widgets and things but if you're writing a program you can actually get access to that you know top quality machine translation this is a video about the fire prediction so again I'll point you to that later so I just want to finish off again looking at life the universe and everything and that's that Hubble image again I love with an example of an end user application called chrono zoom which is essentially a timeline for everything okay so this was developed by us and walter alvarez at university UC Berkeley Walter's a very interesting chap he was the person who went out to Italy and was looking at the rocks essentially because she was hunting there and he saw that in the rock layer when the dinosaurs were around that all of a sudden there was a big sort of definite line in the stratification of the rocks what he found was that there is a ridium present at that at that line and iridium doesn't exist on earth okay and that piece in time was when the dinosaurs were wiped out so Walter was the one who theorized that the dinosaurs were wiped out by a meteorite and then after that people started to look for it and they found the the Yucatan crater which we think is the crater of the meteorite the wiped out the dinosaurs now Walter's so is interested in these very large timescales disorder geologists so he teaches a course in something called big history and big history is trying to understand essentially the timeline of everything and it's very difficult to visualize time from the start of the universe 13 and so I think it's 13.7 little over 13.7 billion years down to a single day in modern history so one of his students actually looked at a piece technology we have called deep zoom which allows you to do through a web browser zooming in and out and they developed a prototype of that they call chrono seem to do this and say I will cry zoom project org so you can look at that and I'm going to just quickly show you so this is a timeline of everything okay and in chrono see what we do is we have what we call exhibits okay so i can zoom in here and i can show different what we call artifacts so when you go to a museum you see an exhibit and they're different artifacts and these artifacts can be things like movies which is quite nice so we can embed things like Vimeo videos wikipedia articles graphics and so within this timeline we can present different like say exhibits and also a bibliography so if you want to use this for your research you can do that so this is quite interesting you see an exhibit and so we can zoom right in but what's interesting is when you zoom interesting than that but when you zoom out okay you can go back to that timeline so if you look at here okay so you've got Giga years there we can see the cosmos and I can zoom in and look at the sort of earth solar system time scale and again I can zoom in and look at here for instance the different craters on the earth so again look at the timeline at the top okay so so 1.8 billion years there so that's so you can see how we've gone from cosmos to life of the solar system it really gets interesting when we start looking at life okay and we zoom into the origins of life okay and things like bacteria and then we zoom into prehistory ok so again look at the timeline with zooming and zooming and zooming into thousands of years and then we can look at people okay and I think we're over mi sapiens at Sur ok then we can mean over here so this is humanity so modern humanity again so the timeline now is quite interesting because we're getting into thousands of years and we can look at sort of Mayan history and then we can go in I think it's over here United States and we can zoom into Microsoft yeah we can zoom in here are folks in redmond and then we can zoom in here oh there yeah that was the meeting we had just the other last year's meeting in January that we had okay and what's interesting then is we then can really feel super significant because I can zoom in from a day in modern history and I can zoom out through life and I can zoom out to earth the solar system and then zoom all the way out to the life of the universe so you can see they're tuned again an interesting Saul delivered through the cloud okay all of that data is in the cloud delivered a really nice kind of Internet experience so that's very fun have a look at that and I just want to finish about where we're going so I talked a bit about supercomputing and some of you I'm sure you supercomputers to your research this is kind of how a lot of research looks today where you have high performance computing clusters national machines universities will have big machines I used to it so happens we had an 8,000 core machine which was fantastic if you could get on it and then you have everybody else and the trends are that we're getting more data it's getting harder to do the analysis so our team in MSR is really working on clap using cloud computing to make it much easier for researchers to access data analyze data share their results again this fourth paradigm of data intensive science is really really great that we now have the cloud and we can build solutions on the cloud so that as we say scientists can be scientists so with that thank you very much for listening I'm happy to take questions

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *