Big Data – Distributed Computing and Hadoop



all right welcome to distributed computing with a brief correlation to how it works a little bit in Hadoop distributed computing has been around for quite a while it's kind of funny I used to be a VMware instructor and in VMware we would take one real physical computer right and we'd create virtual computers we take one real computer and turn it into multiple pretend computers and we call that virtualization distributed computing is almost the opposite I'm going to take lots and lots in lots of real computers make them work together and pretend to be one virtual computer all right group commit to a word together pretend to be one virtual year and that's really what distributed computing is we're gonna talk about distributed computing what some of the different components are fundamentally when we look at distributed computing one of the common phrases you'll hear describing it is something called divide and conquer okay we're gonna take a job we're gonna slice and dice it into smaller pieces and have different people haen't go ahead and process those smaller pieces and when we're done go ahead and figure out what the results are all right imagine that you had one of those 10,000 piece puzzle zat you know pieces and you wanted to figure out how many pieces you had to had one nub how many had to Nub's three nubs and so forth with 10,000 pieces and maybe at work and your team you know five people well how do you handle that process well ideally you get an intern to make him do all 10,000 now what we do of course is we say okay let's break this down into five chunks all right each of you go through and you analyze the puzzle pieces that you're given you go out figure out how many have one nub tune-up three nup and you keep with your own little tally and then when we're done we're gonna take the data from those five individuals we're gonna combine it to get our final results all right and that's a one example computing one of the most famous examples of distributed computing involves aliens it's called SETI the search for extraterrestrial intelligence it was called SETI at home and what you could do on your computer is you could run a screensaver and that screensaver when you weren't using a computer would kick in it reach out to SETI servers it will grab some data it take that data back to your computer and your computer would process it analyze it looking for unusual radio interference or whatnot that might indicate the presence of intelligence and then it will return the results back and people competed so you can get the most blocks done they had millions of people around the world and that's a great positive example of distributed computing although it can be used in a negative manner sometimes if you want to attack a computer you can do something called a denial of service attack where instead of your you'd instead try and open up a one page your computer keeps trying open hundreds of pages and try to slow the server down well the most effective way to do a denial of service attack is known as a DDoS a distributed denial of service attack where they install some type of virus or worm on thousands or hundreds of thousand computers and then they all work as a team to go ahead and attack whatever particular computer so distributed computing is powerful whether used for good or evil alright like my mom used to say many hands make light work which is true now when we consider some of the basic components the one thing we like about distributed computing is in the old days we had one great big computer and we would say ok we have this great big computer go ahead and handle everything well if we wanted to upgrade to get a little more power the growth was really challenging alright what we're able to have with distributed computing is incremental growth as I need more power I simply add more nodes all right more individual small servers to my cluster and I can linearly scale and gradually grow the great thing is is as it grows something kind of neat happens we started go ahead and won first things we'll notice as we'll start to have increased throughput because now we have more and more machines each of them are bringing their own network their own compute their own storage probably more i ops more gigabits per second on the network and more of everything we get MP increased throughput not only do we get increased throughput we also get decreased all right response time because now we have more responders and so as we grow it actually improves more and more until we end up with just a crazy awesome system now you're like okay well great so I can add but it seems like I'm adding that I'm gonna have more and more points of failure actually what's interesting and distributed computing is it has actually a very very high reliability now traditionally when we think of reliability alright when we think of reliability we think okay what's our mean time between failure and so forth well that's really important if that one computer fails and your systems down that's a very accurate measurement all right but this is not one computer it's an army of computers working as a single logical unit and if one of them fails the army marches on and so you have incredibly high reliability now out some people say hey Bill did go ahead and build these with the cheapest components you can get because the overall system so go ahead and use the cheapest compute nodes it's usually not the best practice because you're gonna end up spending more time maintaining the system replacing all those low and compute notes but you want to find a nice balance you can put a lot you don't have to go super high end it has a lot of resiliency built into it for example you don't necessarily need to have individual servers running things like raid 5 for dis protection because in a lot of clusters and a lot of this distributed computing that's actually built in to the structure itself and the last thing with distributed computing is when we compare it to go ahead and mine a you know scaling up buying a giant computer wanted to go ahead and handle all this compared to buying lots of computers to accomplish the same task we're gonna find this as a much better price ok we're able to get much more compute power and throughput and honestly some of the throughput we get we couldn't possibly get through sin machine they're just not capable even if you went and bought a single computer and you know a super mainframe you just can't go ahead and match it to a thousand you know regular computers acting as a distributed computing cluster and so these are very very powerful so do computing spend huge one of the areas that it's really gotten attention is in big data and big data the one that is currently getting the most attention is a dupe so Hadoop has a couple different components so let's go ahead and we'll put her dupe down here we know Hadoop is a big data all right so we have a dupe and one of the first components when you start talking about Hadoop is something called Map Reduce MapReduce for big data Hadoop is literally a divide and conquer strategy a bunch of data comes in all right and instead of having one supercomputer one computer try to go ahead and crank through all that data we're gonna take that data break it up into blocks of data may be 128 makes each we're gonna ship it around to throughout the cluster all right and they'll go ahead no process that work that needs to be done they'll go ahead they'll do that job when they're done with their job they'll return the results we can consolidate the results and get our final answer reduced our results and get that final answer and it truly has MapReduce is simply dividing a big job up and concrete it and that's an example of using distributed computing so from a compute perspective that's really cool but what about from a storage perspective well Hadoop from a storage perspective uses something called the Hadoop file system HDFS the Hadoop file system is pretty neat what it is typically our data is what we call worm all right right once read many times all right write one sorry many times it's usually the way we store our big data well I do makes it very easy for us to do this what it does is it goes ahead and when it chunk of data comes into it it doesn't make just one copy on one server or even two copies on two servers actually makes three copies so you actually end up three copies of the same throughout your cluster and it's all spread around to make sure if any one of them fails all right you're going to go ahead and it will automatically rebuild detective failure you're still running and so it allows you have failures in your system but still have your compute the MapReduce and your storage is still fully functional another great benefit is now when you need a particular block all right I need 128 make stripe I can not only get it from one source I can actually get it from three sources and so if a particular file is broken up in 28 Meg chunks and they're scattered all over a hundred different computers all right when I start requesting those blocks from the system they can send a tremendous amount of data because I could have a lot of computers feeding me different blocks of the same file all right and really have incredibly high input output so what we're able to do in this trip distributed computing and like said one of the great examples of distributed computing of course Hadoop being a very popular solution for Big Data allows to take a bunch of computers regular computers and effectively you turn them into one giant virtual computer all right one virtual supercomputer they're all working together as a single cluster to accomplish the job of taking something that's really difficult for a single computer to do break it down into small chunks distribute it and process it so that's not it for distributed computing thank you for listening this section

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *