Learn Real Time Big Data Analytics Using Python and Spark: Hands-On | Learn Python and Spark



hi everyone welcome to this webinar on real time big data analytics using spark and Python my name is Vinod venkatramana and I am the head of technology here at RIT learning and I'm very excited to have all of you on board as we explore what real-time big data analytics is and how you can use spark and Python to actually go about doing this so first of all I would like to you know start with the agenda of this webinar what are we going to cover in the next one are so the first thing that we are going to do is you know understand what real-time data analytics means you know what is this and why is it relevant in today's day and age and how is it different from the normal data analytics that you and I you know do post that will move on to the use cases where real-time data analytics you know is being applied today and you know why is it so you know critical to be able to do real-time data analytics and what are the use cases you know where it applies after that we will move on to you know understanding how we can write so every time you have a lot of data you have to you know be able to model it in a certain way so that it is easy to you know analyze right after that what we are going to get into is you know how do you analyze real-time data now that you're modeling it so now that your model data in a way it is easy to analyze how can we actually go about doing the real-time data analytics after that we will get into spark and a particular component of spark called spark screaming which is which can be used to analyze data real-time and in the end which is the which is my favorite part we are going to actually do a real-time analysis of tweets as they are happening during this webinar and you know we are going to find some insights from all those tweets okay so this is the agenda we're going to do so let's get started okay so what is real-time data it is essentially real-time data analytics is the ability to analyze data as soon as it is produced so data is produced all around this like every tweet we make every social media post you know every you know user visiting a page on our website user touches in the product on our e-commerce website all of this is essentially data which is produced at different points in time what real name return will it exists is the ability to you know will gather insights from this data as soon as it is produced so literally has zero millisecond latency now what is there is another term that is related to real-time data analytics which is near real-time analytics what this means is you know data gets produced but you are given a little bit of leeway you know you can analyze the data within a given time frame for example you know let us say we have an e-commerce setup where you know users are coming and purchasing products what I want to you know do is in the last 1 minute I want to find out how many products were sold right and this I want to do based on each event which is the order placed event so one after the other order is getting placed and I am doing the near real-time analytics where I am you know within I am trying to count how many orders or place in the last one minute so near real-time analytics is essentially the you know analytics where we analyze data within a given time frame after it is produced right now this timeframe is dependent on the use these the use case that we just spoke about is you know users placing orders on an e-commerce website and there the business teams ask you to give the report every minute how many orders were placed so that use case of analytics which was came you know the one minute in which you had to count the number of orders let us say it will define the timeframe in case of near real-time analytics ok so now we know what real-time return will it exists and we also know near it looks produce now I would like to take you know a little bit of time and discuss what is the difference between the traditional analytics batch analytics that is done using Hadoop and MapReduce compared to real-time analytics so if you take any large analytics deployment in any company you know what happens is that you know data gets gathered like in this example orders get placed throughout the day in the order management system of that e-commerce company and drop will run in the night which will pull it out of the orders table in orders database whatever system it is and input the data overnight into Hadoop or into HDFS or whatever and then the next day you will analyze all the data you know think did the data that has come from the previous day or the previous week so different use cases based on the amount of data you want to you know analyze traditional batch analytics happens this so transactional systems produce data though that data is at some frequency imported into analytic systems like you know and or spark even and you know after that it is you know analyzed to the time billing which may be days which will be weeks whereas as we know we just discuss the real-time analytics so real near real-time data analytics is about the ability to analyze data within a very short time frame from which it is produced so having any you know borrowing any procedures from this traditional batch analytics where a nightly cron job or something like that gathers all the data and impulse it into your analytic system is not going to work so I wanted to highlight this key difference you know battery batch analytic systems you know run once a day there you know have you use you school or they use some you know HDFS copy to is to which from which they copy data from these transaction systems into Hadoop and then the batch analysis runs the next day whereas the real-time data analytics which is a topic of today's webinar the system that we build we need to have the ability to analyze data as nearly as soon as it is produced okay so now just to summarize this we have learned what real time data analytics is what near real-time data analytics and the difference between traditional batch analytics and real-time analytics so moving on now let's look at a few of the use cases where real-time data analytics is required the first use case is very simple I don't even know if this is an analytics use case per se but it's a real time use case definitely let us say there is a system you know that is deployed in the intensive care unit beside each fed in the intensive care unit which is monitoring the key buttons of the patient right such a system you know is the main decision it has to make is to raise an alarm in case the patient needs immediate medical attention a system like this cannot work with the traditional you know patch a processing thing right where you gather all the stats then it goes to another system where a cron job runs every five minutes which you set the vitals and then raises the alert maybe the patient would meet with a fatality meanwhile right so in this case like an ICU case where there is you know as soon as you detect that the white ulcer you know the white you know metric so the ball of the of the patient are going here immediately so that's one the next is fraud detection in today's world today we are all doing online transactions you know we enter our credit card details in multiple that sides right so what is a fraudulent website somehow you know takes a great cut information from you and starts doing transactions on your credit card on your behalf it's basically spending your money to buy things for themselves so in finance of cyber security it is very important to you know to take these frauds real-time or near real you know having all this data gathered overnight and external is the fraud has happened the money spent and now you have to reconcile somehow you have to go for your insurance weekend so essentially in the cyber security space or in the finance phase fraud detection is a use case which demands real-time you know analytics of financial transactions so to say the next is you know you have an application that you have deployed on the cloud let us say your bait website that you've deployed on the cloud how do you know whether it's working right now right how you know in the last five minutes or last two minutes how many other pages did itself how many times you fail so these are again things you know if you do know if you don't act immediately on you know your your applications health then it might cause widespread impact on your business so again application health monitoring which monitors how many errors you are serving to your end-users how many you know page load errors 500 errors you know all of that needs to be real-time the next is nowadays know it's all about every business is moving towards data-driven decision making and one of the aspects in that is you know let us say you do not rule out roll out a new feature on your website and then you know people start talking about that feature in your product on social media you want to know quickly you want to mass soon as the feature rose up what is the user sentiment based on social media what are they meeting about what are they you know talking about on Facebook so again there if you if you do the traditional batch analytics thing then you know you will come to know maybe a day or two later and by that time you could have impacted maybe 90% of your users they might have loved it which is all good but if they disliked it you could not prevent it because your system was not geared to analyze data as soon as it was produced the next days you know Google Maps you know I want to go from here to you know I want to go go to some place Google map you know use the Google Maps uses live traffic management right so it is looking at traffic data as it is produced and based on that suggesting which routes are tests in terms of time you know which routes a less congested so that's another use case where Google in this case Google Maps in this case has to act on the traffic information real time and then you know suggest better routes the next is high frequency stock trading so let us say there is a system in which you know whose responsibility is to in automatically using an algorithm invest money into stocks buy stocks and sell stocks so that at the end of the day it has made a margin it has made a profit for its users such a system you know has to act quickly because stocks raise stock prices you know raise and fall within seconds so if the system is like that you know only at historical data and could only act tomorrow today the company might be doing very well but tomorrow by the time it analyzes that date and decides to buy that stalk it might be too late right and the stock would just be falling the next day the last hand you know something that we are all going to experience in the next 10 15 years is self-driving cars so self-driving car is not a real-time data analytics person's like real-time data processing self-driving car has to look at you know whether it has to accelerate brake slightly turn left turn completely turn left turn right all of that right so it has to be able to act on data as soon as it gets it or it catches it so these are few the use cases that I wanted to impress on you guys where real-time data analytics is now not only you know a good thing to know it's now an essential thing to have in your analytics toolkit okay so moving on what I want to talk next about this we are talking about all this real-time data getting produce type data getting produced and you having the ability to process it real-time but how can you model this data in your minds so that you can understand how to do analytics with it this is what this slide is going to cover this what we are trying to discuss here so one approach you know one way to look at data as it is getting generated is essentially as a stream of events right or a stream of data you know so data real-time data or data getting produced at different points in time can be looked at as a stream of events it's like a stream a river flowing right in front of you standing here and there's a river flowing right across you right and what the river essentially has as it has events where each event in itself knows the timestamp of when that event occurred and it actually and it also has the actual data inside it so the model we are thinking office we are standing and there's a river flowing right a crosses and it's a river of events and those events essentially have in them the time when they occurred and the actual date of the event so let us keep this modeling of data in our head so that we can see how we can tie this with analytics or if we model the data this way how analytics becomes right easier to do in real life okay all right so now moving on what I want to do next talk about this what is the types of real-time data analytics that gets done generally what are the different ways in which real-time data gets analyzed so one common use case is for you to aggregate stats from these send events let us say in the case of the example that we discussed in the case of a website that you have to put on the cloud I being the architect of the system care about how many page load errors did it serve in the last five seconds right this is very important for me because only this can tell me whether my system is right now doing good or not right the the other use case where aggregated statistics from recent events you know can be you know is needed is let us say you work in an e-commerce company or in any company and you know it sells different types of products and you see you as you okay tell me how much revenue each category of products that we sell like apparel like you know Footwear like you know maybe boots like you know whatever else how much revenue each of these categories is generating every minute and I want it as latest as the last minute so revenue per category in the last minute and this again moving on right every minute you have different data maybe this minute books are very hot books over the next minute maybe you know a Footwear or Footwear you know promotion went live because of its people order the food so then the next minute so one type of you know real down data analytics is getting technical stats from recent events the recent events is the real-time part aggregated stats is the analytics part located the next type of real-time data analytics which is done is essentially applying a machine learning model on this stream of events so let us say you know Amazon or in e-commerce website let us say offline producer model which based on what the user is doing on the web side right now and predict whether he will actually go ahead and make purchase in this session or in this interaction with the Dipset let us say a model like this is paid by all the you know machine learning gurus at that company now what you essentially want to do is you know at any point in time take the events that users are doing on your website that stream of events and put this machine learning what on top of it and see how many of those people have just window-shopping versus how many of them are actually liking to make a purchase soon right so that is one other way in which real-time data analytics system is you pay off line models machine learning models for certain business problems and as these events happen you basically you know run these events through the model to predict or to you know to figure out how many users in this case how many users actually make a purchase right the other type the other way the machine learning models can be applied another use case is you know the high frequency stock trading so there is a machine learning model which says that if a stock you know price change if a stock is of this type of company and if it's price changes like this much amount then you should go ahead and buy this much quantity of that stock now as the stock prices keep changing in the stock market those are the raw events that is the trigger of events and this is the model that is sitting in between which is making a decision whether you should go ahead and buy the stock or you should sell the stock or not buy it all right so the second type of real-time data analytics is you know building offline models using the traditional batch phrase batch based analytics that we got spoke about take those models and put it on this real-time stream of data and use it to make business decisions the third and very interesting kind of machine the type of real-time data analytics you can do is basically as the stream of events are happening you actually build your model with that it should I do so the second example the second type we discussed here models are big offline and then they were put on top of this river of data or the stream of data and you know those making some business issue but what if you know your model cannot has to be very recent your model cannot be a day old it cannot be even on our own maybe your use case your business the data that you are dealing with is very very sensitive to time and you know it has to be the model has to be made on the fly and used on data coming immediately after so let's say you build a model based on gathering data for five minutes and use that to predict for the next half an hour then again the stream is flowing the Audion build the model for the next five minutes I build a new model for the next five minutes and use it for the data flowing for the next half an hour right so these are B you know in my head these are the three types of you know be real-time data analytics that I have encountered one is getting aggregate stats from recent events the other is applying a flying machine learning models on this stream of events and the third is if your use case demands it build the model using the recent data and use it to predict data the rules come next cool right so now we have essentially covered the basics in terms of what is real-time data analytics what how is it different from traditional batch processing and you know how can we model data as a stream of events so that we can think about how to do real-time data analytics now let's come and you know discuss how spark stream enables this how would spark swimmin you can do real-time analytics right so essentially what SPARC does is you know do the exact modeling that we spoke about it tells it basically asks you to point it to a stream of data and it models it as a sequence of data sets so the stream of data that you pointed it will start receiving events from the stream and what will give out is the sequence of data sets and each data and each data set is essentially sequenced by a time frame or in other words each data set is basically even from the last eight seconds or last five seconds last ten seconds last five minutes last one hour again the real-time last – depends on the use case so you did it okay for me I want this is the stream of data and I want you to you know give me the stream of data sequence of data sets where each data set contains data from the asked X minutes or last five seconds in this case so now what we have taken is we have taken a continuous stream and we have you know in a way discretized it we are made chunks out of it logical chunks one after the other and now that that the stream of data has been converted into a sequence of datasets and how much data goes into each dataset is defined by you in terms of how much time you won't spawn to monitor that stream in this case events in the last five seconds if you say that then it will collect data for five seconds that will be your first dataset then you collect data for the next five seconds there will be a second dataset then we collect data for the next five seconds then your third dataset and so on so essentially we have taken a continuous stream and discretized it into words sequence of data sets now once we do this you know you will see how each of the three types of analytics that we wanted will become easy we'll discuss that in the slides but moving on so essentially we have taken a continuous stream and made it into a sequence of data sets what's part does as I explained is correct data into the corresponding data set as it arrives now and you know what did then lets you do this it allows you to analyze each data set as soon as it's time window expires or as soon as it's full so as soon as five seconds have elapsed in this example your first data set in your stream will be ready so you can then do whatever analytics you want to do on that data set so essentially it is enabling nearly building data analytics right so you have the stream of data we told SPARC monitor it for every five seconds and give me a data set after data set every five seconds and then what it does is it creates those data sets and it allows you to work with those data sets it allows you to work you know write code which would run on each of the host data sets to do different things and what are the things we could do going back to the previous slide on each data we could find aggregate stats like page load errors in the last 5 seconds on each data set which indefinitely has the raw events or even solid lost they are just grouped this critter is put into a bucket on each of those buckets we can apply an offline based machine learning model to predict if this user will make a purchase or in the third case we could use each bucket build the machine learning model on the fly and up like on the subsequent buckets right so we took the three types of real-time data analytics we learned and we saw how spark sparks way of you know transforming the data stream into a model of you know into a sequence of data sets will allow you to do all those three types of analytics ok these data sets this you know each data set is called a spark RTD and Adri d stands for resilient distributed data set so essentially spark is a distributed computing framework so you basically these data sets are not kept or in the memory of one machine these data set are actually these data sets are held by a cluster of servers right so some servers will have the first data sets from Jerusalem second data set some will have third data set right and as these data sets arrives and you know they are ready for processing you can send arbitrary code to work on this data to extract your insights good so now what we have understood is how sparked screaming is you know allowing you to take the sequence of data flowing and convert it into offer you a stream of data flowing and convert it into a sequence of data sets each of which you can work so that is how you know real-time data is actually modeled in spark as a sequence of arteries cool now that we have these this sequence of our Deniz what is it called so far calls it discretized streams so basically the stream the boundary strong boundary is first five seconds next five seconds next five seconds so on so forth so that is what these frames is these frames in spark streaming concept and essentially when you point spark to a stream of data it will give you a D stream now what can you do pursue at a five-second interval as they are getting placed so I'm an e-commerce company you know hoarders are getting placed that's that that's the raw river actually cuts the event stream the raw main stream on top of it I you know I created a spark D stream by saying collect five minutes five seconds worth of orders and you know create a peaceful mount of this cross stream so now I have a sequence of these are BBS which is the ds3 what can I do with it I can you know transform it into a new stream new T stream and the functions it provides are available on the website the popular ones are just listed here you can use the map function for example which will which will take each element inside this feeling which will take that as the input and whatever you output that element will be replaced by that new element for example let us say I won't only look at our dresses which are worth more than thousand rupees so I have the you know the DS freedom of orders a map and pass it a function which will you know filter example I'm talking about this filter sorry so the filter function you can essentially filter out events in the D stream which are less than sorry greater than thousand rupees in that case the stream is left only with orders which are less than thousand rupees what you can also do is take two streams one may be a Twittersphere one may be a Facebook's field and combine both of them using the Union so YouTube these streams and combined into step taking two rivers and combining it into combined River right you can also reduce where you are actually counting how many you know learning a function which will you know ready compare two elements and give one element back right so that these streams can be transformed into new D scripts and these are some of the popular functions if you're not getting each of these functions do not bother when we go to the Cold War so we will actually see each of them in action right so these things can be transformed it will use these frames the other thing you can do is you can take this this stream and you can actually see out put it you can save it somewhere so that you can be more maybe you can output it so that you can maybe visualize it maybe so that you'll put it so that you NL can analyze it later and so on right to do that it also provides output and save functions like drink it just brings ten elements per RDD in the district right – so you have it these free models if you say dot print it will just take ten elements from each of the underlying guard is the first ten elements and just bring them out field so you can see okay this is the data that is flowing through the you know stream next you can save them as Hadoop files so that you know each you have a Stein stack a sequenced you know orders all saved us files one after the other so all the orders placed in the first five seconds become one file all the orders placed in the next five seconds become another file and so forth it also has a very generic for each RTD function where you can you know pass whatever logic you want you that you want to do with a charity for example let us say you wanted to you know deport all these events to another analytic system like Mixpanel or you know maybe save it save it to a relational database like mysql postgresql our DB function provided on these streams and write your logic of opening the connection and sending the data to the relational database or whatever this is the most general way in which you can output a Hardy for each Rd right so just to take a recap what we have learned is how spark streaming models they does so that you can do real-time analytics on it which is basically a B stream and now we have learned what can you do with these these tips you can transform a T stream into another B stream or you can output at extreme okay all right so we have had a lot of theory we have spoken what modeling we have spoken about spar know now actually let's do hands-on example when we look at some code and it was to actually do some real time analytics so the example that we are going to be no doing this hands-on exercise is essentially we are going to livestream tweets from Twitter which contain the word India in it as they are happening so this will be lives remove all the tweets which you know are happening on Twitter which have the word India in it and then we will use spark streaming to find out top hashtags in those tweets every 10 seconds so basically every 10 seconds we are finding out what are the trending hashtags for the window or 10 seconds so right now maybe you know something is trending and maybe a news breaks and you know then something else will be trending so we are actually on top of things as they are happening and every 10 seconds in this case we are looking at all the tweets made about India and trying to find out think top hashtags in those clips and I mean top hashtag I mean the most frequently occurring hashtag the one which is mentioned in most number of beats okay so the example is very simple we are going to connect to Twitter we are going to ask it to send us all the tweets which contain the word in India then we are going to do the system this is the stream of events right then in on this stream of events which is the you know tweets containing the word India we are going to use spark streaming to actually every 10 seconds to find out the most common hashtags in that 10 second window ok so right so the first thing we need to do in this example is to set up that stream of events so for that we are going to okay first we're going to be doing this entire hands-on thing in Python and what I'm using here is an editor called canopy which I would recommend to all of you as they didn't use for Python when working with you know in general super editor and what I am using here is a package called people which is Twitter for biker twinby what it allows you to do is essentially use the streaming API that whitter provides so Peter provides you this API where you can you know connect to it connect to make a call to Twitter and ask it to send you all the tweets with the filter which can contain the word India in it so line number 46 is where is the main method of this Python program where we are first doing some more than vacation stuff so that you know Twitter knows who I am and I am accessing you need to create a Twitter app for this I would not get into all of that so assuming you know all these feel these two lines of code is boilerplate what I am essentially doing is after I am authenticated with Twitter I am creating a stream and I am asking a Twitter stream of weeks and I am asking it to filter it and give me only tweets which contain the word India in it right and this is what I am setting up here and what I am saying and whenever data is received this is the function that will be called so this is what twin pie does for you Twitter for pythons is what it does it provides you this ability where you can you know connect to the Twitter API and this female API has Peter in this case and then whenever data is available it will call the on data method what I am doing on the when I receive a data this data is essentially the tweet object so it contains the actual tweet which will have the hashtags in it it will also give when the tweet happened it will also give you more hope we did all of that you know state a limit so for each tweet which has the word India in it this function will be corn and all I'm doing here is printing it on the output so it's actually first go ahead and run this program and see what are the Twin Peaks that are happening right now right so yeah so the file that we looked at was Twitter stream dot P line and that is what I'm running here and when I run it it should it's basically printing each of these tweets so I can just stop it right now so these are the this is the 3d object that it has given me and there is this text portion in it where the actual tweet is there this is actual tweet till here highlighted portion thank so these are all tweets in know which are having the word India in it and that is what that stream so these are tweets as they are happening by the wrists that are hacking right now so this is that stream of data that is flowing from Twitter right okay good so now that we have you know an idea of how to create a stream in this case from Twitter but in reality you might not necessarily need to it right you might need something else so there are many types so connectors that sparks framing points with which you can connect to different data sources and create a stream out of it one of the most popular sources is Kafka so Kafka is a distributed system where you can keep time series data and you can easily use farc streaming to connect to Kafka and get a stream of events just like here we are getting it from Twitter okay so this is a very simple example this is this code I have you know just you know it's very simple essentially what we're doing is connecting to Twitter asking it to give a stream which is filtered by the word Indian and then when each we receive we are writing it on the s-video which is what is happening here right so the tweets are coming and they're getting printed up so now what we want to do is point sparks fit into this stream so that it can connect to this frame and then we know we can do some magic with spark which is finding the top hashtags for that to do essentially what we need to know is they are all get spark connect so obviously this is not Kafka this is actually a stream that is only output of a process so steady out of this process because we are just writing it out so who actually make it you know more meaningful what I'm going to do is essentially output of this instead of printing on the STD out I am going to send it send it to this spot this server on to this code on my local ghost so essentially I'm taking all that output and instead of printing out on the you know on the console I'm actually redirecting it to the port nine nine nine nine on my local host and then this is there is a port ready which has data flowing in which easily we can connect some spots fitting okay so now let us go to the second part of the code which is actually the spark streaming path it is there in this file trending tags toward p1 okay so this is where we'll be spending a lot of time the stream is just you know most of the times will find a connector for what you want in this case for Twitter we had to do a little bit of coding but most of the times you will normally you know access data in a cascade luster and Twitter that has sorry sparks framing has you know suppo Connect do Kafka need to be has built in support so you won't have to write this code okay so now what are we doing here so essentially here our task is simply we have that stream of tweets coming in on that port what we need to do is first connect to that pool right so essentially what we are doing first I'll give you a top-level overview of this code and then we'll go a little deeper alright so just right so this is the main I'm starting from line 24 so what we are essentially doing is when you run this program you give the host at the port the harelip stream is available in our case it is localhost and it is available at the port 9999 because that is where you know we are writing into because that's where we are outputting in the screen local host on board nine nine nine nine okay so now what we're doing is creating a far context and giving it a name so that you know in the spark monitoring dashboard we can see our application because this is a real-time application right so you keep running it'll keep running and he'll keep telling you every 10 seconds what are the popular hashtags so there should be a way to monitor it remotely so that is why when you create a spark context you give an app name and after that the window interval that we have chosen is 10 seconds every 10 seconds I won't you know my analysis to be done so I have defined a window interval of 10 seconds and I have created a stream in context using the spark context so essentially when you want to do streaming or real-time data analytics using spark you will have to create a streaming context and to that you pass your spine context and you also pass the we know the time interval how much time should it you know look at those streams and then you know create those these streams out of them so that is 10 seconds in our case so I have created the spark context here that what I am doing is I am telling the spark context to connect to the localhost which will be the first argument will pass and the put nine nine nine nine right so then now what we're doing is we're connecting you we are pointing spark to spark screaming to that stream that we have created in them using the Twitter stream Python pipe right so now the spark context is being pointed to that stream and what it will return you is a D string right so any point spark to the stream of data it returns you a T stream which in this case are named as tweet state discretized stream tweets D string and now we can play with this D string so essentially what am I doing here we will go one by one so what I am doing is first of all as I said you could take stream and track this stream and transform it into another big sphere which is what we want to do if you look at this output that we are getting we are getting Jason objects these are all these are all Jason strings one after the other that are getting printed right this is adjacent string and what I want to do is essentially look at the text part of it right and I want do you know so what I'm doing it is taking the stream of Jason objects and converting it into the stream of also you know the stream of Jason strings and converting it into a stream of Jason objects to do that what I am doing is running a map function with which I can transform the stream and for each to feed I am calling a function extract the tweet text which is the actual text to the tweet and I know that these strings that I am getting each line that I am getting here as a tweet is which is an object so I'm just passing that into Jason and what I'm doing in the extract tweet text is essentially if there is text in the tweet get the text else written an empty string so this was to do some error handling essentially I was getting some tweets in my in my stream which was not having any text in it at all in which case this was breaking so essentially I wrote this function which will take the Jason representation of the tweet and return the text portion of it right they actually do it okay so coming back here so what we're doing here is extract the actual text from each tweet so we are taking one stream calling the map function on it and transforming into another stream where instead of Jason strings I now have the actual text of each feed now after that what I want to do is find out the hashtags from it so essentially I am you know next what I am doing is I am taking these these text of each of these tweets and I am splitting it into words so I'm splitting it by the space character here so once I do that it limit all the word is in that string so now I have taken I have done another transformation so the first transformation happened with the map call second transformation is happening the flat map call so flat map essentially is very much like a map but with the ability that for each input record you can output multiple records whereas in a map for each input record you can output only one record so it snack is a one-to-one transformation slag map is a one-to-many transformation right so in this case because I'm splitting the text by I into words each text will have multiple outputs each word in that wind that is what I am doing here emit all words that meet now because I'm only interested in hashtags I am using the third transformation where I'm filtering you know these these words that are flowing from flat this step to this step and I am you know limiting them to only hashtags so whatever word starts with hash and then now I have a stream of hashtags what I'm doing is you know lower casing it and emitting a double so in Python what you can do is anything you put into into brackets like this is called a double so for each of the hash tag and lower guessing it and I am omitting one with it which represents that I've seen this once so that then I in a distributed environment I can reduce all these ones add all these ones up and count how many times this hashtag is occur across the dataset okay so I'm omitting the tupple of word comma one here and one represents that I've seen it once right and then I am doing the reduce by key this is enough the transformation of D stream to another day stream so now at this point what does my screen contain it contains lowercase hashtags and one this couple it's the sequence of this top right word hashtags and one hash tag and one hash tag and one like this extreme seriously what I am doing is I am taking two elements and I'm combining them by adding up their values so I'm adding these ones where the value is same that is what reduce by pedas so for all for a given hash tag let us say hashtag India how many all the ones will get added and you know I'll have a composite couple same hash tag India occurred one plus one plus plus one means one which is probably six times right and that is what this reduce by P does so count how many times each hash tag got the cuts and now so now till now what have you done okay let's take a step back we have taken a stream of Jason of Jess on strings that Twitter was giving us we come in the first step we converted that into a stream of that only the actual text of the tweet then we converted into a stream more each word in all those fields then we filtered it and convert it into another stream which only contain all the hashtag words then for each hashtag I emitted a couple saying I have seen this hashtag once so tuple contains the hashtag and one then I am reducing by P so wherever the key is same the values many times speech hashtag Urkel's but now our problem doesn't end here so now we know all the hashtags and how many times each of them occurred but we are interested in the top 10 hashtags let us say you want to get the top 10 trending hashtags every 10 seconds what that means is I have to know further order all of these do an order by and pick the limit tab so essentially here I will be till now what we have used this ability to transform the DS frame now what we are going to do is essentially save each of these this the final transform D stream as a table so that we can use sequel queries on it using spark sequel component so what we are doing here is now for each of these you know hashtag which now Topol which contains the hashtag and the count how many times it occurs i am omitting the an object called tag count with the name of the hashtag and the number of times it occurs right so now my stream is transformed a couple of hashtag and count was stream of tag count objects python objects right where the dag content is two things the actual tag hash tag and the number of times desert once I have the structured data I can easily use the output function that we discussed so there are two things you could do within this frame you could transform and you can output so now I'm putting it in the earth sequel context to a data frame so I'm converting I'm saying for each of these are deities that are there inside the stream run this function so I get this RTD I convert it into a data frame and I save it as a table with the name tag cops the reason I'm saving it as a table is then I can you know periodically query it and then output the top tax rate so after that if you see for a second just ignore these two lines will come back to this what we are doing is now we have done all these transformations on our you know tweet screen and finally save grade into sequel context the name bad cops so now what I'm doing is I am you know just 100 times I am running a while loop we are I'm sleeping for 15 seconds the reason I'm sleeping for 15 seconds is I'm giving I have our spark to collect data for 10 seconds so he collect data for 10 seconds then it will actually do this magic this transformation and output on it and then I will wake up after 15 seconds here and I will you know query the same context the tag counts for context with this will select that comic Wong count from tag counts order by count descending limit death essentially this is an order by query limiter to find the top 10 you know tanks and I am just looking through it and printing it right so just to recap what we have done is we have created a spark streaming context pointed into that stream then instructed spark that every 10 seconds you know you that is the interval with which I want you to create the least theme and after that on each on the d-string to this transformation and then do this output and output in this case is into a spark sequel context so that we can run a sequel query to pointed top tags after which my client program this is the client program what I am doing is connecting to the sequel context that we created with the name tag comes and running a sequel query on it and then just printing out the results ok so now if you look at the structure of this program here we describe what to do with the be stream here we described you know what to do once the d stream is saved this context to the sequel context but spark won't do anything till you spot screaming won't do anything till you actually start the context so start collection of data with this call here you set up what all to do you told spark ok these are the things I wanted to do on the d-string but it won't do anything till you actually start the context so once you do this spark engine is now collecting data from that stream on each you know doing this transformation and then our program is just looking at the output that you saved here in the transformation and you know printing out the hashtags and then it is waiting to be terminated okay so that's the broad overview of what this is about now let's actually try to see this in action okay so I'll just pull up two windows so in the first window I am going to do essentially run my Twitter stream thing which is going to create that stream and output it into this hook to the local host at this port and not the other terminal and would actually submit this program this client program that we wrote they read old spark where to look for the stream what to do with that this frame and then finally query the sequel context and print the top hashtags so this is trending text or py so I'm going to submit it to spark using the spark submit command I am giving it trending taxed or py5 to submit to spark and I'm also giving it the host input because that's what this this script requires if you see in this script we are then we are connecting to this free in we are using the first argument as the host and the second document does that port so that is why there is this local host and the boat right okay so let us so what the left side window is doing is actually creating that stream based on Twitter what the right side is doing it is taking that stream and you know doing all the transformations you know on the extreme and then printing the top hashtags that are happening so once in every 15 seconds we should see an output so this is after 10 seconds spark job at chile running that was that that came on the on right screen okay and then there is an error which is fine that comes in okay let me just restarted once again type that's the spark job running and now our client programs are queried to the table and get the results yeah okay so essentially in the first 10 second window these were the trending in hashtag startup India Bank bizarre or PLC startup so look like startup is trending there and now you know here there are a couple of tweets with with India nerd one about emergency a heart failure Swach Bharat of course like so every 15 seconds we are peeking into the data that we created using spark streaming and printing it out here for us to visualize right so essentially what we're doing here let's take a step back right what is this overall streaming analytics we're doing we're essentially streaming tweets in from Twitter which contain you know the word India in it then you know we are pointing spark to the X part streaming to that stream and asking you know using this transformation of these streams and the output functionality of these things you are saving that into sequel context and then we are peeking into the cycle context once in every 15 seconds and printing the the top trending hashtags so our way than yeah cool so that essentially is all the things that I had for you guys let me actually just you know maybe maybe let it run once more and then we can move the questions okay so I'm just looking going through the questions so one of the questions is where in the spark store data so spawn essentially has a concept called oddities an oddity essentially stores data so spark is deployed on a cluster of machines there are many machines which runs spark so what's our disease utilizes their memory as well as disk to store our duties so to you as an inducer our DNA is just like a data structure on which you can do map you can do reduce and so on so forth but internally what's five dozen seduces all the memory of the nodes on the cluster as well as the disk to store these are delays so you it is like I did you dot map you say run this function on it but internally knows where is this RTD which hosts are hosting it in the cluster it will send that code to that to those nodes it will run and that will create a new IDs so that is how our source data okay this okay so the other question from Fira thing is when can we choose Park or Python script so so I guess what you're referring to is doing data analytics which is Python versus doing data analytics using spark so I would advocate you know whenever your data is you know small enough to you know you know to be that you can bring it to your laptop you should use raw Python and you scikit-learn and Kampai and bond us to do analysis on but when the data itself is big and in this case I have only used you know a hash tag of you know India maybe if I expand the hash tags that I am filtering on I might get a lot of queens maybe I might get you know maybe maybe one lakh tweets per se again in that case my machine alone cannot process it right in that case I need to use multiple machines to do this analysis which is what spark gives you so in this demo I have actually run it on you know my laptop but in reality when you submit this job you need to spark submit you would actually submit it to you know a cluster of spark is hosted and all of those servers will process those one lakh tweets per second to figure out which are the trending hashtags yeah so there is some the great question by a gene who's asking a spark good for real-time analytics or should we prefer storm and Kafka and all those things so the beauty of spark is it is it is a package it is not just real-time analytics it is basically you can do MapReduce kind of jobs the traditional batch processing it provides you machine learning library with which you can do machine learning on distributed data sets which is the replacement for Manu and it also gives you spark sequel which what which is what we use today you know we also use part sequel where you can you know look at your big data you know and query it using sequence impacts and it also gives you spark stream and all of that with one abstraction called the IDE right so essentially spark the reasons park is getting so much adoption according to me it's the unified interface and simplicity just look at it this code this code is literally so simple it is just select that's it right and it is I had I did not have to learn too many new concepts to do screaming the only concept introduced was you know this cream and everything else was you know a very natural because the programming model is built on top of the spark core so it is not about spark for real-time analytics and sing spark the very good big data tool in general because it is giving you a lot of things not only you know starting from ability to MapReduce then to the ability to spark sequel then to machine learning on big datasets using em spark real-time analytics like this so yeah so strong and Kafka is another popular combination used to do real-time analytics but I think spark is getting more and more popular because the people who do real-time analytics are the same people who produced jobs are the same people are the same you know people who build the machine learning models in spark is that unified platform for them all right folks thanks a lot for attending this webinar I had a great time and I hope you guys also had some takeaways and and understood how we can use spark and Python to do real-time data analytics thank you so much you

8 Comments

  1. Swarnava Sinha Roy said:

    Everytime I'm getting "tag_counts" table not found error.

    June 26, 2019
    Reply
  2. Spike Grazer said:

    Your presentation was good, but kindly request your partners who publishes this online to take a bit more care in the audio quality of the recording published

    June 26, 2019
    Reply
  3. Victor Charles Vincent said:

    Hi I appreciate your enthu, but your audio is irritating. Please check the quality of your audio always before publishing. one would loose all his interests because of it.
    One of the most worst audio recording with on and off and screeching mic, background noise and stalling all such crazy stuff. You are intents are good, your output and quality of it sucks.

    June 26, 2019
    Reply
  4. Varun Mittal said:

    Very helpful. From where can i get these 2 code files?

    June 26, 2019
    Reply
  5. Ash M said:

    Can anybody help me build or any tool to analyse unstructured (doc,pdf) & naukri real time and unsupervised machine learning

    June 26, 2019
    Reply
  6. Shivam Makan said:

    Very helpful sir. Cleared my concepts.

    June 26, 2019
    Reply
  7. Mahmudur R Saniat said:

    Could you please also share how did you connect (get data) from twitter ? e.g.: the authentication / access token staff ?

    June 26, 2019
    Reply
  8. Kishor Dhulugade said:

    Plz tell me which is better to learn python or hadoop

    June 26, 2019
    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *