Webinar | Data Analytics using PySpark Hands-on (Python & Spark) | Tutorial | Great Learning



hi everyone welcome you all to this webinar on date and then we'll click using Python in spark I hope you all are able to so today you know we opposed to this way a little hands on the veena so what we'll be doing is looking at some theories looking at you know white sparkles relevant in today's world and why – and spark makes for a killer combination to analyze data and specially them very big data and then we will also dive in and look at some code you know we have a live data set which we will query and try to you know run some scripts on it written in Python and see either the basis and you know understand those words so thoughts further delay let's dive into the agenda and see what all we are going to cover in the next remark so if you have any question no please for those questions and as you get them in the questions tab and in the end of last ten minutes or so I will try to answer all of that three so let's look at what our agenda for today then so far you know white you know while spark relevant why are we doing this webinar where more and more companies adopting spark to analyze big data after that we will see why – spark link for a great combination for data scientists to work with big data so that will be all the theory part of this webinar poster we will you know take a data set that I have not found an open-source data set and we'd like to first understand what the data set is about you know and once we get a grasp of what the data is even makes move on and this you know and see all the insights that arise from that date you can start your Python and after that company in some part where you know I will run you through some goal you know code for each of the insights that we are going to define and winter panic on spark and actually look at the output so that's the agenda that we're gonna do for today and you know I hope by the end of it you'll be convinced that – and sparks from a great combination to do big data analytics so moving on the first item I want to cover is why spark so big data analytics so there are many reasons why spark is becoming more and more popular and some of them is what I have listed here the first is part is you know they faster in execution than traditional MapReduce using true so park of built with you know all we were built way up permanent magnets and Hadoop came up and it learns from all the mistakes and learn from all the problems running into so the weight of being designed you know spark allows you to you know specify all your populations and then when you say execute it creates you know you know step by step consider it knows what how we want to do the data so it optimizes the execution plan so that your map is used over what was not been running on your big dinner can I get good posture so the key takeaway is it is faster than traditional marketing even harder the next reason why it can run anywhere what we're doing by that so what I mean and spark can run on your laptop which is what I am doing for this webinar so I'm just using the laptop from which you know I'm presenting this webinar and startup cannot run on it and do not derive insects from the data that I'm going to point it then spark can also run on its own on the cloud creating a cluster so you can spin up machine implemented to Amazon in people and install spark on it to create Dukas clapping engine which can then we won't talk data present in different systems and you know do distributed computing and limited so spark can run standalone on your laptop it can form a clusters one of the cup machines on the cloud and the beauty of sparkle integrates with happen so if you already have a huge Hadoop cluster deployed in your organization and you traditional MapReduce kind of paradigm Sparkle seamlessly come and sit on top of the block search and use the same resources to run your queries so smart can you know Yong Yong do you know the execution the execution engine overdue so you can talk to that and use the machines that you have dedicated for your Hadoop clusters and use them to the next job so that's a great win so you don't need to you know set up a new set of machines you don't need to set up you know it doesn't matter you only have a hundred clips or you're just stuffing of the big data analytics mode magician spots can come any with the next big win that spot awaits its ability to analyze data presently with the reality of modern you know modern computing data resides in many different types of sources so for the webinar the data that I'm analyzing resides on my lap of hadron but in real life you'll find data is missing in sequel databases like you know your post refseq well my obscure your all of these relational databases so they mechanism there they can also exist existing hadoop cluster on its HDFS on its distributed file system get I can exist in a columnist or like HBase or Cassandra your data for Albany you could already have a Hadoop cluster where you have created a high schema and you've been running high scoring so far the beauty of sparkles all these sources and many more it can run anywhere and it can access data doesn't want any of this location a lot of time connecting to data what you should be worried about analyze little bit deeper for something a bit bigger and that's what spot makes easy using these two things it can run anywhere it can analyze data present anywhere next moving on the other things women that the most part provides it's very simple to work with Jim Bob because it removes all the complexity that of huge Hadoop stack bintang so if you had to do additionally run a big Hadoop stack the things that you'll be using as you know you'll be using MapReduce may be written in Java then you'll be using hiding separately install the milk so that you can do data warehousing you know you might have some form of streaming solution you might have installed mahute to do machine learning on on Hadoop so what does it – all these complexities and gives you one single you know inter which is called the heart beating it's time for the play with the data slit so finchy what sparked that there you know and then creates one interface for you to work with which is called the army so you don't have to you know if you wanted machine let me you still work with another if you want to run up keep going query on your data using spot you still work with an IV if you want to like the tradition MapReduce job you still going to the market and these are all the examples that we will see further down on this webinar so this is another big win in math I know learning new technologies new technology starts to do something is to me bit of time what you should have is a lot of solid toolkit which gives you unified interface with which you can do all these different types of big deep analytics right so as I mention in the biggest point spark is much more than MapReduce so spark at its core allows you to do anything that you could do with the MapReduce job on Hadoop but it has a fail and also provides other components so spark is essentially for compiled components together they are the spark code which gives you the ability to to do MapReduce kind of operation and then there is a spot sequence which is you know exactly a replacement for high arch a high which is a Hadoop stack it's an element in the Hadoop stack to data the routing and from sequel queries on your data so spark gives you spot sequel which is equivalent to that also gives you you know ability to do machine learning biotics MLM it stands for machine learning library which is built in this park using that you can do all the predictive analysis regression analysis you know plus a beta you know all the machine learning techniques that you would you would have applied on you would want to apply on your big data or all packaged into the ml lid path of spark then spark also has a streaming component which you don't let us say you know chip card opens up a big billion sale and as to sale is going along it is reading all the tweets where the big billion sale is hashtag and trying to figure out the sentiment of how people are singing about I'll be able to find products that they want are the Lightning the bees they're getting so such a use case demands you to you know stream all the tweets that are being made real-time and then process them to the eyes D in this case the insight which is the customer sentiment thanks to spark also has a streaming company which you can access stream of data which is getting generated and you know do what you want analytics you want to elect the other beautiful part of sparks you know the company called graphics what graphics is you know it gives you ability to play with grafted like social networks so if let us say you had access to know friends of friends who are my friends and who are their friends and who are their friends and so on so for and you want to do some form of analysis you know you want to recommend people that I want to connect with or I could befriend so spark robotics graphics which allows you to do you know graph computing or you know graph algorithms on Big Data because if you gave all the people that exist on the planet and we map out whose friends are holding that I think you know will qualify as big data and then spark graphics can be used to analyze that so in a sense spark as much more than Map Reduce it provides you a code which you can do MapReduce and similar operations but it also provides you sequel interface machine lab I do the machine learning library streaming stream processing engine and graph covenants so those are the reasons that I think spark is most relevant for big data analytics moving on why is this talk about spa and – what's special about fighting the first reason I could think of by writing this is the gospel can draw plain simple the best programming language happening mostly and you know I have been exposed in my career so far I've been exposed to a few programming language just starting from Java to JavaScript to Ruby to Python anyway dooms the game and I have to pick one language subject survive I would probably be better off that's a troll bridge basically if I can drop the code simply easy to write lesser boilerplate more output it makes you very productive so that's why python is relevant moving on those people you know who are already data engineers behind this no – five minutes in fact a recent survey maybe the Python is the most popular language among engineers and scientists so in case you you guys any of you are not yet familiar with Python I would strongly advocate you do you know get your hands dirty and learn Python because that is going to be the language for data analytics and big data analytics in the future another reason why this talk is about – this bar is because paths are both stipend so spark has no programming interface in by consuming you can use Python method by code in Python and it would submit that code buspar and start build on it for you spark extensions were written in a language called Scala which I personally dislike it's a pretty complicated language to me so if you want the most you need them you know see you should learn Scala and you know put write your spots Crimson's color but I would like to impress on you in the next half an hour or so how the Python code is much more simple and you know much more you know easy to write and easy to understand also sparked supports other languages like Java and art so if you're familiar with our spark arms can be used to diagnose spark friction are and spark a go ahead and ecology so that's very inside thing Python is you know makes a great combination of sparks let's move okay so now we have done with most of the theory that I wanted to go through in this webinar what I would now want to do is understand the data said we are going to be using in the next half an hour so the data frame that I picked for this dataset from an organization called Cleveland where have collected ratings on real movies given by real beating so the data that exists has two parts one is the movie cat long where they have listed down the movies the names of the movies here in which it was released you know the John action movie or a comedy movie so on so forth so there is a catalog of all movies and then there is another file a CSV file which has the ratings given to both movies by Eunice so if actually of takes a break from the presentation and look at the data sets that so this is the movie cap catalog which is non CSV file so the first column is the movie ID they are generated for head the second is the name of the movie the third is the lyric of the release and so on so forth so what's important this we have all the movies that the movies that you only would have seen the other file that the bear in this big effect is the rating people so this is a mini PSD file and what it has four columns the first column signifies the user the user ID who is given the rating the second column which I've highlighted signifies the movie ID which is rating the third column is the actual rating given by this user and the fourth column is the UNIX timestamp when we get the trick so the first line if I have to read it out actually replicate the fourth line you loser one nine two six rated movie two forty two three start at this them so if you wanted to move which is the two movie to 42 we could go here and look at that yeah it's a movie called Kolya this was released in 1996 so the user 196 rated the movie Konya three starts at this time this is how the DDI structured now because before the talk about big data analytics of using star what the the data fact that I have picked has only one last movie ratings there are 1 lakh ratings in this file but they have because they that the good ones have exposed a bigger dataset as well which has I think 20 million modulating film ability given by the consumers so that would you know for the purposes of this one in a session I have stolen a subset of hypnotize 1 lakh ratings but all the code everything that we would look at you know would run equally well only on the 10 million of the 20 million will be beta say if you run it on a cluster of machines because I'm running the code on my laptop I pick this one last movie rating the defect 3 so I hope we have an understanding of our data so we have two files one is the movie catalog which has a mobile ad and the beginning the second is the ratings file which has which user rated which movie what rating and s one time like and we have one Laxus video clip so let's go back so now that we have you know an understanding of the data set reason analyze let us now live down the kind of insights you want to derive from the data and then we will go ahead with writing some you know code in Python which will extract those insights for us ok so what are the inside we're gonna derive the first inside we're gonna get I is the rating system graph so as you can see the ratings users can give it given the data set range from one star to pi star 1 star 2 star 3 star 4 star and PI star what we are going to do first is using the path core package of the core functions the core module of Spada we have not expect how many one star movie how many one star ratings have been given and so on so once we do that we will move on and try to find how many powers are the ratings of the movie Godfather distributed how many one star ratings has it got how many tools are saving her havegot and so on so forth for that we'll be using the spark badly so I'll show you how to MapReduce using spark then we will move on to find which are the most rated movies which movies have what most number of ratings the top five for this I will be using the spark seek welcome spark sequel module and we will actually write a sequel and ask to run it on our data and return the top five movies which have been rated the most and the last example that I have for today is to build the real movie recommendation system using spark machine learning library we're given a user we would you know given a user ID we would find movies but we should recommend to him given the data we have okay so let's move on now but yeah if you're moving on I wanted to also point out you know some resources that are used which would be handy for you post this webinar so first of them you know the documentation of py sparks because we are going to use P my spark right – sparks in Python this is the documentation link the next is how do you install Python and spark you know whether on your machine or on the cloud or on ec2 or on your Hadoop clusters I have no point in you do some documentation is you help you do that next is the data set Excel that you guys just looked at that's a link to that and the loudest maths of the code all the code that you know I have I'm going to show you it is all available at my github repo and the link is given right here okay so let's dive in and let's go alright so the first inside we wanted to derive is how are the ratings how many one star ratings are there in the dataset how many two strands and so forth which is the rating system laugh so this is the code for it and let me try to walk you through it so in spark you know the main thing or the main industries I you know told you about is the RDP so to create a particle DD you first need a spot contemplation and the power configuration is essentially there you tell it where the particle is located whether it is located on your local machine or the located at some you are a or you know on a set of machines deployed on you see with what everything here in this slide I am essentially telling spark that I'm going to use my local machine and you can set the program up in Adama and have named it s rating system once we have this configuration we can create a fast context because now we're not using spark spark cluster to do a job so we created our configuration by passing it the object that we just created once we have the spark context we can create the RDD which is V code of spark which is the primary interface with which you do ot so as I said a baby's heart can connect to be by boosting anywhere and that extraction it all comes from our Dini so you can just pick the contact that you created and it has been so I'm using the text file method because I am trying to you know create an oddity from data system on my local machine but the file colon slash slash could be as well replaced with an HDFS colon slash slash in which case Park will connect to the HDFS location and pipe will create an IV B from that data you can also you know connect to relational databases you can connect to various things but finally what you get is a RTD so what I've done is I've taken this file this file which is do dot CSV and I have pointed are sparse with an object to predator RDD so an RTD is more like you know a distributed data set which essentially is like an eye straight up indicates what you can do is you can you know pass functions to it and ask you to manipulate the data and extract the rivers so once we have this you know lines oddity where each line in the file is what each element in the body is what we need to do is count how many I started in for their how many one starts Easter so all I'm interested in is the third color so to do that of what I say if you take the lines are babies we just created and run a function which is what were the map function does I'm asking you the function that I want because I'm on each of the rows and ideally in this case of all I need to do is flip that rule and extract the second column of the third column which is accessed by the index 2 so once I do that essentially the oddity you know has all of this so I've taken one RJD which is the line side BD call the map function on it pass the actual function that it needs to run on each row and create another Hadean so this is the most common operation in between the WTC will transform one oddity into another ability by using the map function when you specify the body so this RDD has the impact leg simply line with their end badly whereas once we're on the map function and they are think to split and extract the third column then all we are left with the venality which are just this third column once you have that one our DD also gives loop actions so you can take an RDD you can transform it into another our DD using the map function but as it is also give you a ability do some actions on them and one protection is found by value what that essentially does is it becomes how many unique values are there in this and how many times they post and it returns that insight of that result to you as a dictionary so what we have done is taken the lines are going to create an oddity out of this file we have transformed into another our DB which just has the ratings in it and then we have used an action on the a DD which is count by value and you counted how many unique values exist in the body B and how many times they exist which is exactly what we wanted to do you wanted to count how many ratings like this how many time steps so once you do that the most spa we'll go ahead once you do this action so whenever you plan so I think I'll go take a little detour and buy to explain this concept here so given an activity that you are created you can either transform it or you can run action for it the wave spark optimizes its execution is till you until you issue an action let's count by value part does nothing positively tracking of all the transformations that you have done so far right and then once you have done all your transformations and you invoke an action like counter value bar normals what are all the operation for extracting that inside the proton in that action so you can group all of them together and run them it will send it once to the glass so that is one way that optimizes its execution so when you when you actually run in this lie in fact if you don't man you know it does nothing it just remembers okay you want to be this line 5 ad and you want to you know converted into another language but when you give an action like count behind you that's when R goes out to the cluster issues the command to the explodes and asks if you actually do the computation so it can optimize because it already knows what are all the transformations you want to do and what is the exact action that you want to you know run on these little kandar degrees grain and once we do this we get a dictionary of – dictionary which is the result and all I'm doing here is just you know sorting it by T and so let's word and diagram so just to mention I'm using the canopy and thought canopy in Tibet for Python which is what I would recommend utilizing this optimal to use an ID like canopy and once you have cannot be set up in the documentation that the quantity – you can you know open a terminal now the script that we have written here is the rating system Emma's not here okay this is the script and to submit it to spark are willing to do the spot – submit rating system Graham got the one they didn't so that might be like this file and the file itself knows where they connect to where of the spot cluster you know and work are needed to create or Barbie to transform or action to done like so once you submit this the job will be submitted to the process and the reverse will be printed out on the client which in this case is my laptop okay both of these words so the one star rating has been given six thousand times nearly the two star rating has been given eleven thousand times nearly and so on so forth what we can see from this data is essentially you know one star rating is given very intimately the sabbatic you know inside we can drop people refer one start of something they really hate and you don't tell the time that each side of the really dislike until they bind each or to find out movie like that they don't get one or people can they dislike of we just don't come and treat it me like so that's the the two star rating is more people you know get this tablet is important one star and you know similarly it will be enforced are the most common rating given places same way maybe even psychology the result the five star rating for something really special something that is you know you know get it made yeah so so what we did here is you know we understood this code and this effect is the job to spark love citizen spa submit which is a command you get one shoot install spark on your on your client device right so this part submit you can write take a spot script that you wrote submit it to the cluster and you can play around with armies then now that's more to be to the next example that I wanted to cover which is the movies ratings you're not given a movie how are its ratings distribute so for that the file is called a distribution device and I'm gonna skip the first few lines because we have already done that no run through that in the last example I'm gonna move to line number seven directly so from our data set they don't we know that research for Godfather you know the movie ID is 127 so essentially what I've done is I'm trying to find how are the ratings of Godfather movie distributed so the movie idea is 127 and I know that you might data the the movie ID column first column second column and the rating column is the photo like so again thing I am reading the same time I'm reading Latin family from it but this time I want to only look at the ratings that are for the movie Godfather so you know what I need to do here is I need fate of this idd and only get those roles where the movie ID column has the value one condition that is what this line of goodness so it takes one IBD uses the filter function to create a subset are daily notes again our daily transformation which is this these are all the ratings at the movie and now what I wanted to show you is how to do MapReduce so what you want to do here is you know we want to you know count how many once everything is god how many to star ratings and so on so forth to do them you know my math so I can take a now TV and until the map function on it we're on each of the ratings Godfather of God I am you know emitting the key as the rating and the value as 1 so Godfather had got thousand one cell ratings which I hope it hasn't then you know this function will run on those 1000 rules and for each over 2m it you know this is a double where the first element is the rating which is one star in this case and the second element is the count one simply find that you know one separately I have seen one star rating once and again you see the once everything little again omit one start rating comma one so essentially for each one star rating I'm omitting a one for each new star rating on a material one so all I need to do in my videos is add those up because by the time reduce is called you know all the values you know as I have emitted for a reading or all come together that's how Map Reduce works so then reduce is called all by one star ratings have all you know all one stall one second everything here for my one star rating for all together so you know how many one star ratings I call I just think like that so we take the Habibi we created you can map function on it and we then reduce by key and we just add up the ones that we are omitting for each key and then the collect function is again an action which essentially goes and collects all the values in your result in target so here we took an oddity we transformed it using the map function when we run the reduced by key transformed it into another ID and then we collected all the values so this is a map of each example you can spark and I hope you all can see how Salinas and how clean it is and then we just not bring out the ratings so let's go ahead and run this okay so what we're going to run this movie yeah so again we submit we do spark some men movie rating distribution dot device which is the file that we have once you submit it should go ahead and tell us how the ratings of the movie Godfather has varied how many ones are how many to sir so just wait for it to complete all right there's not alert so as expected and confident got mostly four or five star ratings because it's an amazing movie and there are a few strange people where we made one stuck brick that's an insight that have strange people who did God for the water and happy all right so what did we do we just you know to copy the fact we created an RDD from it and then we filtered it to only have ratings from The Godfather movie which the Godfather movie is God then we ran a MapReduce job on guitar daily to express how the ratings of the movie Godfather is distributed great let's move on so the next thing that you know I wanted to cover how to find the top the movies which have been the most so we want to find among all the movies which have been rated the 1 lakh rating that we have which movie has which 5 movies have been rated the most or which 10 movies have been rated the most now function being those would all be good movies because we have already seen that people comment rate when they like movies so the technique that I'm going to do here is a sequel so we are going to actually write a sequel regret strategy so again you played so now we are entering into a territory where we are talking this structure Drita if you have to consequence queries there has to be a schema that you have to associate with your data if you look at this file this is going having schema Annette it is a spore luck database it is more like a sparse matrix as possible so then this cooperative matrix it's like a matrix it doesn't tell you what is a what is column stands for would it be column stands for one of the third column sample so when you have to do sports people you essentially have to assign FEMA your data and that is what you know this mapper is doing what this mapping function is one line at a time shooting it by the bad cut but so that it gets access to each of these noble names and then in the same I want to return a role which is a spark spark space of people presenting you know a role in an oddity and you know essentially the root for each line that I'm processing in this function I'm returning a row with I'm saying the first field is the new variety the second field is the movie ID third pleases the rating and the fourth is that time standard which trading oscul once I have this what I can do is you know I can take my mind's eye baby which is the raw ability you know transform it into a ratings RDD where the RDD consists of raw objects not because this method function is what I'm passing here so I'm digging each line and I'm passing with this function which is then converted into another oddity but this time instead of having the body inside it has actually been sighted in the form of roads once I have that you know I just have to give the table the name and i'm calling this table as movie ratings because they are written to movies and now once we have this we can write sequel queries like so to find the top rated movies what I'm doing is I'm selecting the movie ID and I'm counting the rating that the number of readings it does God know after grouping it by the movie age after that I'm ordering the result in the descending order so that the movies which are rated most come on up and I'm picking the doctor this is the sequel query and ranks you just have to save path or sequel part of the spot session that we have to create now this is a new abstraction but it still want to create code one line and wanna play code to create this park object and then you can you know take your raw RDD a final in the schema give a name to that schema and run a query on the schema that's how simple and then just print out builders so here we are doing the collect which is the action we are taking this our baby so when it was part of simple and about the Caribbeans no matter our baby because the others you might be selecting all the rhythms for all it knows so you have become an action on the other day which of the collect action is now about you can just print the book so let's see which has the most rated movies ok but submit the file is most rated movies won't be back so this is only looking at the one like you know ratings popsicles ratings that we have but I am hoping somebody will show up here that's the job getting delegated and you see the job over the skin after he broke the most rated movies are it wrote this line and after that when it executed the collect that's where the job was you could see the progress bar here right so till you do an action on an ideally not who really happens and that's how spark does all looks magic okay so movie ID fifty is the most rated movie let's find out which movie does it's Google dataset tarball Wow okay let's look at movie number goofy which has the second most number of readings which is 5:09 what is 258 it's the movie called contact great I want to check out one more force to movie ID hundred yeah that's what essentially you know what we have seen here is we have taken a ball RDD from Hamid into our daily containing raw objects which is the abstraction for structured data after that we have given name go you know to that structured data so that we can refer that name enough sickled goodies so the name will give here the name but you can't have sort of the table name in the secret code and then we just executed the query which gave another oddity you ran the collect action on it and backs funny the entire thing was delegated that's when the lines are did you have actually created in the cluster that's when the map or an and that's when the query dining and this is all sorted takes X new one we have a million short on time so the last example I wanted to run through is essentially creating a movie recommendation system so how can we create a movie recommendation system with this data so let us say you know this user who rated the movie 55 stuff it's me hey what I've done is I have just entered these three rows manually and I have picked some movies that I like and I've given them five star ratings and a billion and fabricated this data has put these three rows on top of the actual data set and you know the I have picked movies from my from the slips will be fifty is a movie I like Star Wars and we 172 is another movie that I like and I'll put this data in so that you know I can see what movies it recommends to this is one who read these movies would be pretty so taking a few movies I like and give the first you would be like this like given it one star and I want to build an engine which can say what movies should I come in to this user based on this data that we have given us like so one way to do that is you know look at which movies this user likes and find out you know which other people like the same movies and then see which other movies those people like and then recommend those movies to me so that's a collaborative filtering technique with which we can see how people react how people have detective movies you know starting with what movie did I read and then see the movies that are rated highly which are the people who also rated those movies highly and then see which other movies they rated highly and you don't take take a you know union of all those movies and recommend too many which means I should see like so that is collaborative filtering and that's the technique we're going to be using all right again I'm gonna skip all this boilerplate the main thing to move here is we are now starting to use the machine liability component of spark and in that there is a built in modules of recommendation you know which we are going to do to create this recommendation engine so how we gonna do it so understand given look it was actually that come in movies or video of you two buildings that we person what I have done here is read the other side that we have an advocacy bill file so that we can see names of the movies instead of IDs and then go and you know see whether that movie is a good recommendation so I have ordered up the part where we will look up there in the names of the movies instead of IDs so all the data that we need to do the analysis is all presenting on we think data but the final recommended movies I want to bring their names so what I've done is I have the u dot item data which is this file Bureau very calm and I have created dictionary where the teachers the movie ID and the value this the name of all I'm going here is reading this file line-by-line splitting it on the diameter picking the first column which is the movie ID you can with the key and putting the movie name in the value that is what this function is supposed to do it returns a dictionary of coordinates so I'm doing that after that what we need to do now look at actually the ratings data which is the Sailor Regan and what I need to do here is you know create a body D which the amended package of spark can understand and recommend movies to me so the way this works is the MLF has a ratings object which you can create where it has three columns that the first column signifies which is the game war training the picture which is our kids know which is were which movie not what rating colleges are yeah so which movie got water rating and which you directed it at the second third column oops sorry so the first column is used there the second column is movie solid column is ready so it essentially says this user the first is what you the person is the user then you have to tell you know which movie is ringing and then you have to say what rating to be actually give yes that's right so that's essentially what I'm doing I created my draw our DB here in the back then I'm transforming it using the map function till you know split it on the big images so now I'm running another transformation which is another map function where now I have am creating an oddity of sleeping objects in the last example of sequel we created an oddity of dawn we are creating an oddity of creating objects so once you create that the format in which it expects is you know the one big discuss and once you create that you can now use machine learning models on it okay and what I've done here is also called the cash module so this is a special limiter so when you create an RDD you can call the cash method on it if you intend to use the are dating multiple back so you an are you given of a coffee object to create let us say you load this file containing you know 20 million rating you don't want to keep loading it again and again and destroying our baby you want to lower it one and use it and that's the beauty of function programming you create an oddity it's immutable you can really transform it into another oddity or you can extract a metric from of it okay so I have called the cash method which tells us to remember this object because I'll be using it multiple times and after that you know there is a package called LS which you know allows you to paying based on the date ideally that you created and are typically the model is which we can't afford and after that to actually recommend the movie that the user actually wants to see you know all you need to do is called model and not recommended products often which user ID you want to recommend it for which we are taking in as an argument here and we are asking how many movies you in this case how many movies you want to recommend which gives you a collection of recommendations and then you have the site rating over those recommendations and you know the same thing good name of the movie instead of bagging so the commendations are one we did with the ID of the movie but we have a dictionary which maps highly to names we're going to use that we look up the name and also print how good of a match it looks because it also gives you the score okay so I cover the brush through so just to summarize what did we do so all this part is just grading a dictionary of movie a movie named after that what we did is we when our ratings data we can type into an RDD the power of AD after that we transformed it to another our DD which now contains a rating object which is what gives you the spa gives you a rating object when the format is you know which user rated which movie and give it a operating that is the format it expects so the second entity need not be common set up this is you know you know let's order this product which user you know review for this part so this is an abstract rating entity could be it could be no refer to any rating product rating it could be movie rating it for anything so what we have done is taken over all DD and transformed it to an RDD containing rating objects and after that have used with EMS Modi who trained it without dating tragedy and there are a few other parameters which dictate you know how how many times the children these are all internal to the algorithm areas which are beyond the scope of this webinar but but you know the big inputs that you can't we can play around with we create the model which we can't recommend and then you know just to be a bit user we want to recommend which is an input I'll show you how to button to do spots scripts and once we have that just forget this part of the boot for now let's go down now we can do that model that we created here and Africa recommend products what this is this is the name of the method in our case it's actually documenting movies but product is an abstract movie don't so product for you know something like net metric so the model that we trained here you are asking it recommend products for this is one and you're telling it how many products or agreement and that gives you a collection of recommendations which we are hydrating and printing so that we can see and what we are printing it actually is moving in and not be the ideal okay so let's go ahead and run this take a little bit of time because time itself is complex and it has to you know basically but we could see what will be three lines movies people which is the flight both movies then find out which other movies they like each other common movies in that set then drank them and good luck okay so game another because I didn't pass the argument okay so user because here we are reading an e which usually want to recommend for the first argument we have the bathtub let's perform the user that we fabricated in our dataset the user 0 like these two movies and dislike this movie and permit ok it's doing all that we discussed about and that finding recommendations ok so what is what are the movie this guy has written we operated the Shawshank Redemption and Forrest Gump these lights are rated 1 Phi dot o and I've given everything and as we can see they're not very relevant none of the movies I've actually seen or even heard from a friends that I should see them because they know what I like recommended this so yeah so that's Harbor we don't take away don't all then when you use complex algorithms like an alternating least don't just look at the output and assume because it's a complex algorithm the inside would be correct or even meaningful and we saw the case inside is rather all over the place yeah I just wanted to impress upon you that you know the machine learning model of spark has this capability of you know doing building recommendation engines and for the data set and the effect that we put him in the modern do not seem to be very complex but nonetheless if we you know use right there are many algorithms to be a toolkit this is just one LS is just one that gave us there are many like this so maybe we should use another algorithm maybe we should do you know item based filtering basically look at properties of the movies instead of what other people did what uh what other people online instead maybe we can look at properties of the movie fixer so if I like the Shawshank Redemption and you know for a scar Morgan Freeman on hands like so this is very different ways and that is the whole point we'll have to play around we like to know always verify the output that you are getting never believe that you know big NIT especially where is easy to get lost and easy to make the wrong conclusions and yeah there's things ought to be a good example where you know the inside we are getting this pretty bad it's always good to go back and we don't look at you know the recommendations or the predictable you know analysis the big data system Lexmark is giving you and see if you are using the right algorithm and that is actually the truth data science part of you know but you'll have to do great so that's all that I had for you guys I hope the last one or four was useful to you guys and you guys got an idea of how to use spark and fight them to do big data analytics thank you

8 Comments

  1. Tabitha Patcha said:

    Hi Viniod, Can you please explain to me how to conact two columns in that text file using pyspark means , I want to concat 2 columns timestamp and ratings column,.

    June 30, 2019
    Reply
  2. Venkat Raman said:

    I saw the Same example in Udemy pyspark course

    June 30, 2019
    Reply
  3. Pawan H P said:

    Thank you for the tutorial. As beginner I was able to understand all the concepts. As you mentioned at the end, can you please give me an example to do item based recommendation as in recommending movies to the user based on genre of the movies that he has watched.

    June 30, 2019
    Reply
  4. Lorenzo Ostano said:

    Hallo, I'm trying to install pyspark and on the last leg I can't run rdd = sc.textFile("README.md")
    it says Nameerror: name 'sc' is not defined.

    Could anybody please help?

    Thanks

    June 30, 2019
    Reply
  5. pravin pathak said:

    Hello Sir, Thankyou for the nice tutorial. I got confused ,For Spark ML] You gave input [user->movie->rating] how to know the which format  we have to feed the spark ml??

    June 30, 2019
    Reply
  6. biswadeep paul said:

    Hi,
    First of all thank you to give a nice introduction to pyspark. Can you please help me to understand the below two things.
    1. I could see in your third example that the spark sql is fired on a temp view which was created from a data frame. you created a data frame from the RDD. Is it necessary to create the DF from RDD? Can't we create the temp view directly from the RDD?
    2. We are using the RDD at the very first level to read the .txt file. Can't we create a data frame directly to read the text file?
    It would be very helpful for me if you can please explain.

    June 30, 2019
    Reply
  7. Dharaneesh Vrd said:

    Can you please explain how mapreduce works in movie-ratings-distribution.py?
    What will be returned by map() and how it will be processed in reducebykey()?
    If anyone knows please explain it to me! TIA !

    June 30, 2019
    Reply
  8. Prasanna Kumar said:

    Perfectly organised the content…demoing the code is a faster way to make someone understand….great job and thank you very much.

    June 30, 2019
    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *