Predictive Analytics Using R | Data Science With R | Data Science Certification Training | Edureka



hi guys this is lekha from Erica I welcome you to this session on predictive analytics using our without any further delay let's take a look at today's agenda so guys we got to begin this session with an introduction to predictive analytics all right we'll see what exactly predictive analytics is and then we'll move on and discuss the various stages of predictive analytics next we look at how predictive analytics can be implemented using R here we will also see what are the features of our and why you must choose R all right after that we'll move on and look at a real world use case and then we'll finally end this session by looking at a demo where we'll see how predictive analytics can be used to solve a real-world problem all right so guys I hope all of you are clear with the agenda now let's move on and get started with our first topic so guys what is predictive analytics now predictive analytics refers to using historical data machine learning and artificial intelligence to predict what will happen in the future right and to simply put it it answers the question which is what is most likely to happen based on current data and what can I do to change that outcome so basically tries to predict the future by studying the present data available all right so how the process works is the historical data is fed into a mathematical model or into an algorithm that considers key trends and patterns in this data the model is then applied to the current data to predict what will happen next all right so will predict an outcome so this predictive analytics is something that these top tire companies implement so that their business or their revenue grows so guys the analyst can use predictive analytics to foresee if a change will help them reduce risk improve operations or increase revenue now let's move ahead and talk about the different stages of predictive analytics so guys predictive analytics always involves a business problem that needs to be solved you also have to have data that needs to be prepped for analysis more that need to be built and refined and finally to retrieve the predictions and convert them into positive outcomes predictive analytics project will involve these seven steps the first step is define your problem statement now before you even start on a project it is very important that you understand the problem you're trying to solve alright so in this stage basically you're going to identify all the objectives of your project by identifying the variables that you need to predict once you're done with that you next stage is data collection so after you define the objectives of your project it's finally time to start gathering all the data right at the stage some of the questions that you can consider is what data do I need for my project where does it live how can I obtain it and what is the most efficient way to store and access all of the data alright so this phase involves collecting all the data it might be a data stored in a database or you might actually have to go out do some research and then collect data once you're done with data collection the next step is data feeling so guys like I said usually finding the right data will take you both time and effort okay so if the data lives in a database then your job is very simple because all you have to do is you have to query the relevant data using SQL queries but if your data doesn't actually exist in a data set you will have to scrape now this is where you transform your data into the desired format so that you can read the data alright so once you've collected the data you need to transform it into a format so that it can be readable so once you're done with the transformation part you move on to the most time-consuming step which is called cleaning and prepping the data now this is especially true in Big Data projects which usually involve terabytes of data to work with so when I'm saying cleaning the data I mean that there are lot of missing values or a lot of redundant values in your data set you need to get rid of all of these inconsistencies so that you can predict your outcome correctly now next stage is data analysis to define data analysis it is a process of transforming inspecting and modeling the data alright so the main goal or the main aim of this process is to discover useful information so what you do in this stage is you evaluate all the variables that are available to you or all the data that is available you understand the construction you understand the population the quality and the relationship among the different variables and the different data and then you retrieve useful information from this data so basically this is where you understand all the patterns that are hidden in your data to do this you can also pull up an analyze random subsets of data by plotting a histogram or by creating an interactive visualization alright so basically you have to dive down into each data point and you have to explore and understand story behind the data all right this is where you retrieve all the useful insights from your data so what do you do once you've retrieved all the useful insights from your data or the useful patterns from the data you begin to form hypotheses about your data and the problem you're tackling or another next stage in predictive analytics is building a predictive model so there is building a predictive model involves mapping your data into a machine learning algorithm machine learning algorithms can be used to solve a series of problems whether it is a classification problem regression problem or a clustering problem so the first step to building the model is to split the input data randomly for modeling into training data set and into testing data set after that you will build the model by using the training data set ok so this is basically building the predictive model next step is validating the model now once you've built the model you have to test the accuracy of the model or you have to test how efficiently the model is predicting your outcome now how do you do this you can do this by using the testing data set on your model and you can evaluate whether the predictions are accurate or not so guys at this stage you must try and improve the efficiency of your model by filtering out the response variables that can help you predict the result in variables so guys many a times there are like hundreds of predictor variables which aren't actually helping to predict our outcome variable so what you do is you filter down a set of predictor variables which will highly affect the outcome of your result in variable alright so you're just going to filter down the most important variable in your model now the next stage is the deployment stage so there's the goal of this stage is to deploy the models into a production or a production like environment for final use acceptance all right the users have to validate the performance of the model and if there are any issues or any redundancies then it has to be fixed in this stage so guys that was all about the stages in predictive analytics so in our demo we'll be discussing each and every stage we'll be implementing it using the our language okay so now let's move on and look at our next topic which is languages for predictive analytics there are a lot of languages that you can use to implement predictive analytics we have Python we have R we have MATLAB we also have Java so this is a reason why I choose R for this tutorial let's look at why we should go for R and let's discuss a few features of our so to begin with R is an easily understandable language on it it has a very simple and flexible syntax so the next feature is that it is a programming and statistical language now guys when it comes to processes like predictive analytics or feels like data science a statistical language is very important because all of these processes involve a lot of math a lot of statistics since R is a statistical language it is very useful in such processes now the next feature is data manipulation with AR you can easily shape the data set into any format that could be easily assessed and analyzed by slicing the large multi variant datasets ok so data manipulation is something that is easily done in art the next feature is that R has inbuilt functions for data analysis and predictive analytics so guys are provides a lot of statistical functions because I mentioned earlier that it is a statistical language alright so it has a lot of inventions which are already defined all you have to do is you have to call these functions so R has over a thousand plus packages that can be used to implement various statistical analysis tools related to hypothesis testing model fitting clustering techniques and machine learning all right there are several packages that you can install okay so for each of the machine learning algorithm the library you don't have to code the algorithms instead all you have to do is you have to load these libraries you don't have to actually sit down and code the entire algorithm the next important feature is data visualization so guys in order to understand a data set and the relationship between various variables it is very important to visualize the data and are provides a lot of packages which help with data visualization moving on it has a massive community support it is the most sought after technology because of its community support okay so there are over 2.5 million users that are using are and companies such as Google Facebook Mozilla etc also make use of are ok now let's try to understand punitive analytics with the help of a real-world use case now the first question here is what is sports analytics now the sports analytics includes the use of data related to sports such as the player statistics weather conditions information from expert Scouts and so on to build a predictive model around it to make informed decisions so what you're going to do here is you just going to study the data about each player and you're going to predict how well the player is going to play or how well the player plays during particular humidity conditions or by the conditions a lot of things can be done using sports analytics and this is actually implemented in a real world sports right now people who've watched the movie Moneyball alright that was all about sports analytics and it was all about how you can use predictive analytics in order to make better decisions about the game now that we have a lot of data available if predictive analytics is applied to sports you can actually draw out a lot of useful information from a match so for example let's consider the game of baseball now the speed of the pitch the speed of the bat the ground conditions the humidity temperature all of this data can be recorded for each match now this data might seem useless to a viewer like you and me but by using predictive analytics we can process this data and draw comparisons and conclusion about a game so comparisons like the performance of a play under particular temperature or humidity or the performance of a batsman against a particular pitch okay so using the concept of predictive analytics in today's scenario the stadiums are also equipped with special cameras which are dedicated to every player on the ground okay so these high-speed cameras they record every minor detail from the speed of the ball to the body temperature of the player and they send this to the server in real-time then this data can be used to make better decisions or to determine the performance of a player over the series or a season and these days we've also been introduced to wearable technologies so basically they have watches which record your heartbeat your perspiration level your fatigue Ness and stress now these kind of devices are also used in sports okay so these device basically a record the heartbeat perspiration level and all of that and they will help the coach to train the players in a better way okay so that they can avoid injuries and all of that so there's actually there a lot of football teams and a lot of baseball teams which although predictive analytics okay an example of this is Real Madrid okay they practice predictive analytics in order to focus better make better decisions about the game all right now finally let's move on and look at our today's demo so before we start with a practical but let me tell you what exactly we're going to be doing in this demo all right so the problem statement is we are going to implement predictive analytics using our to diagnose whether or not a person has breast cancer based on past medical data now the data set is collected from the University of California website consisting of breast cancer cases and we're going to use this data set to build a predictive model that classifies Adamo as either malignant or benign biggest uncertain feature variables okay these are the feature variables we have a uniformity of cell size we have uniformity of cell shape we have marginal addition we have bare nuclei we have normal nuclei so there are certain variables that can be used to detect whether a certain tumor is benign or malignant arrest to be basically detecting if a person has breast cancer or not so hope all of you are clear with our problem statement and data description now the process or the logic behind this demo is predictive analytics all right so we are going to begin the demo by importing data then we are going to perform data cleaning order I have spoken about this earlier in today's session after which we'll build the model and we'll train it using the training data set and then we'll finally evaluate the efficiency of the model using the testing data set all right so this is how we are going to perform this demo now without wasting any further time let's look at our code so guys in order to save some time I've already typed out the entire code for those of you are not familiar with the our language I'll leave a link in the description already you can go through that content don't worry I'm going to explain each line in detail alright so we're going to begin the demo by importing the data I've already downloaded the data and I've stored it in this spot okay it's in my desktop in the predictive analysis demo folder and this is my text file which contains the data then we are going to use the read dot CSV function to read our data set in a CSV format alright so we're going to store this entire data set in a variable called cancer data all right so let's do surround this line of code okay so we've successfully loaded our dataset now let's just display the structure of the data set to do that there's a function called STR in R which lets you display the structure of the data set so the output here shows that we have around 698 observations and we have 11 variables or 11 feature variables okay these are our future variables now the problem here with the variables is that the name they are not labeled in a way that we can understand okay we can't understand what x1 or x2 or x3 stands for so for that reason we're gonna label the data set now when I took the data set from the website there was a description about the data set and that's why I have all of this labeled over here so this code snippet is just going to label all my variables now let's look at the structure of the data set all right now you can see everything here is labeled okay this band nuclei there is mitosis class land chromatin and all that so guys these Levin variables or these lemon predictor variables are used to predict our outcome okay our outcome is to classify attuma has either malignant or benign okay so we're basically predicting if the tumor is harmful or not our next step here is data preparation if you look at the data structure over here there is a variable called ID but I have the first variable now this variable is obviously not needed to predict the outcome okay we don't need such variables so we get rid of such redundant variables all right this is exactly what this line is going to do now if you check the structure okay you can see that we have only ten variables now because we remove the ID variable all right so our next step in beta preparation is to convert the data into numeric format now if you look at the variable bare nuclei over here it's of the type character okay even though these are numerical digits is of the type character so we need to convert it into a numerical value which will be running this line of code okay so our basically has a function called as numeric okay that will be used to convert a character into a numeric variable right now if you look at the structure of the data set so initially here there was character now if I look at the structure you'll see that it's number okay so we've converted into a numeric format next line of code is going to identify the rows without missing data so where is the problem with missing values is they will lead to inaccurate predictions so our outcome will not be precise if we have missing values in our data so there's this function called complete dot cases which will check for all the samples that have all the values all right so viage is going to use this function and we're going to filter out all the observations which are free of missing values so earlier we had 698 observations now if we run this line of code and if we display the structure of our dataset we have around 682 observations okay so around six observations had missing values and those missing values now next step is basically about this variable called class so guys this class is basically our resultant variable alright this variable will predict if or Puma is benign or malignant okay so if it's benign is going to be labeled as 2 and if it's malignant it's going to be labeled as 4 okay so what we're going to do is we are going to change the labeling of this okay we'll keep it as 1n – okay if I run this and if I look at the structure alright here you can see that now it's transformed into 1 or 2 and all of that so once you're done with cleaning and preparing your data set your next step is going to be building a model you like I mentioned earlier the first step in building your model is splitting your data ok data splicing is the process of spreading your data into training set and testing set so guys you have to always remember that your training set has to have a lot of data samples when compared to your testing set ok that's because in your training set is when your machine is going to learn ok the more data you feed the machine the better is going to learn so from row number 1 to 477 I was signed it as a training set ok from 478 to 682 is my testing set one more thing to note here is I have only added columns from 1 to 9 to each of these data sets of the reason behind this is that the tenth variable which is class is our outcome variable ok so when I'm training and testing my data said I don't want to feed the output as well okay I want the machine to figure out the output when I give it the input I'm basically storing our testing outcomes and our training outcomes over here ok you can see that only variable number 10 is stored in both of these variables okay so let's run this line of code and this one as well all right so after splitting we're finally ready to create the model okay to do this we load the classification package okay this is our library called class and we run the key nearest neighbor classification algorithm this key n n denotes the knn algorithm okay so if we all are interested to learn more about the key in an algorithm I'll leave a link in the description you all can go through that video as well okay so what we're going to do is we're going to run this knn algorithm on our training set and on our testing set okay the test data set is also passed in to allow us to evaluate the effectiveness of the model okay that's where validation of the model takes place so my value of key is 21 which shows that is going to take into account the 21 closes knee–balls okay I have chosen 21 because as the square root of the number of training samples okay our training samples are for Sammy 7 and the square root of that is around 21 okay that's why the value of key is 21 okay so if I run this entire code snippet so our model is created over here we've trained our model on the training set now let's look at the predictions that the model mean okay or the output so what this denotes is our first sample is malignant a similarly second sample is classified as benign so this is the classification that's happened now what we're doing is we're going to validate our model okay we're going to test the efficiency of a model after we display the output of our model you're going to perform model evaluation so you're there just want to cross tabulate our test outcomes and our predictions so basically comparing our actual results to our predicted results all right so let's run this line of code okay so if you see that 160 of the samples have been correctly sampled as benign and similarly 45 of the samples are correctly labeled as malignant okay so our predictions are 100% accurate now to show you the accuracy we can run this code snippet all right so your actual value for sample number one is two okay meaning that this is malignant okay when you predicted it you correctly classified it as two similarly your second sample is in the class one and in your prediction you correctly classified it as class one so we have created a model with a hundred percent accuracy all right so guys all of this can be accomplished using predictive analysis that was it for our demo session if you have any doubts regarding our today's session please leave them in the comment section and we'll get back to you at the earliest so guys thank you so much for watching this video have a great day I hope you have enjoyed listening to this video please be kind enough to like it and you can comment any of your doubts and queries and we will reply them at the earliest do look out for more videos in our playlist and subscribe to Ed Eureka channel to learn more happy learning

3 Comments

  1. edureka! said:

    Got a question on the topic? Please share it in the comment section below and our experts will answer it for you. For Edureka Data Science Training Certification Curriculum, Visit our Website: http://bit.ly/2r6btSL

    May 22, 2019
    Reply
  2. Akinchan Singhai said:

    Hi.. thanks for uploading such a terrific video with nice elaboration. I will appreciate if you post any demo for a Geographical Information System (GIS) based predictive analysis using R.

    May 22, 2019
    Reply
  3. Aashikh Rider said:

    Good video . I just had an refreshing part here

    May 22, 2019
    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *