Multiple Logistic Regression and Predictive Analytics Made Simple in R – Part 1

hi everyone David mail here and today we're gonna start on a multi-part series on multiple logistic regression this is one of the important features or hallmarks of doing data science stuff so it's really cool here is data so I'm using his kratom sales I think I posted this in kaggle a couple months back and so we're gonna pull this in and I'm gonna walk you through everything you know this today we're going to just do the exploratory data analysis by loading the data look at the data and see what kind of insights we can see from this data so let's first put this day to end right so that's this line right here and we have it in there now once we have this in there this is going to a data frame called test data one so if I wanted to open this up I'd have to go into my data frame and let's find that in there test data let's open this up a little more there it is right there so I click on this there it is and it shows you that we have week number date your the year total sales transactions units average T is average temperature we have a ranking which i think is 1 all the way through so that's basically meaningless and then we have violent crime stats for that store the local area and then we have a ratio and day type and day type means a lower crime higher sales and B would be higher crime higher sales now you're saying well how can you determine crime for that well we have another process that I posted on here correlations and another random forest one in my other videos that shows you that we have a correlation a positive correlation between sales and violent crime for this store so as violent crime goes up the sales of kratom products which would be your tees and I don't know powders and who not lawson just maybe i have no idea but those products go up in conjunction with that in a positive manner so being that we know that let's go back to our code here and start our exploratory data analysis so first we already know that our Y variable in this case is going to be de type my work this for another morts was post political party let's just put down de time here okay so now that we've done that we have to establish does ahead bias so let's take a look and make a table of that particular variable right so let's run this and there we go see how close 29 is to 30 we have about the same number in both so it's really not ask you now if you saw 60 in one and two and the other we have some serious bias that we would have to take into account here or possibly use a different variable okay so next step is we want to check for na na SR your incomplete or your missing data so let's take a look at that by running the summary command it's to the table can we run this and here we go let's open this up a little bit so you can see what's in there and when we run the summary command it will show you at the very bottom of each column DNA so see here's one here's one here's one here's one all the way through I think everything basically has an A every single row has one na so being that we have na is let's go back up here and it says what if there's na s then you run this this is na omit okay now you remove those so we do that and we just hit ctrl enter cuz we're in our studio and then once we're done with that let's go back and run the same summary thing again just highlight control and enter and there we all look see the na SR gon sell quick and easy that was so then once we're done with that we want to check the distribution of our variables are they distributed in a with skewness are they fine I so let's go and take a look at that so first let's take a look at this STR command right and let's run that what does that show us okay we have as you go down this list let's bring this up so you see a little better you can clearly see with integers three integers and the rest are naam numeric except for the last one the last one is factor but look at it closely so it's not only factor but its factor with three levels now being that that is our Y independent variable we're good that can be a factor and that's fine for this multiple logistic regression so but the problem is that we know we got a or B days right well this says factor with three levels see it right there factor with three levels well what that means is we've got a and B and C right before we have blank we do not want that blank is right here case you could see it right there we do not want to have a factor blanks to make all of our data look weird and we graph it stuff so we have to remove that so what we're gonna do next is to remove the extra blank level CS right here what we're gonna do is this line of code right here what we're gonna do is just that one column day type we're gonna go and factor it just like this and what it will do this will do then we'll put it back into the same row or that same column this will remove that extra blank factor so let's do that and show you now if we run the same STR against it again watch what happens BAM factor with two levels it's correct just a and B that's what we want okay so we have that now next we're gonna get into the cooler parts of exploratory data analysis where you can actually visualize stuff right so we want to do is start looking at this stuff and the first one here let's just run this and I'll say it's it's gonna be this code right here but let's just run this so you see what it is and let's actually bring it up a little bit bigger so you can really see what's going on here there we go it has to be the right size if it's too small then the little bars won't show and then it's meaningless okay so this is a little bit of code here I'll open up so you could see it takes all of the columns right and shows us the frequency in each of the column so you can quickly see by looking this a date Y C right here is fine it you have either 2017 or you have 2018 you have nothing in between right and you have total sales right which is normally distributed okay you have some skewed things like units right here so it's all to the left right there and rank is finite but it's a little bit skewed to the left so upon looking at that we can get some ideas of if we're gonna use things so I'm gonna stick with total sales that's not skewed so I don't have to worry about that these two I would have to worry about that I'm not I'm gonna deal with total sales come I deal with average temperature which looks normally distributed and violent crime which is a little bit askew to the left but it's pretty it still got a good number there so that's fine I'm going to deal with those three more normally so let me open this up so you could see the code okay and this is the code I just ran so you can see what it is par is the command to go and divide it your screen so you'll see more than one graph return and in this case you saw that I have nine graphs in there so that's why I have see three comma four okay and this I'll give me actually twelve openings and less signified CSS margins two larges open it back up actually there's ten that's why he's that if there was nine I would use three comma 3 but there's actually 10 here so I've got space for two more if I want if I had two more variables okay but that's how that works and so if I take this let's just drink it down so you can see the code again okay so this is the actual code to show that for all of the columns okay and then next let's do the same kind of idea same concept is separate only gonna have a few graphs this time and I want to focus remember I just told you I want to focus on total sales violent crime and average temperature so we're gonna do is we're gonna run these three right I'm sure you got them all selected here and now obviously it looks like crap right now but let's pull that out so you can really see it okay now what this is showing you is type A and type B of the day of the day right because higher crime a has lower prime right but we're looking at total sales violent crime in average temperature you can clearly see that this is B over here and this is a so as sales goes up there's a much higher propensity for it to be a B type day than an any type of a as sales goes down there's almost a hundred percent possibility or probability that's an a-type day and at the very high end it's almost 100% possible probability of being a B type day so we know that that's good we want to look at that that back to order thinking is here and then on top of it we have violent crime showing the same thing on a day type a it's lower on a type B it's much higher and then so we have the temperature the temperature on a type a day is here and in a B day is here so you clearly see that there is more B day temperature it's not as much of a showing as these two but it's still there because there's much more or more higher temperatures on V than on a and you can see the line would actually be somewhere at here to be 50% and clearly there's more B than a now let's close this up a little bit secret to see the code right there okay there's the code that I just used all I'm doing is plotting them against each other seedez type total sales for and then it just says the data is the original test data one now let's go to the next one here let's bring this down a little bit here so you can see that's gonna be the next I'm not gonna cover creating training and testing samples or sample sets today I'll cover that in the next video but I want to show you this one right here this is box plots and this is pretty cool too this is where you can start getting some insights on top of what you just saw so let's run this and again it starts with at par M fro which gives you in this case three three by three because I'm gonna have nine graphs here you'll see at this second here is flat so when you look at this what you can see is I've plotted all of our columns against the y dependent independent variable so in this case you have a and B right and you have the week number right date Y which is your year total sales and what you're looking for is differences between the two and the a and B so like friend is this one here rank one really doesn't show anything significant up anything that we could care about there okay but look at violent crime look at the difference see it shows you the range like there's one outlier here and there's a little bit of range here but the vast majority it's right in there and then look at this the vast majority it's a bigger swath but much higher up so clearly you can see right here that day's type B have a much higher violent crime statistic than type A now you can also look at the ratio which is higher and B than an A but let's look at some other things let's look at the temperature where is temperature in here there it is right there average temp you can see a little bit of see this is where I was showing you earlier one it's there's some difference but it's not as big as a violent crime but there still is more D days than a days by temperature so that's interesting then you look at units clearly you have more on B than an A and then you do have some outliers it shows you how the IRA Liars look and then you have transactions you get some outliers up here but the bachelor it's still much higher than here and then in a day and same with total sales so we're going to stick with the big ones we just remember we want to stick with total sales because these two might be skewed we found that error earlier and so I'm gonna stick with total sales average T and violent crime okay so going further and we're gonna delve deeper into those and see and we're gonna see a lot more about those if we go back here let me open this up so you can see the code again we did a box plot where if you see this going down with weak num date year total sales etc then there's the little squirrely sign and then you have day type right and then we put the Y label and the X label okay so we have them correctly situated then the color could be anything I pick light blue but it could be pink green red blue whatever you want and then of course the data is the original data set which is test data one later on you got to pay attention this because will actually have test data Qin training data okay based off that um but I'll go over that and it won't be confusing so this is how you get to this this is real end up with this some cool insights just off this video alone you can already start pulling some insights on this okay and then the next video we're gonna go into creating training and testing sets and then we're gonna go into actually creating our model and doing more testing and more tests thing and more testing and you're gonna see in the end some really cool stuff with this alright thanks again for watching please take a moment to subscribe like and leave me a comment let me know what you think what you like would you like to see in some future videos thanks again and have a great day

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *