Predictive Analytics using Orange Data Mining



orange is a component based visual programming software package for data visualization machine learning data mining and data analysis orange components are called widgets and they range from simple data visualization subset selection and pre-processing to empirical evaluation of learning algorithms and predictive modeling visual programming is implemented through an interface in which workflows are created by linking predefined or user-defined widgets while advanced users can use orange as a Python library for data manipulation and visit alteration so how do you go about getting orange simply type in the link below and this will lead you to the download page when you're on the page click download orange and install it and you're good to go once downloaded go to applications and look for orange so first we'll be training the available data files in order to predict future or known instances so open a new data file and name it test all right so now drag on the file over and you double click now we open the data file for this example we'll be using the vehicles data file the vehicles data file is a sheer numerical data set later in the tutorial we'll be going through analysis of discrete categorical data sets so now we would like to look at the files inside this folder so we'll open up a data table double click on the data table to open and look for the values and attributes for this data file there are a total of 846 M examples and 18 attributes before we start training our data set we first have to check if the data set contains any outliers that may affect the machine learning accuracy to do so we have to open up and visualize the data to do this go on over and select scatter plot or linear projection for this tutorial we are going to select linear projection double click on linear projection and then you notice there are a few outliers that are further away from the other data sets and these outliers are the ones that we wish to remove so we closed it and output the file to the outlier region you will double click on the outlier region now we have a choice to select the distance metric of Euclidean distances to identify outliers for this tutorial we'll select nearest neighbors as 3 and then we output our outliers to the linear projections notice we output the outliers and when we open it the shaded part of the board data files are identified as outliers so we need to remove all of them before we can receive our visualization we have to remove all that before we can proceed with our integral machine learning part of our program therefore we would like to only export the inliers and not the outliers into our new data file from here we'll put them into a classification tree for example we can see that the outliers are obtained as an output from the outliers widget into a classification tree since we are going to use only the inliers and not the outliers connect the inliers to the data file and click OK now only the inliers data will be used for classification now looking at categorical data we will be using zu as an example so we do the same thing here as well under the tab go to browse documentation datasets scrolled to the bottom and you'll find zu as you can see from here zu has 101 data instances and 16 features now it's loaded in order to classify the data we have to use all learners and this will bring us to the classify tap open classifier and you'll get a list of learners that you can apply to the file Zoo so right now we do not know which learning algorithm is best for the data file for this tutorial we will open 6 algorithms for the assessment we use naive Bayes neural network classification tree CN to random forests and K nearest neighbor now connect the file to all these learners now that we have connected it to all the learners go to test learners which can be found under the evaluate tab now we connect the learners to test learners this will help us determine which learner is most suitable for this set of data and which one is more likely to lead to an accurate prediction so right now if we click on test learner it does not show anything here and that's because you need to connect your data for your test learner as well now that it's loaded up we can see this table what we need to focus on here is ca it means classification accuracy now if you look on the left you will see some sampling options the assessment for the classifier models are different depending on the estimation methods you choose so random sampling over here which is where we are now gives a classification accuracy of 91% cross-validation will give you an accuracy of about 94% but it is usually used for a much larger data set and if we leave one out it takes a longer time to process the data but it gives us a more accurate reading as it chooses only one data instance for testing and the rest for training and then repeats the process until all the single instances are used for testing at least once it is only recommended for small datasets as it is very time-consuming for a large dataset to determine which one to use if so you we have to select the estimation method that is best for our set and for our results in this sample we can see on classification accuracy here k nearest neighbor has the closest value to 1 now orange follows two-step methodologies and classification however in orange you do not need the separate testings in fact over here you can choose how much percentage of your data would be a training set and how much percent of it will be test set so right now let's bring it to 66% this means that 66% of our data would be in the training set and as you can see the numbers change let us now let the program run now that we have set our sample set at 66% and our tests at 34% let us look at the classification accuracy again once more k-nearest neighbor has the highest accuracy as it is closest to one and if we want to double check against our classes like over here we are looking at the accuracy in terms of mammals if we want to look at for example amphibians the number changes again but once more key nearest neighbor has the highest classification accuracy so what we can do now is to be sure to use K nearest neighbor for the prediction of our test learners we can also bring out our confusion matrix connect test learners to confusion matrix now the confusion matrix allows you to see how the classifier classifies the data and which data is classified correctly and which is not so on the left here you can have a look at the different learners that we have and if you click on correct it will highlight the amount of classes that it has predicted correctly we can run through it and as we can see k-nearest neighbor still has the most amount of correct predictions and this corresponds with what the test learner has privily previously shown us which is the highest level of classification accuracy orange also allows you to view models prediction on data and we will be doing this from scratch so to delete the data just track these across right-click and click remove on we are going to use the same zu file so double click reload and we see zu is over here drag it out and you can type this thing called data sampler when you click on the data sampler there are options to select stratified sampling and random sampling for the current pool of existing data that you have whenever we do a prediction we will need to connect them to naive bias and classification tray to the predictions by dragging them across click OK click okay again you see a pop-up window when you drag the connection from the data sampler to the prediction click clear on first and choose the remaining data and drag it across to the data sometimes you might miss the green arrow because you have to be very focused on the center of the box so you have to ensure that it is connected to the center of the box the model is also able to show you how you can predict the type class of various attributes you can use a nomogram for that double-click on a nomogram and you can see the type of target class over here by clicking on the drop-down list so let's see if the data input is a bird you can drag these points across by sliding across the data one means yes and 0 means no so it will show you the probability over here as for the predictions you can double click and for some predictions you need to drag the naive bias and a classification tree to the predictions do not forget to connect these three data sets to the predictions double-click on the predictions and you'll be able to see how they come up with the answers for example they are able to tell that the bear is a mammal because they are looking at 16 of the data attributes you can look at how they are seeing these data attributes by seeing this input from hair feathers milk and so on and so forth now I'm going to show you the different ways that we can visualize the data on the left hand side of the screen you can scroll down and you can see the visualized step so you click on the tab and you see several widgets that you can use to visualize data here we will only focus on four of them which are distribution attribute statistics scatter plot and linear projection first we have distribution click on the widget and you can connect it to the file so using the same Zeus sample file we double click on distribution and you can see this graph coming up here we can select the variables we want to see for example if we pick hair the display outcomes all the different types of zoo animals so the x-axis would be head and the y-axis would be the frequency over here we have the probability plot so let's say that our target value is amphibian that means that amphibians have no head and that is low number of amphibians in the sample file we can select other variables according to what we want to see as well you can also click save graph and save the image for later usage next we also have the attribute statistics visualization widget so similarly we can connect the widget to the file here we can see some attributes on the left-hand corner and category and total values so for categories we have 0 and 1 in this case these nominal values are assigned based on assigned numbers 0 & 1 0 here means no and 1 means yes this means that 57.4% of the data set has no hair and 42 1 4 6 percent of the data set has head so accordingly we can click on the different attributes to find out the different person pages for the different attributes we can also save the graph for if we want to keep it for further use now we will look at scatter plot connect the file to the scatterplot double click and here we can choose the x and y attributes according to the variables here let's say we choose hair for x-axis and feathers for y-axis you can see the datasets here are distributed on the scatterplot for the point color we can choose type so blue will be amphibian red would be bird green would be fish and so on to see where the anomalies are you can also zoom into this kind of scatterplot graph create a box and we can see the different points for example the oranges for insect here insect has hair but no feathers and you can see the majority of the data is actually mammals for the additional point properties this can be used for complicated sets of detail with many independent variables if you want to narrow down very precisely your data you can select point labels point shapes and point sizes accordingly finally we go to linear projection connect the file to linear projection you can move all the attributes that you do not want to appear on the graph down to the hidden attributes we can see the axis feathers hair milk and eggs over here you have the types of or classes of the different zoo animals which you can use to see quality birds will have feathers and they lay eggs we can also zoom into the graph like we did before to see the different details promised in close up form now when we use it in predictive analytics we have to first select attributes go to select attributes and link select attributes to the prediction region here we can see the available attributes attributes loss and meta attributes we move the to our right attributes here and then move it to the attribute set and click apply at predictions let's say we want to see the amphibian prediction possibilities double check again click apply and we can use any form of visualization to see our prediction for example here we will use the scatterplot as an example link it to the scatterplot and you can select your X attributes and your y-axis attributes let's say head and feathers and we have the point color type we have all the attributes here and we can zoom in to see for example zoom in on this part and we can see the attributes for this data point naive Bayes classifier it as an insect and classification sheet classified it as a mammal this data point is actually an insect this means that the naive bias prediction model is more accurate than the classification tree model you can also use the other visualization widgets here to show our final prediction that concludes my brief demonstration of the orange data mining tool thank you for watching

4 Comments

  1. papalapapiricoipi2 said:

    you just put your voice over this video
    https://www.youtube.com/watch?v=G3W2Jc7Wtfw

    June 27, 2019
    Reply
  2. Rex Galilae said:

    He's speaking like he's got asthma

    June 27, 2019
    Reply
  3. Siabi K. Ebenezer said:

    hi Sir, Can you please help me to use orange to analyse my data. Please contact me [email protected]

    June 27, 2019
    Reply
  4. Prashant B said:

    Hey Anurag, how Orange tool is useful in say youtube search data mining?

    June 27, 2019
    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *