Ask the Oracle Experts Big Data Analytics with Oracle Advanced Analytics



good afternoon my name is Shawn will be your conference operator today at this time I would like to welcome everyone to the ass the Oracle expert big data analytics with Oracle advanced analytics 12c conference call all lines have been placed on me to prevent any background noise if you should need assistance during the call please press star zero in your telephone keypad then offer it'll come back to assist you Thank You Tyra Crocker you may begin your conference hello everyone my name is Tara Crockett and I am the senior manager of Oracle Academy in North America welcome to the Oracle Academy ask the experts webcast series where students and faculty have the opportunity to hear directly from Oracle experts on a wide variety of technology related topics in today's economy all careers require computer science skills Oracle Kadim II provides tools to help educators and students prepare for the future we work with ministries of Education districts and schools worldwide to provide computer science curriculum resources and teacher training that are truly fit for purpose we were to ensure our computer science curricular are relevant for today's job market and it is with that goal in mind that we launched the ask the experts webcast series I am really excited about our topic today Charlie berger is with us to discuss big data analytics with Oracle advanced analytics 12c and big data SQL Charlie berger is senior director of product management for data mining and advanced analytics at Oracle he has over 30 years of experience in data analysis and software including Oracle thinking machines corporation bolt Burnett Burnett and Newman palladium expert systems at IBM he is responsible for product management and direction of Oracle's in databases data mining and predictive analytics technology including data mining text mining and sequel statistical functions he holds an MS and manufacturing engineering and MBA both from Boston University yields a BS in just real engineering and operations research from the University of Massachusetts at Amherst if anyone would like to ask a question during the webcast please use the chat function thank you to everyone who has already submitted questions we will try to address as many questions as we can and with that over to you Charlie all right thank you very much for that introduction hi this is Charlie burger and for the next of nearly an hour I'll be going through big data analytics and describing what we've done by moving analytics inside the Oracle database and moving the algorithms to where the data resides and in doing so all the I'll try to highlight all of the different sort of open new doors that are open the new possibilities now this is really a fun and exciting time if you're going into this field today or re-enter or or want to do some some of this down the road so you probably if you've been watching anything on the internet or on the news you've probably seen a lot about big data and big data analysis and data scientists and so on so it's been quite the the buzz at least from my vantage point I'll skip through this illegal safe harbor statement and the data scientist one of the titles that people use to describe the people to do this type of work was we're described as the sexiest job of the 21st century in october two thousand twelve in a Harvard Business Review article on the topic and it really got a lot of attention and that really seems pretty cool right who would not want to have the sexiest job of the 21st century and that's a little bit of what I do but it's a little bit overstated I think in some ways no I'll talk a little bit more about just the nuts and bolts of what we do in the area of data analysis and big data analysis and so on so also if you've been watching a lot of the hype there's an industry analyst firm called Gartner that does a great job of watching all the different technologies and washing all different vendors and they have this annual Gartner technology hype curve and what's interest is at the top of that hype curve right now is big data and if you see the ads on television with all the different players from IBM SmartPlanet Microsoft Oracle has ads SI p all the different major players are all talking about the major consulting firms are all talking about just all the different data that sort of you know everywhere and what you do with that data now that's that that is exciting there's new data sources coming in from sensors from RFIDs from from all sorts of new sources satellites everywhere these days but a lot of this is sort of real sort of nuts-and-bolts data analysis the kind of you know stuff you've probably taken to some of your statistics classes early on and the area that I focus on a little bit more specifically is the use of that data but more in the context of doing analysis of that data and what we oftentimes called a predictive analysis in predictive analytics looking at the past finding patterns and being able to apply that knowledge that was learned to new situations either current or new situations going forward so just going to a sort of predictive analytics 101 there's there's different types of bi and analytics the people oftentimes talk to there's bi reporting which is really sort of reporting on what happens the business value of that is oftentimes kind of low and the complexity is kind of simple it's a pie chart it's a bar chart it's a total of sales by region by product something like that as you go into a little bit deeper analysis you go into analysis you're going to monitoring and finally get into the area that I'll be focusing on for the rest of this presentation which is really the you know predictive analytics where you gain greater insight into what was really happening at a detailed level and you can use you know the past is a predictor of the future so that's what we'll be talking about the typical use cases here are you know developing customer profiles printing customer behavior sometimes predicting the behavior parts or inanimate objects you know materials things of that sort sentiment analysis market basket analysis fraud detection text mining it's a lot of different sort of specifics in this field but that's the general topic the predictive analytics and so in the realm of big data and big data analytics more specifically there's a lot of studies that have been done about the growth of data and contrast is actually to what you may choose to become is more of a data analyst and the good news is that this is a really growing field there's a huge gap between the number of data analysts that are able to do the analysis the of data that's out there to be analyzed and so another way of showing us some times the amount of data that gets used and the first is the amount of data that's produced so a lot of people have data lying around and they they don't they don't really put it to sufficient use i'll give you my favorite example of that when I go to my ATM machine on let's say a friday night the very first question it asks me is what language do you want to use is if maybe it's changed you know when they'll ask you know something you know decades of years here and in fact even with that same Bank what I expected to do today is to say hey Charlie it's uh you know Friday night you want to take out an extra hundred dollars like I've noticed you usually do on a Friday night that's really what I'm looking to see in today's date and time day and age where I have all this data that's been collected I want to put that to good use you see sometimes very specific use cases of that and say Amazon's recommendations engines in on many droids they'll have suggestions about events nearby there's a lot of very specialized use cases that have been done here we're talking about just using all the data that's stored and managed inside your Oracle database and having the the analytical tools to mine all of that data find patterns relationships and to deploy that to others who may want to apply that some way so if you're trying to do this the data analysis platform if you have so much data and so few data analysts the platform it better be very powerful they're real to be able to handle very large volumes of data should be pretty easy to learn and it should be highly automated in an able deployment throughout the enterprise there are different languages different techniques different products commercial and open source that people use for this this is from a recent kdd which is a knowledge discovery user community you can see the links on many of the slides I'm showing at the bottom and they did this survey they do this every year and are which is an open source statistical programming language that's very fun and powerful and free to use and so on is as you would expect a very very popular package also SAS SPSS number different packages are out there pythons been coming up the ranks in recent time and the one that a lot of people use not just for data analysis but just for data management of course is sequel Oracle is that we have support for of course sequel with the Oracle database we have very tight information with our and I'll cover that later on and you do some things with Python but but the sequel and are as well be focusing on probably the most here now if you do some data analysis there's a methodology that's sort of well accepted and understood it's a pretty simple methodology but it's it's a disciplined process but it's followed by most data analysts the ideas if i'm going to start solving a problem the first thing i want to do is get a good business understanding of what is it that I'm trying to do am i trying to recommend movies to people am i trying to identify people who would be you know if I'm going to Zappos and I'm buying shoes do I care you know about the person who buys one set of shoes you know every year or do I really want to identify and give special treatment to those people who I predict you know going forward might buy more than you know five hundred dollars or a thousand dollars with the shoes in the next 12 months so that may be the business challenge that I'm trying to go after people who buy a lot of shoes or people that buy a certain product or have a certain likely to go on a first-class flight somewhere likely to go on a cruise and i'm trying to sell first-class flights cruises lots of shoes things like that so i need to understand what it is that i'm trying to solve for a business problem i have to do data understanding i could figure out what's the right data that i need to get my hands on in order to solve that problem because each of those different business problems are going to require a different kind of data i need to assemble that data and sometimes prepared then I actually go through the process of doing the data modeling I a value that's where the machine learning the automation happens if you set up this first part accurately you know properly and then I do the evaluation which models best tonight that I've actually deployed this whole methodology will oftentimes take a data analyst you know several days more likely several weeks two three four weeks often times just because of the nature of the moving the data from one location to another location after all the questions are doing going back and asking for new data and so on and what we try to do is we try to distrain line that as best we can from Oracle so we we see people that have these platforms sprawls you see data in open source our platforms on a PC or some sort of a compute platform you see staff SPSS is in oracle database you see Hadoop platforms and that's it generates a lot of platforms scroll to sprawl duplicate a data lack of security it just is sort of a problem in for many many reasons so and if you do build a good model how you deploy that model how do you make that actionable so that when I do go to my ATM there's a recommendation hey we thought you might want to take out some additional money or here's a coupon to go somewhere that you might be likely to be interested in so we try to do it at all week before I get to the next slide this was an interesting one you may have seen the netflix challenge they they had this one million-dollar prize to whoever would come up with the best predictive algorithm that would predict better than what they were doing which movies you were likely to launch and the the album that one that was never actually deployed it was never implemented and it was because the engineering effort needed to bring them into production environment just didn't offset you know it was it was just too it's too too much for the small incremental gains you were going to get so that's that's really what what it's all about it Oracle what we're trying to do is create a big data and anal it platform in the era of big data in the cloud that makes both big data and analytics simple and any data size any computer any variety of data or any combination we're also trying to make big data and analytics deployments simple that's the key I build a good model at this Eureka moment how do I get that out to the ATM machine or on a website or in in a campaign or marketing campaign how do I put it into a production without having to take all the data out move it over to an analytical server on a laptop or something and score all these millions and millions of records I would like that to just be one you know one one platform wherever I store the data I'd like to store the data there and analyze all in place so now I want to switch over a little bit into the concepts of data mining what is data money in general and the concept here and you can read books and take courses on it hopefully you are but the idea is to automatically sit through large amounts of data to find previously hidden patterns discover insights and make predictions so there's different types of analytical techniques this is like attribute importance we are trying to the key factors that are most driving a business question like who's going to buy and my Buick and one of the factors age income do they rent or own do they own a dog or a cat and so on if I'm trying to predict customer behavior well that's called classification and here's an example of classification trying to take all these records here I'm trying to classify them into likely to buy 500 hours where the shoes not likely to buy five hundred dollars with the shoes here trying to predict your estimated value well it's a different algorithmic technique called regression and hear what I'm trying to do is fit a line to some data the best example of that in my mind is Zillow and if you go to Zillow and you're trying to look up the value of a home or a property it will take all of the other houses in the area and the square footage in the number of bedrooms the age and so on and it will make up prediction as to the value of that house and that's all you do you just you just press go on zillow I want to see the predictions for the value of these houses in this region under the hood what is building is building a machine learning regression model there and like like you know similarly there's different elements for market basket analysis anomaly detection automatic unsupervised learning sort of clustering where you automatically find different natural subpopulations of people so there's a whole sort of discipline to this and different albums are designed to do different types of things now if you are going to follow that crisp processing to try and solve a some sort of problem well you generally ask a business question like this I want to predict employees that are likely to voluntarily leave or churn predicts customers that are likely to turn let's say from Horizon to go to tmobile or somewhere else I want to target my best customers and on and on these are the tip of questions that come to me as a product manager from various customers are looking to get into this and as far as best practice ago I'd say that's a little bit sort of loose the last one is what we see all too often I've got all this data can you mine it and tell me what I didn't know that's hidden in that data and that's just sort of a not a good thing to do take this a little bit further in thought imagine you're trying to predict which employees are likely to which customers are likely to churn and I have just two variables income and customer months well if I to put a fit a straight line statistical you know curve fit a straight line fit through that data y is a function x1 and x2 I can fit a line through that but it's not going to do a very good job if I fit more of a quadratic type of model of squared terms and subscript it might do a little bit better job to just even with just these two variables but it's still not so good if I use a machine learning algorithm it's going to iterate through all the different possibilities here it's going it's going to keep on checking at each of these different cut points for customer months and see if there's a difference until it finally comes up to this value here which is 14 and it's going to say well at 14 another significant difference the customers that are greater than 14 they all sort of leave they all are cellphone chores but over here they're not and if I if I continue further on the other axis and for each variable when I go on a dimension by dimension space I can find different cut points and different sort of more complex rules that describe historical patterns of people that you know here six one two three four five out of six people today eighty-three percent of the people if they fit this profile customer months is greater than 7 incomes less than 175 and five out of the six those people historically have churned so if I find the new person that fits that profile I'm going to say that that represents you know an eighty-three percent contact prediction that person's concern and that represents about notes they just say fifteen percent of the overall population so as a marketeer or a business analyst I can put that to good use maybe design a marketing program or special treatment for these customers trying to anticipate that they might be thinking about leaving and offer a special promotion maybe a there's I don't know what I would do on that one per se but sometimes maybe they have children or their to international calling and you can make some sort of a specialized promotional offer to them to try and retain them so what we do is we provide insight and prediction and where's this going well you know it really is not so far from this if your if your doctor has information about where your you know you can even monitor where phones are you can monitor transactions all that data you lead sort of a digital exhaust wherever you go these days you need to look at that as a scary fearful kind of thing or look at that as sort of a you know an opportunity that you can try to use that data to make society and everything better and I like to that way although this particular one is a little bit scared I go to my doctor and he knows everything about me but it's how you put the technology to use which really is really up to you okay the fact is if the data is out there and there's a digital exhaust but you can get access to that data you can even do a lot of interesting things with that we had Oracle have been working to try and support the analysis of data for a number of years I on a number of us joined Oracle in 1999 when they acquired a company called thinking machines corporation and about 25 of us joined the ranks of Oracle and we we set about trying to teach a 35 year old relational database how to do things like neural networks and k-nearest neighbor algorithms and such and we found that those algorithms we're not so easily implemented inside of a database but if you looked at what a database does well and at the time data been still to this day databases can count things very very quickly so we initially started out on techniques that uses counting principles I count the number of times that the person who buys a motorcycle or does not buy a motorcycle I count the number of times the purchaser was a male or was a female and I said that the beginnings of a conditional probability modeling I count the number of times they rent or own and they prefer dogs or cats and on down the line and pretty soon just by counting up the people who bought motorcycles first of all these other different variables I start to build up a conditional probability model now since then we've added support for a number of other more sophisticated and complex algorithms are those of those ones actually work quite well we've added the ability to support mine unstructured textual data spatial data transactional data we've integrated with open source are so now you can write your scripts in our and they map transparently to equivalent sequel functions and thereby get all the parallelism and scalability that comes almost for granted with an Oracle database and we've added graphical user interfaces and we've been developing applications that automate these predictive methodologies not cover that a little bit more further on down so this is a high-level overview of what the product is it's called the Oracle advanced database option it has a collection of in database algorithms that are supported and accessed by a sequel or this graph user interface which is called Oracle data miner or more recently we've integrated with our so i can write our scripts and it maps to the database and pushes them down so this is a high level view of it our positioning is that we're not trying to be everything to everyone but if you are an oracle customer with a lot of your data inside of an oracle database we think we are the fastest way to deliver scalable enterprise-wide predictive analytics and you can even insert the word and analytics applications so you can really think of Oracle advanced analytics a little bit like this you have traditional sequel which is doing sequel queries I want to select the customers where their gender is male and their number of shoes they purchased is greater than five hundred dollars so you do things like that and do some pretty clever of analysis or you can add to that some very powerful analytical verbs and I predict they're going to spend more than five hundred dollars or I detect that their anomaly or I regress I predict the value of their home or whatever I associate that they've bought this dishonest i found associations through that so we've added very powerful analytical verbs to the sequel language and that's really what it's all about we've moved the algorithms to the data because it's easier to do algorithms are small data is can become large so rather move the data out to the database we move the algorithms to do to coexist and be part of the actual database for the quants in a room that are there going deeper to this here's a list of the high performance in database implementation algorithms that we support we support logistic regression decision trees under regression we support linear regression something called a support vector machine and on down the list each of these algorithms by having been implemented inside the database they're all going to honor all the rest of what the database does well if the data is encrypted we're going to mine encrypted data if you have reaction I can't exactly see your social security number I can still use that data but I don't have to show it expose it along down the line I can might leverage all the other strengths of the database like Oracle text that's going to tokenize the unstructured data and give me a bag of words as well call it a vector of the number of times you say discount or on sale versus the rest of the population have used the word discounted on sale three times more often than everyone else does then that's kind of analysis to be in three times taller or three times more income or whatever so we're going to we're going to be able to handle unstructured data transactional data we're really able to mine star schema data inside the database which is kind of cool thing our real value add it's a society I think is that we do not move the data we just we avoid this data extraction this data transformation and preparation phase this part of the problem generally will be as much as eighty percent of the challenge just trying to get the right data data wrangling assembling you know okay you driver on your car with this progressive snapchat thing what's the fastest speed I've ever reflect you at or what's the fastest speed over the speed limit you know how much percentage over the speed limit have you have you driven and all of that data and derived variables that I create well if I do all of that outside of the database in in some other commercial product or even in our and I get a really good model let's say out here how do i deploy it how do I say well now I want to apply that model in real time to somebody else well it's not so easily done so we've automated all that in in many cases unless time it takes for the user to just export the data we've already gone through this cold cold complete process of having learned of the patterns and applied them to the larger data set to make predictions on everyone so it's a much lower total cost of ownership and it's a fastest way to deliver enterprise predictive analytics and really most of the customer success stories that I have or our stories about well it used to take us two or three or four weeks now I can do it in two or three or four hours and it really boils down to the the simplified environment of just being able to just you know work with the data directly with sequel or mortgage they are and then also do your analysis and do it all right there in the same spot it's just really fewer moving parts and say it's a simpler simpler design here's Turkcell one of our customers that is a public reference just to give you an idea of this they have these prepaid phone cards that were costing millions and millions of dollars a year in fraud because you'd get this prepaid phone card and you get it and you'd start you know illegally using it for money to buy different things that was taking them you know days and two weeks i think it was originally and that cut that down to almost immediately now they can just detect that you are you seem to be using this card in a in an illegal sort of fashion so that you can stop the car you can make a phone call you can't i if you go to your gas station American Express will ask you to type in you know your zip code or different there's different tricks you can do but the hard part is detecting that behavior and as quickly as you can and the behavior may be complicated so it's not just a simple business really you're trying to capture an automated so at a high level now stepping back a little bit what we have assembled at Oracle is the database but within the database are various different options that are separately licensed they installed by default so if you install the Oracle database you will have the Oracle advanced analytics option already installed and then you can access that by the GUI which I'll show some demos of the our user interface which I'll show some demos of and then applications which also show a little bit of that as well so it's really a platform that undergirds anything you want it to and it's more than just a toolbox it's really meant to be you know a serious industrial strength you know data management big data management and big data analytics platform so let's let's see about how we would put the stuff to use and I like told Albert's comments here if I had a solver problem in spend most of the time thinking about the problem what was the definition of the problem I'm trying to find sure nurse well what do I have a field in the database that says you churn yes or no or you bought more than 500 hours for the shoes yes or no chances are I don't so when we do this in the database I have to start with that business problem I have to you know do something different I don't move the data that's the big change I have to assemble the right data for the problem then i create new derived variables like what's the maximum speed that I've ever seen you drive over the speed limit you know 10 miles 50 miles over the over the speed limit whatever it is be creative and various different analytical methodologies cover that a little bit more transform data into actionable insights that's build a good predictive model and then more importantly deploy that so let's go back to those original problems I want to protect employees that are likely to leave or target the best customers or how can I make more money those are all sort of weak statements in a better statement would be in terms of how can I make more money how about what helps me sell soft drinks and coffee right I make a lot more profit on those products how can I sell more of those pricked customers to churn well based on past customers return but how often how much in advance of churning do I want to do I want to predict if I want to print to one month or three months look ahead I need to look at what the customers look like one month or three months before they actually charge so after really these specific about assembling the data and then eventually I map that it is some sort of data mining technique and we cover a little what those were earlier on so if I'm going to use the GUI off do a brief demo in a second here but this gives you a feel for what that looks like I can bring in point-of-sale data all the transactional data it might be stored let's say if i'm at walmart and i have 10,000 items i don't have to allocate ten thousand columns for that in a database that stored is a very sparse representation i just say Charlie bought product a and then product Z and the product B and those are just three records during the database they're not ten thousand columns with zeros and again just three ones it's not done that way so we can we can take that data and aggregate it and bring that in if I have unstructured data I can take all of that data and I can I can prep that data we do that in even autumn more automated fashion now I can mine that data i can even nest in the middle of this predictive works like a nest in a predictive model that says well they never that they told me what their credit score was but they filled it out and I don't want have to pay for a third party service so I'd like to do my own prediction based on the information I have and how likely how timely they've been paying their their bills I assembled all that together and I finally have a a 360 degree view of my customer right i joined all this data together and now I want to pull go off and build multiple models here on building a number different models I score them and now i get my predictions and i want to automate this well the GUI here is going to lay down little sequel breadcrumbs and i can just hand that script off to my DBA and say could you could you please uh automate this whole methodology if you're a data scientist or a data analyst or a business analyst and you have a tool that's easy to use you know your business domain rather well then and then you can do a lot of cool things and a lot of value to your to your to your organization here's another thing that i wanted to include in this talk a lot of people come to me and say hey can you replicate in your tool this predictive model i'm using using our or SAS or something I've got 20 variables can you replicate that in the answer is usually yes of course I can do that I've got the same algorithms are implemented inside of a database but why are you stopping with 20 why don't we add some more variables like it get me more data maybe those are predictive in nature I get up to you know 250 variables or more and with the advent of big data now I can bring in you know your your geospatial data as you move around with your droid or your iphone i can i can get all your web clicks i can get all your comments on any kind of website i can i can get a lot of data and any and all those variables are all candidates for building a much more accurate and more predictive model so i can do that much more easily using the data the data management platform is also my same no data analysis platform so this sounds like a lot of fun but you know we're reminded by george box a famous statistician that all models are essentially essentially all models are wrong but some of them useful because they may not be perfect but but the sometimes are useful and here was a use case that we did it i was a little bit involved with this myself although others did mostly yeoman's work on this they had a lot of data that they were they were trying to make the boat go faster and they said well we always david we collect on the wind speed the waters bees the tension on the lines a lot of data here could we make the boats go faster and in after 15 years or pole se1 the way they did that was they collected a lot of data they used to have another boat that travel behind it with with Bluetooth that would pick up all the sensor data and collect it and load that into a into a database and do nightly analysis of all that data using Oracle data mining and they found different factors in different scenarios that they could capitalize on and not trying to reveal too many secrets here but it's essentially part of what they did they had all this sensor data race data supplemental weather data they joined it all together and they build unsupervised learning sort of clustering cluster models and they'd say well this race scenario is a choppy water light speed winds or it's flat water strong winds whatever it was and in those different rates and conditions and directions they build separate velocity made good that's that list vm d stands for velocity made good they did separate sort of predictive models and conditions depending on the condition and the leg of the race that are in they would they would change accordingly and operate much more intelligently so it was a great big win for us we were very happy with us and this is essentially what they did dunnhumby has been a good public reference for us all the references have been good but but Donna hommies interesting because they're so well-known and they're part of the Tesco group and they were they were famous for having collected lots and lots of data and analyze it and that's really their their their their strengths and then trying to better recommended their customers well what coupons or what promotions should I have on a tailored basis and this was taking them the better part of a week to do on these weekly promotions and they found they just had to move everything into them into an Oracle Exadata machine box actually and leverage the the analytics in there so that's another another use case and the third one Dunkin Donuts so Dunkin Donuts remember that you know we're trying to find more profitable products that they're trying to sell more coffee more soda so what are the things that go with that one of the things they've talked publicly about is they're trying to sell more of these what's called p.m. flats these these sandwiches you go in the afternoon there we're trying to extend the hours that you go and have a good reason to go into Duncan so what are the things they can do to try and get more people to buy more of these sandwiches that would come in on the afternoon for so again it's clustering analysis market basket analysis and then putting this all together into into various different dashboards and such so some brief demos cut over to the live product just for a minute here so brief demos I have sequel developer this I'm going to go through the GUI part of this and what I'm looking at here is sequel developer we are an extension to sequel developers so we're this data miner extension here and I am going to take this workflow here and I'm going to try and first take a data source drag it over here click click you know drop it on there now it gives me different data sets that I could use i'm going to use this customer surance lifetime value data set i'm going to see the variables that are in there unless they finish now if i want to view that data i can view the data there's a table of it there are the columns that i'm going to be looking at a number of ATM transactions number of transactions kiosk the teller or what profession i am what's my salary my credit balance banking buns i'm going to try and predict whether not you buy insurance or not or by some sort of product and behind the scenes of laying little sequel bread crumbs so that's that's one of the really nice things about doing this all inside the database because if i want to deploy this motivated i can do it rather easily so I want to explore the data I kind of connect this to that I'll say I maybe I want to group this by of another variable here like lifetime value binned I want to calculate a bunch of statistics on this skewness kurtosis and so on and I'm just going to say run and that's going to go off and ask the database to just pass to calculate the statistics statistics on all of this and render little thumbnail histograms and all those different variables i have here if I want to view that data afterwards I get sort of a age and I get the histogram of age grouped by something else I get the credit card limits each of different variables here I guess they're a little thumbnails of them here and I can click on and see more information so I'm really sort of exploring the data to make sure it makes sense to me and I may not solve a problem here if I want to build some models I might come down here and say and I'm skipping over a number of steps here but i might say I just want to build some not doing any data preparation I'm going to directly to a model build actually before I do that I'm going to do this filter columns no because I may just want to filter some columns out of here and out of the filter columns let me go to there and here I'm going to say well my target field is gonna be by insurance my case ID is going to be customer ID and unless they go and so I'm just going to say run this guy here first so just to show you what we get out of this thing here and I guess I forgot to do a step I wanted to do here but here it's going to tell me well some of these variables are all constants other are all unique going to suggest that maybe I just want to get a rid of those things but what I wanted to do is also turn this attribute importance that says I'm trying to solve a particular business problem which are the variables that have the strongest correlation with a particular business problem is if I have the big data and lots and lots of data I may not want to include all that data so here I found these are the variables that are the strongest correlation with the target feel and I can accept these recommendations why the recommended output settings or whatever I'm going to just leave that the way it is right there and say run this now what this is going to do is notice does not built not tested it's going to now build a generalized linear model a support vector machine a decision tree and a naive Bayes model why well because we have all those models support for those algorithms it's going to automatically do a training data set with a holdout sample for testing that data to see how accurate those models are it's finished and stood about 25,000 records in about twenty five variables here if I want to view the I want to compare the test results I can see as different things you do to see well which models are the most give you the highest lift here is not enough time to cover this all here today if I want to view a particular model let's say like a decision tree I can I can do that I take a look at the overall population and the rules that most of the people are not buying insurance but if I kind of walk my way down this decision tree i can find eventually people down here let's say that really are very likely to buy insurance and the rule that describes them are has been discovered in the data here it is so that's that's a real real real quick spin of how to go about doing all this and if I like this model I can go down here and I can I can generate deploy this as as a bunch of sequel code so that's that's that I also have one that's already sort of dunya and it's a little bit more elaborate won't have time to go through all this but I've done the column filter and I've done different models i've applied it at the very end here I get my list of customers who are most likely to buy something I can sort this based on probability and say descending order and so now here's the person its most likely to buy something here is the reason why they're most likely to buy something I've also asked for this to say well I know that they're likely to buy but what's the reason and their age is 61 and it's in this range and they're dumb earth web transactions as 5,000 and so on we can give a little bit of transparency as to why the machine learning algorithms has made that recommendation so I'm going to go back to powerpoint now for a little bit more so at the end I can generate the scripts that automate all of that if I wanted to run this just as a direct script by itself well I have that scene over here I can do that as well so i'm going to go over here and say well i have this whole script it's just going to automate this and then show me the people at the very end who are the most likely to to respond so i just thought in this case likely to be anomalous so i just built a model right here that's where i built the model at the end i use that model this claims model to select all customers or some other conditions hold through and they're likely to do something similarly if i want to use that to make a real time prediction maybe on the fly when i'm a single record apply for some new data so here's a prediction probability using a previously built model using these values the guys on the website right this moment or whatever I come into here and say well now I just want to run that well I can do the same thing here and it's just a make a prediction it made a prediction I said no they don't have $78 they have said that's a typo they have much more money than that well now that prediction comes up now they're much like more much higher like they had to do something there so that's a brief brief demo of the product what do I do if I a fraud what if I have very rare case as well I use different techniques I'm trying to find things that are in a multi-dimensional level you know different they stand out in some way so i might devise a different type of workflow for that and that can go on to a hold of the topic at the end I'll show there's some YouTube videos I've done the past that you can download and watch what if I'm using R or R is widely popular it does have some constraints its memory constrained single threaded so what do we do with our well we take our and your our syntax will map that our will map it to artists equal transparency will take your your our instructions like I want to summarize I'm going to give the summary statistics on some data it will map it to the in database equivalence and do that and get the scalability the database if I want to do a call out to something I don't have support for inside the database I can do that as well so here's a little bit of how that works i'm doing a support vector machine using the our syntax it's going to map it to the equivalent in database functions going to do that seamlessly for you you won't even see that it'll do that automatically and there are your predictions so just show that over here with our just for a second kind of watch the clock here let's say I have my my are instance here and I want to do a let's say a histogram of cars or miles per gallon I want to do a plot of some sort here well these are mostly just sort of just running our by itself but the summary command here says I want to calculate them in the max speed deviation so all of this was pushed down for high-performance calculations inside the database so essentially that the point is I can use my our syntax and it'll it'll for the base our language and the base operations that will map that transparently for you as you go into other packages we'd encourage you to use the high performance in database decision trees clustering and so on algorithms that we already have if you do want to do a call up to something we don't support like let's say a random forest we don't have support for today you would use an open source package to do that and do a call out and the database would control all the data preparation and the whole process leading up to that and then at the very end it would just have to say ask the arm package to do something we also n leave the results in our bring the results back we also have the ability to run some algorithms on the Big Data appliance on do I'm not covering this much here I just wanted to expose you to that that we have a growing collection of algorithms that will actually do the analysis on the Hadoop servers and then go a little bit further we have something called big data sequel that allows you to from one language one platform mine and access all of that data so here I can query the data on the Hadoop server and what we call the Big Data appliance I can join that to other data inside the database build my predictive model so i may have some json data out here i'm julian all together and i can build you know very sophisticated methodologies leveraging all all of the data that's both on the the the hadoop server and inside the database join all those together Big Data sequel is the Oracle technology that I think it's pretty amazing it allows you to do that so we've done all our analysis I know I've skipped over this we built our predictive models what do we do with it well it's really not worth anything unless you can get that out into the hands of others you know I'm still sitting in my ATM when you know it's asking what language you want to use and I want it to be smarter than that so all those results in those predictions well they're just part of the database right I can I can just leave those tables results in miles in the database by can query those I can build applications this over here now is an application that shows I've found different people there are different clusters of different segments of telephone users here for a telco company and now if I if I go through here I can stay well for each one of those clusters let's say the family user high-revenue cluster or segment I want to score apply that cluster and what it does is it returns each of the different members and their likelihood to churn and I can give a rule and I can drill through and I can do lots of things there so I can do very cool things with that we've also noticed some social network analysis where i can analyze bunch of customers and find out interactions between people and I guess this read expired exercise but we launched that again here so so here I'm just going to skip over that I going to slide on that Justin watchin o'clock so here are examples of what you can build using the data management and anal it's platform and then deploy that so we have the communications example I just did this is one or predicting employee performance and I can predict which employees are likely to leave and what their performances I can you see the top reasons and in the real live application I can do what if analysis i can say what if i were to change these input parameters what would the likely outcome be so it's very very powerful when you can operate sort of looking not in your rear rear view mirror all the time it really look bad this is the communication data model one so if I click on this it takes you into the documentation it shows you the models that we've built I click on the churn rich and model it tells you the variables that we've used in that analysis so this is somebody in the telco business using their their their domain knowledge so for how long is the billing address been effective in days lifespan and days customer call count by call center customer account for the total lifetime there's a lot of variables that some business analyst some domain expert thought we're possibly useful in building a printed model and so of course we use all those and we've solved some different problems here's the visual I was looking for before with a social network analysis in this industry it's well known that if you churn and go from let's say verizon to ATT your friends and family or me top say 25 friends and family they're much for two to three times more likely to also leave so I want to know that information when I'm trying to to manage my customers and I want to check the pulse of those friends and family members as well because I know who you're making calls to and from and I know what the age of their equipment is I know if they turned and if I add that new source of information into my analysis I can build an even better model and then more proactively manage my customers and keep them keep them all happier so I just just touch upon some new features just going to think they're cool and we sold a little time here so in the latest release of the database in the latest releases 12c and the latest release of the GUI that i'm showing is a sequel developer 4.1 we added a new graph code a sequel query node it all she'll also allows you to integrate our scripts we've had the automatic script generation which some query node on the server side we've added some new algorithms for our expectation maximization clustering SPD algorithm is support vector machine Alice we've added something called a predictive query of cover in a minute here's just a visual of the graph node we can just kind of launch a box plot a scatter plot just ajust a ala carte graph or two or three that you really want to see the data for here's the sequel query know that I sometimes call this are you know sequel duct-taped node because here just I want to do something but sort of freeform and so I want to run some sort of arbitrary sequel here to calculate the recency frequency monetary for example of customers and assignment index score and enjoying that to the rest of the data and build my predictive models this is where I also integrate some are scripts in here so if I have our scripts at having a little tab and I can insert our scripts here if I wanted to insert our scripts inside of this GUI the text is is something that's kind of cool we're here I can just take any particular variable like back over here and add this down here I have this text data and one of the fields here is customer comments and if I just simply change that from a categorical like I always joke if marital status is not single married divorced widowed whatever and it's actually a paragraph or a PowerPoint or a PDF or legal document all I do is specify that's text under the hood the database is going to say well what languages it is is it in it's going to tokenize all this and give me tokens or themes and that serves up a whole vector of additional information that's fair game for these machine learning algorithms to to mine so that's what we're doing there this is the predictive queries the predictive queries automate the whole data mining process they say it's kinda like Zillow is they've done a bunch of data and I just want to say well show me make a friction are they likely to buy a refrigerator yes or no you know based on past people that bought a refrigerator it builds a model Tunes the model applies the model throws the model away and we turns a result set so while I do here is just say show me the predictions and it does that is complete sorta like Zillow with it with the values of the homes richards so getting started assuming that this is where you want to go and you want to explore this with Oracle it's very very easy if you send an email to me at Charlie duck burger at oracle com on the cover slide I should have had that here I can send you back my list of free links here are some of them here if you go to those main pages that i was shown you i guess i didn't show those yet but if you go to the main pages that i have over here here's the oracle advanced analytics option on otn the Oracle technology Network so you'll see a bunch of information about what the product is here's when I click on the learn more so I've got a data sheet whitepapers these tutorials are really really useful so that's the next thing I am set up here you learn setting up or if data monitor using and there's i noticed that focused mostly on the GUI sides if I think it's easier to use if you're our person there's equivalent learning courses um for our I can touch upon that but if you if you take this course right here using an Oracle data miner and this is free and it's out there you'll get to a tutorial that says well here's what you're going to do here's how you step through this and it just takes you you know it's about 30-45 minute little step by step tutorial on how to go about doing this there's also the Oracle data miner drill down part here there is also a number of different customer reference stories if you're going to our well there's the whole our side of things where you prefer to program in our and if I click on the learn more and this will go into the same equivalent sort of tutorials about how to learn the to tour the transparency layer the embedded our execution so on and I guess the last thing i want to show here was in those links i also have this blog where i have a number of sort of youtube videos so if you just want to see a youtube video on you know i don't know in database analytics for retail market basket analysis it's like five thousand clicks here and you'll see me sort of going through this whole thing and good on your a little bit further we'll have a whole demo in here so it's a very you know a number of different links you can go pursue if you want to investigate this a little bit further there are also some great books that are out there one written by a great consultant Brendan Tierney on an Oracle data miner GUI another one written by one of the people here mark quick and Tom Plunkett which is a great book on the our side of OA a those are available on Amazon and I think the last slide if you just want to take a test drive I have a great partner of lantus software who hosts all of this up on the Amazon Cloud so you just go to this you can google Glamis test drives you'll you'll go to a link here you'll fill out your information you and they'll give you coordinates to connect back in with a remote desktop and you'll have three hours just to play with a software the tutorials are out there there's demo data and after three hours will turn off you can come back in and log in for another three hours but they just sort of constraint it in that way you can load up your own data but you can really just take it for a test drive this afternoon if you if you want to do that and last and I know some of you may be out in the California area but we have these bewa summits SBI warehousing analytics user group meetings that we do once a year we also have some web conferences and other things but encourage you to take a look at that we do sometimes invite students to come in and you know maybe in exchange for helping out a little bit to get free admission it is pretty inexpensive for we try to use a group organization so it's only about four or five hundred dollars for customers to come in and pay for a two and a half day conference with a lot of hands-on learning labs and 25 traxxas five you know track system going on concurrently so it's a lot of technical content for your for your money there and that's it so we've got out of ten minutes I don't know if there's any question is on the line but that's what I wanted to cover today I know I went through a little bit fast but i figured it's it's being recorded you'd always put it on pause or back it up if you want to do are there any questions there are i actually do have some questions for you so the first one is and thank you so much was so interesting so how would a student with a classic database design technique knowledge learn the best sorry let me repeat that how would a student with classic database design technique knowledge best learn the modeling techniques demonstrated in SQL developer i would i would say use those tutorials it's pretty straightforward I know what I'm favorite case studies is is a spice or FF is the ERV and they've presented publicly on there they're looking for fraud and they they talk about these data analysts to take you know two or three or four weeks to take the data and analyze it using SAS SPSS are even even our tools a summary but with a lot faster time but there's this database guy in the back room that's the guy that they go to all the time to get their data and they say well can you give me all these transactions grouped by this and I want to join at this table and the guy that has the computer database knowledge is actually doing a lot of the work already and so at the very end they just have to kind of press go undetected cluster or whatever so this guy in ER and spy service my favorite guy he's now building models that are almost as good ninety-nine percent as good as what these PhD data scientist type guys are doing he's doing in an hour because he already knows how to do most of the work so if you're a simple reference guy you can go to those links and just go to the on the on the pages i was shown there's documentation will take you right into the directory of the api's there's a bunch of sequel scripts there in there if you're much more of a GUI guy frankly like myself you can just take those tutorials and do that and again if you're an hour person is there's equivalent our stuff there is a two-day course that oracle data mining puts Oracle University puts on an Oracle data mining but you know honestly I think if you if you if you just take those tutorials they kind of explain a little bit what's going on and if anybody has any questions you just send me send me a question I can follow up offline perhaps any more questions yep I've got some coming in so if I were to pick skills from a menu to become a data scientist what would be my top five skills technologies to learn to get to learn that's a good question and if you google that I would go now but I don't want to go chasing after stuff if you go to that kdd Nuggets website or other place or just google what skills and such it's interesting because there's a guy that Brendan Tierney actually wrote that book had in his book he talks about it in some of his presentations and one of the things that I think is is here's what i always ask do you already know the difference between a mean and a median it's pretty easy question for most people i know some sales guys here will say hey no tough questions but if do if you are analytically inclined and you know the difference between in the medium you're already fifty percent qualified do you know your business like I'm trying to sell shoes do I know what are the factors that sell shoes when we did the predictive analytics for Brickton employee turnover we work with the PeopleSoft development team that now call the Oracle human capital management team and they said well Charlie everyone knows that the your insurance policy is at one of the biggest predictors no I didn't know that happening oh yeah sure it's well does the employee have a really shallow dies kind of implementation of an insurance policies and and and they don't have dental they you know they don't you know or do they have like a family and a lot of kids may pay for the premium levels of all the benefits so that was a variable through their domain expertise they knew should be included in the models so a lot of it is know your business know your debate domain be analytically curious and then beyond that if we're doing our job right with that drag and drop Glee it should you be able to just sort of drag and drop your way through this stuff we are you know our goal is to make it simple and automated and in a lot of cases you have challenges but you can go back to your DBA and say hey can you help me going to do something a little trickier so so analytically curious know your domain take a few courses you know or learn some stuff online or just do the Oracle course and yeah the last factor they tell in the thing is the ability to tell a story so you get this whole analysis done and like like you know how the Obama how obama got elected there's this whole analytical process you can google and read about that but you know you want to be able to articulate well what am i doing here how is this all working you don't want to sound like a techno speak guy so being able to kind of you know tell an analytical story that's interesting and then the rest of those skills and then i have two more and just because we're getting close to the top of the hour so how does Oracle advanced analytics help with assembling the right data um you have direct access to if you're using a sequel developers I'm showing right now if you use know are you can do the same thing from a command line personally I'm a I'm a little bit more inclined to use the GUI so if I'm down here and I want to get my data now theoretically my my DBA if I don't have access to salary you know Oracle employee salaries than the DBA doesn't want to give me that but or I'm not allowed to have it but here i have in the SH scheme I have customers tables I have the data source here may need some customers transactional data down here so I really from the elastic of supplemental demographics I would have do do a join over here is my join node and now I just kind of connect this guy to that guy this guy to that guy I might want to bring in those transactional data over here with a sales data and so if I have access to the tables I kind of know the data and I may need to go down the down the hall and talk to my you know database guy or whatever I get to sort of and I need to put an aggregate under them here to total all those guys up but this is essentially what i'm doing here and that data right here is now the data that i would then start preaching through something else so you know you need to know what the data is you need to know what the joining the unique identifier is but that's that's kind of how that works so I have one final comment and then before I asked my last question I from one of our attendees that says did it's like watching the seeing this seeing this presentation it was like watching a rock star at this guitar it was amazing to see this stuff in action so you're getting lots of like lots of great comments and then this isn't like I said we really really appreciate this and so this is the final question as we're coming up on the top of the hour so where do you think data mining and analytics is headed in the next three to five years geez I hope all for the greater good I mean I want to go to my ATM machine and have it recommend you know the right thing I think you guys know what what it is on most of your droids and then on iPhones you you know the really good applications that the data mining is just invisible now when I started out in this stuff 15 years or more ago someone told me there's actually neural nets in my ABS braking in my car I Drive a Toyota rav4 and gosh I don't know that maybe there is but the data mined in the analytics are I bigger are the use cases are discovered by smart people like all of you guys they are they are put into applications and companies in in clever ways where you would say it has the best value and they are just automated and visualized if you will right made invisible that's where I think it's all going so you know you go you want to go to a rock and roll concert well it's not going to look mend something that's not in your theme you want to you you submit your taxes and we're working with partners that do this today you submit your taxes you put your deductions for charity and you say so like two dollars or twenty thousand dollars it'll catch you right then and there really Charlie two dollars at all because usually you do this or other people of your income generally do that so it can be real-time pervasive and in the end of the day useful my daughter actually is in a proud dad my daughter is actually doing pretty of analytics and trying to trying to predict us she was for her dissertation and printing from satellite imagery and weather data and such predicting you know the likelihood of a getting Lyme disease because of the prevalence of ticks out in the woods as you go out so like almost a tick forecast so you think it you got the data you can do it these days and I wish you guys all the best of luck going forward to so thank you Charlie and thank you everybody for attending and join us next month as we're going to have a recorded ask the experts on next month led by our amia team so thank you everybody today for joining and we will see you next time

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *