Operational Data Analytics with Elasticsearch, Elastic Stack (ELK Stack)



like to say it's just another index in an asset search and whether that index is logs or whether that index is a metric or monitoring index or whether that index is a PM you can now take all of these three indexes and make interesting realization as a result of them and in order to show you just a tip of what's possible if you take all of this data and put it together logs metrics na PM I like to welcome Tonya on stage Tonya thanks I got a long walk here all right so this demoscene area should be familiar to many of you just based on your experience shopping on an Internet you can look for an item on this site you can you know examine it decide that you want it added to a cart and hopefully purchase it the infrastructure underlying the site is also based on fairly common components you'll recognize Django for application servers Apache and nginx for proxies my sequel for a database and of course elasticsearch for site search from a monitoring perspective we're doing exactly what I just mentioned we're bringing together all these historically disparate sources of operational data into a single instance of elasticsearch dedicated to monitoring this site so we're taking the elastic a p.m. Python agent and that's running within django looking at things like durations of operations within an application server and errors we're taking metric bit and packet beat something that Monica talked about earlier to look at infrastructure metrics and finally file did an audit beat ad log information and other event data to help us with the root cause analysis and above all of these KPIs we of course have cabanas sitting on top of it and we have machine learning and alerting running on all of this data holistically to give us a view into all of the anomalies that we detect on top of what's happening within this site as an operator of the site one of the things I'm really concerned with is abandon rate so this is when users place something in a cart but then don't purchase it and of course it's going to happen some users will change their mind but we don't want to see is a sudden spike in abandon rates that's unusual for the time of day day or week usually that indicates that a site is not performing well and is preventing users from buying your products from having a good experience so of course I'm very concerned when I stay at the top of my inbox exactly that a high abandon rate anomaly detected by a machine learning job looking at that specific KPI looking at this machine learning job the silver lining is at least we detected this right when I started crossing an important anomaly thresholds let's go ahead and troubleshoot this this is like about a dashboard that starts at the business level view so I'm using transactional data that's recorded from the application straight into elasticsearch to look at various business level KPIs including above-mentioned abandon rate and here's my most recent spike and here's this blue line which is inching toward the top not something you want to be your business trend looking below I actually noticed that my sales are also affected so this is not something that's a red herring this is something that's immediately affecting my business I need to investigate as soon as possible what's really cool about this dashboard is in addition to the business KPIs we can weave in APM data that's most critical to this view in this particular case I want to see any user facing errors on the site right next to all these indicators to see kind of the first level of potential root cause and in this case indeed there's a cluster of errors happening right now so what we'll do is we'll jump to investigate to see if that's relevant to what we're looking at this is the API APM UI that Ron just showed you and I'm not going to go through this in detail but I'll just look at the errors in particular and there are a few of them that are possibly innocuous but this check out internal server error looks really bad and if you're a developer you'll notice this is a number parse exception on a phone number validation procedure thankfully is probably very easy to fix we just need to do that immediately because clearly this is not allowing users to purchase items from our site now I wish I was done but nothing in operation is it's that easy as you probably know if you look back at some of the previous spikes that didn't trigger the machine learning job because maybe the average rate wasn't as high those look pretty concerning as well and those happen way before the error started clustering together so possibly there's something else happening on the site as well that we should investigate as you probably know another problem that's commonly felt by users on sites where they walk away slowness if a site doesn't load or partially lows most of you will probably just say nevermind I'm gonna go somewhere else so again as an operator of the site I'm again very concerned to see site performance anomalies found in this data set and those go way bags let me start with the one that's the earliest and again this was detected by machine learning job I'm gonna go ahead and take a look at that and the first thing I note in this view is that the slowness is not pertaining to a particular part of our site this is actually spread across several important views such as basket view order complete view default check out all of important views but not just one particular page so when I see that as an operator I typically suspect an infrastructure problem so that's what I'll look first I'll look at my infrastructure analysis dashboard again this is a Cabana dashboard everything in here is just based on another index and elastic search but it's bringing all of the data together to help me troubleshoot this scenario we see site response times from a p.m. we see things like memory usage CPU usage and total disk i/o from metric beat network data from packet beat and finally slow logs from my sequel and audited data so let's go ahead and take a look first at site response times that's where we'll start kind of with the user experience we see that this started on January 20th and if I just take a look at the time frame when this began a couple of things I note first of all this memory profile doesn't look great that's very concerning to see the memory consumption on my database just sort of go up in an unbounded manner but the CPU spikes also look very strange very regular and something that I'll just investigate first given that they are very obviously strange and across the whole infrastructure zooming in on one of those spikes I can see that there's a lot of traffic going between the my sequel database underlying the site and something called proxy oh one which I know is my internal proxy I don't expect that to happen I don't expect that much traffic happening between an internal proxy my database so let's go ahead and take a look just at that proxy and a curious thing pops up so we see already which is one of the metrics that it looks at is on usual processes processes that were started on all of our hosts and something called reported that py jumps to the top of that list let's take a look all right that sounds like some sort of reporting job again I don't expect that to be happening and I can only click in the right place we can see that this job indeed is happening very regularly if we look closer we see that's happening every minute now I think that's suspicious not but if it's you know there's a business need to do that we need to move that off our production database that's definitely not something that this site was designed to do from a load perspective we see if we kind of go back and do our back to a wider time interval that you know this kind of load is causing a lot of problems on the database itself very slow select statements across the whole site so we need to isolate this process to some other place so in summary this demo really highlights how elegantly and simply elastic stacks brings together these historically separate sources of data that are very important operations APM infrastructure metrics and logs and when I speak to users part of my job many of you have already embraced that and you've put logs and metrics into elasticsearch to your you know looking forward to adding APM to that scenario but for some of you I I find that you're in your environments historically you've relied on separate tools for these jobs they're hyper focused on one use case and that's nice but the thing that many of you miss is this ability to bring the data together to tell a story so if you actually have built custom UIs to do that and that's very expensive and many of you can't really afford to do that so what ends up happening is these troubleshooting scenarios that really should take minutes maybe hours take days or even weeks you know while basically the business is bleeding money so my hope is that this demo inspires you to take a look at your operations you know intelligence and think weather elastic can help you there and hopefully we can focus on making operations more efficient while you focus on running your business thank you very much thank you very much Dahlia I love this demo because it basically proved the fact that every operational problem either boils down to date parsing or a misconfigured cron job

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *