Webinar | Big Data Analytics with Cassandra and Spark



okay terrific so we're going to talk sorry for the delay folks I'll try to keep things moving pretty quick and still try to cover I think most if not all of the the stuff we had for today so start with this guy so this guy is a little bit of history Willie Sutton was a bank robber in the 1930s 40s and actually into the 50s and within the first month of being of the creation of the FBI most wanted list Willie managed to get up on that list and then he was actually captured a number of times I managed to escape a few times actually and then was finally captured in 1952 the thing that's interesting about Willie is they asked him this question after his career of of robbing banks of why do you do that why are you robbing the banks and his answer was that's where that's where the money is and so we're going to keep Willie in the back of our mind for a little while as we talk through some of these these other pieces of the puzzle and we'll come back to him towards the end now talking about Cassandra and SPARC and the whole space I think that a a motivating example sort of helps ground things and so the case that I'll talk about primarily here is an Internet of Things use case and you know those pictures talking about connecting toasters up to the Internet we've been talking about doing that for like 30 years still never understood what the toasters would have to say but nonetheless it's a the idea that we're going to have all these connected devices across the enterprise and out in the world connecting back to some location to deliver data so we see this in a whole bunch of places you get things like your thermostats like nest and and and some of the others who are getting in that space you see it with connected cars we see it on manufacturing floors giving telemetry data for their data smart metering etc and all of them the idea is we've got our our sensors or devices out there on the internet and they are sending data back to some central system that central system clearly can be distributed around the world but you know from a a logical standpoint that's coming into you into this system and now we have to think about what that system should do so it needs to be able to receive from these various places and it also needs to then be able to answer some relatively straightforward questions those could be questions like what did that silver toaster say last or it could be something a little bit more fancy like you know what was the average number of pieces of toast per I don't know toaster type or something along those lines now if we are at all successful in what we're doing we're going to see more toasters coming on the scene and our system is going to need to be able to handle the fact that we are successful here and that the growth of the the number of sensors that are out there is going to need to have a our system is going to need to tolerate that growth and then the other part is that we are certainly going to be in a world the realistic world where faults will happen nodes will go down because of Hardware reasons or something you know we had this situation almost a year ago where Amazon had to reboot a whole bunch of servers for some regular maintenance and we hear about some you know folks that were hosting things on Amazon their systems became offline while other folks like Netflix were rolling just fine and we we now talk about the Amazon outage and not the Netflix outage so in that context of Internet of Things we're going to talk about two things Cassandra and spark and then we're going to talk about the two great tastes that taste great together sort of thing with bringing the two pieces together so first a bit on Apache Cassandra so Apache Cassandra is they distributed no sequel database it is sort of the love child of the BigTable paper by Google and the dynam yes and the Dynamo paper by a by Amazon BigTable is really the data structure so we're talking about tabular data with these sort of column families and you're able to sort of have a bars storage of it in other words we don't store the nulls that don't happen for certain things and it's a very flexible data structure and then we've got the dynamo paper was all about resilience and fault tolerance and having this concept of no no master that's an interesting element that comes into play with Kassandra that we make a lot of use of is the fact that all the nodes are equal and as a result in that sort of democratized world we don't need to worry about a couple of things so first of all there isn't a master to go down and so we're always on and if there's a node that happens to have a failure say a network failure when that node goes offline the other nodes are able to pick up all the pieces for him similarly since all nodes are the same that means that all nodes are answering questions and queries to client applications as well as serving up the data and owning some part of the data to answer those questions in the system means that if we double the number of nodes in the system we really are not only doubling the amount of storage but we're also doubling the number of questions that we can answer at the same time so we usually talk in Cassandra we usually talk about transactions we merge the idea of reads and writes into sort of one bucket there certainly are ways in which the read path and the right path are a little bit different but we usually talk about transactions per second so when we scale out we can get more transactions and more data one of the things baked into Cassandra won't talk too much about that on this talk is really this multi data center idea and so we can have our data centers can the data centers can either be geographically disparate because of of either two things one is fault tolerance and so we can say that if there was a flood in in in say New York like a hurricane sandy then a data center they're going out the application can can go to another data center say on a you know high up on a mountaintop far away from a flood plain and so that's one reason another reason for geographic distribution could be the idea of wanting locality so I'm going to do the guys in New York we'll talk to the New York server and the guys in Los Angeles we'll talk to the Los Angeles server but for this talk we actually might be interested in the third reason for having this and that's that's having workload isolation and so those those data centers can be virtual data centers in other words they could actually be in the same location and we've just isolated them so that if you're doing a slightly different workload you've got different Hardware dedicated to it and then last a little bit on the Cassandra query language Cassandra when it first came out was admittedly not the easiest thing to use a lot of the easy use cases a very straightforward use cases still required you to get down into the guts admittedly the the harder more exotic ones also required you to get down in the guts but you were probably willing to go that far with something that that custom and so we came up with this thing called the Cassandra query language which looks a lot like sequel but is a little bit deceiving basically it covers all the operations that you would do in Cassandra in these normal sort of use cases makes them very simple and they look like the SQL analog now that being said there are things and we'll talk about this a little later but there are things that look like you should be able to do because isn't it an SQL database but it is not and that's part of where we'll come into spark again in the story so if we think just briefly on sort of how Cassandra works on writes we've got our little nest thermostat here so NASA actually is using Cassandra and he's going to connect to the cluster now when he connects to the cluster he actually sees the whole set of nodes here and he can make intelligent decisions there's all sorts of load balancing decisions that he can make we'll do something simple let's say he's doing around robin and say he's going to share his workload among these five as he goes around and so he's going to just tell one of them hey if 73 degrees and so that node may actually not own the data and if he doesn't that's okay he's happy to be the coordinator who we called the coordinator I think of him more as like the maitre d and he says I'll take care of you I can wrap that question to who who needs to know it or who needs to have that information and so internally he will pass that on to the node that owns the data that person or that node will respond when he's done and the coordinator or the maitre d is able to reply back to the to the thermostat or to the client so few things going on there that we can actually start controlling here and that comes under this large umbrella of tunable consistency so consistency is one of the parts of acid but it's frequently one of the things that that we take for we don't really take advantage of in in the sequel space but we end up having to pay for and so the one of the things that's been done in Cassandra is relaxing consistency and allowing people to tune it to the needs that they have and so by doing that we're in a position where we can where we can get different scale out characteristics fault tolerance characteristics etc so the first thing to note is the data in Cassandra is replicated you set that when you set up your table and we're kind of distributed by token range so each node is responsible for some portion of the of the range of tokens for rows and so all rows have a primary key which maps into that token and based on the ranges tells you where that data should be located so if we keep it simple and we said that we had a hundred tokens we don't we have a we actually have a two to the 64 space that number is a little too big to even for illustrative purposes so let's say we had a hundred token range then the node one might take tokens one to twenty node two may take tokens 21 to 40 etc okay now when we do writes we have options as to how many of those replicas need to know the data was delivered before we tell the client that it's all done and by combining reads and writes and these consistency levels we're in a position to ensure some things about how the data is distributed all right and so this becomes a core part of the application and the design decisions that we make and one of the things that somewhat interesting about Cassandra so the first one the top upper left is probably the most relaxed basically what you tell the maitre d is hey as long as you have it even if you don't own it as long as you know that that you acknowledge that you've received this data then you just tell me the clients and I'll be happy now what the coordinator is going to do in the background is he's going to make sure that let's say we have our F of a replication factor of three it's pretty common the coordinator is always going to make sure that all three replicas get the data it's just a matter of when he tells the client that things have been done and you can some assurances around it so the upper left is the most relaxed okay just as long as the coordinator has it it's fine the lower left is just just make sure one of the replicas has it but if one of the replicas has it I know that it's durably you know in the data in the system somewhere and I'm happy in the upper right is actually probably the most common which is quorum ensure that a majority of the the nodes are actually seeing the data and then we can do some interesting a little bit of interesting things with respect to guaranteeing certain data consistency things and then in the lower right it's actually the most stringent we say hey I need to make sure that every node has actually written the right that I just got now the lower right one has actually got some challenges to it because if any node is down then that right won't succeed so we lose some of our availability which is one of the things that we're trying to keep very high here is the high availability and so quorum becomes a really popular choice for really two main reasons one of them is it's a good middle ground between consistency and availability I'm likely to have at least a majority of my replicas up at a given time and as a result I so I have good availability and then as a result I get a couple of nodes actually getting the data now that's the that's the right side of things now on the read side of things so now we've got our client here on the left is doing this query this is looks like SQL it's actually cql it's very similar in that intersection of space between them and again he's going to contact the coordinator it could be anyone and he's going to ask this question give me the user ID for the for the user whose name is PB Cup fan all right so the coordinator says no problem let me go get that and consult with some number of the replicas based on the consistency level so let's say that he did it on quorum and so the coordinator is asking two of the nodes for what they have and they each give him the value that they've got now they may be different right we didn't say we said before very active system we said that you could end up with only needing to ensure that one or two or a subset of the nodes got the rights so when you ask the question you may get different answers and the way we do that in Cassandra is that when we do the writes we put a timestamp so what the coordinator has to do in this case is he's not trying to see if the values are the same this is not a majority votes majority wins system this is a last write win system she's going to check the timestamps and he's going to return back the value at the highest time stamp so if we combined a quorum right with a quorum read we can make sure that we always get the most up-to-date data because we're going to be consulting one of the guys who got the previous the previous set now this is all interesting this is a very basic how Cassandra is working how we get a basic scale out system but the interesting parts really not about especially with Internet of Things the individual thermostat but rather a whole stack of thermostats and then if again if we're going to be at all successful this ramps up too many thermostats and then zillions okay so that sort of where where the the general concepts of Cassandra are and Cassandra really is well-suited to the Internet of Things for the space where we talk about always-on so there's no downtime we're always going to be able to receive and the data from the internet as well as be able to serve up to queries we've got this scalability so that if we get successful then we're able to scale without having to massively reaaargh attack things and that makes it a really good choice for Internet of Things okay so now Isis a really positioned as being a place where the data is but like I said Cassandra alluded to Cassandra doesn't do everything it's great it's awesome we love it but it does come short by design of doing certain things so one of the things is that it does not do aggregates and so it's really been optimized to do lookups and to do right but it's not really about doing large group by aggregates or even the more fancier windowed aggregates that you get in SQL we also don't do joins so much of that is where we spend some time with people who are new to Cassandra because there's a data model change and we can really cover most of the join use cases by intelligent use of data modeling and so we can work around some of these problems they're primarily lookups like I mentioned so we talked about needing a partition key and sometimes the predicates aren't based on the partition key and that's another challenge that we hit with Cassandra and and essentially Cassandra is not designed to be this full table scan kind of place so we need to have something else helping us out here all right so we talked about the peanutbutter let's talk about the chocolate so SPARC distributed computing frameworks then let's say wengie a version 1.1 GA last spring it's got this generalized execution model can make reuse of data that is already pre-calculated people talk about SPARC being an in-memory system it's actually more than that it actually does work off of disk but I think the thing that's really important to keep in mind is that spark is are really good at reusing memory and very efficient there and so that pays a lot of dividends for folks and then there's this easy abstraction this thing called data frames and a whole bunch of other tools that are built on top of this so if we look at the SPARC eco sister or the SPARC sorry the product it's got a number of elements that are all integrated here in one place we've got a basic core and then we've got a sequel engine a streaming engine and then there's a few other pieces that I won't talk as much around machine learning graph analytics and in our and statistics so this is the general high-level view of the stack of spark now spark is a relatively simple architecture to understand there's only two roles of the nodes in the cluster there is a master which we've talked about a little bit before and and some workers we've got this like I mentioned efficient memory caching and this generalized model which is really nice and this abstraction called data frames now if I given this talk about six months ago we would be talking about resilient distributed data sets or rdd's a lot of similar things can be said about them data frames are the sort of Noori envisioned version that actually gives a lot more capability but conceptually speaking they share a lot with the concept of an RDD so the SPARC master he's the one who used submit jobs to and he assigns resources to a particular job the SPARC worker is the one who now has tasks that he needs to do and so he assigns work to the local executors running on this machine and then the executor is actually the little bit of code that's doing the work he's the guy who's taking a part of the data frame or the RDD and and processing and processing the work that comes out of that so the rdd's or the data frames can be generated from a whole host of sources we got things like HDFS and text files and databases and to the Apache Cassandra alright so now we'll talk about sort of how the the two pieces fit together so the way that the way that Cassandra fits in under in with SPARC is is on the bottom and so we can put a Cassandra source or a Cassandra database underneath the the spark core engine and we've got this piece called the data stack spark Cassandra connector that is basically the interface code between the Cassandra data ace on the spark engine that's an open-source technology or package that we've got out there we contributed out there's others who are contributing but data facts is doing the lion's share of that work and what that does is it surfaces Kassandra tables up to the spark engine as a data frame and by spite by integrating it at the bottom by integrating it at that lower level we're able to take all the elements above it the sequel engine the streaming engine the machine learning engine and we're able to just reap the benefits of that from this data from this concept and this this data structure of the data frame if we dig in a little bit and take a look at how what's really going on under the covers I mentioned this spark executor so the struct executor is the workhorse he's doing the processing of the data that is sitting in the in the data set or the data frame he's taking it piece by piece he's working it through the the dag of execution and and processing it so the way that it is inside of that executor we need to take a look at how he's grabbing data from Cassandra so each one of those executors has a connection to the to the Cassandra database but he's making through the through the java driver so data stacks and other thing data stacks does is builds a whole bunch of the open source drivers for Cassandra this one happens to use the Java driver but there's a number of other drivers available and he makes a connection to the cluster makes it each of these executors will then pull across a part of the data frame or part of the full data set and bring it across and so what we do is we split these up rdd's and and and data frames each they split up the full token range into pieces and a spark executor will operate on all of the data for a subset of the tokens so the just in this picture the first the first slice might be tokens one to a thousand the second one could be a thousand one to two thousand etc and we work our way through the entire the entire data set that way we can work on these in parallel so conceptually speaking we we sort of have this situation where each spark node these orange nodes on the large marbles on the left are each making a connection to a sister Cassandra node on the right this is a simplified version there's really no reason why the spark cluster and the Cassandra cluster have to be the same size but for simplicity we'll just display them this way and so they buddy up and so the top spark node there might work with the top Cassandra node and they're going to be reading the data that is owned owned in that space now once you see this picture you sort of rapidly come to why are there two different clusters why don't we co-locate the spark process and the Cassandra process on the same node and get these sort of hybrid nodes will run the Cassandra and the spark on the same thing that allows for us to do local reads and writes and we can skip some of the network performance Network costs of doing spark here and and save that so we end up with this sort of spark Cassandra hybrid so now that we've done that let's take a look at things that we can do now with spark that we couldn't do before with Cassandra so the first one is joined so this is a silly little example where maybe I have some metadata about each of my sensors that is in this table called metadata and I have another set of data that actually has the temperature readings and I might want to know say tell me what the temperatures are or last you know that you have in the data set but but show me also the location for that sensor and so this is a relatively simple join they certainly can get more complicated than that the second example we got is aggregates and so now I might be interested in the year and monthly average or I'm sorry maximum temperature for each of my sensors so this is actually going to do that kind of full table scan and grab the data and and process it and give us this this summary table sort of an OLAP style query now one of the other pieces in the SPARC ecosystem there's a lot of use of distributed file systems like HDFS and they may be external to the cluster that we've got well they're certainly external to Cassandra which does not have an HDFS in it and so we might want to do a join between the two so we have our HDFS data is say the data of temperatures from 2014 maybe they're stored in CSV ES or something along those lines and we'd like to join it with the data that we have the hot data that we've got in Cassandra the operational data and maybe this is a query this example is trying to say in this first line you define the HDFS data set in the second line we are defining the Cassandra dataset and in the third line here we're joining the two together and then doing a little filter to find out which sensors are hotter this year in this month than last year in this month so simply simple but it just wanted to give an example of sort of a year-over-year kind of query and then the last example is super simple this one's just saying hey what if I wanted to restrict my query based on not not the partition key in other words I'd like to know every time there was a temperature reading above a hundred degrees and and and that's just over the entire data set and again SPARC is well-suited to scrub through all of that data because we're integrating in with SPARC we can latch into the entire SPARC ecosystem of tools one of the pieces there is through ODBC and JDBC because SPARC sequel has that interface and so we can tap then into things like tableau and Pentaho and I put R here there is a spark our capability but there is also our ODBC in our JDBC and you could leverage those just for getting an extract of data and then there are some notebook style tools like Apache Zeppelin that's that's an incubating project in Apache that we could also tap into so the spark ecosystem enables a number of tools that are in addition to the tools that we've got in the in the Cassandra ecosystem so I'll do a quick word on spark streaming I'm conscious of the fact we got a late start here so this will be relatively quick spark streaming is one of the big things that we're seeing a lot of folks use with Cassandra the idea here is that you've got a number of data coming in from the ether in and they're going to come into some custom into a receiver and so we could think of this as again we could do this with them with say the sensor data temperature data coming in from our sensor network etc and then what they would spark streaming does is it basically makes these little windows and so these blue dark blue boxes are say one-second windows let's say that we're going to build up and so we end up making these little batches it's really kind of a micro batch kind of approach and so we build up all the data for one second and then we can process that one second or some collection of them and so in spark streaming we can say I'm going to put them in one second wind buckets but I'm going to I'm going to do like say two every two seconds I'll do the actual processing and I'll slide by one second so that's what this example is at least illustrating and maybe I'm doing roll-ups I could do pre aggregation I could do some filtering I've got anything I want to do here and then I can take that data and persist it somewhere and one of the things we see a lot of people doing is using something like a message queue like RabbitMQ or Kafka and then pushing it through spark streaming and have the results go into Cassandra I'll skip going into the details here but there's some relatively basic commands that you would need to do to set this up and it's relatively straightforward by setting up what the receiver is setting up what you're going to do on each window and then telling it to keep running until you decide to stop just a quick plug on data stacks Enterprise one of the things that data stacks Enterprise delivers is Cassandra at its core but we have these other capabilities we've got a this SPARC integration built into the into that platform as well as our integration with Apache Solr on the on the search side of things so now if we come back to our Internet of Things use case we can see we're SPARC and Cassandra really feuer where Cassandra fits us in this system of being able to receive all this data and scale for all the toasters we're going to add to our system and how SPARC really helps us address this broader set of questions that the person on the bottom is is going to want to do with that data and so if we kind of come back then to a to Willie Sutton we really see that the re one of the real benefits here of getting Cassander and spark is being able to unleash the power of analytics where the data is and that's really the big benefit so here's so I had a couple of links for where you can go to follow up again I won't belabor this point too much while since we had some time issues at the beginning and I think I'll pass that then over to Brian I think is the next there's something I need to do to enable Brian Devon yeah I'll pass the ball remember we should be ready to go I think I just did actually it cool we're all we're all good to go Brian go ahead and start when you're ready okay I'm going to share my screen here excuse me I'm assuming I'm still seeing Brian screen is that correct yes that is correct okay what do I need to do obviously this is not not giving me allegis okay you should be able to go to share my desktop now that you have the ball okay there we go you let me know when everybody can see that I can see it Brian you look good okay great again as Brian mentioned I'll keep this very brief we've got started a little bit late so we'll go through this quickly we're going to cover the data stacks implementation methodology KPI is a partner of data stacks and so Valera partner will also be at the Cassandra summit hopefully all of you can join us there September 22nd through 24th and Cape you guys done quite a few of the data stacks implementations in both in retail and financial service customers throughout the US so here's an overview of the methodology we're going to cover we'll keep it at a high level not get too technical just to cover these things to ensure the success of implementations that we've done in the past so initially to get started we have we referred to as a requirements phase and here a high-level we're going to do our use case requirements for data modeling which is very important security and encryption requirements service level agreements SL A's operational requirements to allow us to monitor and manage a system and then the searching and analytics requirements as well some of the key points here is and I think Brian said this earlier get the data model right is going to be key to our success here and leverage whatever you can so if there's an existing database where you can get access to the query logs go and do that that's going to help you define your data model you want to be able to define specific create read update and delete requirements those are essential for the requirements phase then also security is important both from an authentication perspective and an authorization perspective and you can see the areas that we cover their encryption we have the client application to data stacks kind of the cluster to the cluster and then the node to node which is the inter cluster encryption SLES are must-have and are highly recommended even if you just define them for your own project the lack of SaaS has led to a lot of project failures and this always comes up during our implementation process and we work with the customer to help define those we have to understand – we're building a mission-critical system so we have to make sure we define the operational monitoring and the management of the system early on in the process – and to make sure we build – those those requirements and then as we talked earlier data stacks search we're going to be defining our requirements here but ultimately we need to determine the fields that will be searched on and returned in the searching process and you can see some examples we give you as well data analytics has requirements as well they're important to capture at this time the key ones that we see out there that need to be incorporated are statistical algorithms required data sources the data movement and modifications security and access and then there's the other analytical requirements and we just have to make sure we have enough detail to perform a good design step two is the design phase and again the data model leads off this list as well idea access object data movement design operational design search and analytics design as well we're going to touch on so here in the data model design needs to include keep a key space design you know our replication strategy name our table design and you can see that the components within that that are necessary for us to put this design together and then again any relationship between tables needs to be noted here and you know the the joining is not technically feasible within data sets but as Brian mentioned his demonstration how that can be overcome and accomplished so again we want to identify this stuff early in the design process and incorporated here when leveraging simple data access objects we will want to keep it simple to be successful so we want to use the data access objects to encapsulate and abstract data manipulation logic we want to keep this current trends in the industry right now is is an application development where the project's leverage framework to encapsulate an abstract and represent database components as application objects we're doing that a little bit differently and then designing application DSX as much as possible up front will help in the overall application development and functionality component and then the last piece is that data movement design you know the batch real-time data integration between the systems ETL change data capture data pipelines these are all the essential things in the design phase for data movement in our operational design we want to do as much tooling and techniques as possible so we want to deploy new nodes configure and upgrade nodes in the cluster we will be able to back up and restore operations we want to know how to do cluster monitoring ops Center use repairs alerting disaster management processes we want to have all those things in place and KP i highly recommend putting a playbook in place to manage the operational design process here's some of the search design stuff we talked about earlier search searchable terms returned items tokenizer x' filters the multi document search terms these are all things we identify and incorporate into our design phase and then you can see the analytics component as well we do do a design phase for analytics this early in the process step three is our implementation phase it's we're going to cover our infrastructure deployment and configuration management and then software components which conclude both the data model and the application that's been built to do this and then our unit testing of the components when you're doing application development you know based on your organization you can use an agile or waterfall methodology all work well with this process here we want to cover the deployment and configuration of the management mechanisms so it's key in a distributed system like this it's automate as much as possible so so try to leverage ops Center docker vagrant chef puppet all those type of components and then obviously in unit testing a more complex with a distributed system compared to a single node system be prepared for that we're going to be looking for specific defects such as race conditions that's those can only be observed when we at production scale or somewhat a scale of the actual system so we always recommend unit testing should be executed over a small cluster but it contains more than a single node and then there's other things that you can use to automate some of your testing and launching things those are always recommended as well our step four is the pre-production testing phase and you know basic stuff here defect tracking tools you can use to do that stuff and our operational readiness checklist that is key to deploying an application like this you know it's critical to enable the project team to identify actual issues prior to going to production at scale in financial services is very common where the testing environment were using is an exact clone of the production environment just for that reason alone they want to identify as much upfront as possible we recommend a two-week minimal period for running your application at that production scale to look for all these errors and issues prior to going live and it may take several iterations of your configuration some code changes and other things before you get are able to do the full execution here's a good example of an operational readiness checklist and you can kind of bullet through these most people need to be expecting this type of stuff these are things you need to be ready to do and cute prior to going to production that way when you're in production you feel comfortable doing it and everything's in place to to execute on our last up here's the scale and enhancements and these are to highlight the normal operational mode of an application built on data stacks up right so we're always looking for how do we scale this and how can we enhance this whether it's tuning performance more features more functions these are the things they're always looking for in this face here and then you'll always have to prepare for all the eventualities things are going to happen and you can address this by adding nodes to expand capacity to the system when it's needed these are all options you have with the DAC product and in capabilities that you have to plan to take advantage of as you grow and also the final thing is to scale with data stacks the Enterprise Solution is a nice to have and it's what the products for that's what it's known for and its dominance out there in the field today these are just put some a reference architecture examples we put in place Brian gave some great examples either some other ones to look at that are commonly used in the field today here's another one that's a cloud-based one a lot of organizations that are global set up these kind of clusters west-east and then you know amia to do various times of real-time analytics and churches that are out there available to you quickly our prepare it the key things we want to push today is one of the things we've been very successful with a lot of customers appreciate is scheduling a Lunch and Learn reach out to us let us know you're interested in what we do is a KPI we'll team up Athena stacks will come out to your location and it's clearly an educational two hour session based on your knowledge and what experience you have a Cassandra it can be you know introduction all we up to some more advanced data modeling or performance tuning if necessary so that's something we're more than happy to do at no charge and we've got quite a following doing that type of stuff but many different organizations and typically it's geared towards architects developers people looking to understand how they can take advantage of the Cassandra product and the other one is a data stack assessment call we can quickly get on the call of you answer any questions you may have do an assessment of maybe your use case that you're considering or make some changes to something you're currently doing today the contract my contact information is included here feel free to reach out to me at any time and we'll get back to promptly the other thing I want to remind everybody of is the Cassandra summit it's in September kpi will be in booth one one one should be easy to find and there's a piece right there so at this point I am going to change back are actually Devon if you could change back Brian to the presenter and we can take questions at this time I would appreciate that thank you okay perfect yeah switch Brian back we have a we'll take a few minutes to enter a few questions first question what are the cluster monitoring to tools at data stacks offers sure so the data stack Center prior actually doesn't come with data stacks enterprisers data stacks produces a ops center tool that's our visual monitoring tool you can use it on on some of the open source Cassandra cluster but has some limited capability but into DSC it's the data stacks enterprise itself it allows visibility into various metrics going from the sort of Cassandra level and understanding about like read requests and write requests kind of see spiking you can get down to things like the java virtual machine and understand how it's doing and getting all the way down into lower level things like disk access and you know operating system sorts of metrics you could do some charting for that to see the trends as things are going on again this is a distributed system so sometimes it's good to see an aggregated view and we provide that or you can do an individual view you know if you have 100 nodes in the cluster the individual view could be a little bit busy but it certainly can help to to dive in at that level that's on the monitoring side on the management side there's there's some alerting that you can do that's relatively standard kind of thing to have an amount in a monitoring tool we can also take a and do some of the common housekeeping kinds of things to deal with backups and restores we can do things with some of the anti entropy so there are some maintenance pieces that we would do in the database in the regular care and feeding upgrades etc and then lastly there's a number of perform another what we call services that are through the OP Center tool that do things like evaluate whether or not you are subscribing to some of the best practices in your cut in your configurations and sort of just alert you I mean there sometimes there are plenty of good reasons to not do the best practice and so we just want to make you aware that you are doing something non non-standard and that's great if that works for you and it's not great if you didn't know that you were doing that by accident and a number of other services a performance service etc that we can kind of dive a level down so op Center is our visual monitoring tool I did sort of skip back or I skip relatively quickly through the data stack slides this data stacks slide here I talked primarily about things in the blue on the top there's a number of things the data stacks provides along the bottom which I think of as enhancing and making enterprise-class Cassandra itself and so that includes this visual monitoring tool that you can see on the right there great do you encourage or discourage running Cassandra nodes along with SPARC workers and if so under what hardware conditions yeah I think that I think the answer there is is sort of unfortunately kind of fits in that how long is a piece of string kind of question it really depends on what you're doing in your SPARC job while Cassandra we can actually talk a little bit more concretely about the sizes and the sort of prerequisites or the recommended configurations SPARC jobs can go from relatively simple to extremely complicated and so I think it really matters the the basic sort of configuration we talked about for Cassandra is is I believe four cores and 32 gigs of ram I would prefer to double that but not requirement certainly and there's a number of cloud instances that we that we talk about and recommend Amazon and Google and Asher and all the rest of them you certainly can go out there and do that when you bring spark into the mix a few more cores and a few more rant a few some more Ram really does start to help the picture if you're doing analytics that's going to end up building big models and doing a lot of number crunching that can be putting a real stress on both the computation and on the RAM and so it sort of floats the general rule of thumb would probably be more like six to eight cores and probably drifting up 48 64 gigs of RAM you know is sort of where you want to be but you know what I think the best sort of approach to this is really to kind of get a good guess and then actually just try it out and see if the workload works one of the things in terms of sizing I'll take this opportunity with Cassandra to talk a little bit about cluster sizing for just a quick second is we don't always do it by data size a lot of times we do the query the sizing of the cluster by your query SLA s so are you trying to get high concurrency say you want you know 50,000 to clients all asking questions at the same time that's going to change the cluster than if you only need 5,000 and similarly are you trying to get you know very low latencies you know I need to get under 100 milliseconds to do my to do my right vs. I'm okay having a 1 millisecond SLA and so those are the other sort of I'm sorry what yeah those are the other sort of things that we take into account when we're coming up to posture ok great let's take a few more about the availability of individual nodes being offline for many hours so what is the SLA for how long a No cannot be connected to the rest of the Cassander network so in terms of how long it cannot be connected the there's sort of a couple of things mixed in to that the architecture is designed to have it tolerate the fact that that node is not there relatively indefinitely now the fact that it's not there if we go back to our replication factor story of data is being replicated around a couple things are happening while that node is not there first of all I'm only storing some of the data in only two places and so I'll start to worry about things if I lose another node now I only have it in one and my fault tolerance is starting to degrade a bit so I do want to address this if I know for instance that it's going to be a while I may need to rebalance in other words give that token range that's now missing to one of the active nodes so if I had say ten nodes and one went down I could give that to one of the other nine don't need additional hardware necessarily but I do need to you know that it's going to be taking a little bit more strain on each of the remaining nine I could also spin up another node to bring it back to ten and do it that way after a little while so if we get Network hiccups all the time that's part of the world we live in and so the architecture is fine with losing a node for like a second or a couple seconds which is you know normal normal network hiccup thing and what we do the coordinator will just store a little hint that's what we call them store basically what it would have told that node had he been up and we'll hold onto that for a little while there's actually a configuration of it to how long that will be it's usually a few minutes so maybe tens of minutes and then at some at that point it'll say look I'm just not going to bother because it's just piling up and it's going to be too much of a drain and we'll deal with that guy when he does come back on line now after let's say there was a machine guy power supply blew up and so the node died for for two days when you bring him back there's a operation that we call repair that's not quite as much like it was broken but it's more of a we call anti entropy I need to get this guy to have the latest data from the other replicas they can stream the data across to this new guy or the the guys revisiting and then he comes up to par and once he's up to sort of the current state then he's able to sort of join and take care of things and so that operation of bringing a node back in if he's been offline for a while actually looks a lot like bringing a note in as if he had been cold and bring in joining entirely so in terms of the tolerating how it goes you can handle it for a long time it isn't super sensitive to it but you have to concern you have to be worried about if it's down for a long time so giving that to somebody else so that there's not a concerns with having the data available etc I know Brian if you have any other thoughts on sort of know grabbin I've been dominating my cousin's answering the questions I appreciate that that's where we bring the expertise con sorry let's go for one more let's see 20 good ones is there a way to archive the Cassandra commit logs since these log files can be quite big so sure there there's a couple things you can do with the commit log first of all you you can actually set up how much commit log space you offer offer so we didn't cover this on the talk just to bring other people up when you go to do it when each replica goes to do a write what he does first is he writes to a little space on the on the disk called the commit log which is how he knows that it's been written somewhere then it will go through the rest of the path it stores it in memory for awhile and has an in-memory table so it's actually a much cheaper operation to query data that has most recently been active and then after a while that'll that'll grow to a space that needs to be flush to disk and that's when we start calling these things called SS tables and so those are the on disk version and so when you do a read you kind of combine what's on disk with what's in memory if the machine were to somehow just go down you kick the plug out then when it starts back up he can go to the commit log and replay everything that's in the commit log to get himself back up the state and then he sort of brings himself online and says now I'm ready to talk to people so the commit log can is this safety mechanism and it can grow one of the things you can do is set how big it is one of the other things you can do is there is a hook inside of the api's that allow you to tell it you'd like to do something like for instance archiving you might do commit log archiving and so you can say that this API hook will get called when we're going to get rid of the file because it's occupying more space and everything is handled there and it's out we don't ever get rid of a commit lock page or file until we know that all the stuff that's been in it has managed to get durably to one of those SS tables on disk in other words there was that memory that memory table I was talking about and then eventually it'll overflow into the and go on to disk and SS tables we don't want to we don't want to remove want the the only disk copy we have until we make sure that we actually have the SS table disk copy once that's done we kind of mark these commits log pages as go ahead get rid of it it's fine we're covered and so though between a couple a combination of those knobs and interfaces you can really address sort of larger commit log files another thing you can do is you can do an operation in one of the management tools to flush DF force a flush of the memory tables and you basically do that command using the node tool command line utility and you say hey I want you to take the table called Brian or Brian's temperature data and I'd like you to put that on flush everything that you have in memory on disk when it does that it sort of goes to the commit log and set it could clean up and say it is a lot of stuff that but we did now so you can get rid of some of those files so number of options but it's a clearly a question from somebody who's done some operational work with Cassandra it's it's it's a good question it's relatively uh it's relatively targeted to the operational use though it's really good right a perfect record that's enough for all the questions that we didn't get to we'll try to get to in the blog post so thanks again everyone for joining again we will be sending out a recording of this video along with the slide decks within the next 48 hours so sorry for the technical difficulties with banks against coming thank you have a great a Cheers

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *