EcoCast Big Data and Analytics



Oh thank you everyone for joining the Eco cast today we are kicking it off right now here we go with that we already did the poll so with that I'm excited to introduce Vikram who is a senior manager of product and solutions marketing at pivot three-a Vikram are you with us hi David yes can you guys the enemy okay I can hear you great yeah thanks for being on take it away awesome thank you good morning and good afternoon folks my name is Vikram as David mentioned I look after solutions marketing and enablement initiatives at pivot three and really from solution standpoint we're increasingly running into a big data analytics sort of opportunities where customers are looking for hyper-converged infrastructure and they find some of our architectural features really beneficial so I'm going to for today's presentation I'm going to essentially go through the high-level characteristics of a what a big data analytics workload means some of the evolution from infrastructure standpoint some of the characteristics what are the requirements I'll then get into some of the pivot tree architectural features that sort of optimize our infrastructure for a big data analytics sort of a deployment and then we recently did some technical work and published a reference architecture for Splunk so I'll go through some of those details and also highlight the results towards the end of my presentation so that's for agenda we also have a George Wagner my colleague who does product marketing here for any questions and answer so if you have any questions George would be more than happy to answer those questions by the chat window thank you alright so making machine data usable right because that's what Big Data does and if you think about analytics business analytics in general it's gone massive transformation over the last couple of decades and I think you know there are a couple of things that are driving this transformation one of those is this notion of connected devices right the proliferation of connected devices really change change a few things and then the second thing is in last couple of decades most organizations have pretty much become completely digitized so you can do these two things you have unusual data sources and organization being completely digitized now you have a lot of data that that's being created which is not traditional data it's very different sort of data than the traditional business intelligence traditionally your business intelligence systems may be with business objects or in for America you would they would typically take in tabular data you would point them towards certain databases maybe it's your sales database maybe it's your accounting database and they can run certain queries on them fairly simplistic in some manners however the new data does not fit that model the new data is coming from all devices in the forms of system logs and information that they're sending most of this is unstructured and so we need fundamentally different approaches to look at this data and to make sense out of it and the traditional bi approaches we're really backward looking right they would tell you what happened last month or last quarter and not really provide you any real-time insights that can help you make more strategic future-looking decisions but with big data analytics and this sort of data coming at us in such a high volume we can actually create analytics that are more real-time that can provide insights that that will help you and optimize your strategy whatever you're trying to do so all of these data sources Big Data Platform can essentially take data from these sources process that data and that data can be used for could derive critical insights for many different reasons right so if we an application delivery person concerned with how your application is doing this data can provide you insights into how your application is performing at certain times what parts of your applications are used more than the others are there any contentions going on within certain parts of your application helped you find you how users experience your applications if you're an IT operations person this can really provide you a lot of critical insight into what what's going on in your IT right now right here and that can help you make decisions is there any is there any contention are about to happen are there any correlations that I can learn from and be proactive and avoid in outrages so that sort of things the third one there is security and compliance and I would like to give an example of pivot three there were three working with one of our channel partners and a third-party a software vendor with Splunk we bundled up these solutions and started offering these other security and compliance solutions really it's designed to look at your IT infrastructure continuously analyze it and provide you alerts if there is anything abnormal if there is security threats going on in the environment so you can address the problem because before it becomes an issue similarly from business analytics point in your product pricing strategies that sort of stuff but the last one there there IOT and machine data is I think the that's going to drive a lot of growth and adoption for big data platforms the notion there is that all these machines are now connected and they're sending data whether it's a manufacturing machine sitting on a factory floor or if it's a jet engine flying in the air all of these machines are generating a lot of data and transmitting that data and so looking at data making sense of our data can be optimized operations can be do some preventative maintenance oh we don't have any outages that sort of usage is increasing and it's going to drive a lot of big data growth is what analytics analysts also predicting so really big data allows you to look at your data universe holistically make sense out of it takes in different kinds of untraditional data sources and really allows you to derive critical insights that can help you in number of business areas all right so uh if you if you think about the the infrastructure side of big data though what what we saw right now was worth that what it can deliver in wood from application standpoint but from infrastructure standpoint big data in a short period of time has gone you know you know quite a bit of transformation so when this platforms initially became available for this sort of unstructured data processing they will essentially build for exorcistic standalone server nodes right and that really allowed them to get started quickly using commodity standards-based hardware and and start analyzing data so standards-based exited servers they would use local storage on those servers resiliency because the servers are not resilient by themselves if one of the node goes down the resiliency was really handled by application by duplicating data sets it was i as in with a lot of physical deployments it's extremely underutilized infrastructure you can think about it say in a deployment you need for indexers and three search nodes now you need actually seven physical machines to run that infrastructure so a lot of infrastructure a lot of it is underutilized in this scenario and there are performance challenges with this as well as the scale growth and you have more and more nodes it becomes unmanageable so now what realization came into picture solve some of these problems right so you started virtualizing your compute elements you index and search virtual machines on a compute they're using hypervisor but from storage standpoint you know now you can use shared storage however for shared storage you may need many different storage subsystems to address the specific data needs of specific data buckets you may need high-performance storage you may need a scalable capacity efficient storage for long term retention and you know if you utilize shared stand and that sort of storage to do that and these storage as you can imagine can become complex expensive many of these storage solutions we're really not designed for virtualization right they were scale up dual dual controller scale of architectures so in certain high density environments those those storage resources can become contention fast forward to where we are today and now if you mind the best of both world better metal servers as well as the virtualized computer and share storage you have something called hyper-converged right so hyper-converged modern offers you a simple modular x86 based infrastructure that handles your compute needs as well as your storage needs by virtualizing the storage underneath that so you don't need an external storage to do that it's operationally efficient because you're just managing x86 boxes and it's simple to deploy operationally efficient to operate and with our HDI solution that that utilizes a policy based our framework an intelligent engine to prioritize resources this can be a great platform for you know applications that have different nor diverse data needs so we'll see that in some of the architectural aspects as we go through these slides okay so I think we have a poll question David do you want to read it out of how how do you want to do this yeah let me read it out so it says here do you currently have or are you planning to deploy a big data analytics platform so everyone out there in the audience check out the slide window and we would appreciate your response I'll share the results with everyone here in just a moment I'm curious to see what everyone is doing around big data I know a lot of companies you know as was said in the previous presentation might have a big data problem but just not not realize it or you know don't have any current plans to address it at this point are you saying that a lot today yeah I mean I think I said some of those are transitioning into a big data analytics what we've seen most of these are you know some of the big data analytics deployments in the past built on shared storage or coming off of warranty and they're looking at new infrastructure so I think it's all over the or all over the place for Beijing and I think your force sort of represent that too I agree it looks like most everyone out there has voted let me share the results with you and it looks like 21% says they have say they have a big data deployment 14% are planning to deploy a big data platform 25% aren't quite sure yet and 39% don't have any specific plans what do you think Vikram yeah I mean this is sort of what we've seen generally in our install base as well and the good thing about our platform is you know it can handle multiple applications and so a lot of our customers who use this force a VDI or who use it for video surveillance and analytics are starting to look at us and based on that you know I think this looks the representative of our install base x1 xn all right thank you so now from infrastructure standpoint I am and I represented an infrastructure company so I'm going to talk more about the infrastructure that you need as a foundation to build a solid big data analytics solution on top so from infrastructure standpoint if you think about many of these big data platforms whether they are whether it's plunk or it's cloud era or its harden works many of those are essentially you will see data is coming in from a lot of different sources to these platforms right all of these devices are sending data simultaneously concurrently at a very high rate and that's usually called as ingest so so the application needs to take in all these data index all of this data and write it to the storage and it needs to happen at a fairly low latency for for it to be effective right to be able to take in all of the data at the same time without having to without dropping any data so you need scalable high throughput to be able to achieve that and when I say throughput it's on both sides application side as well as on the storage side you need to be able to take in all of that data quickly you need to be able to process all of that data from compute standpoint and you need to be able to write their data to disk at very high high speed very low or low latency to accomplish this secondly once the data is written the data usually is searched and so you know certain parts of your data are going to be so more often than the others the one that came in more recently I usually search more often and and for these search queries to be effective whether it's an application that's conducting the search query or an operator that's conducting a search query they also need to be a very high speed very low latency so you need that sort of high performance low latency storage access here it's not so much about ingest now but it's more about the read read performance of your data search so you need that and then on to make things complicated you typically these deployments gather a lot of data over time and there are certain regulatory and compliance requirements that require companies to keep this data for a long time so you need a storage that can also be cost effective and scaleable to handle that petabyte scale cold and frozen data which is rarely searched but you still need to retain that so looking at this right is up the the requirements for Big Data infrastructure or pretty extreme they need to be high performance low latency not only from compute but also from storage standpoint but at the same time there needs to be scalable cost effective to be able to handle all of that data that the organization accumulates over time that's what I think our platform can make a big difference so this is what I'm going to introduce you to some of the trees architectural differentiators really I mean if we want to boil it in a nutshell I think this slide does a good job some of the largest HC air deployments are built on our infrastructure for example one of the large large city metro organization runs their operations on top of pivot tree and they're handling somewhere close to 10 petabytes of data and we have number of examples like this and this infrastructure the way it's designed to work seamlessly and automatically rebalance they don't need a lot of operational overhead their own Needle administrative overhead to make to keep this infrastructure functioning and performing the way it should so with in that instance where I mentioned about 10 petabytes of data there's one storage admin that's managing that infrastructure in terms of efficiency whether it's computer efficiency or storage efficiency we provide marketing outcomes we're able to boost our compute densities as well as our storage densities because of the way we have optimized our nvme data path and then from performance standpoint the scale out architecture that we leverage on the nvme data path that we leverage it provides you know overall performance for the applications about two to three times traditional traditional HCI solutions so you know with that kind of architecture we can handle big data deployments quite well it's got scale its coefficients in its card performance and so I'll take you through some of the architectural aspect that that that play role nice so first of all are the foundation of pivot resolution is this distributed scalable architecture the that that we leverage for quite some time really what it does is it as you can see in the image you can cluster a number of nodes into one pivot three cluster now when you cluster these nodes really we are not just virtualizing the storage but we are aggregating pretty much everything that the nodes provide right from Sora standpoint we aggregate all the drives that means every single virtual machine every single volume that you create across every single drive that's in a cluster now that's that's critically important in our deployment like big data where you have to support high interest rates right so we have a large number of drives taken in data and writing it to disk and since we paralyze that across a number of drives you know in a hybrid cluster which can scale to 12 nodes or flash thrusters that can scale to 16 nodes you essentially have 144 144 drives and so every single VM every single volume is really benefits from all of those drives providing the IO IO performance we also aggregate parts of the ramp to provide a read cache we also aggregate bandwidth every single note that that you add really essentially adds 20 gigs of application bandwidth 20 gigs of sand bandwidth so I was just killing your infrastructure because you have more data being ingested you actually are scaling your bandwidth as well the next critical element is or build on top of this distributed architecture is the parented arranger coding that we have rager coding is our method parented algorithm that protects our infrastructure from any failures if a node fails how do we how do you handle redundancy that's something that we do with laser coding most other HCI solutions use something very primitive like you know replicating data from one node to the next node in order to keep from failure so that we have an extra copy of data we believe that's quite coordinate inefficient way of doing it if you see the graph there the bottom two lines the straight line graph is the sort of efficiency you will get with that kind of data protection at the same time you will incur two to three x io penalty because every right I Oh becomes either to write iOS or three diodes so you're wasting a lot of i/o capacity in that as well with our literature coding you we eliminate that i/o duplication so that adds to performance but more importantly we start getting red light efficiencies and that's that's quite important when you have to retain data for two to three years and then lastly without a cutie platform we added this policy based quality of service that utilizes nvme flashed here to this architecture and now we are able to prioritize workloads assign policies to the workloads between policy one which is a mission critical and policy five which is a non critical you can assign a policies in the range of one to five to your workload sort alter your data sets and system is now intelligent that it understands what the priorities are it understands what the workload serve volumes are doing and if they require any performance and based on priority and based on the access patterns we will prioritize resources whether it's nvme resources or it's a ram cache or its SSD or SATA drives in a hybrid model there are two SSDs and and it etcetera so we also prioritize between those resources so all of that put together this intelligent multi-tiered architecture you can prioritize and have a hands-free policy based performance management that will always meet your Restless so now you know you can have different policies for example for your hot warm bucket data bucket that's ingesting a lot of fresh data and a lot of data being searched and you can have different policies for your frozen buckets and that way you can make sure the right data set gets the right performance here just to go into a little bit more about this nvme architecture that we have in our code security hyper-converged infrastructure so we leverage nvme and nvme in main flash and it's a standard you can put nvme flash flash and pretty much any server today that has PCIe slot what we are doing on top of it is actually what makes a lot of difference we have smartly managing that nvme we are using it as a persistent tear of storage as well as a cache we are making sure the right data sets are moving in and moving out of nvme at the right time so we can make the most use of nvme resources nvme resources are ultra fast but they are also not very scalable at this point right I mean they're expensive and they you can scale them only to a certain extent in a node and so given our ability to decide what needs to be an nvme and what needs to Komodo fin vehement and when that needs to happen we can really get a lot of performance out of it as a result we can boost a vm density by 2 to 3 X we can provide better IO performance as well as the application performance and so I'll put together the intelligence working on top of nvme is what is driving this superior performance so the quality of service in action here you can see why quality of service matter and without the quality of service what you see is on the left side it's say without quality of service and this has been a problem that's been plaguing a lot of IT organizations and has this has been preventing them from really using either a shared storage or a hyper-converged infrastructure to consolidate multiple mix workloads right what happens in this case is you cannot write it as a workload and a wrong approach in sub-second of all the resources and the right workload suffers in our case you can prioritize them you see on the right hand side you can say ordered database think of it as or maybe a hot ball bucket here in in in big data context is my mission critical and you have business critical and you have non critical bugs now if my mission-critical workload is not driving any iOS for storage then my system will allow other workloads who require the access who want access to get that performance but as soon as my mission-critical workload starts to experience more iOS it will total the system and prioritize the resources and make sure the right workload always meets the priority always with service levels so it's a hands-free way of making sure how you prioritize the application and make sure your right applications are going to get performance no matter what the situation is and is essentially in many use cases we were completely able to eliminate any performance optimization that goes on on an ongoing basis coming to Splunk we recently published this reference architecture uses our new acuity HCI systems we did it on a hybrid system because hybrid system is what we feel is cost-effective for big data analytics and the way we are looking at nvme data path it really does provide terrific performance for search and indexing activities so what we were able to see in this one was about four to six times the the indexing and search performance compared to what's plunk reference Hardware recommends so we were able to boost performance essentially by optimizing i/o performance and making sure that the CPUs are not waiting for storage to serve iOS and that way you can more efficiently fit in more VM and fewer VM can actually do more searching and more indexing you know some of these statistics I'm going to talk about on the next couple of slides but really if you are interested in this reference architecture email it to email your request to me my email address is right here and I'll be happy to send you the reference architecture so here's I mean he's a rough depiction of how the architecture was built and it was the point here is it was exceptionally simple to build this architecture and to make sure it performs the way it's supposed to you know you have indexers and you associate if you have any analytics software you can also deploy that on the same notes as your a spring software in this case this is a splint reference architecture so you know our quality of service can allow you to have multiple sorts of workflows researchers indexers as well as your analytics in the case that I mentioned previously malwarebytes they all could coexist on our cluster you can prioritize them and when it comes to your data buckets you essentially really create your data buckets and assign them policies you know you want to have a few you know hot and warm buckets they're going to be relatively small you assign them policy one that way you're indexing operations are always going to get priority you may have a cold bucket that may retain a month or two months or three months worth of data you can assign it a policy tool and that way you know they will they will be prioritized after the indexing activities but if the indexing activities are not going on then then your whole bucket can get a lot of performance and then a frozen bucket can be on a non-critical policy because frozen bucket is usually not searchable it's just a data retention and that way you can cost-effectively optimize where your data lives so flexible architecture here what's what's key about it is is you know flexible over time as well you build it it works based on policies there is very little performance optimization you have to do over time but as your needs change for example you have a horrid happening and you need to make data that was generated three years ago available for searching you can simply slip policies in our volumes here with our policy with a policy based quality of service you can flip simply on-the-fly make policy changes and make that data available now it's at policy two and three it's available utilizing some of the nbme resources to make sure your searches are more productive so very flexible architectural operational is simple to do easy to scale you can add nodes as your capacity in each group so what we're able to achieve with this architecture so here's so Splunk generally the results are look at you look at indexing performance you know how fast you can take in index and store the data and then you look at the search performance how fast you can search the performance what I'm showing you here on the left side here is the indexing performance that we were able to achieve is enough on a three node hybrid system was enough to support up to 13 terabytes of indexing per day now in my practical experience you know I haven't seen a deployment that would use that kind of indexing I haven't really seen a deployment that that would use more than a few gigs of indexing in a day more than a terabyte of indexing in a day so this is quite a high pretty much you know any large big data indexing issue you may have three node cluster may be able to satisfy it from performance standpoint now you may need to add more storage just to retain all of the data that's coming in but from performance standpoint three node cluster can very effectively and safely index up to 13 terabytes on our system now what you see on the right side is the first performance and search performance is measured for different kinds of searches dense sparse and rare dense means there's a larger concentration of data to be searched versus the total data available and it goes down since from there on across the board we were able to achieve great performance benefits performance improvement compared to what Splunk recommends all suggest you should get so platform does awesome from indexing as well as some search performance standpoint and it's very simple to deploy simple to manage and simple to operate we did we have a voting question here yes sir we do have a poll let me pop that pull-up for you okay awesome thank you so the poll should be on the screen it says if you're currently having or are planning to deploy a big data analytics platform what infrastructure are you considering so the choices are just standalone physical hosts virtualized servers and a separate San or nas hyper-converged infrastructure not sure or not applicable at this time so I'm curious to see what the responses are on this let's give the audience here a moment to answer all right I see some some responses coming in you alright looks like most of the audience has voted if you haven't get your vote in now let me share the results yep and the results are 5% of stand-alone physical hosts 32% virtualize house with a SAN turn as 26% hyper-converged infrastructure 21% not sure yet 16% not deploying what do you think Vic oh yeah I mean I think the the the first one right and a lot less people today are running it on standalone server that's a change since the beginning days of big data when it was exclusively standalone x86 servers right so that that's expected I think people are moving towards newer technologies to simplify that infrastructure the second one you see there virtualized servers a separate and a storage a lot of these may be I think deployments that are already in place you know came probably before HCI and maybe coming up for renewals and stuff and may start looking at HCI as infrastructure and I think what you're seeing down there on under hyper-converged looks like a good good number to me 26% who may be looking at these deployments for the first time or as a new deployment check absolutely I agree yeah did you go ahead yeah I got one more slide and then I'll take questions okay all right so so this slide essentially you know some of the technical features or benefits that we bring to the table you know what do they do for a big data analytics sort of for deployment right so the low latency and maybe data path that we walked two miles over time is really you know provides you with high in indexing as well as a faster search performance and it really allows you to do all of that why slashing hardware right you may need less nodes with pivot tree and you may be able to get a lot better performance per index or VM than another infrastructure so you may need fewer VMs to do the same amount of indexing and search and we saw that firsthand when we did our reference architecture for rung we were able to get quite a bit higher performance either its indexing or search per VM with a similar vgpu and a round ratio we see p on the ram number you know much better transaction performance out of that the next one really what we bring to the table is this multi tier architecture which is which is the first in HCI and a policy based quality of service which i think is the it is pretty advanced first in HCI but I think compared to other storage solutions it's it provides you very sophisticated resource prioritization not-not-not resource reservation capabilities which is quite important and with that architecture you really have one single infrastructure right you have pivot three that can do your high performance search and indexing but also do long term retention so you do not need two different storage systems one for performance one for efficiency right typically you would deploy a high high speed flash system to handle your hard warm needs you would deploy a more scalable low performance cost-effective solution for your cold attenuation now you manage in two systems sometimes three in our case you can do all of those data sets on one single activity cluster its policy Basin is hands-free once you assign the policies the system will perform and you can be assured of that and it's flexible in the sense that you want to change those policies on the fly because there is situation in your organization where audit is coming up or whatever else right or maybe you want to look at your organizational data three years ago for your own business benefit you could very easily promote the frozen data sets into high-performing data sets with two or three and then lastly the distributed scale-out architecture and the racial coding really adds to you know simplicity in scaling so you know if a racial coding provides you market-leading efficiency from capacity ten points so you have more usable storage per per per terabyte of rock storage that you buy it can scale to petabytes we have numerous examples using us at that scale very easily and you can grow non-destructively as your needs grow you simply add in a node and you have more storage and the whole thing gets to be aggregated again and then you know add capacity as needed in small steps you don't have to plan your infrastructure for next three years you can add a node when you need it maybe in two months in six months in a year based on your data needs when you need a new capacity you can add it you can add it in a small step you can add one node each time you need capacity it makes it really simple for you to grow your infrastructure as your data goes and that oh here's the question let's get this one David okay yeah we already answered that so yeah and here's my information if you have any questions feel free to contact me there excellent excellent I'll just leave that up while we do some QA we do have some questions for you from the audience if you're ready Vikram yeah I'm ready all right first one here they're asking what about quality of service policies for my storage cannot be applied to data buckets and is it based on virtual machines oh yes oh yes it can be applied to data bucket so you the quality of service policy is mission critical being 1 and non-critical being 5 and everything in between can be applied at a datastore or a volume level you can design your buckets so so that you know so that they're in those data stores so we usually create a volume per bucket and you know here are two hard wall hot warm volumes you would have send in policy or quality of service policy one if you had now let's say for cold data storage volumes maybe you'll find them quality of service – or quality of service 3 and then you may have one or two large volumes or multiple large volumes depending on your use case for frozen data and you can assign policy four and five and the good thing about quality four and five is if you if you assign your data search to that policy it will not take up nvme resources and in that case you are making sure that the data sets that don't need performance you will not take a performance got it got it another question here is about nvme in pivot 3 is it used for data storage is it used for cash or what's it for yeah it is a it's for both and it is uh you would not be able to create a data store and say tag this data store to nvme because that would be very inefficient way of using nvme nvm is real what we use our system users internally it uses at persistent tear as well as parts of the nbme are used as cashing tear so we manage that internally transparent to the application and it's really the quality of service policies that govern would then be what and how much nvme is going to be used for what data sets okay okay nice and if I want to do some automation and use some api's there's pivot reoffer api's yes there were three rest api if there are any application integrations or you want to do an integration from automation standpoint it's very easy with pivot tree to do that okay and what kind of server nodes are compatible with two three all right so good good question so people three we have we work extensively with the two key windows right now dell and then our we recently started supporting a lot more form factors now you can choose between one you and two you form factors we have now some of the high density form factors as well so del r 737 40s we support lenovo's 3650 as well as the RS 650 that came out recently some of the high density models can go all the way up to 172 terabytes per node so we provided various different form factors either one year to you and then you have a choice of different drive sizes that means the total node capacity different CPU core configurations different ramp configuration and and I mean and a couple of ways you can deploy towards a solution one would be you know you can you can buy pivot rebranded appliances from our channel partners and then everything comes up your to a branded ready to go or you can work with our partners like Dell and you know to source your hardware and you can put us as a software only offering on top of that hardware very nice very nice well I think that's all the time we have for questions but a great presentation thank you for being on the event today thank you David appreciate the opportunity

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *