Next Gen Big Data Analytics with Apache Apex



hello good afternoon everyone welcome to our talk about Apache epics I'm Thomas and I'm here with promote to give you an introduction to what apex is today and probably you have heard about stream data processing a lot at this conference there were a number of great talks and I'm curious who has heard about apex before okay that's good and how many of you have used the stream processing framework so far nice so let's start them with what is stream data processing in a short summary so we have a number of data sources an increasing number of data sources and also increasing volumes of data that we would like to process and turn into actionable insights data come from IT IOT devices a wide range of data from kafka topics file and files also actually a continuous source of data as they are being continuously produced then social media feeds and so on so there is a lot of data a lot volumes and we want to find ways to process that what is common to these is these are continuous streams of data so there's not really a start and an end point the data keeps coming and we want to be able to process it and compute results form it the architecture to deal with large volumes of data and this high school but it's really an in-memory processing architecture and there are a number of frameworks out there that do processing in memory in addition to that when we have a continuous stream of data we need to establish boundaries at which we can do computations so and at which we can emit results and those are called windows in-stream data processing systems and most of the time time-based windows what is very important to do this type of processing is that we can maintain state in a system so let's say I want to compute a count I need to collect the data and I need to count I add every time I receive a new event I add to the count but I need to maintain that count in memory right and I can't lose it because at some point I need to materialize that result and provide it to a downstream system so state management stateful operation is key to stream data processing having that foundation and then being able to do aggregation who is positing and so on that enables us to build streaming analytics pipelines then what do we do with the results the results can be made available to a variety of other systems the usual suspects databases files but also message bus like cough card to give it to other downstream systems and an interesting idea also that was presented earlier today was actually to make the data available for the in-memory state that we have to to a careering system directly so this is an this is an example flow here we have a browser that generates events on a web server they manifest in logs and then those logs go into a Kafka topic let's say and then the blue boxes or whatever what apex would be doing right why sizing the data from Kafka so it's a consumer text gets the data from Kafka and then for example we could retrieve the log lines we might decompress a message pause individual lines filter and then there are we have a stateful operation which is the aggregation we actually need to accumulate some data first and then we can compute a result from it then we can provide it for consumption and in this example we choose Kafka topic as the destination because that could send it to another system all to front end it's actually interesting that a very similar system we're going to present or a use case that very closely represents to this pipeline an existing customer that is using Apex in production using this to actually visualize data in real time from the in-memory state of the pipeline so what is Apex what kind of do it's doing in memory distributed stream processing and it basically divides the applicant depositing logic over many machines and can paralyze computation and therefore scale out and do the computation it is Java based it provides an API for you also to build custom building blocks in addition to the library of building blocks that's already provided it's scalable it's high support system and it can do computation with low latency and in addition to the scalability in general it's also possible to scale up and down dynamically with effects so elasticity we refer that to it also has a number of optimizations for a computer locality that allows you to arrange the different we call them operators in the system for better performance collocating the threat or co-located on the same machine on the same process important things from the beginning for us for tolerance and operability so for tolerance of course and correctness not to lose any data also not a double count and the overarching goal to provide a system that can do end-to-end exactly once processing means taking into account not only the stream process itself but also the interaction with external systems to give your strong classes in guarantee so stateful means the state is preserved there is something that there's a mechanism to bring back the state in the event of a failure and so a checkpointing is a way to to keep the durable copy that can be used for recovery then for a probability there are a number of interesting features and apex the availability of various metrics that can be used not just to look at but also to do scaling decisions and so on also the ability to record data and to visualize data as it's being computed in the in the system and then also to perform dynamic changes which is configuration changes but also changes to the to the logic itself where the system is running so in the Hadoop stack apex is a Hadoop based platform it runs on top of Yann and HDFS and it's broken down into two pieces the engine the engine which provides for the high performance fault tolerance photo and streaming and the data in motion processing and then the library maja which provides ready-made operators that you can use to assemble applications there is a wide range of connectors in there that interface with most of the common systems out there I mentioned Kafka earlier but they also connect us for HBase Cassandra and RabbitMQ and many other systems so chances are you will find what you need there on top of that data torrent provides a number of value-added tools for visualization those actually use the same REST API that is also available to build custom tooling that allows you to access applications at runtime get stats from there but also control certain aspects and make dynamic changes so there's a nice management console – for the admin there is also visualization a dashboard for data visualization as it's being computed inside apex applications and a way to assemble applications without writing the java code so I said that apex is a Hadoop native system what that means is or your native Rado what that means is there's an application master every application that the user launches on the cluster there's a separate application master and the application master is responsible for acquiring the resources that are needed to for four-forty distributed compute so it will look at the processing a graph that was defined and then it will acquire the containers but will require the containers initially but it's also able to acquire resources dynamically as the application is running because if we talk about things like dynamic changes dynamic scaling up and down or introducing new compute logic you need to be able to go and get the resources for it and and that happens by interacting with the yon Resource Manager so the green boxes the other green boxes with the numbers those are the streaming containers so with that all the all the things that you are familiar with from Jung apply or from Hadoop in general it plays nice with security it you have multi-tenancy and you have for the isolation for applications on a secure cluster the application development model its operators and strings you form a directed acyclic graph there's one is exception to the directed acyclic graph is when you do when you want to do either iteration processing it's actually possible also defeat the output of one Operator back upstream as an input so a stream is a sequence of data tuples in a streaming first system we process event by event without waiting for patches and things like that so each event is presented to the operator as a separate core and so the stream is a sequence of such tuples that is given to the operator an operator can accept multiple streams and they can produce multiple streams to yes and so it operators can be propelled operators from the library or custom-build operators against the apex api so then the next concept that is important for the state management is checkpointing and the we call it a streaming window also in apex not to confuse with application level windowing these are events that flow through the graph processing graph through the execution layer along with the data so every source will generate periodically and the default is 500 millisecond window tuple the window triple traverses the entire deck that provides an opportunity to do processing periodically that we don't want to do on every single event that is optional but inside the engine it is also being used for check binding so at configurable interval such tuple will be a checkpoint tuple those are the colorful markers that you see here the eight data cubes are gray you will might not see them clearly but the the colored markers those or the controlled supers they flow through the system so when such a yellow checkpoint pupil arrives at an operator the operator will snapshot the state and save it to durable storage so that means because every operator does that asynchronously and distributed way that there's no blocking and there's no central entity involved in the checkpointing and the saving of the checkpoints is also asynchronous so to not block the processing so with this check finding we have consistent state you can go back to that state when we need it for recovery but it's also needed for example for dynamic partitioning when we need to redistribute the state important point here and no artificial latency the T streaming windows or intervals have nothing to do with micro patches these are just additional events inserted into the flow and traversing the tag so what about event based computations so it's important that we can do a processing based on event time because this is how the user most of the time understands the data and also to decouple the processing from the from the time at which window from the time at which an event was generated for example we may want to replay last week's data and we want to have the same result so we need to be able to look at the event time so how does that work when you do when you do pausing on apex you have your data tables and the data tables contain the time span that's the event time the blue boxes on the top below the arrow you have to state so the state those are time pockets really and the processing of the data to oppose effects state the vertical lines all check lines moments where check points are taken so as the events come in we updated time time pockets and then the check pointing will actually make sure that the data is safe so as you can see you can repeat the same sequence of events we will achieve the same result in addition to that apex operators support also a number of the other other windowing semantics that you might be familiar with from beam so scalability is achieved through partitioning this first slide probably will look familiar to you so you have one logical operator or task you want to have multiple of those in the execution layer so that you can hand causes more data then in addition to the task there is the unifier that brings back the output of multiple partitions what that unify exactly does of course depends on the compute logic so if you have an opera for example let's say we are counting then the unifier is simple right you're just adding up all the partial counts that we get the total crime and that's great but if we do a top and a computation then the unifier logic is much more involved and different so that's why the unifor and apex is also pluggable component but it's tightly related to the operator that is getting partition so on the right side that's shuffle really what you see there each operator can the parallelism and each operator can be controlled separately and then the unifiers are also partitioned as to avoid bottlenecks in in depositing so this is standard stuff it's a shuffle so let's go to the advanced partitioning so apex has a concept of parallel partitioning our partitioning means I don't want to have a shuffle after every operator I would like to process multiple to multiple computations in the sequence and then I only I want to have a fire so it's possible to express that so we can build a parallel pipes here so for example here you have three operators arranged and then you have another partition of three operators so these forces independently of course this provides for greater efficiency right because we don't need to shuffle but it also has another benefit if I'm let's say I don't have this operator the last one and I don't need a unifier then I have truly independent pipelines and they process independently they also are independent when it comes to recovery so let's say one operator fails this one here then there will be no effect whatsoever on this pipe here in the apex topology but these two operators they will have to be reset to checkpoint and recover this is a unique feature in apex and it enables to you to do things like speculative execution to achieve an SLA guarantee then we have cascading unifiers so assume we have an operation that reduces the data so as we get a high high volume input but we have a relatively small output in terms of data value so now we may have a bottleneck at the network interface for example on a single machine this you can buy you can overcome this in Apex by cascading the unifier you will simply have multiple unifiers in a sequence which will of course add to the latency but it in in first instant it allows you to actually handle the data volume so you've got a partial reduction done here partial reduction down here and then another level down there and this can cascade if you have a very large number of partitions you can also have multiple levels so then we support dynamic partitioning as I said earlier so we have let's say we have two operators each of one and two and now we based on based on a metric it can be throughput or latency or some external information we want to scale this up and we want to allocate more resources to it so that's possible it's the both the trigger the trigger that decides that we want to make such a scaling the decision is customizable as well as the logic also that distributes the state amongst these partitions because why you had two partitions here now you have four over here but it's a stateful system so each of these partitions has state in it that needs to be carried over and actually how the state of two is redistributed into four partitions that's something in many cases that only the developer of the operator or the application knows so there are some default implementations that we have but in general this is all something that the user can influence so we can do dynamic scaling we can acquire additional resources and then we can throttle it back down so one example of that is actually carbon Zuma if the cluster changes right and we could automatically detected from the meta data that we have more partitions and we can bring up additional consumers or we just change the mapping of competitions or consumed by the apex operators so for Thorens operator state is check pointed to persistent store the default for that is HDFS and the there are two things that happen in doing check pointing the operator or the engine will receive the check point tuple that I talked earlier about then it will take the state from the operator and take it into into a copy and then that copy needs to be written to a durable store so the second operation is something that happens asynchronously and to HDFS by default it's pluggable back-end actually for checkpointing recently we had a contribution to make this work with a geode Apache geode as well and so any other back-end could be used to do the checkpointing HDFS it's just a default because Apex runs with minimum dependencies on a tube so in case of a failure then an operator can be put back to the checkpoint and we don't lose the state and I will talk also about how we don't lose the data that is in flight so we we have two ways of detecting that something failed one is Jana of course telling us that a process has failed and with the exit status we understand what to do with it and we have the heartbeat mechanism of each container the application master status checkpoint at the controller also needs to be checked wounded because we say that we can do dynamic changes in the topology now we need to keep track of that knowledge too so that is done and then we buffer the data so this is to preserve the in-flight data and the way this works is with the buffer server each operator that produces data sends it to a buffer and then the downstream operator takes it from the buffer this is an in-memory pops-up mechanism this is like Kafka without disk sort of because we don't we don't need a fault tolerance for that buffer but we need the buffer to be available if the this operator fails we bring back the checkpoint that bring back the replacement was checkpoint at state and it will go back and subscribe was the powerful as of mr. position of the checkpoint at state so there is no need to do anything over here if this operator fails there is no resetting the entire ecology and FX this is incremental recovery and the same mechanism is also used for dynamic partitioning so finally we have entrant exactly once so we apex has the ability to do the check pointing it doesn't lose the data it is an it it will replay the data when it's needed a given idempotent operation this is exactly once within the system and then depending on what system we are interacting with and this is where the operator library comes into the picture we are utilizing the support at that perspective system offers for example when we are writing to a database we will use transactions when we are writing to files we use atomic file renames whatever technique is available with a message pass it might be the last message about you that we query so we can achieve an intent exactly once semantics so is that I will hand it over to promote and he will talk about the application development on apex thanks Thomas so so far we saw the features and functionality of the apex stream processing engine scalability and fault-tolerance features that it provides now let's take a little bit of a look into the application development so in some of these slides you will see an apex logo on the top and some want when you see the epics log on the top that means that that's all the way in apex that's a feature in apex otherwise it's an add-on that's being provided by data torrent so the easiest way to build applications if you already have your business logic is is a pipeline builder so as Thomas talked about earlier you basically to Bill your application you write these operators which are the building units of your application and compress your business logic so with this pipeline builder you can connect these operators up to quickly get your pipelines going you can configure these operators and basically just build operators different kinds of pipelines so then if you want more flexibility and when you're actually building this business logic you basically do a build your application in Java there are two ways that are supported there's a low level API and there's a high the API the low-level API is more compositional you actually building a graph you're basically specifying the individual processing units in the graph and then you're connecting them up so this is a simple word point application with four operators the first one's reading the data from an input source the second one is parsing it the third one is counting the words continuously and fourth ones outputting it just showing it to you on a console so this in this API you can see in the composition API you can the first four lines are basically instantiating those operators and the next three lines are connecting them up so the next one is a high level API which is easier to pick up with pick up an end go so this one you can actually build build your applications in a descriptive in a declarative fashion here you're basically specifying a sequence of operations to perform on your data and basically you use the apex stream API for this so here you can see you're specifying the topic of the data from a folder from files in a folder then apply a simple function on each event and then to account and print and these are extensible as well you can extend the stream API and there's these functionalities and you can also actually add the add low-level operators you can mix the lower operators with the high level API as well so this is a new feature that's going to come out in our next version 3.5 where we're going to where we have a high level support for the different event time windowing specifications that are specified by beam and so this is a modified version of the previous example where we are basically applying applying a window in Windows on it package windows on it so where we are saying that we basically want to output the result every one second and we want a cumulative count and since the data is since there's no boundaries as such we are basically creating one global window and we are also working on an beam runner and that'll be out shortly as well so how would you actually go about let's say I want to build a new operator your own custom business logic that's not from the ones that's already provided how do you how do you go about doing that so here's a couple of examples so the first one is a simple parser all you define was to basically receive data you define an input port and you just implement one callback called process and whenever data arrives to your operator this method is called and in this method you can do any custom processing you want and when you want to output some data you create you basically emit it to an output port that's it so each operator does not have to worry about where the data is coming from all the way to where it is going next you can just concentrate on your business logic at that stage and at the bottom you see a more more complex operator which has state so this is the this is a counter so it's basically counting number of unique events it's receiving so it's keeping the current counts in a map and it's incrementing it every time a new event is received this this car this state is preserved so even if your operator goes down it'll be recovered with the from the checkpoint at state and you don't have to do anything special so this is done for you automatically any variables that you have in your operator that are not transient would be preserved and recovered so this is a brief Oreo for operator library listing D commonly used and commonly used function is basically I wanna mention Kafka so we have a very robust support for Kafka our craft operators are dynamically partition able as Thomas was mentioning earlier so they'll detect the number of partitions on the Kafka side for a topic and they'll scale accordingly they support multiple topic ingestion the support both pointed in point and API and there are they they're all fault tolerant and idempotent and so you can see there are different kinds of operators here there's operators for messaging no sequel parsers common ETL stuff for analytics dimensions and a whole bunch of other operators so when your applications running you basically it's a distributed application there many components and and and and all of these are running on a cluster you don't want it to be a black box you want to see what's going on you won't know what's going on this is where the monitoring console comes in so when your application is running you can actually see all the individual operators and partitions you can look at their stats you can look at their configuration you can look at the resources they're using you can even record data life as this application is running and for developers the console also provides easy debugging tools like looking at the logs all in one place searching through logs and if there are any failure events and recovery you you could basically find all those here and these are basically two views of an application so the left is the logical view so this is the this is the application as you designed it and on the right side shows the multiple partitions so as you can see the block reader and block writer actually partition four ways so you can basically see at both a logical level and a physical level all the details and there's one one thing you can see on the right actually if it's not enough it's clear but the partitions are of this so the block reader and block reader partitions are the same color but that is actually showing locality that's another feature we provide these operators could be separate processes running on different machines they could be on the same machine they could be running within the same process as different threats or they could be running inside the same thread so we allow different placement of operators based on you know your data needs based on your processing needs and this is another capability that Deltona provides is real-time dashboards so if you if you want to visualize your data from from the output of your application in real time without authority actually touching a store you can do that and we provide a bit set of widgets as well you can add your own widgets as well so just briefly I want to touch upon some of the use cases so we have so apex and a donut are being used in production in in different places and these are basically used cases where the customers have talked about these and they have done meetups and there's a slide after these use cases which has links to those videos and slides as well so the first ones pragmatic they are in the ad space and they provide analytics for publishers so they are able to analyze your ad impressions and click logs and slice and dice that information and perform different dimensional combinations and cubing and show those results in real time so they are they're using it and this is a brief overview of you know how much data they're processing and and what is the outcome of them using Apex the second one is G so G is one of the leaders in industrial IOT they have built a platform called predix which is a cloud-based platform and it's it's used to basically process data from all the G machines that are deployed across the world I think about windmills you know deep-sea oil rigs think of whatever you know G is pretty pretty big so the G G basically is using epics in their time series application service where they're basically using it to ingest and process high-speed data in a lot of millions of events per second so next is Silver Spring networks they're also in the IOT space they basically provide the technology for smart meters for utilities and they basically are able to deploy technology to collect and process data and they're basically using Apex to analyze logs and failures so that they can predict future outages pick outages and and they're basically getting input data is XML and they're doing processing and their outputs Avro and so essentially they're different file formats they're using for their different applications and this this page has the links for the use cases that I talked about if you want to go in-depth into detail and and what those are so in the last page some more resources about apex and any other features that Thomas and I talked about if you want to go in detail there are slides there and and videos that basically give you more details on these sit any questions yes yeah if the locality is no local which is their different processes on the same node there I still see realized but it's through the loopback interface so you're not affecting you're not sending the data or the network but if they're in the same container then there is no serialization it's just going through an in-memory queue yes yeah yeah it's usb3 underneath so the midgets have plugable so you can you can you have of course your widget has to follow certain API but yeah you can plug in a different framework if you want it means yes we do we do have operators that allows you to do that so are you talking about like application data yes so we we have what's called manage state which is a if you have large amount of state in your operator and you cannot store it all in memory we provide basically a smart state management system that basically spells to disk this is the last check point because the operators the new partitions will start from the last committed state yes it's it's consistent state right so we need to when you do a partitioning you need to take the current state or that the last consistent state that you have you have to redistribute it to get a new state and you have to bring it online so we can only take data that is consistent it's actually similar to the recovery case when an operator goes down it goes back to the most recent available state that we have check pointed right all the most reason possible the dynamic partitioning works in a similar way yes because the the data is already available in the buffer it of the upstream operator so it will automatically be replayed so the operator will restart and it will pull the data from the from the buffer yeah unless it's the input operator which is the first operator in that case it needs to keep track of the offset in its state for example the kafka operator keeps track of the offset so memory started pulls the data from the offset no the actual data is in the buffer but but these things are moving forward and all older data is being discarded when when the entire dag has check pointed beyond a point yes yes we do on our epics on the epic epic epic website that there will be benchmarked the stream us report on every release and those are published we also have our contribution to Yahoo a performance benchmark that is actually in a in a repository out there was a preliminary presentation about that but we will shortly publish a blog also which we got pretty good results so stay tuned so so what so it achieved with the with the with the platform and the operator working together typically the output operated so the platform guarantees that it will recover the state and it will play the you know you will receive the tuples in an idempotent fashion right so if you look at the output operators that we have in the library for example the file output operator or the database operator they will basically ensure that the state with the external system is consistent with the with what they recovered from the check point so that they don't duplicate the data so I go into details if you want offline no we do have vicar we can take they can also fly in case there are other questions but we have we do have we do have actually operated specifically for file output that preserved exactly one semantic we can discuss how we do it but definitely it is correct it depends on the sink what that mechanism is it's not it's not a problem that you can actually solve this in the engine only by a spa mode set by the engine working together with the specific connector you can achieve that any other questions we are available after the talk if if you want to want more details or have any other questions we will be there at the room at the door thank you thank you

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *