The Data Analytics Platform or How to Make Data Science in a Box Possible – Krzysztof Adamski



and of the second day almost ready for the party I don't see any party atmosphere here no not really highly technical conference and you see two people from a bank they are going to be talking about technical things which is interesting so we are not a famous tech company we're a bank but apparently of cuts massive amount of data we've got billions of transactions there's one more thing that the bank has is trust it's the trust that people you have a trusting bank that whatever you put in a bank is safe so basically if you combine data and the combined trust this is our responsibility to manage data to process data in a responsible way this is a challenge on our side with the change we like to change our approach and is the kubernetes here let's find out so we start our mission and the aim is ambitious the ghost ambitious be a front of us we would like to be a platform that being used is been used widely across ng4 by fifty percent of our employees fifty percent of fifty thousand employees across the world that's how our ambitious goal secure and responsible way and we would like to achieve that goal in two hours time but if you work in a bank that's not a greenfield so we haven't started like from scratch with a blank page so obviously there's a history of data warehouse systems within the bank there are some BI systems and when we started we will set up the team advanced analytics teams five years ago we started with Hadoop stack it was a standard stack back then that you that you build your platform on top of this obviously we embrace the scalability capabilities of Hadoop you can scare Hadoop into thousands of nodes that's not a big deal there were some limitations of that of that stack and Halden during her sessions scratch the surface of these limitations like managing Python dependencies like monitoring this tens of of Conda deployments or Python different versions of Python packages this is something that we face back then but there's also something different so there's the known problem of cows versus cattle whenever you manage your Hadoop deployments basically you know your samples by heart you know their names these are pets to you these are these cows and this is something that it shouldn't be like this you shouldn't like have a relationship with your servers if you don't like them you should break that relationship you should release those machines that's the way how I see it another problem may be managing the Hadoop versions you know upgrading the Hadoop stack and you know being if that's really going to work it scares me that they need to upgrade the whole platform that I cannot everyday components independently that's something that we really don't like back then but when we decided to start our journey we said okay we are talking about technology but what is important for us let's start from user journey perspective so we have to our goal is to have put the data put the Toolik in one place to guide our users on the journey so basically would like to democratize democratize data democratize tooling that's why we put in your user in front the start of our journey also being cloud ready cloud native yep that's what we think about how to prepare ours how to prepare the end yeah we are working in a bank bank is not that like the first industry industry that goes into the cloud but how to be prepared into that movement that will happen eventually how to prepare the architecture to be ready for that movement to the cloud the rest of microservices we're observing communities back then yeah it's five years old project I was observing this project you know from from the early beginning and we were using docker for local development but using it is to manage data to build data pipelines that wasn't an obvious choice I was using this for for service applique on status applications for web applications but not for data processing but something is changing so we've got the benefits of Communities benefits of isolation manna feeds of resource management the project like spark on communities that emerge that gives us exciting possibilities to go into that step to being become cloud native its closing that gap over to Europe ok thank you can you hear me yep just Chris okay so probably time to talk some architecture about our our solution that we settled on for this challenge and it is called the data analytics platform and we intend this to be a platform as a service in Turin Italy inside ing we're hoping to also take it to other places in future and to be a platform as a service you need a number of things firstly you need a front-end you need something that guides your users as they make their journey through your product they need to be handheld to show all the tools they can use how they should achieve their jobs normally and not having to constantly refer back to a a user manual and that should have some sort of consistency to your product it should feel like a coherent whole secondly you need a complete frame work we need something within this platform that is going to be ultimately running these data science workloads and based on the lessons of the past we didn't want this to be too opinionated we wanted it to be agnostic to the technologies so we wanted data scientists to be able to choose the latest and greatest technologies of the day and we don't want to have to manage them as separate clusters we don't want to spark cluster flink cluster and Technology X that comes out next year we want a framework that can handle this and yeah spoiler alerts we're at Keep Calm it's Q bonitas fuzz being a bank we have lots and lots of data we have malt pool patter bites coming in every year into this platform quite different types of data obviously transactions but also trades click streams things like that and we want to be able to hold this in a high-speed cache right alongside where the data scientists need to do their work and we'll be going into that but also our data must be kept into certain separate areas certain types of data can never be combined in a bank certain users should never be able to see certain types of data so whatever data system we select needs to have really quite fine-grained access controls finally running through all of this we have a load of security and integration tooling we have thousands of users using this platform thousands of production pipelines and we need full traceability of this we need to know exactly where our data is where our data was if we get asked to describe why this model gave this certain result three years ago we have to be able to within a number of days go back replicate everything and work out what happened and so we have a whole suite of tooling running through a whole platform to deal with this so let's drill into one of these session sections so the first thing I user sees when they log into the platform is our D ap portal and this is precisely what I mentioned it's a web interface that shows you all the things you can do as a user it has a service catalog with all the tooling you're allowed to access it has a file browser to view our storage cluster and it has ultimately an environment where you can define all your dependencies how you want to do your job and provide you that environment as a data scientist interestingly we found that one of the main bottlenecks to a data scientist once they have a good working platform with all their tooling is actually just understanding the complex in data environment within an organization Angie's a very large multinational bank lots of geographies different industries legacy systems ma systems and just understanding I have to solve this state science problem what data do I need to look at is a really significant challenge so for this we actually formed a partnership with lyft the transportation company in the US and we built a open-source product called Amundson which is what you see behind me this is a search based interface you start presumably looking for a certain class of data and you can then drill into that data it will show you the obvious things like the metadata for that data who owns it when it was updated metrics around the data but also it shows other things so it shows the social nature nature of that data perhaps you work in a similar department to a power user it can recommend you what data sources you should probably be looking at it can also show you things like access patterns how useful do other people in the bank find this data the coolest feature of this platform for me is also when combined with lineage which we'll go into later you can also see what people did with this data so here I have a date source called grid I can see that maybe other data scientists in my department have already enriched this they maybe geo coded it with lat/longs so I'm ready to do my geospatial analysis without having to do that work myself and that's all exposed via this tool as mentioned it's a it's an open source project if this sounds like an interesting problem for you an interesting tool for you go and check it out and as always contribute back and help us in this challenging one issue with the DOP portal being a web interface is it itself could be a security vulnerability it has an interface for you to fire off a sequel query and to return data and there's nothing stopping you then copying that data onto your local laptop and no surprises in a banking environment that is a very bad thing we don't want users being able to take data in onto a portable device or a non secure device so we need the nature of a jump post a remote desktop and here kubernetes comes to the rescue there are commercial VDI solutions about but we wanted to to build their own open-source platform and we needed it to be scalable and secure so what we did is we took the xrdp project we gave them a couple of pull requests so we added in the ability for you to copy into your remote desktop but not copy data out and then we quite simply wrap that in a container we have a web interface when you first onto the platform that spins every user of their own dedicated container on demand deploys it into our cluster and Apache guacamole then exposes that front-end into the users browser we're super happy with this it was quite simple it took us about a week to achieve the performance is great and every time we get new users we can just add more cluster resources we're not paying for our own VDI server farm we we built one very simply using kubernetes primitives and this has been excellent ok let's talk about the compute part of our framework so here we think of kubernetes in a slightly non-traditional way the sort of classic bread and butter of kubernetes is to take a long-lived stateful service or an honest and stateless service and to monitor it for user load to auto scale it as needed to make sure that service is always available to deal with failover redundancy rolling upgrades those kind of things and we definitely need that so the front-end I just described needs all of those concepts and cue Benitez is great but we also have the second challenge which is the the multi pata bytes analytics workloads and these are slightly different in that they should scale up rapidly to all spare cluster resources so that that workload can be computed as quickly as possible and then scale right back down to zero to keep the cluster available to other users what makes this even more challenging is we need those two patterns to exist on the same cluster we don't want to be managing separate clusters for secretaries as I mentioned so we did a lot of work within kubernetes on concepts like auto scaling quality of service and yeah resource classes things like this one interesting thing is this is a data science problem in itself so we turned the platform onto itself we recorded all of the scheduling data that the platform gives us so we can see peak loads we can see when different types of workloads come up and we can just over time look at times where the cluster maybe didn't perform as well as we would have hoped and just keep adjusting our quality of service rules over time so it was quite a nice use case of using the kind of technologies we are enabling to actually improve the technologies themselves data science in a box is perhaps the main part of the platform it's the real sort of critical tool we have because data scientists are our main audience it's ultimately a platform where ing data scientists can come in to do their experiments they're a valuable resource and a scarce resource and we found with our old platforms they were spending up to 80% of their time not actually doing any data science at all they were doing things that feel much more akin to infrastructure work like installing their own jupiter labs or data engineering work trying to work out the complicated data structures and cleaning them up ready for their actual modeling so data science in the box is purely aimed to speed up this process so what this is is actually a dedicated jupiter lab environment for each user and in the portal they have an interface where they can select everything they want about that environment they can choose their language our PI Python Scala they can choose computational frameworks like spark or flink and they can also add in any operating system packages they need Python packages things like that that will then launch them this dedicated isolated environment and they can do all the remaining work inside this platform without having to be infrastructure experts it also attempts to have everything ready for them so all the data that that user is allowed to access we can look that up in our authorization system and we can pretty mountain that into the container so they click the button they get a jupiter lab and all the data is right there for them architectural e this it looks like this picture so the user logs into the d AP portal and they they type their requirements into this interface behind that we have a quite a lightweight api we call kalisto it's tracking the state of all builds that we've fired up in the past so if a user has already requested this pattern we can just provide that image straight to them perhaps the image is out of date and needs rebuilding for the latest vulnerability patches then it will trigger a new image to be built for that user and currently that's just sending commands into a gitlab pipeline we're also looking into canna Co for this for better scalability and security the image is then loaded into artifact tree and the portal tells the user that it's ready typically in about 20 seconds at that point the user can select that image and then they can describe their resource requirements perhaps they need lots of CPU but not so much around this time perhaps they need GPUs in future we also hope that you can select your public cloud that you want to use here and that can then deploy that image into the cluster and forward you directly into your GPS web page let's look into the the storage element of the data analytics platform so as mentioned before we wanted one single framework to do all of our storage needs and it had to meet a whole different host of use cases obviously it needs to be a highly performant highly scalable storage system for us to hold all the data we need but also we need things like blocks storage we need file system access and we want cloud native api's for this as well and it also needs to be able to serve persistent volume mounts for key benitez and we found safe to be a really excellent fit it met all those requirements and it's also blazingly fast so we have the presto sequel engine if it's if we're querying a table hosted on s3 on s3 in SEF we currently get about 10 billion records a second on a relatively complex sequel query so when you have years and years of transactions around the globe you can in a one or two seconds get some really complicated analytics back just using simple sequel one area where did come up short for us surprise surprise is in the security domain and open source projects tend to have security is a little bit of an afterthought particularly the thing we needed was the fine-grained access controls we want to be able to say precisely this user in this team these are the models these are the data points these the types of actions they're allowed to do at every single level in set so we built a another open-source tool which we call rocku and there's a link to it here if you guys want to check this out what that's doing is intercepting every s3 or file system command into SEF and it's checking that the user has an open ID token from our key cloak server or single or author identical provider and also it's checking authorization in Apache Ranger if your user is trying to do something that they're not allowed to do or haven't been enabled to do yet then the rocku proxy will reject the request also Roco adds the concept of short term tokens so we can give users time-based access to this system so if a credential somehow was able to be compromised it's only going to be active for a short amount of time before that token gets refreshed and replaced so that's lastly let's look into the security and integration part of our toolkits so security has been the single toughest challenge of the d IP platform and this is because we're really trying to do two conflicting goals we're trying to bring cutting-edge technology to a bank but banking is one of the most heavily regulated and trusted or interested in issues on earth we have to we can't just bring in insecure technologies for the sake of doing data science we have to always keep that security element the problem of security in an open source world is actually a problem of proliferation if I want just let's say three big data tools Flinx park presto whatever i rapidly have twenty thirty tools because you look at the different authentication systems back-end databases config management systems there's not the biggest amount of overlap in a lot of these places and that's the security nightmare because in this kind of environment all those tools need to be penetration tested regularly they need to be monitored for vulnerabilities and it becomes impossible to manage at a certain scale so this platform has taught me that my new favorite number is one to maintain this platform we'd had to choose a single technology for each concern we can't have multiple authentication systems we settled on open ID connect using key cloak and that meant most of our development effort actually went into open source commits bringing open ID support to technologies that don't necessarily support this another key example there is Apache Ranger we used that for saying what users are allowed to see we needed that to be a single authorization tool because we don't want to be saying what can this user see and then looking through four or five different dashboards to see what its permissions actually are so to stay in control we made this one tool for each concern so how does it works actually so basically we an user went user wants to analyze data set it get an access parts of the data set for certain period of time so we assume a role for a user by example for example it could be that one of our employee wants to analyze how competitive our mortgage mortgages are in Belgium so a user has an active on this particular data set only for few hours for example when the and when all the analytics is done it can persist the data in the underlying storage subsystem they met on this in memory the database is encrypted separated in their own namespace in the persistent storage its managed centrally by Rockland Ranger without cost how is that possible so that we track the data basically track the data from the source to the very end of the report that's the one of the main procedure one of the goal of the platform that we have to know where the where this data coming from where it is data coming from how it was the end of the date we are using tax so tax being attached to data by this example we are doing some smart things so we know if there are some public data being analyzed by user that the only public data can be available to anyone but also the tax being inherited so if someone analyzed specific type of data this tax follows the data to the very end that we know how to what what kind of central policies had to be attached to this data set we couldn't find a proper tool in the open source world that manage that data Governors pretty it's pretty non-existent in the operations world that's why we backed up on Apache Atlas and this is why it's a no personal project and we are very much keen on using all the data governance on top of that ootek have to highlight this it wouldn't be possible basically without a perfect team the great team that we have that we work with we've just highlighted with the nations that the guys are coming from that are being that the in building that platform it would be possible without them some of them around this audience some of them are still working hard making that prosper but from even better the people who have different skills front-end back-end engineers ops engineers but even guys from non-financial discs that they're helping us making it possible within a banking environment and we encourage you not only to join that team but also joined the open-source projects that we are being the Rob mentioned Amundsen that we're building together with lifts please join us and build something around data governance with Apache otters which because we this is something that it's pretty not existing in open source world and we have to make sure to analyze data in a responsible way to make sure that's what we are doing is good for society good for the humankind that we are doing this manageable way so thank you for joining our session I hope you enjoyed [Applause] let's have a great party any questions from the audience I can be microphone guy you told that the data analytics platform is open source can you give us a beatable occurrence the hope that's when we also plan to open source the parts of the platform basically almost all the components all the components that we showed in the maps are officers like individually the platform as a whole we plan to open source creating something like that in a box so you'll be able to spin it up yourself at the moment there are parts of the components that you can that can use you can set up yourself but it's not like it but run as a whole but we plan to do this in the form of hand charts for example we have a couple of the component that are already available like this and but they are definitely we have that in mind that the whole pod from open-source the link of the separated component to where I can I can find them you can there was a link to Roku that's like a first starting point but if you're interested in special part just reach out to us and we will help you to guide you which components are where you can find the proper tools at the mall we don't have the full platform open sourcing and we have some commercial components just a couple because we need container security there's no good container security platform open source that I know about so there are some products on the market but not as good as a commercial offering so yeah we have to choose wisely a so if there are good commercial products in terms of security we were a bank that's a top priority it's a trusted high and so all the containers that you run from a user interaction are built automatically by the platform so no user brings their own docker image for example is that correct you can pick yourself in so yeah currently everything is built inside the platform for a few reasons firstly it's what our users are it kind of fits their skill set so the data scientists are not necessarily the kind of people who naturally are building docker containers left right and center we do however have data engineers and people like that also using the platform and we do hope to support the ability to just submit your containers there are some complexities there like we ultimately have a small set of base images that have certificates loaded in certain types of monitoring loaded in so we need to build it so that you're essentially submitting something more like layers the natural complete images yeah we can't allow people to obviously submit containers full of vulnerabilities onto the platform that kind of thing so it's kind of contest for nobility but at Rob mentioned the universal the base image that you have did stamp it'll be proportion build the developers can build on top of that and you know integration with the whole enterprise liar the certificate authority under this custom 404 dollar into class yeah that makes it easy for them to based on our image than relying on stuffing from the internet yes so it's good for their local development testing but production izing it they should pick a stay with our base images that's any more questions yeah when I do that did you notice any performance problems because you have every request going through Rocco and authenticating okay heard about self integration and problems perforce problem with Seth between the users requests going through and finally going to self that's about um not sure which state you are talking about about saving the the result or about spinning of the new image non-violence there's also no we haven't seen especially not with Roku Roku is a is a proxy but it doesn't catch anything so basically it's and steady sky scalable infrastructure so we've got multiple routers gateways set up and dot three up so basically it's load balancing between them we haven't found any performance penalties on this we are fully until I think whatever we have on the network layer so obviously we have a faster network then we can go even faster but at this moment like it's it's blazing fast yeah I'd agree with Chris tie which is our bottleneck is the network bandwidth I guess everyone at ING is short for now it's astonished that they can craft the whole history of ing transactions in a matter of seconds that's effective yeah maybe maybe an extension to that because I mean you you can also integrate like the raddest gateway direct with Keystone and and get basically the the request of indicate authenticated Pakistan for you but the other issue is that Keystone it it has absolutely no caching and it's actually quite slow so with the North Keystone yeah yeah I mean if you have a de Keystone integration for for safe happening there so I think that's the what's behind the question there because I mean if you do like some serious serious like loads or like on the objects towards then I mean every request get separately authenticated against Keystone's I don't know how its with the rock way is mechanism so authentication requests these are being crafted an acolyte on top of Roku but we are not doing keysto we are using key clock so open IDC so one of the iframe with us if you have a cache you don't you don't have similar issues but anyways yeah we're not trace those changes if we've ever faced the problems of the perform of the storage we are much more looking into a Luxio to create a cache layer and I'm going to be an in-memory source layer that we are some looking into but at the moment we haven't that's the problem so if there's no problem just don't create one so basically we are limiting ourselves we are not as we doesn't want to pick up you know as many components as we want it would touch the problem that we're gonna resolve it here all right okay cool thanks hey there is some overlap between a Munson and an Apache applause why actually use both you have to pick us up so a Munson and Atlas atlas is a very good lineage tool in terms of it's like a registry of all lineage I can query it with an API I can store that at quite large scales as a front-end it's not really user ready yeah a Munson is purely designed for your your actual users your data scientists to search for their data we actually added the Atlas back end into a Munson Amundson on the left side was using neo4j but now it's got pluggable backends and yeah we we basically made a user friendly UI for Atlas I want to pass in my mic but I don't show up we'll just make more steps I'm curious about the gdpr compliancy you talk about the tagging I don't know if you create the standard or using are your internal standard to tag columns to manage the composition of data and to full of this data up to their usage so how you are managing this yep so ing will have massive legal departments whose sole job is this problem we do have internal standards like C classifications of data so dates can be c1 which is essentially open all the way up to c4 which is this needs to be on encrypted Hardware where no users can ever access and we within that so we one of our tags is the C rating also all data that comes to us from a data at Lake the columns get a tag that describes what type of data it is so if we have a date of birth or an email address in there that column will get tagged and that column is the lineage that we track so if we I do a join and take that email address and add it to another data set the dataset gets promoted to being a c3 c4 date set yep okay any final questions okay why is it gonna you know we're gonna be staying here thank you very much and enjoy the party [Applause]

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *