Berlin Buzzwords 2019: Shaofeng Shi–Accelerate big data analytics with Apache Kylin

welcome to these sessions my name is I'm from China Shanghai today the topic is how to accelerate the big data analytics on Hadoop with a particularly this is my this is about me I'm a particulate emitter and the PM is a member since 2014 when I was working for eBay in Shanghai now I'm the chief architecture in collisions this is today's agenda firstly I'd like to introduce the background of this project what's the problem we are facing and what's the solution we proposed next Li I will introduce a whole application build store and queries or OLAP cubes on Hadoop I will also give a simple provenance benchmark in the final part have you introduced several use cases the background is pretty simple because today more and more data are collected by the companies and when we produce processing the Big Data Hadoop is one of the best choice well if you're looking at the technical components in Hadoop you will find that most of them are developed for batch processing they were not designed for the low latency cycle queries that means when the data data size keep will increase you will spend more and more time in running the cycle queries that will cause worse and worse user experience for your data analysis so the Challenger where were facing is how to keep the high performance as the data grows to be precise the high performance away means the query Leighton should be at second to sub second level which is very comfortable for a human being we investigated the solutions in the domain there are mainly two technique technologies one is the massive parallel processing solution the other is a secure on hadoop solution the sample includes Amazon redshift pivotal current plan and also the presto Impala and the spastic oh well when we run in the benchmarks on big data we find that these MPP solutions couldn't work well because these solutions they are trying to accelerator the queries with following methodologies one is shared Anacin architecture they will distribute the data into multi nodes by certain keys and then when you query the data the office on Mecca is very stable okay they will distributed the data by certain rules to multi-node and then translated a single query to parallel processing so to reduce the total latency another optimization is to convert the data into columnar format with some type of the compression and also the vo in order to improve the performance they will catch the data in memory as much as possible the total architecture if you look at that there are many many bottlenecks such as memory the CPU and the network so the PPE solution they will have the following limitations the first user performance couldn't be very low latency early the latency is from tens of seconds to even tens of minutes secondly it couldn't so multiple concurrent users because one query may exhaust all the resources in the cluster the third problem is about others capability each time when you added note that we already read rest of the cluster you will find that it will needed to redistribute all the data it will take several hours to finish and when you add more and more order to this cluster the master node will become bottleneck this limited the cluster to expand it to a larger size so I think what a reliable solution for bigoted firstly it should be high performance and it can so multiple concurrent users with the Kudus capability so that you'd only needed to always move your data and it needed to be standard standard can polenta with so that you can integrate it with other system and the finally the OLAP solution need to be ease of use so we propose that our solution is to develop a new OLAP engine on top of Hadoop we call it as a bikini this project was originated in eBay the this diagram shows the position of killing it is running on top of Hadoop and the connector connecting with your bi to your dashboard your report is the core concept in about killing is using the OLAP cube technology it can support a very large data skill from PB to a TB to PB well the overall latency can be controlled as a subsection D level the query language is unsecure so that is a user doesn't need to learn another language we will provide the JDBC ODBC and the rest api for you to integrate it with other tools it will have a better bi integration for example connecting from tableau from Cognos from MicroStrategy and the hazard tools will provide a user-friendly web GUI for the kinetics to self serving and then you can integrate it with your user authentication system where out app and a single sound when we're talking about the OLAP cube Cuba is a data structure optimizes the father multi multiple – dimensional data access its performance can be very very fast this technology has been widely adopted by the data warehouse products in the past years well if you look at big data domains there is no to to to build the OLAP cube so we as we were thinking can we try to move the cube concept into the big data domain if we computed the cube and the creditor Cuba for Big Data what's the benefit firstly we will get the benefit under performance because the you don't need to scan your raw data the cube is rigid and the indexed so it can be 1,000 the faster than reading from raw data and it will be cost effective for example once the cube will be billed the queries after that will be very fast so you just need to run one comprehensive bill and the Kin the provenance increment isn't next under the cubes they are very easy to learn to understand to create to update by your analytics that means that theta engineers doesn't need to involve this you you might concern is it possible to build the cube for such huge data set you don't need to worry about that a bit because Hadoop is enough maturity today we can use it to do this process this is the overall data flow in a very killing we separated the data flow into two part one party is offline data flow the other is online data flow marketing in the blue and Corinne the offline data flow it is a cube building process a public hearing will generated the job steps to extract the source data from high Kafka or other system extracted them to the Hadoop and then wrong the MapReduce or spark jobs against the same and the converted to HBase format and loaded into HBase once the Kilby build you will be able to query them with the unceasing Co you can connect the from your peer tools or dashboard to a killing killing will translate will pass the seeker and the translator it to cube or visiting it will no longer needed to touch your source data so the online data flow will be much faster someone may mention that is the cube the same concept as them what he realized the view they are similar but they are they are different in one cube by default it will have many qubits one dimension cube will have two and power Q boards each keyboard is actually materialized view and this is the sample of a four dimensional cube this is a cube order of the four dimension and there will be four three dimension cube cuboid and six two dimension keyboard and for one dimensional cube or all these keyboard will be built in one job so all the data in all of them consistent when you browse your data or up or down killing will pick up the pace to match the keyboard to answer your question so even if you draw up to a very high level data the performance can be satisfied this page shows how killing builded the cube we divided the cube building into several steps and the each step is maybe a high job MapReduce job or HBase job killing will exactly the same one by one and automatically so you don't need to write any code the first step really is to extract the source data to Hadoop and design killing will extract the dimension values for them to build the dimension dimension dictionaries then it will start several round jobs to build the cubes gradually by default we will use the belayer cubing process it will firstly build the base keyboard which which contains all the dimensions and then based on it we will create aggregated to cater and minus one cue balls and that repeated this until all the cue balls be calculated finally we will convert the cube to the edge piece for matter under loading into HBase how killing persistence the cube in HBase we know that Cuba is composed by many cue board and the HQ board has the dimensions and the measures and the query early is to scan a certain cue board with some dimensions so we will combine the cube order ID together with the dimension values as HBase rocky and the proteges measures also called matrix to HBase the column values how we query the OLAP cubes in killing to make the system easy to use we use cycle as the query language killing in degrees opportunity cassette as the psychic passer and the optimizer let's see an example this is a typical lab query that selects two dimensions add the two matrix from two tables there will be a filter condition and the finally attained either salty this query will be falsely deposited into such an execution plan and will be equals i cuted from bottom to up well looking at this plan the table joy and the aggregation surely is the most time-consuming part which means if you executor in this plan the time complexity is at least Owen Wow how killing to it killing will pre calculate the data pre a cricketer the data into cubes that means the table join and the allocation has already been finished in the queue building phase then for this plan we can rewrite it to start from this cube and the to some filter and then minus RT as we know the human is already created under indexed so the cube visiting will be in consistent time complexity with which is almost the whole one we did a performance benchmark to compare killing with hype with the star schema benchmark this diagram shows how how much be improved when changed to a play killing we can say for each query at least it will be improved 200 times and the most the speed biggest in speed-up I use a more than one salt in the time this this is unfair because most of computation has already been finished in killing but the user experience II is very different another diagram shoes as the data increases chilis latency is very stable well for a petty high it really increases as your data increases killing also have many advanced features for example when you have many dimensions you can't even a partial cube with some rules of is some algorithm we call a queue Brenner it also support very high cardinality dimension such as user ID cell phone number etc and it'll support precisely count distinct measure on UHC column it also support incremental data loader so that you don't need to refresh history data when you load muted it also supported use Kafka and a DBMS as a data source in the production environment you can scale it to multi node cluster and even you can separate the cube building and the cube query into different Arab cluster we called reader/writer separation currently the killing we Street auto is on the way the main feature is the real-time or lab this is a summary of the best scenario for upper killing the first scenario is a dashboard reporting and the binney's intelligent Mac radiator from Lex a database or MPP solutions the second scenario is the multi-dimensional data exploration you can let your data energy tix to self so to find the values in the data the sort of scenario is to offload the traditional data warehouse on to Hadoop killing can help you to as salary to these queries and also many users use killing for the user behavior analysis such as to computer the TV movie to the funeral and as his retention analysis etc and also it can be used the father transparent query acceleration the user doesn't and needed to via the underlying and genes the Justice Center the on sensical queries ok nextly let's look at several use cases this is one of the biggest a particulate improvement in the world it is in the matron and the damping me twenty mean is the biggest online to offline service provider in China there are many see is a degree reaching hotel travel and also mobile mobile the share the piker in orange is also a business line in mate one importing I see there are a lot of mobile on the street my fur has more than 300 300 million active users and the other three milling merchants so you can imagine how much transaction data and the user behavior data be collected and they in four years ago they decided to build a central or lab platform for this for the whole company they did some investigation and the selected after killing to do this to serving thousands of their business analysts and their external partners till last year there are more than 1,000 the cubes in the killing and the total data feeding into killing is more than a trillion rules the cube storage is near close to one pp every day they will have more than three million cycle queries and most of these queries are triggered by the analysts while on such a huge dataset and with so many queries most of the queries can be answered in around a one second the second a youth kiss use from Yahoo Japan Yahoo Japan is the one of the biggest pato in in Japan it provided the email service news services and also inertial beans they used to use Apache Impala as an education engines for for data analysis well it's a performance couldn't fulfill has the data crews so they migrated ATF from an Impala to killing after this most of the queries were reduced from one minute second and also as I mentioned the killing can be deployed into two Hadoop cluster so the deploys the killings jobs are in DC one and the deploys a killing query server in DC to DC one is a shared Hadoop cluster in America and the DC 2 is in Japan which is closer to two layer and analysis after killing builders the raw data into cube killing can automatically pushes the cube into the query cluster nearly the cube can be much smaller than the original raw data so use this they can see what a lot of bandwidth you can read them all on this tech blog the so the use case is from Telecom e telling Co me is a mobile payments company in spam they used to use my secure and the vodka to do the data analytics well as data karo and the growth they are syncing the new solutions and finally they deployed the Hadoop and killing to support their interactive queries and this solution has been a very good result the first use case is from a UX group yesterday evening I have dinner with him and know the whole storage in our X where X is a global online marketplace the data team is important but there are penises in more than 40 countries they used to use Apache Amazon redshift as a data warehouse well as the data grows redshift couldn't a skill so they decided to migrate the data to s3 and zambuta an ethics over above s3 and as they trotted a couple of solutions included the redshift spectrum and the opposite route and the other and finally they find that only killing can match the requirement one major reason is that their penis and acid are using tableau as the – well other other engines the couldn't match the requirement the performance requirement or they couldn't match the P I integral integration requirement they said the killing is a game changer with its stream faster performance and the simplest integration with tableau here are some useful links you can approach to learn more about educating and we have the meetups we'll be holding in China in US and sometimes in the Euro mainly in Spain or in England and the usual future we also wish we can have the metaphor in Berlin if you need the enterprise solution you can conduct the Cajuns and the last I have one time oh I believe the live demo will give more comprehensive understanding about this tool this demo is made by my colleague Oh the demo is trying to show the user experience the difference between before user and particularly and after using up the killing okay here we will firstly use highway as the engine on a milling dataset you can see that when you click a button the page is just a refreshing refreshing refreshing the experiencing is very bad nextly we builded the cube for that they said then when you make any selection on the other web GUI or on the power bi tools that will refresh the page in the real time it it also can be used as a back-end ng4 tableau here we have a data set with several billion rows let's see how it's active in tableau when you drag some filter all dimensions and make some selections the reporter or the diagram will refreshed very quickly you can also define some segments level on your data for example you can define some community the column and design using ET in your report okay this is all the transition at the demo so anyone has any questions okay let's think shopping for you stop thank you now we have time for questions wanna fight or other question about the persistence layer I saw there's a plan to replace HBase with pocket or something else is is this going to be somehow like next release or sorry I saw this I have a question about the persistence and they're so successful now it persists to edge pace and that's all there's a plan to replace HBase with arcade or something else's yeah it's going to be next release or yeah this is a good question currently in a particular the Cuba is started in HBase and the in the meanwhile we are developing the packet storage yeah and it has been you know stage the status and the the release data hasn't been determined because the performance is still needed to optimize maybe it's an another major release country only actress well in in the commercial version we have replaced the HBase with the Coleman FL format thank you yeah is there anything do you have any technology for identifying correlated dimensions like each dimension is expensive is there Oh is there any way to analyze the data to say you don't need some dimension or just ways of reducing the dimensionality of the cube based on the data okay let me confirm are you asking that does killing provider any way to optimize dimensions right yeah actually killing have a set of rules to optimize this I didn't list them on today's presentation because that is a little technical let me give you some example firstly in killing you can divide the dimensions into several groups yes this is avoid avoid that the mutual combinations in too many dimensions and in each group you can't define some rules for example Hiraki who's the nation country the country nation Sadie region etc and also you can define some counter views for example an ID and a name they will always appear together or or some something else and also you can define this some mandatory rules for example in your dashboard or you know report some conditions are required so when you define the cubic you can set these these dimensions as mandatory so that the killing doesn't calculated those conventions that has no that dimension yes we provide many rules for you to optimize the cube the other question yes so I'm wondering so now it's coming well with tableau and basically we test with power bi the Microsoft solution and it only supports the direct import something so we have to dump all the data to low no and is there any other use case like any other situations support and many others visualization tools are supported except for tableau or currently tableau is the most popular used VI tools for application and for power bi you just mentioned the papi I needed to attract the data to its server and the tool running this but recently probably I'll release the a plug-in called the director query the req required means that it can push down any queries to the underlying engine how we a dozen tanida to cache the data so we developed the implement or driver for power bi direct query so the the video that I just showed is using that driver direct query driver yeah and only with this you can to the analytics on the big data set for the Pattullo users we also ask you is the user to use the director connector mode only then let tableau to generator the secrets and ascend to optically otherwise tableau will send the selected star hurry to try to catch all the data then it will crush do we have any other question okay then now I would ask a question because we serve time so while we add more columns to our dataset the storage space needed somehow arises exponentially right because we have all these combination of the dimensions so right now you just store all of them in HBase right do you plan to do some like following up on the question before you plan to do some compression or something in along that lines to minimize the storage space or do you – do you plan to save everything like always firstly we know that for q4 the dimension if it has many dimensions the combination will be many many to many and the amount of them many of them are useless so it only needed to calculate so many dimensional combinations so killing two totals three we develop the components are called cubes and the current pass your query history and the two collected the most most frequently visited keyboards and then optimize the cube so the cue ball will be much smaller than before but it's a performance that can be kept at the same level yeah in the future we will keep optimize this in this way to make the cube more acceptable for the users okay then if there is no other question we can thank shelfing again for his talk yeah thank you

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *