😲Big Data Analytics for Financial Services: 10 Ways You'll Get Addicted to SQL!



our speaker this evening has over 20 years of industry experience specializes in delivering complex analytical solutions to Fortune 100 companies that's a mouthful please welcome scale as head of solutions engineering Nikita will give etske thank you guys thank you so um so yeah so my name is Nikita yes yes so people returning scaler I was with various investment banks so it was a quite a journey and I was always a second my heart was always with a with the product side I mean engineering so was kind of building systems leading kind of running the executive leadership roles but I always wanted to move to a product company to make things right because all product companies all vendor companies that I was kind of interacting with in my row we're not kind of like a bit of a tunnel vision here and there and then and then about two years ago I met with the founders of scholar and I was really blown away by the vision and how they they really want to democratize access to big data how they want to really make the very complex things very simple and available because it really becomes a staple by the unique data you need analytics to pretty much drive anything and and go by and hopefully through this presentation I will be able to share you to share with you you can hear so so I'm so I joined Skyler about two years ago and I'm running the solutions engineering team here in New York a few friends from our team are also here Josh and Arif and we'll be staying after the presentation to can answer your technical questions and also feel free during the presentation to ask any questions and the good questions hopefully all will be good so we have this beautiful process and I've already mentioned that kind of so anyways so the today I'll take you through why Scala is a great technology why what what what does it enable you how do how does it empower you how does it make you more efficient and effective in what you do with mass but I will also show you two demos of kind of two different use cases so with this you know I was putting this presentation together I was thinking before kind of Google joining the Morgan Stanley Goldman you can work at the cetera I was thinking something and it was actually in those years when complexity is always kind of what city is always growing right and so in the round like beginning of the century people who are saying like there's a lot was like kind of creepy stuff and a little images were laying around and at that point people who went did the angle brackets and they said okay so let's put just angle brackets around all information cool and it was called XML and then it was so all our problems and people same architecture same everything just putting any book records in on everything but of course it was not something anything but then they were like that are you know consultants like myself who were really kind of enjoying that and then this kind of fun kind of things go on and all right so the latest trends were the Daniel 8 so at some point everyone was building digging digging kind of booming and they they awakes so it won an investment company that I work with it was like hundreds of data waves right and they're all popping up and even though you don't may not have enough data or something like that they don't like swill solve all your problems but they do like three singing whatever so now everybody is moving to the cloud right so now it's a new new kind of initiative let's move to the cloud and it'll solve all problems but the complexity is growing and we need we need some answers on to how how you can how you can kind of survive in this ever-changing world so imagine when interestingly just the so mentioned so you come out of college over you are in the business and you're really good at sequel sequel is as we know from the trivia is about 50 years old technology and it's still around there it's so great so you know sequel you know some scripting and you really London analytics so know what so what's happening is you have some data already migrated to the lake right maybe several ways Cyril is it maybe initiative fasten or care it may be in hi and maybe in some other you know they don't like technologists some of it is still in in the DBMS over in those sequel databases some of the listen in in you know in Excel files some of theirs in a CSV file someone listen you still have some services like the restful services some of it is available through the best for services some of it is still kind of available through some HTML pages some with structured semi-structured or not structured right and some of it you already started to move to the cloud some of it is still on Prem so you have all this mess and then are you coming in and your task to know give me the insights why do some do some analytics do something out and make something out of this data so what do you do you need to explore and profile this data you need to make sure that the data quality go so business is right you need to kind of apply many some business rules or cannot implement some complex business logic transformations or maybe do the kind of ransom ml algorithms and in the end after you're done and you're happy now you have to apologize so that actually runs consistently at a given schedule or reacting to you know events and things so what is the answer years so the answer here is scholar so today so scholar is a five and a half years old company and so what what scholars concept is it's really that we the engine that they were the separation of compute from storage so your storage can be anywhere it can be in in you know Asia s3 Google's go out calc HDFS Sun Nass it can be in any our DBMS you name it it could be in the files or in or even in non-structured like PDFs be perhaps etc I could be xml's Jason it can be on the cloud such as Asia AWS Google cloud or it could be a friend and and there's color engine releases on the top of it and has a lot of native connectors to a lot of these storage types and where it doesn't it also enables you to build connectors very easily to to bring the a pretty much anywhere now with that so once you ingest the petabytes of data in scholar if you have it sufficiently large cluster you now can run your transformations and they are all run in parallel and moreover on the top of this ancient you can build your own user-defined functions and they will run in xt use user-defined functions you can build in the language of your choice python is being the most popular and and they will run on ingest on transform when you're sort of an expert so you can adjust from anywhere you can apply any transformations you want and you can export to any place any data stores or romantic kind of for sending email or whatever you want so so Skylar actually offers the 10x improvement in developer productivity and time to value so these numbers actually come from some of our customers it provides so they kind of that's certified by some of the light investment banks they provide three to 10x for savings they linearly scale out and they can process petabytes of data in seconds with sufficient enough cluster and basically because we sit on the top of your storage we can eliminate the need to copy data and and we also kind of enforce strong transactional consistency any questions so far it's it's yes and no it's much more than virtualization layer it's more a layer that allows you to run complex business transformations and that's what I'm going to dive into but we can see it on the top of anything and complex perspective your virtualization layer but this is this is kind of this is just by the way but our strongest point is that we can enable people who build super complex transformations data flows and that will be the subject of what kind of defer their presentation and the demos and the ending of data for those transformations can come to from a change source the compute with this color will be the compute engine yeah exactly so the time of the presentation is the 10 ways that you've got addicted was equal to sequel this color and and things that we'll cover today is the super intuitive at work I need for a thoughts equals sequel queries moreover when you run this query you you have the button to switch to the interactive planet where I will show you where things may go wrong where you may lose your data way you may do have some skew or some other things that he can address you can run any sequel over any data from from anywhere as you mentioned you can build super complex module transformations with modular data flows and they that is so exciting Lee easy and this is this will be the number 4 here will be my main topic today number 5 you can extend sequel with the user-defined functions or single functions so you can write your user defined functions with Python or other languages or you can actually build them with the scholar constructs themself and course you can integrate your you know datum with BI tools such as click Luca at cetera and you can use your favorite ml libraries in UTS or access them from Jupiter enumerate you can operationalize and integrate with change data capture or schedulers and you can easily access visual resource utilization and skill management kind of from the administrative console it is all again made very easy and convenient for you so that you become really really effective at doing that and lastly the data linkage across the entire single data pipeline and that allows you to see how things move from your sources from hola chicas in your sources through all computers including complex you know formulas and and user-defined functions group bys unions joints and everything so most all touch based on that well this will dive into details of that so first this is the sequel mode interface where you can run your ad-hoc queries and you can see on the left there is the they sequel query that you write and it's type-ahead and allows you kind of very easily to put together and also for your tables and the attributes and the sequel functions and fully artistical compliant on the right because either they they catalogue of your tables or or the output of your data on the on their own button you can see can of the historical queries that you run and whether they failed or not and you can also jump into clicking on this view button you can jump from your sequel query into the interactive plan all right here right this is the the query plan for this query and from there you basically can turn that plan into its own data flow and now you can enter it now you can see it's visible on the screen but there this cue and the rebounds and other statistics available and you can actually either modify it right from here or or modify your sequel to make it run better and you can see it also heroes which these operators are really the filters joints and stuff like that so basically this complex sequel statement was translated into into those basic relational operators and end so this kind of the data visualization of the skiing answering your question this is what one of the things I think that Congress color so doesn't matter whether your data is and in in any of the clouds or any of the lakes or in the forest where you don't know where where the trees are that it's dark and and or the mountain rivers right we kind of gives you kind of the way to wing the way kind of to drive forward and deliver very efficiently little tasks with big data now now I'm switching to the advanced mode so advanced mode basically you take your sequel statements and now those sequel statements or any of the individual relational operators such as joins etc Maps filters group eyes or device can become a units of your data flows so units of your data force could be individual relational operators or relational operators or the complete sequel statements and as you move through your data phone you can see what's happening where it kind of gives you a complete transparency and control for things that you do but and now actually so here you see you're building this data flow and on the top you can preview your data tables as you're building them so this is actually this is the this is a project that was done for one of our customers and that's where they returned 20,000 lines of Java code over spark that someone was just visited visiting for spending six month effort to do that we turned them into this visual data flows were really kind of complex Marquart houses just become plain vanilla sequel statements very easy to very easy to topple should be very easy to view as as your terms of transform transformations are moving along which we'll see the demos and to compare this kind of three modular data flows then was the original basically the Java code right it was long development troubleshooting cycles dependency on scarce expert programmers and really there was no IDE or ability to troubleshoot or or debug what's going on very long time to volume and and and warehouse it basically was kind of 10x performance boost so doin self wouldn't have hours it was running 15 minutes this batch and it took two weeks instead of 6 and 1/2 the world so now moving to the next five which is like sensibility so again you can extend your sequel your LC sequel with your user-defined functions here actually a user-defined function that uses torch what a learning library is introduced and now you can actually use this function from your sequel as if it's just a sequel operator or you can build sequel sequel functions which are similar to parametrize able use and use them from your sequel release or if you have your favorite machine learning library you can integrate them right here either in your user-defined functions as we saw with torch or you can actually access any structured and scholar from your Jupiter notebooks and and which are natively integrated scholars beyond integration so those are two screenshots from two different projects one is to blow one is Cle and obviously anything else can can be reused as well to integrate with scholar published tables operationalization you so firstly means besides just running it from this ite and you can also execute it from the batch for from they kind of either connecting to some of the change data capture streaming tools or schedulers or you can build your scripts and run it from from some of the command lines and basically can processed you know micro batches or just run it to define the frequency this is the kind of the administrative screen where as you're running your data dataflow pipeline you can also watch how nodes on your cluster actually the resource utilization and here you can see that we can name we actually utilize almost hundred percent of the CPU so utilization of your of your infrastructure is pretty much complete which which makes again gives a lot of efficiency gain and lastly before we jump into the demos data lineage analytics so here with that same projects where we have there's twenty thousand lines of code translated into sequel based data flows we notice there might be some inefficiency or some duplication right because as people were writing those coders the Java classes over years and years they probably don't remember over probably different people who are writing classes right so firstly when we put this in the visual data flow you see right away where you do something but you're already done or where you do something that you're not using later right but sometimes nevertheless this come on quite useful to taken an attribute either and then trays that are built either to the beginning as it's done here so there's four articles are traced to the beginning to see where they came from and you can see that there's two of them came from here and then the rest came from about 74 other tables and and you can also and that kind of helps you a lot well I kind of understand your data quality issues when something happens in production or don't understand while you're developing stuff we you can identify fields of interest so that you don't bother with things that you don't need and you can also to market source to result which is on the right hand side we can predict impact of source codes changes if something changes in the source what will be affected and also you can remove unnecessary transforms that you don't need any questions so far before I move into the demos now you go angry the working set right so so we have to again there to two months it's a great question the question is is the working said will limit it to the memory so the two modes there's interactive mode and there is a batch mode so in interactive mode which we'll see soon as you building your data flows we keep data a few previous tables and you keep metadata so that you can roll back go back go forth etcetera right so that kind of is a bit expensive and we also keep all your articles around as well but because we don't know what you want to use next right but then when you operationalize it into it you turn it into kind of the production mode right now we only we only kind of look at the attributes that you need we know because of what we no need we just don't take right so then kind of a tremendous optimization happens and at that point we actually want to more efficient then and that's kind of what was about the 10x TCO reduction where we need 10x less infrastructure for the same performance so the question was about the performance that that specific project so the performance like different so no not the kind of degree this expert and sparkle right but so what we learn from kind of running some benchmarks against mark is that when there is multi key joins or when they left outer join so a few other cases spark behaves either kind of a little bit unpredictably or or not as great or when you when you do some filters binomial takes columns and stuff like that so that's where actually Andy needs to be indexed all the time right kind of repetition and things like that so then uh none of this is applicable to scholars so that's where that's where the kind of the 10x performance who's that was so in some of our projects are coming from and there is no magnitude so scholar is not the MapReduce based technology still there is a relational algebra base technology just scale to hundreds of nodes and thousands of users the data quality issues so that was actually the reason we've done can use this in the specific project was because there was a bunch of transformations which you don't know is it used or not you spend time kind of reimplemented it but then what happens with it right you sort of like it's really flight with inefficiency and then the word went of transformations may run several pipes right so it's kind of any direction when you look at the lineage you can see the same thing happening here and again and again and some stuff actually is dying out now they did the data quality is a different story which are which night but I will touch base now right the data quality but I'll just jump into it and then I in the next demo I'll touch base where the data quality monitoring to play which risk adequate is driven by the business rules so basically the secret is that the business rules is a metadata there is just another data so this system so in this case is a pen not a spur right the data and computer separated no we don't use mark inside but where we can we can plug into HDFS we can plug into how do we can plug into anything we can read Park here and stuff like that but we don't use but this is no spark based no it's several it's similar posture that's separated I was that deficient you need to net all this stuff that that other you know companies of the world was how was it how was our relations were engine efficient yeah but that's why that's why the company was funded right but there is a reason why the big players work or miners and others do not go there just you know take their Oracle or Microsoft single server environment of the cluster because because you know it's completely different you can run the Oracle clusters you can you can run sideways like your clusters just they just don't scale they didn't learn how to scale and then you know saying that to scale back UT they don't scale – yeah they don't scale to the hundreds of nodes and we can billions of dollars why don't the why did they scale but you somehow figure below that's a great question I think they should have asked themselves how many so other I think so there's the different so there's a number of CPUs and its user learning in part about and and again if there's some architectural questions we can we get together you have some some questions yeah with the 70s Emperor became something separate reduce write in it say it is on the same notes and yet it's not on the same notes right you're still gonna have to bring all the bits and pieces of your pocket files together in and assemble them together still bringing stuff together right it's like it's kind of running on so we actually we actually beat Exadata with the winsome with some customers were actually considering us for kind of like it replace the exit data clusters so so sorry so the question is how do you measure the number of nodes and the memory footprint that you need are those those Oh just before even measuring it okay my dear some pipes yes some sparks but what going to Nationals so if you have a hundred not to do plastic for example right and you run your spark jobs on that cluster now you want to bring scholar you will probably need depending on the only on the power on the complexity of your transformations maybe you'll need to know the cluster may be a three node cluster right maybe ten no cluster I don't know it doesn't it's not proportional to the data you can have your data as huge as you want but the the sizing of the so the sizing of your Castellanos is proportional to the amount of due to the complexity and an amount of data that you need for your transformations if the complexity of the transformations is a general assumption right but we don't need all this data to be in Scala we first color you only need one you need to run the compute that's what you ingest I mean we can I suggest that we when we can talk about it after after that right but the complexity you like this color processing is proportional to the complexity right so you don't need if you have a hundred no so for this this example for example this customer has a thousand our Hadoop cluster right spark cluster if the only the only implement three nodes cover cluster to run they can use the three nodes is enough to run they can pull the complex transforms that they run on the data is not common but the bigger doesn't see the sky McMasters Davis continues to see in whatever it needs to be it's a scholar they told me they talked some wind up surfing that exactly 35 you know what supposedly being a mom you need to transfer over the network here and there and that is the source of inefficiency that's why I have to watch every word because reason made a lot on TV but watch it a technicality so let's let's not go through that in the demos right so right even even in spark you still transfer data because the data sits on each the public knows it so what you run with you you run your you run your your processes not on every node right it's still centralized so you need to ship data to one node parties parties second there's variety might be on the state knows where the data resides that it's only happening I think ultimately internet reproduce how you will they don't hate you so so in our situation like that one of the 300 project the mentally disagree with that the data sheets staying where ever again and then the system that manages that thing that should be able to basically propagate those those you know execution you unions back into MIDI and get it from them so this is how the propagation course looking horse if they take it resides in many different systems if you will start loading this day that you do some at improvisation breaks this is the sort of inefficient community laid out with me let's take the car yeah let's and and also the evidence is kind of is officially available if you need that we are 10x more efficient than the spark spark escapes doing the same thing on the same infrastructure I understand it's kind of surprising to you but we can give diagrams architecture so kind of if you're interested I'll be happy so anyways went to your point here like in this example for example data comes from some data is the bigger data come from the data way some business rules come from adjacent file that sits on the file system and then some reference later comes from positiveness so the use case here is it's a kind of emulator for the use case is a post rate surveillance system where data needs to be kind of what's being watched for here is the client favorite team so that basically you would want to make sure that the clients who plays the trade didn't get who plays the trades later dealing in there Tracy executed earlier than other clients that plays the trees earlier than them right so that's kind of a that kind of the definition of the problem so now all they kind of computes for these are coming from this kind of large tables of orders enquiries and trades an example I think may have about ten million records in each and then what's being computed is Mike there's a bunch of kind of reallocated measures such as based on direction of orders national amount and price we also compute the total motion in the same direction weighted averages in both Direction basis point difference in price and other attributes and then this business rules are being read in in in the runtime again and apply as user defined function and their words are generated so now the visualization here is the click-click sense and we leverage the clixsense ability to build a odom and dashboards where the way basically you you have one somebody not where you can navigate and see kind of the high level alerts and stuff like that and when you see something that is weird or strange or interesting you can click through and it will build Europe at a real time a new dashboard which kind of just drew us into those right so so here we see that these are the sources do the way it is structured that there's the the data connections and this data connections on the left you see that they're the enquiries and orders and traits and then the middle section is really kind of the reference data that comes now from from from Postgres relational database and here comes the the business rules in JSON so now these are this is kind of a very simple supply plan but in fact in underlying that there's a sequel statements there's some complex logic and in this map transformation which is the base basic kind of a relational manager transformation this this map which was computed this function which was confused real-time is being applied to derive the alerts and lastly kind of lastly we go the the daily cues that are now exposed and published into the BI tools and so basically that's what we'll see how we convert the g17 to Python ugf and then apply it in the in the so you see that they're basically two screens so one is the client facing so watching the exposure of customer clients throw the laptop national proton trades there's a number of fraudulent alerts where is not the different types of alerts and the other screen is for example the trader facing screen for so that the traders supervisors can identify potentially suspicious activities where some traders and more worse than others and then they can start drilling into trades of a specific trader and then for a given day and then just look for all trades of that trader so now let's switch to that so here you see kind of this data flow you see how you can basically connect all your sources and well in the second demo we'll see how we connect to those connections so here you can see basically you can you can see that this data is linked to a JSON file right and it basically creates if you you can review the result and it's right here you can copy copy that result of the transformation of this JSON file and it comes here so that that for example just generated this so that was generated by the user defined function site so for example user defined function will generate a user-defined function but now I can apply in the map and in great supply so the sequel right so my Sigma statements are pretty basic so this is this is this is an example to sequel statement like you define your inputs and then you basically do a signal statement that joins or or does whatever and then at some point you can so here for example may be a more this is even simpler cycle right this is where you're computing with a common table expressions you're computing kind of all the measures that you need for for your words to run and then and then here's where you are applying your alert and this is just an example right this is not what you have to do there's just an example of doing this so here you have mind a alert then you see this kind of it's you know the sign here with a learn this is the alert that was computed here I mean the one that I showed you was my friend yeah so but we support other languages as well rather than being the most predominately used so this this language is sequel right and the and and the user-defined functions which will stop more on the second demo the user-defined functions are all built-in in Python and this equals Yancey sequel yes transaction not signal beyond spur for procedural language school it is the sender Seco and and we also going to support the PL sequel right now we can do PL sequel but kind of doing this kind of an integration with Wyatt with Jupiter and PL sequel was actually coming by the end of the year wouldn't what else bubble-making yeah oh sorry when you have too much happening so so like so first let me sup with Beijing right so if you if you or if your data flow so we never saw the the cluster never crash this right so we support Beijing so you can tune you can turn your a cluster to always page when you need to mind if you and then just make it happen so you said you asked them how do we so when so if you think that your volume skin you can exponentially increase all of a sudden right but then you can turn on the this is the paging and then and then we see who you will just start paging and but still the data flow and by selling me real quickly right so now this we show how they kind of the interactivity works all right so let's suppose that I'm really interested in in this guy so I'm so the standard is that I am the trading supervisor and I want to really understand what's going on why this within the organization of this Steven sheer one of our traders have like an a reminder that the most amount of alerts then okay so assuming into that I'm zooming in to the day five years ago when this have happened and now let's see now that this is kind of the way bleep works again basically generated a condom and so what's happening now click is going to do scholar and it picks up the data so instead of going through instead of loading tens of millions of records that are therein in scholar the only bring in first about thousand plus records into into the summary dashboard and then this one demand what are basically the framework lets me pull just the records for this particular trailer and that particular date all right so let me see so just two to three thousand records in four seconds so now I can basically launch this dashboard and now this is the freshly generated that word so when we connect from tableau obviously the query just goes kind of interactively so now this is a brand new dashboard that I can that was built on demand I can zoom in into this maximum they kind of declined with which this trader you wonder Clark have done much similar alerts right because I just zoomed into the the trade the customer with which this trader kind of seems to be favoritism towards this customer so now let's go and generate so again I can delete this guy and generate a new dashboard here so and again click goes to scholar it picks up picks up to thousand records again in four seconds and now I can launch this new dashboard so lesson is real missing lets me build a very lightweight dashboards where I don't need to overload the infrastructure and a quick infrastructure or if it was to blow that I can just build in an interactive police so any more questions over this skeleton was right here that's scholar this is this color ID so now we sports in that single witches yeah the sequel is being processed right so by this to node to node cluster scholar scholar is the engine scholar is the relational algebra driven engine that processes the queries right it's the query engine the data can reside anywhere you place it so if you want to bring different ways of interacting right you can interact with the data actually there few few things most most common one is that you interact with data on ingest basically you go into your you put your data sources to the places where you want to bring it in native integrations worth what guess so that when you when it started today when you started interacting more you see all things that you want to see this through this two ways that you write when you interact with the with what's color one is what I'm doing right now which is I'm building this interactive data flow this is the this is my ID this is the ID so the other way of running it will probably just so I can do like an intimate with this stuff in two ways I can I can write secret right then how to sequel right from here right from scholar interface I can write signal from like the book and generate signal or click and generate sequel or some other BI system can generate CC cool I will show you here this part of the demo we are actually works forever that issues sequel queries over ODBC connection right and then they go and included the same publish tables so I can do I can interact the skeleton to ds-12 SDK also nobody PC or or through the vessel most forward series or JD's what do we see JDBC SDK we have what we lecture which sparks but you also have different versions of drivers but also used for DC drivers so will every some of the open source stuff and actually maybe this Volokh will also several to your questions let me jump and start from the sequel mode and this is kind of I was going to do in the end but this is basically a sequel this is the single mode where have two tables right in this master this is kind of a tiny box I have two tables published here and I can run sequel against this against tables and in the end of the demo is being but let me jump into this five point so here the data comes the four data sources you can see – two of them come from Excel spreadsheets one come from a bunch of CSV files and one comes from actually restful services so we actually build the connectors but when this again is a custom custom characters but it's so easy to build as connectors to bring data from from anywhere including from restful endpoints and so the use case here is actually running and nature language process to extract data from some qna aboard and after that we can have process this data that it was collected with an RP to extract some knowledge graphs and then visualize those knowledge graphs so wave came from so some of my friends who were working in one of their white investment bank they they wanted to they're kind of running the legal department and they needed to process data from multiple policies that are kind of spread it across multiple websites and kind of with a semi structured way and then big becomes basically any processes and they're much more explain who could have been affected by one of the problems is changing right so now you really need to be able to don't have this root cause analysis so this was just a demo implementation of short to show how this could be done so let me show you this discourse this is which is really like a type overflow there's an XML I mean it is a scholarly discourse website where you can actually go to some information or scholar or ask your questions they get them answered so it is organized in a way that you have your data categories topics and posts and then this stock being kind of answered and the more most being added as a kind of in the response to your questions right so now when you see here's this is as we're here so this is the start of the scholarly discourse those are categories that you can drill into and and let's say you go into operators so the operators in is the category and then the topics are the operators and you can drill through them so let's take the same thing and now go in scholar and create a new data source so now I have I have a few options I want to select this discourse data source and I'm placing this the root of this URL and I click browse and guess what here's my here's the list of my categories and within each category there is a list of topics so I can choose one category for example and I want to ingest so that's where I see the result set the out-of-the-box result set is not that telling but I have a duty that user-defined function and a user-defined function is right here and and again you can see that a bunch of different Python modules are being loaded so this Lowood posts user-defined function is what I use in order to ingest this jason from from discourse right so it's about 20 lines of code now let's select that user-defined function here it was called lower discourse it was posts for refresh and here we go so from this one one of these topics that came three posts so that three rows there is more more topics in this in this selected category so I can now say create yourself and the baby son is getting created right so now I can actually work with this as if it's just sequel as it's just a tabular form and I can run sequels against it so going back to there going back to the data flow so now here there's a basic sequel that brings it all together alright and then running the sequel and first time pipelining this sequel is as part of my my they know they developed by plane right and as I'm typing it up it gets executed so for example we can let me reset it and let's start executing this flight plan so first when I execute this sequel what's happening is I'm pulling data from those data sources kind of sending the best flu they are to the beautifulsoup to remove all the extras from from phrases and now I'm building this West structure right so this one structure is built by merging data from some Excel files some CSV files and and then the restful restful discourse scholar website and now moreover as you see here I can build custom nodes that I can then share with my business users right so this again all about efficiency how I can make things super efficient so there were a lot of questions about the performance right but besides the performance where I think one of their strongest sides of scholars actually really empowers you to do very complex things in a very simple way and actually and what kind of facilitates collaboration where where the engineering when the engineers can command can start building these libraries of user-defined functions or libraries of custom notes or libraries of reusable data flows so for example here there's a sentiment right so the only thing I need to do in order to run the the sentiment analysis is it's basically built in bringing this sentiment Norden what's inside this sentiment is really is really a map function and that might function that calls this ugf called get segments and then to get segments is basically echoing stuff from and it could be anything it could be any any ML library or an LP library or any other library that you want to bring so here just wanted to show so right so this is the function that does the sentiment by calling this waiter from a Nazi key and then and then kind of get get nouns get kind of get stems etc being kind of invoked from these other nodes here so once you get the sentiment then do the position tagging do they cannot extract the noun stems which is kind of the roots of the world though they are the phrases the roots of the denounced then bringing kind of these common stems so that I can get rid of the noise and again here we go and then we get the kind of the corpus finally build and in the end after I'm done with that I run this kind of a lot of times the real fast so in the end I'm kind of extracting this similarity so I just wanted to show that I also how I can extend the Mexico with running UDF's right from simple so as you see in this sequel statement I'm looking for the similarity between between different worlds and this is basically calling the nmp similarity UDL which I basically built in in Python and and then I can use this similarity to kind of implementing kind of a tf-idf algorithm to find like the nouns that are related to each other and and then kind of I will not go into the details of this specific semantic graph standard because we're running out of time but basically the topics that have names and then occurrences and various resources and the topics have associations associations could be a different type so and this the kind of the way I structured my the way I structured my data flows here is that here is the associations here the topics so this is similarity associations mentioned unisys relations campaign where associations and these are names and occurrences in other resources and now let's just see how it all kind of works so let's go to temporary trouble so here we go so now what's happening here as opposed to if they are the previous example where click was issuing queries against scours so here the the tool that I think are you just kind of done in a few days yesterday or so this two is basically quitting Scala and brings in all the neighbors and our drizzle the neighbors had kind of run the same query that you see that we just looked at and here right that's the same query just run from the from the external web server in just this this kind of roots of the other stuff is being substituted and so all right so let's see what we see here we see that this temporary table is being related to data flow graph which does make sense so the temporary tables are tables that are being created between kind of that will be dropped when I operationalize but that I need in order to build mine the grass so they can go back and undo and redo and stuff like that they related to some active tables and and then to worship so I can prove through active table for example and now the focus is on the active table again issuing another committee and now I can and I can see okay this active table mentioned positively there's this kind of the green means it's positive mentioned and if I click here and launches the page where it was mentioned right but that basically gives me the ways to navigate between and they kind of drill and find the relationships between things so because we're running out of time I think in closing what I wanted to mention this as well is if you have any questions please don't hesitate to reach out if you if you interested in more demos we don't hesitate to reach out we're also hiring we're growing here in New York and we were looking for Gulen folks especially those who have experience with spark and other technologies and – and sequel but but especially if you guys have any questions I'll be happy to answer them as well we're gonna raffle off speed beats headphones as everybody and do their car who wants to have a chance to win a couple more here so saw your car didn't touch Tracy hands anybody else but once going twice yeah are you still – now all right anybody else and scale it up please now I'm sorry go ahead Ariosto can I ask you to mix it all up see someone's receipt there that's a funny tracer can I guess I'm looking on their way and we shouldn't pull out a card Christopher beer Christopher beer that's you okay okay Nikita won't be here for a one-on-one questions feel free to come up right now and ask them thank you all for coming [Applause]

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *