Rethinking Business: Data Analytics With Google Cloud (Cloud Next '19)



thank you for coming to the session I know it's after noon I will try to make sure I don't let you fall asleep we are going to have an exciting set of product demos we have a couple of our customers sharing their stories how they are using the platform and the whole goal here is to go ahead and today morning in the keynote did you get a chance to see the smart analytics demo is it good great so a key thing is we want to take that whole theme about smart analytics give you more context of what other product announcements we are making what are we launching and give you all the details that you need but before that let's talk about what's changing in the world right if you think about industries across across the border are changing if you think about automotive industry historically 10 15 20 years back when you looked at automotive industry was a very different industry now with organizations like cruise automation they're collecting real-time information from all of their vehicles doing self self-driving cars making real-time decisions at massive scale and so this the whole ability to go ahead and create these large amounts of data and make decisions is super important in organizations the second key thing is another great example is AirAsia if you look at democratizing data insights within organization it's becoming more and more critical so now AirAsia last year that next they were here with me at the session and they were sharing they've saved like roughly five to ten percent on their operational costs in an airline that's a pretty large number and this was possible only because they were able to take all the data and insights within the organization make it available for all of their users who are actually doing different kind of activities within within the organization so it's very critical not just to have the infrastructure that can scale to do massive amounts of queries but also to go ahead and make those insights available to everybody within the organized and then there is other aspect to it a lot of different industries lot of different customers like Broad Institute which is which is founded by MIT Harvard and Harvard hospitals they are basically producing like 12 terabytes of data per day and then doing leveraging the the cloud computing infrastructure to do genome sequencing every 12 minutes and so it's it's interesting to see how organizations across different industries are actually using cloud computing especially our data analytics platform to take massive amounts of data derive insights from them and make decisions at every point we're seeing momentum across all different industry verticals all different regions with with the platform few key interesting facts I wanted to share with you today once with bigquery we now have more than exabyte amount of data managed by bigquery for our customers we have our largest customer now has more than 250 petabytes of data in a single data warehouse and we had last year roughly 300 percent growth in data analyzed on the platform so it's fascinating to see all this growth organizations across the world now leveraging the platform leveraging the insights that they can gather from from from Google cloud so let me talk a bit about what's our philosophy around around our investments in in analytics platform right our main goal is we have this theme called radical simplicity like our goal is to make sure deriving insights from data needs to be super simple to get to a point where anybody within an organization should be able to do that and how do we do that how do we make it happen one the most important thing is investing in server lists you should be able to bring any data do not have to worry about infrastructure put it into Google cloud and start analyzing it the second is your providing comprehensive solution that provides the end-to-end lifecycle of your data management then embedding ml not just using ml to improve our products but so making sure ml is available to everybody within the organization and then we reform believer in open cloud we are firm believer in making different open-source components available to you to run at scale or within our environment and finally all the enterprise capabilities that you all expect us to have within the platform is super important so quick visual the key thing about server less data platform is one where in traditional platforms you would have to go ahead and spend time in figuring out what's your capacity requirement how many servers do you're going to need what is the provisioning what's the monitoring there's so much stuff goes into that but our key thing is you shouldn't have to worry about all that that's all managed by Google we take care of that you just bring as much data that you need to bring start analyzing it from there and then from a platform perspective I know there are a lot of logos here I'm not going to go in depth of every one of them but from an end to end life cycle perspective we have services that allow you to ingest data and that could be a real-time streaming at scale like pop sub you can you can do millions or billions of events per per second collection there's a services for transferring data from on-premises different SAS applications we have service for IOT data coming in so all the ingestion services are available to you then for all the real-time and batch processing we have data flow which is our streaming like engine capability it allows you to do batch and streaming with a single API with data proc you can do manage Hadoop and spark environments data prep allows your data analyst to go ahead and do data wrangling all of these are available to you and then in data warehousing you have bigquery you can use cloud storage for for storing massive amounts of unstructured data and then on advanced analytics side you have our cloud Rai services so that's the whole portfolio you have a whole set of things in addition to that we have a cloud composer manage airflow for you to go ahead and do workflow orchestration and then we are announcing two new services you heard about them today around cloud fusion with more about that and then catalog so that completes the whole portfolio that we have and then with that let me share few more things on on different scenarios though when we talk to customers there are three main scenarios that our customers leverage the platform for and Thomas earlier today mentioned these and I will try to take them into the next step and give you more details but one is modernizing data warehouses so that you can go it and make it broader than just data warehousing for reporting and and dashboarding it's more about intelligent decision making predictions and stuff like that so we'll talk more about that the second is running large-scale Hadoop clusters on promises that customers are running moving that into cloud to get much better TCO but also get a scalability that cloud can provide and the third is streaming analytics I think by 2025 more than 25 percent of the data that generated will be in streaming form and as the industries are changing you will need capabilities that can collect this streaming data make real-time decisions on them in the in the application that you are in and all that so so that is super critical and that's what customers are using other than that we heard a lot from our customers about breaking the data silos making it easy to get data into the platform and then also protecting and governing the data so we have those solutions available so let's talk about the first thing earlier today you saw a demo of fusion so Cloud fusion basically is our fully managed code free data integration service the whole idea is we want to make sure that bringing data to GCP is super easy for for our customers data fusion is actually based on an open source project called seed app and it gives you a visual tool to go ahead and drag drop pick from hundreds of connectors that we have already got for you for on Prem systems different applications and all and then you can go ahead and transform the data that's coming in and you can publish it into any one of the data stores that we have could be bigquery could be cloud sequel could be any one of the other data source that's available the key aspect to this is the goal here is just simplifying migration of your data to cloud transforming it as it's coming in and making sure you have a single place to manage all your data pipelines and then finally it provides your ability to go ahead and do visual transformations as you're coming in you can go ahead and track lineage about the data that's coming in and provide data quality on top of top of the data so this is one of our big like releases this for this next it's available in beta so you can go ahead and leverage it you saw some of the demos earlier today there are two other things that we have in the same realm we basically we have a in bigquery if you have use bigquery it has connectors that were available from our first party services like Adwords like double-click and all of that make it easy for customers to bring data in from these applications and then put it into bigquery analyze it we have extended that to our partner ecosystem and so now I'm happy to announce we have more than exactly I think 135 connectors across different applications starting from like Salesforce Marketo Adobe Analytics Facebook analytics work day all the different SAS applications is now available and customers can use that in in your big query environments you all can start using that and the third thing is we know what there's a big challenge on migrating the the traditional data warehouses that are running on from Isis let's say like Tara data or if you are using red shift we have tooling now available to easily migrate those two to bigquery so that's that's the key thing that we're providing all this tooling to make it easy to bring data into GCP so that you can start leveraging the other capabilities that that we already have customers want us to be able to help them understand their business better they don't just want us to do their banking or our employees expectations are changing as well they'd like us to provide them with relevant data and insights so that they can make smart decisions and your talk in a timely manner and so to do that we need to look at digital transformation and a key part of that digital transformation is data got it so can you share some use cases with bigquery or other things that you're using so that we can get more insights into what you're doing yeah absolutely I haven't talked about some of the bigquery use yeah of course so the query has been one of the key tools that our data scientists are using our daily basis and it actually effectively helped us a lot in terms of the scale abilities and running those handling those heavy computational queries on top of the different data sets so I will give you a real story forwarding so some of our data scientists are working on those using those customer transaction data to build those aggregated and that the identified insights for our institutional level clients – for them to understand that they are customers better so those you know analysis including what's your lawyer customer look like where they are living and who are lapsing from your business so we are analyzing billions of transactions of the data in the back to that time it was around 17 terabytes for a single table for us children and it took literally five days to extract the data and get insights from from data set and which is actually quite costly for us to deliver our insights to our clients and also limit our data sense to continuously develop new insights adding new innovations to the data set and by moving that full pipeline to the bigquery we successfully to reduce the times from five days to twenty seconds together in size which is a big achievement for us and not only enhancing the efficiency for our data scientists but also allow us to start rethinking the data science processing the organization so our data scientist starts to meet our clients directly rather than or authorizing those query as sitting at the back end and that they are taking their insights to the clients taking the direct feedback from our clients to get even together conducted a customer led design workshop to get those are customized in size requirements to support our clients better so we are the reason we are taking customers inside we have the confidence that bigquery can help us to handle those having a computation back-end and we've we managed to work with airline industries to help them analyzing their customers shopping behavior before and after flying so they can use those insights to optimize their campaign effectiveness also we've been working with a few retail industry companies in Australia to help them to identify which location is the best for them to open a new store so such such kind of customized and nurses helped us to position ourselves not only a service provider but also a strategic partner from the data and analytics side perspective for our clients and the currently we have streams of data scientists work working on bringing more data like payments and supply chain and credit rating data on the GCP and the to combining those different datasets together commingling those data sets to unlock the value of the data set in the bank I think that's awesome I just heard 5 days roughly too few seconds is where you were able to drop it that's the power of at scale what you can do with data processing and analytics and I think super interesting to hear can you share more about cloud Composer usage how you're using composer for orchestration and all yeah of course so ting have been exploring different orchestration tools and we've been using composer things it was in our Federation and it's a great tools for teams to keep going and our data scientists are loving it because it's Python base and it's very easy to manage your dependency and the multi layer of data pipelines and we've currently got daily and a weekly data pipelines running on a composer to generate hundreds of features and and terabytes of data moated terabytes of data however with the growing of the teams and complex of the data pipeline we sort of meeting the challenge like running multi tendencies on composer and we've been very excited to hear more announcements from composer side this time that's great we'll share some more clay so Keith can you share more on did you make the decision to go to GCP and what should everybody here especially in the industries like yours think about as they move to cloud make that decision absolutely so I think when when looking at moving to a cloud provider one of our key requirements was we needed a provider to help us get the most out of our data and so the the core data capability is very important services on top of data AI services ml services and partnering with someone who has those services also also absolutely critical but also you know data doesn't live in a vacuum and and where the data is is where application delivery starts to converge to and so when looking at tcp weave and google we found a provider that has those AI nml services also has the application delivery components and so we're also very heavy users of gke cloud sequel and a number of other components as well as then the underlying data capability and so I think as an ecosystem that's great as a financial services organization there are a whole other suite of considerations that there need to be overlaid on an implementation and so as a heavily regulated industry it's very important that when implementing a cloud environment it's not only you know the awesome technology that's there it's balanced with great controls that can meet the expectations of your regulators that can ensure that you hit those privacy expectations again of regulators but also of your customers and I think as then a final point an interesting piece around the cloud implementation journey is once you're there things can things become a lot faster in terms of your ability to deliver on them but it can it can then shine a bit of a spotlight on yourself internally as an organization and your processes and his ability to actually internalize and deliver on on that change good thank you thanks a lot for sharing thank you very much thank you so it's it's very interesting as we have seen in last couple of years how different industries have been starting to adopt cloud starting to use some of our analytics capabilities to go ahead and leverage it for different scenarios with that let me share few things around what's coming new with bigquery in this in this conference what are the different things we are announcing one is we had a goal last year to go ahead and launch bigquery everywhere we have been steadily increasing our footprint globally this is super critical as organizations want to keep their data in specific geographical locations so we've launched like around 12 regions in last one year and we'll continue the momentum going forward make sure we are available in every region wherever now Google Data Centers exists so the work is not done we will continuously do that but we're already in all of these different regions we should be in your region wherever you are now the second big thing that we are announcing today is basically bigquery supports two different pricing models it has on demand model where you can go ahead and per query pay for whatever data you're accessing the second model is we have a flat rate model which gives you price predictability and you can go ahead and and buy out like X number of slots for for the whole month and then you can go ahead and use that what we are announcing today is two things in alpha we will have our reservations API which will give you two capabilities one you will be able to go online and then if you're registered for alpha you can start buying slots directly which means you can go ahead and say I want 2,000 slots but we are also reducing the entry on that and we are making a 500 slot bigquery flat rate available which should reduce the cost of entry if you want to get started at a lower level so that's one the second thing it allows you to do is you can quickly and easily manage resources so let's say you have 2,000 slots you want to distribute them into four different the teams and say hey everybody gets 500 each so that you don't go ahead and like you know have different kinds of queries can have different priorities and stuff like that so you can go ahead and do that that way but we always make sure that you have access to all of those open new slots available for everybody so the key thing is the computer resources that you have is always available you can use them but you can allocate them across your organization very easily this has been one of the asks from our customers for a for quite a few time now and so this is going to be available the second thing that is available earlier in couple of months back we announced storage API so there are a lot of organizations who are putting all of their data in bigquery the query storages their structured storage layer like for for all of that data in the organization and sequel is a grape like you know language for a lot of things but not for all things and so we have lot of customers who wanted to use spark or Hadoop on top of the same data that we have why do you want to have the same data copied in GCS and bigquery and all the different storage layers so that you can process it so we basically have a high speed storage API available with this what happens is your bigquery that's the the data that's stored in bigquery is now available from any of your spark or Hadoop workloads you can use data flow for for batch jobs from from bigquery if you want to you can go ahead and use the ml engine ODBC drivers all of them will be able to directly leverage the same storage layer at high speed and you'll be able to go ahead and do all these workloads different types of workloads on on the data in bigquery so this just expands what you can do the third thing that we have coming in bigquery is earlier last last year next I think in July we announced beta for bigquery ml so we will have a big query MLG going to GA in in few weeks from now along with that we are also based on the demand that we are getting we have k-means clustering available so you want to do segmentation a customer segmentation those kind of scenarios you'll be able to do that very easily with like just a couple of lines of sequel code you can do matrix factorization so recommender systems you can go ahead and do that and then you can import tensor flow models directly into into in 2b qml ah so those are the three key things the fourth key announcement that we have and we announced this earlier in the keynote is around bi engine so the whole idea of BI engine it's like it's a fast low latency analysis service you don't have to create any kind of models or anything you automatically the data that's in bigquery it can accelerate queries on top of it our goal is to have all the response times in in millisecond times under a second in most cases and then it will be it will be available so that you can do interactive reporting and interactive dashboarding very easily across your organization at a large concurrency numbers so that's another thing that's available so I've talked about a lot of things here earlier I also mentioned about all the 100-plus SAS application connectors we talked about bi engine so let's do a quick demo let me call upon Michael to come and show us some of the capabilities that that we're launching so I'm gonna show you guys what it's actually like inside bigquery to go get data from an external source and bring that in through a transfer I'm gonna try and go through pretty quickly here let's click the transfer button in bigquery and that'll take us to a view where I can see the active transfers that I have now so I'll hit the create transfer button here and we have Google's built-in transfers down here I can transfer from Google Play Google Ads sources like that but now I can click explore data sources there and just like Sudhir said here we have a long list we have more than 100 external data sources built by our providers that show up in this list so for example here is an Adobe Analytics connector highly requested source from us this is made by super metrics one of our close partners and we can see details about this connector I can also enroll in it or I can search for others on the marketplace another example Facebook connectors so we have some Facebook Ads data here that you can see I can search for Salesforce as well and when I search their top result here is a Salesforce connector built by five Tran another one of our really great partners and I can enroll in this connector right on the marketplace and choose the project that I want to enroll in I've already enrolled for this project and so because I did that it's gonna show up for me now automatically on my drop-down list right there so I hit Salesforce by five Tran and then I can enter the name of the connector of the transfer that I want and I can choose the schedule that's that's right for me we can go weekly or in this case a daily schedule and I'll select the the destination data set inside bigquery that I want that to go to and then hit connect source and right here I get a warning this is asking me permission for that connector to write data into the bigquery data set that I selected so I'll hit accept and then up comes this pop-up from five Tran where I can authorize the the connector that I'm interested in so normally this would ask me for my Salesforce pass I've already done that though so when I click save this is gonna create the connection for me and take me back to the transfers page where I can where I can complete my settings for the transfer I can also choose to get notifications if I want to in case the transfer fails so let's click Save and that's good to configure the transfer for me and there you can see it the transferring run is now pending so that's really all that I needed to do that's how easy it is to go all the way from choosing one of 100-plus sources getting that into a transfer so that goes right into your big query data set that's really all you need to do this particular transfer takes about seven minutes so I have a data set already set up where you can see what it looks like once you're actually inside bigquery and let me just show you what it looks like here's that salesforce data we can preview the leads table there's city data data on the the company in this case for for each of these rows we could go and query that join it with other data and integrate it with other information inside bigquery but now I'm going to show you the BI engine feature with this data so like Sudhir mentioned with this BI engine we have the capability to run really fast sub-second latency queries that's because it's running from memory from RAM inside GCP so I can go ahead create a reservation and decide on the capacity that I want with BI engine and then once I've done that what can I do with it well here's a great example this is a data studio dashboard it's running off bigquery on bi engine and as I'm clicking around here let's see filtering down to nurturing and new leads for example maybe I want to slice and dice this by Houston and Dallas and it's reacting really fast because it's using bi engine so please try out bi engine today it's in beta and check out the external data sources on the marketplace thank you I think the key key value we can get by connecting all of these like different types of applications that are there like organizations are using various different types of applications now bringing all that data together having analytics across all of them and deriving insights is going to be very interesting for organizations I think and making it easy is one of our key goals other than that there are a lot more other things that we are also working on we're announcing I won't go in depth of each one of them but but here's some additional information like we will later this year we will have ability to go ahead and do federated queries on top of park' or C finds directly on GCS we will be able to do Federation across like cloud sequel which wouldn't be another data source for for queries from within bigquery so that's there other than that there's a good economic advantage report that was created by ESG group you should take a look at it if you're moving to cloud especially with bigquery there's massive amounts of savings that that you can get from a total cost of ownership perspective let's switch gears talk about the second key scenario running large-scale Hadoop and and spark workloads on GCP one of the key things from a value proposition that we have is we let you go ahead and pick any of the open source projects that you want to run through data proc to through composer that technologies that we have for example we have been continuously adding more and more projects now you can go out and leverage presto you can go ahead and already you could do Hadoop spark various different projects underneath it it's secure we go ahead and do the management of it we define we can launch the clusters we can shut down the clusters automatically and all if you really look at if you really look at the value proposition on this this this I won't go into depth of each one of these points the key thing is if you were on Prem or you were managing it on by yourself in on compute engine versus using a managed service that green is what you will have to do and blue is what we take care so the key thing is just focus on the last column and see you just have to manage your code write the code and deploy and we take care of the cluster management and everything that's the biggest value proposition for for the cloud especially with especially with no now ephemeral clusters you can do massive amounts of saving so you don't have to have static clusters running throughout the day at scale for for you with that let me call upon Jonathan and rarest from booking.com to share more of what they are doing [Applause] so why don't you introduce yourself tell us more about booking com sure I'm in Irish merica I'm a principal developer at booking.com I work on enabling clouds technology for booking us command opening that up to booking booking is the largest online travel agent in the world we employ over 17,000 people and we have offices in 100 in over 120 countries so we of course work with a lot of data as well I'm joined by Jonathan I'm John I'm a data scientist working in data quality and booking so I hear quite a lot about other products we're putting together here got it so what were the key challenges about us in booking comm before you started migration to cloud sure so at booking we run quite a large installation of Hadoop and on that the workloads are made mostly hive and spark workloads both production workloads and human interactive we have over a thousand daily users over these clusters so of course because they like to all work together there is a lot of contention for resources over these clusters so that was a huge channel challenge for us and that was an opportunity for us to use cloud and give the data scientists especially for the more data intensive workloads give them personalized capacity so base basically dedicated clusters per per user and that was our proof of concept work that we started late 2017 that was very successful with our with our data scientists and that was the business case for later on triggering a big data migration to cloud so that means every data scientist can have their own cluster that can spin up and then they can work on that is that yes that's the default is a multi-tenant large cluster where they contend for resources yeah but they all have the option to basically elevate that to a dedicated cluster for themselves where the data scientist decides to a certain limit the size of the capacity that he needs Gordon that's interesting because that's one of the benefits of moving these things to cloud and and having that scenarios where you're gonna have static clusters but also bursts into for specific workloads and all that's yeah that's really good can you share about I know we I worked together on interesting challenges you had and how we have incorporated some of them in the product portfolio so can you talk about yes aside from the challenges of moving the data to clouds and then integrating the data making these clusters appear with the data on and making them available for the data scientists the first thing that the data scientists asked asked for was for the toolbox their toolbox to be very same as on Prem the same libraries the same integrations with on Prem technologies and so on so this required customization of data proxy installation of libraries to also etc we found out pretty quickly that the time to spin up such clusters went went up quite high and that was impacting the user experience so our ask towards your team was to make it possible for us to create customized images for data proc which in collaboration with Google we managed to get now to GA yeah so that is what we have you know that's great I think if you always learn from our customers it was a great scenario we were able to go ahead and put that in quickly so that everybody can benefit so Jonathan why don't you share more about what you have been up to with with the whole set of technologies I'd love to so working with our Google counterparts we started an exploration saying well now that we are in cloud there are some tools that are available to us like bigquery and bigquery ml that we don't have on-premise so let's see if we can use this for a case close to my heart can we surface some data quality issues that would be very difficult to do otherwise so the scenario we chose to explore is very booking in nature so we of course serve many properties on the website you can find them each property has many room types that room type might represent many rooms but each of those room types is quite particular they're scale millions of properties so scale tens of millions of room types and maybe most particular for a visitor is that those room types have lots and lots of facilities or potential for facilities something like a hundred and seventy six of these so the scenario would be a customer visits the website maybe they have a particular facility in mind that they'd really like to make sure is there a bathroom a TV who knows they can filter for this well that's really helped from our side however we then need to make sure that that data is correct how could this go wrong well if you are forgetting to list this as a property manager or owner you can lose customers by way of the filter and go the other way if you accidentally say that you have it well then you might be misrepresenting yourself accidentally and then the customer experience is quite odd when they when they arrive in the bathroom isn't they're saying okay it's not so good I don't know about your trip yeah yeah so how do we fix these potential things they also might tell us something about ourselves maybe we can ask better questions of the property owners we can learn things about however we're putting this in a form that makes it so that certain repeated mistakes we can eliminate in a certain way so this is certainly an added value for booking if we can get this right we're the intermediary okay so again I mentioned the data it's very wide it's reasonably long tens of millions of rows and quite wide it's very boolean so it's yes or no to having a facility but 176 of these it's quite a lot so we wanted to attack this using something in bigquery ml in particular we're gonna end up using k-means clustering now why do we want to try that perspective well we could attack it with rules we could say ah we know for sure that if you have pay-per-view channels you you better have a TV to watch them on right that's pretty reasonable however 176 lends itself to lots and lots of subsets of rules it's very difficult to manage and upkeep because well you could be certainly adding lots more facilities in the future so maybe all kinds of hololens or something like this is available in your room so it would be very difficult to manage by hand let's see if we can surface these by throwing math at this problem and especially in a quick and irritable iterable way be a bigquery ml and sequel so we have one premise one assumption backing our project here which is we assume most of the data is pretty healthy it's pretty representative of truth if that's true then we hope that similar things will end up next to each other in such a clustering and oddities will stick out odd things we'll be able to find so with this assumption I think it's a pretty good we know our data relatively well and we've not it's not our first time looking at this kind of thing so we're hoping math we'll find the ones we haven't caught yet okay so k-means clustering there was a very nice talk earlier today we gave a longer version of this I hope you'll visit it on YouTube but we if we throw k-means clustering at this we have just a few lines of code to build very several different versions of the model we only have to tune maybe one parameter and we can very quickly see what comes out let's visualize some of those well actually sorry let me take a step back what would clustering look like this is two dimensions of course we have a hundred and seventy-six remember our goal is to find the things in the triangles the ones that really stand out if we can cluster well the ones that are very far from their their centroid their middle or maybe the ones where something is going on so this is visually what we're aiming for okay here's in data studio very convenient that we can look at this through the the GCP pipeline we have a lot going on but if you look at the red on the right this is relating both the size of clusters and also how far they are from each other okay well notice cluster 10 there is actually quite far from any other cluster well that tells us something maybe he's odd that's one of the ways a cluster could be odd it could be very far from others it turns out cluster 10 is very weird for another reason if you look at the green there this is the distribution of distances within the cluster so for each cluster if you are at 0 that means you're at the centroid and if you're at the top that means you're as far away from the centroid as anything in that cluster now I care about cluster 10 again because well the the farthest points are the farthest of any from their centroid so we probably should look there first it gives us an indication on where we could peak so let's take a look at a few examples from ok cluster 10 if we drill in what do we see here well let's look at the things that this item has I've drilled in I'm looking at exactly one item the blue bars there represents something about the whole cluster cluster 10 how common is each one of those room facilities which are on the bottom so let's take an example of say toilet the Bar says that's available in my outlier it's also very commonly available in the cluster in total I see a toilet I say shower I see body soap I see free toiletries these are all very good things where would you put them probably in a bathroom which is not available here the the outlier does not have a bathroom listed anyway maybe it does have one I don't know and free toilet paper so maybe it's BYO toilet paper I'm not sure maybe you filtered for that yourself maybe that's something you want but for my trip I'd like to be sure that there's toilet paper waiting for me when I arrive that's good to know we should at least follow up with the property let's take one more example go a different direction of course this one might have other bathroom problems but let's see what it does have so I see cable TV channels I see satellite channels but I don't see TV or flat-screen TV so I hinted at this before you have plenty of channels to watch but nothing to watch them on we would definitely want to surface this for the property because they might be missing out anyone filtering for that thing is maybe going to miss this property but it's probably likely that they have such a thing so we found some really interesting stuff I think these are things we wouldn't have found otherwise or would have had a very hard time to identify next step for us would be automation improvements so that we could do this on a very regular basis and not have to explore by hand the same way we did got it Thank You Jonathan thanks a lot and some interesting news cases there thank you thank you so yeah so let's continue on some of the new investments that we are announcing right now I think we are investing in the the security features of data Prague so that Kerberos is now available so yeah so you'll be able to use the same security models that you're using on Prem you have auto-scaling capabilities and then you also have one of the other big investments that we're doing with composer is on composer flex so it will allow you to make it completely serverless composer capability and then one of the other things is we just announced our partnership with cueball lot of enterprise organizations are using cueball for their Hadoop and spark workloads and their whole unified experience with the workbench with notebooks and dashboards is is super valuable for enterprises now they are available on GCP so you will be able to use them they have great enterprise security features controls and governance as well as seamless workload migration so if you're using cueball today you'll be able to continue using that on GCP from your on-prem the next key thing that we have is as I said 25% of the data will be generated in 2025 in streaming form and we have great capabilities on the platform for streaming starting from ingestion with pub/sub transform and analyze across the board and then you can also do that with our open source technologies or partners that's available so with that one of the key announcements that we have today is data flow sequel so the whole idea is a lot of organizations use data flow from different platforms like you can write java code and be for beam and all that and do it but sequel is a good interface lot of customers like it so we're making that available let's do a quick demo from Sergey on that welcome Sergey thanks ed yeah so in this demo I'm going to take a pop subtopic I'm going to associate a schema with this topic and I'm going to join it with a big credit able to do some steam enrichment once have a schema topic and I enrich steam I'm going to group it by time and insert the results into a bigquery data data warehouse now the goal of this demo is to show you you can calculate very quickly aggregate statistics on a stream of events I'm going to start with bigquery and many bigquery users will find it quite useful that they can now access data flow right from query settings you get the choice of the data flow engine as the execution backend and once you choose data flow engine and save the setting you will be able to create data flow jobs I actually have a sequel statement saved in my notepad just for demo purposes so that I can avoid typing it I'll quickly explain what's going on here I have a pop subtopic this no this is not a table this actually is theme of events I'm gonna join it with a static table in bigquery allowing me to do steam enrichment here's my joint condition and the key key portion of this sequel statement this streaming psycho statement is the tumbling function this is the piece that allows you to do steaming analytics it creates fixed five-second windows and you can run aggregations on top of these windows that's exactly what's what's going to happen in my projection part of the sequel statement I'm gonna create statistics for sales regions for all of the sales events in my stream and I'm gonna have a timestamp of of these calculations and the sales amount oh and by the way I mentioned that we that we use schema we store now the schema for pops up in the catalog here's the schema of my pop subtopic this is what enables me to run sequel on streams having a schema my events have very simple attributes we have a timestamp and we have a payload and the payload contains the person who purchased the good the good itself where it was purchased the state of purchase and the amount well great so that's let's run the job in my next screen I just need to type in the destination table and then initially we support bigquery as the destination but we'll add more destinations in the future as well great so within a second or two I'm gonna get a data flow drop item that's the job ID and if I click on it I'm gonna get rerouted to data flow and once the resources have been provisioned and the sequel query gets executed and transformed into a data flow graph I will see things in the middle of the screen now I don't want to wait for the purposes of the demo so I'll launch the job just like it a few minutes ago here's here's what you will see in the in in the data flow experience for those of you who are familiar with execution plans that's exactly what's going on here so data flow took a sequel statement it created a execution plan for your sequel statement using the beam framework I have my my input flow from bigquery I have my input flow from pop sub and I have a join condition in the middle and for this particular sequel statement we are using very efficient side and put joints now I also wanted to conclude the dam I also wanted to show you that data is actually flowing from through this sequel statement so I switched back to big wave and I have a slight star statement here let me quickly run it here are the results not me rerun it again alright as you can see I get my data updated every five seconds also next yeah Thank You Sergey as as I started earlier today when I talked about our one philosophy which is making sure we make it really simple for doing the activities that you do today you could have done the exact same thing in Java written a few lines of code and and made it happen but we want to make it really easy for all the analysts to go ahead and do the similar activity on streaming data and with this sequel based language now in dataflow becomes accessible to everyone in the organization that can write sequel in addition to that we also have flex RS scheduling which is flexible resource scheduling you will be able to go ahead and for delayed data pipelines you can save up to 40% by using preemptable VMs on our side the main thing there is we will guarantee finishing of the jobs in any case because we mix regular VMs and 3m tables and so you'll get a lot of savings but guarantee that your jobs are going to get get done there are lot of other announcements I won't go in depth of each one of them but you can take a look at them from a governance perspective one of the things so we have different things that we that we already offer right we have the the built-in encryption that is their customer manage encryption keys we talked about it earlier today access transparency Thomas touched upon it we also have tools for efficient governing as well as compliance with HIPAA and all the different compliance things the key thing we are announcing today is data catalog data catalog is basically our fully managed and scalable metadata service which will allow you to search for all your data assets where they are what are they who has like you know all the different details along with that it also allows you to go ahead and define schemas for pops up which is streaming data so once you have that you can go ahead and start using it in sequel it's a very simple search experience it gives you ability to go to auto tagging with DLP like we can run the DLP we can mark PII data and all and then you can also give your own business metadata that you can define and then from there we will be able to go ahead and define policies on top of it so you can say anything that's PII do not give access to this group of people or give it access to only these people so you will be able to do those kind of activity but fundamentally it's easy discovery of your data within your organization we are going to solve that across all of different assets within GCP other than that the most important thing is you have lot of investments in different tools that you may have acquired from your different organizations we have a huge partner ecosystems it should just run as is without any problems like we have great partners in bi space like tableau like look or all those tools are there informatica five trend we showed earlier today as well as talent all of these partners for ingestion and also we have a huge partner ecosystem for you to new to leverage other than that Google has been identified as as leader in both Forrester waves as well as Gartner so there so we are trending in the right direction we have a lot of investment going in analytics platform generally and then it's it's ready for enterprise adoption we have a lot of customers and would be great if you can go take a look at some of the key new capabilities start playing with the products and and give us more feedback thank you [Applause] you

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *