DATA & ANALYTICS: Analyzing 25 billion stock market events in an hour with NoOps on GCP



[MUSIC – ALABAMA SHAKES, "DON'T
WANNA FIGHT NO MORE"] NEIL PALMER: Good. Ready? OK. Good afternoon, everybody. Thank you for coming. My name's Neil Palmer. I'm the CTO for the FIS
advanced technology group. Today, we're here to
talk to you about this. It's a very long winded
title, I do realize. We wanted to call
it, "How we're going to save Western
civilization," and all I got was this corporate
branded t-shirt. But that wouldn't
fly with marketing. So you got that. What we are going to talk
to you about today is how we are planning to ingest
every piece of stock market data from the US
markets, process them in four hours on a
daily basis– that's 15 terabytes– I make
it available for market reconstruction and
analysis to the regulators. And so to help me
with that today, and we'll demo a little bit
of what we've been building. I have Mr. Carter Page from
Google, a lead for Bigtable, and Mr. Todd Ricker,
principal engineer from FIS, who's going to be
driving the demo. Little bit about us, FIS
is the largest technology financial firm, financial
technology firm, in the world. We build everything from online
banking systems to ATM software to help you get your
cash out, for those who still use that,
through to asset management software, trading
systems risk, everything at the back end of Wall Street. My world's a little
bit different. My world is to go out and
do work in other industry verticals, to look at
media, to look at internet and to figure out what's going
on in emerging technologies in those industries. We find that those
industries are typically about 18 months ahead of the
financial services firms. And so doing real
work there allows us to bring that knowledge in
house and apply to our products and to our clients. So the problem we're
looking at today is– what on earth is
going on in Wall Street? As you may have noticed in
the media over the last couple of years, there's been
some issues with the well functioning of Wall Street. And you look at the film,
"The Big Short", you look at the book, "Flash Boys". There's some
serious issues going on in terms of the systems
and how they're being managed. And I think part of that
is because a lot of people have forgotten that up
until about 15 years ago, Wall Street drove significant
technology innovation. The acceleration,
the electronification of the markets, the
straight-through processing piece was all a competitive
advantage to them and made them a lot of money. The problem with
that is it's left us with a large number of very
disparate systems on very disparate technologies across
ever increasing numbers of broker dealers and
execution environments. And so when you couple
what is essentially technical debt with what could
be viewed as regulatory debt, you have a much
greater risk profile to the well ordered
functioning of the markets. And given the way our
economy and our way of life is based on the well order
functioning of those markets, this is not a good thing. And so there is a need for
much better forensics in terms of how the market's behaving,
and so that when things do go wrong, we can actually
figure out what has happened. So I want to touch a little
bit about where we've come from before we get to
what we're going to talk about, which is the
consolidated audit trail. It was only about 40
years ago, we still had pit trading, paper
trading, like you see in all the old movies. And of the space
of 40 years, we've gone from that through to
the high frequency trading that you hear so
much about these days where essentially 75%
of the trades placed are algorithmically driven. This has greatly
increased data volumes, and it's greatly
increased data complexity. And so it's gotten to the point
where many of the Wall Street firms have embraced this
electronification quite aggressively. It's not necessarily a
bad thing in of itself. It's driven a lot greater
price efficiencies. It's actually done a better
job for a lot of even retail participants there. But when things
do go wrong, they create anomalies in the market,
which creates unfairness. And people suffer as a
result of the consequence. And so the goal is
that we need to have a single system these
days that allows people to understand what happened. And so you can see here in the
last not even 15, 20 years, there's been astronomical data
growth in the financial market. It's interesting because
even though there's been astronomical growth,
it's not uniform growth. It's not one way
traffic, if you will. You can look at the
spike on this graph and see 2008 volumes
reach peak, and then they plunge as a result of the
recession on the bubble that popped then. Also what happened
was that in 2010, there was what's known
as the Flash Crash. And this was a problem with
the markets and the trading where several trillion dollars
was wiped off the US markets in a matter of minutes and
then mysteriously reappeared about 30 minutes later. Now, I don't know about you, but
having several trillion dollars magically disappear and then
reappear is probably not really a good thing for
life in general. And so the SEC as a
result put together a proposal to create the
consolidated audit trail, or CAT as it's known. We have our own very
special design CAT symbol. Not just your usual
grumpy cat stuff. Thank you very much. And so really,
the goal of CAT is to track every life cycle
event, every tick, every trade, every piece of data that's
involved in the US market in one place. Previously, the quote
has been that regulators have been trying to use
bikes to catch Ferraris. And so our goal is to build
the next generation system that will allow them to understand
in a reasonable amount of time what is happening in the market. And so this is a
lot of data, right? This is taking data
from all these silos across all these banks, these
broker dealers, the executions, the dark pools, and bringing
them into a single system. We're going to have to
ingest 100 billion market events per day. We have to process
them in four hours. That's 15 terabytes
a day for six years. That's 30 petabytes of data. That, and here's
the kicker– we have to be able to query
the whole 30 petabytes. We are one of three final
bidders for this process. The process started
four years ago with over 30 firms involved. It's down to the final three. With the help of
Google, we're going to show you a little bit about
what we've built and managed to achieve so far. There's three core
problems we'll go through. Just the sheer scale of
ingesting that much data, the sheer scale of processing
that much data, and what is frankly, a ridiculous
time frame– the four hours. And then the final problem
is how do we actually get any actionable information
out of all that data. There is absolutely zero
point in us throwing this into a big black hole and we
can't do anything with it. Although, when the
original RFP came out– it was about 250 pages long,
and there was two lines in it about query- it says we need
batch and we need ad hoc. So we were like,
well, thank you. That's extremely helpful. So what we're going
to do now is we're going to walk you through–
we're going to kick off a demo that's going to
come in several parts around data
ingestion, processing. We'll talk about the
linkage because we have to be able to
link life cycle events across multiple reporters. And then we'll show you some
of the visualization tools we're creating to actually
interrogate the information. And so for that, I'm going
to hand it over to Todd to kick off the demo. TODD RICKER: Thank you. Thank you, Neil. Can we get my demo
screen up here? Perfect. Thank you. So I'm going to just start
by kicking off the job, and then I'll tell
you what it's doing. You can see I'm about
to run a Maven command. I'll say I'm doing a
two part demo here. I'm running the data
processing piece first, which is a data flow job. And then a little later,
we'll look at the data, and we'll see how we
can visualize that data. So let's kick it off. Should be exciting. It's executing a data flow
job, that Java code sitting on my local laptop, it's
pushing it out into the cloud right now. It's just standard
Maven stuff going on. So what does the job do? The job is going to process 25
million synthetic trade events that I generated this
morning from real price data from this morning. So at the end of the demo,
you can compare prices to the real prices. You'll see they're same. But it is synthetic data. So the job kicked off. I'll Alt Tab over to the
Google Cloud console. We should see it running. Very good. So what the job's going to
do, take those 25 million synthetic trade events. They're stored in
Google Cloud Storage. They're FIX messages. FIX is the standard protocol
by which exchanges and banks communicate trade events. So it's going to copy those
out of Google Cloud Storage. It's going to insert
them in the Bigtable. Then we'll make another
run through that data and create the links
between the trade events. So where they have
relationships, we'll link them at that phase. That's the part that's
difficult at scale. And then finally, we'll copy
all that data from Bigtable into BigQuery so
that we can easily visualize it and query it. So, Neil, why don't you
tell us more about linkage? NEIL PALMER: OK. So we're going to let
that keep ticking along. Fingers crossed
I haven't angered the demo gods of anytime soon. So what a lot of
people maybe don't know is that when you order
stock, when you purchase a stock online through
your broker dealer– be it Fidelity, Charles Schwab,
your Cousin Vinny, whoever. If you've just cashed out in
some highly successful IPO and you need to reinvest
it into something a bit more interesting
and you want to buy 10,000 shares of Google,
for example, you click Submit. You get an order confirmation
number back straightaway, but that is not how that
trade gets executed. That trade gets routed
you're a broker dealer, and your broker
dealer is governed by a couple of regulatory
things, one of which is best execution. They have to make
sure that you get the best price for that trade. But after that, they have a
lot of options about how they might want to fill the trades. They might route it
to multiple exchanges. They might route it to
other broker dealers. They might fill it themselves. They might send it to dark
pools, a wide range of areas they can do that generate
multiple, multiple hops in the trade. And in fact, most
trades these days that are actually executed,
at the point of execution, on Wall Street are typically
no more than 200 shares because they don't want
to move the market. They don't want to tip their
hand around who's buying what. And in fact, there was an
incident earlier this week where somebody fat
fingered some Apple shares, and you could see a huge spike
over several seconds in Apple share prices because
somebody was trying to buy 70,000 shares instead of 70. So what this does
is it generates thousands and thousands of
life cycle events, billions. And we have to reconstruct them. We have to be able to connect
the order execution back to the ultimul beneficiary
owner, the person who ultimately pays for the trade. Now, the problem is this is a
regulatory transaction system. All right? This is not log file analysis. We cannot drop one
of these events. We cannot make a mistake. And so we're not allowed
to use pattern matching. We're not allowed
to use fuzzy logic. We have to take the trade
events and connect them from point of execution, point
of fill, reconstruct the tree, all the way back up
to the original order. And that's complex. It takes a huge amount
of processing power, and we have to have very
flexible schemas in order to be able to handle attributes
the may vary based on reporter, may vary based on asset
cost, may vary based on time. And so for that, we
chose Cloud Bigtable. And to talk about Cloud
Bigtable, here's Carter. CARTER PAGE: Thanks, Neil. Hi, everybody. My name is Carter Page. I'm the engineering
manager for Cloud BigTable. I'm also the engineering
manager for Bigtable internally at Google as well. At Google, we've had our fair
share of big data problems. With a audacious mission
to organize the world's information, to
make it universally accessible and useful,
we've had to come up with a whole suite
of big data tools, as you've heard for the
last couple of days. We developed these different
tools– GFS, MapReduce, Bigtable, Dremel. And all these things start to
come together into our cloud tools to allow Neil and the team
to build something like this. And all these things came out
of real needs that we had. They were actual problems
we had at Google, and we sat down and
figured out what do we need to engineer
to solve this problem. So for instance,
very early on, we're basically collecting a copy
of the entire worldwide web. And we're trying to figure out
where do we store this thing? And you can either have
an application trying to figure out how to
shard it across a bunch of different machines,
which puts a lot of burden on the product team,
or you can come up with abstraction, an API, which
simplifies this actually very complicated process underneath. And we did that and became GFS. GFS was later
upgraded, essentially replaced with Colossus in 2010. It also has a key value
API in front of it that we know internally
as Blobstore. And externally, you would know
it as Google Cloud Storage. As we were building
out these files and we have a place
to store them now, but now we're getting
a lot of updates. And while a file system
is great at doing blocks, it's really bad at doing
small reads and writes. So you got a copy of the
web, and you're updating the latest pages that came in. If you have just a file
system like we had before, we had to rebuild the
entire index from scratch. And this was taking
days, turn into weeks, and was approaching a month
and it was creating a problem. You had data staleness issues
in addition to the fact that this was costing
more and more resources and time to actually run. So we had to build a
database on petabyte scale, and what we came up
with was Bigtable. We launched an internal
service in 2006. So this was actually
our internal cloud. Externally, we launched this
as Google Cloud Bigtable a year ago, but this is actually
extracting the internal cloud service we've been
running for a decade. So you have a whole
bunch of data, and you want to be able to
do an ad hoc query on say, a trillion rows. And you want to be able to just
ask some basic questions of it. It's not a very easy thing to
do without something like what we developed internally,
Dremel, or if you saw the previous presentation,
BigQuery, which is also talked about a lot here. And there's a fun kiosk
on BigQuery out there. And then you've
got all this data. You want to do something
with it, right? You want to identify which
photos have cats in them. You want to figure out what
your page ranking is going to be for your search index. You want to do something
useful with it. So you need to actually have
data processing at scale massively parallel processing. We built three tools
internally– MapReduce, Flume, and MillWheel. And these things became
what you see up here, which is data flow. Data flow is the
kind of conductor that helps tie all
these things together. It's a really key piece to be
able to do big data analytics. You need something
that can actually talk to Bigtable, for instance,
at the scale of a Bigtable becomes useful. So imagine what you
can build if you have these products, these
services internally that just will magically scale for you. Whether you have 10,000
customers or whether you have 100 million customers,
you can think about the product and not think about how
you're going to scale it up at each step along
the way and how you're going to deal with
sharding your databases. When you don't have
to think about it, you can focus on building
really great products. And I really think that we
contribute a lot of the product diversity that Google's
been able to develop over the years to this really
rich infrastructure. And this is the same
infrastructure that is in Google Cloud Platform. And that's, I think, what
makes this really exciting for a lot of us,
is the idea of this was an enabling factor
for Google to be able to grow and
have this diversity and richness of our products. We think that external
customers will be able to find new
and exciting ways to be able to do things
when they don't have to worry about the basics. So back to Cloud
Bigtable, which is, again, my area of expertise. So I'll spend a couple
of minutes on this. It's the exact same
service in Cloud BigTable as Photos and Maps and
Calendar and et cetera talk to. It's not an abstraction or
a Bigtable light version we have running. It's actually the
exact same service. It just has an external cloud
API kind of attached to it. Cloud Bigtable, for those
of you who don't know, I'll give you a
quick introduction. It's a NoSQL database. So it's not relational. You define columns
as you insert them. It supports sequential
scans, which basically lets you do some clever
things which are good for time series and the like. And it'll autoadjust
access patterns. So you've got this
large sharded database. And as you change
your access patterns, it will adjust the
data for you so you don't have to think about it. Briefly how it works is
you have a set of clients out there– usually a
lot more than that– that are talking to a set of nodes. Now, the nodes are
basically just CPU and RAM, and they will pin to a certain
chunk of the actual data that sits in the Colossus file system
which handles our durability and basically replicates
that data for us. And each node will serve
out a chunk of that data to the clients. What this allows us
to do is it allows– in Cloud BIgtable– allows you
to scale processing and file size orthogonality. So for instance, if
you have a lot of data but you're just
kind of trickling it in over the course
of a few years, you could have a few
nodes and let that grow. Or if you're doing a lot
of really fast processing on a couple of
terabytes, you don't need to actually scale
your nodes to the same size as your actual data. So I was talking about
learning access patterns. For instance, imagine you have
a hypothetical situation where we've got a really small– three
nodes is a very small cluster. It's for production. But it's like the minimum size
we allow in Cloud Bigtable. But imagine for
illustration purposes, we have a three node cluster. And by the luck of the
draw, all your clients are all asking for the
data that's sitting in the node on the far left. What Bigtable will do is it will
identify these load patterns as they start to appear,
and it'll immediately start shuffling data around. It's actually really
fast at doing this. It can move entire chunks of
data in a couple of seconds. And if the access
patterns change again and there's a new server
that starts to heat up, it'll also start to rebalance. And that's useful when
you start to think about how you want to scale it. Basically, what
you get out of this is you get linear scalability. You can go from three
nodes to 30 nodes, and you get 10 X increase
in terms of your throughput. You can go from for
30 nodes to 300 nodes. You get another 10 X. It
is linear all the way out. You go to the kind of
sizes that they'll need to run– it just keeps going. And that's actually
really great for planning because a lot of times these
[INAUDIBLE] architectures will eventually start to flatten
out a little bit on the curve. But you can actually count
on your line going straight, and you don't have
to do a load test at an order of 1,000 nodes. You can do it at
100 nodes and know what resources you're going
to need for that higher set. It also gives you confidence. As you grow, you don't have
to worry about it too much. Your cost of your
underlying database can be just a function
of your growth. So internally– and
getting to the point where we exposed Cloud
Bigtable last year– it's a lot of engineering
that we did internally to be able to have
this cloud product. Again, it's essentially a
cloud product internally. We had to teach Bigtable
how to configure itself. It started with thousands
of different flags and configurations
that were really confusing to internal engineers. And we realized that
Bigtable could actually do a better job figuring out how
to configure itself correctly than our engineers
could a lot of the time. And this results in a simpler
control plane on the cloud interface for it. We do a lot of isolation. We had to make sure that
one product at Google that was having a very big
day was not interfering with another product at Google. And as I showed before, there's
a rebalancing aspect to it that we've been spending
a lot of time on. And this is true of all these
products you're seeing up here. These aren't things that
were invented for cloud. These are things that we've
been developing and running and using internally and are
instrumental to operations for the past decade. And the same sort
of work we've had to do to be able to make
all these products– so these Google
products co-exist side by side– that's actually what
puts us in a great position to have products that work
very well in a cloud scenario. So with that, I'll
hand it back to Neil. NEIL PALMER: Thank
you very much. So what Carter's
just been describing was very important for us from
a design decision perspective. The levels of load we're talking
about– with Google taking the technologies that we
already know scale that level, if not beyond that level. For me, that was a
weight off my mind. Like I don't have
to worry about this. Is this going to scale? I know it's going to scale. So I'm going to talk a little
bit about the high level architecture. But first, I want to
check in on the demo. Todd how is it going? TODD RICKER: It's
going pretty well. We've ingested 26 million
order events, market events. We've linked them, and we're
finalizing with the BigQuery copy. It'll be a race. We'll see what happens. NEIL PALMER: I'll slow down. TODD RICKER: Yeah. [LAUGHS] NEIL PALMER: So just a little
bit about the thought process when this started in 2012. And a lot's changed
in those four years. FIS runs a lot of infrastructure
and a lot of processes and a lot of applications
for financial services firms. But we looked at the scale
of this, and we thought, no. Too big. We can't handle that. We started to look
at private cloud as an option– addresses
some of the scale issues, definite economic issues. And so we came to
the point where we're looking at public cloud,
and it became clear to us that they had the scale
and you could already see the impact of the economics
and the impacts of Moore's law with prices decreasing. And so we decided that
was the way to go. I have to say at the time
in financial services, that was not a particularly
popular decision. When we first walked in
front of the regulators with Google in the
room and told them what we were planning on doing,
they basically were like, you are out of your mind. And so on one hand, the process
has taken a really long time because that's the way
regulatory systems work. On the other hand,
that's good because we've had the journey that's brought
the regulators along with us in terms of understanding
what public cloud is, how the security works, and
especially the economics. And the continued price
reductions in all the public cloud providers has made
a significant difference in terms of the risk
cost-benefit equation. So in terms of high
level architecture, you've heard Todd
talk about some of it. We have a direct pairing
connection with Google and My5 and the data center there. It's a 10 gig pipe. Right now, we're
pushing L2 equities data in through it on
a regular basis. We're only consuming about
500 megs of that bandwidth. We've got 20 times
headroom ready to go. It smokes. We drop. It's basically fix, drop copy
the files into cloud storage. We use data flow
to transform and do an initial validation
on the data and insert it into Bigtable. We run another
job that then does the linkage on all
the events in Bigtable before extracting
it out to BigQuery. Bigtable was critical
because as I said before, we have to link
from the ground up. And so when we get an
event– unless it's an absolutely new order, unless
it's the very first step– what we know is we have the event
and we know its parent event. And so what we do with Bigtable
is essentially we write a put. We do a put and put the row in. But we also do
essentially another put which puts a phantom
row in for the parent. And so it allows us to know,
by constructing the column names on the fly, whether or
not the previous or subsequent steps in the chain have
already been found. Part of the problem is we
have no guarantees on when we're going to see this data. The banks have to
report it by– I believe it's 8 PM end of day trading. We have four hours
to process it, but we don't know the orders. We don't know the
sequence of the orders and when they come
in, none of that. And so you have to be
able to very flexibly see whether or not the full
life cycle has been found. And there are some
orders that you won't. Some orders are
out until cancel. The other big thing
that was important to us was BigQuery and the
API access to BigQuery. I don't know about
you guys, but I hate getting requests
from business people to write me a new report. That drives me practice. And because we have to bring
an entire industry with us– we have to move
an entire industry from a variety of different
regulatory systems onto this one, and
it's an industry that is basically saying
you're going to pry Excel from my cold, dead hands. So the fact that BigQuery means
we can plug Excel into it, we can plug other BI tools
into it without our involvement was very, very important. And so we'll talk a
little bit about that later on in some of
the visualizations that are going on. We haven't gotten
this far yet, but I think the next major
challenge for us is going to be around testing
tools and regression testing and things that we give the
industry so that they can understand how our
software is working, how our system is working,
how their order management systems are operating. When they upgrade
their software, how do they know that
it's compatible with what we've done lately? And I think that's going
to be a big, big challenge. Are we ready? Are we going to
look at the demo? TODD RICKER: Yes. NEIL PALMER: All right. TODD RICKER: The demo
gods have smiled on me, and the job completed,
First I'm going to show you a BI tool
on another set of data. This is not the data
we just processed. This is a BI tool
called ClixSense. It's a web based tool. It uses the BigQuery connector. So we can nicely connect
it to our data sources. As BI tools go, you can
set up the visualizations without necessarily
using a programmer. So business users can set
up their own visualizations, help them do their own job well. And this is powerful stuff in an
industry that's still primarily run on Excel spreadsheets. So here, we're looking at the
most active underlying options from today– from this morning. Market open until about 11. We have a pie chart. We have a bar chart,
pretty simple stuff. You can see that SPDR's ETF was
the most actively traded option this morning. That's typical of
SPDR as a ETF fund. That's tracking the S and P 500. So on most normal days, it's
going to be a very high volume. Little odometer. Great. Very simple, but powerful stuff. So let's move on and look at
another visualization that adds had some fun. So we have here a heat map. It's the same data source,
same underlying data. We've added the
dimension of the trading venues of up at the top there. So you can see the
extra various exchanges. The darker color
denotes higher volume. So we can easily look
across all the exchanges and see that SPDR's
on the ISE exchange. ISE exchange was the
highest volume this morning. So that's fine and good. Let's look at something
even more complex. This psychedelic looking
thing is the zoomable sunburst visualization. I think that's pretty
arbitrarily named, but we can really drill
down and see even more data with this visualization. The inside wheel here is the
various highly traded options, underlying options. So we see SPDR here. We can drill in on
that, and then we see that middle wheel is
the various exchanges. So we can see that on
the ISE, SPDR's was at the highest volume there. And then we can
drill in even further and see that the majority
of the orders were puts. So that's great. It's BI tool. It doesn't necessarily
require programmers. The next thing we
built was a custom app. And this is going to visualize
the data we just processed. So my web app is
connected up to that data that we just processed,
and it is built with D3 and also uses the
BigQuery data connector. So let's look at Google
stock for this morning. There we go. So we have a scatter plot here. These are all market events. This is my synthetic
market events, but based on real pricing data
from this morning, from market open until about 11. The various colors are
different market event type. So these blue are
new order events. There's routing
events and a bunch of different kind of events. But let's look at one
of the filled events because they are the
interesting ones. They've reached the end
of their life cycle– or at least a
portion of them has. OK. So here is our D3
visualization of linkage. Neil walked us through
what linkage meant. This is a single order
and the life cycle it took on to be filled. So if we see the blue dot up
there, that was the new order. When I mouse over
at the top, you can see the specifics
of the order. So 458 shares. It was a routed through a
couple different broker dealers. And then at this
point, the order was split into six
smaller orders. So they branch out. We can see that what started as
a few hundred shares is now 91. So we're breaking down the
order into smaller bits. Got routed again, and then
finally it was filled. And the other path
is it goes out, gets split a bunch of times,
and it's not yet filled. So those smaller
orders are still open. So again, this is
D3 visualization using the BigQuery connector. It's all client side. It's deployed to App Engine,
but there's really no smarts on the app server side. It's all client side. So the final thing
I'm going to show you is the real Swiss army
knife of data processing. This is Google Cloud Datalab. It's a Python based tool. It's web based. It's built on IPython notebooks. I've hidden some of the code
here– the rendering code– but it's not too much. I'm calling it there at the top. You can see the BigQuery query
that we're issuing there. It's pretty simple. And then what this graph is
doing– the underlying data is the price data for
Bank of America on a day a couple months ago
back in November. So it's one day of price
data for Bank of America. The yellow line is
the moving average. You probably can barely
see the yellow line. But there's a moving
average line there. And the red dots
indicate when the price has gotten 3 and 1/2 standard
deviations away from the moving average. So these are like anomalies
in the price during the day. So this is interesting
to traders and auditors and a bunch of people. So this is your
data scientist tool. People who can program,
people who know Python, you're able to use NumPy,
SciPy, Pandas– a lot of the real powerful analytics
tools that we have available. So that's three ways
of looking at our data. Going to Neil. NEIL PALMER: Thank you. So you kind of
get a feel there– tools for business users, tools
for day to day operations, and then tools for
your data science guys to do the exploration. Just a couple of lessons
learned from going through this process. We originally used MapR
when we first built it. Native Hadoop,
migrated to Dataflow. We love Dataflow. The abstraction
level is fantastic. Cannot recommend highly
enough optimizing your code. One stray logging statement
at this sort of scale equals about a five hour delay. When you're developing
your pipelines, we found it helpful to
generate multiple data sets, static data sets,
that we could then re-run both different
configurations and different logic at the same time. Sometimes the
configurations that you need vary based on your
data load, data scale. So don't assume that a small
number of large workers is better than a large number
of small workers or vice versa. It is very contingent. You will spend quite a
bit of time playing around with those to figure out
what actually suits your job. And then really learn to tune
the job to maximize the worker throughput. That, we found,
was very important. So where are we
going in the future? So the CAT bidder is supposed
to be chosen this year. Whoever wins then has one year
to get all the major exchanges on board, which is just
a silly short time frame. Year two the major broker
dealers have to come on. And year three, everybody else. So three years to build and
deploy to the entire industry. After that, things could get
interesting– different asset classes, different
types of transactions could be brought in. Again, bigger, wider, more
holistic view of what's going on in the market. Potentially move
closer to real time. Real time for the SEC for
the market at the moment is about five days. That's what they
consider real time. We can– I mean, Dataflow
support streaming out the box. We can potentially
contract this whole process as much as the people
on the other end are able to handle that. And then we're
having discussions about opening up this. Right now, this is
a regulatory system. We think it's quite
important that the broker dealers be given access
to the data as well. We want the broker
dealers and the regulators to be having
conversations and looking at exactly the same data. The goal is to build a better
regulatory system for our stock market. And if you do that,
then the broker dealers could potentially shut off
all sorts of internal systems that they are now running
because they have to. And you come to sort of
a global source of truth. It's going to save everybody
money, time, and effort. And if you take that
concept a little bit further– I don't
do it with them very often because it scares
the living bejesus out of them– but if the data's anonymous
and if the data's secure, well, what would happen if you
opened up APIs onto that data and let the market
innovate around it? And so for us, it's exciting
because no matter who wins in this point,
all post trade data will be in the public cloud. So think about that. All the arguments that
have happened about– it's not secure this, it's
not secure that– done. The regulators would have put
all the data in the cloud. And so hopefully that will
bring and spur the same type of innovation that we've
seen in media and internet to financial services
and eventually to other regulated industries,
such as health care, energy, et cetera. So conclusion–
scalability and security. Scalability– no brainer. Using those technologies,
we [INAUDIBLE] go higher. Security– again, that was a
great conversation over years. Google never had a breach. Lots of breaches at the banks. I mean, the kind of conversation
solved itself there. Economic cost– again, decreased
pricing, not a huge amount of cost off the bid, very
valuable to everybody. And the elasticity,
because of the trading daily volumes is
another key point. So the goal is to change
not just the where, but the how of
financial services computing and what can be done. This has never been done before. It's only recently it's
been technically viable. It's even more recently that
it's been economically viable. So we're pretty
excited about it. However, one more
thing, which I think is the phrase that gets used
in this part of the world. We ran a test with Carter's
team two weeks ago. We process 25 billion market
events in an hour end to end. Well, technically
it was 50 minutes, but I didn't want to
be seen as showing off. That is 20% percent more than
the highest market volume day ever in the US markets. We created 25
billion FIX messages and ran exactly that
demo in 50 minutes. We used 3,500 Bigtable nodes
and 300 standard-32 VMs to do that, which
isn't that many when you consider those numbers. Sustained 22 million
events read a second. 60 million events written
a second sustained. Burst– this is
where it gets silly. It start to get
silly at this point. 34 million events read a second. 22 million events
written a second. CARTER PAGE: So I just
want to jump in here and provide a little context. We see a lot of traditionally
large numbers at Google. And not be cocky, but we often
kind of yawn at terabytes and things like that. These are actually
really big numbers and kind of made our team blush
being able to pull them off. To put it in context, imagine
34 million events per second is the equivalent of over
two billion events processed every minute. 22 million writes per second
is over a billion events every minute. And usually when you talk about
processing billions of events, you generally are
talking about– and this is in
big data context– you get a billion
events, usually saying I did a billion events
in a day or I did it over hours. Now we're saying that
we can do billions in just a single minute. These are the actual reads and
writes to the database itself. So this is to a persistent
storage system, which actually, even though the reads
number is bigger, from an engineering
standpoint, the write is actually what
we're prouder of. And there's actually
a lot more work that goes into that because each
of those writes are durable. They're not cashed. So to put it in
another context, if we were in the middle of
this process that's going– the task that
we did, or 50 minutes– and we're writing at an rate
of 22 million writes per second and we simultaneously yank
the plug on all the machines, every single one of
those acknowledged rights has been preserved
and will be there when it comes back. And that's something we're
actually really proud of, and I think it's a pretty
extraordinary large thing. [APPLAUSE] NEIL PALMER: So we're going to
translate that into just shear gigs per second just because. So sustained 22 gigs
reads per second, 13 gigs written per second. Burst, 34 gigs read, 18
gigs written per second. It was just smoking. And so it was an utter pleasure. We've had more fun than
we know what to do with, and we're really looking forward
to taking the system forward and seeing where we go
with it and also talking about where else we can
apply this technology. So I think that's it. We'll be around for
questions if you need us. Thank you very much. [APPLAUSE] All right.

5 Comments

  1. Audrey Mciver said:

    One more thing I personally don’t understand trade and it’s something I would never try to do with out schooling first. So you should make a note of this

    June 26, 2019
    Reply
  2. Андрей Листопадов said:

    Good morning everybody. The TokenGo platform enables each of us to create our own business. Entrepreneurs, the authors of startups, projects explicitly need simple to understand and launch sites for the ICO procedure, allowing easy integration with the crowd-hosting block-platform. TokenGo's ICO Landing page constructor solves this problem to the fullest. ICO authors have the flexibility to customize their site for ICO and quickly, in a few clicks, integrate it into the TokenGo web platform.

    June 26, 2019
    Reply
  3. Dyke Hensen said:

    Oh my this is BIG, big data

    June 26, 2019
    Reply
  4. sangthong sikhao ID135177352 said:

    FDIC ssangthong sikhao ssn498903326

    June 26, 2019
    Reply
  5. Erwin Mayer said:

    Impressive achievement, I have no doubt you will find tremendous value as a company to make such data available to all kinds of subscribers (from big broker dealers to private traders), whatever the outcome of the CAT bid. Keep up the good work!

    June 26, 2019
    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *