Lecture 54: Blockchain for Data Analytics – I (Blockchain for Big Data



.
Ah Welcome to the course on ah blockchains architecture design and protocols ah. We are discussing ah about different use cases
as well as the research topics on blockchain. ah So, today we will touch upon another interesting
use case of blockchain , and the research topics and aspects around that ah. The use of blockchain for data analytics;
so, ah how can you utilise the concept of blockchain technology for effective data analytics
under various environment ah with the help of ah blockchain . So, ah we will take 2 lectures
on that. So, in this first lecture we will discuss
about blockchain and it is application over big data technology ah. So, let us look into what is mean by big data. ah. So, big data, it is ah classified or it is
identified by 3 v's, 3 parameters; volume, velocity and variety . So, the volume ah the
first parameter it denotes the data size the amount of data ah that we have. So, if you just think about the typical data
centre. Like, say YouTube data or google data centre,
the amazon data centre ah; in any of this data centre the amount of data which is been
generated ah it is in the order of trillion of bytes. So, ah the question comes that how will you
manage this huge amount of data . Another interesting example is that social networking
sites like Facebook . The just think of the amount of data which is generated from Facebook
ah everyday ah, it is even in the order of ah ah few 100 terabytes ah if I assume correctly. So, ah that way ah managing this huge amount
of data and running application on top of this huge amount of data ah is a challenging
task. So, the question comes that with this huge
amount of data how can you design effective mechanism or effective algorithm to process
the data. Then the second perspective of a big data
technology is the velocity of the data. Like the speed of change of data. Again think of the example of a social networking
website like Facebook. So, the amount of change of data it is again
ah in a larger scale. Say for example, if you just think about the
data which is being generated from a personal profile , a personal Facebook profile, it
again changes possibly in every day. Ah Another interesting example is the video
streaming websites like YouTube . So, you can just think of that everyday people are
uploading data in YouTube , and the way this data is changing and the information is changing
ah that is the tremendous. So, ah if you want to do some analytics on
top of this streaming media data, if you have done certain analytics today ah, again you
have to run the analytics under next day and day after to find out a correct result of
that, or to predict the trend of ah change. So, this velocity of the data is ah the second
important aspect. And the third important aspect which is again
crucial for ah applying blockchain for big data application is the variety of data. So, the variety of data means ah say ah just
just think of a kind of geospatial applications. Say, you want to throw a geospatial query. Now if you want to throw a geospatial query,
you have to possibly require data from land registering, you possibly require the emerging
data from ISRO, you possibly require the metrological data, the weather data. So, different varieties of data which are
there all together, and your query need to be executed on top of that variety of data
. So, this gives a challenge, this gives multiple challenges for developing an application over
the big data ah platform. So, the question comes not only in terms of
query processing or analysing the data, rather managing the data, sharing the data among
multiple peers and then analysing the data. How will you handle all these things in a
effective way? So, we will first look into that what are
our challenges for managing this kind of data in a traditional platform that we are using
now a days. So, we will identify certain problems from
there, and then we will look into that how we can design a solution for those problem
with the help of the blockchain technology .
So, ah let us look into the traditional way of processing big data ah. So, this volume of data that started increase
after internet become global, and every people started having a personal computer and then
a smart phone ah through which people are generating data every day ah. So, ah in the earlier mid 2,000 people had
started taking about that what type of technology are platform ah, we should require to handle
a big data aspects. So, multiple technologies were developed such
as zoo keeper at yahoo ah, then big table and mapreduce platform at google ah. So, then Casandra at Facebook so, all these
different kind of technologies that was ah for ah developing ah applications or developing
analytical method over big data platform, or ah developing methodologies to effectively
analyse data in a time sensitive way. So, if you want to process the trillion of
peta bytes of data ah, then time is a huge constant. So, if you just run a single processing machine
with some fixed amount of RAM and processor power, it will possibly take ah some ah even
millions of years to process your data. So, that is why people are exploring the technology
for parallel processing through mpibs techniques, or different kind of techniques through which
you can ah optimise the processing over ah the big data technology. Then came different kind of open source project
like ah this Hadoop distributed file systems; which was ah there for ah handling which was
there for storing big data in a ah decentralized platform. So, storing big data is again ah ah challenge. Because you have this huge amount of data
how will you effectively store the data in a platform and how will you access it. So, we require some kind of specialised file
system to ah make distributed storage of data because a central storage or a single storage
is not sufficient to store the entire component of this data which is available ah. So, that is why ah people people designed
this kind of distributed storage architecture; where data is stored into multiple storage
ah ah environment multiple storage platforms ah through network attached storage NAS or
storage area networks SAN ah this different kind of distributed storage platform. And then this Hadoop distributed file system,
it helps in ah effective access of the data over this kind of distributed platform. And then came this Hadoop mapreduce. So, which is again an open resource project,
and Hadoop mapreduce say it helped in executing query effectively on a ah on a HDFS kind of
platform ah by optimizing the query processing time by combining the data by clustering the
data into multiple groups, and then processing the query on the target groups which are interest
for the end user . So, that way ah we witness different such technologies for processing
big data. But ah then comes that what are our challenges
for handling big data in a large scale platform. So, the first challenge that ah we bother
about is that who will control the infrastructure when there are multiple multiple actors involved
. So, just think of the example of geospatial data that I have mentioned . So, this geospatial
data ah if you want to execute certain geospatial query, the data is shared among multiple government
and private agencies. There are multiple stakeholders who hold the
data of different types and all those different type of data is required ah to process ah
geospatial query. Say for example, you want to find out for
a specific location ah say, what is the possibility of having a flood in Mumbai ah say during
April 2,018. If you want to have this kind of query over
a big data platform, then you require multiple data. First of all, you require the metrological
data, you require the weather data, you require the land usage data ah, you possibly required
ah ah data from the marine department ah, the c level and all these things . So, you
require data from multiple such sources and then executive query on top of the data to
find out the result. Now interestingly this different type of data
are having in the hand of different agencies or different take holder. For example, the weather data is derived the
weather department the ah metrological department contents their ah set of data, the marine
department contents constant c level data, the land register contains the land registry
data. Then the municipality contains the data of
ah this water disposal and all this individual things. So, that way multiple such stakeholders contain
different piece of data. And for this effective processing ah you required
the data from multiple stakeholders. So, the question comes that ah with this kind
of geospatial data; where the data is shared among multiple government and private agencies
like say ISRO DRDO metrological department, land registry department, and so on who will
take charge of the data; say, whenever I am taking data from meteorological department
and I am trying to combine it with some imaging data from ISRO, then ah if there is a data
loss or if there is some kind of fraudulent behaviour from my side, then who will take
charge of that? Whether ah that that is because of certain
problems, certain security loophole ah for the data sharing platform of the metrological
department, or certain loophole for the data sharing platform from the ISRO .
So, that is a major challenge. So, ah the first thing the second thing is
that if you have multiple copies of data at different locations. How will you know which one is the most updated
data? So, for example, ah from the ISRO imaging
data, you can get ah information about the land usage. From the land register record you can also
get the information about the land usage. How will you know which data is most trustworthy
or most recent data? So, that is the challenge related to developing
ah data sharing platform or data sharing infrastructure; that who will have the control over the infrastructure
when there are such multiple stakeholders .
The second problem that we have is ah that how will can you trust the data. So, whatever data you are getting whether
the data is trustworthy or not . So, you have possibly generated the data yourself,
how will you prove that you are the originator of the data, or you are authorised person
to share the data. Or for example, if you are getting certain
data from ah some agencies or some stakeholders, how will you ensure or how will you verify
that he is the authorised person to share the data with you are not. Say, these are the kind of important questions,
otherwise ah there can be serious copyright valuation and you can be in a big travel for
that. So, to solve this problem to handle this problem,
you have to think of a mechanism to maintained ah this kind of cooperative relationship among
multiple stakeholder or multiple actors . Another problem comes that how will you handle crash
and malicious behaviour during data transfer from the source. Say when you are getting the data from ISRO,
it is coming over the network. If certain malicious behaviour happens or
if someone make some changes in the data, how will you ensure that that changes have
not been done by ISRO or have been done by some malicious agents, or it may also happened
that possibly ISRO has share the data with another agency, and they have authorised that
agency to share the data further with you , and you are getting the data, how will you
detect that that intermediate agency who is sharing the ISRO data with you, maybe with
the consent of ISRO, but they have not done any modification on the data . So, the trustworthiness
of the data is a major requirement which we need to ensure. The third requirement is that, how will you
monetize the data, how do you monetize the data. So, ah the question comes that how do you
transfer the rights of the data. Say for example, whenever you are sharing
your data with a third party, how will you transfer the right ah to that third party,
how will you ensure that that third party we should not share the data with anyone else
or they should share the data under some ah say control policies. So, ah that is one important question, and
another question which is actually ah currently floating in the big data community, that can
we develop a universal data market place ah. Just like data like electricity or internet. So, everyone can buy electricity or buy internet
service and then can ah participate that. So, data can also be a kind of service, because
the way we are generating huge amount the data, and if we get certain data certain share
of the data, we can possibly developed lots of applications and we can possibly ah utilise
it for developing a better nation. So, the question comes that how will you handle
this kind of universal data market place without ah having kind of fraud fraudulent or malicious
behaviour in the environment. Well, let us look into certain use cases and
see that how can we effectively solve this kind of problems. So, the first use case that we are going to
discuss is the share control of big data infrastructure. So, just think of a blockchain database, rather
than a traditional database that we have . So, if you have a blockchain database; that means,
you have say the multiple copies or multiple infrastructural component; which are owned
by say different zonal offices and whatever access is being done on these infrastructure,
that is maintained the log is been maintain inside the blockchain. So, that we are calling as a kind of blockchain
database. So, ah the control of the database infrastructure
it is shared across entities ah, the entities can be within an enterprises within an consortium
or across the planning. So, there can be a consortium where the where
multiple private companies can join together and they agree to share data among themselves. So, they have their own infrastructure and
those infrastructure need to be connected in a transude way ah. So, we have an example of this kind of blockchain
database which is called BigchainDB. We will look into ah BigchainDB in little
details ah. So, the advantage here is that that the infrastructure
can be spread across different locations still the properties of a database like integrity
consistency all these things or ensure with the help of ah blockchain. So, the blockchain will help that whatever
access is being done on top of that data base the transactions are being made, their consistent
and they are tamper proof. So, if you access the data through that blockchain,
everyone will be able to validate who has access the database, who has access that particular
say data storage and what type of data has been accessed. Well ah, let us see ah specific use case of
this. So, each regional office ah with it is own
sysadmin, they control one node of the overall database. So now, we are distributing the database among
multiple places . So, the entire database is control collectively,
but every regional office has their own sysadmin control . And in this architecture data is
till protected even ah 1 or 2 sysadmin goes rogue or a regional office is hacked because
even if a regional office being hacked that things are through blockchain, it will not
be able to ah access the infrastructure at the other regional offices. ah Let us look in to a second use case ah
audit trails on data. So, consider a data pipe line the data is
being generated and finally, the data is been utilized. So, that data been generated from the IoT
sensors , then ah you have some kinesis or event hub and stream analysis which is taking
that data from the IoT sensors and ah doing some initial streaming analysis on top of
that, then the data is put inside and HDFS storage ah. Then you are applying some says spark data
cleaning mechanism on top of the IoT sensor data, then you are applying certain normalisations,
spark normalisation on the data. Finally, you are transferring the data to
a MongoDB storage a specific database storage and then doing the tableau analytics showing
the query on top of that data. So, if you consider this data pipeline ah
with the audit trails so, you mean that ah you should have an information that how the
data how a particular piece of data has passed into this individual pipeline, who has made
that passing and on what time it has passed ah from one stage of the pipeline to the next
stage . So, we can again solve this with the help
of a blockchain. So, think of a solution of something like
this; that before each data pipeline step starts you timestamp the input data as follows. So, you create a transaction so that the transaction
is an hash of the data and the corresponding meta data or any other information that you
want to add the in the transaction . Then the cryptographically sign of the transaction
ah, and finally, you write the transaction to a blockchain database . So, whenever you
are writing the transaction to a blockchain database, it will automatically timestamp
the transaction. So, whenever you are writing the transaction
in the blockchain data base, you have immutable evidence that you had accessed to the data
at that point of time. So, you are ensuring that ah you have ah got
the data from the IoT sensors, then at that point of time you have done a stream analytics,
and then at the next point of time you have put the data in a HDFS database and so on. So, that way every event which is being happened
on top of the data, that is being logged in the form of a transaction inside the blockchain;
and with the that blockchain the things are tamper proof and you can also do certain verification. So, this can actually answer few of our questions
that we asked before. So, first of all ah how we will you proof
that your ah you are the originator of the data. So, that was one of our original questions. So, to proof that you are the originator of
the data say , your data is cryptographically signed by you and stored in the immutable
data base. So, that way anyone can use your public key
to verify that you are the originator of the data ah. Because that information is already there
in the blockchain and the blockchain is temper proof. Then the second thing is that ah, what about
crashes and the malicious behaviours that we are talking about. So, if there is certain kind of crash on malicious
behaviour, what you can do whatever data you are getting some from the source ah from ah
data source ah you re hash the data and check with the information from the blockchain that
ah that is there. So, if there is a match; that means, you got
the correct copy of the data you can do a validation ah if there is a miss match; that
means, something is wrong somewhere so; that means, ah the data that you have got possibly
it is not the correct copy of the data or someone else has tried to forge that data
, ok. So, ah with this 2 use case let us look into
a practical blockchain database ah a open source blockchain database that are I will
say ah which is there in place and it is gaining popularity ah for big data application. So, it is called BigchainDB. So, I suggest all of you to explode BigchainDB
in more details and a ah do a analysis of ah it is ah content it is code and how you
can write application on top of BigchainDB. So, BigchainDB ah has the following features,
first of all decentralization. So, you do not have any single point of control. So, there is no single point of failure. The second is the query so, you can write
and run any MongDB query over the database ah. So, it supports the MongoDB query format,
the third properties immutability. So, once the data is stored in the database,
that data cannot be changed or deleted. So, that data is immutable ah, the 4th is
it is byzantine fault tolerant. That is an important property of ah BigchainDB. So, it is support ah for and this byzantine
fault tolerant properties to support this ah infrastructure decentralization. So, based on the bfd principal as we discussed
earlier it can support up to one third of the nodes one third of the ah nodes in the
network ah that can experience arbitrary failures. So, even if there is ah one third of the node
at most one third of the nodes experience arbitrary failures. The system will still be able to recover ah
and we will be able to give you the correct result ah. The 4th one is the ah the fifth one is the
low latency. So, the transactions finally, it happens fast
because of this byzantine fault tolerant consensus algorithm which is their ah. The ah next one is the customizable. So, you can design your own private network
with the help of certain infrastructure and then installing ah BigchainDB on top of ah
this multiple machines and then connecting them with each other. And then it has a rich permission in support
that you can set permissions ah at the transaction level. So, you can specify that this particular transaction
should be accessible by this person or this group of persons and no one sells in the universe. So, it is supports the good ah support of
access control list or access control mechanism. . So, as I mentioned that ah this BigchainDB
provides ah decentralized ecosystem for big data applications. So, you can write the application of your
standard ah big data processing using MongoDB, then you have this platform which is again
ah decentralized platform, you can have multiple data server. So, at the ah this BigchainDB is installed
and then can connect it with each other. Just ah like a peer to peer architecture ah
it supports the decentralized processing of the business logic. Like whenever you are throwing the query,
it will take care of that in which particular data source the data is there, and it will
access it and ah perform the query. And that the lower end in the stack, you have
the decentralized file system to store the file in a decentralization environment. And the bigchain ah ah data base BigchainDB
data base the blockchain data base that ah we are talking about. . So, here is a comparison between a typical
blockchain architecture and typical this data base ah distributed data base architecture. Interestingly ah BigchainDB combines the properties
of blockchain allocate distributed data base. So, the blockchain supports ah decentralization
byzantine fault tolerance immutability and owner controlled assets . Whereas, a typical
distributed data base it supports high transaction rate, low latency and indexing and querying
unstructured data . BigchainDB actually combines ah all these
features all together. So, ah it takes the power of ah blockchain
technology ah to make a decentralized platform with byzantine fault tolerance, and immutable
architecture and this ah (Refer Time: 27:26) controlled assets for a which a permission
ah permission in of access control. And it combines it with the features of ah
typical a big data ah data base ah, distributed data base like a high transaction rate low
latency and indexing and querying on big data through map reduce or this kind of architecture
ah; which is supported by MongoDB kind of data base. And ah combines and having ah ah central controlled
ah ah platform . So, that is something a description of a BigchainDB .
So, here are some ah further readings that you can explode. So, there is a whitepaper from BigchainDB. So, the link is given here so, I suggest all
of you to explode it and ah find out different properties of BigchainDB, and how BigchainDB
actually implements it at the code level combines the feature of ah database as well as ah the
blockchain technology all together. And you can also try BigchainDB. So, ah you can go to their developer platform
and here in this particular link ah . All this things are under this ah dot dot
dot dot BigchainDB dot com website ah. So, you can install it, you can ah create
a local cluster of a few machines, and start ah implementing your own database and running
your own queries . So, ah just try it out ah hopefully it will be enjoyable for you. So, thank you all for attending the class. So, in the next last we will come with another
application ah with certain research aspects of ah blockchain for data science or data
processing applications. Thank you all.

One Comment

  1. Aaron Rodgers said:

    great video, very well explained.

    May 22, 2019
    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *