EESTech Challenge Online Seminar – Part 1 – Introduction to Big Data Analytics

hello and good evening to everyone I am George daniil from Elsa Potter's and I and I am the online seminar coordinator of his excellence let me start with slides in order to talk to you what is the Katis excellences and what we are doing here on behalf of this excellence board and online seminars team I would like to welcome you to the first online seminar I will be the host and dr. Konstantinos Politis live from Pittsburgh of Pennsylvania will be the lecturer today he will introduce himself later on this seminar is held in the context of his excellence and as you may have heard east ik is organizing this challenge but what is istic East ik is a volunteering European Association that envisions all electrical engineering computer science students to reach their full potential in their academic professional and social lives okay with this tech but what is East excellence if the talents is a competition that tends to create opportunities for European students to gain no leads in the fields of electrical engineering computer science and develop a professional network the focus and the topic of the competition will be chosen chosen annually from the electrical engineering computer science field this year top because you know is big data the competition consists consists of local rounds and one final round local rounds will be implemented during maths and the 24 local committees that will organize them are as you can see in the slide our ankyrin Antwerp Athens Aveiro Belgrade Catania Geminid Craiova Delft Dublin east sarajevo cleavage crackle Ljubljana Milano Novi Sad Patras Sarajevo tamperer Tirana 3 esta Tuzla santhi and syria winners of the local rounds will pass to the finals in May they represents a live round all the first winning teams of the local rounds will be competitive for the winner of the from additional tasks given by the company who is going to be the sponsor this year the final round will be held in Officer eternal in Serbia some frantically takes the competition Hydra is addressed to students aid to students aged 18 to 6 to 26 which will be able to reach one of the cities in their country that conduct a lock around in the can participate in teams of three the online seminars are introduced as a free educational tool which will be used to support our goals and help students who have great ideas motivation and interest about the field but don't know how to get started we're going to provide free online seminars this one one on January and one on February during the next seminars we're going to have a presentation of the tools that are going to be used in the local tasks like Hadoop and MapReduce on the final seminar we will may present some good practices on how to solve the final task in order to stay tuned check out channels of communication and feel free to email us in case you have further questions doctor Pellegrini's you start thanks for the introduction and I'm very glad we here talking on this topic so as George said we are going to be talking about big data analytics and today it's their first seminar on big data so let me get started and the title of my talk is your introduction to big data analytics and we're going to see what is big data what are the challenges and what are actually some also big failures that you have seen so not only the positive things but also some things that you need to be cautious about so if you have any questions after the presentation you can also email me so just a bit of background from me so you know who I am from Greece I did my undergraduate at Athens of the National Technical University we're actually in 2004 and 3 we started a local committee of Athens so I'm one of the founding members fistic show mystic is very close to me so I I'm glad I'm back so many years after to give this presentation once I graduated in 2006 I moved to us I did my PhD in the University of California in computer science and then in 2010 I joined Cooper's Pittsburgh where I'm still hot so today's presentation is going to introduce Big Data and here is the here the objectives and their roadmap that I will follow in this presentation and we will basically see what our Big Data why do we care about Big Data now and we will see what you might have heared as the 3 vs volume variety and velocity and we will see what this exactly means we will also talk about the challenges associated with Big Data some of them are Surratt's purchasing and analysis and the vast majority of this presentation will deal with applications from Big Data why Big Data are important but also I think equally important is why you should be cautious when using big data right so what are the big failures to have already observed and the use of big data and what can go wrong and what can we do as because we measure ok so this this slide this slide has the time series of the volume of the query of the term big data analytics from Google search sense as you can see 10 years back nobody was looking for big data but after 2010 2011 there has been steadily increased on the interest of people in big data analytics obviously if you also search for other trends and double shirts like similar keywords you will see a similar trend so why do we why this increasing and there are many cases there are many reasons the major of this is that there were success in early adaptors so when people started using big data before big data became thing they're the people that use it had a lot of success so while there's want to replicate that nowadays is not the technology is mature so even though there are a lot of challenges associated there's also a mature technology that can actually help solve some of these challenges so there is an increased provision to collect a variety of data so the idea is collect as many data as you can and you will find some use later so there is an increasing trend and increasing test in big data analytics and this is not going to go away so here is an expression that basically has been very has been very popular the last couple of years because everyone wants to know what is Big Data and when you're asking someone about the data they give you a vague answer or if you ask 10 people well Big Data is most probably you will get 10 different answers so done early from the university put it very nicely Big Data he said it's like teenage sex everyone talks about it nobody really knows how to do it but everyone thinks that everyone else is doing it so every claims that they are doing and this is exactly what happens with big data so everyone talked about Big Data and nobody really knows what it is and what to do with big data everyone believes that the other companies do it so we claim they are also doing big data so I think that's a very clever saying that captures the reality for Big Data today so with our Big Data right so what sorry I miss I skip the slide so it's out of order sorry about that so Big Data essentially why do we care about Big Data is because there are three major things that Big Data can accomplish for a foreign organization and this has to do with cost reduction decision-making and new products and services so let's start with constant action there is a big Big Data technologies such as Hadoop MapReduce and cloud based analytics can actually bring significant cost advantages when it comes to storing large amount of data so they you are able to store a significant amount of information that you can basically analyze later and find more efficient ways of doing business and efficient ways of doing business essentially means cost reduction for businesses now it can all Big Data can also help you do faster and better decision making where basically if you combine the ability to perform fast analysis with technology such as Hadoop you can analyze new source of data and business can really analyze information near real time and make decisions based on what now let's take up obtain from this data now Big Data can also help release new products and services so people basically can go you can go to customer needs and satisfaction based on analytics on data that they collect from customers so they can actually passively understand what customers like and basically companies can create new products that meet these customers needs so sorry as I said it was further with my slide so the question is what is Big Data really right as I said Big Data sounds more like a buzz word and people don't really know exactly what it is and here we will just attempt to answer this question to the best that we can obviously it is not a foolproof definition but it's definitely let's start in one way I like to define Big Data is by defining what Big Data are not right so Big Data are data that you can imagine being data that do not fit in a single message right so you cannot get this data and store them in your desktop and your laptop or even a single server and you need more specialized tool in order to analyze however one thing that you need to keep in mind is that volume is not the only defining characteristic of Big Data so volume obviously is what comes into our mind right so you have gigabyte terabyte petabyte of data this is the very first thing that comes into our mind when you talk about Big Data but Big Data doesn't have to do only with data so they are being generated at much higher speed even near real-time I compared to traditional data in their heterogeneous there is a variety you get data from sensors you get data from mobile devices so there is a large variety on data as well so here are the three V's that we talked about and basically the three vid these are volume velocity and variety and essentially when we talk about data volume we'll talk about the size of the available date and the size of the available data has been growing at an ever-increasing rate this applies to both companies that collect data for their customers but also individual people's data right so everywhere you visit online you live behind your online breadcrumbs and this increases the amount of information that you have for yourself so a text file can be a few kilobytes a sound files a few megabytes and you can have movies easily stored that are in the order of gigabytes more sorts of data are also added continually so for example what we mean here is that in the old days companies who were generating data internally through their employees but now currently the data are not generated manually by employees but also you possibly collect them from your customers you can have machines such as sensors that generate the data so smartphones for examples and the variety of information to the network infrastructure and to the network operator this data didn't exist 10 20 years ago we also have more source of data that we need to combine and analyze that and before this is a major issue for those that are trying to put this data into use it's now you don't only have to care about one source of data but you need to care about several source of data that can give you different angles in the same problem so that the days where you will have exabytes of data are not far away right and in indeed in some fields this is a reality right now so this creates issues with regards to how you analyze and processes data so in the past when you had small data data that could fit in one server you could analyze them with the bots process you take a chunk of data you submit the job the server and you wait for the result this this works when you have incoming data rate a slow incoming data rate so you have data that you collect offline you analyze them and you can do this BOTS processing but as we said one of the characteristics of the big data is the velocity right so you can have data that are generated at a very high speed rate and you need to process them without delay for example such sources can be social and mobile application where you generate data very fast and in this case is batch processing breaks down so you need some sort of real-time distributed streaming process now data variety is the three the third V in this process were basically in the past 20 years ago data were basically an Excel table or in the more advanced form it was a data base but now these are not enough all the data are not necessarily relational data you can have data that range from pure text which is ants your data actually or photo image audio video GPS data sensor data of course relational database as well so all of this you don't you can not just store them in a table or in a sequel database so you need to have specialized infrastructure to be able to deal with this heterogeneity and as new applications are introduced new data formats will keep coming so this is not going to end you will keep getting more and more variety the big data Lasky landscape is also increasing so there is an increasing number of companies that try to help organizations and enterprises to deal with these challenges so you have additional players such as Oracle and IBM SAP that are working in the area you have other conglomerates like Amazon that are offering big data infrastructure service for example with Amazon Web Services startups are also trying to get a piece of the pie and they're specializing on specific niches of this area such as visualization with tableau or access to data where you see data as a service where for example you have nib which is a data you can imagine as a data broker that collects data from social media or other sources and then resells them to companies that want to make use of them or for example as you see here in Eric's which is basically a company that collects a massive amount of transportation data and then sells them to the Department of Transportation or to other local governments organizations that want to get access to this data so the big data landscape is is is a vast landscape and you can find companies that deal with technologies like Hadoop Apache and other technologies technologies that provide business intelligence like traditional loans like IBM Microsoft Oracle you have this big conglomerates like Amazon and Google that provide infrastructure as a service and they have started such specialized in specific niches so the talents associated with Big Data are several and clearly storage is one of them and the first one that someone has to worry about after all you can not have analytics if you can not store them as you said big data cannot fit in a single message that's the way to define big data therefore what we need is distributed technologies that can store the data over several physical devices one problem that this causes though is that we need to consider things like network bottlenecks because now if you have the data on separate physical devices you will need to transfer data from device to device and this can create a vast bottleneck for networking the good thing is that companies nowadays do not really need to care to worry a lot about this infrastructure because the big companies like Microsoft and Amazon they actually provide this as a service to everyone that wants to use big date so if you want to store a large amount of data and then process them you can basically buy this service by servers from Amazon or Google or Microsoft or any other company that offers this infrastructure as a service so the storage itself is not a big challenge in the perspective from the perspective of an enterprise that wants to use big data but it obviously is a big talent from the perspective of the service provider like Amazon and Google and Microsoft that provide this as a service so once you have solved this problem of storing the data the next silence that you need to face is processing and analyzing them in the general case as we said you have a variety of data and you need a new paradigm of analyzing all of this data since simply relational sequel databases might not be appropriate for storing the data given this variety of formats distributed file systems are needed there there are actually new technology that can be used to analyze in-person data as such as have Dupin MapReduce and who can even move beyond CPU processing and – graphical processing units GPUs and of course super computing technologies now the focus of this presentation is not the technology itself right you're not going to learn a lot about MapReduce and Hadoop from this presentation my understanding is that the second or third seminar of this series will go deeper into these cases but let's very quickly in a few slides try to understand the MapReduce paradigm so not we're not going to learn how to program a MapReduce but let's see what is the paradigm in order to see the difference with traditional programming let's say so this figure here it's just an illustrative example of how MapReduce works which is essentially divide and conquer strategy so essentially a complicated task that you have to solve you break it down with the smaller tasks then these tasks are being solved individually by several workers so this breaking down is what we call what is the mapping phase and then every worker solves this problem reduces the problem into the individual pieces and then you combine these pieces together to the final result so this will be more clear with the example that we will see later but the main idea of MapReduce essentially is to divide the problem and solve the smaller problems and combine the solutions so the two phases that you discussed in the terminals of moppet uses recall obviously mop and reduce so the map phase essentially gets the input files the input data and essentially it creates a hash table a dictionary a key value pair centrally whereas the reduce phase combines aggregates this individual dictionaries and provides the final answer so for example let's assume that we have a task of finding the frequency of words in a big file right the simplest way if we don't want to use MapReduce is scan the file sequentially count the appearance of each word and then at the end when you scan the whole file you have the answer this is obviously this obviously will give you the answer but it will take forever especially especially if the file is huge so in this case the way that MapReduce would solve this problem is that you will split the file into smaller chunks these smaller chunks will be assigned to different workers and these workers essentially will take each one of the smaller chunks and will create the mopping which is dictionaries basically so for example we have this input file deer bear River car car River deer car bear and then what we do is split this into three parts and we assign it to three workers so for example the first worker gets one part of the file which is the beer beer bear and river so what the mopping face does is that it creates dictionaries and essentially what in this case the dictionary is that says the first water essentially identified three dictionaries deer which is the key value and the key the key and the value is one because it was one occurrence then bear the word bear creates another key value pair the keys bear and the values one one appearance for the word bear and so on then at the end I want them up the mop the the the workers have done this mopping there is a sapling process that happens where basically the key value pairs with the same key they are all being assigned to a specific a specific server so for example you're stealing the sapling process that all the key value pairs with the word bear go to the first processor all the key value pairs with T equal to car go to the second one and so on and then you essentially aggregate and reduce this to sink to just to two specific dictionaries so at the end the final result is that the word bear appeared twice the word car appears three times the word deer appeared two times and the word river appear to touch so obviously this might look might look like not providing a very big advantage over sequential sequential processing but I would urge you to go on this webpage that github try this library the mincemeat pie which essentially is an emulation of MapReduce in Python so what this library does is that it uses your the course of your machine as individual workers and you can actually like the actual MapReduce code that is being run emulated on your machine and you will say tremendous speed up on this simple word counting example so if you try and use the word counting example using this MapReduce emulation you will see tremendous speed up compared to a typical traditional sequential processing of the file so this this is actually a nice tool to play around and get familiar with MapReduce now an interesting thing to keep in mind is that computer scientists have always been trying to find ways to store retrieve and process data more efficiently so the objective itself for Big Data is not something new what is new is the nature of the data the nature of the data is different so we don't have only one type of data relational data but we have vast amounts of data of different types and velocity so the the challenge itself and the objective is not something new computer scientists always cared about scalability right big o-notation they always want to find efficient ways to store and retrieve data so this is not something new what is new is that the techniques that we used to have to do that are not anymore as good let's say right because we don't only have big amount of data but we also have large amount of data but we also have large variety and large speeds of data so now that know a little bit about the challenges in some of the technological paradigms for big data and what are the way to solve them let's move to the main part of this presentation which is basically applications of big data right then why big data have been have become an integral part of of technology companies so a big data have seen tremendous use or have improved operation in a variety of fields and today you will see applications in retail healthcare transportation and sports so let's start with retail applications retail applications have been using data actually for many many years it's not something new but big data have facilitated several operations and services in this domain during the past decade so for example it helps you learn your customer customer behavior and the sentiment can be determined using big data analytics and this data can actually come through interactions of the customer within the store through direct mail other marketing channels like social media you can find the sentiment of your customers through the social media you can correlate data with transaction data online browsing behavior if you have online presence in store shopping threads what product preference is what customers prefer and many more other things you can also incorporate external data streams as I said social media traffic you can assess customer sentiment behavior so some of this data runs track to date data stream right it's not relational transactional data indecent inside that you get can be valuable on how you can set inventory and pricing strategies now another important thing is measuring brand sentiment so brands that brand studies use focus groups and customer polling techniques and this can be expensive and often are not accurate because you can have issues with sampling and identifying the appropriate people to survey so you can use big data analytics you can do brown sentiment now behavioral trade using things like Pinterest Twitter Facebook the results are still biased but are less biased and can be used to guide product development advertisement and marketing products programs because if they are still biased in terms of who is has access to this service and who is using the services but this can represent a large amount of your customer base you can actually customize promotions big dinner lytx can be used to create custom offers based on the browsing history that you have based on what you bought from the physical retail store you can customize the promotions for localized marketing you can push coupons and offers smartphones based on your location you can have real time offers via social media I will share to some very interesting examples on that in a way you can improve the store layout right big data can be used to analyze how customers move in space inside your store you can have sensor data RFID and QR codes or Bluetooth today in tracking store traffic shopping habits this can help you understand which products to put nearby other technologies are emerging for in-store in-store mapping for applications that instant coupons and you can in general know that the store flow you can optimize your e-commerce if you have an online presence which most of the retails do have today clickstream data monitoring online behavior can help you optimize the commerce website if you don't have access to big data the sheer volume of clicks and data would be difficult to analyze so if you are not able to collect context information and clicks them click-through data it will be difficult to do any sort of analysis if you have this information you can incorporate other metrics as a social media purchase history and improve overall your e-commerce website you can also do order management so big data can be very valuable for inventory management and tracking for example big data can help inventory needs in order to facilitate real-time delivery and this is a big issue today a big challenge today for comments like Amazon they want to have very fast and real-time almost real-time delivery and you can use to automate order processing such as out of stock goods right so you can provide services where basically you have a better or order management all of those are important to retail companies and they are important enough actually to build research laboratories such as the ones at Google Microsoft and Facebook has so it might be strange but Target and Walmart they both have significant research labs that hire CS and engineering PhD graduates we help them solve all of this project problems and this is actually a very good indication big data are of like three meters to them so let's see some specific examples here so we talk about target having the research lab and actually target has one of the most well-known big data success stories so one of the stories from Target that made big headlines is that they they were able to correlate the baby shower registry with guest ID programs and the retails were a the retailer was able to identify other products that pregnant women were most likely to purchase and then they could customize basically promotions and coupons due to this product so this this registry tracks data such as purchase history returns web visits whether you have contacted the customer support other web clicks clicks so target was able to to actually target pregnant woman it was special offer such as offering coupons for prenatal vitamins moisturizing lotion and so on the same strategy is being applied to other shoppers using AIDS education marital status and more so they are able to correlate large amounts of data and being able to offer and are able to offer personalized coupons now another interesting story was from pizza franchise chain store where a pizza saying offers customers a mobile phone app for mobile marketing and deliver special offers so what this Papa Murphy's pizza saying did was that they must have user data with local weather conditions and the the pizza maker was able to offer special delivery coupons to customers that were unable to cook because of power itíd outages during a storm so the mobile this program actually healed the twenty percent response rate which is very very very very big actually there was a study that was that performed from my grub hub which is a start-up that basically facilitates home delivery food home delivery they found that when actually there is a storm pizza deliveries are increased and not only that the tip amount that people are tipping the delivery boy is increased as well now calls which is a big-box retailer in the US its testing real-time offers and it stores for shoppers hoped to offers delivered via smartphone so if you are if you allow calls to send you coupons through a smartphone you can get more personalized coupon so for example if shoppers visit the should the shoe department the app can correlate the shopping and browsing history that they had either in store or through the website in deliverance and cook coupons for shoes that that these customers you don't like so the health care industry is also another big industry that has relied on big data during the past decade and actually maybe more than any other industries on the brink of a major transformation through these the use of advanced analytics and big data technologies and there are four major trends in healthcare nowadays so the first one is the healthcare IOT where IOT stands for the Internet of Things so there is a rapid increase in the numbers of smart and connected devices such as sensors that generate and exchange data about people in the environment so the spending actually on healthcare IOT is about 120 billion dollars by some estimates most of this data are created in an unstructured way and you need a big data technology such as Hadoop in order to basically analyze this data so what what do with this data provide they provides a variety of monitor devices a monitor every sort of passing behavior the glucose level fed fetal monitors electrocardiograms blood pressure so all of this our devices generate data at a high rate many of these measurements required a follow-up visit with a physician so you take your measurements your blood pressure your glucose level and many of these visit many of these will require follow-up visits but if you do smarter monitoring this the devices can communicate without their pace and devices and creately refine this process so basically you can learn whether other custom other patients in the same will see with similar measurements similar characteristics like you whether they actually had a follow-up visit and what this follow-up visit said so you can replace these follow-up visit maybe with a phone call from a nurse and this can significantly reduce the cost of health care not only for its individual but for the country as a whole other smart devices they are detecting whether medicines are being taken regularly at home if not they can initiate a call to their healthcare provider to make sure that patients are properly medicated so the possibilities that are offered by the healthcare IOT is basically to lower the cost and improve the passing the public patients health care now the other the other big another big area where big data analytics are used in healthcare is reducing fraud waste and abuse so the cost of this fraud in the healthcare industries is a very very key contributor to the spiraling health care cost in the u.s. so big data analytics can actually be a game-changer for for detecting health care fraud so basically a predictive analytics are being used by the Centers for Medicare and Medicaid Services they prevent it more than two hundred and ten million dollars last year from fraudulent claims so united healthcare for example transition to a predictive modeling environment in order to identify inaccurate claims so there are patterns in inaccurate claims and United Healthcare was able to use Big Data technologies to basically detect this and generated an almost 2,000 percent return on their Big Data technology so the gain they had was the Mendes compared to the investment they didn't Big Data infrastructure and analytics so the key not to identify fraud is the ability to store this data and go back in history and analyze and analyze this largely unstructured data set of historical claims so what people have claimed in the past for healthcare so then you can use machine learning algorithms to detect these anomalies and patterns you can analyze patient records billing to detect anomalies you can see whether there was an over utilization of services in short time periods whether patients received healthcare service from different hospitals in different locations very close times which you should not expect to see so in general you can analyze all these vast amount of data that you collect from the static health care providers and identify fraud so the third pillar on health care service and how to use big data is predictive analytics or what you might have shared as evidence-based medicine so initiative's initiatives here basically make use of the adoption of electronic health care records so the patient record are now electronic they're actually bits on a machine and the volume and detail of this information which are for every patient is increasing so there has been a big investment from the federal government stimulus about 30 million 30 billion dollars in this healthcare and Mayson technologists and then and these was designed specifically in order to provide incentives to healthcare providers to collect this electronic health records from patients so once you have all this information from patients what you can see is that you can correlate treatments with outcomes so you can combine and analyze structured and unstructured data across multiple sources and basically improve the accuracy of diagnosing patient conditions you can matching match statements with outcomes predict what which patients are at risk for disease or for readmission so this predictive modeling is using big data is used actually for early diagnosis is reducing mortality rates for problems such as congestive heart failure and sepsis so in fact congestive heart failure accounts for the most of Health Care's and spending so if you diagnose this early it can be treated you can avoid expensive complication but if you but what has been found is that early manifestations of these conditions can be easily missed by physicians but if you collect and store and analyze this vast amount of data that you have from all the patients then you will be able to to identify and detect early these conditions so for example in Georgia Tech there was a group that proved not to but the sole case that machine learning algorithm anymore factors in patients hearts than doctors could and by adding these additional features there was a substantial increase in the ability to distinguish people who had this congestive heart failure from those who didn't so this predictive modeling can uncover non-citizen patterns that couldn't be previously encode uncovered so there are actually startups like Optima labs that have collected a lot of healthcare records 30 million patients they have created a database for predictive analytics tool doctors make informed decisions to improve patient treatments so obviously this doesn't mean that the expertise of doctors is going away what it just means is that that allows the doctors to simultaneously similarly consider multiple factors that couldn't be considered before because people do know a human brain doesn't have the ability to process thirteen million data records at the same time so this is the promise of the evidence-based medicine now the fourth pillar in the healthcare use of big data analytics has to do with real-time patient monitoring so healthcare facilities are looking to provide more proactive care because proactive care actually is isn't the long term zipper so being proactive is better than being reactive to condition so you constantly want to monitor patients vital signs so there are monitors that do that and you can get the data from these monitors and can be analysed in real-time and you can send alerts to care providers so they instantly know about changes in the patient's condition so processing real-time events with machine learning algorithms can provide physicians doctors with insights and help them make actually decisions that will lead to effective interventions ingesting time so you can have wearable sensors that keep opportunity to the caregivers to interact with patients in new ways this makes health care not only cheaper but also more convenient and persistent because you have it all the time so essentially it changes the nature of the relationship between the patient and the caregiver so it wasn't one example you can have remote in-home monitoring of patients with chronic pulmonary diseases monitors can track the weight of patients that have heart disease they can detect fluid retention before hospitalization is required so you can intervene before needing to go the hospital and this not only helps your health but also manages your healthcare cost science asthma medication we can track their monitors to track that and in general wearables acquirable monitor can generate a vast amount of data that can help in real-time patient monitoring so let's see some specific application in this area in one of the most interesting examples is what BlueCross BlueShield which is a health care provider in the US has been doing during the last year's so just to give you a background the last years in the u.s. there has been a huge endemic on opiate opioid abuse so essentially what this means is that people are being prescribed with strong opioid painkillers and then they're getting obviously addicted to them and they keep wanting this opioid the painkillers and this has resulted to several overdose deaths so data science at blue blue cross blue suits have started working with big data experts to tackle the problem of this opioid addiction so what they have been doing is that they are using insurance health insurance and pharmacy data and they have been able to identify more than 700 risk factors that can predict with a high degree of accuracy whether someone is at risk for using opioids so essentially what as Blue Cross Blue Shield's data scientist say it's not one thing so it's not like he went to the doctor a lot all right this is not predictive so what is predictive is a combination of things like you hit the threshold of going to the doctor a lot you have certain types of conditions and you go to more than one doctor in more than one zip codes to get your prescription so these things add up and they have managed to identify using big data large data sets from insurance and pharmacy data to identify which of these factors can predict opioid addiction and then once you do that you can reach out to people that are identified as high-risk and intervene and prevent them from developing this drug issue another interesting example in this Big Data healthcare is the cancer moonshot program so President Obama before the end of his last term came up with this program and the goal of this program was to accomplish 10 – accomplished 10 years of progress that was curing cancer in half of the time so the both this world call of the program to basically have the amount of time needed to cure cancer provide some recommendation and this recommendation included big data use cases so medical researchers can use actually large amounts of data on different treatment plants and recovery rates from cancer patients in order to find trends and treatments that have the highest rates of success so what essentially this means is that you can collect data from all the hospitals that have provided specific treatment plans you can also know what was the recovery rate whether there was a remission or not and then you can based analyze all this data and see which treatment plants have the best chance of success so for example researchers have examined tumor samples in bio banks that are linked with patient treatment records and using this data they were able to find that certain mutations and cancer proteins interact with different treatments differently and they could find trends that lead to better outcomes they actually the researchers and the doctors they actually find unexpected benefits for example they find that the Spear amine which is an odd antidepressant was actually have the ability to help cure certain types of lung cancer right and they were able to do that because they were able to analyze a unified database from many different hospitals but had all these different data of course this kind of Big Data initiatives they also come with other challenges right so for example you might have income incompatible data system so different hospitals might use different systems to store their data they might store different types of information even if you have structured data the specific attributes of the data might be different so you have some sort of compatibility right and this is a huge problem in databases that you need to solve obviously when we talk about healthcare we have privacy concerns and patient confidentiality issues so simply put even though there are clear applications and clear benefits that you can have there are other things that you can that you need to consider when you're talking about applications of big data in healthcare so another interesting example of big data analytics in the healthcare is that the first map on this which currently is known as propellor this started this company startup essentially what it does is that it produces inhalers for asthma patients that have GPS enabled targets so essentially what it does is that it keeps track of your location and it keeps track of how much you use your inhaler and then it is able to identify whether the environmental conditions around the area you are whether they are good or bad essential and can provide other users of this inhaler with information and whether they might have some mathematic right so essentially it tries to correlate locations users of the inhaler and the environment in order to provide better information to to people actually CDC is using this data in order to develop better treatment plans first Mattox so we talked about healthcare and retail retailers which are obviously big users of big data now the last two applications we'll see are in some less traditional fields and the first one is transportation so as urban roads design decades ago they're now filled up with cars so in the mid of 20th century everyone thought that car is a savior of Transportation obviously the day we don't think the same so there there is a big problem in many US cities and of course in many European Asian cities with traffic right so the idea is that basically how can you use big data in order to create efficient public transport that can take some of the congestion of the streets and one of the things that the big data can do is facilitate route design so for example you can collect a vast amount of data on people mobility and you can do that either through GPS sensors from your mobile phone or for example many cars are equipped with GPS sensors so you can identify what the transport engineers are calling origin destination pairs you can see where people want to go at what times of the day and you can facilitate better and personalised route design for public transport you can actually identify ways to reduce commute time so an interesting experiment that happened in Minneapolis a decade ago was basically to put some traffic lights at the entrance of highways in order to better facilitate better the incoming flow from side streets to the highway the perceive delay was higher but the actual commute time was smaller and you can use big data to actually do this sort of experiments and basically identify what works best one of the big things that we will see that people did not do well when they are using big data is they don't put human in the center of the problem when you talk now about transportation about our cities humans need to be the center focus right so we need to basically think of humans first rather than simulated using congestion right so big data can help us understand better what people are looking for into their transportation whether for example they would be interested in taking a slightly bigger out but that makes them feel better so their big data can actually facilitate this human centric transportation and of course at the end of the day local governments can help can use this big data to do data-driven policy to basically make decisions in order to to to make policy decisions that improve overall the quality of living in cities so some examples that we will see here have to do with transportation and for example street light data is a company a startup company that is basically doing some analytics on electric vehicles so it's collecting it is collecting data from electric vehicle charging stations and it tries to identify better locations for the ill for the recharging stations they mentioned better these stations of how many cars they should how many spots for the cars that should have what will be the needs in the future and they are basically trying to improve the the usage of electric vehicles so basically they are trying to make the experience of people using electric vehicles better show us to take more people out of traditional diesel cars to electric vehicles of course another big problem has to do with parking so for example there's there is some research that has shown that about 30 percent of the traffic in downtown areas in US cities is traffic that is looking for parking on state parking so big data can actually help in many ways for example one of these ways has to do with dynamic price of party so if you collect a vast amount of data on parking demand and what is the parking demand over the course of the day you can identify a better pricing scheme so as to basically improve users of these parking spaces the problem in the u.s. cities is basically that people want to park on the street even though garage parking garages are pretty much the same price so the idea is that you can increase the price in areas with small supply and provide a better experience for everyone actually there are several experiments that have been done with dynamic pricing using big data that I collected through sensors and it has shown that this has improved hazardous congestion overall and has improved the utilization of the parking spaces now other approaches here in transportation using big data is basically three designed so one of the big things in the u.s. in Europe things are a little bit better obviously in many cities is trying to build biking infrastructure in the cities in the US unfortunately there is not a lot of prior knowledge on how to build an efficient biking infrastructure so what many cities are doing is that they are experimenting so basically they are creating different layouts for bike lanes and then they are collecting all sorts of data they're collecting data on how this infrastructure improves if they local businesses how does it improve safety of the safety of the cyclists how does it make them feel how does it improve congestion overall congestion in the state so there is a vast amount of data that are collected and they are analyzed and then they can be used to basically make policy decisions and what should be the infrastructure that is built yeah so the this is again infrastructure for electric vehicles as we discussed earlier and one of the things that we said has to do with human centered transportation so here an interesting work that has been done by some researchers in Europe in Cambridge and Yahoo was basically this idea of copy mobs so what the researchers did is that they collected a vast amount of data through gaming on what makes people feel happy so basically they build this game which is called urban gems and basically they show two people two images and they ask them a single question so they show two images from a city and they ask them which one makes you feel happier right so collecting these answers from a very large number of participants they were able to identify what features of the urban environment makes people feel happy and basically they were able to provide route recommendations for making people feel happy rather than just making people arrived to the destination faster so this is an example of a human centric transportation so finally with regards to applications we have sports and this might seem strange to some people but in reality analytic statistics and be–and data have been part of all four sports as early as eighteen nineteenth century actually the first book score that was ever collected which is actually data I was collected in the eighty fifties for a baseball game now today obviously we have way more detailed data and sports and some of the applications that this data can facilitate have to do with in-game strategy right so you can have what it's similar to evidence-based medicine but evidence-based support serology you can have roster selection so how can you identify players that provide a good value but they are essentially under valued in terms of money money contracts fun engagement so sports is a business and as business they want to increase their revenue so how can you engage your funds better what are your fans want both from the perspective of in-game stay in stadium experience and out of stadium experience and of course something that is also related with medicine how can you do injury prevention obviously you cannot prevent injury such as a player had a sprained ankle obviously this is an event that is not predictable but in stay in sport or have repeated motion for example in baseball where you have a feature repeating the same the same movement the same motion time and time again in the game you can actually people actually have analyzed the way the pitchers are doing this movement in order to see some signs of possible injury that is coming so why all of a sudden the last five to eight years sports analytics gained so much attention and they entered this big data this big big data era the answer is simple now we are able to collect way more information so before a 10 15 years back the only information that could cover a game was basically box score information aggregate statistics for players for the team and nothing very very detailed actually remember when I was young and I was playing basketball we were keeping this statistic by hand right so this was a very manual intensive way of doing things but today there is all sorts of technology that can generate a vast amount of data for all different kinds of sports so for example in soccer and in hockey and also in basketball obviously you have optical tracking technology where basically you have a set of cameras that look at that point obviously in the field on the field and they can track the players and they can identify the location for every player for the ball several times every second so this obviously as you can imagine generates a vast amount of data now for other sports like American football where basically players wear a lot of padding you can put small sensors in this padding that can capture similar information so these sensors from GPS and they can identify the location of players on the field they can measure things like speed that the player some sensors can actually also measure the hydration levels of the player so there is a vast amount of information that you can get from players and this has led to several applications right maybe the thing that popularized the use of data and big data in sports analytics is the book from Michael Lewis Moneyball if you haven't read the book it's a very good book or you can watch the movie Moneyball so the idea is such so actually this is a real story at the book and the movie describes and there was a small market baseball team in the US which had a very small budget and they were able by using advanced analytics and big data analytics to basically build a roster that was able to compete with much richer teams and we're actually able to have a lot of in-game success now it took us several years to have a major advancement in the use of big data in sports and this came from the NBA so the NBA has actually not necessarily mandated but has introduced a program where all the teams and now in their stadiums they have six cameras where they track the the players and the ball 25 times every second so you have all the locations of the play throughout the game and basically you can use these data to do a lot of analysis so some teams they're analyzing defensive strategies and this small demo is something basically that I was built by the Toronto Raptors and it was one of the major successes of this technology where basically they used a vast amount of possession data from over 2000 games and what they did is that they introduced the concept of ghosting so essentially they analyzed of this possession and they managed to identify what is the optimal position of a defendant what should be the optimal position of a defendant by creating these ghosts these players that should have been in specific location they were actually able to identify which players perform better on defense in another application that the Toronto Raptors had was basically they were monitoring players and how fast they cut through the base line and if they saw some change in their in their velocity they were identifying the display must be tired so they were substituting players there are other companies for example against cam which is actually a local company here in Pittsburgh but provide services to all the NBA and the NFL teams for fan engagement so it provides tools like social media analytics and other tools such as virtual reality that help engage funds better funds both when they are in stadium but also outside the stadium so as I said sports franchises are businesses and what they want is their customers which is their funds to be satisfied and have the best experience they can so here's another example of this ghosting application actually from Disney research you might be wondering why Disney is into sports it's actually because they own ESPN which is a major the major sports network in the US and maybe why so what they did here in this work is that they analyzed soccer possessions I believe it's from premier league and basically they were able to build some sort of probability model on what the probability a possession will end up in a goal so for example in this one that I showed you which is taken from their work the scoring probability of this case is 70% right now what they did is that they built this roasting system where basically these transparent circles are ghosts are not real players but are where the players should have been right so for example this player number 5 is pretty much where he should have been but this player number 8 is not where he should have been so if you so essentially if you if the defense had played exactly how they were supposed to do the scoring chance would have been much lower 41% so essentially what this analysis is doing which uses a fast amount of data big data optical tracking from all the games in a season and uses some deep learning techniques they are able to put some different value on the defensive play of players so this is actually revolutionizing the way players are evaluated and the reason is until now without the use of this data we were only able to evaluate the offensive strategy of player I think basketball you were talking about points and rebounds in soccer we're talking about goals but now we have an ability to actually evaluate the defense that the player produces ok so now that we have done this sober view of the applications let's see the last part of this presentation which is the big page right so not all is rose in the landscape of big data there are also failures on applying big data two problems and in this case is their failures can be big right so for example additional statistical techniques that are known for years to have flows like the p-values have these flows pronounced when applied in big data right so for example p-values is a concept that we learn during our first attested course what peoples can do is that they can identify significance when there is not so researchers have tried to point out the shortcomings but the problems keep appearing all the time so for example this paper these receipts from people at original universe universe personam Maryland and Indian School of Business has basically a shown if you have a large amount of data and you try to identify correlations or differences between two data sets and if you only rely on this that is Cal consort of p-value you will always identify a significant effect because the amount of data is so big that basically will always be able to pick up some difference so in order to see this and understand why this is a problem let's see an example from the criminal justice system so let's assume that you are doing a murder investigation right and in the criminal scene you identify it found a fingerprint right so typically what you are doing what authorities are doing is that they are taking this fingerprint and you are passing it through a database with several hundreds millions of pantries so for example let's say 150 million of entries of fingerprints and you're trying to find a match right you are trying to see whether this fingerprint is in the database and hence you can identify who it belongs to now the false positive rate is very low right is let's say 10 to the minus 8 so you are only going to identify to falsely identify a fingerprint once in 10 once in 10 to the 8th almost once in 1 million and let's say that you forgot the maps right so you passed this fingerprint that you find in the criminal in the crime scene you pass it through the database and you found a magic right should you convict this person right I hope you want because one of the things is that if you go through this big database blindly right you expect to find 1.5 matches so if if the person doesn't exist the database you still have this false multiple false positive rate but it's very small but once you have a very very large database you're almost guaranteed that you will find a match by false positive so this is similar to what we are doing about people you said finding impact and effect when there is not obviously there's way there are ways to solve around that issue but these ways are still being developed right but you have so this is one example of why you need to be careful with big data and possibly my personal opinion is that here we need to have a 4-3 right so we talked about the three ways of velocity volume and variety we need to have one more V which is variance right the higher the data so that you have the less variance that you might have for a statistical test to pick up that differences now another big flaw in our current thinking of big data is that we keep the human element out of the loop so everything in big data is as rows and columns or connections or bits and bytes and we never consider the implications of using this data or how they're being generated this creates several ethics issues but also validity issues for the systems built so one sad story is Google Flu predictor so you might recall from you might be aware actually that Google built a system but supposedly was able to predict the flu earlier than traditional methods and the way they did that is by the by using queries that they were sent to Google search sense so for example if people were you were looking for the flow a lot essentially Google said that flu is coming with love endemic scum that was a great example of using big data but unfortunately didn't work quite as well as it seemed so the main problem with the method was that the search engine itself was prompting users to search for similar keywords that they had searched for so if you have if you if you have noticed I'm sure many of you have done have noticed when you are looking for a key word for a key for a term when your cells in for a term in Google where a search engine or an area other search engine you also get some recommendations how about look searching for this there so if you search for flu Google will prompt you to search for flu shots and other similar terms so clearly this last one should not be counted towards the flu prediction algorithm since it was artificially created by the system however what researchers found is that it was counted this led over estimation and eventually failure of the application so the article that you see here basically has way more technical details but I gave you the layman's description of this so essentially when considering analyzing data big data the mechanisms through which the data themselves were created need to be considered right and in this case the data were not only created by the people searching something on Google but also from the algorithm of Google itself now the corporate smart see the rhetoric also is about efficiency right predictability security safety you will go to work on time there will be no queue you are safe because there are CC TV cameras around you all of these things make acetic set up suddenly it makes acceptable to live in but they don't make it great right so we have the technical solutions but what's the we need to think more about fundamental urban problems what makes people happier right at the end of the day would care about people enjoying living in cities so as I mentioned earlier this boot city life project is the project from these researchers I mentioned earlier where they're using social media data to map emotional layers of cities now the biggest concern of all is has to do with biases so algorithms don't necessarily desert truth and in fact they're quite subjective and they have to do I say essentially with what data are being used so the same goes that big data is algorithmic so it must be objective and not biased which is wrong actually so all traditional livings of social discrimination exhibit themselves in the big data ecosystem and what do I need do I mean with that I mean essentially that once you have a null going to what the albergue will give you obviously is not biased but how you created the argument and how you train them the data that you use for the albergue can be biased so for example last recently Amazon came under fire for algorithmic models that limited its prime delivery service in minority neighborhoods in both US and UK financial institutions have been penalized for computer generated decisions that eventually discriminate against credit or insurance seekers by race right so this race feature we used is very discriminative so the the white house actually released a report which found that big data analytics have the potential to eclipse long straddle civil rights protections on how person from a is used however the effects are hard to track because we do not have a good understanding of big data now a very interesting book and text on this big data and big risks have to do is this book from Katherine Neil a weapons of mass destruction which basically describes on how algorithms can be biased so what we basically need in order to solve this problem is policy so we need new policies on how data are being collected how data being used how data are being used to train algorithms we need to get education and ethics on data and eventually we need to have human involvement right so at the end of the day whatever we are doing with big data is is done in order to help humans in specific applications right on specific domains so this human element needs to be there in the design from the beginning what data you're collecting how you're using them how you're turning your algorithms so one very interesting article from nature essentially says that we need to have more accountability for big data al burns and essentially what it says that it's biasing bias up by rephrasing the word garbage in but sounds which basically means that if your data your input data your training data are biased so with your output actually Gartner says that by 2018 half of business ethics violations will occur through improper use of big data analytics and this is a big problem because as engineers and as computer scientists what we are trained is how do you solve all this technical challenge how do you store the data how do you process the data how do you analyze them how do you visualize them so all of these are clear challenges but one of the main challenge that many times you forget is the ethics on using this big data and I hope that if you if there is only one thing that you get out of today's seminar is this thing that basically we need to put human into the loop when we're analyzing big data and we need to consider ethics and biases a lot when we were building a system and algorithms using big data so as I said if you have any questions on what I discussed you can either send it to George and can forward it to me or you can send it to my to me through the email I have in the beginning of the presentation and I'm really glad I gave this seminar and I hope you also enjoy it so I'll turn now to church

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *