2014 Symposium on Big Data Analytics in Healthcare – Sudha Ram



so it's my privilege to introduce the next speaker because number one she doesn't need introduction and number two I'm not going to be reading a long bio I don't make that same mistake twice on the same day so it's my pleasure to introduce dr. Sudha ROM she's the anheuser-busch distinguished professor in Mis innovation and entrepreneurship she also has a joint appointment with the computer science department Sudha came to the University of Arizona back in 1985 that's when she also got her PhD from the University of Illinois and she worked on distributed databases back in 1985 even before the internet was invented by Al Gore who somebody else adara her today so she's always being had the frontier of discovering and creating new knowledge related to data and information she has worked on distribute databases she has worked on conceptual modeling she has worked on semantic interoperability and now she's really making a lot of progress and a lot of impact in data analytics she and I co-founded insite a couple of years ago we're sponsoring today's symposium it's so exciting to have sued around in the department she her middle name is collaboration she collaborates with everybody who wants to come and talk to her about data and data issues she was a copy I have a 60 million dollar grant the I plant grant the first instantiation of that grant and she has this very long resume of funding we research funding 200 publications 200 articles and PhD students were there in the audience that that's what you have to do to suit for and it's just a pleasure to have here in the department I interact with her every day and so I just wanted to invite her to come and present on this exciting topic thank you sue thank you very much Paulo for that lovely introduction so this is a very very interesting project for us in insight we're working with partners in the healthcare industry I would first like to credit my co-authors and workers on this project a lot of the credit as you know doesn't really don't really doesn't really go to the professor it goes to the doctoral students you work with and so one of our doctoral students when Li Zhong she also by the way doubles as a very good photographer week we wear multiple hats and insight we don't just work with big data we also create the big data so she's doing that with her pictures max Williams and dr. your lawn pink attends are from parkland center for clinical innovation this project is research is funded in part by the Parkland Center and we're actually working with the Parkland Hospital in Dallas and other physicians in the emergency department that we've closely collaborated with along with their teams are dr. Anand Shah and dr. Ruben Amara Singam so this project started off as a big data project we were we started talking Ruben and I started talking a few years ago about you know the emergence of big data and then I served on an NIH committee with him on predictive analytics on looking at the future of predictive analytics and at that time we were talking about what does big data really mean so we said well let's get together and you're doing a lot of work on big data we're in the healthcare industry and you're interested in it let's think about what we want to focus on so this project actually uses as as an example but it can be generalized to other kinds of chronic diseases as well as other healthcare challenges are there any asthma sufferers in this audience okay lots of hands going up this is actually a huge challenge in the u.s. 25 million people causing 2 million emergency department visits which are entirely preventable and half a million hospitalizations resulting in in a lot of preventable deaths also 50 billion dollars annually in medical costs in treatment of course a lot of missed school days because childhood asthma is very prevalent and missed work days so we first started looking at chronic diseases and if you look at a map of the US well you can see from the color the dark brown is about pediatric asthma prevalence and it is prevalent in a lot of states especially in Texas and California and so that this was one of the reasons why we picked asthma because this is a hospital in Dallas this is a Center in Dallas and asthma is one of their biggest challenges so our objective in this project was to develop robust models to predict asthma related emergency department visits in near-real-time and I think we're making very good progress and I'll explain to you how we've done this so there's a lot of asthma surveillance that happens CDC issues weekly reports but the problem is that there's lag time of several weeks there's three to four weeks for hospitals to report their data to CDC and then CDC aggregates them and reports them so you really used you see the ups and downs of asthma cases but you only see it after for for about four weeks at the individual level Parkland has worked on developing asthma predictive models using their electronic medical records and more specifically the history of individual patients and what they've succeeded in doing is coming up with models that can predict who's gonna show up in the emergency room within a three-month time window and so this is at the individual level so at the population level you have these reporting surveillance at the individual level you have this prediction but it has a very long time window so our objective really was to see if we could do something in between you know a real goal is to go down to the personalized level but we started with the population risk model and we want to reduce the lag time for predicting how many people will show up in an emergency room two to a day for example or to a few hours so we thought well why not leverage big data I've been doing a lot of work on big data on many different areas for a long time now and to me big data has it's not just about volume and we've heard that message over and over again today it's really about the variety which is which is absolutely correct so their social media data we're leveraging that in our project we're leveraging internet searches through Google Trends we're leveraging sensor based data you've heard the term Internet of Things well there's a lot of different devices that constitute the Internet of Things and one of them is the environmental sensors which is very relevant for asthma we also have wearable sensors and we were starting to leverage those but not in this talk I'm sure there are a lot of people wearing something to track their number of steps they walked anybody wearing either the jawbone or Nike Fuel or the Fitbit okay well so wearable sensors we saw in the last talk and a couple of talks before that we want to be able to leverage that so we're idea is very important but I want to go a bit further what's really important for us is to learn how to leverage what's in this variety and to me the the two things that are very important that we're leveraging in this project is when you look at these various sources of data including your electronic health records every piece of data that you attract today comes with a spatial timestamp and a temporal it's a I'm sorry a spatial stamp and a temporal stamp so a timestamp associated with it and leveraging that geographic location and timestamp is actually what gives us the power to get insights out of data so it's not just the variety but it's the specific characteristics of the space stamp and the time stamp and on top of that a lot of this data is very fine-grained both spatially and temporally so you're not just getting City wise or state wise or country wise aggregates over three months what you're getting is individual down to the microsecond time stamps so if you think about Twitter or a sensor data or even the wearable sensors for example you can actually get data down to the hour the minute the second and to us that's very important so those are some of the characteristics were leveraging so on Twitter I you know I don't have to introduce this I think mark and Brian did a very nice introduction of how Twitter has locations and so you can get locations in a variety of ways from Twitter data you can get at the city level you can even get GPS coordinates of the lat/long coordinates and the temporals for every tweet you get a timestamp with a high level of accuracy Google searches of course Google Flu Trends is very well known to people you can actually get data from Google Trends and you can use certain keywords just like you do in Twitter and you can even specify the geographic area although Google Trends api's are not as fine finely grained in their resolution as Twitter is for example but you can still get data so we were actually able to extract data from the area of interest during the time frame that we wanted to study environmental sensors again EPA has dozens hundreds of thousands of datasets on various air quality variables and you can actually they have multiple sensors in different places all over the US and you can actually leverage both the spatial and the location timestamp so there's a lot of work that's been done using these types of data individually so with Twitter for example the people have looked at Twitter messages to correlate them with CDC statistics we saw some of that people have used some machine learning techniques to extract and classify tweets etc we've also seen examples of Google Flu trend it's been in the news lately there is also other work that's been done on all of these individual data sets but what we really wanted to do was to combine and so for us big data was not just terabytes but really combining these in a very novel way to see if we could then use it to predict visits to the emergency room so we first started with de-identified data so what the reason why we're working with Parkland Hospital is because we needed the emergency room visit data and like I think it was Brian or Mark who pointed out that a lot of the CDC statistics that get posted are not at a hospital level you have them at the city level for example very high level of aggregation but we were fortunate to work with this hospital and they actually gave us their 80 visits for a whole year and we took a subset of it for a period of time and this period of time was a three-month period between October and December and since then we've extended the project and you can see here on the x-axis is the time the y-axis is the number of ET visits and you can see an ebb and flow of ET visits for asthma as a primary diagnosis and so we know where the hospital is located we also know where the people for these IDI visits are coming from we have their precise zip codes and so we're able to leverage that because then we can use that along with the locations from Twitter Google Trends as well as the environmental sensors so we used the Twitter streaming API and we collected two types of data views certain keywords to collect asthma related tweets and we call that the asthma stream and then we collected a general Twitter stream which is all of the tweets regardless of what whether they were asthma or something else because when you combine data from multiple sources or when you take a data set you don't want to take absolute numbers you want to take you want to normalize them so you can use them in prediction so if you have a general Twitter stream and an asthma Twitter stream the asthma Twitter stream we used that the physicians actually were very helpful in helping us identify common keywords that might be useful so we had to go through a big process to decide on these keywords so we used asthma several misspellings of the word terms for common medical devices associated with asthma like a nebulizer inhaler etc and the names of prescription drugs used to treat asthma asthmatic conditions like albuterol singulair and a few others so these were some of the keywords we used and we got a pretty good data set of tweets in fact if you look at it in those tweets we see other keywords as well related to asthma symptoms runny nose wheezing sneezing inhaler etc and this is sort of the trend of tweets you see thousands of tweets people it's very interesting when we started this project I wasn't even sure that we could find anything about asthma on Twitter what's interesting is when people have asthma you would think they would reach for their inhaler first no they first tweet and then they reach for their inhaler which is great for a researcher like me I love that I'm not so sure about the patient this is kind of the asthma related tweets globally and you can see and there in various languages but you can see obviously the english-speaking countries have a predominance this is zooming in on the US you can also see the various keywords the orange is asthma the yellow is inhaler etc so and this is within the US and I very quickly I want to show you the a if this comes up really quickly I can show you so this is a real-time visualization as the data is coming streaming in over time geographically and temporally okay so you can see the time line moving and you can see the ebb and flow of tweets and these are all asthma related tweets okay some know skip through this because I want to get back and I'm afraid I'll run out of time so we also have a similar one for us and I can show you that so what we did was so as you can see from these various data sources we have a timestamp we have a geographic location so we defined geo boundaries for collecting the Twitter stream based on the ER IDI data set that we had we know where the hospital is we know where the people are coming from so we collected we drew a rectangle and collected asthma stream only from within that particular geo bounded rectangle and obviously you know 80% of our time went in cleaning the data as Brian said actually it was probably more than that did a lot of things for cleaning the data I don't want to go into too much detail also I want to tell you that we only dealt with then we removed all the non-english tweets because at this point we you know these most of the people going to the hospital were English speaking people so it's interesting you know you do want to extract signal out of all of this data and there's a lot of noise so on the on your left side you will see people talking about I had a mini asthma I need to get to the doctor on the right side people talking asthma attack because someone is taking their breath away obviously not a you know every time i refresh my news page I give myself or first chemistry exam tomorrow having an asthma attack those are really not relevant I even saw treats where someone went actually went to a Justin Bieber concert and said oh my god I'm gonna have an asthma attack Justin Bieber and I took a snapchat or something like that so you know you do have to remove all of those tweets so we actually used machine learning techniques to train a data set and then and then clean up because we have so many tweets we have hundreds of thousands of tweets and you really cannot do it automatically so we actually gave a sample of 4,500 tweets to three different IDI physicians and had them classify the tweets and God bless them they actually did that and we took that as a training sample and then we used a machine learning techniques to use that as training data and we got a very good precision and we were able to create a very clean data set the next thing we did was we took the Google search data the interesting thing about Google Trends data is you can get data on a daily basis and you can get it for a certain area but you can get it for that rectangle we were talking about and so we use the same keywords that we used on Twitter to look at the total number of searches that were used to search on Google for those asthma related keywords the same 19 or 20 keywords and we also went and so we first collected the data set on on one of the days we're on in December and then for some reason we decided well we're gonna go back and see how reliable the data is so we did two more times and it's it's interesting the same keywords on Google Trends gives you three completely different data sets and we weren't really able to explain so I emailed Google I'm still waiting to hear from them about why there are three different data sets and you know it's because the same keywords but the number of searches are different then we looked at the AQS 8ui the air quality index data and I learned a lot about air quality it turns out there are hundreds of different types of measurements on air quality who would have ever guessed because all we hear about is carbon monoxide and ozone layer right well there's PM 2.5 PM 10 there's nitrogen oxide so all of my chemistry training actually came in very useful here so we learned we collected pollution data again from the same geographic region for the same time period and we selected there's multiple air quality sensors and sometimes you have sensors that sense every second sometimes it's every hour and sometimes you see gaps because the air quality sensors are down so it's a very interesting process of trying to get these this data triangulate it and clean it up so we got data on a number of pollutants and you see these carbon monoxide sulfur sulfur monoxide PM 2.5 etc so we took this and then we decided to first so we also normalized all of our IDI visits because remember what we're trying to predict is an outcome variable of how many people will show up in an emergency room and so rather than predicting the actual number we said we're gonna predict high medium low so we normalized it and we used some rules for saying what's high what's medium what's low after consulting with the IDI doctors our methodology is we're using a bunch of different machine learning AI techniques and we also use some stacking and boosting techniques to see but before we get into each of the predictions what I wanted to say is we wanted to so we have the flu the Google Trends data we have the tweets we have the air quality index data on all those variables we wanted to first kind of select a feature space to do the prediction so we first wanted to run some correlations between the emergency room visits and the Twitter tweets number of tweets the emergency room visits individually with the air quality variables etc and so we found much to our surprise and delight that Twitter data is varied there's a high level of correlation very significant between the asthma related tweets and the DD visits on if you take aggregate them at a daily level and that's the level at which we did the analysis for this project and so you can see very close correlation so then we said well maybe Twitter data is correlated with everything else that goes on in the IDI so so we decided to take something that's completely unrelated to asthma in this case it was abdominal pain constipation related IDI visits and we chose that because it effects a completely different organ system of the body and so we found that obviously the Twitter data is not at all correlated so that gives us evidence that Twitter would be a useful source of predictor variables for edie visits for asthma similarly when we did the pollution data we found that these three variables carbon monoxide nitrogen dioxide and PM 2.5 levels were most closely correlated so that kind of brings down the number of variables that we want to use in our prediction and so p.m. two points for 10mins interestingly enough wasn't very closely correlated Google Trends data we try remember we had three data sets for Google Trends while we tried this correlation with all three and found that two of them were not at all correlated one of them had but because we could never resolve this issue why the numbers were different we decided not to use Google Trends data and then so now is the results so this is our input so the outcome variable is the number of people high medium low and the inputs are a number of asthma tweets and the air quality indexes and so based on this to cut a long story short we found a high level of accuracy our best so differ depending on which class you're talking about whether you know it's high medium or low different techniques work best so decision tree worked very well for accurately predicting if they were hive is going to be high visits on a day neural net a day you know combined with another technique work very well a decision tree for medium and so it's really no one technique that works best it's a combination of predictive modeling techniques and we're experimenting with random forests right now and a couple of other techniques as well so so when we showed these results to the doctors they were actually quite shocked because they thought only medical history would be a good predictor and turns out well you can use these other sources in a very innovative way so these results are very promising they demonstrate the utility and value of linking big data from diverse sources and this is the first time to my knowledge it's this is being done for a non communicable disease usually people have tried to do that for viruses over you know for the Ebola or for flu etc but asthma as a non communicable disease it's a chronic disease and as you saw it works for asthma perhaps it's relevant for other chronic conditions as well and that's what we want to experiment with so there are some challenges that we still have to deal with but first I want to talk about why is this important why is this useful well this is a population level risk model we do want to take it down to the personalized level so what we think we could do is we can use this as surveillance in real time and you can also using these predictions you can identify patients in fact parkland wants to do this identify specific patients to whom they can communicate the risk of an asthma attack or the fact that on that particular day there's predicted high number of people are expected to show up and it would act as a notification system for individual patients the hospital is most interested in using our results for they're making their staffing decisions because like I said a lot of times they're overwhelmed with the number of people who show up and they don't expect them so they don't have the emergency room physicians with the required expertise they don't have the respiratory equipment that's needed to you know to help these people and so they're very interested in using this model to use that and right now they're staffing decisions are being made on a on a three week basis and so they want to do this so that they can change it on on a rolling basis in more real time we want to be able to do targeted patient interventions using the patient address and geolocation so alerting them and counseling them on preventative methods using our risk model but most of all what we want to do is use other attributes as well so we haven't really used the individual so this is not an individual risk risk model but we do want to use demographic variables for a prediction we've also got a buy-in from other major hospitals in the Dallas area when they heard about the results so now we have access to a very large database that pretty much covers all of the hospitals in the Dallas area and so the the hardest thing to get is the medical emergency records you can get Twitter data you can get the Google Trends you can get the air quality it's very hard to extract medical records and probably either because it's hard to get it out on an epic or Cerner system are also because of ownership and privacy issues so we've resolved those issues and we've now we've got about 75 member hospitals in the Dallas area that want to work with us the other thing that we are also doing is we're talking to a startup company that actually uses inhaler which has an IP address associated with it and these are inhalers given to individuals and it actually tracks every time someone takes a sniff from the inhaler and so we're actually gonna get more data on who's actually adhering to their you know the prescriptions and actually taking sniffs etc so hopefully if we get that data we can add that I've already talked about this was data from a single Hospital so we've got more data now we're also trying to so I'm very interested I've always followed the stock market prediction models and so we're looking at those models to see if we can borrow from them and modify them to use an individual level prediction another thing that we're doing is um so we did this at a day level but what happens actually when you look at the data there's a time lag between them tweeting or between the time that the air quality index changes and them showing up in the emergency room so we want to bring down make it more fine-grained in terms of the prediction to an hour perhaps but what we're also seeing is that in these tweets there's there's a distinct pattern that shows escalation of symptoms so people start by saying I have some chest tightness I'm coughing I was up all night I've been wheezing and then the symptoms become more and more severe as hours drag on as a couple of days drag on and then you see them saying I think I'm gonna have to go to the emergency room or I want my way to pick up my kid from the school to take them to an emergency room so we want to use this escalation of symptoms as a way to maybe add some corrections to our predictive model and last of all we're building a Twitter based mobile app to provide patients with real-time information about this kind of risk so this is it's been a very interesting experience we started this project about a year ago and this is it was kind of very interesting to work with the emergency room physicians to get their domain knowledge and to use our big data analytics knowledge to put it together so this is the kind of project we typically like to work on in the insight Center and I do want to again think when Lee because she did a lot of the work and actually collecting cleaning the data and then doing the analysis thank you very much hello yeah thanks for the presentation my name is Andy Panetta Joshi from the School of Pharmacy so I was just wondering for our disease state like asthma so do you think like looking into the weather patterns like seasonal variation is important and do you think of incorporating that in your data yes absolutely so this pilot project was done over a three month time period now we've had we've got data for a much longer time period and we're seeing a lot of seasonal variations that were incorporating into a predictive model another thing that some of you might be thinking about is we've used a subset of air quality data but temperature changes also trigger asthma attacks as I understand and so we're incorporating the temperature information as well yes absolutely thank you this was very exciting and lots of applications sorry I can't over here I'm just wondering since you've expanded your study and you're you're now partnering with 75 hospitals in the Dallas area given the huge percentage of families that are Spanish only speaking in that area are you expanding your analysis to also Spanish Spanish language Twitter feeds um so that's an interesting question we've actually talked about tweets and other languages and we are seeing a lot of tweets in in different kinds of languages coming in that area so yes I think we're gonna have to expand to these areas again so the the tweets themselves we're gonna have to figure out how to actually clean them and how to extract them because that's gonna be a huge input into our so yes absolutely yes you know that's an excellent excellent question I would love to get a dataset from AT&T or t-mobile that gives me the text message the actual text and text messages being sent because I suspect that a lot of people are texting each other and so I'd love to get that data so we've actually talked about it in the hospital apparently this hospital about 80% of the people in this hospital have smartphones and 90% of them actually use text messages as well as social media so that is something we we would probably have to have an opt-in because there's huge privacy issues associated with that but yeah that would be fantastic that's another source of data hello so one of the things I think you did a really good job of is showing the the spectrum of symptoms but there's also a spectrum of triggers and one thing I wonder if you're missing is that in some cities there's actually urgent care so what I'm wondering is did you track people going to Urgent Care versus ER visits because they've really done that well in in maybe in many big cities where people don't go to the ER because they they have an in between means of you know that's absolutely right we've thought about that so here this particular data set was only for the IDI but definitely we know that not everybody goes to a needy and there apparently there's also certain zip codes from where people go to ne D and other zip codes where they go to the next day either to their primary care or to Urgent Care so that's part of what we want to do is to get that data set so you're absolutely right hi I am just a real quick question how did parkland and U of A get together in this relationship so it's interesting so I give I went to give a talk at a symposium for big data and I was presenting my work where I've done some work on news propagation on Twitter and doing some prediction on how fast a news article spreads and whether you can predict the lifespan of a news article not a news meme but a news article and so the CEO of parkland center for clinical innovation Rubin was in that in the audience and he came and said you know we want to do something with big data you know what can you do to help us and we got talking about it and then we say well let's work on something together so it just happened so it really need so I think John what you're pointing to is to do something like this you know so big data can mean a lot of different things and if you just start thinking in abstract it's not easy to come up with a problem so you really have to sort of think about what are some of the problems and we tossed around a variety of different ideas and then we zoomed in on well what is a big challenge for you guys you know is it about personalized medicine is it something in the IDI is a treatment for other things and and so we said well let's work with the IDI so it's also depends on you know who wants to work with you you need the domain experts you need the datasets so it kind of all came together and then we picked asthma we could have picked some other disease but we picked asthma because the asthma DD doctors were willing to work with us it sounds like you've been getting a lot of momentum and success around the effort I was wondering if maybe you could speak to how you would potentially operationalize that or what what the hospitals could do proactively to help maybe provide faster care or you know prevent actually ER visits and suggest other alternatives right so in this particular case like I pointed out I think there's this is a surveillance system so we could provide early warning we could also make the IDI better prepared you know rather than sort of deciding three weeks ahead you know how many people of certain types if we know there's gonna be certain numbers during certain days we could better prepare them and also have the equipment ready in case they show up now prevent prevent asthma where you could provide warnings through mobile apps and things like that so those are you have to use a multitude of mechanisms I think for prevention I very cool hi Michael hi do do you and well lots of implications for this at least in our world I think we just were talking very exciting um do you or Ruben have any insights into other types of chronic disease that might follow a similar pattern so actually we want to work on diabetes and for diabetes I think we can get a lot of data on on what people eat through social media and what their work habits are and wearable devices can tell us a lot about it so that's the other thing we've talked about and also about cardiac diseases so we do want to address other common chronic diseases so asthma is just an example for us here

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *