Interview with Jimeng Sun Part 1 – Georgia Tech – Health Informatics in the Cloud



So today I'm talking Doctor Jimeng Sun
from right here at Georgia Tech. Jimeng joined us earlier this year
after many years at IBM, where his work has depended heavily on analyzing
data from electronic health records. Jimeng thanks very much for
taking the time to be in the course. I've heard you talk many times about the
difficulties of working with EHR data. Could you explain some of
those to the students? >> Sure. First, Mark great to be here. So electronic health records very
important for my research and many other people who work
on clinical informatics, house care informatics, also need
to access electronic house records. But electronic house record is very
messy, and lot of missing data in there, because they are collected for
billing and clinical operation purpose. They're not designed for research. So there are a lot of information
in the HR systems, but for a given patient they only have a small subset
of those, and they're not uniform. So some patient have a lot more
information than the others, and that makes the electronic health records
very messy, very difficult to deal with. The other thing is,
there are just a lot of different kinds of information that are in EHR systems,
and it's getting more and more. For example, there are diagnosis
information such as ICD code in there. And their medication information,
there's lab test. And their clinical notes
represented as free text. And their medical images. And more and more genomic data
getting into the EHR system. So all of those different type of
information requires different kind of techniques to process them,
to analyze them. For example, for text data like clinical notes,
you need natural language processing. And for image, you need specialized
medical image analysis tools. For structured data like
diagnosis medications, you would use general
data mining techniques. So people have to be able to pick the
right tools to deal with the right data. >> In a moment, we're going to talk
about some work you've done around the early diagnosis of
congestive heart failure. Which is interesting, particularly
within the context of this course, because chronic disease has really been
the exemplar I've used throughout. But first, in that work,
you've had to use that free text data. And in fact, you've extracted from it,
clinical features. That can be used by a machine
learning algorithms. So how do you do that? How does that work? >> Right so first for
people who are not so familiar with clinical notes,
it's very rich information. A lot of subtlety and a lot of
symptoms are represented in the notes. Only in the notes,
not in the structured data. So, it's very important to process
the clinical notes, to extract those symptom information, the severity
information that are hidden in the text. So, that's the reason of why we
need to process clinical notes. And the way we did that is
to use a softer pipeline for dealing with test called UIMA,
unstructure information management architecture that's
originally developed at IBM research and later on becomes an open source project
that many people are using that. Including later on the Watson project. So it's facilitate natural language
processing development and you develop a pipeline of extractors to go through
the text, extracting the information that you cares about like symptoms
that related to heart failures so we used heart failure signs and
symptoms, and also the context. You not only want to know the symptoms
are present in the texts, but also whether it's in
the positive context, meaning that the patient is
confirmed to have the symptoms, or it's in the negative context,
the patient doesn't have the symptom. So, you also need to know the context. So, we did many different extractors and to extract that information and
also the context. So, it's a lot of work. >> So armed with that information and the structured information, you've
developed a machine monitoring algorithm to diagnose congestive
heart failure earlier. I should mention to the students
without a clinical background that congestive heart failure is the single
most expensive medical condition. And that diagnosing and treating it
early, as we discussed in the course, is true of most chronic diseases, we can hope to avoid the complications
that are very expensive to treat. But it's very subtle at first, there
is no test like a blood glucose you can take, or a hemoglobin A1C to say
you've got congestive heart failure. So clinicians often don't pick up on
it as early as might be the case. Your model does,
can you tell us how you did that? >> Right, so that's a project we
started when I was at IBM Research a few years ago in collaboration
with Geisinger health systems. So we would visit Geisinger and I mean,
trying to learn about what are the most important or challenging problems
that we can tackle together, using data-mining machine
learning techniques. And they identify heart failures, one. And at times, there are already many
groups are interested in heart failures. Mostly in the post-diagnosis
phases in terms of readmission, hospitalization, readmission
after heart failure's diagnosis. But that groups in Geisinger, the focuc, they wanted to look at pre-diagnosis,
look at what are the signs and symptoms that lead to
diagnosis of heart failures. Can we identify the patient of high risk
of developing heart failures earlier? So the goal is for early detection. And we leverage The first thing we
did is we want to see whether that's even possible. Are there relevant signal in the data
that earlier than the diagnosis time? And we look at the clinical notes and
that's where we first developed natural language processing
techniques to extract that and the findings from those
notes are very shocking. In fact,
just defining from the clinical notes, we're able to publish in
the journal in cardiac failures. And thus, are medical journals
that clinicians write. There are some exciting
findings in terms of signs and symptoms that in
the electronic house records two years prior to the diagnosis,
and that's very surprising. Many people don't believe those. I mean% of patients have that sign and symptom already in the clinical notes
two years before the diagnosis. So those make us believe
we can do something. Then the question is how do we really
leverage that reach the HR information. So, it would be the integrated data
analysis tools to integrate diagnosis, lab measure, and medication, and also clinical notes together to view
the very reach patient profiles. Then score those patient
profiles using advanced machine learning techniques to
be able to predict a model. And that helped us to develop this very
accurate predicting model that can predict heart failures diagnosis six to
12 months before the actual diagnosis.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *