ML #8 - Open Healthcare Datasets

Hosted by Mike Mastanduno

April 13, 2017 - 20min

Share this content:

Many people want healthcare data to play with, but don't know where to find it. In this chat we'll provide you the data resources you need to start doing machine learning.

Full Broadcast Transcript

Links to Data Sources Mentioned


Narrator: Welcome to the weekly live broadcast with the Health Catalyst
data science team where we discuss the latest in machine-learning topics with
hands-on examples. Here’s you host, Levi Thatcher.

Levi: Welcome. This is the Machine Learning broadcast. Mike Mastanduno, I feel like
it’s been a while.

Mike: Yeah, Levi. We haven’t been on this side of the desk together in quite some time.

Levi: Yeah. I feel like I’ve missed it.

Mike: It feels like– what was it? Two weeks?

Levi: It’s been a while.

Mike: Yeah.

Levi: I was out last week. I heard, you know, things went amazing while I was gone.evi: Sorry, I’m back. We’ll try to get back in the groove here.

Mike: Yeah. It’s great to be back. So what are we talking about today?

Levi: So open healthcare datasets. So how do you find data to play with? So let’s
imagine that you’re wanting to use our package. And 1000 rows
would give you— aren’t enough. So how do you go and find more data to play
with since healthcare is such a locked down environment, so many security
concerns. We’ll dive into that. And I’m excited for it.

Mike: Totally. It’s going to be awesome.
So, I guess, we have some bookkeeping to do at the beginning of every show. We
want to remind you to log into YouTube so you can chat with us. We want to
answer your questions on the air. Feel free to try and engage with us. And if
we’re not too flustered, we’ll definitely try and respond. Make sure to increase
your video resolution with the gear in the lower right.
We’re going to be doing a lot of code work today so it’ll definitely help.
Nice to see you, Jonas. Glad to have you back. You’re a regular subscriber.
The rest of you, please subscribe. Our broadcasts are better by having more
people watching them and a better community. And on that note, you can join
Slack Channel community healthcare-ai. And we can continue the conversation
throughout the week.

Levi: Yeah, yeah. Keep it going throughout the week so that things just don’t drop off
because we want to actually have a discussion back and forth. And so, the Slack
Channel has been great for doing that. We’re excited to build on that progress
we’ve made.

Mike: Yeah.

Levi: So what’s first, today? What are we diving into?

Mike: So, today, we’re going to do the mailbag. We missed it last week to talk about an
excellent article that was in the New Yorker. Again, if you haven’t read it, I hope
you will. It’s by Siddhartha Mukherjee who’s a Pulitzer Prize-winning author and

Levi: Ooh.

Mike: And he writes a great article about AI and medicine. Again, it’s in the New
Yorker. Check it out.

Levi: It looks like 20 pages long, probably.

Mike: Yeah, it’s a 20-minute—

Levi: But it’s amazing.

Mike: Yeah, a 20-minute reading.

Levi: It’s amazing.

Mike: It was awesome. And so, we’ll do some questions from users. We’ve had a few
pile up over the course of the week. And then we’ll dive right into some
healthcare datasets.

Levi: Yeah. And we’ll do a demo for live coding. Stay tuned.

Mike: We’re going to put Levi on the spot and see if he can—

Levi: Yeah, a little bit–

Mike: See if he can code under the gun.

Levi: –trepidation there.
So first question, from Paul, “Would dynamic machine learning be better than
static machine learning in healthcare?” So this has come up a lot. Hot topic.
Mike: Yup.

Levi: Dynamic ML? It seems dynamic as it is. The machine’s learning.

Mike: Yeah. So what is the difference between— what does Paul mean by dynamic

Levi: Yeah. So this is a good one. And what people are wondering is, when you have a
model that’s driving risk stratification. So let’s say we’re predicting each day who
are our most high-risk patients for sepsis. Does that model learn each day that
it’s in production? Is it constantly getting better? Because you feel like– okay,
the robots in the movies. They feel like they’re learning from their environment,

Mike: Yeah, for sure.

Levi: Yeah. Well, what about our sepsis model?

Mike: Well, our sepsis model is not learning from its environment because it’s built–
you know, there’s a lot of different machine learning algorithms.

Levi: Yeah.

Mike: Some are dynamic, some are not. So the sepsis model is built on a random forest
which kind of takes this huge dataset, and digests it, and learns all it can from
that dataset all at once. And then it’s just applying that knowledge to new rows
and new predictions. There’s another class of machine learning models that can
take feedback from new predictions. And those would be like Bayesian classifiers
or recurrent neural nets.

Levi: It’s a little more like ongoing over time—

Mike: Yeah. So they improve over time. But the good news about the sepsis app is that
sepsis– we’re looking back three years and we don’t expect sepsis to change a
whole lot in the next, like, month. And if it does, we can re-train our model and
get that new input.

Levi: That’s a great point. So with sepsis and a lot of these clinical use cases, you’re
predicting on 30,000 to 60,000 rows. And then you tweak. Like, for CLABSI,
which is a central line infection, you’re predicting for those folks that have a
central line in them. So it’s only a few hundred people a week. So your data
doesn’t turnover from week to week, really. You will take advantage of the past
data and not really worry too much about necessarily what you’re learning that
week. You’re just trying to predict for those folks that week. So I think that
covers like a lot of machine learning use cases and a lot of different algorithms.

Mike: Yeah. And I guess, it’d be worthy stating that, again, it comes back to your use
case. If you have a really high turnover dataset, you might get more value out of
a dynamic algorithm.

Levi: It’s some kind of what’s called online learning?

Mike: Yeah.

Levi: I’ve read that term. So that’s something you can google if you’re interested.
So great question, Paul— Paul 1.

Mike: Paul 1.
So second question from Paul2. Paul2 said, “That a barrier for machine learning
in healthcare is, basically, why should we spend resources on this? We don’t
even know that it works. Formal studies haven’t shown effectiveness.” And he
just asked us to comment on that. So I’d love to comment on that, Paul #2.
Thank you.
I think, why should we spend resources on this, we don’t know it’s good? Well,
machine learning is in every aspect of our lives these days whether you like it or
not. Any website you visit, any marketing effort that’s directed at you has been
done by machine learning. And I would say that the internet is quite a bit better
than it was five years ago even.

Levi: Oh, definitely.

Mike: Netflix is a great experience–

Levi: Oh, man.

Mike: –and machine learning makes that possible so it seems silly to think like, “Oh,
just because it hasn’t been validated it doesn’t mean it can help.” It’s helped in
every other industry.

Levi: Yeah. It’s like saying, “Okay. Well, why should we use math to do this when we
could just like guess?” I’m not making fun of anybody, really. But just like
that’s— so machine learning is math. And so, it’s like, “Okay, if we’re all about
data and excited about data, it makes sense to use math to analyze data.

Mike: Yeah.

Levi: [inaudible 00:06:24].

Mike: Well, I guess, the other side of the coin is that it is healthcare so this is–

Levi: It’s a special arena—

Mike: Yeah. Potentially, your medical advice, you don’t want that to be coming from a
machine that’s unvalidated so.

Levi: Yeah.

Mike: Yeah.

Levi: They’ll test it, you know?

Mike: It could be sticky to. But I think the benefit is going to show in the years to come.

Levi: For sure.

Mike: Yeah.

Levi: Yeah.

Mike: If we have anything to say about it.

Levi: Exactly.
Great question Paul 2. We’ll find out your— yeah. We’ll go to Paul’s.
So question 3 from Francis, “Do you provide example code for a lunch-and-learntype

Mike: So Francis was—

Levi: [inaudible 00:07:03] Francis an email?

Mike: Yeah. So Francis sent us an email and he said he was tasked with giving his
organization a lunch and learn, or a brown bag, or kind of like use R for
healthcare data analysis. And, do we have any example code? So, Francis, I think,
we’ll respond to you separately in case you miss this – which would be crazy
because who would miss this? But we have the example code in
You could do some things with that. And then we also have a few R notebooks
that are— I think they are hosted on the blog.

Levi: Yeah.

Mike: You could use as a starting point for what you’re interested in covering.

Levi: Reading our blog, never a bad idea, Francis. I should read it more, in fact.
So check out to go to the packages to see those examples that are
tied to the documentation. And then, on the blog, there are the notebooks. Just
a repeat of everything that Mike said, essentially. So I understand everything,
you know? I like to repeat stuff.

Mike: Cool.
So great mailbag this week. Please keep the questions coming. We enjoy getting
the feedback and thinking about how– it helps us learn new things too which

Levi: Yeah, yeah, yeah. Online learning. And that was interesting. Okay.

Mike: So yeah.

Levi: So, mailbag is good.
What’s on the docket? Open healthcare data. So open, like, why do we say
open? I guess, we could start with.

Mike: So open healthcare data, that’s actually a pretty small percentage of data
because most healthcare data is protected. So open healthcare data would be
data that we could go onto the internet and download and use for testing our
analyses or playing around with machine learning.
So this is a healthcare show so it’s nice to talk about healthcare-specific datasets.
We don’t want to have to point you to stock exchange or sports datasets
because our package is really— it’s really geared towards healthcare. So that’s
So we thought we’d kind of talk about basically where you could get some
healthcare data. And it does exist out there. There’s a good amount of it, if you
know where to look.
So if we go to my screen, we can see one place that’s really great for imaging.
There’s this grand challenge in biomedical image analysis website. And it has 80
or 90 different datasets that are imaging. Let’s see, all challenges. I can click into
that. Maybe it will show us something.

Levi: How did you find this?

Mike: I found this from googling around. So here’s one on endoscopy. Here’s one on
retinal scans. This one’s on cataracts, coronary arteries. And I think there’s like
80 or 90 different imaging datasets. So if you want to get into some image
analysis or neural nets, this is a great place to go.
And then, of course, you have like population health-type datasets. So this is the
website for the CDC. And there’s a whole bunch of datasets on all sorts of
different population health and—

Levi: So much broader than—

Mike: Yeah, disease metrics and—

Levi: –what we did with—

Mike: So here’s one on Lyme disease. And frequently, motor vehicle occupant death

Levi: Whoa, that’s pretty cool.

Mike: So that’s kind of a fun way to do some population health analysis. And actually,
on that note, there are some blog posts on that even. And I think Levi posted
some code and other datasets on population health

Levi: Checkout the blog. Because we love—like, we want the blog to be about all
healthcare data and not just health system data.

Mike: Yeah.

Levi: So it’s all fair game.

Mike: And then finally, we can look at things like Kaggle which is a way to find any
dataset. And so if you go to Kaggle and then click datasets, you can find all of
these user-contributed datasets. And so, there’s stuff like FIFA player datasets
and product back orders, credit card, fraud detection. But we want to see
medical data too, so like–

Levi: Medical [inaudible 00:10:59].

Mike: So like there’s a great dataset on diabetic retinopathy which the Google Health
Group in England used to do some pretty cool things. And then, I think, they
published it.

Levi: This is actually the images of their–

Mike: Yeah. I think that’s their dataset. And then—

Levi: It’s a nice search box.

Mike: There’s another great one that’s kind of interesting from the perspective of our
data which is more like electronic medical record data. Say, it’s this one.
So we get a lot of our data from electronic medical records. And this isn’t exactly
what an EMR might look like but this is a pretty good dataset of whether patients
miss their appointments or not. So it’s pretty cool to see kind of like patient
demographics and health characteristics that then relate to whether or not
someone missed an appointment.

Levi: Yeah.

Mike: I think this one is 300,000 rows.

Levi: That’s a lot.

Mike: So there’s quite a bit of data here to play with. And actually, that’s what we’re
going to do on this portion of the show.

Levi: Whoa.

Mike: We’re going to turn it over to Levi and let him download and explore the data.
And we can do it along with— along?

Levi: Yeah, yeah. Whoa, whoa. Wow.
Okay. So 300,000 rows. Is that big data?

Mike: Definitely getting close to getting big data.

Levi: Big data?

Mike: I like that. Definitely getting close.

Levi: Where’s the threshold?

Mike: I mean, it seems like random forest is definitely going to be fine for this dataset.
It might not be as good as a neural net but I think it depends on the data.

Levi: But we can hold it– yeah.

Mike: Yeah.

Levi: We can hold on our laptop so you don’t need some Hadoop cluster—

Mike: Definitely.

Levi: –to hold this data and process it?

Mike: Yeah. The download’s only 4 mb. And once you unzip it, I think it’s about 25.

Levi: Okay

Mike: So, no problem holding it in memory.
But, Levi, I’m going to throw you under the bus and let’s get into some code.

Levi: Yeah.
Okay. I think we can do this here. So I’m going to take over.

Mike: There you go.

Levi: And let us download and crunch some data.
So for those of you following at home, we’re going to— well, the screen adjust
here. We’re going to download some data and we’re going to use what Mike just
described. All right.

Mike: Maybe I can talk a little bit about what’s in that dataset while Levi’s—

Levi: Yeah. Well, maybe talk about why is no-show data interesting. Like, what can
you do with no-show data–

Mike: Ooh, yeah. That’s a good point.

Levi: –like, in terms of machine learning?

Mike: I like it. You’re thinking about the business use case, you know?

Levi: So I like the business questions.

Mike: That’s the first step in any machine learning pipeline. It’s like, “Why are we doing
it? What’s it going to help with?”

Levi: Yeah. So we’ve done this with Catalyst?

Mike: We actually have done this with Catalyst. We have a no-show model in
production with one client right now and another one in development. And that
client is using it to— what are they doing? They’re either double booking slots or
using extra reminder calls before appointments because missed appointments
are bad for the hospital. They slow down [crosstalk].

Levi: Bad for business. Yeah, so either of those use cases are super helpful.
So if you could see my screen here. Let’s focus on the screen for a moment. So
we have this dataset that Mike mentioned. And you can google it easily. Kaggle,
medical appointment no-shows.
Now, we’re going to be working in R here to actually do something with this
data. So let’s go to R first, now that we’ve seen that website. So this is R Studio.
And I think we’ve introduced this before.
We’re going to do to start out, is we’re going to create a new project. And so, R
Studio is fantastic for working with data. So, on the top right, you’ll notice, you
can click on this project tab and then go to “new project.” And what we’ll do is
click on– okay, let’s put in a new directory.
The project is basically a way to group your work. So if you have some datasets,
some files, you want to keep those organized. And that’s what this allows you to
So let’s say, “Yeah, we’re not going to create a package. We’re going to create an
empty project.” And so, here, you would type in your directory name and you
decide where you want to put it. And I won’t do that now because I actually
created this but for those of you at home doing this, go ahead and type a
directory name. Click “create project”. And I’ll just go ahead and open one that I
created earlier called “open demo”.
And what we’ll do here is we’ll actually show you how to load in a dataset from
Kaggle. Okay. So what you do is you go ahead and look at this appointment noshow
page. And you click “download”. And you might notice, instead of just
downloading to wherever you want, you can clink “save link as” and put it in the
directory you just created using R Studio.
And so, you can find your directory there. And then you have to go and unzip it.
But that’s fairly easy.
And so, if you go back to R studio now what you’ll notice is, in that project, you’ll
have several files. And so, you’ll have you’re R proj file, a csv file which
represents your data. So maybe a good question for the beginners, csv file? You
know, in grad school, I’d really played with csv’s. Did you?

Mike: Csv’s?

Levi: Yeah.

Mike: Yeah, that’s how you get data from one place to another.

Levi: See, we used different datasets. We used like these netCDF files because we
were looking at geographic data.
But you used them for your medical imaging work?

Mike: Yeah, it was easy. Just because it was easy to write to a csv and read from it. It’s
kind of a simple standard.

Levi: Yeah.

Mike: It stands for comma separated variable. So it’s a table format where a new cell is
drawn just when there is a comma. Like, that means new cell.

Levi: Yeah.

Mike: So it’s pretty easy to create one, pretty easy to read one.

Levi: Exactly. And you have rows and columns, so it’s tabular data.
So you’ll notice my [inaudible 00:16:31] here, csv and then I have this R demo file
which you won’t have. And then you’ll have an R history file spring up.
But let’s pretend that we don’t have this demo file yet. We’re going to create this
on the fly. So for those of you at home following along, let’s go ahead and create
a new script.
And we’ll do a couple of things here. First off, we need to install the tidyverse.
And I’ve done this already but I’ll show you how you can do it at home.
#install.packages(‘tidyverse’). So, you’ll have to have that. And you also have to
have ggplot, too. And I put the little hashtag in front because the hashtag is how
you put a comment into code. And the comment basically means don’t run this.

Mike: So this is just kind of just a note for yourself at this point? Like—

Levi: Yeah. It’s just, you know, this code does this so you can tell yourself later on, if
you forget, that sort of thing.

Mike: Yeah, comments are really important for code because if you give your code to
somebody else, they know what you’re doing so.

Levi: Yeah. And at oftentimes, like later, like a month or six months later, you look at
the code that you’ve written—

Mike: Six hours later.

Levi: Six hours, yeah. That’s like, “Wait, what was I thinking here?” All the times, so
comments are amazing.
So what we’re going to do is load in this dataset is to d, which will represent our
data frame or our— well, we’ll talk about that in a minute. But basically, you do
assignment by this < and -. Read_csv brings in the csv file.
And if you click tab in R Studio—hmm, it’s not working for some reason. But the
idea is you simply have to type the name of the file. And so, we can see, if we do
this on the fly here – Issue-comma-300k.csv. And this is using the tidyverse
package. And so, if we run this. Oh, that’s a good point. So it says, “We cannot
find that function.” And so, to load the function in from a particular package, you
do library and will load in the tidy verse.
So library basically says, “Okay, well, we already downloaded this package. Now,
let’s bring it in the memory to be able to use the functions that came with it.” So
now, if we go ahead and run it. Okay. So it’s chugging along there. Now, 300,000
rows, decent size.

Mike: Yup.

Levi: If you don’t have a lot of memory on your computer, this may take a moment
here but we have it now in our memory. And so, if we type d down here, look at
dataset. Oh, there we go. Okay.

Mike: Cool.

Levi: So let’s focus on the console for a minute. So console comes up a lot in data
science, programming. What’s the console, Mike?

Mike: Console is kind of like that’s where stuff is evaluated to the ad hoc place to test
stuff. So if you type 2+2 and hit enter, it will return 4. That’s where output comes
out. So if you run a script that will print the first five rows of that data we just
loaded, it will come up in the console.

Levi: Yeah. So basically how you interact with the program, a lot of the time. And so,
we see in the console here, we have this dataset and we have our columns – age,
gender, appointment, registration time and date, day of the week. Stuff that
relates to no-shows. Awesome, just what we expected.

Mike: Yup.

Levi: And you notice, this is a tibble. So the tidyverse uses what are called tibbles
which is a nice tidy way of looking at data. You might have heard of data frames.
So it’s something that’s similar to that, just a little more tidy. And so, you know,
so we have 300,000 rows, 15 columns. And all the columns are shown here.
You’ll list down below. And this is one of the beautiful things of tibbles is that
you have all the different columns listed that weren’t shown here.
So, okay, well we brought the data in. And for some people, that’d be really cool.
For some people like, “Okay, well [inaudible 00:20:17] loaded data, great.”
So I guess, where would you start with, you bring in a dataset, what are the first
things you do with data when you start to look at it?

Mike: Oh, the first thing I do is I want to know what’s in it so I would just get the
column names—

Levi: Okay.

Mike: And so, you could do something like names of d and get the column names, get
all of them.

Levi: Okay.

Mike: Does the tibble there show you all the column names?

Levi: Yeah, yeah. So we haven’t talked about tibble much, Mike and I. But here’s the
tibble. So actually, yeah.

Mike: Oh, so it does show you all.

Levi: It’s quite nice.

Mike: Man, that’s helpful.

Levi: And the tibble– above and beyond the data frames, the tibble actually shows the
type of column you’re looking at.

Mike: Oh, that’s great.

Levi: Yeah.

Mike: So that tells you what’s what.

Levi: Tools are pretty fantastic. And so, let’s actually do something with this, right? So
ggplot. We loaded that in. Well, we installed it. And we’re going to load it in now
because ggplot is how we like to plot. And you can plot with baSAR.

Mike: Yeah, ggplot looks a lot nicer though.

Levi: It can look— yeah, in a lot of cases, it does–

Mike: It does a lot of really handy things for you.

Levi: It’s super handy.
Okay. So we load in our function from our ggplot.
And let’s go ahead and do something here. So I am lazy and I like to google stuff.
So let’s just pretend we don’t have this page up. And let’s say, “Ggplot, give me a
histogram.” Done. Okay. So it’s the top link, click through. Just for reproducibility
so you can all do this at home.
Now, a documentation you often have, you know, arguments and things like
that. But a lot of times it’s easiest and most exciting to just skip to the example.
That’s what we’re going to do. So let’s skip down to the example and grab the
first couple of lines here. We’ll adapt it to our use case. And we’ll start with the
histogram. So a lot of times you want to look at the distribution.

Mike: Yeah.

Levi: You want to see like, “Okay. Well, how does my data look?”

Mike: That’s probably this exploration process, right?

Levi: Yeah.

Mike: Once you want to know what’s in the data, graphing it is a great way to do that

Levi: Yeah. So before you do machine learning, it’s best to kind of see, “Oh, well, what
data do I have?”
So we’ll copy this. Drop it into R Studio. And we’ll change a couple of things. So
the first element here to highlight is the name of the dataset. So that’s now d.
And we have to choose a column to look at. So the ggplot syntax is fairly simple.
So you’ll be working in this aes object a lot of the time. Or we’re working with
whatever you call the aes indicator here.
And so, this is where you say, “Okay, well, what is the column that we’re going to
be looking at on the x or y axis?” Histogram is nice because you only have to
specify one.
But let’s look at age real quick. And so, we just go ahead and put age in there and
run these two lines. Notice, we don’t have to load in the dataset again. Already
in memory. So we can click run.

Mike: With those lines highlighted though, right?

Levi: Yeah, highlighting the two lines of interest.
There we go. So we get some sort of warning there. I don’t know if we still have
to worry about that.
But to get to the punchline, we have our histogram here. And it’s fairly basic
since it’s just very minimal code. But the idea is you have an example of, okay,
well, you know most of the folks in the dataset, they’re you know 50, 60 years of
age and pretty steep drop off.

Mike: That kind of makes sense, right?

Levi: Yeah.

Mike: So you’re probably going to have more people that age coming to the hospital
than anywhere else.

Levi: Yeah, yeah. It’s interesting but it’s not like normally distributed. So notice how
fast it drops off for folks over 60. Hmm, maybe there’s a different medicare clinic
or something that the older folks use.
But there we go, an example of how to start looking at data once you’ve
downloaded it from Kaggle using modern tools that are free and easily set up.
Download it from the internet and get going using what we showed here.
So, Mike, any questions? Any comments?

Mike: So now, what would be the next step if you already used ggplot?

Levi: Yeah. So, that’s a great question.

Mike: I guess, maybe, I don’t want to look at— maybe how the age is associated with
each gender or something like that?

Levi: Yeah, yeah.
So we could start with some box plots, that sort of thing. So there’s a lot of good
documentation on ggplot. So if we go ggplot boxplot. If you’re familiar with
boxplots, or if you’re not, they’re fantastic. And you can see all the
documentation here that you can easily grab. So if we scroll down to the
example code, you can easily see– okay, well, wow, two lines of code there. It
gives me that nice boxplot. And you can kind of take it from there and google
around for other ggplot things that you want to do. So scatter plots—

Mike: Yup.

Levi: And filling in lines or just [inaudible 00:24:52] distributions are cool.
But that’s the demo for today. Anything that we need to go over? Any
housekeeping items or anything? We want to keep these tight.

Mike: Yeah. We had a couple of questions through the chat we could touch on real
quick. Someone asked why that we thought Netflix may be switched from a
thumbs up/thumbs down to a star rating system? And I didn’t know that
happened but I gave my best that maybe it’s more gradation within the star
system so you get a more accurate representation, or maybe it’s just more
familiar to users?

Levi: Yeah.

Mike: And then we had some people asking about our— we had someone ask about
what we spend our days doing but I think we’re going to have save that for a
future topic because there’s just too many things to think about.

Levi: A lot of stuff. Yeah. A lot of building, package work. Also, the good stuff,
education. In terms of Netflix though like they’ve actually recently switched even
from stars to an actual percentage as to how much you liked this movie.

Mike: Oh, did they really?

Levi: Yeah. So it’s super precise now.

Mike: Well, that’s the feedback they give, right? But what about the feedback that goes
into it?

Levi: Yeah. Yeah. So the ratings that could still would be stars, I have to check.
Man, great questions. Join our Slack Channel. Join the community
at And you can sign up for the e-mails that we send our blog posts.
Check us out. Subscribe. Like and share.
Thanks for joining. We’ll talk to you all, next week.

Mike: Thanks, everybody.

What topic or projects should we feature?

Let us know what you think would make it great.