ML #11 - How Do You Evaluate Model Performance?

Hosted by Mike Mastanduno

May 04, 2017 - 20min

Share this content:

This week we'll talk about when and how to evaluate machine learning model performance. it depends on the type of model, the type of data, and as always, the business objectives. Join us this week to learn which performance metrics are the most appropriate for each use case.

Full Broadcast Transcript


Levi: ~joining us. Hands-on healthcare machine learning. I’m Levi.

Mike: And I’m Mike.

Levi: This is Mike. We did this last week. Were you here last week?

Mike: No.

Levi: I think we had some [inaudible 00: 00: 10].

Mike: [inaudible 00: 00: 10]. Yeah, we had some dinner and Taylor was in last week.

Levi: [inaudible 00: 0: 13] fun?

Mike: Nope.

Levi: Just working?

Mike: Yup. So it just wasn’t my turn, I guess.

Levi: Well, glad to have you back.

Mike: Yeah, great to be back on the show.

Levi: Yeah. So today, well, we’re getting some exciting stuff, evaluating model performance. I can’t wait.

Mike: Yup.

Levi: Hopefully, you’re all excited.

So just some logistics here before we get started. So we want to encourage you to contribute to the conversation in YouTube so be sure to log into your YouTube account. And we’re going to show on the screen a little bit, so you may need to adjust your resolution to HD. We’ll let you figure out which level of HD. And we also encourage you to join the community at – blog posts, get notices as to when the broadcasts are coming up and that sort of thing. And join the Slack Channel as well. You can find all those links at

And let’s dive into what we have on the docket. So mailbag?

Mike: Yeah. We’ve got a mailbag this week. And then the main event. And we’ll be taking questions and answering them throughout the whole show so.

Levi: Beautiful, beautiful.

Okay. So first mailbag question is, “How can I progress my career in data science?” We get this question a lot. And it’s a good question.

People are excited about data science. They want to maybe shift to a new role at a current company, maybe change companies. So, have you thought about this, Mike? I mean, I had a couple responses but I don’t want to bias you or lead this into a particular direction.

Mike: Yeah. So I’ve always thought, kind of, there aren’t too many degrees in data science. And, I guess, you start to see business intelligence and data analytics masters. But I think the thing that people are really looking for when they’re evaluating data science candidates is, “Can you do the work?” So the best way to showcase that is to have exciting, interesting side projects hosted on GitHub using the tools you use at the job. And so, yeah, if you can showcase yourself with almost like a portfolio that I think that’s a great way to kind of sell yourself.

Levi: Portfolio idea. Yeah. That’s fantastic. And Mike himself recently left academia and made this transition.

Mike: Yeah. And I found the portfolio to be pretty helpful for that.

Levi: Yeah. That’s a great point. So blog posts and GitHub. And so GitHub can be your own personal projects or even contributing to other folks’ work. You know, even the ones that kind of works on their own.

Mike: Yeah. Just something you can speak to about your own experience and why you did what you did and what the choices you made along the way.

Levi: Yeah, that’s super helpful.

Mike: Yeah. 

Levi: And don’t be afraid of putting stuff out there that might seem trivial because other people would want to see those first steps. And it helps you to learn things more in-depth-ly. Whenever I write something out, it feels like I’m learning a little bit better than if I didn’t write it.


Go ahead.

Mike: I was going to say, I think the Data Elixir Newsletter or the Data Science Weekly Newsletter, they gave me a lot of good examples of kind of analyses people have done in their spare time.

Levi: Yeah, fantastic.

Google data elixir, it’s amazing.

One of the things I had mentioned, so somebody e-mailed us and said, “Hey, how do I get into data science?” And so, I gave him a couple of things. And one thing I said that has helped me is to become known as the machine learning person at your work. So maybe your company’s not doing machine learning yet – and maybe they should be. So be the person that knows, in the context of your current business, how machine learning could help and be able to explain it clearly to executives and others. And soon you’ll be known as that person –the ML girl or guy that is the go-to whenever questions arise about machine learning. And that will help you transition eventually in your current company or give you enough projects that you can move to a different company and do data science, if that’s your goal.

Great questions.

Mike: We did have a question in the chat about our MOOCs and Kaggle contests equivalent to that. And I would say, they are a good piece of that portfolio but I think having something on GitHub, that you’ve conceptualized yourself and done the analysis, is really important because a lot of these data science jobs, they’re not going to be “here’s a dataset and a problem. We want you to solve it.” It’s going to be more like. “Here’s a problem. Can you help?” And so, they’re more open ended. And that’s kind of the downside to the Kaggle competition. There’s a lot of structure to it. But I think both of them can be useful tools together.

Levi: For sure.

So it’s super helpful to find a use case and see like, “Okay, well how will my machine learning apply here?” because then you’re dealing with messy data and you have to kind of sort out the real world implications and logistics that aren’t always so easy.

Mike: And kind of picking the use cases is half the battle, you know?

Levi: Yeah, yeah.

Mike: You have to find a use case that’s interesting and the Kaggle competition takes that out of the mix.

Levi: Yeah. Maybe it kind of helps you to determine like what machine learning can do as an initial step. And then from there, if you go ahead and apply it to your business or others.

Great questions.

Mike: Yeah.

Levi: So second question of the day was, “What version of R is required for” Fantastic question.

And so, just for those who don’t follow the R Community very well, it feels like quarterly— it’s fairly frequent that the base R version is updated. And great question.

So we actually need version 3.2.3 which came out— it feels like maybe last year in the spring or summer. So they’re up to— actually 4.0 just barely came out and that is noted in the documentation of the description file, if you go ahead and pull on the repo or download our code. But 3.2.3 is the answer there. So any new version is great. If you’re able to update, that’s fantastic. You shouldn’t see any problems.

Great questions. Keep them coming in. And in the chat window, especially. We love to see what you’re wondering as the discussion flows, we want to be to contribute and learn from you since this is a two-way discussion. And also, on the Slack Channel for afterwards, so after the broadcast. How can we help with your projects? Where are you getting stuck? What kind of things are you working on? We want to be able to work together and learn from you so that we’re not working in silos separately.

Mike: And just for the community’s benefit, 3.2.3 is named wooden Christmas tree.

Levi: Oh, wooden Christmas tree. That was one of the favorites.

Mike: Yup. So it would have to be more recent than wooden Christmas tree. And Another Canoe which is 3.3.3. That’ll work.

Levi: Oh, I love Another Canoe.

Mike: That’s the one I’m on. Or You Stupid Darkness is 3.4.

Levi: Oh, that’s kind of random.

Mike: Where do you think they come up with the names?

Levi: I don’t know. Pointing to the dictionary, I guess.

Mike: When we release a new, are we going to give it a name like this?

Levi: We might need to get names. That’s true. It seems pretty common out there.

Mike: Speaking of which, we are going to be releasing a new version fairly soon.

Levi: That’s true.

Mike: Hopefully, in the next couple of days. And it’ll have a great functionality on feature availability, profiling and generating AUCs and what else is in–?

Levi: Yeah. Glad you brought that up. So it doesn’t tie you into SQL servers. That’s one thing that we’ve gotten. So a lot of people don’t use SQL server in their personal work. They want to work with csv files. And so, in this new version, or release 1.12, it will be in the next couple of days, you’ll actually be able to work both pulling from and pushing predictions to csv files or JSON files or what have you.

Mike: And we’ll come up with a catchy name for it too so it doesn’t have to be 1.0.

Levi: Yeah. Yeah, yeah, yeah. Exactly.

I saw an Ubuntu release the other day that was Precise Pangolin. I like the ring of that, alliteration is good.

Mike: Well, we can’t steal that though.

Levi: No, no. Something similar, perhaps.

Mike: All right. Well, enough chatter, Levi.

Levi: I just wanted to mill that. Sorry.

Mike: Let’s get into the main event here.

Levi: Okay. So what are we talking about today?

Mike: So today, we are talking about performance metrics and how you evaluate machine learning models. And so, we’ll get into that.

We’ll talk about classification and regression problems. And then, kind of, how do mire over time because all these things — you’ve got to consider all of them.

So, first one, let’s start with classification. So classification problems, everybody remembers, I hope, that you’re predicting a probability between 0 and 1 with the idea of giving a binary outcome – yes or no. And so, the simplest way to evaluate a classification problem is accuracy which is just “how many did you get right? How many did you get wrong?”

That has a few problems. With the most common one being kind of, let’s think about trying to find a rare disease. And so, if we built a machine learning model that just always predicted that the patient didn’t have the disease. If that disease is rare enough, accuracy is going to be pretty high. So maybe if the prevalence of the disease is 1%, you always say no, your accuracy is 99%. That’s pretty good.

Levi: Pretty good and pretty deceptive.

Mike: Yeah, it’s pretty deceptive. That’s a great way to [inaudible 00: 09: 06]. It totally is because you’re not really— that’s a great accuracy, high performance metric but that’s the wrong performance metric because it’s not telling you anything about the model and your model is terrible because it never identifies any of the sick people.

Levi: Yeah. The yes’s are what’s most interesting, it feels like.

Mike: Right.

So we need to pick performance metrics that evaluate the yes examples and the no examples a little bit more fairly. And so, one way to do that is area under the curve of a receiver operating characteristic curve.

Levi: That’s a mouthful.

Mike: Yeah. It really is. But the idea behind that curve is that to go from probabilities to yes and no, you have to pick some threshold. So do you call everything a 0.5 a yes? Do you call everything above 0.6 a yes?

Levi: That’s a great point. Model delivers probabilities, not yes’s or no’s. I get that question a lot.

Mike: Yeah.

And so, the threshold – you can imagine like if you chose a really high threshold versus a really low threshold, you’re going to get really different distributions of yes’s and no’s.

Levi: Yeah, exactly.

Mike: And so that’s another reason accuracy is not a very good metric because it implies that a threshold was chosen but you don’t know what it was.

Levi: Yeah. So you have to be flexible on threshold. And there’s two ways you can be wrong with a model, where you can over-predict or under-predict. And so at different levels of this threshold, you’re sort of trading off or you’re wrong on one side or the other.

Mike: Yeah. That’s a good point. So the benefit of the AUC is that it evaluates all of the thresholds at once.

Levi: Wow.

Mike: And so, any potential threshold that goes into the performance metric and that essentially kind of takes it out of the equation. You don’t have to think about it anymore.

Levi: Beautiful.

Mike: Yeah. So that’s one reason people really like it.

And then the other thing that’s really good about area under the curve is that it kind of, with a hand-wavy definition, it evaluates the positive cases and the negative cases with a little more “even-ness”. So even if you have imbalanced classes, it’s going to do a better job at making sure that the model is evaluated on how many of the sick people they have found as well as how many of the healthy people didn’t pass.

Levi: That’s a great point.

Okay. So we like AUC more than accuracy for doing classification.

Question. So we mentioned confusion matrix in the chat. Great point. So you mentioned threshold, how does that tie in with confusion matrix?

Mike: Yeah, so the threshold is how you generate the confusion matrix. Once you’ve picked the threshold, then you go through and you’re going to count up all the false positives, the true positives, and then the false negatives, and the true negatives, and you can build that confusion matrix.

Levi: So it’s tied to one particular threshold which may be a little bit limiting?

Mike: Yeah, but if you know the threshold that’s okay.

Now, AUC comes from the confusion matrix over all thresholds so that’s why we kind of prefer it. But the confusion matrix is a good place to start.

Levi: A good place to start, you think.


I mean, the AUC [inaudible 00: 12: 09] all potential thresholds, you wouldn’t start with AUC, get an idea of your model performance, and then when you have a threshold, go to confusion matrix?

Mike: Hmm, yeah.

Levi: [inaudible 00: 12: 19] we haven’t used confusion matrix a lot lately?

Mike: Yeah. We’ve been keeping it higher level with just looking at AUC and calling it quits there.

Levi: Yeah, but so let’s say you have a doctor discussion where you’re saying, “Hey, clinician A, here’s how your model performed. You can either err on the side of false positives or false negatives.” From there you can get a threshold. And so, once you have your threshold, a great way to display that is with the confusion matrix.

Mike: Yeah.

Levi: So thanks for bringing that up, Paul.

Mike: Yeah.

That’s a good one. And then the other thing is this trade off of precision and recall.

Levi: Ooh.

Mike: What’s the difference?

Levi: Yeah.

So that one’s a hard thing to remember because there are some synonyms here. So is sensitivity the same thing as precision?

Mike: You know, I can never remember.

Levi: Yeah, [inaudible 00: 13: 06] for those.

Mike: But the gist of is that AUC evaluates the positive cases and the negative cases. Whereas, precision recall evaluates the positive cases that were given as positive and it evaluates the positive cases that the model said were positive.

Levi: So folks, it’s more on getting the yes’s right.

Mike: Yeah, exactly.

Levi: Like, people that had cancer that we said they were going to have it.

Mike: Exactly.

Levi: Okay.

Mike: So it kind of looks, from both angles, at the sick people, how well did the model do at finding the sick people?

Levi: Yeah.

Mike: So it’s a harder metric to get good.

Levi: Yeah. It’s less popular. So a lot of the literature you’ll find on and other places focus on AUC where sometimes it’s called a c-statistic. Have you seen?

Mike: Yup.

Levi: So PR or the precision recall curve is a little bit more esoteric or whatever.

Mike: Well so recall is the same as sensitivity. I might have that backwards. But assuming I have that right, precision is the same as positive predictive value which is pretty common in healthcare.

Levi: Yeah, yeah.

I guess, the PR curve or the area under the PR curve is a little less.

Mike: Yeah, the area under the PR curve, that’s more of a machine learning thing.

Levi: Yeah, but definitely helpful and in 1.11 release.

Mike: Of course.

Shameless plug.

Levi: Yeah. So this is good stuff. And this is classification.

And where else do we go? What other things can we do with machine learning?

Mike: Oh yeah, that’s a great question. So as I’m sure you know, you can also do regression. And so regression is we’re predicting a continuous output. And you’re going to have to compare your continuous output versus the true continuous output. So all the stuff for classification we just talked about doesn’t really apply. And that makes it a little harder. But it gives you other advantages too.

So if we can look at—if we have a dataset– we’ll head over to an image and if we look at a dataset, you’ll notice that a regression problem is typically a lot more visual. So you can kind of like plot all the data in a scatter plot and then plot the line of best fit. And I’ve just shown a simple linear regression here. Maybe, it’s cost of a house versus square footage or something. But you get a nice idea of kind of what the spread of the data looks like and how you can evaluate that.

And so, similar to accuracy, a lot of people start with the R2 coefficient. And that’s what comes out of Excel.

Levi: Interesting. I know that.

Mike: Yeah.

And so, one problem with the R2 coefficient is that, just like accuracy, you can get a really high R2 for data that doesn’t really fit a model.

So looking at these four examples, we have a few different cases where the data of linear fit is just not a good fit to the data. But you’re going to get a pretty high R2 value because your data is close to that line, even though it’s not a good fit.

Levi: Interesting. And it’s pretty popular.

Mike: Yeah, it is.

Levi: Maybe, too popular.

Mike: So the other two options that are better, they’re going to focus on the residuals which is the difference in the predicted outcome and the correct variable. And so, that’s a little closer to kind of looking at classification. In my mind, it’s kind of more apples to apples but with the assumption that the spread in the residuals is going to be a normal distribution because it’s kind of a random process, the difference between the true data and the fit. You can do things like take the absolute value of the difference, and sum them all, and then divide by the number of points and that’s called mean absolute error.

Levi: And you want to take the absolute part of it because you don’t want them all summing to zero sort of getting—

Mike: Right, right. You don’t want the negative ones and the positive ones to sum to zero so you do the absolute. And if you’re more concerned about outliers, you can use this thing called root mean squared error.

Levi: Ooh.

Mike: Which is where you square the residual for each point, and then take the average of that, and then the square root. And that kind of penalizes big outliers in your data more than just lots of small variations.

Levi: That’s a great summary.

So gives you these metrics, by the way. So that when you create a model, it tells you how well it did in terms of MAE or RMSE. One of like the benefits with MAE is that it’s in the same units. So you get this number and it’s like, “Well, what does this mean? Is this good?” And so, if you’re predicting house price with mean absolute error, it will tell you the average error in terms of house price, which is kind of nice.

Mike: That is nice.

Levi: If you are ever really concerned or excited about making sure you’re handing outliers well, then you use RMSE which is maybe slightly harder to interpret. But if you’re using multiple algorithms like in then you can compare apples to apples, then it’s not so difficult.

Mike: Yup.

The fact that all these metrics are in makes me think that either the people who wrote the package are really good or maybe they’re paying us to be on this broadcast and talk about these [inaudible 00: 18: 29].

Levi: Yeah. That’s one of the benefits of, is it does it for you. And that’s one of the annoying things because when I got started with machine learning, I had to re-calculate these things and write functions myself for these different projects that we were taking on. And now, everything’s packaged up very nicely to make your life easier to say, “Okay, if you’re doing regression, here’s what’s an appropriate metric. Or for classification, here’s AUC.”

Mike: And we just have one more point on monitoring our performance metrics, and that’s kind of like, “How do performance metrics change over time?”

Levi: Interesting.

Mike: What could cause a performance metric to change? If you build a model and it’s looking good. And then you put it in production. How do you make sure it’s still good?

Levi: Yeah. So you can, I guess most easily, just manually check. We have something called the generate AUC function in where you can go in and say, “Okay, well I’ve gathered data for the last couple of months, seeing if people got an infection or not.” And this will say, “Oh, here’s your AUC. Here’s how good your model did in production” which is quite nice because it could be quite different from when you’re developing the model, because the underlying data might have changed. Health systems have really interesting data pipelines that aren’t always static over time. Data leakage is one that we’ve come across. That’s a couple of big ones.

Mike: Yeah. So if you can watch that AUC over time and see if it’s falling versus where it was in development, maybe it’s time to re-train the model or maybe it’s time to just leave the [inaudible 00: 19: 59] alone.

Levi: Yeah, yeah, yeah.

So you can decide. And it’s super easy to re-train using the tools.

Awesome rundown, Mike. Thanks so much.

Mike: Yeah.

Levi: It was beautiful.

Mike: It was great.

Levi: So in the chat, if you have any questions, let us know.

We actually want to do something new. We’re going to bring in a poll question. And we’re quite excited for this. And so, what we’ll do here is we’ll direct to you to So folks, if you wouldn’t mind heading to I believe that’s going up on the screen there. And we have a question that says, “Where do you feel like you’re struggling most with machine learning?”

So we want to be able to help the community learn what people are struggling with. And that’s the point of this poll. And (a) finding an appropriate business question, (b) data prep, (c) feature engineering, and (d) model evaluation, (e) visualizing the predictions. So yeah, it’s

It’s quite a handy little tool, pollev is. If you ever need a poll service.

Mike: Yeah, it’s pretty good.

Levi: Okay. We have the responses coming in. It looks like finding an appropriate business question and feature engineering are actually the top two.

Wow, actually, finding the appropriate business question is polling ahead. Interesting.

Mike: Yeah. I think that’s really interesting because like we were saying at the beginning, with getting, you know– why is the Kaggle contest not as good as a project you’ve put together on GitHub. And that’s because the Kaggle contest doesn’t have the finding an appropriate business question part, you know?

Levi: Yeah.

Mike: And that part’s hard.

Levi: It is hard. You have to really know your business and know the tools.

And we’re here to help with that. We’ve been learning that ourselves, here in Health Catalyst.

And we talked a little bit about that in the webinar yesterday. Maybe we should have a whole broadcast just based around that. [inaudible 00: 22: 00]

Mike: Yeah. So could we help the community with that idea? Maybe it would be finding appropriate use cases. Use cases in healthcare, we’ve done broadcasts on that but we’ll have to think about how we can help people think about how to come up with a business question that’s useful.

Levi: For sure. That’s a good point.

And if you want to drop any questions in the chat, on YouTube or in the Slack Channel, you can follow up a little more in-depthly over the coming days but we definitely want help think about that because so many issues in healthcare, it’s sometimes it’s hard to know what to chew off first.

Mike: Mm-hmm, that’s true.

Levi: Bite off.

Mike: Bite off.

Levi: Okay.

Next question. Two poll questions today and then we’ll finish up. So next poll question is, “Would you be interested in small machine learning challenges each week?”

So we’ve debated this a little bit. The idea is that we want to engage during the week. We want to help you make progress. So the idea would be to drop a small, tiny, homework assignment in the Slack Channel such that you have something that you’re working on and asking us about in the same Slack Channel. And then we can check in during the broadcast and see, “Okay. Well, what do people struggle with? What’s the proper route to accomplish this small task?”

So let’s open this up, of you stay at Let’s see the returns come in here.

Mike: Let those come in for a little while and then, you know.

Levi: Yeah.

I mean, it’s nice to watch things but I feel like when you’re doing a MOOC or when you’re trying to learn something hands-on is the name of the game. And so, we can have people follow us along here. But it might be helpful for you when you’re not watching the video to have little snippets of things you’re working on.

So it looks like most people think that’ll be a good idea. So please keep answering there if you haven’t got to it yet. But we’ll have to think on that.

Mike: Yeah.

I mean, I think it certainly looks like the community is interested in that kind of thing so maybe the best way for us to do it would be to kind of serve it up through the Slack Channel. And then spend a little time on the next broadcast, discussing it, once people have had time to kind of explore and do whatever it is.

Levi: Yeah, yeah, yeah. That’s great.

Mike: Just talking points.

Levi: And the idea would be we help you along in the Slack Channel and when you get stuck, you have a venue there where you can get help not only from us but other members of the community. So we’ll follow up with that.

So that’s what we have for today. Any other items on the docket? I think—

Mike: I’ve got nothing, Levi, except the weekly shameless plug to please subscribe and join the Slack Channel, start our GitHub repo’s and—

Levi: Oh, GitHub repo’s. Yeah.  

Mike: You can send snacks and prizes to Levi and I at Health Catalyst, if you really like it.

Levi: [inaudible 00: 24: 35] our mailing address, you know.

Thanks for joining, everybody. See you next week.

Mike: See you next week. Thank you.

Levi: Thank you for joining us today. Remember to like, share and subscribe. Comment below and click the links below to join our Slack Channel and our community.

What topic or projects should we feature?

Let us know what you think would make it great.