ML #9 - From Zero to Your First Open Source Contribution: It Happens Today!
Hosted by Mike Mastanduno
April 20, 2017 - 20min
Open source tools are dominating the data science world. The ability to interact with and contribute to such projects are an invaluable skill set inside and outside of Catalyst. Join us for a hands on workshop where you will be guided through making your first open source contribution no matter your background.
Links to Materials Mentioned
- Contributing to Open Source – The landing page containing links to walk you through the experience of making your first open source contribution. http://contributing-to-open-source.readthedocs.io/en/master/
- healthcare.ai Slack channel – Join our Slack channel! https://healthcare.ai/slack/
Levi: Hi everybody, I’m Levi, here with Taylor Miller.
Levi: Taylor, it’s been a while. How are you doing?
Taylor: Good. Good, good.
Levi: So today, we’re talking about open source contributions.
First up, before we get to that, we just have a couple of housekeeping items. So if you want to log in to your YouTube channel, such that you can participate in the chat, that’d be fantastic. And you might want to check your video resolution. High resolution is good since we’ll be doing a lot of screen work here and getting it going with GitHub and Git. Go and subscribe.
Join our Slack Community. We’re here to interact and to work with you on problems. And Slack’s a great way to do that so go ahead and join. We have a lot of folks that have joined us there. And we’re eager to get to know you and to work on stuff together.
So first off, mailbag.
Taylor: Yeah, we had a couple today.
Levi: We’ve got a couple of questions that came in. And #1, it’s been a couple of weeks since we got this one so sorry about the delay here but someone asked, “What are the differences between deep learning and machine learning?” You’ve been getting into that a little bit lately. You’ve been you know researching, doing a little bit of—
Taylor: Doing some [inaudible 00: 01: 03].
Levi: Yeah. Any thoughts or–?
Taylor: Yeah, so machine learning is kind of this umbrella concept that there’s many different types of machine learning. We’ve talked about a couple of those. We’ve talked about regression. And we’ve talked about classification. Deep learning is a particular algorithm that can be used for both of those tasks. It’s less of a statistical approach and more of a neural net approach patterned after, of course, our biology.
Levi: Yeah, yeah. That’s fantastic. So, often, multiple layers.
Levi: Whereas, with like linear regression you’ll think of coefficients where that’s just sort of one layer. Deep learning, the idea is that it’s stacked deep. And the deeper the better, it seems like these days.
Taylor: It works really well for predicting things or classifying things with lots of features, particularly images. I mean, how if you take an image, if you’re going to run regression on that image, you’d look at every single pixel. So you build a neural net that would take every single pixel as part of the input. So many, many, many pixels. So really wide data.
Levi: And doing things that you cannot do with logistic regression. It’s fantastic and we’re excited to put it into healthcare.ai. So it does have some applications in healthcare even before we get to things like image classification. So if you have tabular data, you get to above 300,000 rows, it can help you out in a way the logistic regression or random forest could not. And we’re excited to put that in. Taylor and Mike have actually been working on that a little bit lately.
So second question is Kevin from the Slack Channel. Again, hop on Slack. Let us know your thoughts. So Kevin says, “How do you communicate? What features impacted your model most?” That’s a fantastic question. And one that we’ve designed healthcare.ai around.
So the idea is that we want to enable users to actually not have to worry about the details as to, “Okay, well for this question, what algorithm should I choose? How do I figure out which features are important? How do I process my data?” So in choosing random forest and lasso, we’ve actually picked those because they let you see, okay, well, these four features are more important than these other four. It gives you guidance as to what’s been helpful or not.
And you know space is limited. We only have so many variable that we can keep on the ATL server each night. So the idea is that if it’s helpful to you, and it often is, throw out the variables that weren’t needed. Only keep the ones that were. It also makes your model more interpretable, which of course is big. So, Kevin, thanks for reaching out.
Anybody else? Please, hop in the chat. Let us know your thoughts. We’ll respond to you dynamically during the session. And then if you have questions that are maybe of longer length or they’re more in-depth, throw those in the Slack Channel and we can chat afterwards.
So what’s on tab for today? What are we getting into?
Taylor: So we’re going to talk about something that’s really important. And that’s barriers. And we’ll start with the question, “Levi, what was it like when you first started contributing to open source software?”
There’s a lot of terminology that I wasn’t familiar with. There’s a lot of folks that seemed to know what they were doing and that we’re at a sort of different tier of knowledge. It was kind of hard to tap into that and feel like you were one of them.
There’s a lot of tools.
Levi: A lot of different terms out there.
Well, that’s what we hope to address today. We’re trying to break down some of those barriers. Particularly, in this small community, we’re trying to germinate. We would love to collaborate on software together.
And if we can get you through some of that initial pain, once you’ve been through that once, you really realize, looking back, “Okay, that was a substantial hurdle but I’m over it now. And now, let’s roll. Let’s collaborate.”
Levi: Exactly. You know, it’s amazing. You know, you dedicate yourself to this for a few hours and the clouds start to go away, especially with Taylor here to help.
Levi: So how do we start out? So many topics. So many different parts of open source contributions.
Taylor: So the approach I’d like to take today is a very workshop based approach. We’re going to ask you to try to follow along. We’re probably moving along a little too quickly for you to follow on directly live but pause that YouTube video, do the things we’re doing, rewind it 10 seconds to make sure you did it right, and then continue on. I figure, if you step through today’s video, doing this from scratch, you’d probably get through this in half an hour pretty easily.
We would love it if you would post questions to Slack when you get stuck, if there’s something that I missed, if there is something that’s confusing, if it wasn’t clear, please let us know so we can clarify that.
Levi: We love the contributions. And then the YouTube channel as well. You know, I can help out. I’ll be responding to questions here while Taylor’s walking you through. But, again, we’re trying to build a back catalog here so feel free to pause and take it slow.
So let’s dive right in.
Just a quick overview of what we’re going to do. We’re going to get through some of the terminology which is required. We’re going to try to do that as fun as possible. And then we’re going to walk through setting up accounts, getting tools you need. And then we’re going to just do it.
From zero to finish, we’re going to take an issue that needs to be worked on. We’re going to get that in our machine. We’re going to fix that issue. We’re going to push that back up to the collaborative cloud. And then walk through it.
Taylor: So strap in. Let’s do this.
Levi: All right. I’m excited.
Levi: So where do we start?
Taylor: Well, we start right here. I’ve put together a little site, today, that will walk us through a lot of these concepts [inaudible 00: 06: 24] outline as a resource.
So first thing, what we’re going to do, we’re going to set up accounts and tools. So go ahead and click on that.
You need two things. We’re going to be using GitHub today. And this is a widely used collaborative platform for software. You sign up for an account. Piece of cake. You give an e-mail. It takes about two minutes and you’ll be ready to go. And then download GitHub Desktop. And that is a desktop application that helps you understand and work with Git in a really easy manner.
So I’ll just walk you through that. I signed up for a new GitHub account today so it could be just the same. So here’s the signup form, e-mail address password. You’re done. Then you will go and jump over and download GitHub desktop, which is this delightful little application we’ll be walking through this afternoon, and get that installing. While you’re waiting, pause the video, come back and then we’ll move on.
Levi: And you like the application more than just doing it through the web? There’s kind of two different routes.
Taylor: Yes. Yes, because you need to do some things on the web. And you also need to so some things on your local machine. And we’ll talk about that.
So let’s jump right in. If you want to follow along, there’s a terminology page here that we’re going to go through. But we’re going to try to do this a little more interesting visually because that’s how I think. And I think, particularly starting out, it really helps to see what all this jargon means.
All right, so let’s do some doodling here. So, what are we trying to do? We did go over this a little bit, in a prior broadcast. And we’ll link to that later. But let’s just jump in.
So what’s happening here? You’ve got your local machine, your laptop, where you want to work on some code. You have a remote environment. We’ll put the ever present cloud. So that’s your remote environment and this is your local environment. These are terms you need to know. And there’s a syncing process. And there’s a two-way syncing process.
So first, we should probably preface that what we’re doing— I jumped right in without asking. So Git is what we’re using today. That is a version control tool. Version control is absolutely essential if you are working on code. It can be used in other things. I used to use it for papers in school. And it helps you track changes. And we’ll talk about what that means.
So you, first, will need a repository. This is commonly referred to as a repo. And we’ll walk through all this. So these are just the terms. And then we’ll walk through it so don’t get overwhelmed. Stick with us. I promise we can get through this.
When you have your code or your repository and you want to push it up to the cloud, that is called a push. When you want to pull it down, that is called a pull. And this is the syncing process.
Levi: And the cloud— so the cloud seems so ubiquitous, these clouds. Is it just like a computer somewhere holding your code?
Taylor: Yeah, so the cloud that we’re going to be using today is GitHub. And that is a hosted collaborative platform. That’s why I had you sign up with those accounts. So GitHub.com. It’s where lots and lots of open source software is built in teams. Good question.
So what is a repo? What is this? So you might have some code you’re working on. You’ve got some files here. And maybe it’s a Python file. We’re going to be working the Python repo today.
And you’ve made something that works. So you make yourself what is called, the next term, a commit. And you can think of this commit as a point in time that you want to save. And it’s not just every file, it takes the entire snapshot. So my project, my start is one file. But this could be folders and folders of stuff. And this saves all of that in a single state that you can deal with.
So maybe you did some more work and you’ve added some more things to this script. And it’s working. You’re in a good state. You want to make another commit.
Now, why do you do this? You do this because as time goes on, things get very complicated and you want to be able to see how things have changed. So the way you do that is called a diff.
Levi: So commit, it’s sort of like taking save up a notch?
Levi: It’s sort of like a checkpoint?
So a commit is a checkpoint in time. So let’s say we add this green little line of code here and that would be the difference to fix some bug. Maybe the next thing we want to do is add a new feature and that required a whole new file. And so, the diff between these two commits would be this entire file et cetera. So commits go on through time and that’s their process.
So the next process to think about, the next terminology we need to cover is forking. And forking is what we’re going to do today. Forking is a process of taking someone’s repository – an open source repository and making a copy of it.
So let’s presume that this is our healthcare.ai repository and it’s got our code in it and so forth. But we want you to become contributors and help. As a community, we want to build awesome machine learning software for the healthcare community. So what you’re going to do is you’re going to fork that. And that happens at a single point in time.
Let’s say it happens right here. And you would make your fork. And now you have a complete copy of the entire repository, including its history. So you get all the history going back since the repository started so you can dive in to stuff and see how a software’s evolved. And then you begin to make your own history with your own commits as time goes on.
Once you have solved a problem or maybe you’ve added a new feature, maybe you want to add a new algorithm, a new clustering, or something simple – maybe there’s a typo in our documentation that we missed.
Levi: Start small.
Taylor: Yeah, start small. And we’ll talk about that.
The next terminology is a pull request. So as this healthcare.ai repository exists and as your fork exists, things change. So maybe you fixed a typo, or you’ve written a new function, or you’ve made it super— now it can play chess, or whatever that is. We’re going to bring this back into the code base. And what that is called, it’s called a pull request where you, as the contributor, are saying, “Hey, I have done this work. I’m letting the maintainers know. Can you check it out? Make sure it makes sense. Bring it into the package.” So that’s your pull request.
Levi: That’s kind of a funny term – pull request. Some people kind of say, “Okay, why pull?” Like, “What’s the point in going on?
Taylor: Yes, good question.
Because you’ve got your own repository going on and the pull is saying, “Hey, I think this would actually be valuable to the main repository. Why don’t you guys pull that in?” So there’s a review process here and we’re not going to talk about that today. But it’s a part where the maintainers look at the quality of the code that’s been added or check the typo. “Oh yeah, as it turns out, it was correct.” Then they bring that in.
Levi: So you don’t have to worry that like you’re necessarily going to break something.
And that’s the whole process of having a fork is you can do anything you want in your own fork. And it’s not going to affect healthcare.ai, the main package, at all until that pull request is completed. So that’s the merge situation.
Taylor: So hopefully that gives you a little bit of an overview of some of the terminology. I hope it makes it a little easier to deal with. You can look at the terminology page if there’s anything we missed. And there are some great— GitHub itself has some fantastic tutorials and articles on all these terms in detail if you need more visual or you just need to read more about it.
But now that we’ve covered some terminology, let’s jump in and actually go through the process of how do you actually do this. We’ve got the jargon. How do we do it? So jump over here to the work flow document and this will outline what we’re going to do.
Okay. So we need to fork your repo. We’re going to jump into our new GitHub account and— let’s see.
Levi: Okay. So we’re on GitHub. We’re on the website. There’s all these different repo’s out there representing different projects. So you found healthcare.ai, hopefully. We’d love your contributions. Maybe you’re working with some other organization or some other project that’s built some great software and you want to improve it. So you have your code base that you want to improve.
So we’re going to go and click on healthcare.ai. And we’re going to press this fork button. And that copies – it takes a moment, that copies the entire repository, including the whole history into your own GitHub account. This is your code. You can break it. You can fix it. You can do anything you want.
And then we’re going to open up GitHub Desktop which you’ve downloaded. And this makes Git really easy to understand. Once you’ve signed in, you’ll see this delightful little glowing plus button. You say, “Hey, let’s add a repository.” You want to click on the clone tab. And this jumps in. I did not press the fork button. Well, let’s fork it.
Come on. Remember my password. Last minute show prep.
Levi: We have a question I came across. So Leon asked, “What if we want to undo a commit and re-commit? Sort of, rewrite. Let’s say we made a mistake. We need to go back.”
Levi: Is that what commits are for?
Taylor: That’s exactly what commits are for. There’s processes that are, once you’ve been through it once, fairly easy to understand to allow you to roll back in time and say, “Ooh, man. You know what? That was actually a mistake. We can roll back to that point and see.” And that’s the beautiful thing about having that in version control is that you can tell. At any given point in time, you can go back to that and test things.
Levi: So we like to commit often since—
Taylor: Yes, we do.
Levi: –it makes it a lot easier to go back.
Taylor: Yes, we do.
All right. Healthcare.ai. So we need to get an issue. And an issue, if you jump onto the healthcare.ai Python repository, this is where we keep track of features we want to add, new things we want to experiment with, bugs.
And typically, a lot of open source projects, because they want to appeal to people who are new, they’ll have different levels of issues. I mean, we’re not going to ask you to go and convert all filters and transformers to transformer mix-ins because that’s a pretty serious thing. But lots of open source projects will have what’s called a label. And they’ll put something like help wanted, or easy, or starter.
And we’ve got some issues here, all ready to go, that help you with that process. Let’s see, help update author’s file. That sounds like an easy thing to do.
So you would decide on an issue. You’re going to go work on that code in your repository. And then we’re going to commit that up. So let’s see if we can pull this off here.
Levi: All right, so you’ve found your task. You know what you’re going to do. Now, let’s see how we’re actually going to go and do this thing here.
Taylor: Yes. Yes.
And because I did not prepare enough for this, I am locked out of my GitHub account. So we’re going to have to do this a little bit differently.
Levi: Yeah. Plan B.
Finding an issue may be one of the hardest parts, the most intimidating parts. So with healthcare.ai, like Taylor mentioned, we’re excited for contributions from the community. We’re excited to help people go down this path that we have, ourselves, gone down before. And so, if you’re ever curious or uncertain as to, “Okay, should I work on this? Should I work on that?” Feel free to reach out.
So our info is in the GitHub repo. You can contact us on healthcare.ai. And really, you can’t go wrong by saying like, “Hey, like I really want to contribute. How do I get started? Like, what issue is most appropriate for me?” if you’re uncertain. Because, really, establishing that relationship really endears you to the maintainer of the package.
Taylor: It does.
Levi: Yeah. Still—
Taylor: We’re stuck. We’re going to have to finish this later.
Levi: I mean, we could do it from my machine.
Levi: So, let me walk through the steps briefly here.
Taylor: So what you do is, we’ve chosen an issue. We pulled up our text editor. We would make our changes to– let’s see, what was the issue we picked? You know, fix some typo in a document. We save that in our text editor. We use GitHub Desktop. And we can point towards some resources there. And we’ll probably finish this up in the chat as well.
Then in GitHub Desktop— let’s see if we can actually pull that up. GitHub Desktop. Let’s show you how easy it is to make a commit and then push that up. So pick a project. Here’s your changes.
This is what the UI will look like. You can see the changes here. And that diff we talked about, that red and green, red being a subtraction green being a change. You type a little commit message down at the bottom here. And then you get this delightful blue check mark that says “commit changes”. That is, you’ve made that change. It has saved everything in that current state right now. And then there’s a little sync button at the top. That actually pushes those changes up to the GitHub account – the cloud, so to speak. And then you’re good to go.
Levi: Yeah. So I’ve got GitHub Desktop here. We could switch over.
Levi: Let’s give it a whirl just so we can show people the UI briefly. So in terms of where we are in this workflow, so I have a branch. And I’ve already forked. And I have something I’m working on. So you’ve made this change to some file. And let’s see where we go from there.
So now we’re sharing out and what we want to do is go to your particular branch. Let’s see. Here, it’s on the left, you have your repo’s. I’m usually doing this on the website instead of the app.
Taylor: Yes, you get your repo’s here on the left.
Levi: Yeah, so that’s not the name of the repo though.
Taylor: Nope. Hit the pause button and hit clone.
Levi: There we go.
Taylor: And now you have your repo.
Levi: So you have your list.
Taylor: So here, it’s downloading. It’ going to ask—
Levi: You see, I should have that in my computer already though. So we wouldn’t want to re-clone, right?
Taylor: It doesn’t matter. Either way.
You don’t want to choose the same directory if you’re cloning though.
Levi: Yeah. So we’ll just put that in downloads. Clone it down. And so here’s you’re pulling the clone off the website, or off the server, or the cloud. Putting it on your computer.
Okay. And now, in R Studio, if we hop over real quick.
Let’s edit a file.
Okay. So R Studio’s opening. What we’ll do is we’re going to go ahead and open up a file to edit. And this will be off of the downloads.
Okay. There’s our repo. Let’s open up our proj file. Yes, we want to open that project.
Okay. Let’s open up a real sample file. So we’ll open up—
Let’s actually create a new branch here. So how do we create a new— oh, here we go. So you click on that little branch icon. We’ll create a new branch called Levi test. And what this will do is will let us work on a part of the code that won’t affect other parts of the code.
So now, in R Studio, what we’ll do is we’ll go ahead and open up this branch. Make a change. And then do a pull request.
Okay. So there’s Levi test. What we just created.
So now if we open up a file, let’s just do the contributing document. We’ll fix the typo, for example.
Levi: Okay. So let’s say that, “Okay, we need to set up Git really well.” And we’ll save that. We’ll actually go ahead and—
We commit this in the Desktop–?
Taylor: Yeah, use the Desktop App.
Levi: Yeah. Let’s commit it over there.
Taylor: So there’s your change. You write your little— if you want to see the diff, click on contributing and it should show what you changed.
Levi: Okay. There we go.
Taylor: Which makes it really easy to track over time.
Okay. So now, let’s go ahead and commit this. And your commit message is— we’re supposed to keep them short but we’ll say, “Okay, we fixed a typo.” That’s great. We’ll commit that. And then we’ll publish, which is the same as pushing to the remote, which is basically backing up your code – the change you’ve made.
Levi: In case of fire or other natural disasters.
Taylor: It’s worth noting that even if you’re not planning in contributing to open source software, if you’re writing software at all, if you’re doing technical writing, using version control is so helpful and it can save you so many times.
Levi: Even by yourself.
Taylor: Oh yeah. Yeah, it’s not just for— I use it for all. Any time I’m writing code. Even if it’s prototype trash code, I’m always putting in version control. That way I can track my thought process as time goes on.
Levi: Fantastic idea.
Okay, so now we’ve pushed our change to our remote repository. So now we want to create a pull request where we can actually have our code reviewed by the package maintainer. So in GitHub for Desktop, you click pull request. And you can change the name of the pull request if you like, but you don’t have to. And then you click send pull request.
Now, if you go to this repo. You’ll be able to see—
Taylor: [inaudible 00: 25: 10]
Levi: Oh, there we go.
Yeah. You see, I’ve usually worked on the actual browser. I’ve not really used the app quite a lot but there we go. There’s your pull request and anybody can go to our repo right there and see that this came through.
And then we can request particular reviewers over here on the right. So let’s say that, “We want Taylor to review this or somebody else.” Then they can come in and say, “Okay, well, yes you did this part really well, or no this part needs some improvement.” And then I can go back and fix it. And re-push my code change back up. And when everything’s good to go, and all the commenters give you the thumbs up, then you go ahead and merge your branch into the master branch and your code is an official part of the repo.
So that kind of runs us end to end.
Levi: Anything to add there?
Taylor: No. That’s it. Try it out. Hop on Slack. Let’s see if we can walk through this together.
It is nontrivial to do this. It is nontrivial but once you’ve been through it a few times you realize you are so much better off. These kinds of skills are in demand and it will help you a lot.
Thanks so much, Taylor.
Taylor: Sure thing.
Levi: Thanks for joining us guys. We would like to remind you to like the broadcast on YouTube, subscribe, join our Slack Channel, and join the community as well at healthcare.ai. We’re sending out blog posts. And I’m trying to form really a group here that can work together on healthcare problems.
Thanks for joining.
Taylor: Thanks a lot.
What topic or projects should we feature?
Let us know what you think would make it great.