Getting Started in Open Source

A Primer for Data Scientists

Rebecca Bilbro

I really, honestly love programming... I also love collaborating, exchanging ideas, learning better and faster ways to accomplish things that I'm already familiar with or, even better, learning completely new things that broaden my horizons as a developer or person... I enjoy getting feedback from friends - or programmers I'm building friendships with on GitHub, hearing their thoughts about my code and what I could have improved.

But what shouldn't come with the territory - and no maintainer should have to lose sleep over - is dealing with trolls and negative comments from users who only seem to want to beat the maintainers into submission, while showing no intent of helping to improve the project itself.

— Jon Schlinkert The Maintainer's Guide to Staying Positive

Introduction

The phrase "open source” evokes an egalitarian, welcoming niche where programmers can work together towards a common purpose — creating software to be freely available to the public in a community that sees contribution as its own reward. But for data scientists who are just entering into the open source milieu, it can sometimes feel like an intimidating place. Even experienced, established open source developers like Jon Schlinkert have found the community to be less than welcoming at times. If the author of more than a thousand projects, someone whose scripts are downloaded millions of times every month, has to remind himself to stay positive, you might question whether the open source community is really the developer Shangri-la it would appear to be!

And yet, open source development does have a lot going for it:

Users have access to both the functionality and the methodology of the software (as opposed to just the functionality, as with proprietary software).
Contributors are also users, meaning that contributions track closely with user stories, and are intrinsically (rather than extrinsically) motivated.
Everyone has equal access to the code, and no one is excluded from making changes (at least locally).
Contributor identities are open to the extent that a contributor wants to take credit for her work.
Changes to the code are documented over time.

So why start a blog post for open source noobs with a quotation from an expert like Jon, especially one that paints such a dreary picture? It's because I want to show that the bar for contributing is... pretty low.

Ask yourself these questions: Do you like programming? Enjoy collaborating? Like learning? Appreciate feedback? Do you want to help make a great open source project even better? If your answer is 'yes' to one or more of these, you're probably a good fit for open source. Not a professional programmer? Just getting started with a new programming language? Don't know everything yet? Trust me, you're in good company.

Becoming a contributor to an open source project is a great way to support your own learning, to get more deeply involved in the community, and to share your own unique thoughts and ideas with the world. In this post, we'll provide a walkthrough for data scientists who are interested in getting started in open source — including everything from version control basics to advanced GitHub etiquette.

Step One: Find a Project

By our very nature, as data scientists we are already significantly involved in the open source community, even if we don't realize it. Nearly every single one of the tools we use — Linux, Git, Python, R, Julia, Java, D3, Hadoop, Spark, PostgreSQL, MongoDB — is open source. We are used to going to StackOverflow and StackExchange to find answers to our programming questions, grabbing code snippets from blog posts, and leveraging useful packages from places like CRAN and the Python Package Index (PyPI). But our Google foo/copy-and-paste/pip install approach is heavily dependent on the contributions and thought leadership of others, and at some point, you are likely to encounter at least one of the three following scenarios:

You are running a package and encounter a bug.
You are looking for a function in a well-loved package to do some specific task, but realize that it isn't implemented.
You've developed your own code to help you do a certain task or sequence of tasks over and over, and you realize that other people have to do those tasks, too.

These scenarios should be signals to you to get engaged in contributing to open source. In the first case, you could just submit a bug report, or you could go a step further and actually fix the bug. In the second case, you've identified a new feature, which you could suggest to the maintainer of the code, or implement yourself and give back to the code base. In the third case, you're very close to having developed a standalone library that you could refine, package, and then share with the world. This post will be primarily concerned with contributing under one of the first two scenarios.

Step Two: Figure out Git/GitHub Basics

In order to become an open source contributor, you have to buy into version control. Version control is a way of managing files as they evolve, and offers a methodology for recording changes to a file or a whole project over time so that historic versions can later be retrieved. Version control is important to developers of all languages, and is especially important to open source development.

Git

One of the most popular tools for version control is Git, which is itself an open source project. Git was developed by the Linux community and first released in 2005. Git is a distributed version control system that's fast and simple and supports non-linear development of small and large projects. If you don't already have Git, download and install the latest version here. Then configure Git with your name and email by typing the following into your terminal:

~$ git config --global user.name "YOUR NAME"    
~$ git config --global user.email "YOUR EMAIL ADDRESS"

Next, take a crash course on Git (try this one or this one or just read this) and figure out the main Git commands. Make sure you know how to use these ones:

~$ git init # Initialize a repository in an existing directory
~$ git clone URL # Get a copy of an existing repository
~$ git status # Check the status to see changes/staged
~$ git add FILENAME # Stage updated file
~$ git commit -m "Unique message." # Commit the changes
~$ git push origin master # Push the commits back to Github
~$ git log # See the commit history

Github

Github is a popular web-based Git repository hosting service that offers all of the distributed revision control and source code management (SCM) functionality of Git as well as adding its own cool features like metadata & reporting tools. If you haven't already, create a Github account — in addition to being useful for open source development, your Github page is your programming portfolio, the place you can point potential teammates and employers to show them what you've worked on and how you approach problem-solving with code.

Waffle

An additional tool I like to use is Waffle.io. Waffle is a lightweight project management tool for GitHub repositories. For open source developers who use Github, it's particularly convenient because Waffle's source of record is GitHub. GitHub issues and pull requests transform into cards on a board (and likewise, adding a card to the waffle creates a new issue in Github), making it easier to plan, organize, and track work across one or many repositories. Create your own Waffle.io account to facilitate collaborative coding and good project management practices.

Step Three: Find the README

Once you've identified a project that you want to contribute to, and you've learned the Git/Github basics, the next step is to find the project README. Usually written in a simple text or markdown file, a README is the author's way of communicating to users what the package is, how to get and install it, what it looks like, and how to use it.

In the Art of README, Stephen Whitmore writes, "Your documentation is complete when someone can use your module without ever having to look at its code." As the package's public-facing documentation (as opposed to documentation that appears in comments and docstrings inside the source code, for instance) the README is a critical first point of entry for anyone interested in contributing to a project.

This documentation lays out what the module is supposed to do at a high level: how the installation should work, how the package is meant to be used, and importantly, how the application programming interface (API) is meant to work. This is particularly important for those who want to modify or contribute to the code, since whatever changes you make should be compatible with the API.

Step Four: Read the Issues

Now that you have a sense of what you'd like to contribute to the project, you should read the existing issues. Most software projects have a bug or proposed feature tracker of some kind, and in GitHub it’s called issues. Every Github repository has its own section for issues. Issues can be categorized using labels and tags to denote their type (task, enhancement, question, bug), their degree of importance, the level expertise required to address them, the project milestone they're associated with, etc. For any public repo, the issues are also public, and if you are thinking of reporting a bug or suggesting a new feature, you should first read through the existing issues to make sure that the bug you've identified, or the feature you'd like to suggest, has not already be submitted and added to the backlog.

You can also go look at the project's Waffle to help you visualize the core developers' current workflow, backlog, and priorities.

Generally maintainers of popular open source projects provide some guidelines to other about how to contribute. At District Data Labs, we encourage contributions in the following ways:

Adding issues or bugs to the bug tracker.
Working on a card on the dev (waffle) board.
Creating a pull request in Github.

The labels in the Github issues for our projects are defined in the blog post: How we use labels on GitHub Issues at Mediocre Laboratories.

Step Five: Fork It

When you want to create your own copy of a repository on GitHub, you fork it. You can find the fork button in the top-right corner of the page. Your forked version of the project is yours; its a place where you can freely experiment, modify the code, add and change things as you like without changing anything about the original repository. In open source, forking is the typical method for contributing to an open source project.

Step Six: Submit a Pull (Merge) Request

Now that you have identified a novel feature or a suggestion for a bug fix, you can fork the repo, add the new feature or fix the bug, commit the changes and push them back to your forked version of the repo, and then submit a pull request to the maintainers of the project. They'll receive a notification, and can review your pull request and either merge it into the source code or provide you with feedback to help guide you toward contribution. Keep in mind that your pull request might not get pulled in immediately for a number of reasons — everything from a potential complication with your proposed fix/addition that will require the creation of a new test to the maintainers having a busy work week. Patience is appreciated, but for the most part, maintainers will likely be pretty stoked that you like their code enough to contribute, and they'll be inclined to coach you through the process to get a successful contribution merged into master.

Next Steps: Learn Git Branching and Semantic Versioning

Steps 1-6 will get you operational as a open source contributor to an existing project. However, if you're at the stage of converting one of your own custom projects into an open source package, there are a couple additional things to consider, two of which are Git branching workflows and semantic versioning.

Git Branching

Most version control tools have a way of supporting project branching. Branching means diverging from the main line of development in order to make changes, experiment, and explore without breaking the main line. Fortunately, Git makes branching very easy and lightweight by facilitating smooth switching back and forth between branches. Git encourages a workflow that branches and merges often, even multiple times in a day.

A typical production/release/development cycle is described in A Successful Git Branching Model. A typical workflow is as follows:

Select a card from the dev board - preferably one that is "ready" then move it to "in-progress".

Create a branch off of develop called "feature-[feature name]", work and commit into that branch.

~$ git checkout -b feature-myfeature develop

Once you are done working (and everything is tested) merge your feature into develop.

~$ git checkout develop
~$ git merge --no-ff feature-myfeature
~$ git branch -d feature-myfeature
~$ git push origin develop

Repeat. Releases will be routinely pushed into master via release branches, then deployed to the server.

Semantic Versioning

Semantic versioning is a way of encoding information about the stage and changes to code over time in the version numbers. Semantic versioning is very important to the open source community because it provides another tool for developers and users to communicate about things like package dependencies, changes in the application programming interface, and backwards compatibility.

Under semantic versioning, package version numbers are assigned and incremented as follows: Once a public API is declared (either within the documentation, the code, or both), changes to the API are denoted with specific increments to the version number. As explained in Semantic Versioning 2.0.0: "Consider a version format of X.Y.Z (Major.Minor.Patch). Bug fixes not affecting the API increment the patch version, backwards compatible API additions/changes increment the minor version, and backwards incompatible API changes increment the major version."

In addition to semantic versioning, calendar versioning is a recommended convention that dictates a project's release calendar using semantic rules rather than arbitrary numbers.

Dive In!

If you're interested in getting into open source, you can get started by checking out the District Data Labs organization on GitHub, and looking through the projects we have in active development. I hope you enjoyed this post and that you will consider becoming a contributor!

Yellowbrick - A suite of visual analysis and diagnostic tools to facilitate feature selection, model selection, and parameter tuning for machine learning.
Cultivar - A multi-dimensional data management and visual exploration tool.
Baleen - An automated ingestion service for blogs to construct a corpus for NLP research.
Partisan Discourse - A web application that identifies party in political discourse and an example of operationalized machine learning.
Minimum Entropy - A DDL-hosted question & answer site for beginners who need answers to Data Science questions.

Guide to Idiomatic Contributing by Jon Schlinkert
Markup for Fast Data Science Publication by Benjamin Bengfort
Pro Git by Scott Chacon and Ben Straub
A Successful Git Branching Model by Vincent Driessen
Art of README by Stephen Whitmore
How we use labels on GitHub Issues at Mediocre Laboratories by Shawn Miller
Semantic Versioning
Calendar Versioning

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!