Open-Source: How to Get Started
This article is for people who’d like to get started with contributing to open-source with code, or community contributions, or writing their own libraries from scratch. Note that this article is quite opinionated, as my experience is limited.
How I Started my Journey
I came across this video by Hugging Face on YouTube that changed my life. I started following Thomas Wolf (the person on the video, who’s Chief Scientist at Hugging Face) on twitter, and must say I was such a big fan of what Hugging Face was doing. One day I saw him posting about a contribution sprint.
I got myself signed up without knowing about any open-source workflows. We were supposed to commit a dataset script that would load the dataset and create splits and define configurations and various specifications of the dataset.
I must be honest I struggled quite a lot in the first place. I didn’t even know about formatting, let alone CI!
I required guidance, but after a while I started helping people out though, thanks to Quentin and Yacine teaching me about the workflows 🤩 Also they usually write down contribution guides which helps you a lot if you follow them.
After that, I participated in other sprints (like wav2vec one), where I failed — due to quality of data, the model wasn’t performing well and I didn’t have time to find other sources of data without breaching licenses or doing something against PII. After a while, I started teaching people about how to use Hugging Face Transformers, I even included it on my talk at Google I/O.
This is another way of contributing to open-source. You don’t necessarily need to contribute code to contribute to open-source, there are other ways:
- writing documentation (blog posts, library documentation, projects on github)
- gathering people and giving workshops to teach them about a specific open-source tool
- opening issues/feature requests on GitHub and giving feedback, testing beta versions
- answering questions on forum, stackoverflow or developer discord
are more valuable than you think. A library with no adoption due to lack of documentation or resources is just a code. Codebases require community to thrive.
Get Started with Community Contributions
- Just pick a new tool in the ecosystem, test it and see if it doesn’t work, if it’s not intuitive by means of developer experience and give feedback. Even if you don’t find a bug, pointing out if usage can be simplified is a valuable thing to report.
- Writing a blog post about how to use the tool, putting up a notebook on kaggle or a GitHub repository including a potential use case for the tool is a good place to start.
- If you’re good on shooting videos, you can shoot a video of a walkthrough in ecosystem, developing an application using the tool and more!
- If you’d like to give a talk in a developer conference, putting up a deck on the ecosystem, new things included in latest release or going through an end-to-end application is very good. At Google I/O, I was going through a concept called transfer learning and how to train a model with transfer learning using google’s stack and also Hugging Face.
Getting Started with Code Contributions
It might be quite intimidating and overwhelming to get started to contribute a big codebase: the fear of your PR being rejected, receiving a lot of reviews thinking it’s a bad thing, tests not passing and more. Here’s couple of things no one will tell you but are obvious:
- No one starts contributing to open-source at the age of 1. Everyone has learnt it from someone else.
- As open-source maintainers, we love to onboard new contributors, we’re more than happy to provide guidance.
- Your PR being rejected is not the end of the world, it doesn’t mean your code is bad, there are multiple reasons of it (that I’ll go through) and you should open a second PR.
🌈 Open an issue before opening a PR or find the issue that you want to solve and discuss with developers first 🌈
There are design decisions associated with each codebase. If you came across a design that you don’t agree on or you saw a problem, first find the issue if it exists or open one and discuss with developers. Come up with an initial design of a solution and then open a PR that you can iterate on.
🌈 Find good first issues 🌈
Most of the libraries have issue tags under https://github.com/organization/library-name/labels. You need to filter for issues that have good first issue tags, they are usually good to get started with a codebase.
These do not guarantee your PR will be merged but they will ease the process and make it more likely for your PRs to be merged.
✨ Try to follow the developer communities ✨
As I told my story, my journey has become with me following Hugging Face. Like Hugging Face, many developer communities have open-source contribution sprints, like scikit-learn. You need to keep in touch with the core developers to be notified about the sprints and more. The good thing about sprints is that the core developers dedicate a large portion of their time to help you out, thus make it easier for you to get started. At Hugging Face, we have good amount of sprints, including:
- We’ve done a documentation sprint: where transformers library needed to update the docstrings of the functions and with the help of community, we documented the codebase.
- wav2vec/2 sprints: Where everyone trains an audio model with their own language. Given number of languages are overwhelming for core developers to handle, we provided compute, scripts and guidance to participants to train a model and commit.
🙌🏻 Some tools you can get familiar with that open-source libraries use:
- pre-commit: Some libraries use pre-commit hooks. Once installed, it formats your code during committing your code.
- black/flake-8/isort/any formatting or quality tool: Every code base has a style guide that it follows. For python, black helps format the code automatically, meanwhile flake8 checks the code for quality, e.g. you might have a variable you’ve defined but not used. For other languages, you need to find out what works.
- test libraries: If you’re contributing to an open-source library with a new addition to the codebase, you need to be familiar with testing as you’ll be expected to write a test for your code to preserve code coverage.
Getting started with writing your own library
I recently started writing a library from scratch with my colleagues. Before that, I tried writing another library with one of my friends and horribly failed. The main reason was that I couldn’t come up with a design for the functions, classes & co. I got overwhelmed and couldn’t put my ideas into library. This time I was lucky enough to work with a very experienced open-source developer (follow him!). Couple of things I learnt from him that has given me a starting point:
- Always put developer experience before anything. A codebase that is not intuitive will not be adopted.
- This way, he gave me a good starting point: before anything, he’s writing executable notebook-like code cells and define how users will interact, before actually writing the code itself, then he implements the code afterwards. (Check here)
- Related: document your code well, e.g. write docstrings first and then implement.
- Write tests. Writing tests might be cumbersome but it will make sure your code will always work as intended and be robust against edge cases, and will prevent problems related to consistency next time someone contributes a code, help with backwards compatibility.
- For later, put up contribution guidelines to onboard new developers. Try to define rules about merger, styling and so on such that it will easily scale to multiple maintainers or contributors, e.g. we don’t merge our own code ourselves for sanity check.
Finishing up the blog, I have couple of thanks to make to people who onboarded me to open-source and convinced me to do it for full-time (give them a follow if you want to learn more about contributing 🙂)
- Yacine & Quentin for helping me out on my first sprint
- Omar & Julien for reaching out to me to let me work with them
- Adrin for teaching me a new thing every single day.
(and more people from Hugging Face that I wouldn’t be able to fit their names in this blog)
If you have any additions or stories to add, this article is open to additions, I will add your stories below this article, so feel free to reach out to me on my twitter, I’d love to hear your open-source stories.