Over the past few years, our research team at PLAYERUNKNOWN Productions built a range of internal machine learning systems, datasets, and research projects. Like many engineering teams, most of that work stayed internal. It worked, but it was fragmented, difficult to maintain, and largely inaccessible outside the team.
At the same time, we started running into recurring issues. Systems were hard to reproduce, onboarding new contributors required significant context, and knowledge was effectively locked inside the organization. Valuable work existed, but it was not easy to reuse or extend.
This led us to a simple question:
What if we treated research and machine learning systems the same way we treat production software, structured, versioned, reproducible, and designed to be shared?
Exploring that question pushed us toward open source. Not as a final step after a project is done, but as a way to improve how we design, build, and maintain systems from the beginning.
This article is the first in a series about moving from internal research and engineering projects to open source machine learning systems. Here, we focus on the strategic layer: why open source matters, what changes when you take it seriously, and how to approach it in a structured way.
Author: Hirad Emamialagha
Rethinking Open Source
Our initial assumption was simple: open source means publishing code on GitHub with a README.
That assumption did not hold for long.
Well-structured open source projects, especially in machine learning, are not just repositories. In practice, they rarely exist as a single repository. They form small ecosystems of related components.
These systems typically include:
Code
Model weights
Datasets or data pipelines
Documentation
Research papers or reports
Examples and tutorials
Tooling and CI/CD
Community and governance
Releasing only code is rarely enough. Without context, reproducibility, and usable artifacts, most projects are difficult to adopt and easy to abandon.
A more useful definition is:
Open source is not sharing code. It is packaging a system so others can understand it, run it, and build on top of it.
Open Source Improves Internal Engineering
One of the most important insights we had was that open source is not only an external activity. It has strong internal effects.
Preparing a project for open source is one of the fastest ways to expose engineering debt.
It forces you to:
Remove hidden dependencies
Eliminate environment assumptions
Clean up repository structure
Write real documentation
Make the system reproducible from scratch
In practice, this often improves the internal quality of the project significantly, regardless of whether it is ever released publicly.
This leads to a shift in mindset:
Projects should be built as if they could be open sourced later, even if that decision has not been made yet.
That single constraint changes how you design systems.
Open Source as a Long-Term Strategy
From an engineering perspective, open source is not a one-time release. It is a long-term strategy that operates across multiple dimensions.
For engineers, open source becomes a portfolio of real work:
Code quality
System design
Documentation
Engineering practices
Unlike a resume, it shows how you actually build systems.
For organizations, open source contributes to:
Technical reputation
Hiring and talent attraction
Collaboration opportunities
Visibility in specific domains
For the broader ecosystem, it enables:
Reproducibility
Shared tooling
Accelerated research and development
Open source sits at the intersection of all three.
Before You Open Source Anything
Not every project should be open sourced, and not every project is ready.
Over time, we found that most issues fall into a small number of categories. To keep things simple, we reduced this into a checklist we use before any release.
1. Legal and Licensing
You need to ensure that:
No proprietary components are exposed
No internal infrastructure is referenced
No restricted data is included
You also need to choose a license that defines how others can use the project.
2. Maintenance Commitment
Open sourcing a project creates expectations.
Even small projects can generate:
Issues
Questions
Feature requests
Contributions
Open source is not just publishing. It is committing to a certain level of ongoing maintenance and communication.
3. Repository Quality
Internal repositories often contain:
Experimental code
Hardcoded paths
Implicit assumptions
Missing documentation
External users have none of the internal context.
If someone cannot clone your repository and run it on a clean machine, it is not ready.
Open Source Readiness Checklist
Before making a project public, we now use a simple checklist:
No secrets or internal references
License selected and added
Installation works on a clean environment
Documentation is complete and clear
Examples or notebooks are provided
Basic validation or tests exist
Contribution guidelines are defined
If these are not satisfied, the project is not ready, regardless of how good the code is.
The Four Pillars of Open Source ML
A useful way to think about open sourcing machine learning projects is through four core pillars:
Code : training, evaluation, inference
Models : weights, checkpoints, configurations
Datasets : or reproducible data pipelines
Research and Documentation : papers, reports, explanations
Most projects only release code. That is usually insufficient.
The most useful and widely adopted projects combine at least:
Code
Model artifacts
Documentation
Examples
Without these, reproducibility breaks down.
What We Tested in Practice
We started small by publishing internal tools and utilities rather than full research systems. This reduced risk, shortened feedback loops, and helped us build internal experience with:
Repository preparation
Documentation standards
Licensing decisions
CI/CD for public projects
We also experimented with publishing models through platforms like Hugging Face. This introduced a different set of requirements:
Model cards
Usage examples
Configuration clarity
Releasing models is not the same as releasing code.
In parallel, we explored reproducible research workflows using version-controlled LaTeX projects, Research-PaperOps. Treating research papers like software artifacts, with versioning, CI builds, and structured repositories, proved to be a powerful approach.
Open source, in this sense, became part of a broader effort toward reproducible engineering and research.
Challenges We Encountered
The process was not trivial.
One of the biggest challenges was uncovering hidden assumptions. Many internal systems depended on:
Specific directory structures
Internal services
Undocumented environment setups
Making these systems runnable in a clean environment required significant effort.
Documentation was another major challenge.
Internally, many things feel obvious. Externally, nothing is.
Writing clear documentation often required as much effort as implementing parts of the system itself.
We also had to rethink how we structure repositories:
What belongs together
What should be separated
How to organize multi-repository systems
These challenges were not drawbacks. They were signals. In most cases, addressing them improved both external usability and internal quality.
Strategic Considerations
Open source is a strategic decision first and foremost.
Open source is not always the right decision, and treating it as a default can be as problematic as ignoring it completely.
Releasing a project can expose:
Architectural decisions
System limitations
Areas still under development
In some cases, this transparency is beneficial. In others, it may conflict with business or competitive considerations.
The decision to open source should balance:
Engineering benefits
Organizational goals
Long-term maintenance capacity
Key Takeaways
Open source should be treated as an engineering and research strategy, not a place to upload code.
Preparing a project for open source improves its structure, documentation, and reproducibility. The most useful projects go beyond code and include models, datasets, and clear documentation. And open source is not a one-time action. It is an ongoing commitment.
A simple test is this:
If someone outside your organization can understand, run, and extend your project without additional explanation, you are not just ready to open source it, you have already engineered it properly.
What Comes Next
In the next article, we will focus on the practical side:
Repository structure
Licensing decisions
Documentation standards
CI/CD pipelines for research and ML projects
References and Resources
GitHub Open Source Guides Open Source Guides
Hugging Face Documentation Hugging Face - Documentation
Choose a License Choose an open source license
Contributor Covenant Contributor Covenant | A Code of Conduct for Digital Communities