Github is currently the most popular open source software hosting platform, so can the Linux kernel community migrate to Github? Daniel Vetter of Intel, who wrote a blog post on this issue, says NO. Whether this answer is fair or not is left to the reader.
This paper is mainly based on
Blog. FFWLL. Ch / 2017/08 / git…
It was abstracted and adapted. Original author Daniel Vetter is from Intel OTC kernel development.
Daniel says that this article was motivated by two things:
One is a discussion at the Maintainerati Conference, an industry gathering of open source software maintainers. Daniel talked with several open source software maintainers about how to scale truly large-scale open source projects, and how Github forces projects to scale only in a few specific ways. Linux kernel development uses different models, which are not understood by open source maintainers on Github, so Daniel explains how Linux works and why the Linux model is different.
The other is in the comments section Maintainers Don’t Scale, another of Daniel’s blogs, the highest supported comments are “… Why don’t these dinosaurs use modern development tools? “. (In fact, this question has been hovering in the heart of the author for a long time… .). Some of the kernel community’s top maintainers defend traditional mailing lists and pudding submissions against the Github pull Request. Is it really the problem with a handful of people in power? Daniel doesn’t think so. The root cause is that Github’s model simply doesn’t support the Linux kernel’s development model with its large number of contributors. As a result, moving even a few subsystems of the kernel community to Github is impossible. Daniel says it’s not just about hosting Git data, but also about how Github’s how pull Request, issue, and fork functions work. (At this point, I’m a little skeptical of this conclusion. Github hosts the entire community for so many large projects, but it’s a bit of a fool to say it can’t even manage a kernel subsystem.)
Github extends the way
What’s great about Git is that it makes it easy for anyone to fork open source projects and create branches on top of them. Finally, with a nice improvement, you can create a pull request based on the upstream main repository and get the code reviewed, tested, and then merged. Github is also great because it simplifies the complexities of Git with a UI that is easy to learn and use, making it easy for novices to contribute to open source projects.
At last, a hugely successful open source project cannot labelling, sorting, bot-herding and automating all pull requests and issues in a code warehouse. This is done by breaking up a single code base into manageable parts. More importantly, different parts of the project require different rules and processes due to their size and maturity: very new experimental libraries with different stability and CI (continuous integration) conditions than mainline code. Also, there may be a lot of outdated, unsupported code in the project, but it can’t be removed yet: this would require splitting a very large project into multiple subprojects, each with its own style of flow, merge criteria, separate repositories, and separate pull request and issue tracking. Typically, project management may require dozens or even hundreds of full-time contributors until the cost of administration becomes a pain point and massive restructuring becomes necessary.
Almost all Github hosted projects solve this problem by splitting a single source tree into many different projects, divided by function. The usual result of a split is some core code, plus a bunch of plugins, libraries, and extensions. All of this is built together through some kind of plug-in or installation package manager, in some cases by pulling code directly from other Github repositories.
Since almost all big projects are done this way, Daniel doesn’t want to belabor the benefits of this approach. Instead, he points out the downsides of this approach. Here’s a brief summary:
- Communities are needlessly divided. Most contributors only have the repository to which they directly contribute code, ignoring the rest of the project. This is good for them, but it also leads to parallel, repetitive work being discovered in different subprojects and shared results.
- Project refactoring and code sharing have inefficiencies: first you need to release the new version of the core project for the new code changes, then go through all the other subprojects and update them, and then remove the old shared code in the core project.
- In theory, supporting multiple subproject version combinations is unsustainable. This must be ensured through integration testing.
- Reorganizing multiple subprojects that are part of a large project is painful because it requires reorganizing your Git repository and deciding how to split it. In a single repository, reorganizations simply update the maintainer information file.
(See here, I believe that if a strong internal coupling of the code base, because of the large scale of the split, there are these problems… Indeed, if the project is large enough, this coupling can be circumvented by some process. Isn’t a Linux distribution, for example, seen as a collection of warehouses split in the same way? Of course, Linux distributions are made up of many independent open source projects. But in my previous work on Solaris OS, I essentially split a weakly coupled project into multiple libraries. In order to avoid the problem of interface coupling, the convention of interface stability is designed between different warehouses, such as public interface and private interface. EOL also has rules for supporting multiple versions of public interfaces.
Why a Pull Request?
Daniel says that the Linux kernel is one of the few projects he knows that has not been split as previously stated. The Linux kernel is a huge project, and it doesn’t work without sub-project planning. So it’s also worth taking a look at why Git needs Pull Request: on Github, Pull Request is the real way for contributors’ code to get merged. However, kernel changes are made by submitting patches to the mailing list, even after Git has been used by the kernel project.
However, Pull Request has been supported since very early versions of Git. The original Pull Request users are kernel maintainers, and Git was originally designed to solve Linus Torvalds’ kernel maintenance problems. Pull Requests are undoubtedly necessary and useful, but they are not designed to handle code changes by individual contributors: To this day, Pull Requests are used to forward code changes across entire subsystems, or to synchronize code refactoring, or to synchronize code changes across multiple subprojects. For example, in 4.12, the network subsystem maintainer Dave S. Miller’s Pull Request was submitted by Linus: This submission consists of over 2,000 code submissions from 600 individual contributors, as well as a bunch of code merges and pull requests from the next level of maintainers. However, all of these patches are selected from mailing lists and submitted by maintainers, not original authors. One of the oddities of Linux kernel flow is that the original author does not commit code to the codebase. This is why Git keeps track of committers and authors independently.
Github’s innovations and improvements include the use of Pull Requests everywhere, which are delegated to individual contributors. However, this was not the purpose for which Pull Requests were originally created.
The way the Linux kernel extends
At first glance Linux is a monolithic warehouse, with everything crammed into Linus’s main warehouse. But in reality it is far from that:
- Few people run Linux from the Linus Torvalds main repository. If they are running upstreams, they are usually using stable kernel branches. But more often with Linux distributions, which often have extra patches and backports, and are not even hosted by Kernel.org, and are often organized in completely different ways. Or they use the kernel they get from the hardware manufacturer, which often has a lot more changes than the main warehouse.
- No one does development directly in Linus’s warehouse except Linus himself. Each subsystem, even large drivers, has its own Git code base and its own mailing list to track patch submissions and issue discussions, and these subsystems are independent of each other.
- Cross-subsystem work is done within the Linux-Next integrated code tree, which typically contains hundreds of Git branches from different Git code bases.
- All of this is managed through MAINTAINERS files and get_maintainers.pl scripts, which can tell any given piece of code everything about who the maintainer is, who the code reviewer is, where the Git repository is, which mailing list to use, Where to report bugs. The tool identifies cross-subsystem changes not only based on file location, but also by capturing characteristics of the code, such as device tree processing and kObject hierarchy, to be handled by appropriate experts.
According to Daniel, this approach has the following advantages (as compared to the previous Github subproject split, of course), which are summarized below:
- Reorganizing the split of subprojects is super easy, just updating the MAINTAINERS files and creating a new Git repository.
- Pull Request discussions and problem discussions across subsystems are very, very easy to reassign between subprojects by adding a mailing list of subsystems to your email response to Cc:. Similarly, working across subsystems can be very easy to coordinate, because the same Pull Request can be submitted to multiple sub-projects, and there is only one global discussion.
- Work across subsystems and do not require any release synergy between multiple projects. Just modify all the code in your own repository.
- It also doesn’t prevent you from creating your own experimental changes, which is one of the important benefits of multiple repositories. Just add code to your own fork, no one forces you to change it back, or put your code into a single repository because there is no central repository.
The only problem with Github is that it doesn’t support cross-repository workflows. .).
Linux: Monotree with Multiple Repositories
Some might argue that Linux’s schema looks like a monolithic code base with extensibility problems, as mentioned at the beginning of this article. Next, Daniel spends a lot of time in his article explaining that Linux is a model of monotree with Multiple Repositories.
Daniel also uses the Pull Request workflow between kernel community maintainers as an example of why Linux doesn’t work on Github. Here’s a summary:
The simple scenario is to spread the code changes through the kernel maintainer hierarchy until the changes eventually land in a code tree that can eventually be distributed to the software. For Github, this is easy to do using the Pull Request UI.
Even more interesting are changes that span multiple subsystems, because the Pull Request’s workflow becomes a Mesh of acyclic diagrams and their variants. The first step is to have code changes reviewed and tested by the maintainers of all subsystems. In the Github workflow, this would be a Pull Request submitted to multiple code bases at the same time, as long as there is a single thread of discussion that is shared across all repositories. In the kernel community, this is done by submitting patches simultaneously to a large mailing list and maintainers.
This approach to kernel Review is not usually the way code is eventually merged. Instead, one of the subsystems is determined to accept the Pull Request in a manner that all other subsystem maintainers agree to. Usually, the selected subsystem is the one that is most affected by the change, but sometimes it can be because a subsystem is doing work that conflicts with the Pull Request. Sometimes, an entirely new code repository and its maintainers can be established. This usually occurs when the functionality of the change spans the entire code tree and is not easily covered by a number of files in one place, one directory. The most recent example is the DMA Mapping Tree (CODE tree for DMA Mapping projects), which attempts to combine changes to various drivers, platform maintainers, and processor architecture support in a single project.
Sometimes, however, code changes in multiple subsystems conflict, causing them to require considerable effort to resolve merge conflicts. In this case, such cross-subsystem patches are usually not accepted directly (equivalent to a Pull Request for Rebase on Github), but rather the Pull Request is modified to make these patches common to all subsystems, and then merged into all subsystems. A common baseline is important in order to avoid irrelevant changes affecting a subsystem. Since this type of Pull Request exists for a particular Topic, these code branches are also commonly called Topic branches.
Daniel also gives the example of Microsoft’s OS project, which is a single code tree. And according to his conversations with people at Microsoft, the code tree was so large that Microsoft needed to write a virtual file system for GVFS to support more efficient development. (In my opinion, this example is slightly inappropriate. Microsoft OS has not only the kernel, but also a lot of other code in user mode, and the kernel code tree here is the kernel code. From a more global and more equal perspective of Linux distributions, Linux distributions are actually multiple code trees… . Daniel was the one who attacked Github’s multi-tree approach to subprojects…)
Dear Github, it’s your turn…
Unfortunately, Github does not support the cross-subsystem workflows discussed earlier, at least not in the native UI. Of course, this can be done with the original Git command, but you have to go back to sending patches through the mailing list, and the Pull Request has to go back to mail, and then merge manually. In Daniel’s opinion, this is the only reason why the Kernel community can’t move to Github. There are a few hiccups (really?). Some of the top maintainers are extremely opposed to Github, but this is not a technical issue. And, not just the Linux kernel, it’s a general problem that all giant Github projects face when scaling, because Github doesn’t give them an extension to multiple repositories, but keeps the option of a single code tree.
In short, Daniel presented Github with what he thought was a simple new feature requirement:
Support Pull Request and Issue tracking to manage code repositories across multiple different, singleton code trees.
Very simple idea, but with huge impact.
Daniel not only gives the core ideas, but also gives some details of the suggestions, which are summarized below:
Warehouse and organization
First, you need to support multiple derived repositories where the same organization can own the same code base. Take git.kernel.org for example. Most of the code bases above are not personal.
Having multiple branches in a code base is not a substitute for this requirement, as the main reason for splitting the code base is to keep Pull Request and issue separate from each other.
Also, you need to be able to build code base derivations based on fait accompli (history). This is not a problem for new projects. But in the case of Linux relocation, this is a problem: you have to move all of the Linux subsystems at once, and there are already a lot of Linux codebase on Github that doesn’t have the right Github derivation relationship with each other.
Pull Request
The Pull Request needs to be able to commit to multiple code bases at the same time, but it can still hold a discussion thread. In addition to committing to all code bases at once, it is also possible to reassign Pull requests to different branches of a code base.
Also, the state of the Pull Request needs to be different for each code base. A maintainer of a code base might close the Pull Request rather than merge it, because maintainers agree that one of the subsystems pulls the Pull Request, and that maintainer will merge and close the Pull Request. Another code tree might have to close the Pull Request on an invalid Request because the Pull Request does not fit the older version of the code base, or a derived code base from a particular vendor. Even more interesting, a Pull Request might be merged multiple times, in each subsystem’s code base, with a different Commit ID.
Issues
Like Pull Request, problem tracing requires isolation across multiple code bases and the ability to move. For example, a bug may be reported to a distribution’s kernel repository and then analyzed, and the driver bug also exists in the latest development branch, so the issue is not only related to the repository, but also to the main upstream branch, or perhaps more repositories.
The Issue state should be independent in different code bases, because a push in one repository at a time will not be understood to be available in all repositories. Even more, backport to older kernels and distributions may require more work, and some codebases may decide it is not worth fixing the bug, close the bug with WONTFIX, but at the same time mark it as a successful fix in other codebases. (My previous company had all of these features. This is indeed a required feature of a commercial software defect tracking system.)
Summary: Monotree, not Monorepo
The Linux kernel is not moving to Github. However, allowing Github to move “singleton tree, not singleton repository” types of projects to Github would be a huge benefit for all existing mega-projects.
Tradeoff is everywhere
Daniel’s article so far, next is the author to talk about their own views, a word, for reference only.
First, because of Github’s design issues, forcibly splitting tightly coupled codebase into subprojects is a real problem. However, the benefits of splitting between loosely coupled code are also obvious. Everyone can follow their own direction, pace, to develop and maintain. In a broad sense, Glibc’s relationship with the Linux kernel is such that a stable system call interface is key to its functioning. Therefore, the demarcation boundaries of the code base should be based on the existence of stable protocols and interfaces.
Secondly, Github, as the largest same-sex social platform in the world, addresses the needs of individual open source program enthusiasts. It’s not surprising, then, that Github doesn’t support the monolithic code tree, not monolithic code base model required for large open source projects. As for the positioning of the product, perhaps Github’s current design is one of the reasons for its success. It’s what allowed Git to spread from a handful of Linux kernel developers to the entire programmer community.
Finally, according to KISS (Keep it Simple Stupid), Github’s biggest strength is being dumb enough for mainstream users. But as Github grows in influence and more businesses and organizations start hosting projects on it, perhaps Daniel’s advice is worth taking seriously? At the very least, enterprise users are likely to pay real money.
It seems that not only are we faced with various tradeoffs in technology, but also in the direction of product development. Who knew that the Github shortcomings that this article criticized weren’t the product manager’s decision after much thought? And Daniel’s suggestion may be a need to meet the needs of large enterprises in the new situation?
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.