Today, Kubernetes has become the de facto standard for distributed cluster management systems and public/private clouds. In fact, Kubernetes is a distributed operating system, it is Google in the field of distributed operating system more than ten years of engineering experience and wisdom of crystallization, and Google has been managing the world’s largest distributed cluster, in the field of distributed operating system research and understanding leading the whole world. As a result, Kubernetes, released in 2014, was able to surpass many of its predecessors in just a few years.
As a distributed operating system, Kubernetes (including its predecessor Google Borg) appeared much later than UNIX, Linux, Windows and other famous stand-alone operating systems. Kubernetes architecture design naturally inherits many precious legacies of stand-alone operating systems. The microkernel architecture is the most important of these legacies. In the remainder of this article, we will focus on the concept of the microkernel and its impact on the Kubernetes architecture.
What is a microkernel?
While introducing microkernels, it’s worth looking back at the history of stand-alone operating systems to understand their value. In this chapter, operating system refers to the single-node operating system.
The rise of the UNIX
After the birth of electronic computers, before the 1970s, there were many operating systems, among which DOS, OS/360 and Multics were well-known representatives. This was a pioneering era in the field of operating systems. Twenty years of pioneering work produced great results: Advances in CPU technology led to the birth in 1969 of UNIX, a true time-sharing operating system.
Image credit: Wikipedia
Supported by new CPU technologies, UNIX divides software systems into kernel and Userland programs. The kernel is a collection of interrupt handlers that encapsulate the capabilities of the hardware as system calls. User-mode programs use hardware functionality through system calls. User-mode programs run in their own processes. UNIX is trapped in a kernel that makes system calls, while a time-sharing scheduling algorithm in the kernel decides which process to hand the CPU to and manages context switching for the process. In addition, UNIX encapsulates (almost) all hardware as files. UNIX also provides a special user-mode program shell for users to use the system directly. Through the interprocess communication capabilities provided by the kernel, the shell allows users to combine a series of applications to handle complex requirements. The author calls this design idea “KISS” (Keep It Simple and Stupld). All of UNIX’s design ideas were remarkable innovations at the time.
Not only has UNIX made a direct contribution to the industry in its own right, but it has become the blueprint for all modern operating systems, earning two authors, Ken Tompson and Dennis Ritchie, the 1983 Turing Award.
UNIX was born at Bell LABS, part of AT&T, and seeing how powerful it could be, AT&T made the seemingly selfless decision to open source UNIX (initially only to universities), which led to the birth of all modern operating systems. AT&T was eventually broken up, but the legacy of that decision continues to this day. Now in the 2020s, whether it’s MacOS, Windows, Linux, it’s all directly influenced by UNIX, and iOS comes from MacOS, Android comes from Linux, So the soul of UNIX lives on in everyone’s phone, in the services behind every phone App.
In addition, UNIX came with a byproduct that was worth more than the operating system itself: Dennis Ritchie designed C for UNIX, which became the primary source of design for all popular modern programming languages, and remains one of the most important programming languages today, nearly 40 years after its birth.
It is worth mentioning that at that time, the main open object of UNIX was Berkeley, Carnegie Mellon and other research universities. The College of arts and sciences was small in scale and did not have graduate programs, which did not belong to AT&T’s main open target. Therefore, a young man who graduated from Olivet College was not affected by the UNIX trend of thought. The software genius, David Cutler, designed the VMS operating system at DEC in 1975. VMS, like the original UNIX, ran on PDP-11, but was designed independently of UNIX. VMS did not make waves in the industry and ended up being unix-compatible. David Cutler left Digital to join Microsoft, where he made his own legend. Interestingly, Jobs also attended a liberal arts college, so it seems that liberal arts college students in the United States are not in the ordinary way.
The rise of the microkernel
UNIX “all files” design has brought a lot of convenience of user program design, but it requires all of the hardware packaging in kernel mode, thus the kernel module of bugs will make the whole system is affected, for example, if a device driver has a memory leak, all users of the equipment configuration process will have a memory leak, If a kernel module has a security vulnerability, the security of the entire system is out of control.
In order to solve these problems, operating system researchers began to develop the concept of “microkernels” in the 1970s. The essence of the microkernels is that the operating system kernel retains only the basic functions of memory address management, thread management, and interprocess communication (IPC). Other functions such as file systems, device drivers, network protocol stacks, GUI systems, etc., are treated as separate services, which are typically separate user-mode daemons.
User-mode applications access these services through IPC, thereby accessing the full functionality of the operating system, which greatly reduces the number of system calls that need to be trapped in the kernel and makes the system more modular. At the same time, the system is more robust, and only a few system calls in the kernel have access to the full capabilities of the hardware, such as device driver problems that affect the corresponding services, not the entire system. In contrast to the micro kernel, the UNIX design was called a monolithic kernel.
After the opening of UNIX, AT&T continued version iteration, and universities developed many new operating system kernels based on AT&T’s UNIX, among which the more well-known ones are:
- Monolithic BSD, a storage device, was released in 1974 by Berkeley legend Bill Joy (who reportedly developed the first version of the BSD kernel in three days, Bill Joy’s work also includes the first TCP/IP stack, VI, Solaris, SPARK chips, and more). The kernel had a great impact on the industry and later developed into FreeBSD, OpenBSD, NetBSD and other branches. Modern operating systems such as Solaris, MacOS X and Windows NT have borrowed from it.
- Mach, the microkernel, was published by Carnegie Mellon University in 1984 by Avie Tevanian and Rick Rashid, two graduate students at CMU. The kernel also had a big impact on the industry, with GNU Hurd and MacOS X drawing lessons from it, but the project itself failed.
- MINIX, a microkernel, was released in 1987 by Professor Andrew Tanenbaum of Vrije Universiteit Amsterdam. Countless computer science students have mastered the design principles of operating systems through MINIX and its accompanying textbooks, and the initial version of Linux was copied from MINIX. MINIX, while famous, was primarily used for teaching and never gained a foothold in industry.
The silence of the microkernel
From the 1990s to the 2010s, the descendants of UNIX and VMS waged a free-for-all, and as a result, the microkernel was a beautiful concept but a brutal reality:
- While MINIX was a monolithic system designed solely for teaching, Linux was a monolithic system based on MINIX’s design, which was a huge success. Mach had a profound impact on the industry, but was not itself widely used, and its successor, GNU Hurd, was under development but never used.
- The NTOS kernel for Windows was designed by David Cutler based on VMS, a separate system he had designed for DEC (VMS was unrelated to UNIX). NTOS borrowed from microkernels and some BSD code, but eventually David Cutler decided to put all services (such as guIs) in kernel mode rather than user mode, so Windows NT was monolithic kernel compliant in software architecture, while running practically monolithic kernel compliant. It’s called hybrid kernel.
- MacOS X was based on NextStep OS, which was designed by Avie Tevanian, who was the main architect of Mach. After his PhD, Gates and Jobs invited him to Next, Rick Rashid, his buddy at CMU, went to Microsoft as David Cutler’s chief assistant, and Avie Tevanian reportedly used a calculator every day at Next to calculate the stock appreciation he lost by not going to Microsoft. After returning to Apple with Jobs, Avie designed OS X based on NextStep and BSD code. Coincidentally, OS X also used the hybrid Kernel architecture to great success and seamlessly switched between PowerPC and x86 instruction architectures.
In addition to Linus Torvalds, David Cutler and Andrew Tanenbaum, Avie Tevanian and Rick Rashid are among the leaders of microkernel architecture. But there’s a reason why none of them got the microkernel off the ground.
A microkernel operating system can access system services much less efficiently than a monolithic operating system. For example, in Linux, system calls (such as Open) are stuck in the kernel only once, switching the CPU to high-privilege mode and then switching back to low-privilege mode. If in a microkernel operating system, the user calls would open need to assemble a IPC request message, sent to the corresponding file system service process, then obtain the IPC response message from the file system service process and unpack, got the call as a result, in this way, the message copies of data and processes from a context switch will bring a lot of overhead. Messages need to be copied because user-mode processes cannot access each other’s memory addresses, whereas kernel code can access any user-mode process’s memory addresses. It was for performance reasons that BOTH OS X and Windows chose the Hybrid Kernel architecture, and NTOS even integrated a GUI subsystem into the kernel for a better user experience.
To put it simply, in the case of a poor PC, the Windows mouse arrow is more of a follower, moving even when the system is about to crash. The greater success of Windows XP after its predecessor, Windows 98, is due to NTOS ‘close focus on performance, compared with Apple’s initial Machintosh feat in the mid-1980s. But because Jobs couldn’t convince the sales team to switch to a stronger memory chip, the original Mac performed poorly and ran very slow programs, failing to achieve the blue-ocean success it deserved.
Kubernetes and microkernels
Performance problems may be crucial for stand-alone operating system, but for distributed operating system is not the case, a distributed operating system as a “behind”, do not need to directly face users, and of the little loss can be used on single machine performance more machines to make up for, under this premise, better architecture is often more important.
The birth of Borg
At the time that the stand-alone OS wars were being decided, Google, the industry’s newest darling, was preparing for its IPO. Google was, in today’s terms, a “little giant” : a giant that had already emerged, to be sniped at, but was too bogged down in the war to care. In 2003, in order to better support a new version of its search engine (based on MapReduce) for its hundreds of millions of users, Google began work on a massive cluster management system called Borg, which aims to manage clusters of 10,000 computers. Although it started with a small team of three or four people, Borg kept pace with Google’s rapid growth and proved its potential. Eventually, all Of Google’s machines were managed by Borg, and well-known systems such as MapReduce and Pregel were built on Borg. From the perspective of the operating system, Borg was a monolithic system; any functional upgrades to the system required modifications to the Borg underlying code. In a mature technology company like Google, there are many good engineers, so this problem is not serious in internal systems. However, if it is a public cloud, it is necessary to access the needs of many third-party applications. No matter how strong a company’s engineering team is, it cannot connect all other systems in the industry to Borg. In this case, the scalability of the system will be very important.
Around 2010, with the closure of Google’s China operations, many of Google’s best engineers joined Chinese companies like BAT, and some joined Tencent Soso. When the former Googlers joined Tencent, they copied many of Google’s systems and were technically brilliant, including a Borg clone called TBorg, later renamed Torca. Torca played a very important role in Soso’s advertising business. Later, due to the business adjustment of Tencent and the merger of Sogou and Soso, Torca lost users in Tencent and gradually stopped its maintenance.
A few years after the Borg was launched, Google realized the monolithic architecture’s problems and bottlenecks, and a small team began working on the Omega system. The Omega system inherits the idea of a microkernel and is more flexible and scalable than Borg, with new functionality upgrades that require almost no changes to the underlying code. However, since all Google systems were built on Borg, Borg’s monolithic feature, MapReduce and other systems were tightly bound to Borg core code, it was impossible to seamlessly migrate to Omega system. Migration also took a huge amount of effort, time, and trial and error, so the Omega system failed to take off at Google despite the persistent efforts of the core team.
Interestingly, the career trajectory of Brendan Burns, one of the core members of the Omega project, bears a lot in common with that of David Cutler, the operating system guru.
- They also graduated from liberal arts colleges: David Cutler from Olivet College and Brendan Burns from Williams College.
- They also joined a traditional industry giant after graduation: David Cutler at DuPont and Brendan Burns at Thomson Financial.
- As the Godfather said, a man can only have one destiny. Cutler and Burns learned to write code at the two traditional giants, and it may have been then that they discovered their talent for software and their destiny to build a new generation of operating systems. So they also chose the hottest tech giants of the day for their second jobs: David Cutler joined Digital and Brendan Burns joined Google.
- They also reached the pinnacle of their careers at Microsoft: Brendan Burns is now Microsoft’s Corporate VP, and David Cutler is Microsoft’s only Senior Technical Fellow. Microsoft reportedly has a rule that Cutler’s Technical rank must be the highest in the company. Cutler automatically gets a level up from anyone who reaches Cutler’s level.
The birth of Kubernetes
In the era of stand-alone operating systems, hybrid kernel prevailed, which proved the success of microkernels in software architecture. However, due to performance problems and the fact that no successful kernel adopted “pure” microkernel architecture, microkernels failed from the practical point of view.
Unlike the microkernel architecture failures of the stand-alone operating system era, Omega’s internal failures at Google had nothing to do with performance issues, but legacy issues. For the open source community and most companies, there was no comparable system and no historical burden, so a few years later, Google decided to open source Omega, the next generation of distributed operating system beyond Borg, and named it Kubernetes.
In order to understand the relationship between Kubernetes and microkernels and the advantages that the microkernel architecture brings to Kubernetes, it is necessary to introduce some technical details.
As mentioned above, system calls to standalone operating systems need to be “trapped” in the kernel, also known as interrupts. Regardless of the kernel type, standalone operating systems need to register system calls in a region of memory at startup. This area is called an Interrupt Vector or an Interrupt Descriptor Table (IDT). Of course, interrupt handling in modern operating systems is very complex and there are many System calls, so in addition to IDT, you need a System Call Vector (SCV), which calls an interrupt handler through a unified interrupt entry (e.g. INT 80). This interrupt handler distributes system calls to different function code in the kernel via SCV. So SCV’s place in the operating system is just as important as its place in StarCraft. For microkernel architectures, what system capabilities user-mode services provide, in addition to system calls in SCV, also need to be registered to a region.
Similarly, distributed operating systems like Kubernetes provide external services in the form of API. The API provided by the distributed operating system itself is equivalent to the system call of the standalone operating system, and each API needs to be registered to a certain location. For Kubernetes, the API is registered with ectD. Kubernetes itself provides apis equivalent to system calls, supported by a component called Controller, and a new API provided by the developers for Kubernetes, supported by Operator, Operator itself and Controller are developed based on the same mechanism. This is consistent with the idea of microkernel architecture: Controller is equivalent to the service running in kernel mode, providing core capabilities such as thread and process management and scheduling algorithm, while Operator is equivalent to the GUI, file system, printer and other services in microkernel architecture, running in user mode.
Image source: mapr.com/products/ku…
Therefore, Kubernetes works in a similar way to a stand-alone operating system. Etcd provides a watch mechanism. The Controller and Operator need to specify the contents of their watch and tell ETCD, This is the equivalent of a microkernel architecture registering system calls in IDT or SCV.
For example, Argo is an Operator that provides the ability to execute a DAG workflow in Kubernetes. Kubectl submits Argo’s YAML to Kubernetes API Server. The API Server writes key-value data from YAML to the ETCD, which will alert those services that are specifying keys on watch. In our case, the service is Argo. This is just like a user process requesting user-mode services in a microkernel architecture.
Argo gets the HTTP request from EtCD Watch, goes to ETCD to read the data in YAML and parse it, then knows what container to start and asks Kubernetes to start the corresponding container through the API. The Kubernetes Scheduler is a Controller that allocates resources and starts the container upon receiving a start container request. This is the process by which a user process starts another process through a system call in microkernel architecture.
Of course, there are differences between Kubernetes and standalone operating systems: Kubernetes has no explicit “sink” process, whereas microkernel-based standalone operating systems need to sink when accessing system calls, but not when accessing user-mode services. However, Kubernetes can set different permissions for different services, which is somewhat similar to the difference between kernel-mode and user-mode CPU permissions in standalone operating systems.
The architectural advantages of the microkernel are evident in Kubernetes: In Borg, adding new subsystems is very complex, often requiring modifications to the Borg underlying code, and thus the new system is bound to Borg. In the case of Kubernetes, developers simply implement an Operator based on the SDK provided by Kubernetes and can add a new set of apis without paying attention to the underlying Kubernetes code. Argo and Kubeflow are both applications of operators. Any existing software can be easily integrated into Kubernetes through the Operator mechanism, so Kubernetes is very suitable as the underlying distributed operating system of public cloud. For this reason, Kubernetes was released in the middle of 2014, and after a year of growth in 2015, In 2016, Kubernetes became mainstream in the industry. For companies with no historical burden, Kubernetes is also used as the underlying system of internal cloud.
The end of the
In this article, we introduce the brief history of the development of stand-alone operating system, introduced the microkernel architecture in this historical process from rise to decline, also introduced the microkernel architecture in Kubernetes to revitalize the process. In general, technologies that are significantly ahead of their time may not succeed at the time they are proposed, but they will certainly take back their glory years later when The Times catch up. This is demonstrated by the different encounters of microkernel architectures in the era of stand-alone operating systems and cloud computing, and by the different encounters of deep learning in the era of low computing power and high computing power.
It’s worth noting that After Kubernetes, Google introduced Fuchsia as a possible alternative to Android. Fuchsia is based on Zircon, which is based on C++, which is the microkernel architecture. It remains to be seen whether microkernels will have a Renaissance in mobile/iot operating systems, in addition to distributed operating systems, in the modern era of computing power explosion.
This article is based on a recent post wang Yi shared with the SQLFlow_ and _ElasticDL_ teams. Shen Junmo and Zhang Haitao, Wu Yi, Yan Xu, Zhang Ke. This summary explains how SQLFlow is a kubernetes-native distributed compiler, and why ElasticDL only makes distributed AI for Kubernetes. The authors of this article include the authors of Baidu Paddle EDL. Paddle EDL is a distributed computing framework based on PaddlePaddle and Kubernetes, which was contributed to the Linux Foundation in 2018
This article is the first “financial level distributed architecture” public account.