For Kubeflow components are more, it is easy to figure out what each component is doing, this article on Kubeflow for a systematic summary, so that we understand the usefulness of each component, and straighten out the relationship between components, to help you reasonably and quickly choose their own components, Then, the underlying architecture and process of each component will be introduced and analyzed in detail for further study.
What is Kubeflow?
Kubeflow is a machine learning kit from Kubernetes. Kubeflow is a set of technology stack running on K8S, this technology stack contains a lot of components, the relationship between components is relatively loose, we can use together, can also use a part of them alone. The following image shows Kubeflow as a platform for arranging ML system components on Kubernetes:
Kubeflow
Kubeflow
Kubeflow
-
In the experimental phase, we will develop the model based on initial assumptions and repeatedly test and update the model to produce the desired results:
- Identify the problem we want the ML system to solve.
- Collect and analyze the data required to train ML models.
- Select an ML framework and algorithm and code the initial version of the model.
- Test data and train models.
- Adjust model hyperparameters to ensure the most efficient processing and the most accurate results.
-
During production, we will deploy systems that perform the following processes:
- Convert the data into the format required by the training system (to ensure that our model behaves consistently during training and prediction, the transformation process must be the same during the experimental and production phases).
- Train ML models.
- The service model is run for online forecasting or in batch mode.
- Monitor the performance of the model and input the results into our program to adjust or retrain the model.
It can be seen that the goal of Kubeflow is to build a unified machine learning platform based on K8S, covering the main machine learning process (data -> feature -> modeling -> service – monitoring), while taking into account the experimental exploration stage of machine learning and the formal production environment.
Kubeflow components
Kubeflow provides a lot of components, covering all aspects of machine learning, in order to have a more intuitive and in-depth understanding of Kubeflow, first look at the overall Kubeflow components, and the main components of Kubeflow for a simple introduction:
- Central Dashboard:
Kubeflow
thedashboard
Kanban page - Metadata: Used to track individual data sets, jobs, and models
- Notebooks: An interactive business IDE coding environment
- Frameworks for Training: supported ML Frameworks
- Chainer
- MPI
- MXNet
- PyTorch
- TensorFlow
- Hyperparameter Tuning:
Katib
, hyperparameter server - Pipelines: AN ML workflow component, used to define complex ML workflows
- Tools for Serving
Kubernetes
Deployment of machine learning models- KFServing
- Seldon Core Serving
- TensorFlow Serving(TFJob): Provides a Serving pair
Tensorflow
Online deployment of models, support for version control and no need to stop online services, switching models, etc - NVIDIA Triton Inference Server(Formerly called TensorRT)
- TensorFlow Batch Prediction
- Multi-tenancy in Kubeflow: Multi-tenancy in Kubeflow
- Fairing: a will
code
Pack to buildimage
The components of theKubeflow
Most components are implemented by definitionCRD
To work. At presentKubeflow
The main components are: - OperatorIs to provide resource scheduling and distributed training capabilities for different machine learning frameworks (
TF-Operator
.PyTorch-Operator
.Caffe2-Operator
.MPI-Operator
.MXNet-Operator
); - PipelinesIs based on
Argo
Implementation of pipeline project for machine learning scenarios, providing the creation, scheduling and management of machine learning processes, also provides aWeb UI
. - KatibIs based on various
Operator
The realization of hyperparameter search and simple model structure search system, support parallel search and distributed training. Hyperparameter optimization has not been applied in practical work on a large scale, so this part of the technology still needs some time to mature; - ServingSupport for service-oriented deployment and offline prediction of models trained by each framework.
Kubeflow
Offer based onTFServing
.KFServing
.Seldon
And so on. As there are many machine learning frameworks, algorithm models are also varied. The industry has lacked a truly unified deployment framework and solution. This aspectKubeflow
It only integrates the common ones, but does not do more abstraction and unification.
Above, I have made a systematic summary of Kubeflow components to help us have a basic understanding and overall grasp of each component. To strike while the iron is hot, let’s go through the architecture and workflow of each component in detail.
Jupyter Notebooks
Jupyter itself contains many components. For individual users, JupyterLab + Notebook is sufficient. But treating Jupyter as a corporate-level platform is not enough. There are many things to consider, such as multiple users, resource allocation, data persistence, data isolation, high availability, permission control, and so on. These problems are K8S ‘forte. So it made sense to combine Jupyter with K8S. JupyterHub is a multi-user Jupyter portal, designed from the beginning of multi-user creation, resource allocation, data persistence and other functions as a plug-in model. Its working mechanism is shown in the figure below:
JupyterHub
OS
OAuth
JupyterHub
K8S
Kubeflow
Multi-Tenancy in Kubeflow
Kubeflow
solo
Kubeflow
default-editor ServiceAccount
Jupyter notebook Pod
kubeflow-edit ClusterRole
Kubernetes
Pod
Deployment
Service
Job
TFJob
PyTorchJob
Therefore, the above Kubernetes resources can be created directly from Jupyter Notebook in Kubeflow. Notebook comes preinstalled with the Kubernetes Kubectl command line tool, which is pretty straightforward. When binding the Jupyter Notebook to Kubeflow, you can use the Fairing library to submit training jobs using TFJob. Training jobs can run on a single node or on the same Kubernetes cluster, but not inside notebook Pod. Submitting jobs through the Fairing library gives data scientists a clear understanding of processes such as Docker containerization and POD allocation. Overall, Kubeflow- Hosted Notebooks better integrates with other components while providing notebook Image scalability.
Pipelines
After Kubeflow V0.1.3, pipeline has become the core component of Kubeflow. The purpose of Kubeflow is mainly to simplify the process of running machine learning tasks on Kubernetes, and finally hope to achieve a complete set of available pipeline to achieve a set of end-to-end process of machine learning from data to model. Pipeline is a workflow platform that compiles and deploys machine learning workflows. In this sense, it’s no surprise that Pipeline is a core component of Kubeflow. Kubeflow/Pipelines implements a workflow model. A workflow, or pipeline, can be thought of as a directed acyclic graph (DAG). Each of these nodes is called a component. Components handle real logic, such as preprocessing, data cleaning, model training, etc. Each component is responsible for different functions, but one thing in common is that components are packaged as Docker images and run as containers. < span style = “box-sizing: border-box; word-wrap: break-word! Important;”
experiment
step
step output artifacts
Pipelines architecture
Figure 2: Kubeflow Pipelines
- Python SDK: used to create
kubeflow pipelines
Component specific language (DSL). - DSL compilerWill:
Python
Code conversion toYAML
Static configuration file (DSL compiler). - Pipeline Web Server:
pipeline
A front-end service that collects various data to display the relevant view of what is currently runningpipeline
List,pipeline
Execution history records about eachpipeline
Running debugging information and execution status. - Pipeline Service:
pipeline
The backend service is calledK8S
Service fromYAML
createpipeline
Run. - Kubernetes Resources: createCRDS running
pipeline
. - Machine Learning Metadata Service: Used to monitor traffic
Pipeline Service
To create theKubernetes
Resources, and persist the state of these resources in the ML metadata service (store between task flow containers)input
/output
Data interaction). - Artifact Storage: Used for storage
Metadata
andArtifact
.Kubeflow Pipelines
Store metadata inMySQL
In the database, store artifacts inMinio serverorCloud StorageAnd other artifacts in storage. - Orchestration Controllers: Task Orchestration, such as the **Argo Workflow** controller, which coordinates task-driven workflows.
Working principle in Charge
Pipelining can be defined in two steps, starting with defining components that can be fully customized starting with the image. Here’s how to customize: First of all, you need to package a Docker image, which is the dependency of the component, and the operation of each component is a Docker container. Secondly, it needs to define a Python function to describe the input and output information of the component. This definition is to enable the pipeline to understand the structure of the component in the pipeline, including how many input nodes, how many output nodes, etc. Components are then used just like normal components. The second step to realize the pipeline is to compose the pipeline according to the defined components. In the pipeline, the input and output relationships will determine the edges and directions on the graph. Once the pipeline is defined, a good pipeline client can be submitted to the system to run through python. The use of Kubeflow /pipelines is somewhat complex, but its implementation is not troublesome. The entire architecture can be divided into five parts: ScheduledWorkflow CRD and its Operator pipeline front end, pipeline back end, Python SDK, and Persistence Agent.
ScheduledWorkflow CRD
Extend theargoproj/argo
theWorkflow
Definition. This is also a core part of the pipeline project, which is responsible for really working in the pipelineKubernetes
The corresponding container is created according to the topological order to complete the logic of pipeline.Python SDK
Responsible for the construction of the pipeline, and according to the pipeline constructionScheduledWorkflow
theYAML
Definition, which is then passed as a parameter to the back-end service of the pipelined system.- Back-end service dependencies store databases (e.g
MySQL
) and object storage (e.gS3
), processing all the assembly linesCRUD
The request. - The front end is responsible for visualizing the entire pipeline process, retrieving logs, initiating new runs, and so on.
Persistence agent
Responsible for transferring data fromKubernetes Master
theetcd
In thesync
To the relational database of the back-end service, and its implementationCRD operator
Similarly, throughinformer
To listen toKubernetes apiserver
Corresponding resource implementation.
Pipelines provides machine learning process creation, scheduling, and management, as well as a Web UI. This section is based on Argo Workflow. I believe this will be the future development of Kubeflow.
Fairing
Kubeflow Fairing is a Python package that makes it easy to train and deploy ML models on Kubeflow. Fairing can also be extended for training or deployment on other platforms. Fairing has been expanded to train on the Google AI Platform. Fairing simplifies the process of building, training, and deploying machine learning (ML) training jobs in hybrid cloud environments. By using Fairing and adding a few lines of code, you can run ML training jobs using Python code directly from Jupyter Notebook, either locally or in the cloud. Once the training is complete, you can use Fairing to deploy the trained model as a prediction endpoint. Notebooks of Kubeflow code editor Jupyter, Fairing for packing up code to build image, and workflow component Pipelines, Katib for referring-in and KFServing for delivering deployment services.
Katib
Before understanding the process of KATIB, let’s first introduce the components of KatiB:
- Experiment Controller: to provide the
Experiment CRD
Life cycle management. - Trial Controller: to provide the
Trial CRD
Life cycle management. - SuggestionsTo:
Deployment
The way to deploy, withService
Way to expose services, provide hyperparameter search services. At present there are random search, grid search, Bayesian optimization and so on. - Katib ManagerA:
GRPC server
, provided the rightKatib DB
The operation interface of theSuggestion
withExperiment
Between agents. - Katib DB: Database. It will store
Trial
andExperiment
, as well asTrial
Training indicators. The current default database isMySQL
.
Katib architecture
How Katib works
When an Experiment is created, the Experiment Controller uses Katib Manager to create an Experiment object in Katib DB, and Finalizer is used to indicate that the object uses external resources (database). The Experiment Controller then asks the Manager to get the hyperparameter values from the GRPC interface provided by Suggestion based on its own state and definition of parallelism. And then forward it to the Experiment Controller. In this process, Katib Manager is the role of the agent that brokers the Experiment Controller’s request to Suggestion. After the hyperparameter is given, the Experiment Controller constructs a definition of Trial based on the Trial Template and the hyperparameter, and then creates it in the cluster. After Trial is created, similar to the behavior of Experiment Controller, Trial Controller will also create a Trial object in Katib DB through Katib Manager. Expected jobs (such as Batchv1 Job, TFJob, PyTorchJob, etc.) and Metrics Collector jobs are constructed and then created on the cluster. After these jobs finish running, the Trial Controller updates the status of Trial, and the Experiment Controller updates the status of Experiment. And Experiment goes on to the next iteration. Previous trials have been trained, and the training indicators have been collected. Experiment determines whether to create a new Trial based on the configuration and repeats the previous process if necessary. The following figure is the comparison and analysis chart of Katib’s competitive products found on the Internet (Zhihu@goce) for your reference:
AutoML
KubeFlow
Katib
KFServing
For the productization of deep learning, training is only a means rather than an end. The purpose is to put the models generated through training into mobile phone programs or Internet applications for voice or text recognition and other application scenarios.
KubeFlow
TF Serving
KFServing
Seldon Core Serving
KubeFlow
Kubeflow
KFServing
Seldon Core Serving
TensorFlow Serving
TFJob
KFServing
Seldon Core Serving
KFServing
Kubeflow
Seldon Core Serving
Kubeflow
KubeFlow
KFServing
KFServing
Kubernetes
CRD
Tensorflow
XGBoost
ScikitLearn
PyTorch
ONNX
NVIDIA Triton Inference Server
REST
GRPC
TensorRT
TensorFlow
Pytorch
ONNX
Caffe2
Triton
TensorRT
NVIDIA Triton Inference Server
KFServing
KFServing
NVIDIA Triton Inference Server
kubeflow
tf-operator
Kubeflow
TF-Operator
Metadata
Prometheus
Argo
Istio
Kubeflow
Other dry goods (Essential skills) :
- From principle to actual combat, thoroughly understand Nginx!
- Understand Nginx from principle to practice
- Three “dark magic” and “dirty operations” in Python
- Shell programming, one is enough
- Overview of Kafka: In-depth understanding of the architecture