The scalability of THE SYSTEM: Sig-scalability, SLO, ClusterLoader2 This article will analyze ClusterLoader2 at the source level, documenting the long trampling path. First of all, I’m using cl2 as the Release 1.15 branch, not the latest Master branch, because my K8S version is 1.15.5.
What exactly does a profile mean?
After cloning the Perf-test library and entering ClusterLoader2, the following sentence comes into view
To run ClusterLoader type:
go run cmd/clusterloader.go --kubeconfig=kubeConfig.yaml --testconfig=config.yaml
When you open config.yaml, I’m sure most people will be dumbfounded. And if you choose to build your own Kubernetes cluster, there is a high probability that CL2 will not run smoothly. So before we step on the pit, first we need to know: what exactly is a configuration file?
This configuration file describes which phases of testing cl2 does, which actions are performed in each phase, and which data are collected. In order to save space, I will not repeat the design Doc here. After reading the Design Doc, you should be able to understand the structure very well. Here is the flow of the config file in density test, which can be used as a template to test the performance of the scheduler.
The parameters in density-config-local.yaml is complicated. You must know its grammar at first so that you are able to adjust them.
Here’s some grammar you must know:
{{$DENSITY_RESOURCE_CONSTRAINTS_FILE := DefaultParam .DENSITY_RESOURCE_CONSTRAINTS_FILE ""}}
means the parameterDENSITY_RESOURCE_CONSTRAINTS_FILE
is default to “” if it is not set. You can set it manually to override its default value{{$MIN_LATENCY_PODS := 300}}
just means setting the parameter to 300{{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}}
means the number of namespaces is euqal tofloor(nodes/node_per_namespace)
. NOTE that.Nodes
MUST NOT be less than.NODES_PER_NAMESPACE
{{$podsPerNamespace := MultiplyInt $PODS_PER_NODE $NODES_PER_NAMESPACE}}
is similar to grammer 3, but multiplying the params{{$saturationDeploymentHardTimeout := MaxInt $saturationDeploymentTimeout 1200}}
meansmax(saturationDeploymentTimeout, 1200)
Then you must be familiar with the procedure so that you know what each parameter means. There’s no silver bullet.
Below the parameters are the procedures of the testing. I’ll explain them step by step.
name: density
automanagedNamespaces: {{$namespaces}}
tuningSets:
- name: Uniform5qps
qpsLoad:
qps: 5
{{if $ENABLE_CHAOSMONKEY}}
chaosMonkey:
nodeFailure:
failureRate: 0.01
interval: 1m
jitterFactor: 10.0
simulatedDowntime: 10m
{{end}}
Copy the code
Don’t mind the name and namespaces and tuningSets. In most cases you don’t care them. chaosMonkey is an open-source software developed by Netflix to test the robustness of your system by shutting down your nodes at random and creating jitters. By default it is not enabled.
steps:
- name: Starting measurements
measurements:
- Identifier: APIResponsivenessPrometheus
Method: APIResponsivenessPrometheus
Params:
action: start
- Identifier: APIResponsivenessPrometheusSimple
Method: APIResponsivenessPrometheus
Params:
action: start
Copy the code
steps is the procedures you defined. Each step might contain phases, measurements. Meansurement defines what you want to supervise or capture. Phase describes the attributes of some certain tasks. This config defines the following steps:
- Starting measurements: don’t care about what happens during preparation.
- Starting saturation pod measurements: same as above
- Creating saturation pods: the first case is saturation pods
- Collecting saturation pod measurements
- Starting latency pod measurements
- Creating latency pods: the second case is latency pods
- Waiting for latency pods to be running
- Deleting latency pods
- Waiting for latency pods to be deleted
- Collecting pod startup latency
- Deleting saturation pods
- Waiting for saturation pods to be deleted
- Collecting measurements
So we can see the testing mainly gathers measurements during the CRUD of saturation pods and latency pods:
- saturation pods: pods in deployments with quite a large repliacas
- latency pods: pods in deployments with one replicas
So you see the differences between the two modes. When saturation pods are created, replicas-controller in kube-controller-manager is handling one event. But in terms of latency pods, it’s hundreds of events. But what’s the difference anyway? It’s because the various rate-limiter inside kubernetes affects the performance of scheduler and controller-manager.
In each case, what we’re concerned is the number of pods, deployments and namespaces. We all know that kubernetes limits the pods/node, pods/namespace, so it’s quite essential to adust relative parameters to achieve a reasonable load.
latency pods
Follow my math:
- latency pods = namespaces * latencyReplicas
- namespaces = nodes / nodes per namespace
- nodes = avialable kubernetes nodes your cluster has
- nodes per namespace is
$NODES_PER_NAMESPACE
in line 8 - latencyReplicas = max(MIN LATENCY PODS, nodes) / namespaces
- MIN LATENCY PODS is
$MIN_LATENCY_PODS
in line 18
saturation pods
Follow me:
- saturation pods = namespaces * pods per namespace, this formula can be found in Creating Saturation pods step
- pods per namespace = pods per node * nodes per namespace
- pods per node is
$PODS_PER_NODE
in line 9 - see the calculation of namespaces and nodes per namespace above in the part of latency pods
It’s quite complicated. You have to be patient to figure out what shit is really happening. Here’s some tips and regulations:
- During the testing on local cluster, due to the fact the scale is small, we can set nodes/namespace = nodes so that there’s only one namespace. It helps you simplify the math.
- During the testing on kubemark, we’re able to simulate hundreds of nodes, so it’s better to have 2 or more namepsaces
- The measurement pod start up latency only applies to latency pods, but not saturation pods, although it tells you the metric during saturation pods testing as well
Now you can set the parameters and do the testing. After a while(usually 5~10 min in local cluster testing), you can check out the results.
Why doesn’t this thing work out of the box?
My Kubernetes cluster was installed by myself, without tools like GCE or Kubeadm, so I encountered a lot of pothole, did a lot of hacks on CL2, and barely managed to get through the test.
Here are the pits I encountered, with source-level parsing.
SSH issue
Cl2 needs to secure SSH to the master node to collect certain data. For example see PKG/measurement/common/simple/scheduler_latency. Go
cmd := "curl -X " + opUpper + " http://localhost:10251/metrics"
sshResult, err := measurementutil.SSH(cmd, host+": 22", provider)
Copy the code
When you run CL2 in an environment that can’t SSH to the master, the program doesn’t terminate, it just prints error logs. Therefore, we need to ensure that the nodes from the test environment to the cluster are not encrypted.
What is username for SSH? Is the username of your current account and cannot be overridden by specifying any flags. When I tested on the corporate cluster, my personal account did not have SSH permission, so I had to run it as root. If you have a K8S cluster on your personal computer, you may be able to avoid this problem.
dependency installation issues
This refers primarily to the Probes and Prometheus components. During cl2 testing, you can choose to install Prometheus Stack or not. In CMD/clusterLoader. go there is the following code
func initFlags(a) {
flags.StringVar(&clusterLoaderConfig.ReportDir, "report-dir".""."Path to the directory where the reports should be saved. Default is empty, which cause reports being written to standard output.")
flags.BoolEnvVar(&clusterLoaderConfig.EnablePrometheusServer, "enable-prometheus-server"."ENABLE_PROMETHEUS_SERVER".false."Whether to set-up the prometheus server in the cluster.")
flags.BoolEnvVar(&clusterLoaderConfig.TearDownPrometheusServer, "tear-down-prometheus-server"."TEAR_DOWN_PROMETHEUS_SERVER".true."Whether to tear-down the prometheus server after tests (if set-up).")
flags.StringArrayVar(&testConfigPaths, "testconfig"And []string{}, "Paths to the test config files")
flags.StringArrayVar(&clusterLoaderConfig.TestOverridesPath, "testoverrides"And []string{}, "Paths to the config overrides file. The latter overrides take precedence over changes in former files.")
initClusterFlags()
}
Copy the code
Prometheus Stack and probes are not installed by default and can be managed by users. Note:
- if
enable-prometheus-server
forfalse
, thentear-down-prometheus-server
Parameter is invalid - Ensure that the installation is consistent with the cl2 configuration for Prometheus. Prometheus’ YAML files are stored
pkg/prometheus/manifests
Prometheus Operator is used, and key information, such as ports and namespaces, is not changed. So if you install with Prometheus Operator as well, chances are you’ll be able to run CL2 without getting stuck on Prometheus - If YOU let CL2 install Prometheus, the pit will come! Cl2 will read
$GOPATH
To find Prometheus Manifests. That is, it defaults to you not using the Go Module and is systematic$GOPATH
Don’t be empty. In fact, we probably use the Go Module, and many systems directlyecho $GOPATH
There is no value! So I recommend installing Prometheus and Probes yourself - Prometheus and Probes must be installed or cl2 tests cannot be run
metrics grabber issue
Kube-scheduler, kube-Controller-Manager, kube-Apiserver, kube-proxy, etCD are all binary deployed, and ETCD and Kubelet forbid HTTP access. This results in runtime errors all the time. See PKG/measurement/common/simple/etcd_metrics. Go
// In https://github.com/kubernetes/kubernetes/pull/74690, mTLS is enabled for etcd server
// http://localhost:2382 is specified to bypass TLS credential requirement when checking
// etcd /metrics and /health.
if samples, err := e.sshEtcdMetrics("curl http://localhost:2382/metrics", host, provider); err == nil {
return samples, nil
}
// Use old endpoint if new one fails.
return e.sshEtcdMetrics("curl http://localhost:2379/metrics", host, provider)
Copy the code
In turn, you need to check how metrics are taken for other components in that folder to make sure they are consistent with your environment.
In addition, there is another deep hole. In the PKG/measurement/common/simple/metrics_for_e2e. Create a grabber, go to grab the interrelationship of the metrics
grabber, err := metrics.NewMetricsGrabber(
config.ClusterFramework.GetClientSets().GetClient(),
nil./*external client*/
grabMetricsFromKubelets,
true./*grab metrics from scheduler*/
true./*grab metrics from controller manager*/
true./*grab metrics from apiserver*/
false /*grab metrics from cluster autoscaler*/)
Copy the code
The grabber cited vendor/k8s. IO/kubernetes/test/e2e/framework/metrics/metrics_grabber. Go pack, in-depth look at the time of this package, We found that the default component is deployed in POD mode when it crawls components. If vendor/k8s. IO/kubernetes/test/e2e/framework/metrics/metrics_grabber. Go
func (g *MetricsGrabber) GrabFromScheduler(a) (SchedulerMetrics, error) {
if! g.registeredMaster {return SchedulerMetrics{}, fmt.Errorf("Master's Kubelet is not registered. Skipping Scheduler's metrics gathering.")
}
output, err := g.getMetricsFromPod(g.client, fmt.Sprintf("%v-%v"."kube-scheduler", g.masterName), metav1.NamespaceSystem, ports.InsecureSchedulerPort)
iferr ! =nil {
return SchedulerMetrics{}, err
}
return parseSchedulerMetrics(output)
}
Copy the code
If you deploy k8S components in Binary mode, you need to modify vendor’s package. The worst part is if you use the Master branch, it’s already changed to Go Module, and you have to open a package and rewrite it all…
master node issue
Cl2 will automatically determine which node is the master node, judgment way in vendor/k8s. IO/kubernetes/PKG/util/system/system_utils. Go
// TODO: find a better way of figuring out if given node is a registered master.
// IsMasterNode checks if it's a master node, see http://gitlab.bj.sensetime.com/xialei1/perf-tests/issues/4
func IsMasterNode(node corev1.Node) bool {
// We are trying to capture "master(-...) ? $" regexp.
// However, using regexp.MatchString() results even in more than 35%
// of all space allocations in ControllerManager spent in this function.
// That's why we are trying to be a bit smarter.
name := node.Name
if strings.HasSuffix(name, "master") {
return true
}
return false
}
Copy the code
It determines whether a master node is a master node by the suffix “master”. Kubeadm after installation to the master node in a node – role. Kubernetes. IO/master = ‘tag, and other k8s installation also don’t have to be so named for the master node. I have given comments to the community, see Perf-Tests #1191.
scheduler throughput issue
This is an issue I mentioned to the community before and the master branch has been fixed. See perf tests # 1083. In short, they incorrectly use average throughput as a metric, when in fact they always use maximum throughput as a metric.
Relevant code see PKG/measurement/common/simple/scheduler_throughput. Go
type schedulingThroughput struct {
Average float64 `json:"average"`
Perc50 float64 `json:"perc50"`
Perc90 float64 `json:"perc90"`
Perc99 float64 `json:"perc99"`
}
Copy the code
There should actually be a Max value.
What exactly are these indicators?
After numerous hacks and debugging, CL2 finally started running. After running, I found a question: what are these indicators respectively? How do I locate bottlenecks from metrics?
Most CL2 indicators are based on user E2E. Each indicator involves many processes and links, making it difficult to locate specific bottlenecks. Therefore, it is necessary to sort out the specific process involved in each indicator, from when to when to end
There are currently three official metrics
- mutating api
- readonly api
- latency pod startup
pod startup latency
Query is relatively complex, the source code in the PKG/measurement/common/slos pod_startup_latency. Go, divided into two phases, the start and gather the results recorded in podStartupEntries, This is a map[string]map[string]time structure that records the time of each stage of each POD.
In the Start phase, start an Informer listening pod. When a POD is in the running state and no record is found in podStartupEntries, record the POD in podStartupEntries:
watchPhase
fortime.Now()
createPhase
For the podcreationTimeStamp
runPhase
Timestamp of the container in the running state for pod
During the Gather phase, stop listening on the POD. All events are then iterated over. The Scheduler logs an event after scheduling the pod, similarly
4d17h Normal Scheduled pod/sensestar-test-gvl92 Successfully assigned default/sensestar-test-gvl92 to sh-idc1-10-5-8-62
Record the time when the schedulePhase of all recorded pods is event. Then summarized the indicators:
- “Create_to_schedule “: Scheduled event in the event – pod creationTimeStamp
- “Schedule_to_run “: The pod container is in the running state – event Scheduled events
- “Run_to_watch “: Informer receives running POD – The pod container is running
- “Schedule_to_watch “: The scheduled events in the RUNNING pod-event have been received by the informer
- “Pod_startup “: Informer receives running pod-pod creation timestamp
This query looks a little weird, but it can break down the latency for each phase of the POD creation process. The kubelet metric kubelet_pod_start_duration_seconds is available from Prometheus
Histogram_quantile (0.99, the sum (rate (kubelet_pod_start_duration_seconds_bucket h [1])) by (le))Copy the code
A few extra words about the process created by pod in Deployment
- Apiserver receives the request to create a Deployment and stores it to etCD, notifying the Controller-Manager
- Controller-manager creates a pod shell, prints creationTimeStamp, and sends the request to apiserver
- Apiserver receives a request to create a POD, sends it to etCD, and pushes it to scheduler.
- Schduler selects node, populates nodeName, and updates pod information to Apiserver. The pod is pending, and the pod is not actually created.
- Apiserver updates pod information to ETCD and pushes it to kubelet of the corresponding node
- Kubelet creates pod, fills HostIP and resourceVersion, sends update request to Apiserver, pod is in pending state
- Apiserver updates pod information to ETCD while Kubelet continues to create pods. When the container is running, Kubelet sends the POD update request to Apiserver again, and the pod is running
- Apiserver receives the request, updates it to the ETCD, and pushes it to the Informer, who records the watchPhase
Mutating API and Readonly API
Cl2 query statement:
Histogram_quantile (0.99, the sum (rate (apiserver_request_duration_seconds_bucket {resource! ="events", verb! ~"WATCH|WATCHLIST|PROXY|proxy|CONNECT"}[20m])) by (resource, subresource, verb, scope, le))
Copy the code
Fetch at apiserver, meaning
Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.
From the time apiserver receives the message, until the reply is sent. Read -only The API involves etCD only. Mutating – API may involve other components and is not discussed.
etcd metrics
An additional ETCD is added here because ETCD is the focus of our K8S performance tuning. The etCD of the results is pretty clear, I just want to emphasize that…
Clusterloader2 Impressions of trampling
I think the advantages of CL2
- Modeling the test process
- The collected indicators are relatively comprehensive
I think the disadvantages of CL2
- Slightly steeper curve, extremely unfriendly to users installing their own K8S clusters…
- Limited introductory documentation
- Branch management is a bit chaotic