The scalability of THE SYSTEM: Sig-scalability, SLO, ClusterLoader2 This article will analyze ClusterLoader2 at the source level, documenting the long trampling path. First of all, I’m using cl2 as the Release 1.15 branch, not the latest Master branch, because my K8S version is 1.15.5.

What exactly does a profile mean?

After cloning the Perf-test library and entering ClusterLoader2, the following sentence comes into view

To run ClusterLoader type:

go run cmd/clusterloader.go --kubeconfig=kubeConfig.yaml --testconfig=config.yaml

When you open config.yaml, I’m sure most people will be dumbfounded. And if you choose to build your own Kubernetes cluster, there is a high probability that CL2 will not run smoothly. So before we step on the pit, first we need to know: what exactly is a configuration file?

This configuration file describes which phases of testing cl2 does, which actions are performed in each phase, and which data are collected. In order to save space, I will not repeat the design Doc here. After reading the Design Doc, you should be able to understand the structure very well. Here is the flow of the config file in density test, which can be used as a template to test the performance of the scheduler.

The parameters in density-config-local.yaml is complicated. You must know its grammar at first so that you are able to adjust them.

Here’s some grammar you must know:

  1. {{$DENSITY_RESOURCE_CONSTRAINTS_FILE := DefaultParam .DENSITY_RESOURCE_CONSTRAINTS_FILE ""}} means the parameter DENSITY_RESOURCE_CONSTRAINTS_FILE is default to “” if it is not set. You can set it manually to override its default value
  2. {{$MIN_LATENCY_PODS := 300}} just means setting the parameter to 300
  3. {{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}} means the number of namespaces is euqal to floor(nodes/node_per_namespace). NOTE that .Nodes MUST NOT be less than .NODES_PER_NAMESPACE
  4. {{$podsPerNamespace := MultiplyInt $PODS_PER_NODE $NODES_PER_NAMESPACE}} is similar to grammer 3, but multiplying the params
  5. {{$saturationDeploymentHardTimeout := MaxInt $saturationDeploymentTimeout 1200}} means max(saturationDeploymentTimeout, 1200)

Then you must be familiar with the procedure so that you know what each parameter means. There’s no silver bullet.

Below the parameters are the procedures of the testing. I’ll explain them step by step.

name: density
automanagedNamespaces: {{$namespaces}}
tuningSets:
- name: Uniform5qps
  qpsLoad:
    qps: 5
{{if $ENABLE_CHAOSMONKEY}}
chaosMonkey:
  nodeFailure:
    failureRate: 0.01
    interval: 1m
    jitterFactor: 10.0
    simulatedDowntime: 10m
{{end}}
Copy the code

Don’t mind the name and namespaces and tuningSets. In most cases you don’t care them. chaosMonkey is an open-source software developed by Netflix to test the robustness of your system by shutting down your nodes at random and creating jitters. By default it is not enabled.

steps:
- name: Starting measurements
  measurements:
  - Identifier: APIResponsivenessPrometheus
    Method: APIResponsivenessPrometheus
    Params:
      action: start
  - Identifier: APIResponsivenessPrometheusSimple
    Method: APIResponsivenessPrometheus
    Params:
      action: start
Copy the code

steps is the procedures you defined. Each step might contain phases, measurements. Meansurement defines what you want to supervise or capture. Phase describes the attributes of some certain tasks. This config defines the following steps:

  1. Starting measurements: don’t care about what happens during preparation.
  2. Starting saturation pod measurements: same as above
  3. Creating saturation pods: the first case is saturation pods
  4. Collecting saturation pod measurements
  5. Starting latency pod measurements
  6. Creating latency pods: the second case is latency pods
  7. Waiting for latency pods to be running
  8. Deleting latency pods
  9. Waiting for latency pods to be deleted
  10. Collecting pod startup latency
  11. Deleting saturation pods
  12. Waiting for saturation pods to be deleted
  13. Collecting measurements

So we can see the testing mainly gathers measurements during the CRUD of saturation pods and latency pods:

  • saturation pods: pods in deployments with quite a large repliacas
  • latency pods: pods in deployments with one replicas

So you see the differences between the two modes. When saturation pods are created, replicas-controller in kube-controller-manager is handling one event. But in terms of latency pods, it’s hundreds of events. But what’s the difference anyway? It’s because the various rate-limiter inside kubernetes affects the performance of scheduler and controller-manager.

In each case, what we’re concerned is the number of pods, deployments and namespaces. We all know that kubernetes limits the pods/node, pods/namespace, so it’s quite essential to adust relative parameters to achieve a reasonable load.

latency pods

Follow my math:

  • latency pods = namespaces * latencyReplicas
  • namespaces = nodes / nodes per namespace
  • nodes = avialable kubernetes nodes your cluster has
  • nodes per namespace is $NODES_PER_NAMESPACE in line 8
  • latencyReplicas = max(MIN LATENCY PODS, nodes) / namespaces
  • MIN LATENCY PODS is $MIN_LATENCY_PODS in line 18

saturation pods

Follow me:

  • saturation pods = namespaces * pods per namespace, this formula can be found in Creating Saturation pods step
  • pods per namespace = pods per node * nodes per namespace
  • pods per node is $PODS_PER_NODE in line 9
  • see the calculation of namespaces and nodes per namespace above in the part of latency pods

It’s quite complicated. You have to be patient to figure out what shit is really happening. Here’s some tips and regulations:

  1. During the testing on local cluster, due to the fact the scale is small, we can set nodes/namespace = nodes so that there’s only one namespace. It helps you simplify the math.
  2. During the testing on kubemark, we’re able to simulate hundreds of nodes, so it’s better to have 2 or more namepsaces
  3. The measurement pod start up latency only applies to latency pods, but not saturation pods, although it tells you the metric during saturation pods testing as well

Now you can set the parameters and do the testing. After a while(usually 5~10 min in local cluster testing), you can check out the results.

Why doesn’t this thing work out of the box?

My Kubernetes cluster was installed by myself, without tools like GCE or Kubeadm, so I encountered a lot of pothole, did a lot of hacks on CL2, and barely managed to get through the test.

Here are the pits I encountered, with source-level parsing.

SSH issue

Cl2 needs to secure SSH to the master node to collect certain data. For example see PKG/measurement/common/simple/scheduler_latency. Go

cmd := "curl -X " + opUpper + " http://localhost:10251/metrics"
sshResult, err := measurementutil.SSH(cmd, host+": 22", provider)
Copy the code

When you run CL2 in an environment that can’t SSH to the master, the program doesn’t terminate, it just prints error logs. Therefore, we need to ensure that the nodes from the test environment to the cluster are not encrypted.

What is username for SSH? Is the username of your current account and cannot be overridden by specifying any flags. When I tested on the corporate cluster, my personal account did not have SSH permission, so I had to run it as root. If you have a K8S cluster on your personal computer, you may be able to avoid this problem.

dependency installation issues

This refers primarily to the Probes and Prometheus components. During cl2 testing, you can choose to install Prometheus Stack or not. In CMD/clusterLoader. go there is the following code

func initFlags(a) {
	flags.StringVar(&clusterLoaderConfig.ReportDir, "report-dir".""."Path to the directory where the reports should be saved. Default is empty, which cause reports being written to standard output.")
	flags.BoolEnvVar(&clusterLoaderConfig.EnablePrometheusServer, "enable-prometheus-server"."ENABLE_PROMETHEUS_SERVER".false."Whether to set-up the prometheus server in the cluster.")
	flags.BoolEnvVar(&clusterLoaderConfig.TearDownPrometheusServer, "tear-down-prometheus-server"."TEAR_DOWN_PROMETHEUS_SERVER".true."Whether to tear-down the prometheus server after tests (if set-up).")
	flags.StringArrayVar(&testConfigPaths, "testconfig"And []string{}, "Paths to the test config files")
	flags.StringArrayVar(&clusterLoaderConfig.TestOverridesPath, "testoverrides"And []string{}, "Paths to the config overrides file. The latter overrides take precedence over changes in former files.")
	initClusterFlags()
}
Copy the code

Prometheus Stack and probes are not installed by default and can be managed by users. Note:

  1. ifenable-prometheus-serverforfalse, thentear-down-prometheus-serverParameter is invalid
  2. Ensure that the installation is consistent with the cl2 configuration for Prometheus. Prometheus’ YAML files are storedpkg/prometheus/manifestsPrometheus Operator is used, and key information, such as ports and namespaces, is not changed. So if you install with Prometheus Operator as well, chances are you’ll be able to run CL2 without getting stuck on Prometheus
  3. If YOU let CL2 install Prometheus, the pit will come! Cl2 will read$GOPATHTo find Prometheus Manifests. That is, it defaults to you not using the Go Module and is systematic$GOPATHDon’t be empty. In fact, we probably use the Go Module, and many systems directlyecho $GOPATHThere is no value! So I recommend installing Prometheus and Probes yourself
  4. Prometheus and Probes must be installed or cl2 tests cannot be run

metrics grabber issue

Kube-scheduler, kube-Controller-Manager, kube-Apiserver, kube-proxy, etCD are all binary deployed, and ETCD and Kubelet forbid HTTP access. This results in runtime errors all the time. See PKG/measurement/common/simple/etcd_metrics. Go

	// In https://github.com/kubernetes/kubernetes/pull/74690, mTLS is enabled for etcd server
	// http://localhost:2382 is specified to bypass TLS credential requirement when checking
	// etcd /metrics and /health.
	if samples, err := e.sshEtcdMetrics("curl http://localhost:2382/metrics", host, provider); err == nil {
		return samples, nil
	}

	// Use old endpoint if new one fails.
	return e.sshEtcdMetrics("curl http://localhost:2379/metrics", host, provider)
Copy the code

In turn, you need to check how metrics are taken for other components in that folder to make sure they are consistent with your environment.

In addition, there is another deep hole. In the PKG/measurement/common/simple/metrics_for_e2e. Create a grabber, go to grab the interrelationship of the metrics

grabber, err := metrics.NewMetricsGrabber(
		config.ClusterFramework.GetClientSets().GetClient(),
		nil./*external client*/
		grabMetricsFromKubelets,
		true./*grab metrics from scheduler*/
		true./*grab metrics from controller manager*/
		true./*grab metrics from apiserver*/
		false /*grab metrics from cluster autoscaler*/)
Copy the code

The grabber cited vendor/k8s. IO/kubernetes/test/e2e/framework/metrics/metrics_grabber. Go pack, in-depth look at the time of this package, We found that the default component is deployed in POD mode when it crawls components. If vendor/k8s. IO/kubernetes/test/e2e/framework/metrics/metrics_grabber. Go

func (g *MetricsGrabber) GrabFromScheduler(a) (SchedulerMetrics, error) {
	if! g.registeredMaster {return SchedulerMetrics{}, fmt.Errorf("Master's Kubelet is not registered. Skipping Scheduler's metrics gathering.")
	}
	output, err := g.getMetricsFromPod(g.client, fmt.Sprintf("%v-%v"."kube-scheduler", g.masterName), metav1.NamespaceSystem, ports.InsecureSchedulerPort)
	iferr ! =nil {
		return SchedulerMetrics{}, err
	}
	return parseSchedulerMetrics(output)
}
Copy the code

If you deploy k8S components in Binary mode, you need to modify vendor’s package. The worst part is if you use the Master branch, it’s already changed to Go Module, and you have to open a package and rewrite it all…

master node issue

Cl2 will automatically determine which node is the master node, judgment way in vendor/k8s. IO/kubernetes/PKG/util/system/system_utils. Go

// TODO: find a better way of figuring out if given node is a registered master.
// IsMasterNode checks if it's a master node, see http://gitlab.bj.sensetime.com/xialei1/perf-tests/issues/4
func IsMasterNode(node corev1.Node) bool {
	// We are trying to capture "master(-...) ? $" regexp.
	// However, using regexp.MatchString() results even in more than 35%
	// of all space allocations in ControllerManager spent in this function.
	// That's why we are trying to be a bit smarter.
	name := node.Name
	if strings.HasSuffix(name, "master") {
		return true
	}
	return false
}
Copy the code

It determines whether a master node is a master node by the suffix “master”. Kubeadm after installation to the master node in a node – role. Kubernetes. IO/master = ‘tag, and other k8s installation also don’t have to be so named for the master node. I have given comments to the community, see Perf-Tests #1191.

scheduler throughput issue

This is an issue I mentioned to the community before and the master branch has been fixed. See perf tests # 1083. In short, they incorrectly use average throughput as a metric, when in fact they always use maximum throughput as a metric.

Relevant code see PKG/measurement/common/simple/scheduler_throughput. Go

type schedulingThroughput struct {
	Average float64 `json:"average"`
	Perc50  float64 `json:"perc50"`
	Perc90  float64 `json:"perc90"`
	Perc99  float64 `json:"perc99"`
}
Copy the code

There should actually be a Max value.

What exactly are these indicators?

After numerous hacks and debugging, CL2 finally started running. After running, I found a question: what are these indicators respectively? How do I locate bottlenecks from metrics?

Most CL2 indicators are based on user E2E. Each indicator involves many processes and links, making it difficult to locate specific bottlenecks. Therefore, it is necessary to sort out the specific process involved in each indicator, from when to when to end

There are currently three official metrics

  1. mutating api
  2. readonly api
  3. latency pod startup

pod startup latency

Query is relatively complex, the source code in the PKG/measurement/common/slos pod_startup_latency. Go, divided into two phases, the start and gather the results recorded in podStartupEntries, This is a map[string]map[string]time structure that records the time of each stage of each POD.

In the Start phase, start an Informer listening pod. When a POD is in the running state and no record is found in podStartupEntries, record the POD in podStartupEntries:

  1. watchPhasefortime.Now()
  2. createPhaseFor the podcreationTimeStamp
  3. runPhaseTimestamp of the container in the running state for pod

During the Gather phase, stop listening on the POD. All events are then iterated over. The Scheduler logs an event after scheduling the pod, similarly

4d17h Normal Scheduled pod/sensestar-test-gvl92 Successfully assigned default/sensestar-test-gvl92 to sh-idc1-10-5-8-62

Record the time when the schedulePhase of all recorded pods is event. Then summarized the indicators:

  1. “Create_to_schedule “: Scheduled event in the event – pod creationTimeStamp
  2. “Schedule_to_run “: The pod container is in the running state – event Scheduled events
  3. “Run_to_watch “: Informer receives running POD – The pod container is running
  4. “Schedule_to_watch “: The scheduled events in the RUNNING pod-event have been received by the informer
  5. “Pod_startup “: Informer receives running pod-pod creation timestamp

This query looks a little weird, but it can break down the latency for each phase of the POD creation process. The kubelet metric kubelet_pod_start_duration_seconds is available from Prometheus

Histogram_quantile (0.99, the sum (rate (kubelet_pod_start_duration_seconds_bucket h [1])) by (le))Copy the code

A few extra words about the process created by pod in Deployment

  1. Apiserver receives the request to create a Deployment and stores it to etCD, notifying the Controller-Manager
  2. Controller-manager creates a pod shell, prints creationTimeStamp, and sends the request to apiserver
  3. Apiserver receives a request to create a POD, sends it to etCD, and pushes it to scheduler.
  4. Schduler selects node, populates nodeName, and updates pod information to Apiserver. The pod is pending, and the pod is not actually created.
  5. Apiserver updates pod information to ETCD and pushes it to kubelet of the corresponding node
  6. Kubelet creates pod, fills HostIP and resourceVersion, sends update request to Apiserver, pod is in pending state
  7. Apiserver updates pod information to ETCD while Kubelet continues to create pods. When the container is running, Kubelet sends the POD update request to Apiserver again, and the pod is running
  8. Apiserver receives the request, updates it to the ETCD, and pushes it to the Informer, who records the watchPhase

Mutating API and Readonly API

Cl2 query statement:

Histogram_quantile (0.99, the sum (rate (apiserver_request_duration_seconds_bucket {resource! ="events", verb! ~"WATCH|WATCHLIST|PROXY|proxy|CONNECT"}[20m])) by (resource, subresource, verb, scope, le))
Copy the code

Fetch at apiserver, meaning

Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.

From the time apiserver receives the message, until the reply is sent. Read -only The API involves etCD only. Mutating – API may involve other components and is not discussed.

etcd metrics

An additional ETCD is added here because ETCD is the focus of our K8S performance tuning. The etCD of the results is pretty clear, I just want to emphasize that…

Clusterloader2 Impressions of trampling

I think the advantages of CL2

  1. Modeling the test process
  2. The collected indicators are relatively comprehensive

I think the disadvantages of CL2

  1. Slightly steeper curve, extremely unfriendly to users installing their own K8S clusters…
  2. Limited introductory documentation
  3. Branch management is a bit chaotic