The Current K8s-based Spark application runs in two modes

  • Spark on K8S supported by Spark
  • Spark on K8S operator based on K8S operator

The former is the implementation of the K8S client introduced by the Spark community to support the RESOURCE management framework K8S. The latter is an operator developed by the K8S community to support Spark

The difference between spark on k8s spark on k8s operator
Community support The spark community Google LoudPlatform unofficial support
Version for The spark > = 2.3, Kubernetes > = 1.6 The spark > 2.3, Kubernetes > = 1.13
The installation Install according to the official website, need K8S pod create List Edit delete permission, and need to compile the source code to build the image, the construction process is tedious K8s admin is required to install incubator/ SparkOperator and the pod Create List Edit delete permission is required
use Code 1 supports client and Cluster modes.spark on k8s Submit through YAML configuration file, support client and cluster mode, submit as code2, specific parameters for referencespark operator configuration
advantages Task submission in sparker mode is more convenient for users who are used to Spark K8s configuration file is used to submit tasks, which is highly reusable
disadvantages Driver resources are not automatically released after the driver runs Driver resources are not automatically released after the driver runs
implementation For Spark submission, both client and cluster submissions inherit SparkApplication. Submit as client, subclassJavaMainApplication, which runs in reflection mode. For K8S task analysis,clusterManager isKubernetesClusterManagerThis mode is the same as the mode of submitting tasks to YARN. Submit in cluster mode. For K8S tasks, the spark program entry isKubernetesClientApplication, the client will set clusterIp to NoneserviceExecutor interacts with the service through RPC, such as the submission of tasks, and creates driver-conf-map extensionsconfigMapTo create the Spark Driver podvolumnThe mount form is referenced, and the contents of the file are finally referenced when the driver submits the task–properties-fileThen configuration items such as spark.driver.host are transferred to the driver. At the same time, a -hadoop-config file is createdconfigMapBut how does a K8S image distinguish between an executor and a driver? Everything is indockerfile(Specific build time according to the hadoop and Kerbeors environment is different configuration) andentrypointShell, where the driver and executor are distinguished. Use the K8S CRD Controller mechanism to customizeCRD, according to theoperator SDKAnd listens for the add, delete, modify, and check events. If the CRD creation event is detected, create a POD and submit the Spark task according to the configuration item in the CORRESPONDING YAML file. For details, seespark on k8s operator designThe principle of cluster and client mode is the same as that of Spark on K8S, because the image reuse is the official spark image
Code 1 - bin/spark - submit \ - master k8s: / / https://192.168.202.231:6443 \ - deploy - mode cluster \ - name spark - PI \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=2 \ --conf "spark.kubernetes.namespace=dev" \ --conf "spark.kubernetes.authenticate.driver.serviceAccountName=lijiahong" \ --conf "Spark. Kubernetes. Container. Image = harbor. K8s - test. Uc. Host. Against a/dev/spark - py: CDH - server - 5.13.1" \ conf "spark.kubernetes.container.image.pullSecrets=regsecret" \ --conf "spark.kubernetes.file.upload.path=hdfs:///tmp" \ - the conf "spark kubernetes. Container. Image. PullPolicy = Always" \ HDFS: / / / TMP/spark - examples_2. 12-3.0.0. JarCopy the code
code 2 --- apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: spark-pi namespace: Dev spec: type: Scala mode: cluster image: "gcr. IO /spark-operator/spark:v3.0.0" imagePullPolicy: Always mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: "The local: / / / opt/spark/examples/jars/spark - examples_2. 12-3.0.0. Jar" sparkVersion: "3.0.0 restartPolicy:" type: Never volumes: - name: "test-volume" hostPath: path: "/tmp" type: Directory driver: cores: 1 coreLimit: "1200m" memory: "512M" labels: version: 3.0.0 serviceAccount: lijiahong volumeMounts: - name: "test-volume" mountPath: "/ TMP "Executor: Cores: 1 Instances: 1 Memory: "512M" labels: version: 3.0.0 volumeMounts: -name: "test-volume" mountPath: "/ TMP"Copy the code

This article is published by OpenWrite, a blogging tool platform