background

Spark provides various operating modes, such as local, standalone, and on YARN. To ensure the consistency between the development environment and the actual operating environment, code is written locally, and jar packages are compiled and uploaded to the Spark cluster for commissioning. But when faced with complex processing logic or changing code for performance problems, developers will have to modify, compile, and upload jars multiple times. Endless repetition is a drain on productivity.

Thinking behind Local mode

Spark Local mode is a mode provided by the framework to simulate the coordination of multiple processes using threads, so that we can run programs directly in the IDE. By default, the local file system and hive libraries used in this mode are different from the actual application running environment. Therefore, to use local mode for rapid development tests, you must first configure local mode to use common resources in the cluster.

How to Configure (Windows)

Cluster environment: Hadoop2.7.4, Spark2.1.1 Required software: Winutils.zip Development tool: IDEA

  • HADOOP_HOME decompress hadoop-2.4.4.tar. gz to D: hadoophadoop-2.7.4 Decompress winutils.zip to D: hadoophadoop-2.4.4. bin HADOOP_HOME D:\hadoop\hadoop-2.7.4 Add path: %HADOOP_HOME%\bin;

  • Copy cluster configuration files Cluster files: core-site. XML, hdFs-site. XML, and hive-site. XML Copy cluster files to the Resources folder in the project

  • Configuring local DNS resolution enables the local environment to resolve domain names in the configuration file

  • Configure HDFS permissions in the cluster environment. By default, users use Windows local users to read and write HDFS, which are not authorized. It can be solved by using the following methods:

System.setProperty("HADOOP_USER_NAME"."hdfs")

Copy the code
  • Run test code Run the following code directly in IDEA to test:
Def main(args: Array[String]): Unit = {// set the log level logger.getLogger ("org").setlevel (level.info) // If you are running on Windows and need to access HDFS from Widnows, you need to specify a valid identity system.setProperty ("HADOOP_USER_NAME"."hdfs")

val spark = SparkSession.builder()
.appName("App")
.master("local") / /local
.config("HADOOP_USER_NAME"."root"EnableHiveSupport ().getorCreate () val sc = spark.sparkContext // Import the implicit conversion of Spark import spark.implicits._ // Imported Spark SQLfunctions
import org.apache.spark.sql.functions._

spark.sql("show tables").show()


sc.stop()
spark.stop()
}
Copy the code

conclusion

The Spark local mode is configured to obtain cluster resources, avoiding frequent jar installation and upload, and greatly improving development efficiency. This article describes how to configure the development environment on Windows, as well as other platforms. In addition, cluster environments such as Apache native and CDH Integration have been tested for use.

Installation package and project source code extraction code: 1i6h