Connect HDFS and Hive in Spark local mode

background

Spark provides various operating modes, such as local, standalone, and on YARN. To ensure the consistency between the development environment and the actual operating environment, code is written locally, and jar packages are compiled and uploaded to the Spark cluster for commissioning. But when faced with complex processing logic or changing code for performance problems, developers will have to modify, compile, and upload jars multiple times. Endless repetition is a drain on productivity.

Thinking behind Local mode

Spark Local mode is a mode provided by the framework to simulate the coordination of multiple processes using threads, so that we can run programs directly in the IDE. By default, the local file system and hive libraries used in this mode are different from the actual application running environment. Therefore, to use local mode for rapid development tests, you must first configure local mode to use common resources in the cluster.

How to Configure (Windows)

Cluster environment: Hadoop2.7.4, Spark2.1.1 Required software: Winutils.zip Development tool: IDEA

HADOOP_HOME decompress hadoop-2.4.4.tar. gz to D: hadoophadoop-2.7.4 Decompress winutils.zip to D: hadoophadoop-2.4.4. bin HADOOP_HOME D:\hadoop\hadoop-2.7.4 Add path: %HADOOP_HOME%\bin;
Copy cluster configuration files Cluster files: core-site. XML, hdFs-site. XML, and hive-site. XML Copy cluster files to the Resources folder in the project
Configuring local DNS resolution enables the local environment to resolve domain names in the configuration file
Configure HDFS permissions in the cluster environment. By default, users use Windows local users to read and write HDFS, which are not authorized. It can be solved by using the following methods:

System.setProperty("HADOOP_USER_NAME"."hdfs")

Copy the code

Run test code Run the following code directly in IDEA to test:

Def main(args: Array[String]): Unit = {// set the log level logger.getLogger ("org").setlevel (level.info) // If you are running on Windows and need to access HDFS from Widnows, you need to specify a valid identity system.setProperty ("HADOOP_USER_NAME"."hdfs")

val spark = SparkSession.builder()
.appName("App")
.master("local") / /local
.config("HADOOP_USER_NAME"."root"EnableHiveSupport ().getorCreate () val sc = spark.sparkContext // Import the implicit conversion of Spark import spark.implicits._ // Imported Spark SQLfunctions
import org.apache.spark.sql.functions._

spark.sql("show tables").show()


sc.stop()
spark.stop()
}
Copy the code

conclusion

The Spark local mode is configured to obtain cluster resources, avoiding frequent jar installation and upload, and greatly improving development efficiency. This article describes how to configure the development environment on Windows, as well as other platforms. In addition, cluster environments such as Apache native and CDH Integration have been tested for use.

Installation package and project source code extraction code: 1i6h

Connect HDFS and Hive in Spark local mode

background

Thinking behind Local mode

How to Configure (Windows)

conclusion

Related Posts

The Hardware of the clang – assisted AddressSanitizer

DolphinDB and Druid comparison tests in the sequential database

Edit text in FreeDOS like Emacs