The simplest Spark tutorial ever – Chapter 1: Run the first Spark application

The simplest Spark tutorial ever, all code examples address: github.com/Mydreamandr…

(Notice in advance: the article is arranged by the author: Zhang Yaofeng combined with his own experience in production, and finally formed a simple and easy to understand article. It is not easy to write, please indicate when reprinting)

What is Spark?

There is a lot of information about Spark on the Internet. Here are its scenarios and advantages:

  • Apache Spark is a next-generation batch framework that includes streaming capabilities. Developed based on many of the same principles as Hadoop’s MapReduce engine, Spark focuses on speeding up batch workloads with sophisticated in-memory calculations and processing optimizations
  • 2. Spark can be deployed as an independent cluster (requiring the cooperation of corresponding storage tiers) or integrated with Hadoop to replace the MapReduce engine
  • 3. Unlike MapReduce, Spark processes all data in memory and interacts with storage tiers only when data is read into memory at the beginning and the final result is persisted. All intermediate state processing results are stored in memory
  • 4. In addition to the engine’s own capabilities, Spark has built an ecosystem of libraries around it to better support tasks such as machine learning and interactive queries. Compared with MapReduce, Spark tasks are easier to write
  • 5. Another important advantage of Spark is diversity. The product can be deployed as a standalone cluster or integrated with an existing Hadoop cluster. The product can run batch and stream processing and can run a cluster to handle different types of tasks

Spark component

After a brief understanding of the above, what components Spark contains, I will list them here, and you will be familiar with them for the time being, and then I will take you through them one by one

Spark is a fast and versatile cluster computing framework. Its core components include:

  • 1. Spark Core: Implements basic Spark functions
    • Task scheduling, memory management, error recovery, and storage system interaction modules
    • RDD: Elastic distributed data set API definition
  • Spark SQL: A program package used to operate structured data on Spark
    • SparkSql allows you to query data using SQL or HSQL
    • Sparksql supports multiple data sources
  • Spark Streaming: Real-time streaming computing component
    • Provides apis for manipulating data flows
  • MLlib: Machine learning library, rich and powerful
  • 5. Graphx: Graph calculation and image algorithm
  • 6. Cluster manager :(Hadoop like Yarn)
    • In order to meet the scaling calculation from single node to thousands of nodes, the implementation of a simple scheduler

These are some of the commonly used components in Spark. You can see that Spark has a very strong ecosystem, ranging from big data to machine learning. Is it still human if the boss doesn’t give you a raise? (smiling face

Install the Spark

  • 1. First we need a virtual machine or cloud server, I chose Centos7.5
  • 2, the Spark spark.apache.org/downloads.h download address… I chose 2.2.3. All tutorials in the series are based on Spark2.2.3. Try to be consistent
  • 3. Decompress tar -zxvf spark……. [Decompress Spark and use it if the JDK environment is OK]
  • 3, install JDK, configure environment variable [/etc/profile], configuration effect: source/etc/profile
  • 4. Java-version verification

Then go to your Spark unzip directory and look at the source package in the \bin directory

  • 1. Python,R directory: source code
  • Readme. md: Get started help
  • Bin: some executable commands
  • Examples: contains Java, Python, R language introductory Demo source code

An introduction to case

The Spark package already provides Python and Scala shells. Let’s run a Spark example using shell to see the effect

  • 1. Run Python Shell
    • Go to the spark/bin directory
    • Execution:. / python – shell
  • 1. Run the Scala Shell
    • Go to the spark/bin directory
    • Execution:. / spark – shell

After executing shell, wait for a moment and the Spark icon will appear.

Then start writing our Spark applet to see the effect

val lines = sc.textFile("/ usr/local/spark/spark - then - bin - hadoop2.7 / README. Md." "The path in parentheses is my installation directory. Readme.md is the Spark system file (which comes with the spark installation itself). Please make sure your file is also in this location. // Print the first element in RDD [readme.md first line]Copy the code

Execution Result:

CTRL + D out of the shell

  • Java-wordcount example provided by Spark

conclusion

Through the above simple case, we have successfully contacted Spark. Let me briefly talk about the execution process of the Spark program and how spark works inside the execution process

  • Lines is an RDD created from a local text file on your computer. RDD is an RDD created from a local text file on your computer.
  • 2. In spark, to perform calculations, we need to express our calculation intentions through operations on the distributed dataset, which will be automatically computed in parallel on the cluster. Such dataset is called resilient distributed dataset (RDD)
  • 3. RDD is Spark’s basic abstraction of distributed data and computing
  • 4. We can run various parallel operations on this RDD, such as count() and first() we just saw.

The series has been all finished in CSDN, anxious to CSDN view of learning, address: blog.csdn.net/youbitch1/a…

Don’t worry about my nuggets, wait for me to update, nuggets above spark tutorial will be slightly different from CSDN, I will check the original written when nuggets, I will improve the bad parts and then send out, CSDN will not change, too lazy to change (laugh