Last updated on March 30, 2016:

Writing in the front

This series is based on my own understanding of Spark learning, my understanding of reference articles, and my own experience with Spark. The purpose of writing such a series is to sort out my notes on learning Spark, so I will focus on understanding everything and will not record any necessary details. Besides, the original English documents sometimes appear in the article, so as not to affect my understanding, I will not translate them. For more information, read reference articles and official documentation.

Second, this series is based on the latest Spark 1.6.0 series. Spark updates quickly, so it is necessary to keep track of the version number. Finally, if you think the content is wrong, welcome to leave a comment, all messages will be replied within 24 hours, thank you very much.

Tips: If the illustration doesn’t look obvious, you can: 1. 2. Open the image in the new TAB and view the original image. 3. Click Present Mode at the top of the right directory.

1. What is Spark Dataframe

Let’s take a look at the official documentation:

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

We can see a few key points about spark Dataframe:

  • Distributed data sets
  • It’s like a table in a relational database, or a sheet in Excel, or a Dataframe in Python /R
  • It has rich manipulation functions, similar to operators in RDD
  • A Dataframe can be registered as a data table and then operated on in SQL
  • Rich ways to create
    • The existing RDD
    • Structured data file
    • JSON data set
    • Hive table
    • External database

2. Why spark Dataframe

Why a Dataframe is used is a bit tricky to implement in detail, but the following diagram basically explains it all:

However, this article will talk about Spark Dataframe from a basic point of view. Instead of focusing on the details, let’s first understand some basic principles and advantages. For the content in the above picture, we will see the later arrangement, and maybe we will talk about it in chapter 15.

The DataFrame API is inspired by the design of R and Python Data Frames and has the following features:

  • Data volume support from KB to PB level;
  • Multiple data formats and multiple storage system support;
  • Advanced optimization with Spark SQL’s Catalyst optimizer to generate code;
  • Seamlessly integrate all big data tools and infrastructure with Spark;
  • Apis for Python, Java, Scala, and R languages (SparkR);

In short, Dataframe makes it easier to manipulate data sets and is faster because the underlying code is optimized to execute through spark SQL’s Catalyst optimizer. To sum up, spark dataframe can be used to build spark app.

  • Write less: Write less code
  • Do more
  • Faster: At a faster speed

3. Create a dataframe

Spark SQL, Dataframe, and DATASets all share the spark SQL library. They all share the same code optimization, generation, and execution processes. Therefore, the entry for SQL, Dataframe, and datasets is sqlContext. There are many data sources that can be used to create spark Dataframes, and we will cover the easiest way to create a Dataframe from a structured file.

Here is the spark SC template I created myself:

sc_conf = SparkConf()
sc_conf.setAppName("03-DataFrame-01")
sc_conf.setMaster(SPARK_MASTER)
sc_conf.set('spark.executor.memory', '2g')
sc_conf.set('spark.logConf', True)
sc_conf.getAll()
try:
    sc.stop()
    time.sleep(1)
except:
    sc = SparkContext(conf=sc_conf)
    sqlContext = SQLContext(sc)Copy the code
  • Step 2: Create dataframe from json file

Description of data file: Basic information about Chinese A-share listed companies can be obtained here: stock_5.json

Note: The JSON file is not a standard JSON file, and Spark does not support reading standard JSON files. You need to pre-process the standard JSON file into the format supported by Spark: each line is a JSON object.

For example, the people.json example from the official website requires the following format:

{"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
{"name":"Michael", "address":{"city":null, "state":"California"}}Copy the code

For this file, however, there are only two standard JSON formats:

{"name": ["Yin", "Michael"], "address":[ {"city":"Columbus","state":"Ohio"}, {"city":null, "State" : "California"}}] # # # # or [{" name ":" Yin ", "address" : {" city ":" Columbus ", "state" : "Ohio"}}, {" name ":" Michael," "address":{"city":null, "state":"California"}} ]Copy the code

Therefore, when using Spark SQL to read a JSON file, it is important to deal with the format of the JSON file in advance. Here we have dealt with the format of the JSON file in advance, as shown below:

{"ticker":"000001","tradeDate":"2016-03-30","exchangeCD":"XSHE","secShortName":"\u5e73\u5b89\u94f6\u884c","preClosePrice ": 10.43," openPrice ": 10.48," dealAmount turnoverValue ": 19661," ": 572627417.1299999952," highestPrice ": 10.7," lowestPrice ": 10.4 7, "closePrice" : 10.7, "negMarketValue" : 126303384220.0, "marketValue" : 153102835340.0, "isOpen" : 1, "secID" : "000001. XSHE", "listD ate":"1991-04-03","ListSector":"\u4e3b\u677f","totalShares":14308676200}, {" ticker ":" 000002 ", "tradeDate" : "2016-03-30", "exchangeCD" : "XSHE", "secShortName" : "\ u4e07 \ u79d1A", "preClosePrice" : 24.43, "op EnPrice ": 0.0," dealAmount ": 0," turnoverValue ": 0.0," highestPrice ": 0.0," lowestPrice ": 0.0," closePrice negMarketValue ": 24.43," " : 269685994760.0, 237174448154.0, "marketValue" : "isOpen" : 0, "secID" : "000002. XSHE", "listDate" : "1991-01-29", "ListSector" : "\ u4e 3b\u677f","totalShares":11039132000}Copy the code
# # # df is short for dataframe df = sqlContext read. Json (' HDFS: / / 10.21.208.21:8020 / user/mercury/stock_5 json ') print df.printSchema() print df.select(['ticker', 'secID', 'tradeDate', 'listDate', 'openPrice', 'closePrice', 'highestPrice', 'lowestPrice', 'isOpen']).show(n=5)Copy the code

4. The dataframe operation

Just like RDD, dataframe has a number of operators that operate on the entire dataframe data set. We’ll call it the Dataframe API for short. Operators, DSLS and so on are not easy to understand for the unfamiliar. Here is the full list of apis: Spark Dataframe API

4.1 Executing SQL statements on a Dataframe

4.2 Conversion of spark Dataframe to PANDAS Dataframe

A picture is worth a thousand words:

Throughout The birth and development of Spark, I think spark has done one thing very well: it is compatible with similar products. On the big side, it’s like this quote on Spark’s website: Runs Everywhere: Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3., and Spark are compatible with Hadoop products, enabling hadoop developers to easily switch from Hadoop to Spark. On the small side, Spark is very compatible with some of the subdivision tools, such as the Dataframe, which supports the conversion of The Spark Dataframe to pandas Dataframe.

Pandas dataframe can be used to draw pictures, save them to various types of files, and do data analysis. In my view, the dataframe of Spark can be used to process the data, analyze the whole logic, and then display the results in the dataframe of Spark. Pandas Dataframe: Pandas Dataframe: Pandas Dataframe

5. Some experience

5.1 Spark JSON Format Problem

Currently, Spark does not support reading standard JSON files. You need to pre-process the standard JSON file into the format supported by Spark: each line is a JSON object.

5.2 Spark Dataframe and PANDAS Dataframe Selection Problem

Pandas Dataframe is used to analyze pandas dataframe. Pandas Dataframe Is used to analyze data in pandas dataframe. Pandas Dataframe is used to display and report to pandas dataframe.

6. Next

Ok, dataframe says a few words briefly. Let’s pause for a moment for the last example, and then pick up where it left off, using an example THAT I’ve been practicing: Using Spark for quantitative investing.

7. Open wechat, scan, click, lollipop, ^_^

Refer to the article

Links to articles in this series

  • Spark 1. Introduction to Spark
  • Spark 2. Explain basic concepts of Spark
  • Spark 3. Spark programming mode
  • Spark 4. RDD of Spark
  • 5. Spark learning resources you can’t miss these years
  • “Spark” 6. In-depth study of the operation principles of Spark: Job, stage, and Task
  • Spark 7. Use Spark DataFrame to analyze big data
  • “Spark” 8. Practical Cases | Spark application in the financial field | Intraday Trend forecast
  • Spark 9. Set up the IPython + Notebook + Spark development environment
  • Spark 10. Spark Application Performance Optimization | 12 Optimization methods
  • Spark 11. Spark machine learning
  • Spark 12. Spark 2.0 features
  • Spark 13. Spark 2.0 Release Notes Chinese version
  • “Spark” 14. A Spark SQL performance optimization tour