Join the background

SparkSQL supports three join algorithms: Shuffle Hash Join, Broadcast Hash Join, and Sort Merge Join. Both of them belong to Hash Join, but Shuffle or Broadcast is required before Hash Join. In fact, the Hash Join algorithm is derived from traditional databases, while Shuffle and Broadcast are the combination of the concepts of distributed big data. So the root of big data is the traditional database. Hash Join is the kernel.

Classification and implementation mechanism of Spark Join

Above is the classification and use of Spark Join.

Hash Join

Select * from ORDER,item where item.id = order.i_id; select * from order,item where item.id = order.i_id Now assume that Join adopts hash Join algorithm, the whole process will go through three steps:

  • Confirm Build Table and Probe Table: This concept is important. Build Table is built into a Hash Table with join keys as the key, while Probe Table uses join keys to search for qualified rows in the Hash Table and then join links. The Build table and Probe table are determined by Spark. In general, small tables are used as Build tables and large tables are used as Probe tables.
  • Building a Hash Table: Read Build Table(item) data in turn, hash each data according to Join Key(item.id), hash into the corresponding bucket (similar to the principle of HashMap), and finally generate a HashTable. The HashTable is cached in memory. If not, the HashTable is dumped to disk.
  • Match: After the Hash Table is generated, scan the Probe Table(Order) data in sequence and use the same Hash function (in Spark, the same partitioner is actually used) to search for the same Hash (join key) value in the Hash Table. If the match is successful, join the two together.

Broadcast Hash Join

Broadcast Hash Join is used when the Join table is small.

Conditions for a Broadcast Hash Join are as follows:

  • Is broadcast table needs to be less than the spark. SQL. AutoBroadcastJoinThreshold configuration information, the default is 10 m;
  • The base table cannot be broadcast; for example, when left outer JOIN, only the right table can be broadcast.

Broadcast Hash Join can be divided into two steps:

  • Broadcast stage: Broadcast a small table to all executors. There are many algorithms for broadcast. The simplest one is to send the table to the driver, and then the driver distributes the table to all executors in a unified manner.
  • Hash Join phase: Hash join is performed on each executor, the small table is created as a Hash table, and the partitioned data of the large table matches the data in the Hash table.
Sort Merge Join

When both tables are very large, SparkSQL uses a completely new scheme to Join the tables, called Sort Merge Join. This method does not need to load all the data on one side before hash join, but needs to sort the data before join.

First, reshuffle the two tables according to join keys to ensure that records with the same Join key value are divided into corresponding partitions. After partitioning, sort the data in each partition, and then join the records in the corresponding partition. Sort Merge Join does not load all the data on one side into memory, regardless of the size of the partition. Because both sequences have ordered, go through from the beginning, encounter the same key output, if different, small left continue to take the left, vice versa. Thus, the stability of SQL Join under large data volume is greatly improved.

The whole process is divided into three steps:

  • Shuffle phase: The two large tables are repartitioned based on join keys. The data of the two tables is distributed to the whole cluster for distributed parallel processing
  • Sort stage: Sort the two table data of a single partitioned node
  • Merge phase: Join the sorted data of the two partitioned tables. The join operation is very simple. The two ordered sequences are iterated separately. If they encounter the same join key, merge the output.

After the above analysis, it is obvious that the cost relation of these kinds of Joins can be obtained: Broadcast Hash Join (cost)< cost(Shuffle Hash Join)< cost(Sort Merge Join). When designing a data warehouse, you are advised to avoid Join query of large tables. SparkSQL can also according to memory, bandwidth resources moderate the spark parameters. SQL. AutoBroadcastJoinThreshold big, let more join actual execution for Broadcast Hash join.

Statement: all articles in this number are original, except for special notes, public readers have the right to read first, shall not be reproduced without the permission of the author, or tort liability.

Pay attention to my public number, background reply [JAVAPDF] get 200 pages of questions! 50000 people pay attention to the big data into the way of god, don’t you want to know? Fifty thousand people pay attention to the big data into the road of god, really not to understand it? Fifty thousand people pay attention to the big data into the way of god, sure really not to understand it?

Welcome your attentionBig Data as the Road to God