“This article has participated in the good article call order activity, click to see: back end, big front end double track submission, 20,000 yuan prize pool for you to challenge!”
(Pure author's personal cognition, if there is insufficient, welcome to correct)
1. What is Spark SQL?
SQL: Briefly mentioned, SQL is structured query language, divided into DDL, DML, DCL, DQL, namely definition, management, control (permission), query. It can be used to query structured data, or structured data (because structured data is typically stored in a relational database) supports SQL operations.
Spark SQL: Spark that supports SQL operations. The implementation of SQL support.
A table is simply an abstract description of structured data and files. File system development, the emergence of the database system, the database system has a relational database, relational database has the definition of the schema, and then the concept of the table.
2. Start with Hive MapReduce
Note: You can use more than just mysql to store metadata for persistence.
Terms: Cli refers to the command line interface (the inputting command that hive starts). Driver refers to a role in Hive, which can be understood as driving metastore. Metadata management in Hive Metadata can be regarded as the description of data and data structures. For example, all the data defined for a table in mysql is metadata
SQL commands are parsed, optimized, and executed by Hive. During execution, hive uses metadata management to search for the corresponding SQL keyword metadata. Metadata management stores all metadata in mysql. If the corresponding metadata is found on the CLI and matches the SQL statement, convert the SQL into a MapReduce program and request YARN to obtain resources. After the resources are obtained, the mapReduce program is sent for execution. The execution result can be stored in hbase or HDFS. Hbase serves as the storage layer based on HDFS. The execution result is displayed on the CLI window. (This is the early hive — SQL)
3. In Hive operations, replace MapReduce with Spark
When Map Reduce is exposed to more and more problems, spark is introduced to optimize some of mapReduce’s shortcomings. In this case, of course, it needs to support hive access. SQL is converted to Spark for execution. However, Spark is an RDD data model. Many SQL optimizations performed by Hive cannot be further optimized at the spark execution level. Of course, spark must be adjusted accordingly when Hive changes, and many problems are exposed.
4. Spark supports SQL phases
In the end, Spark SQL directly supports SQL to Spark program execution. At this time, its SQL optimization can be further optimized. However, MetaStore of Hive and Catalog of decoupling layer are still used for metadata management. This can be temporary metadata management. At the bottom level of Spark SQL, spark core apis and native operators are invoked to implement the execution of SQL to Spark.