Now, with the support of various source and sink connectors of computing engine, ETL developers can focus all their energy on Extraction and translation. The focus, of course, is on writing SQL to extract data from different data sources for conversion operations. So writing a SELECT statement is roughly equal to all the work. Master SQL SELECT syntax, no doubt master the key to win the data test.

SELECT basic

The SELECT statement is used to query data from a database table. It allows you to manually specify which columns and rows in a table to retrieve. The returned data is called a result set and is displayed in the development tool or used by the application running the query. SELECT statements can be as simple as selecting columns from a small table. Or, they can be complex, with many columns and complex calculations. This article will look at the syntax differences and specific use cases of SELECT statements in MySQL, Flink and Spark.

The basic grammar

SELECT [ ALL | DISTINCT] {field list (*)} [INTO1 ]
FROM{data source} [WHEREExpression] [GROUP BYGrouping expression] [HAVINGBlock query expression] [ORDER BYSort expression]Copy the code

In the expression above

  • {} is mandatory, such as selection lists
  • | are optional, such as DISTINCT
  • [] is an optional parameter, such as WHERE

When evaluating a SELECT statement, the system evaluates the FROM clause first, and then the SELECT clause. The order of execution was made clear in the first article summary, which is mentioned here in passing.

MySQL, by default, is not case sensitive. So there is no real difference between a SELECT and a SELECT in MySQL, which is why I can use caps freely in this article. Such as

SELECT * from (SELECT 1 as id,'diga' as 'name' UNION SELECT 1 as `no`,'japan' as 'courtry') t
Copy the code

with

select * from (select 1 as id,'diga' as 'name' UNION select 1 as `no`,'japan' as 'courtry') t
Copy the code

This also works in Spark and Flink, so there is no special demo. However, to make the SQL more readable, it is recommended to use uppercase letters to describe SQL keywords and lowercase letters to write description fields or table names.

A simple query

select * from table
Copy the code

Select * from table; select * from table; When the table structure changes, the application may not work properly if you use an asterisk () and do not change the application code to work with the new table structure. In addition, such queries have a performance bottleneck, returning many useless fields. Clearly, using ” is not a good decision on either a functional or a performance level. The other is to get the data directly from the column name:

select column,column1,... from table
Copy the code

The data is retrieved from a list of field names.

PS: Relevant simple query cases can be returned to the previous two chapters for reference, and this article will not demonstrate.

Expression query

select 1+1
Copy the code

As shown in SQL above, some simple expression operations can also be performed in select statements. This is helpful in actual writing, as it preprocesses some of the logic. It is also worth noting that we can use arithmetic expressions in WHERE conditions, but not index fields, especially in production. Or you might have to get ready for the bursar office. More on this later when we introduce WHERE.

MySQL test

Spark Actual Test

spark-sql> select 1+1 ;2 Time taken: 0.225 seconds, Touch1 row(s)Copy the code

Flink actual test

Of course, SELECT also supports constant queries, which can be interpreted as inserting a fixed value into a temporary table. Such as

SELECT *,"cq_test" from (SELECT 1 as id,'diga' as `name` UNION SELECT 1 as `no`,'japan' as `courtry`) t
Copy the code

Mysql test

Spark Actual Test

spark-sql> SELECT *,"cq_test" from (SELECT 1 as id,'diga' as `name` UNION SELECT 1 as `no`,'japan' as `courtry`) t;1 Diga cq_test 1 Japan cq_test Time taken: 0.73 seconds, Touch2 row(s)Copy the code

Flink actual test

Why do we need to do this? A simple scenario would be helpful when using a UNION query to tag each result set based on which table it comes from. Of course, the method of using UNION will also be mentioned in the later articles, which will not be introduced in depth here.

Function queries

If you’ve seen the interview, MySQL has a very famous interview question – “tell the difference between selct count(*) and count(1).” If you don’t understand, there is an article on this topic that you can read carefully. The count() in here is a function.

It could be a simple function

select count() from...
Copy the code

It can also be a slightly more complex one, such as

select LENGTH(result) from.Copy the code

Or a date handler

select FROM_UNIXTIME(unix_timestamp)
Copy the code

And so on, there are hundreds of functions like this, I don’t want to do the extension of the use of specific functions, SQL functions can become very complex, but they are very powerful, for more information about how to use you can look up on the Internet. There may be inconsistencies between functions in different engines. For example, earlier versions of MySQL did not support functions such as from_json, while Flink and Spark are more capable of supporting user-defined functions.

The subquery

Finally, we can use subqueries in SELECT statements. This is a more advanced feature, and in general we rarely use subqueries directly in SELECT statements.

Adding a subquery to the SELECT clause means that the results of the subquery are displayed as column values, for example:

SELECT *, (SELECT `name` from (SELECT 'test' as `name`) t2) from (SELECT 1 as id,'diga' as `name` UNION SELECT 1 as `no`,'japan' as `courtry`) t
Copy the code

Haha, please forgive me for writing a very boring and meaningless query. Just for demonstration purposes. Because in fact, in all my years of work so far, I don’t really remember anyone writing SQL like this. Take a look at the actual MySQL test results

Spark Actual Test

spark-sql> SELECT *, (SELECT `name` from (SELECT 'test' as `name`) t2) from (SELECT 1 as id,'diga' as `name` UNION SELECT 1 as `no`,'japan' as `courtry`) t;
1	diga	test
1	japan	test
Time taken: 0.429 seconds, Fetched 2 row(s)
Copy the code

Flink actual test

Flink SQL>  SELECT *,(SELECT `name` from (SELECT 'test' as `name`)) from (SELECT 1 as id,'diga' as `name` UNION SELECT 1 as `no`,'japan' as `courtry`) t;
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.table.api.TableException: SINGLE_VALUE aggregate function doesn't support type 'CHAR'.
Copy the code

Flink query failed. Detailed error log

org.apache.flink.table.client.gateway.SqlExecutionException: Could not execute SQL statement. at org.apache.flink.table.client.gateway.local.LocalExecutor.executeOperation(LocalExecutor.java:215) SQL - client_2. ~ [flink - 11-1.13.1. Jar: 1.13.1] the at org.apache.flink.table.client.gateway.local.LocalExecutor.executeQuery(LocalExecutor.java:235) ~ [flink - SQL - client_2. 11-1.13.1. Jar: 1.13.1] at org. Apache. Flink. Table. The client. The cli. CliClient. CallSelect (CliClient. Java: 479) SQL - client_2. ~ [flink - 11-1.13.1. Jar: 1.13.1] the at Org. Apache. Flink. Table. Client. Cli. CliClient. CallOperation (CliClient. Java: 412) ~ [flink - SQL - client_2. 11-1.13.1. Jar: 1.13.1]  at org.apache.flink.table.client.cli.CliClient.lambda$executeStatement$0(CliClient.java:327) [flink - SQL - client_2. 11-1.13.1. Jar: 1.13.1] at Java. Util. Optional. The ifPresent (Optional. Java: 159) ~ [? : 1.8.0 comes with _271] the at org.apache.flink.table.client.cli.CliClient.executeStatement(CliClient.java:327) [flink - SQL - client_2. 11-1.13.1. Jar: 1.13.1] the at org.apache.flink.table.client.cli.CliClient.executeInteractive(CliClient.java:297) [flink - SQL - client_2. 11-1.13.1. Jar: 1.13.1] the at org.apache.flink.table.client.cli.CliClient.executeInInteractiveMode(CliClient.java:221) [flink - SQL - client_2. 11-1.13.1. Jar: 1.13.1] at org. Apache. The flink. Table. Client. SqlClient. OpenCli (SqlClient. Java: 151) [flink - SQL - client_2. 11-1.13.1. Jar: 1.13.1] at org. Apache. The flink. Table. Client. SqlClient. Start (SqlClient. Java: 95) [flink - SQL - client_2. 11-1.13.1. Jar: 1.13.1] at org. Apache. The flink. Table. Client. SqlClient. StartClient (SqlClient. Java: 187) [flink - SQL - client_2. 11-1.13.1. Jar: 1.13.1] at org. Apache. The flink. Table. Client. SqlClient. Main (SqlClient. Java: 161) [flink - SQL - client_2. 11-1.13.1. Jar: 1.13.1] under Caused by: org. Apache. The flink. Table. API. TableException: SQL optimization: Cannot generate a valid execution plan for the given query: FlinkLogicalSink(table=[default_catalog.default_database.Unregistered_Collect_Sink_20], fields=[id, name, $f0]) +- FlinkLogicalJoin(condition=[true], joinType=[left]) :- FlinkLogicalAggregate(group=[{0, 1}]) : +- FlinkLogicalUnion(all=[true]) : :- FlinkLogicalCalc(select=[1 AS id, _UTF-16LE'diga':VARCHAR(5) CHARACTER SET "UTF-16LE" AS name]) : : +- FlinkLogicalValues(tuples=[[{ 0 }]]) : +- FlinkLogicalCalc(select=[1 AS no, _UTF-16LE'japan':VARCHAR(5) CHARACTER SET "UTF-16LE" AS courtry]) : +- FlinkLogicalValues(tuples=[[{ 0 }]]) +- FlinkLogicalAggregate(group=[{}], agg#0=[SINGLE_VALUE($0)]) +- FlinkLogicalCalc(select=[_UTF-16LE'test' AS name]) +- FlinkLogicalValues(tuples=[[{ 0 }]]) SINGLE_VALUE aggregate function doesn't support type 'CHAR'.Copy the code

According to the error log, flink does not currently support subquery queries in SELECT due to the inability to generate execution plans.

Write in the last

In the process of building SQL capabilities, I expected flink, Spark, and mysql to be quite different. I didn’t expect it to happen so quickly, so obviously. This strengthened my intention to complete the whole series of articles. In order to reduce detours for the students who are in contact with big data, to get started quickly and identify the differences clearly, I will keep updating, hoping it will be helpful to you.