This is the fifth day of my participation in the August Wen Challenge.More challenges in August

Null value handling in SQL is often the most common exception handling in ETL. First of all, as data sources may come from different business domains, business development in each domain usually allows the existence of NULL value when defining table structure. One value is meaningful in business, and the other value is simply written into data as default value. This requires data development and testing students in the actual development and testing process, need to judge the business meaning of null value, or special processing.

What is a NULL value in SQL?

NULL represents missing or unknown values in the database. Business development usually allows the presence of null values when defining table structures. One value is the default value that has business significance, and even the default value set during table building. In addition, if the aggregate function is used in SQL, null values will be present in the result if the data is not matched. Such as

SELECT SUM(1) from DUAL WHERE 1 =2;
Copy the code

In actual development, null may directly affect subsequent calculations in the case of the above unsatisfactory conditions and may need to be handled accordingly. It is important to note that a null value is not part of any particular data type. It is a flexible data type that stands for nothing and can be placed in the column of any data type, be it string, int, BLOb, or CLOB.

To overcome this problem, we usually check in the process of ET data processing and filter or replace meaningless null values.

The following null value handling is often seen in ETL scripts.

select 
       ifnull(product_category, '') as product_category,
       ifnull(user_name_create, '') as user_name_create,
       ifnull(user_name_picking, '') as user_name_picking,
       ifnull(user_name_package, '') as user_name_package,
       ifnull(user_name_knight, '') as user_name_knight,
from table
Copy the code

NULL processing in SQL

function instructions
ISNULL() Helps us replace NULL values with the desired values.
IFNULL() Allows us to return the first value if the value is NULL, and the second value otherwise.
COALESCE() Helps us return the first non-null value in the argument.
NVL() Help to replace NULL values with the required values provided by the user.

In today’s SQL basically by the above four processing methods, let’s take a look at the most commonly used ifNULL

ifnull

Ifnull takes two arguments, the first of which is the judged field or expression, and the second is the value to be replaced. If the value of the first argument is not null, the value is returned. If the result is null, the second argument is returned.

Mysql is used in

mysql> SELECT IFNULL(`name`,'kong') as `name` from (select 1 as `id`,'test' as `name` UNION SELECT 2 as `id`,null as `name`) t;+ -- -- -- -- -- -- + | name + | | + -- -- -- -- -- - test | | kong | + -- -- -- -- -- - + 2 rows in the set (0.00 SEC)Copy the code

Here we can see that all values corresponding to NULL are automatically replaced with ‘kong’

The spark is used in

spark-sql> SELECT IFNULL(`name`,'kong') as `name` from (select 1 as `id`,'test' as `name` UNION SELECT 2 as `id`,null as `name`) t;Test Kong Time taken: 6.339 seconds, Touch2 row(s)Copy the code

Use in FLink Because flink cannot directly use the above statement to define the table, so by connecting the data table to display

Flink SQL> select * from testsuit;
+----+-------------+--------------------------------+
| op | testsuit_id |                     project_id |
+----+-------------+--------------------------------+
| +I |          72 |                             26 |
| +I |          73 |                             26 |
| +I |          74 |                             26 |
| +I |          75 |                             26 |
| +I |          76 |                             26 |
| +I |          77 |                             26 |
| +I |         116 |                             26 |
| +I |         119 |                             28 |
| +I |         127 |                              1 |
| +I |         129 |                              1 |
| +I |         130 |                           null |
+----+-------------+--------------------------------+
Received a total of 11 rows
Copy the code

Use ifNULL substitution to see the output

What’s going on? How come null is not filtered? Check the database carefully. It turns out that I wrote a null value manually during the replication. The null value is the string ‘NULL’ instead of the null value. After resetting

In this way, ifnull is used the same in Spark, Flink, and MySQL.

ISNULL

Isnull accepts a single parameter, which can be a field or an expression. If the parameter isnull, 1 (true) is returned. If the parameter is not null, 0(false) is returned. Specific differences see the following specific cases

Again, use the previous example to see the results

MySQL

spark

Subfunctions are not supported in Flink

Note that ISNULL may be supported differently in all three engines due to low scenarios. Flink does not support this function at all, whereas Spark returns a Boolean value and MySQL returns 0 and 1 to distinguish the results.

COALESCE

Coalesce can accept multiple parameters and is used to return the first non-empty expression of its parameters. If all parameters are NULL, NULL is returned by COALESCE. At this point, we need to modify our test statement slightly

SELECT coalesce(`name`,null.'3') as `name` from (select 1 as `id`,'test' as `name` UNION SELECT 2 as `id`,null as `name`) t;
Copy the code

The result is as follows

As you can see, null values are continuously filtered and the value of 3 is returned.

The spark is used in

Flink used in

All parameters in coalesce are null

NVL

NVL and IFNULL function types also take two arguments, returning the first argument when the first value is non-null, and the second expression when the first value is null

MySQL does not support NVL functions

The spark to use

This function is also not supported in Flink

Conclusion:

This paper mainly introduces the occurrence of NULL value in the existing ETL is inevitable, and how to deal with it carefully becomes the focus of subsequent cleaning work. In addition, functions such as IFNULL(), ISNULL(), COALESCE(), and NVL() are discussed, which operate on NULL values. With examples of each function, we have seen usage and syntax and the different support levels for different hospitality.

  1. Ifnull is supported by all three engines and has the same effect.
  2. Isnull The return types of MySQL and Spark are different, but Flink does not support isNULL.
  3. Coalesce The three engines support the same information.
  4. NVL is an Oracle dialect and is supported only by Spark among the three engines.