Hello, I’m Bella Sauce

I am a girl who mainly writes Java and SQL. In this week’s work, I want to do a function (well, what function to do is determined by myself). This function is mainly to check some data indicators to monitor whether the values of these indicators are correct, that is, to monitor the data quality.

background

These metrics are generated by writing SQL in MaxCompute (well, I monitor myself) and are all stored in a single field as JSON. Why not put all metrics in one field instead of one column? First, there are too many indicators, dozens of them. Second, the types of indicators are not fixed, but can be dynamically generated according to configuration. That is, the indicators are not fixed. If they are stored in the form of columns, they must be able to dynamically generate columns. Those of you who have worked with MongoDB know that it stores data in a document structure. One of the important advantages of using document to store data is that the columns are not fixed. Whatever you throw into the document, it stores what you throw into it, which is more flexible than fixed columns. So considering the nature of my data metrics, I initially threw all the metrics I dynamically generated into a single field. Considering the ease of Java reading metrics, I serialized all the metrics and their values into JSON format.

All metrics are put in JSON form in a field (let’s say the field is called index_Value). Cool! It was really cool!

But for me today, it’s not good, and it’s too bad, why? Because I’m going to reverse sequence these indices, and then I’m going to do all kinds of math on them.

The scheme

Ok, with that background, let’s go back to the data metric auditing function. My first reaction when considering the implementation details of this function is to write a UDF and have MaxCompute SQL call the UDF. The input parameter of the UDF is set to index_Value (that is, the field that holds all the indicators), and the output parameter is set to the result value of various mathematical operations after deserialization. Well, when I think of this scheme, my mouth involuntarily turn up, this implementation, very simple, is the Java deserialization and ordinary logical calculation, who can not do this. But then, a little person came to mind and told me, “Why not see if SQL can solve this problem? Why call UDF in SQL again, why not do something new? Get out of your comfort zone.”

Following the little guy’s advice, I began my exploration of SQL deserialization of Json strings. If you put your money where your mouth is, it pays off. Now that my functionality is implemented and published, let’s take a look at how MaxCompute SQL operates on JSON strings.

Generate JSON data

Let’s first look at how to generate JSON data, both using MaxCompute’s official functions and udFS.

MaxCompute officially provides a TO_JSON function to generate strings in JSON format, but the scenarios this function can support are very limited. Let’s look at the command format:

to_json(expr)
Copy the code

Expr is mandatory and supports only three formats: MAP, ARRAY, and STRUCT.

Let’s look at a few examples.

The map type

1. The key-value pair must exist at the same time; otherwise, an error will be reported

2. The key in the generated JSON data is the same as the key in the map. The case is not automatically converted

3. The key-value pair whose value is null is still output normally

SELECT to_json(map('Bella paste _map',100,'MySQL',100,'Java',99,'Redis',98,'geography',60, 'Flink', CAST(NULL AS STRING)));Copy the code

STRUCT type

1. The key-value pair must exist at the same time; otherwise, an error will be reported

2. All keys in the generated JSON data are in lower case

3. If the value of a key-value pair is null, the key-value pair is automatically filtered out

SELECT to_json(NAMED_STRUCT('Bella '_named_struct ', 100, 'ES', 99, 'HBase', 98, 'Java', CAST(null AS STRING)));Copy the code

Array type

1. Generate data in JSON Array format

The SELECT to_json (ARRAY (map (' Bella sauce _array_map_1 ', 100, 'ES', 90, 'Java', 60), map (' Bella sauce _array_map_2 ', 90, 'C', 80)));Copy the code

UDF

In addition to the above three methods, MaxCompute also provides UDF methods to generate JSON, which I did because I had to translate multiple rows into a column, and then the data format for that column was JSON.

-- 1. Create TABLE DROP TABLE IF EXISTS student_SCORE; CREATE TABLE IF NOT EXISTS student_SCORE (ID BIGINT COMMENT 'ID, Student_no BIGINT COMMENT 'student ID',student_name STRING COMMENT 'name',suject STRING COMMENT 'subject', Score BIGINT COMMENT 'score'); INSERT OVERWRITE TABLE student_score VALUES (1, 2021073101, 'Bella ', 'MySQL', 100), (2, 2021073101, 'Bella Sauce ', 'Java', 99, (3, 2021073101, 'Bella Sauce ', 'Redis', 98), (4, 2021073101, 'Bella Sauce ', 'HBase', 97), (5, 2021073101, 'Bella Sauces ', 'Geography ', 60), (6, 2021073102,' trafalgar Row ', 'MySQL', 100), (7, 2021073102, 'Geography ', 100), (8, 2021073102, 'Redis', (9, 2021073102, 'Trafalgar Row ', 'HBase', 100), 'Geography ', 100), (11, 2021073103,' Tholon ', 'MySQL', 95), (12, 2021073103, 'Tholon ', 'Java', 94), (13, 2021073103,' Tholon ', 'Redis', 93), (14, 2021073103, 'HBase', 98), (15, 2021073103, 'HBase', 20); DROP TABLE IF EXISTS student_score_JSON; DROP TABLE IF EXISTS student_score_json; CREATE TABLE IF NOT EXISTS student_score_json AS SELECT MAX(id) AS id ,student_no ,MAX(student_name) AS student_name ,GENERATEJSONSTRING( WM_CONCAT(',',suject_score) ,',' ,'=' ) AS suject_score FROM ( SELECT id ,student_no ,student_name ,suject ,score ,CONCAT_WS('=', suject, score) AS suject_score FROM student_score ) a GROUP BY student_no ;Copy the code

The UDF code is as follows:

import com.aliyun.odps.udf.UDF; import com.google.gson.Gson; import org.apache.commons.lang3.StringUtils; import java.util.HashMap; import java.util.Map; import java.util.Objects; ** @author Bella Paste * @date 2021/08/01 */ public Class GenerateJsonString extends UDF {public String evaluate(String source, String delimiter, String joiner) { Map<String, String> map = transferStr2Map(source, delimiter, joiner); return new Gson().toJson(map); } /** * @param source Data source * @param Delimiter * @param Joiner * @return */ Private static Map<String, String> transferStr2Map(String source, String delimiter, String joiner) { Map<String, String> map = new HashMap<>(128); if (StringUtils.isBlank(source)) { return map; } String[] sourceArray = source.split(delimiter); for (String item : sourceArray) { String[] itemArray = item.split(joiner); if (Objects.isNull(itemArray) || itemArray.length == 0) { break; } map.put(itemArray[0], itemArray[1]); } return map; }}Copy the code

The generated student_score_JSON table contains the following data:

You can see that the SUBJect_Score field is JSON format data.

The student_score_JSON table is used as an example to explain how MaxCompute SQL can parse JSON data.

Parse JSON data to generate multiple columns

MaxCompute SQL provides two functions for parsing JSON strings, GET_JSON_OBJECT and JSON_TUPLE.

GET_JSON_OBJECT

Let’s look at GET_JSON_OBJECT first. The command format is as follows:

string get_json_object(string json, string path)
Copy the code

As can be seen from the command format, each call to get_json_object can only extract one field from the JSON string. If there are N fields in the JSON string, then get_json_object will be called N times, and the JSON string will also be read N times. In the case of large amounts of data (which MaxCompute does), this behavior is very bad, magnifies the entire data processing by N times, and can affect performance.

Note that path should start with $to indicate the root node,. To indicate the child node, and the Java field in suject_core should be read as $.java.

Get_json_object returns null if a path value does not exist.

SELECT  id
        ,student_no
        ,student_name
        ,GET_JSON_OBJECT(s.suject_core, "$.Java") AS Java
        ,GET_JSON_OBJECT(s.suject_core, "$.geography") AS geography
        ,GET_JSON_OBJECT(s.suject_core, "$.MySQL") AS MySQL
        ,GET_JSON_OBJECT(s.suject_core, "$.Redis") AS Redis
        ,GET_JSON_OBJECT(s.suject_core, "$.HBase") AS HBase
FROM    student_score_json s
;
Copy the code

What if you want to read a JSON Array? Of course you can.

1. Data key[*] can read all data in the array

2. The array key[array subscript] can read the JSON string stored in the corresponding subscript of the array. To further read the value in the JSON string, run the array key[array subscript] command. The key field is sufficient.

Let’s have a look at chestnuts. The first is data preparation.

DROP TABLE IF EXISTS tmp_score_array_demo ; CREATE TABLE IF NOT EXISTS tmp_score_array_demo (score_array STRING COMMENT 'score JSON array'); INSERT OVERWRITE TABLE tmp_score_array_demo VALUES (to_json(MAP('scores', ARRAY(MAP('array_map_1', 80, 'Flink', 70, 'Redis', 60), MAP('array_map_2', 90, 'ES', 70, 'Redis', 60))))) ;Copy the code

The data prepared by the above script looks like this.

1) Read the scores array.

SELECT  GET_JSON_OBJECT(tmp_score_array_demo.score_array, '$.scores[*]')
FROM    tmp_score_array_demo
;
Copy the code

  1. Reads the value of the first element in the scores array.
SELECT  GET_JSON_OBJECT(tmp_score_array_demo.score_array, '$.scores[0]')
FROM    tmp_score_array_demo
;
Copy the code

  1. Read the value of Flink in the first element of the scores array
SELECT  GET_JSON_OBJECT(tmp_score_array_demo.score_array, '$.scores[0].Flink')
FROM    tmp_score_array_demo
;
​
Copy the code

JSON_TUPLE

Let’s look at JSON_TUPLE again. The command format is as follows:

string json_tuple(string json, string key1, string key2, string key3...)
Copy the code

As you can see from the command format, even if we want to read multiple key values in JSON, we only need to read the JSON data once. This, is not what I have been looking for, excited I quickly tried the next.

SELECT  JSON_TUPLE(
            student_score_json.suject_score
            ,"Java"
            ,"geography"
            ,"MySQL"
            ,"Redis"
            ,"HBase"
        )
FROM    student_score_json
;
Copy the code

It works. It works really well, but there are two problems.

1. I can’t know the corresponding relationship between the two only with grades and no information about students.

2. All the lists are missing, so it is impossible to know which subject is in each column. MaxCompute also provides LATERAL VIEW function, we can cooperate with the LATERAL VIEW to eat together, perfect solution to the above two problems.

SELECT  s.id AS id
        ,s.student_no AS student_no
        ,s.student_name AS student_name
        ,a.Java AS Java
        ,a.geography AS geography
        ,a.MySQL AS MySQL
        ,a.Redis AS Redis
        ,a.HBase AS HBase
FROM    student_score_json s
LATERAL VIEW JSON_TUPLE(s.suject_score, "Java","geography","MySQL","Redis","HBase") a AS Java, geography, MySQL, Redis, HBase
;
Copy the code

At this point, I want the effect out!

NVL (a.ava, 0) (if there is no key in the json, take the default value 0), and then you can do whatever you want with these columns.

What if you want to read a JSON Array? Of course you can. Let’s take the tmp_score_array_demo table as an example. The data in the table are as follows:

  1. Read the value of the scores array
SELECT json_tuple(tmp_score_array_demo.score_array, "scores[*]") FROM tmp_score_array_demo;
​
Copy the code

  1. Reads the value of the first element in the scores array
SELECT json_tuple(tmp_score_array_demo.score_array, "scores[0]") FROM tmp_score_array_demo;
Copy the code

  1. Read the value of Flink in the first element of the scores array
SELECT  json_tuple(tmp_score_array_demo.score_array, "scores[0].Flink")
FROM    tmp_score_array_demo
;
Copy the code

Ok, so we have explained how to generate JSON data, how to parse JSON data, JSON Array data, and so on. That’s all for today. See you next time