In Hive, a lot of data is stored in Json format. For example, when a developer burrows a page on an APP, multiple fields are stored in a Json array. Therefore, the data platform must parse the buried data when calling the data. Let’s take a look at how JSON data is parsed in Hive.

This article is published on the official account of “Five Minutes to Learn Big Data”.

Hive provides the JSON parsing function

1. get_json_object

  • Syntax: get_json_object(json_string, ‘$.key’)

  • Parse the json string json_string and return the content specified by path. If the input json string is invalid, then NULL is returned. This function can only return one item at a time.

  • Example:

select 
get_json_object('{"name":"zhangsan","age":18}'.'$.name'); 
Copy the code
  • Results:
name
zhangsan

If you want to resolve both the name and age fields, you can write:

select 
get_json_object('{"name":"zhangsan","age":18}'.'$.name'),
get_json_object('{"name":"zhangsan","age":18}'.'$.age');
Copy the code

But if you have a lot of fields to parse, it’s too cumbersome to do this, so there’s the json_tuple function.

2. json_tuple

  • Syntax: json_tuple(json_string, k1, k2…)

  • Note: Parse json string json_string, you can specify multiple KEY in JSON data, return the corresponding value. If the input json string is invalid, then NULL is returned.

  • Example:

select 
b.name
,b.age
from tableName a lateral view
json_tuple('{"name":"zhangsan","age":18}'.'name'.'age') b as name,age;
Copy the code
  • Results:
name age
zhangsan 18

Note: there is no $in the json_tuple function above.

If you add $. To the json_tuple function, the resolution fails:

select 
b.name
,b.age
from tableName a lateral view
json_tuple('{"name":"zhangsan","age":18}'.'$.name'.'$.age') b as name,age;
Copy the code

Results:

name age
NULL NULL

The json_tuple function does not need to add $. Otherwise, it will fail to resolve.

Summary: The advantage of json_tuple equivalent to get_jSON_object is that you can parse multiple JSON fields at once. But if we have a JSON array, neither of these functions can handle it.

Hive parses a JSON array

Nested subqueries parse json arrays

If there is a Hive table, the contents of the json_str field in the table are as follows:

json_str
[{” website “:” baidu.com “, “name” : “baidu”}, {” website “:” google.com “, “name” : “Google”}]

We want to parse this field into the following structure:

website name
baidu.com baidu
google.com Google

To parse this JSON array, you need to use the following functions:

Explodes function

  • Syntax: Explode (Array OR Map)

  • Explode () takes an array or map as input, and separates the elements of an array or map into rows. This function is also known as the “column to row” function.

  • Example:

Analytical array -
hive> select explode(array('A'.'B'.'C'));
OK
A
B
C
Analytical map -
hive> select explode(map('A'.10.'B'.20.'C'.30));
OK
A       10
B       20
C       30
Copy the code

Function regexp_replace

  • Regexp_replace (string A, string B, string C)

  • Note: Replace the part of string A that conforms to Java regular expression B with C. Note that escape characters are used in some cases, similar to the regexp_replace function in Oracle.

  • Example:

hive> select regexp_replace('foobar'.'oo|ar'.' '); 
OK
fb
Copy the code

The above example replaces oo or AR in the string with ”.


With these functions in place, let’s parse the contents of the json_str field:

    1. First parse the elements in the JSON array and convert them to each line:
hive> SELECT explode(split(regexp_replace(regexp_replace('[{" website ":" baidu.com ", "name" : "baidu"}, {" website ":" google.com ", "name" : "Google"}]'.'\ \ [| \ \]'.' '),'\ \} \ \ and \ \ {'.'\ \} \ \; \ \ {'),'\ \; ')); OK {" website ":" baidu.com ", "name" : "baidu"} {" website ":" google.com ", "name" : "Google"}Copy the code

A brief explanation of the above SQL:

SELECT explode(split(
    regexp_replace(
        regexp_replace(
            '[{" website ":" baidu.com ", "name" : "baidu"}, {" website ":" google.com ", "name" : "Google"}]'.'\ \ [| \ \]' , ' '), remove the brackets around the JSON array'\ \} \ \ and \ \ {' , '\ \} \ \; \ \ {'), replace commas with semicolons between the elements of the JSON array'\ \; ') is separated by a semicolon (the split function is separated by a semicolon);Copy the code

Why replace commas with semicolons between the elements of a JSON array?

Since the elements are separated by commas, if we do not replace the commas between the elements, we will also split the elements with the split function, which is not what we want.

    1. (son_tuple) (son_tuple) (son_tuple) (son_tuple) (son_tuple)
select 
json_tuple(explode(split(
regexp_replace(regexp_replace('[{" website ":" baidu.com ", "name" : "baidu"}, {" website ":" google.com ", "name" : "Google"}]'.'\ \ [| \ \]'.' '),'\ \} \ \ and \ \ {'.'\ \} \ \; \ \ {'),'\ \; ')),'website'.'name');Copy the code

FAILED: SemanticException [Error 10081]: FAILED: SemanticException [Error 10081]: UDTF’s are not supported outside the SELECT clause, nor nested in expressions

If you want to write UDTF inside of another function, you can’t write explode inside of json_tuple.

If explode cannot be used in another json_tuple, we can use a subquery like this. If explode cannot be used in another json_tuple, we can use a subquery like this.

select json_tuple(json, 'website'.'name') 
from (
select explode(split(regexp_replace(regexp_replace('[{" website ":" baidu.com ", "name" : "baidu"}, {" website ":" google.com ", "name" : "Google"}]'.'\ \ [| \ \]'.' '),'\ \} \ \ and \ \ {'.'\ \} \ \; \ \ {'),'\ \; ')) 
as json) t;
Copy the code

No error is reported. The execution result is as follows:

www.baidu.com Baidu google.com GoogleCopy the code

Use lateral View to parse JSON arrays

The goods_id and json_str fields in the Hive table are as follows:

goods_id json_str
1, 2, 3 [{” source “, “seven fresh”, “monthSales:” 4900, “userCount” : 1900, “score” : “9.9”}, {” source “:” jd “and” monthSales: “2090,” userCount “: 78981 , “score” : “9.8”}, {” source “:” jdmart “, “monthSales:” 6987, “userCount” : 1600, “score” : “9.0”}]

Destination: Resolve monthSales in the goods_id field and jSON_STR field.

Let’s start parsing:

  1. Split the goods_id field and convert the JSON array to multiple JSON strings:
select 
explode(split(goods_id,', ')) as good_id,
explode(split(regexp_replace(regexp_replace(json_str , '\ \ [| \ \]'.' '),'\ \} \ \ and \ \ {'.'\ \} \ \; \ \ {'),'\ \; ')) 
as sale_info 
from tableName;
Copy the code

Error: FAILED: SemanticException 3:0 Only a single expression in the SELECT clause is supported with UDTF’s. Error encountered near token ‘sale_info’

SELECT supports only one field when using UDTF. The select statement has two fields, so an error is reported.

To solve this problem, we need to introduce another hive syntax:

lateral view

Lateral View is used with split, explode, and other UDTFs to split a row of data into multiple rows. In this way, the split view can be used to divide a row of data into multiple rows. The Lateral View combines the results to produce a virtual table that supports different tables.

  • Example:

Let’s say we have a hobbies_table with two columns. The first column is name, and the second column is id_list of the user’s interests, which is an array containing the id values of the interests:

name id_list
zhangsan [1, 2, 3]
lisi [three, four, five]

We want to count the number of occurrences of all interest ids among all users:

  1. Parse the interest ID:
SELECT name, hobby_id 
FROM hobbies_table 
LATERAL VIEW explode(id_list) tmp_table AS hobby_id;
Copy the code

SQL execution result:

name hobby_id
zhangsan 1
zhangsan 2
zhangsan 3
lisi 3
lisi 4
lisi 5

After splitting the ID_list, it is easy to calculate the number of times it appears in the user, grouping and aggregative according to hobby_id.

SELECT hobby_id ,count(name) client_num
FROM hobbies_table 
LATERAL VIEW explode(id_list) tmp_table AS hobby_id
group by hobby_id;
Copy the code

Results:

hobby_id client_num
1 1
2 1
3 2
4 1
5 1

When using the UDTF, SELECT only supports one field:

select good_id,get_json_object(sale_json,'$.monthSales') as monthSales
from tableName 
LATERAL VIEW explode(split(goods_id,', '))goods as good_id 
LATERAL VIEW explode(split(regexp_replace(regexp_replace(json_str , '\ \ [| \ \]'.' '),'\ \} \ \ and \ \ {'.'\ \} \ \; \ \ {'),'\ \; ')) sales as sale_json;
Copy the code

Note: The above statement is the result of the Cartesian product of the three tables, so this approach works if the amount of data is not very large.

The execution result of the preceding statement is as follows:

goods_id monthSales
1 4900
1 2090
1 6987
2 4900
2 2090
2 6987
3 4900
3 2090
3 6987

If there are other fields in the table, we can filter the data that matches the result based on the other fields.

Summary: Lateral View usually appears together with UDTF, in order to solve the PROBLEM that UDTF does not allow multiple fields in select.


The public account [five minutes to Learn Big Data], the original technology number in the field of big data, focuses on the research of big data technology