Hive SQL usage is abnormal

Hive is a Hadoop-based data warehouse tool used for data ETL, which is a mechanism for storing, querying, and analyzing large-scale data stored in Hadoop. Hive maps structured data files to a database table and provides SQL query functions. Hive SQL is an SQL language that differs slightly from the SQL syntax supported by relational databases. This article compares the SQL syntax supported by MySQL and Hive, and finds that the output of the same SQL statement is different in Hive and MySQL.

Two integer divisions

Division is an arithmetic operation that is difficult for SQL engines to interpret. If you add, subtract, or multiply two integers, you always get an integer. It is worth noting that different SQL query engines produce different results when dividing two integers. In Hive and MySQL, arithmetic divides two integers, and the output is of type Decimal.

- Query in Hive
select 10/3       -- Output: 3.333333333335
Select * from MySQL
select 10/3       -- Output: 3.3333
Copy the code

If you use the following method, an integer type is returned

- Query in Hive
select 10 div 3       -- Output: 3
Select * from MySQL
select 10 div 3       -- Output: 3
Copy the code

Case sensitive

When we compare two strings, different SQL engines produce different results. Note that Apache Hive is case sensitive in string comparisons, as shown in the following example.

- Query in Hive
select 'Bigdata' = 'bigdata'   False - output
Select * from MySQL
select 'Bigdata' = 'bigdata'  Output 1 -
Copy the code

The same SQL statement will return 1 if MySQL is used, because MySQL is case insensitive when comparing strings. This means that MySQL will interpret two strings as the same string as long as they have the same letter.

Let’s look at another phenomenon, what happens when we write table names in uppercase?

Depending on the SQL engine you are using, you need to be case-sensitive when referring to tables in the database. If you use Hive, you don’t need to worry about case when referencing tables because they always convert letters to lowercase. MySQL MySQL MySQL 1146-table ‘XX’ doesn’t exist

MySQL > create table test
- Query information in Hive
select * from Test   -- Normal output result
Select * from MySQL
select * from Test   -- error: 1146-table 'Test' doesn't exist
Copy the code

Use aliases in GROUP BY

Assume the following query:

-- Use the alias to query information in Hive
Compiling statement: FAILED: SemanticException [Error 10004]: Error while compiling statement: FAILED: SemanticException [Error 10004]: line 7:9 Invalid table alias or column reference 'inventory_status': (possible column names are: userid, visitdate, visitcount)
SELECT CASE
           WHEN visicount > 5 THEN "more than 5"
           ELSE "less than 5"
       END AS inventory_status,
       count(*) AS cnt
FROM test
GROUP BY inventory_status

-- Do not use an alias. If the following statement is used, the output will be normal
SELECT CASE
           WHEN visitcount > 5 THEN "more than 5"
           ELSE "less than 5"
       END AS inventory_status,
       count(*) AS cnt
FROM test
GROUP BY CASE
           WHEN visitcount > 5 THEN "more than 5"
           ELSE "less than 5"
       END
Copy the code

If the same query statement is queried in MySQL, the output is normal.

A non-numeric string is converted to a numeric type

With SQL, we can use the CAST command to convert the data types of columns in a table. If you want to convert a string column to an integer, do the following.

SELECT CAST(column_name AS INT) FROM table_name
Copy the code

So what happens if we convert a non-numeric string to a numeric type?

- Query information in Hive
select cast("bigdata" as int) - returns null
Select * from MySQL
select cast("bigdata" as signed int)  - return 0
Copy the code

Views and SQL query statements in Hive

When a view is created in Hive, the SQL statement corresponding to the view is stored in a field in a table. In Hive, the metadata contains the following table:

CREATE TABLE `TBLS` (
  `TBL_ID` bigint(20) NOT NULL,
  `CREATE_TIME` int(11) NOT NULL,
  `DB_ID` bigint(20) DEFAULT NULL,
  `LAST_ACCESS_TIME` int(11) NOT NULL,
  `OWNER` varchar(767) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  `OWNER_TYPE` varchar(10) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  `RETENTION` int(11) NOT NULL,
  `SD_ID` bigint(20) DEFAULT NULL,
  `TBL_NAME` varchar(128) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  `TBL_TYPE` varchar(128) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  `VIEW_EXPANDED_TEXT` mediumtext CHARACTER SET utf8,
  `VIEW_ORIGINAL_TEXT` mediumtext CHARACTER SET utf8,
  `LINK_TARGET_ID` bigint(20) DEFAULT NULL.PRIMARY KEY (`TBL_ID`),
  UNIQUE KEY `UNIQUETABLE` (`TBL_NAME`,`DB_ID`),
  KEY `TBLS_N50` (`SD_ID`),
  KEY `TBLS_N49` (`DB_ID`),
  KEY `TBLS_N51` (`LINK_TARGET_ID`),
  CONSTRAINT `TBLS_FK1` FOREIGN KEY (`SD_ID`) REFERENCES `SDS` (`SD_ID`),
  CONSTRAINT `TBLS_FK2` FOREIGN KEY (`DB_ID`) REFERENCES `DBS` (`DB_ID`),
  CONSTRAINT `TBLS_FK3` FOREIGN KEY (`LINK_TARGET_ID`) REFERENCES `TBLS` (`TBL_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Copy the code

The previous table stores metadata information about tables and views in Hive. If a view is created, the VIEW_EXPANDED_TEXT and VIEW_ORIGINAL_TEXT fields store SQL statements corresponding to the view.

Different results may occur when we create a view or execute it directly using the following SQL statement:

create view as select * from test where name like"% big data"Copy the code

If the SQL statement is executed directly, the desired results are filtered based on the criteria. However, if you are creating a view, you may not get the desired results. As mentioned above, the SQL statement corresponding to a view is stored as a field in Hive metadata, corresponding to a table. SQL > select ** from ‘% big data’ where ‘% big data’ = ‘% big data’;

create view as select * from test where name like"?????"Copy the code

The above problem can be solved simply by changing the encoding of this field in the metadata:

ALTER TABLE `TBLS` MODIFY COLUMN VIEW_EXPANDED_TEXT mediumtext CHARACTER SETUtf8;ALTER TABLE `TBLS` MODIFY COLUMN VIEW_ORIGINAL_TEXT mediumtext CHARACTER SETUtf8;Copy the code

conclusion

This article shares some problems with Hive and provides corresponding examples. We can pay attention to these problems in the process of using the same SQL statement in MySQL and Apache Hive results.

Two integer divisions

Case sensitive

Use aliases in GROUP BY

A non-numeric string is converted to a numeric type

Views and SQL query statements in Hive

conclusion

Related Posts

ICME VQA Grand Challenge award-winning work sharing

Finally, someone has spelled out the metrics of machine learning

[Future Cloud – Business Monitoring] Real-time large-screen technology solution