The data type

When creating a table, we need to specify the data type when creating a new column. Generally, in a data file, we can specify all the data as string, which is a string type, and then use the function to convert the data type to other data types, such as date, value, etc.

CREATE  TABLE [IF NOT EXISTS] [db_name.]table_name    
  ``[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
  data_type
  : primitive_type
  | array_type
  | map_type
  | struct_type
  | union_type
Copy the code

Hive data types can be divided into five types, but the five types can be divided into two types. The first type is primitive data type, or basic data type. The other type is complex data structure, which we will learn in the next section

Numeric types

type Occupied bytes Store the scope The default type example note
TINYINT 1 byte - 128. to 127 The INT type 0 Signed integer
SMALLINT 2 – - 32768. to 32767 The INT type 0 Signed integer
INT 4 bytes - 2147483648. to 2147483647 The INT type 0 Signed integer
BIGINT 8 bytes - 9223372036854775808. to 9223372036854775807 The INT type 0 Signed integer
FLOAT 4 bytes Type DOUBLE Signed single-precision floating point number
DOUBLE 8 bytes Type DOUBLE Signed double – precision floating – point number
DECIMAL An exact numeric string with decimals is available in Hive 0.11.0
NUMERIC As with DECIMAL, Hive version 3.0.0 is available

It is important to note that all of these data types are implementations of interfaces in Java, so the behavior details of these types are exactly the same as their Java counterparts. For example, STRING implements STRING in Java, FLOAT implements FLOAT in Java, and so on

The integer

INTEGER types include TINYINT, SMALLINT, INT/INTEGER, BIGINT. The default type is INT. In other words, when you store INTEGER numbers in a Hive table, the data type is INT, not TINYINT, SMALLINT, INT/INTEGER, or BIGINT. Examples are TINYINT and BIGINT

Only if the number you store exceeds the limit of the INT type can it take your BIGINT, or if you specify the data type when you store the data

Type Postfix Example
TINYINT Y 100Y
SMALLINT S 100S
BIGINT L 100L

floating-point

A double can hold a double value, and float is a single value.

We don’t use much of either type in Hive because it loses precision

A DECIMAL and NUMERIC

In an enterprise where we specify the number of digits and the precision of a value, we need to use the DECIMAL data type, which can specify any precision and number of digits for easy calculation. The usage is DECIMAL(precision, scale), where precision is precision and specifies how many significant digits there are, and scale is the number of DECIMAL digits after the DECIMAL point, such as DECIMAL(10,2) where there are 10 significant digits and 2 DECIMAL places.

Hive-0.11.0 and hive-0.12.0 fix the precision of DECIMAL types and limit them to 38 digits, starting with Hive-0.13.0 you can specify the size and precision of DECIMAL, The DECIMAL(Precision,scale) syntax can be used when creating tables using the DECIMAL type. DECIMAL(9,8) represents up to 9 digits, with the last 8 digits being decimals. This means that at most one digit is in front of the decimal point, and more than one digit is null. If no argument is specified, the default is DECIMAL(10,0), that is, there are no DECIMAL places, in which case 0.82 becomes 1.

The DECIMAL type provides precise values and a wider range for floating-point numbers than the DOUBLE type, which stores an exact representation of a number, and the DOUBLE type stores very close approximations. You can use DECIMAL when approximations of DOUBLE are insufficiently precise, as in financial applications, equal and unequal checks, and rounding operations, when values are outside the range of DOUBLE (< -10308 or > 10308) or very close to 0 (-10308> <… < 10^308), the type DECIMAL can also be used.

Hive’s DECIMAL type relies on Java’s BIGDECIMAL, which in Java means an immutable arbitrary precision DECIMAL number. Many operations on other numeric types are also applicable to DECIMAL. Persistence of DECIMAL supports both scientific and non-scientific counting, meaning that if you store data files in formats such as 4.004E3 and 4004, it is legal for BIGDECIMAL.

Time to type

type format The scope of The name of the The introduction of version instructions
DATE yyyy-MM-dd 0000.01.01 ~ 9999.12.31 string Hive 0.12.0 It is mainly used to indicate the year, month and day
IMESTAMP yyyy-MM-dd HH:mm:ss.fffffffff is Indefinite length string Introduced in Hive 0.8.0 The decimal is followed by nine digits, which means that the accuracy is nanosecond
CHAR Maximum number of characters: 255 no Fixed length string Introduced in Hive 0.13.0 ‘abc’ or “abc”

HIVE provides the DATE type description year month day in the format yyyY-MM-DD. The value ranges from 0000.01.01 to 9999.12.31.

HIVE provides TIMESTAMP type description detailed time type with time zone information

The date type in Hive is not commonly used. Data of the date type is usually saved as string, which saves the trouble of converting the date format. Generally, the string type is string. VARCHAR and CHAR are not recommended.

We also mentioned above that it is not recommended to use the time type but the string type. However, if an existing table has a time type, you can use the following conversion method to convert it

Valid casts to/from Date type Result
cast(date as date) Same date value
cast(timestamp as date) Turn out the date timestamp
cast(string as date) If the string is in ‘YYYY-MM-DD’ format, you can also convert it to date, or return NULL if the string is not ‘YYYY-MM-DD’
cast(date as timestamp) The date type is converted to TIMESTAMP and the time zone information is the current time zone
cast(date as string) The date type is converted to a string in ‘YYYY-MM-DD’ format

Character types

type The length of the range Whether trailing whitespace affects the comparison The name of the The introduction of version example
STRING Unlimited length is string From the beginning ‘abc’ or “abc”
VARCHAR The value ranges from 1 to 65535 characters is Indefinite length string Hive 0.12.0 ‘abc’ or “abc”
CHAR Maximum number of characters: 255 no Fixed length string Introduced in Hive 0.13.0 ‘abc’ or “abc”

STRING types in Hive can be expressed in single or double quotation marks, “” or” “

The VARCHAR type can be thought of as a length limited version of STRING. It can only store characters between 1 and 65535. If you convert a STRING to a VARCHAR type, it will be truncated if the length exceeds the limit. Affects the result of comparing strings

A non-generic UDF cannot directly use a VARCHAR type as an input parameter or return value. You can create a string UDF, and the VARCHAR value is converted to a string and passed to the UDF. To use the VARCHAR argument directly or to return the vARCHAR value, create a GenericUDF. If other contexts rely on reflection based methods to retrieve type information, there may be other contexts that do not support VARCHAR. This includes some SerDe implementations.

The CHAR type is similar to the VARCHAR type. Different real varchars are variable in length, ranging from 1 to 65535. However, the length of the CHAR type is immutable, which means you need to specify the length when creating the data. However, this length does not affect the comparison, and the maximum length supported is 255

Many people think that the timestamp is an integer number, but it is not. The timestamp in the database is actually readable, which is yyyY-MM-DD HH: MM :ss. FFFFFFF

Intervals

In SQL calculation, unix_timestamp and from_unixtime are used together: first, the Unix time from 1970 to the present integer seconds, then add and subtract to achieve the purpose of the time before and after, now let’s recognize a more convenient function on time add and subtract: INTERVAL

-- Current time Indicates the time 10 seconds ago
select current_timestamp(a)- INTERVAL 10 second;
-- one year from now
select current_timestamp(a)+ INTERVAL 1 year;
Copy the code

Year_month,month_day, day, hour, hour, minute and second can be added or subtracted according to the format of month,month, day, hour, hour, minute and second. The format is arbitrary delimited characters (usually Spaces), corresponding to year_month,month_day, etc. It is important to note that they must be adjacent and month_day comes first. For example, year_month cannot be month_year. It can’t be a year, a year

-- like a year and a month ago
select current_timestamp(a)- INTERVAL '1 1' year_month
Copy the code

Here you can replace current_timestamp() with an arbitrary time function, for example

select current_date()  + INTERVAL 1 year;
Copy the code

Other types of

BOOLEAN The BOOLEAN type indicates true or false

BINARY arrays I’ve never used before

For Boolean types, we usually use 0 and 1 instead, which makes it easier to use later.

Type conversion

The atomic data types of Hive can be converted implicitly, similar to Java type conversion. For example, if an expression is of the INT type, TINYINT is automatically converted to INT. However, Hive does not convert the atomic data types of Hive. INT is not automatically converted to TINYINT and will return an error unless CAST is used.

The implicit type conversion rules are as follows

Any integer type can be implicitly converted to a wider range of types, such as TINYINT to INT and INT to BIGINT.

All integer types, floats, and strings can be implicitly converted to DOUBLE.

TINYINT, SMALLINT, INT can all be converted to FLOAT.

BOOLEAN types cannot be converted to any other type.

Data type conversions can be performed using the CAST operation display

For example, CAST(‘1’ AS INT) converts the string ‘1’ to the integer 1; If a CAST fails, such AS CAST(‘X’ AS INT), the expression returns NULL.

Below we provide a type conversion table, in which true represents the row can be converted to the corresponding column type

void boolean tinyint smallint int bigint float double decimal string varchar timestamp date binary
void to true true true true true true true true true true true true true true
boolean to false true false false false false false false false false false false false false
tinyint to false false true true true true true true true true true false false false
smallint to false false false true true true true true true true true false false false
int to false false false false true true true true true true true false false false
bigint to false false false false false true true true true true true false false false
float to false false false false false false true true true true true false false false
double to false false false false false false false true true true true false false false
decimal to false false false false false false false false true true true false false false
string to false false false false false false false true true true true false false false
varchar to false false false false false false false true true true true false false false
timestamp to false false false false false false false false false true true true false false
date to false false false false false false false false false true true false true false
binary to false false false false false false false false false false false false false true