View = = = = my back
Hive have powerful built-in function to realize data processing and data analysis, but sometimes the demand of specific logic, or is the processing logic reuse rate is high, simple built-in functions to handle back very trouble, even not be able to process, this time, you need to use Hive a custom function, Hive a custom function is divided into three general;
Udf (User defined Function)
: passes in a value and returns a value after a logical operation, such as floor, round, etc.Udaf (User aggregation Function)
Sum, count: returns a row based on the selected value group by, similar to sum, count;Udtf (User table-generator Function)
: Returns explode based on a particular line of input, similar to explode.
Steps for building UDF functions;
- Methods that inherit UDF or UDAF or UDTF-related base classes or interfaces to implement the logic required by requirements.
- Package your written classes as JARS and, preferably, upload them to HDFS.
- Log in to the Hive client and use
CREATE FUNCTION db_name.your_function_name AS 'your_jar_package_name.your_jar_class_name' USING JAR 'hdfs:///your_jar_hdfs_path/your_jar.jar
Create your UDF function:
Db_name: specifies the name of your hive database. Your_function_name: You give the function a name; Your_jar_package_name: the package name of the UDF that your Java code implements; Your_jar_class_name: the class name of the UDF that your Java code implements; HDFS :///your_jar_hdfs_path/your_jar.jar: jar path stored in HDFS. Other: Hive create function keyword;
- Used on the Hive client
select your_function_name (cols)
Test custom function functions;
This blog first explains the implementation of the most used UDF, and then enters the actual practice;
==Hive UDF actual combat ==
== Requirement background ==
I have a batch of equipment data, there is a field called the screen proportion such as 16:9,4:3, etc., but the data is not good enough in the front-end collection, there will be 16:9,9:16 such situation, the data is very inconvenient, yu decided to clean the data warehouse, set the logic of the whole equipment screen proportion is large: Small, that is, only 16:9 and 4:3;
= = = = Java programming
I use IDEA to write Java in detail, as shown in Figure 1, which is to build a Java project based on Maven framework to realize UDF.
The Java code for the logic is as follows, which is relatively simple and does not need much explanation, right?
package org.example;
import org.apache.hadoop.hive.ql.exec.UDF;
public class PicRatio_UDF extends UDF {
public String evaluate(String picratio)
{
try
{
String[] myArray=picratio.split(":");
int myArray0=Integer.parseInt(myArray[0]);
int myArray1=Integer.parseInt(myArray[1]);
returnmyArray0<myArray1? myArray1+":"+myArray0:picratio;
}
catch (Exception e)
{
System.out.println(e);
returnpicratio; }}}Copy the code
For the sake of rigor, let’s write a main function to test whether our Java method is correct. The use case tests the Java code as follows;
package org.example;
public class Test_UDF {
public static void main(String[] args)
{
String source_PicRatio="3:4";
PicRatio_UDF myPicRatio_UDF=newPicRatio_UDF(); String myPicRatio=myPicRatio_UDF.evaluate(source_PicRatio); System.out.println(myPicRatio); }}Copy the code
The project is based on the Maven framework, and the Maven dependencies and plug-ins used are contained in the POM. XML file as follows.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>udf</artifactId>
<version>1.0 the SNAPSHOT</version>
<name>udf</name>
<! -- FIXME change it to the project's website -->
<url>http://www.example.com</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.8.5</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-common</artifactId>
<version>2.3.5</version>
</dependency>
<! -- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.3.5</version>
</dependency>
</dependencies>
<! Pentaho :pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde -->
<repositories>
<repository>
<id>spring-plugin</id>
<url>https://repo.spring.io/plugins-release/</url>
</repository>
</repositories>
<build>
<pluginManagement><! -- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
<plugins>
<! -- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
<plugin>
<artifactId>maven-clean-plugin</artifactId>
<version>3.1.0</version>
</plugin>
<! -- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
<plugin>
<artifactId>maven-resources-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.0</version>
</plugin>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.22.1</version>
</plugin>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-install-plugin</artifactId>
<version>2.5.2</version>
</plugin>
<plugin>
<artifactId>maven-deploy-plugin</artifactId>
<version>2.8.2</version>
</plugin>
<! -- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
<plugin>
<artifactId>maven-site-plugin</artifactId>
<version>3.7.1</version>
</plugin>
<plugin>
<artifactId>maven-project-info-reports-plugin</artifactId>
<version>3.0.0</version>
</plugin>
</plugins>
</pluginManagement>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass></mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
Copy the code
After everything is ready, use Maven to Jar the code. Please refer to IntelliJ IDEA to Jar the code. Note that it is important to ensure that the log output is accurate when packing the Jar, and there is no error. If you do not read the package error log and simply copy the generated error package, hive reference the package will still report an error, you need to change and repackage, and then upload the package to HDFS. ///hive/hive_udf_jar/ udF-1.0-snapshot-jar-with-dependencies. Jar; Now open the Hive client and create a permanent UDF function. Perform the following operations in hive Cli:
hive> create function rowyet.picratio as 'org.example.PicRatio_UDF' using jar 'HDFS: / / / hive/hive_udf_jar/udf - 1.0 - the SNAPSHOT - jar - with - dependencies. Jar';
Added [/opt/hive/log/c0017eab0d8b- 49bd- 8157.- 09237071085.c_resources/udf1.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [hdfs:///hive/hive_udf_jar/udf1.0-SNAPSHOT-jar-with-dependencies.jar]
OK
Time taken: 6.03 seconds
- test
hive> select rowyet.picratio("4:3")
> ;
OK
4:3
Time taken: 0.902 seconds, Fetched: 1 row(s)
hive> select rowyet.picratio("3:4");
OK
4:3
Time taken: 0.064 seconds, Fetched: 1 row(s)
Copy the code
Can also temporarily add functions, specific HiveQL as follows, generally not commonly used, are to establish permanent majority;
create temporary function rowyet.picratio as 'org.example.PicRatio_UDF' using jar 'HDFS: / / / hive/hive_udf_jar/udf - 1.0 - the SNAPSHOT - jar - with - dependencies. Jar';Copy the code
Exception analysis ==
- If you failed to load the Jar, you failed to load the Jar log. If an error
FAILED: SemanticException java.lang.IllegalArgumentException: java.net.UnknownHostException: hive
“, no Jar package is loaded to the Hive client. Check whether the Jar path is correct/
;
Added [/opt/hive/log/c0017eab0d8b- 49bd- 8157.- 09237071085.c_resources/udf1.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [hdfs:///hive/hive_udf_jar/udf1.0-SNAPSHOT-jar-with-dependencies.jar]
OK
Copy the code
- If loading the Jar shows a log, an error is reported
Failed to register rowyet.picratio using class com.ruoyin.hiveudf.PicRatio_UDF
If this is similar, it indicates that there is a problem in the packaging of Jar, and there is a problem in the Jar. Please go back to the packaging log to see if there is any error and fix it.
== Delete the custom UDF function ==
User-defined UDFS can also be deleted. The HiveQL is as follows:
hive> drop function rowyet.picratio;
Added [/opt/hive/log/ce6a5159- 2063.- 4796.-b4cb- 5a6762b0ad15_resources/udf1.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [hdfs:///hive/hive_udf_jar/udf1.0-SNAPSHOT-jar-with-dependencies.jar]
OK
Time taken: 0.752 seconds
hive> select rowyet.picratio("4:3")
> ;
FAILED: SemanticException [Error 10011]: Invalid function rowyet.picratio
Copy the code