One line of input returns a result

It can meet most of the needs of daily work

UDF implementation method

Hive provides two methods to implement UDF:

The first is inheritanceUDF 类

Advantages:
- Implement a simple
- Hive supports basic types, arrays, and Maps
- Support for function overloading
Disadvantages:
- The logic is simple and is only suitable for implementing simple functions

In this way, the code is less coding, the code logic is clear, and simple UDF can be quickly implemented

Second: inheritanceGenericUDF 类

Advantages:
- Supports parameters of any length and type
- You can implement different logic depending on the number and type of arguments
- Logic that allows you to initialize and close resources (initialize, close)
Disadvantages:
- The implementation is a little more complicated than inheriting udFs

GenericUDF is more flexible and can implement more complex functions than an inherited UDF

About the choice between the two

The UDF class is inherited first if the function has the following characteristics:

Simple logic, such as English to lowercase function
The parameter and return value types are simple and are the basic types, arrays, or maps of Hive
There is no need to initialize or close the resource

Otherwise, consider inheriting the GenericUDF class

The steps for both implementations are described below

Inherit the UDF class

The first approach is the simplest, creating a new class that inherits the UDF and then writing evaluate()

import org.apache.hadoop.hive.ql.exec.UDF; / * * * inherit org. Apache. Hadoop. Hive. Ql. Exec. UDF * / public class SimpleUDF extends UDF {/ * * *. Write a function that requires the following: * 1. Function name must be the evaluate * 2. The parameters and return value types can be as follows: basic type, Java wrapper classes, Java org.. Apache hadoop. IO, Writable type, List, Map * 3, etc. Void */ public int evaluate(int a, int b) {return a + b; Public Integer evaluate(Integer a, Integer b, Integer c) { if (a == null || b == null || c == null) return 0; return a + b + c; }}Copy the code

The way to inherit UDF classes is quite simple, but there are a few caveats:

The evaluate() method does not inherit from UDF classes.
Evaluate () cannot return void. Evaluate () : void

Supported parameter and return value types

Supports hive basic types, arrays, and Maps

Hive Basic Types

Java can use Java primitive types, Java wrapper classes, or corresponding Writable classes

PS: For primitive types, it is best not to use Java primitive types. UDF will report an error when null is passed to a Java primitive type parameter. Java wrapper classes can also be used for null value determination

Hive type	Java primitive types	The Java wrapper class	hadoop.io.Writable
tinyint	byte	Byte	ByteWritable
smallint	short	Short	ShortWritable
int	int	Integer	IntWritable
bigint	long	Long	LongWritable
string	String	–	Text
boolean	boolean	Boolean	BooleanWritable
float	float	Float	FloatWritable
double	double	Double	DoubleWritable

Array and Map

Hive type	Java type
array	List
Map<K, V>	Map<K, V>

Inheritance GenericUDF

This approach is the most flexible, but also a bit more complex to implement than the previous one

After GenericUDF is inherited, three methods must be implemented:

initialize()
evaluate()
getDisplayString()

initialize()

/** * Initialize GenericUDF, Each GenericUDF example calls the initialization method only once * * @param Arguments * ObjectInspector instance of custom UDF arguments * @throws UDFArgumentException * If the parameter number or type is incorrect, Throw this exception * @return Function return value type */ public Abstract ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException;Copy the code

Initialize () is called once during GenericUDF initialization to perform some initialization operations, including:

Determine the number of function arguments
Determine the function parameter type
Determine the return value type of the function

In addition, users can perform some customized initialization operations, such as initializing the HDFS client

One: determine the number of function parameters

The number of function arguments can be determined by the length of the arguments array

Example to determine the number of function parameters:

If (arguments.length! = 1) // Throw new UDFArgumentException(" Function requires an argument "); // An exception is thrown when the custom UDF parameters do not conform to expectationsCopy the code

Second: determine the type of function parameters

ObjectInspector can be used to detect parameter data types and has an enumerated class Category that represents the type of the current ObjectInspector

Public interface ObjectInspector extends Cloneable {public static enum Category {PRIMITIVE, // // Hive MAP STRUCT, // STRUCT UNION // UNION}; }Copy the code

Hive primitive types are subdivided into multiple subtypes. PrimitiveObjectInspector implements ObjectInspector to more specifically represent the corresponding Hive primitive types

public interface PrimitiveObjectInspector extends ObjectInspector {

  /**
   * The primitive types supported by Hive.
   */
  public static enum PrimitiveCategory {
    VOID, BOOLEAN, BYTE, SHORT, INT, LONG, FLOAT, DOUBLE, STRING,
    DATE, TIMESTAMP, BINARY, DECIMAL, VARCHAR, CHAR, INTERVAL_YEAR_MONTH, INTERVAL_DAY_TIME,
    UNKNOWN
  };
}
Copy the code

PrimitiveCategory enumeration types cannot be explained

Parameter Type Example:

If (arguments[0].getcategory ()! = ObjectInspector. Category. PRIMITIVE / / parameter is the original Hive type | |! PrimitiveObjectInspector.PrimitiveCategory.STRING.equals(((PrimitiveObjectInspector)arguments[0]).getPrimitiveCategory() Throw new UDFArgumentException(" function first argument is a string "); // An exception is thrown when the custom UDF parameters do not conform to expectationsCopy the code

Third, determine the return value type of the function

Initialize () requires a return instance of ObjectInspector, which represents the custom UDF return value type. The return value of Initialize () determines the return type of evaluate()

ObjectInspector’s source code contains a comment to the effect that instances of ObjectInspector should be fetched by the corresponding factory class to ensure instance singletons and other properties

/**
 * An efficient implementation of ObjectInspector should rely on factory, so
 * that we can make sure the same ObjectInspector only has one instance. That
 * also makes sure hashCode() and equals() methods of java.lang.Object directly
 * works for ObjectInspector as well.
 */
public interface ObjectInspector extends Cloneable { }
Copy the code

For basic types (byte, short, int, long, float, double, Boolean, string), can be directly obtained through PrimitiveObjectInspectorFactory static field

Hive type	Writable type	Java wrapper types
tinyint	writableByteObjectInspector	javaByteObjectInspector
smallint	writableShortObjectInspector	javaShortObjectInspector
int	writableIntObjectInspector	javaIntObjectInspector
bigint	writableLongObjectInspector	javaLongObjectInspector
string	writableStringObjectInspector	javaStringObjectInspector
boolean	writableBooleanObjectInspector	javaBooleanObjectInspector
float	writableFloatObjectInspector	javaFloatObjectInspector
double	writableDoubleObjectInspector	javaDoubleObjectInspector

Note: There are two basic types of return values: Writable and Java wrapper:

When Initialize specifies a return type of Writable, evaluate() should return the corresponding Writable instance
When Initialize specifies a Java wrapper type, evaluate() should return the corresponding Java wrapper class instance

Complex types such as Array and Map<K, V> can be obtained using the static method of ObjectInspectorFactory

Hive type	ObjectInspectorFactory static method	Evaluate () Return value type
Array	getStandardListObjectInspector(T t)	List
Map<K, V>	getStandardMapObjectInspector(K k, V v);	Map<K, V>

Examples of cases where the return type is Map<String, int> :

// 3. User-defined UDF returns Map<String. int> return ObjectInspectorFactory.getStandardMapObjectInspector( PrimitiveObjectInspectorFactory.javaStringObjectInspector, / / Key is a String PrimitiveObjectInspectorFactory. JavaIntObjectInspector / / a Value is an int);Copy the code

The complete initialize() function looks like this:

/** * Initialize GenericUDF, Each GenericUDF example calls the initialization method only once * * @param Arguments * ObjectInspector instance of custom UDF arguments * @throws UDFArgumentException * If the parameter number or type is incorrect, Throws this exception. * @override Public ObjectInspector Initialize (ObjectInspector[] arguments) throws UDFArgumentException { // 1. Check the number of arguments, only one argument if (arguments.length! = 1) // Throw new UDFArgumentException(" Function requires an argument "); (arguments[0].getcategory ()!); = ObjectInspector. Category. PRIMITIVE / / parameter is the original Hive type | |! PrimitiveObjectInspector.PrimitiveCategory.STRING.equals(((PrimitiveObjectInspector)arguments[0]).getPrimitiveCategory() Throw new UDFArgumentException(" function first argument is a string "); // When the custom UDF parameter does not conform to the expectation, Throw an exception // 3. The return type of the user-defined UDF is Map<String. int> return ObjectInspectorFactory.getStandardMapObjectInspector( PrimitiveObjectInspectorFactory.javaStringObjectInspector, / / Key is a String PrimitiveObjectInspectorFactory. JavaIntObjectInspector / / a Value is an int); }Copy the code

evaluate()

Core method, custom UDF implementation logic

The code implementation steps can be divided into three parts:

Parameters of the receiving
Custom UDF core logic
Return processing result

Step 1: Parameter acceptance

The evaluate() argument is the custom UDF argument,

/**
 * Evaluate the GenericUDF with the arguments.
 *
 * @param arguments
 *          The arguments as DeferedObject, use DeferedObject.get() to get the
 *          actual argument Object. The Objects can be inspected by the
 *          ObjectInspectors passed in the initialize call.
 * @return The
 */
public abstract Object evaluate(DeferredObject[] arguments)
  throws HiveException;
Copy the code

Deferedobject.get () gets the values of the arguments, as you can see from the source code annotations

/**
 * A Defered Object allows us to do lazy-evaluation and short-circuiting.
 * GenericUDF use DeferedObject to pass arguments.
 */
public static interface DeferredObject {
  void prepare(int version) throws HiveException;
  Object get() throws HiveException;
};
Copy the code

DeferredObject () returns an Object. Deferedobject.get () returns an Object

For Hive basic types, the Writable type is passed in

Hive type	Java type
tinyint	ByteWritable
smallint	ShortWritable
int	IntWritable
bigint	LongWritable
string	Text
boolean	BooleanWritable
float	FloatWritable
double	DoubleWritable
Array	ArrayList
Map<K, V>	HashMap<K, V>

Parameter receiving example:

Map<String, int> // 1. If (arguments[0] == null) return... // 2. Arguments Map<Text, IntWritable> Map = (Map<Text, IntWritable>)arguments[0].get();Copy the code

Step 2: Customize the UDF core logic

Once you get the parameters, you’re free to play here

Step 3: Return the processing result

This step corresponds to the return value of Initialize ()

There are two basic types of return values: Writable and Java wrapper:

When Initialize specifies a return type of Writable, evaluate() should return the corresponding Writable instance
When Initialize specifies a Java wrapper type, evaluate() should return the corresponding Java wrapper class instance

Hive array and Map return values of the following types:

Hive type	Java type
Array<T>	List<T>
Map<K, V>	Map<K, V>

getDisplayString()

GetDisplayString () returns the information presented in Explain

/**
 * Get the String to be displayed in explain.
 */
public abstract String getDisplayString(String[] children);
Copy the code

Note: do not return null, otherwise a null pointer exception may be thrown at runtime, and this problem is not easy to detect

ERROR [b1c82c24-bfea-4580-9a0c-ff47d7ef4dbe main] ql.Driver: FAILED: NullPointerException null java.lang.NullPointerException at java.util.regex.Matcher.getTextLength(Matcher.java:1283) ...  at org.apache.hadoop.util.RunJar.main(RunJar.java:136)Copy the code

close()

Resource closes the callback function

It is not an abstract method and may not be implemented

/**
 * Close GenericUDF.
 * This is only called in runtime of MapRedTask.
 */
@Override
public void close() throws IOException { }
Copy the code

Custom GenericUDF complete example

import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; import org.apache.hadoop.io.Text; import java.io.IOException; import java.util.HashMap; import java.util.Map; public class SimpleGenericUDF extends GenericUDF { @Override public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException { // 1. If (arguments.length! = 1) // Throw new UDFArgumentException(" Function requires an argument "); If (arguments[0].getcategory ()! = ObjectInspector. Category. PRIMITIVE / / parameter is the original Hive type | |! PrimitiveObjectInspector.PrimitiveCategory.STRING.equals(((PrimitiveObjectInspector)arguments[0]).getPrimitiveCategory() Throw new UDFArgumentException(" function first argument is a string "); // When the custom UDF parameter does not conform to the expectation, Throw an exception // 3. The return type of the user-defined UDF is Map<String. int> return ObjectInspectorFactory.getStandardMapObjectInspector( PrimitiveObjectInspectorFactory.javaStringObjectInspector, / / Key is a String PrimitiveObjectInspectorFactory. JavaIntObjectInspector / / a Value is an int); } @Override public Object evaluate(DeferredObject[] arguments) throws HiveException { // 1. If (arguments[0] == null) return new HashMap<>(); String str = ((Text) arguments[0].get()).toString(); // Customize the UDF core logic // Count the number of occurrences of each character in a String and record them in the Map Map<String, Integer> Map = new HashMap<>(); for (char ch : str.toCharArray()) { String key = String.valueOf(ch); Integer count = map.getOrDefault(key, 0); map.put(key, count + 1); } // 3. } @override public String getDisplayString(String[] children) {return "This is a simple test for custom UDF~"; }}Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

UDF Development Manual – UDF

UDF implementation method

The first is inheritanceUDF 类

Second: inheritanceGenericUDF 类

About the choice between the two

Inherit the UDF class

Supported parameter and return value types

Hive Basic Types

Array and Map

Inheritance GenericUDF

initialize()

One: determine the number of function parameters

Second: determine the type of function parameters

Third, determine the return value type of the function

The complete initialize() function looks like this:

evaluate()

Step 1: Parameter acceptance

Step 2: Customize the UDF core logic

Step 3: Return the processing result

getDisplayString()

close()

Custom GenericUDF complete example

UDF Development Manual – UDF

UDF implementation method

The first is inheritanceUDF 类

Second: inheritanceGenericUDF 类

About the choice between the two

Inherit the UDF class

Supported parameter and return value types

Hive Basic Types

Array and Map

Inheritance GenericUDF

initialize()

One: determine the number of function parameters

Second: determine the type of function parameters

Third, determine the return value type of the function

The complete initialize() function looks like this:

evaluate()

Step 1: Parameter acceptance

Step 2: Customize the UDF core logic

Step 3: Return the processing result

getDisplayString()

close()

Custom GenericUDF complete example

Related Posts

Flink – only use Table&Sql to achieve real-time analysis of e-commerce user behavior

I used Squid to set up the proxy for Java Http requests

How does it work when you can still pay when your phone has no Internet?