One line of input returns a result

It can meet most of the needs of daily work

UDF implementation method

Hive provides two methods to implement UDF:

The first is inheritanceUDF
  • Advantages:
    • Implement a simple
    • Hive supports basic types, arrays, and Maps
    • Support for function overloading
  • Disadvantages:
    • The logic is simple and is only suitable for implementing simple functions

In this way, the code is less coding, the code logic is clear, and simple UDF can be quickly implemented

Second: inheritanceGenericUDF
  • Advantages:
    • Supports parameters of any length and type
    • You can implement different logic depending on the number and type of arguments
    • Logic that allows you to initialize and close resources (initialize, close)
  • Disadvantages:
    • The implementation is a little more complicated than inheriting udFs

GenericUDF is more flexible and can implement more complex functions than an inherited UDF

About the choice between the two

The UDF class is inherited first if the function has the following characteristics:

  1. Simple logic, such as English to lowercase function
  2. The parameter and return value types are simple and are the basic types, arrays, or maps of Hive
  3. There is no need to initialize or close the resource

Otherwise, consider inheriting the GenericUDF class

The steps for both implementations are described below

Inherit the UDF class

The first approach is the simplest, creating a new class that inherits the UDF and then writing evaluate()

import org.apache.hadoop.hive.ql.exec.UDF; / * * * inherit org. Apache. Hadoop. Hive. Ql. Exec. UDF * / public class SimpleUDF extends UDF {/ * * *. Write a function that requires the following: * 1. Function name must be the evaluate * 2. The parameters and return value types can be as follows: basic type, Java wrapper classes, Java org.. Apache hadoop. IO, Writable type, List, Map * 3, etc. Void */ public int evaluate(int a, int b) {return a + b; Public Integer evaluate(Integer a, Integer b, Integer c) { if (a == null || b == null || c == null) return 0; return a + b + c; }}Copy the code

The way to inherit UDF classes is quite simple, but there are a few caveats:

  1. The evaluate() method does not inherit from UDF classes.
  2. Evaluate () cannot return void. Evaluate () : void

Supported parameter and return value types

Supports hive basic types, arrays, and Maps

Hive Basic Types

Java can use Java primitive types, Java wrapper classes, or corresponding Writable classes

PS: For primitive types, it is best not to use Java primitive types. UDF will report an error when null is passed to a Java primitive type parameter. Java wrapper classes can also be used for null value determination

Hive type Java primitive types The Java wrapper class hadoop.io.Writable
tinyint byte Byte ByteWritable
smallint short Short ShortWritable
int int Integer IntWritable
bigint long Long LongWritable
string String Text
boolean boolean Boolean BooleanWritable
float float Float FloatWritable
double double Double DoubleWritable
Array and Map
Hive type Java type
array List
Map<K, V> Map<K, V>

Inheritance GenericUDF

This approach is the most flexible, but also a bit more complex to implement than the previous one

After GenericUDF is inherited, three methods must be implemented:

  1. initialize()
  2. evaluate()
  3. getDisplayString()

initialize()

/** * Initialize GenericUDF, Each GenericUDF example calls the initialization method only once * * @param Arguments * ObjectInspector instance of custom UDF arguments * @throws UDFArgumentException * If the parameter number or type is incorrect, Throw this exception * @return Function return value type */ public Abstract ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException;Copy the code

Initialize () is called once during GenericUDF initialization to perform some initialization operations, including:

  1. Determine the number of function arguments
  2. Determine the function parameter type
  3. Determine the return value type of the function

In addition, users can perform some customized initialization operations, such as initializing the HDFS client

One: determine the number of function parameters

The number of function arguments can be determined by the length of the arguments array

Example to determine the number of function parameters:

If (arguments.length! = 1) // Throw new UDFArgumentException(" Function requires an argument "); // An exception is thrown when the custom UDF parameters do not conform to expectationsCopy the code
Second: determine the type of function parameters

ObjectInspector can be used to detect parameter data types and has an enumerated class Category that represents the type of the current ObjectInspector

Public interface ObjectInspector extends Cloneable {public static enum Category {PRIMITIVE, // // Hive MAP STRUCT, // STRUCT UNION // UNION}; }Copy the code

Hive primitive types are subdivided into multiple subtypes. PrimitiveObjectInspector implements ObjectInspector to more specifically represent the corresponding Hive primitive types

public interface PrimitiveObjectInspector extends ObjectInspector {

  /**
   * The primitive types supported by Hive.
   */
  public static enum PrimitiveCategory {
    VOID, BOOLEAN, BYTE, SHORT, INT, LONG, FLOAT, DOUBLE, STRING,
    DATE, TIMESTAMP, BINARY, DECIMAL, VARCHAR, CHAR, INTERVAL_YEAR_MONTH, INTERVAL_DAY_TIME,
    UNKNOWN
  };
}
Copy the code

PrimitiveCategory enumeration types cannot be explained

Parameter Type Example:

If (arguments[0].getcategory ()! = ObjectInspector. Category. PRIMITIVE / / parameter is the original Hive type | |! PrimitiveObjectInspector.PrimitiveCategory.STRING.equals(((PrimitiveObjectInspector)arguments[0]).getPrimitiveCategory() Throw new UDFArgumentException(" function first argument is a string "); // An exception is thrown when the custom UDF parameters do not conform to expectationsCopy the code
Third, determine the return value type of the function

Initialize () requires a return instance of ObjectInspector, which represents the custom UDF return value type. The return value of Initialize () determines the return type of evaluate()

ObjectInspector’s source code contains a comment to the effect that instances of ObjectInspector should be fetched by the corresponding factory class to ensure instance singletons and other properties

/**
 * An efficient implementation of ObjectInspector should rely on factory, so
 * that we can make sure the same ObjectInspector only has one instance. That
 * also makes sure hashCode() and equals() methods of java.lang.Object directly
 * works for ObjectInspector as well.
 */
public interface ObjectInspector extends Cloneable { }
Copy the code

For basic types (byte, short, int, long, float, double, Boolean, string), can be directly obtained through PrimitiveObjectInspectorFactory static field

Hive type Writable type Java wrapper types
tinyint writableByteObjectInspector javaByteObjectInspector
smallint writableShortObjectInspector javaShortObjectInspector
int writableIntObjectInspector javaIntObjectInspector
bigint writableLongObjectInspector javaLongObjectInspector
string writableStringObjectInspector javaStringObjectInspector
boolean writableBooleanObjectInspector javaBooleanObjectInspector
float writableFloatObjectInspector javaFloatObjectInspector
double writableDoubleObjectInspector javaDoubleObjectInspector

Note: There are two basic types of return values: Writable and Java wrapper:

  • When Initialize specifies a return type of Writable, evaluate() should return the corresponding Writable instance
  • When Initialize specifies a Java wrapper type, evaluate() should return the corresponding Java wrapper class instance

Complex types such as Array and Map<K, V> can be obtained using the static method of ObjectInspectorFactory

Hive type ObjectInspectorFactory static method Evaluate () Return value type
Array getStandardListObjectInspector(T t) List
Map<K, V> getStandardMapObjectInspector(K k, V v); Map<K, V>

Examples of cases where the return type is Map<String, int> :

// 3. User-defined UDF returns Map<String. int> return ObjectInspectorFactory.getStandardMapObjectInspector( PrimitiveObjectInspectorFactory.javaStringObjectInspector, / / Key is a String PrimitiveObjectInspectorFactory. JavaIntObjectInspector / / a Value is an int);Copy the code
The complete initialize() function looks like this:
/** * Initialize GenericUDF, Each GenericUDF example calls the initialization method only once * * @param Arguments * ObjectInspector instance of custom UDF arguments * @throws UDFArgumentException * If the parameter number or type is incorrect, Throws this exception. * @override Public ObjectInspector Initialize (ObjectInspector[] arguments) throws UDFArgumentException { // 1. Check the number of arguments, only one argument if (arguments.length! = 1) // Throw new UDFArgumentException(" Function requires an argument "); (arguments[0].getcategory ()!); = ObjectInspector. Category. PRIMITIVE / / parameter is the original Hive type | |! PrimitiveObjectInspector.PrimitiveCategory.STRING.equals(((PrimitiveObjectInspector)arguments[0]).getPrimitiveCategory() Throw new UDFArgumentException(" function first argument is a string "); // When the custom UDF parameter does not conform to the expectation, Throw an exception // 3. The return type of the user-defined UDF is Map<String. int> return ObjectInspectorFactory.getStandardMapObjectInspector( PrimitiveObjectInspectorFactory.javaStringObjectInspector, / / Key is a String PrimitiveObjectInspectorFactory. JavaIntObjectInspector / / a Value is an int); }Copy the code

evaluate()

Core method, custom UDF implementation logic

The code implementation steps can be divided into three parts:

  1. Parameters of the receiving
  2. Custom UDF core logic
  3. Return processing result
Step 1: Parameter acceptance

The evaluate() argument is the custom UDF argument,

/**
 * Evaluate the GenericUDF with the arguments.
 *
 * @param arguments
 *          The arguments as DeferedObject, use DeferedObject.get() to get the
 *          actual argument Object. The Objects can be inspected by the
 *          ObjectInspectors passed in the initialize call.
 * @return The
 */
public abstract Object evaluate(DeferredObject[] arguments)
  throws HiveException;
Copy the code

Deferedobject.get () gets the values of the arguments, as you can see from the source code annotations

/**
 * A Defered Object allows us to do lazy-evaluation and short-circuiting.
 * GenericUDF use DeferedObject to pass arguments.
 */
public static interface DeferredObject {
  void prepare(int version) throws HiveException;
  Object get() throws HiveException;
};
Copy the code

DeferredObject () returns an Object. Deferedobject.get () returns an Object

For Hive basic types, the Writable type is passed in

Hive type Java type
tinyint ByteWritable
smallint ShortWritable
int IntWritable
bigint LongWritable
string Text
boolean BooleanWritable
float FloatWritable
double DoubleWritable
Array ArrayList
Map<K, V> HashMap<K, V>

Parameter receiving example:

Map<String, int> // 1. If (arguments[0] == null) return... // 2. Arguments Map<Text, IntWritable> Map = (Map<Text, IntWritable>)arguments[0].get();Copy the code
Step 2: Customize the UDF core logic

Once you get the parameters, you’re free to play here

Step 3: Return the processing result

This step corresponds to the return value of Initialize ()

There are two basic types of return values: Writable and Java wrapper:

  • When Initialize specifies a return type of Writable, evaluate() should return the corresponding Writable instance
  • When Initialize specifies a Java wrapper type, evaluate() should return the corresponding Java wrapper class instance

Hive array and Map return values of the following types:

Hive type Java type
Array<T> List<T>
Map<K, V> Map<K, V>

getDisplayString()

GetDisplayString () returns the information presented in Explain

/**
 * Get the String to be displayed in explain.
 */
public abstract String getDisplayString(String[] children);
Copy the code

Note: do not return null, otherwise a null pointer exception may be thrown at runtime, and this problem is not easy to detect

ERROR [b1c82c24-bfea-4580-9a0c-ff47d7ef4dbe main] ql.Driver: FAILED: NullPointerException null java.lang.NullPointerException at java.util.regex.Matcher.getTextLength(Matcher.java:1283) ...  at org.apache.hadoop.util.RunJar.main(RunJar.java:136)Copy the code

close()

Resource closes the callback function

It is not an abstract method and may not be implemented

/**
 * Close GenericUDF.
 * This is only called in runtime of MapRedTask.
 */
@Override
public void close() throws IOException { }
Copy the code

Custom GenericUDF complete example

import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; import org.apache.hadoop.io.Text; import java.io.IOException; import java.util.HashMap; import java.util.Map; public class SimpleGenericUDF extends GenericUDF { @Override public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException { // 1. If (arguments.length! = 1) // Throw new UDFArgumentException(" Function requires an argument "); If (arguments[0].getcategory ()! = ObjectInspector. Category. PRIMITIVE / / parameter is the original Hive type | |! PrimitiveObjectInspector.PrimitiveCategory.STRING.equals(((PrimitiveObjectInspector)arguments[0]).getPrimitiveCategory() Throw new UDFArgumentException(" function first argument is a string "); // When the custom UDF parameter does not conform to the expectation, Throw an exception // 3. The return type of the user-defined UDF is Map<String. int> return ObjectInspectorFactory.getStandardMapObjectInspector( PrimitiveObjectInspectorFactory.javaStringObjectInspector, / / Key is a String PrimitiveObjectInspectorFactory. JavaIntObjectInspector / / a Value is an int); } @Override public Object evaluate(DeferredObject[] arguments) throws HiveException { // 1. If (arguments[0] == null) return new HashMap<>(); String str = ((Text) arguments[0].get()).toString(); // Customize the UDF core logic // Count the number of occurrences of each character in a String and record them in the Map Map<String, Integer> Map = new HashMap<>(); for (char ch : str.toCharArray()) { String key = String.valueOf(ch); Integer count = map.getOrDefault(key, 0); map.put(key, count + 1); } // 3. } @override public String getDisplayString(String[] children) {return "This is a simple test for custom UDF~"; }}Copy the code