Table production function

One line of input, multiple lines of output

Implementing a custom UDTF requires inheriting GenericUDTF and implementing its three methods:

  • initialize()
  • process()
  • close()

Process () and close() are abstract methods in GenericUDTF and must be implemented. Initialize () is not abstract, but it must be implemented manually, because GenericUDTF’s Initialize () will eventually throw an exception:

throw new IllegalStateException("Should not be called directly");
Copy the code

initialize()

The methods that need to override the implementation are as follows:

public StructObjectInspector initialize(StructObjectInspector argOIs)
  throws UDFArgumentException {}Copy the code

Initialize () is called once during GenericUDTF initialization to perform some initialization operations, including:

  1. Determine the number of function arguments
  2. Determine the function parameter type
  3. Determine the return value type of the function

In addition, users can perform some customized initialization operations, such as initializing the HDFS client

One: determine the number of function parameters

The initialize() argument is StructObjectInspector argOIs

You can obtain all parameters of a customized UDTF in the following way

List<? extends StructField> inputFieldRef = argOIs.getAllStructFieldRefs();
Copy the code

The way to determine the number of parameters is simple, just determine the number of elements in inputFieldRef

Example:

List<? extends StructField> inputFieldRef = argOIs.getAllStructFieldRefs(); If (inputfieldref.size ()! = 1) throw new UDFArgumentException(" need an argument ");Copy the code

Second: determine the type of function parameters

The element type of inputFieldRef is StructField, which is used to get the parameter type ObjectInspector

To determine the type of ObjectInspector, refer to UDF Development Manual – UDF.

An example of determining the number and type of parameters:

// 1. Check the number of arguments. List<? extends StructField> inputFieldRef = argOIs.getAllStructFieldRefs(); if (inputFieldRef.size() ! = 1) throw new UDFArgumentException(" need an argument "); ObjectInspector ObjectInspector = InputFieldref.get (0).getFieldobjectinspector (); // 2. if (objectInspector.getCategory() ! = ObjectInspector. Category. PRIMITIVE / / parameter is the original Hive type | |! PrimitiveObjectInspector.PrimitiveCategory.STRING.equals(((PrimitiveObjectInspector)objectInspector).getPrimitiveCategor Y ())) // Throw new UDFArgumentException(" function first argument is a string "); // An exception is thrown when the custom UDF parameters do not conform to expectationsCopy the code

Third, determine the return value type of the function

UDTF functions can produce multiple lines of output for one line of input, and can have multiple columns per line of result. The return value type of a custom UDTF is slightly more complex, requiring that all column names and column types be specified for the output result

The initialize() method returns a value of type StructObjectInspector

StructObjectInspector represents the structure of a row of records, which can contain multiple columns. Each column has a column name, column type, and column comment (optional)

An instance of StructObjectInspector can be obtained by ObjectInspectorFactory:

/ * * * structFieldNames: Column * / ObjectInspectorFactory. GetStandardStructObjectInspector (a List < String > structFieldNames, List<ObjectInspector> structFieldObjectInspectors)Copy the code

The NTH element of structFieldNames, representing the name of the NTH column; StructFieldObjectInspectors first n elements, represents the type of the first n columns.

StructFieldNames and structFieldObjectInspectors should keep the same length

// There is only one column, The column type is Map<String, int> return ObjectInspectorFactory.getStandardStructObjectInspector( Collections.singletonList("result_column_name"), Collections.singletonList( ObjectInspectorFactory.getStandardMapObjectInspector( PrimitiveObjectInspectorFactory.javaStringObjectInspector, / / Key is a String PrimitiveObjectInspectorFactory. JavaIntObjectInspector / / a Value is an int)));Copy the code

process()

Core method, custom UDTF implementation logic

The code implementation steps can be divided into three parts:

  1. Parameters of the receiving
  2. Customize UDTF core logic
  3. The output
/**
 * Give a set of arguments for the UDTF to process.
 *
 * @param args
 *          object array of arguments
 */
public abstract void process(Object[] args) throws HiveException;
Copy the code

Step 1: Parameter acceptance

Args is a user-defined UDTF parameter. Different parameters are of different Java types. The following is the Java type of common Hive parameter types

Hive type Java type
tinyint ByteWritable
smallint ShortWritable
int IntWritable
bigint LongWritable
string Text
boolean BooleanWritable
float FloatWritable
double DoubleWritable
Array ArrayList
Map<K, V> HashMap<K, V>

Parameter receiving example:

If (args[0] == null) return; String STR = ((Text) args[0]).toString();Copy the code

Step 2: Customize the UDTF core logic

Once you get the parameters, you’re free to play here

Step 3: Output the results

The process() method itself does not return a value and outputs a line of results through forward() in GenericUDTF. Forward () can be called repeatedly and can output any line

/** * Passes an output row to the collector. * * @param o * @throws HiveException */ protected final void forward(Object  o) throws HiveException { collector.collect(o); }Copy the code

Forward () can accept either a List or a Java array, with the NTH element representing the value of the NTH column

List<Object> list = new LinkedList<>(); // The first column is int list.add(1); // The second column is string list.add("hello"); // The third column is Boolean list.add(true); // Output a line of results forward(list);Copy the code

close()

This function is called when there are no other input lines

Some final processing such as resource shutdown processing can be done