Hive UDFS functions

Table of Contents

In this article I will go through the process of configuring a UDF function for Hive.

Intro
#

Today i was going to use a simple sha256 function in Hive in order to mask a column and apparently in the latest Cloudera distribution the Shipped hive version doesn’t have that native function.

This article will explain how you can build a sha256 or other udfs function and add it in Hive.

Checking Cloudera Packages Version
#

Check the following URL in order to see the latest shipped package versions in Cloudera.

CDH 5.12 -> hive-1.1.0+cdh5.12.1+1197

Return Type	Name(Signature)	Description
string	sha2(string/binary, int)	Calculates the SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512) (as of Hive 1.3.0). The first argument is the string or binary to be hashed. The second argument indicates the desired bit length of the result, which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 256). SHA-224 is supported starting from Java 8. If either argument is NULL or the hash length is not one of the permitted values, the return value is NULL. Example: sha2(‘ABC’, 256) = ‘b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78’.

Full Version

Implement UDFS
#

this will server more as an exercise, one could create a more complex udf function. For the time being let’s create a GenericUDFSha2 based on existing hive 1.3.0 version

You code clone my repo with some udfs-utils here

The original code for hive version 1.3.0 is available in the repo

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSha2.java

Let’s create the building structure

mkdir -p GenericUDFSha2/org/apache/hadoop/hive/ql/udf/generic
cd GenericUDFSha2

Let’s create a pom.xml file

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.rramos.bigdata.utils</groupId>
  <artifactId>GenericUDFSha2</artifactId>
  <packaging>jar</packaging>
  <version>1.0-SNAPSHOT</version>
  <name>GenericUDFSha2</name>
  <url>http://maven.apache.org</url>

  <!-- properties -->
  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
     
  <!-- prerequisitesprerequisites -->
  <prerequisites>
     <maven>3.0</maven>
  </prerequisites>

   <!-- Dependencies -->
   <dependencies>

    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.12</version>
      <scope>test</scope>
    </dependency>

    <dependency>
       <groupId>org.apache.hive</groupId>
       <artifactId>hive-exec</artifactId>
       <version>2.0.0</version>
    </dependency>

    <dependency>
    <groupId>joda-time</groupId>
    <artifactId>joda-time</artifactId>
    <version>2.9.3</version>
    </dependency>

  </dependencies>

  <!-- Build options -->
  <build>
   <plugins>
    <plugin>
     <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-jar-plugin</artifactId>
      <version>2.6</version>
       <configuration>
        <archive>
          <manifest>
           <addClasspath>true</addClasspath>
           <mainClass>com.rramos.bigdata.utils.GenericUDFSha2</mainClass>
          </manifest>
        </archive>
       </configuration>
      </plugin>

      <plugin>
          <artifactId>maven-assembly-plugin</artifactId>
          <configuration>
                <archive>
                    <manifest>
                        <mainClass>
                            com.rramos.bigdata.utils.GenericUDFSha2
                        </mainClass>
                    </manifest>
                </archive>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
           </configuration>
      </plugin>

    </plugins>
  </build>

</project>

You should obviously change for you packaging namespace, i’m just using com.rramos.bigdata.utils to be simpler.

Next, let’s create the following file

org/apache/hadoop/hive/ql/udf/generic/GenericUDFSha2.java

with the content

package com.rramos.bigdata.utils;

import static org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.PrimitiveGrouping.BINARY_GROUP;
import static org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.PrimitiveGrouping.NUMERIC_GROUP;
import static org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.PrimitiveGrouping.STRING_GROUP;

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

import org.apache.commons.codec.binary.Hex;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.serde2.objectinspector.ConstantObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.Converter;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;

/**
 * GenericUDFSha2.
 *
 */
@Description(name = "sha2", value = "_FUNC_(string/binary, len) - Calculates the SHA-2 family of hash functions "
    + "(SHA-224, SHA-256, SHA-384, and SHA-512).",
    extended = "The first argument is the string or binary to be hashed. "
    + "The second argument indicates the desired bit length of the result, "
    + "which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 256). "
    + "SHA-224 is supported starting from Java 8. "
    + "If either argument is NULL or the hash length is not one of the permitted values, the return value is NULL.\n"
    + "Example: > SELECT _FUNC_('ABC', 256);\n 'b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78'")
public class GenericUDFSha2 extends GenericUDF {
  private transient Converter[] converters = new Converter[2];
  private transient PrimitiveCategory[] inputTypes = new PrimitiveCategory[2];
  private final Text output = new Text();
  private transient boolean isStr;
  private transient MessageDigest digest;

  @Override
  public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
    checkArgsSize(arguments, 2, 2);

    checkArgPrimitive(arguments, 0);
    checkArgPrimitive(arguments, 1);

    // the function should support both string and binary input types
    checkArgGroups(arguments, 0, inputTypes, STRING_GROUP, BINARY_GROUP);
    checkArgGroups(arguments, 1, inputTypes, NUMERIC_GROUP);

    if (PrimitiveObjectInspectorUtils.getPrimitiveGrouping(inputTypes[0]) == STRING_GROUP) {
      obtainStringConverter(arguments, 0, inputTypes, converters);
      isStr = true;
    } else {
      GenericUDFParamUtils.obtainBinaryConverter(arguments, 0, inputTypes, converters);
      isStr = false;
    }

    if (arguments[1] instanceof ConstantObjectInspector) {
      Integer lenObj = getConstantIntValue(arguments, 1);
      if (lenObj != null) {
        int len = lenObj.intValue();
        if (len == 0) {
          len = 256;
        }
        try {
          digest = MessageDigest.getInstance("SHA-" + len);
        } catch (NoSuchAlgorithmException e) {
          // ignore
        }
      }
    } else {
      throw new UDFArgumentTypeException(1, getFuncName() + " only takes constant as "
          + getArgOrder(1) + " argument");
    }

    ObjectInspector outputOI = PrimitiveObjectInspectorFactory.writableStringObjectInspector;
    return outputOI;
  }

  @Override
  public Object evaluate(DeferredObject[] arguments) throws HiveException {
    if (digest == null) {
      return null;
    }

    digest.reset();
    if (isStr) {
      Text n = GenericUDFParamUtils.getTextValue(arguments, 0, converters);
      if (n == null) {
        return null;
      }
      digest.update(n.getBytes(), 0, n.getLength());
    } else {
      BytesWritable bWr = GenericUDFParamUtils.getBinaryValue(arguments, 0, converters);
      if (bWr == null) {
        return null;
      }
      digest.update(bWr.getBytes(), 0, bWr.getLength());
    }
    byte[] resBin = digest.digest();
    String resStr = Hex.encodeHexString(resBin);

    output.set(resStr);
    return output;
  }

  @Override
  public String getDisplayString(String[] children) {
    return getStandardDisplayString(getFuncName(), children);
  }

  @Override
  protected String getFuncName() {
    return "sha2";
  }
}

And build the package.

mvn package

After compile you find in target dir the require jar (GenericUDFSha2-1.0-SNAPSHOT.jar) you need to add in Hive.

You should use your Hadoop Distribution instructions for deploying new jars.

Here are Cloudera Instructions.

Next on your Hive session you need to ADD JAR and create a FUNCTION or TEMPORARY FUNCTION

    ADD JAR ./target/GenericUDFSha2-1.0-SNAPSHOT.jar
    CREATE TEMPORARY FUNCTION sha2 AS 'com.rramos.bigdata.utils.GenericUDFSha2';
    SELECT sha2(foo) from bar LIMIT1;

Matthew Rathbone Blog has some great tutorial on Hive Functions. Take a look if you want to go deep with it.

Cheers, RR

Intro#

Checking Cloudera Packages Version#

Implement UDFS#

References#

Intro
#

Checking Cloudera Packages Version
#

Implement UDFS
#

References
#