There are few official reference materials for Spark to read and write Hbase responses. Recently, I happened to test Hbase and integrate Spark to learn about this knowledge. There are two main implementations of Spark On Hbase drivers. Official driver and Hortonworks SHC.

hbase-connectors

Pure official drivers, previously a separate project, are now integrated with Kafka drivers. The advantage is that it is easy to use, adding references directly in Maven. In general, there is no use conflict.

But there are obvious drawbacks:

  • It is complicated to use, and you need to build Hbase data structures by yourself. Few examples can be found, but some examples are included in the driver, which can help us to run quickly. For Example, if the link fails, you can enter the project and find the file at the end of Example.
  • You can only build a JavaHBaseContext and call its provided methods, but only RDD related methods are found, so the DataFrame commonly used by Spark 2.X may not be friendly.

hortonworks-spark

The Hortonworks contributed driver, with some encapsulation compared to the official driver, is basically the same in usage as other built-in drivers (JDBC, etc.) and DataFrame friendly.

But the biggest problem is the citation problem:

  • Hortonworks repository is a repository that can be used in spark packages. This repository cannot be used in a network environment without an agent.

Note: hortonworks: shC-core: 1.1-2.1-s_2.11 has not been matches to spark-packages.org, but will be there soon.

  • To reference jars via Maven, you need to specify Maven’s repository to HortonWorks.

Note: this an artifact is located at Hortonworks repository (repo.hortonworks.com/content/rep…).

  • So you can only use SHC by specifying that the HortonWorks repository automatically downloads dependent JARS, or by preparing jars yourself and adding –jars to those jars on Spark-Submit

Due to network reasons, hortonWorks warehouse connection is too slow. If you pack it by yourself, according to the situation discussed in the issue, you can add the referenced JAR package and run it by yourself. However, due to version limitation, the commands in the link are basically out of date. I also ran it by referring to the JAR package once, but there will still be problems if the version of Spark, hbase, and SHC changes in the future. Therefore, I should find a method to solve the problem, rather than directly giving commands on how to run.

  • Analyze the Spark version and Hbase version. Currently, the Spark and Hbase versions supported by SHC are 2.4.0 and 2.0.4 respectively. If the version in use is similar to or earlier than this version, SHC can be used properly.
  • Find the branch or tag that best matches your Spark and Hbase versions in the SHC repository. If you can find a release package, download it directly; otherwise, compile it using source code. Get the SHC-Core JAR package. You can upload it directly to your classpath or package it into your own project using Maven System Scope.
  • Analysis of SHC source code found that it mainly uses some class names of hbase server and client, so we just need to add these dependencies to our own project maven, we can refer to its own POM, After the test, you only need to add hbase-Server and hbase-MapReduce to run. Other Phoenix and Avro may need to use related functions.
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>2.0.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-mapreduce</artifactId>
            <version>2.0.4</version>
        </dependency>
Copy the code
  • It is time to install Exclusions for the Maven tree. The main possible conflicts are netty and Jackson packages, which I exclusion in the hbase dependency.

As for usage, SHC encapsulation is very good, basically shielding many details of Hbase. You only need to develop an additional Hbase Table Catalog. Since the official example has released detailed examples of Scala version, I will record the Java example here.

        String catalog = "{\r\n"
                + "\"table\":{\"namespace\":\"default\", \"name\":\"demo_person_with_row_key\", \"tableCoder\":\"PrimitiveType\"},\r\n"
                + "\"rowkey\":\"idname\",\r\n" + "\"columns\":{\r\n"
                + "\"idname\":{\"cf\":\"rowkey\", \"col\":\"idname\", \"type\":\"string\"},\r\n"
                + "\"id\":{\"cf\":\"general\", \"col\":\"id\", \"type\":\"string\"},\r\n"
                + "\"name\":{\"cf\":\"general\", \"col\":\"name\", \"type\":\"string\"},\r\n"
                + "\"mobile\":{\"cf\":\"general\", \"col\":\"mobile\", \"type\":\"string\"}\r\n" + "}\r\n" + "}";
        Map<String, String> map = new HashMap();
        map.put(HBaseTableCatalog.tableCatalog(), catalog);
        map.put(HBaseTableCatalog.newTable(), "5");
	Dataset<Row> df = spark.read().options(map).format("org.apache.spark.sql.execution.datasources.hbase").load();
    	df.show();
// df.write().options(map).format("org.apache.spark.sql.execution.datasources.hbase").save();
         
Copy the code