Spark Streaming and Kudu were used to do real-time data warehouse recently. Due to the poor data, I finally got it done after a lot of trouble. Here I recorded the pits encountered during the process

Create a Kudu table from Impala

create table kudu_appbind_test(
md5 string,
userid string,
datetime_ string,
time_ string,
cardno string,
flag string,
cardtype string,
primary key(md5,userid,datetime_)
)
stored as kudu;
Copy the code

Depend on the choice

Reference kudu website: kudu.apache.org/docs/develo… There are a few key points on the website

  • Use the kudu-spark_2.10 artifact if using Spark with Scala 2.10. Note that Spark 1 is no longer supported in Kudu starting from version 1.6.0. So in order to use Spark 1 integrated with Kudu, version 1.5.0 is the latest to go to.
  • Artifact if using Spark 2 with Scala 2.11
  • Kudu-spark versions 1.8.0 and below have slightly different syntax.
  • Spark 2.2+ requires Java 8 at runtime even though Kudu Spark 2.x integration is Java 7 compatible. Spark 2.2 is the default dependency version as of Kudu 1.5.0.

I’m using Spark 2.4.0, Scala 2.11, kudu 1.8.0, so I’m using Kudu-spark_2.11-1.8.0. jar.

    <! -- https://mvnrepository.com/artifact/org.apache.kudu/kudu-spark2 -->
    <dependency>
      <groupId>org.apache.kudu</groupId>
      <artifactId>Kudu - spark2_2. 11</artifactId>
      <version>1.8.0 comes with</version>
    </dependency>
Copy the code

But an error is reported for the following write statement

kuduDF.write.format("kudu")
  .mode("append")
  .option("kudu.master"."server:7051")
  .option("kudu.table"."impala::kudu_appbind_test")
  .mode("append")
  .save()
Copy the code
java.lang.ClassNotFoundException: Failed to find data source: kudu. Please find packages at http://spark.apache.org/third-party-projects.html at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:649) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) ... 49 elided Caused by: java.lang.ClassNotFoundException: kudu.DefaultSource at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62) at  java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:628)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:628)
  at scala.util.Try$.apply(Try.scala:192)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:628)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:628)
  at scala.util.Try.orElse(Try.scala:84)
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:628)
  ... 51 more
Copy the code

According to the error message, Kudu is not the Spark Data Source. Kudu-spark_2.11-1.9.0.jar (kudu-spark_2.11-1.9.0.jar) It’s still an error

# use kudu - spark2_2. 11-1.9.0. Jar
java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.kudu.spark.kudu.DefaultSource not a subtype
  at java.util.ServiceLoader.fail(ServiceLoader.java:239)
  at java.util.ServiceLoader.accessThe $300(ServiceLoader.java:185)
  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:376)
  at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
  at java.util.ServiceLoaderThe $1.next(ServiceLoader.java:480)
  at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
  at scala.collection.Iterator$class.foreach(Iterator.scala:891)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
  at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
  at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
  at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:624)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
  ... 49 elided
Copy the code

Continue to see the kudu source, found in the kudu org. Apache. Kudu. Spark. The kudu is written in the class

  implicit class KuduDataFrameWriter[T](writer: DataFrameWriter[T]) {
    def kudu = writer.format("org.apache.kudu.spark.kudu").save
  }
Copy the code

The format is different from the official format(“kudu”). Finally I changed it to this one and found it worked

kudu.write.format("org.apache.kudu.spark.kudu")
          .mode("append")
          .option("kudu.master"."server:7051")
          .option("kudu.table"."impala::kudu_appbind_test")
          .save()
Copy the code

Spark integrates Kudu with several limitations

Kudu.apache.org/docs/develo…

  • Kudu tables with a name containing upper case or non-ascii characters must be assigned an alternate name when registered as a temporary table.
  • Kudu tables with a column name containing upper case or non-ascii characters may not be used with SparkSQL. Columns may be renamed in Kudu to work around this issue.
  • <> and OR predicates are not pushed to Kudu, and instead will be evaluated by the Spark task. Only LIKE predicates with a suffix wildcard are pushed to Kudu, meaning that LIKE “FOO%” is pushed down but LIKE “FOO%BAR” isn’t.
  • Kudu does not support every type supported by Spark SQL. For example, Date and complex types are not supported.
  • Kudu tables may only be registered as temporary tables in SparkSQL. Kudu tables may not be queried using HiveContext.
  • When you register as a temporary table, you must assign alternate names to Kudu tables whose names contain uppercase or non-ASCII characters.

  • Kudu tables that contain column names with uppercase or non-ASCII characters cannot be used with SparkSQL. You can rename the column in Kudu to solve this problem.

  • <> And the OR predicate is not pushed to Kudu, but is evaluated by the Spark task. Only predicates LIKE with the suffix wildcard are pushed to Kudu, meaning that it LIKE “FOO%” is pushed down but LIKE “FOO%BAR” is not.

  • Kudu does not support every type supported by Spark SQL. For example, Date does not support complex types.

  • Kudu tables can only be registered as temporary tables in SparkSQL. You may not be able to query the Kudu table using HiveContext.


  • DataFrame (Kudu) DataFrame (Kudu) DataFrame (Kudu) DataFrame (Kudu) DataFrame (Kudu) DataFrame
  • When the Kudu partition is used, DataFrame must write data to the Kudu partition, and cannot insert data into a partition that does not exist

Starting from: blog.csdn.net/lzw2016/art…

More big data related Tips can be followed: github.com/josonle/Cod… And github.com/josonle/Big…