prologue

A wordy

Graph database is a kind of non-relational database, which uses graph theory to store the relational information between entities, and has natural advantages in describing, storing and querying the association relation of knowledge graph. At present, the commonly used graph databases include Neo4j, JanusGraph, Giraph, TigerGraph and so on. Neo4j has outstanding performance advantages due to its own storage mode, but the community version does not support clustering and has poor scalability. JanusGraph supports Hbase, Cassanda, Google Cloud Bigtable, etc., as the underlying storage, and Elaticsearch, Apache Solr, Apache Lucene as the underlying index, realizing tinkerPOP standard graph framework. Strong scalability.

At the development level, we usually interact with JanusGraph data using embedded Java applications or connected JanusGraph Server. However, as a representative of distributed graph database, we do not have anything to do with distributed computing engine like Spark, so I feel uneasy. Based On the latest VERSION of JG 0.4, this practice recorded the errors and troubleshooting process I encountered in importing graph data into JanusGraph in Spark On Yarn mode.

Hand over to treat background

  • JanusGraph V0.4 (Stored in HBase)
  • The Spark 2.3.1 (HDP)
  • CentOS 7.4
  • IDEA Community Edition 2019

Set off!

Maven Configuration

On the premise of satisfying the function, the less dependent jar package is the betterCopy the code
<dependencies>

        <dependency>
            <groupId>org.janusgraph</groupId>
            <artifactId>janusgraph-all</artifactId>
            <version>0.4.0</version>
            <exclusions>
                <exclusion>
                    <groupId>org.janusgraph</groupId>
                    <artifactId>janusgraph-berkeleyje</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5</version>
        </dependency>

        <! -- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>The spark - core_2. 11</artifactId>
            <version>2.3.1</version>
        </dependency>
        <! -- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>The spark - sql_2. 11</artifactId>
            <version>2.3.1</version>
        </dependency>

        <! -- https://mvnrepository.com/artifact/org.apache.spark/spark-graphx -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>The spark - graphx_2. 11</artifactId>
            <version>2.3.1</version>
        </dependency>

        <! -- https://mvnrepository.com/artifact/org.apache.tinkerpop/spark-gremlin -->
        <dependency>
            <groupId>org.apache.tinkerpop</groupId>
            <artifactId>spark-gremlin</artifactId>
            <version>3.4.4</version>
        </dependency>

    </dependencies>
Copy the code
  1. Janusgraph-berkeleyje can’t be downloaded, but we don’t need to store it, just throw it away
  2. The httpClient package is the library required for the ES index, as discussed in the previous column
  3. The code reads data from a CSV file on HDFS and writes it to JanusGraph using Spark

Post code

def main(args: Array[String) :Unit = {

    val spark = SparkSession
      .builder()
      .appName("test-load-data")
      .master("local[*]") // Note that this line is commented when packaging the commit, using YARN mode
      .getOrCreate()

    val rdd = spark.read.csv("hdfs://101.bigdata:8020/xxx/graphData.csv")
      .rdd.map(x => {
      (x.getString(0), x.getString(1).toInt, x.getString(2))
    })

    rdd.foreachPartition { x => {
      //var tx = janusGraph.newTransaction()
      val janusGraph = JanusGraphFactory.open(conf)
      val g = janusGraph.traversal()
      var counts = 0L
      try {
        x.foreach(y => {
          g.addV(y._1).property("property1", y._2).property("property2", y._3).next()
          counts += 1
          if (counts == 1000) {
            g.tx().commit()
            println("# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #")
            println("# # # # # # # # # # # # # # # # #" + counts + "# # # # # # # # # # # # # # # # #")
            println("# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #")}}}finally {
        g.tx().commit()
        janusGraph.close()
      }
    }
    }
    println("------- this means the end --------")}Copy the code

Bug1 – LZ4BlockInputStream

Exception in thread "main" java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream; Z)V

The solution

Exclude a lower version dependency from the dependency due to jar package conflicts, as shown in LZ4

< the dependency > < groupId > org. Janusgraph < / groupId > < artifactId > janusgraph -all < / artifactId > < version > 0.4.0 < / version > <exclusions> <exclusion> <groupId>org.janusgraph</groupId> <artifactId>janusgraph-berkeleyje</artifactId> </exclusion> <exclusion> <artifactId>lz4</artifactId> <groupId>net.jpountz.lz4</groupId> </exclusion> </exclusions> </dependency>Copy the code

In this case, the local runtime is OK.

Bug2 – StopWatch

Spark-submit specifies the master to be yarn, cluster.java.lang.NoSuchMethodError:com.google.common.base.StopWatch.createStarted() This mistake is quite common, the culprit is a bag named Guava, worked together for a long time, and finally solved:

The solution

Note the order in which jar packages are loaded: The root of the error lies in the use of the guava-14 package during the execution of the Spark submitted task, and the StopWatch method appeared after version 15.

The Guava package in the SPARk2 /jars path is the version 14 used in the application, either 16 or 18, but for the principle of minimum change, we changed it to version 16. Only the package in the Spark installation directory cannot be changed, because yarn still has a dependency on Spark when submitting yarn to run. The final change is as follows: Replace the Guava package in the spark2-HDP-yarn-archive.tar. gz directory on the machine with the JAR package of version 16. (Note that the HDP integration environment is used in this test.)

The data is ready.

The next step

After that, I'm going to move on to Spark reading graph data and using Spark for OLAP analysis of graph data. Bye.Copy the code

Toasty luxury toasty

Part of the screenshots by my little buddy support, thank @xiaosample gay love zi eat!