Principles and practices of Spark Connector Reader

This article describes how to read Nebula Graph data using the Spark Connector.

The Spark Connector profile

The Spark Connector is a Spark data Connector that can be used to read and write external data systems. The Spark Connector consists of a Reader and a Writer. This article focuses on Spark Connector Reader, and Writer will be discussed in detail in the next part.

Spark Connector Reader principle

The Spark Connector Reader uses Nebula Graph as an extended data source for Spark. Read data from Nebula Graph into DataFrame and perform subsequent map and reduce operations.

Spark SQL allows users to customize data sources and can be extended to external data sources. The data read using Spark SQL is a DataFrame distributed data set organized in named columns. Spark SQL also provides numerous apis to facilitate calculation and conversion of dataframes and use DataFrame interfaces for various data sources.

Spark invokes the external data source package org.apache.spark. SQL. Learn about the interfaces provided by Spark SQL to extend data sources.

Basic Interfaces

BaseRelation: represents a collection of tuples with a known Schema. All subclasses that inherit from BaseRelation must generate a StructType Schema. In other words, BaseRelation defines the format in which data read from the data source is stored in the Spark SQL DataFrame.
RelationProvider: Gets the parameter list and returns a new BaseRelation based on the given parameter.
DataSourceRegister: short for registered data source. To use a data source, you only need to write the user-defined shortName instead of the fully qualified class name of the data source.

Providers

RelationProvider: Generates custom relation from a specified data source.createRelation()Generates a new relation based on the given Params parameter.
SchemaRelationProvider: Generates new Relation based on the given Params parameter and the given Schema information.

RDD

RDD[InternalRow]: Construct RDD[Row] after Scan from data source

To customize Spark external data sources, you need to define some of the above methods based on data sources.

With Nebula Graph’s Spark Connector, we implemented Nebula Graph as an external data source for Spark SQL, reading data in the form of sparkSession.read. The class diagram of this function is shown as follows:

NebulaRelatioProvider is defined, and RelationProvider is inherited to customize relation, and DataSourceRegister to register external data source.
NebulaRelation defines Nebula Graph’s data Schema and data transformation method. ingetSchema()Method to connect to the Nebula Graph’s Meta service to obtain the Schema information for the configuration return field.
Define NebulaRDD to read the Nebula Graph data.compute()The Nebula Graph Method defines how to read Nebula Graph data. This involves performing a Nebula Graph data Scan and converting the Read Nebula Graph Row data into Spark’s InternalRow data. An RDD row is formed with InternalRow, each representing a row of data in the Nebula Graph, and finally read out all of the Nebula Graph data in a partitioned iteration to assemble the final DataFrame result data.

Spark Connector Reader practice

The Reader function of Spark Connector provides an interface for users to read data. Read data of point/edge type one ata time, resulting in DataFrame.

To get started, pull the Spark Connector code on GitHub:

Git clone -b v1.0 git@github.com: Vesoft - Inc /nebula- Java. Git CD nebula- Java /tools/nebula- Spark MVN clean compile package install -Dgpg.skip -Dmaven.javadoc.skip=trueCopy the code

Copy the compiled package to your local Maven library.

The following is an example:

Add to the POM file of the MVN projectnebula-sparkRely on

<dependency> <groupId>com.vesoft</groupId> <artifactId>nebula- Spark </artifactId> <version>1.1.0</version> </dependency>Copy the code

Read the Nebula Graph data in the Spark program:

// Read Nebula Graph dot data val vertexDataset: Configuration [Row] = spark.read. Nebula ("127.0.0.1:45500", "spaceName", "100"). LoadVerticesToDF ("tag", "Field1,field2") vertexDataset. Show () read the Nebula Graph side data val edgeDataset: Dataset[Row] = spark.read. Nebula ("127.0.0.1:45500", "spaceName", "100").loadedgestodf ("edge", "*") edgedataset.show ()Copy the code

Configuration description:

nebula(address: String, space: String, partitionNum: String)

Address: Multiple IP addresses can be separated by commas (,), for example, ip1:45500,ip2:45500 space: Nebula Graph's graphSpace partitionNum: Set the number of partitions that Spark can read Nebula part. Use the partitionNum specified in the Nebula Graph when creating the Space to ensure that a Spark partition can read Nebula part.Copy the code

loadVertices(tag: String, fields: String)

Tag: The tag fields in the Nebula Graph. The field names are separated by commas. Indicates that only fields in fields are read,* reads all fieldsCopy the code

loadEdges(edge: String, fields: String)

Edge: Edge fields in the Nebula Graph. The field names are separated by commas. Indicates that only fields in fields are read,* reads all fieldsCopy the code

other

Spark Connector Reader GitHub code: github.com/vesoft-inc/…

Special thanks to Half Cloud for contributing the Java version of Spark Connector

The resources

[1] Extending Spark Datasource API: write a custom spark datasource [2] spark external datasource source code

Like this article? GitHub:

Ac graph database technology? NebulaGraphbot takes you into the NebulaGraphbot community

Principles and practices of Spark Connector Reader

The Spark Connector profile

Spark Connector Reader principle

Basic Interfaces

Providers

RDD

Spark Connector Reader practice

other

The resources

Related Posts

Java inheritance (extends) uses

Spring Cloud Alibaba Combat (ii) Nacos chapter

Go language to achieve multi-person chat room