After covering the concepts of Spark in our last article, let’s take a closer look at its core structure and useful APIS. This video is very informative.

Portal: 7 Steps to Mastering Apache Spark 2.0

Iii. Core structure of Apache Spark

To better understand how the components of Spark interact, it is necessary to capture the core structure of Spark in detail. All the key words and concepts come to life when explained, and this Spark Summit training video will help you start your Spark journey faster:

Video link: Youtu.be /7ooZ4S7Ay6Y

DataFrames, Datasets and Spark SQL

In Step 3, you learned about elastic distributed Data sets (RDDs) — they form the core data abstraction concept of Spark and are the basis for all other higher-level data abstractions and apis, including Dataframes and data sets.

In Spark2.0, dataframes and data sets on top of RDDs form the core high-level and structured distributed data abstraction. Dataframes, called data columns in Spark, can organize data plans, process data or describe operations, and publish queries. The dataset goes a step further and provides a strict compile-time type security, so that type-specific errors are found at compile time, not run time.With data structures and data types, Spark can understand how you will perform description operations, which columns of a specific type or fields of a specific name will access your data, and the scope of which particular operations you will use. Spark will then optimize your code through Spark 2.0’s Catalyst Optimizer to generate efficient byte code through Project Tungsten.

DataFrame and dataset provide apis for a variety of high-level programming languages, making your code more readable, and supporting higher-order functions such as filter, sum, count, AVg, min, Max, and more. Whether you use Spark SQL or Python, Java, Scala, or R to express your computation instructions, the underlying code generation is exactly the same because all execution plans are through the same Catalyst optimizer.

For example, Scala’s scope-specific code or corresponding queries in its SQL would generate exactly the same code. For example, there will be a data set at the bottom called a Scala project called Person and an SQL table called “Person”.

// a dataset object Person with field names fname, lname, age, weight
// access using object notation
val seniorDS = peopleDS.filter(p=>p.age > 55) 

// a dataframe with structure with named columns fname, lname, age, weight
// access using col name notation
Val seniorDF = peopleDF.where(peopleDF("age") > 55) 

// equivalent Spark SQL code
val seniorDF = spark.sql("SELECT age from person where age > 35")Copy the code

If you want to know why Spark structured data is important and why Dataframes, datasets, and Spark SQL provide an efficient way to code Spark, you can find out by following the video (Youtu.be /1a4pgYzeFwE).

5. Graphic processing of GraphFrame

Although Spark has a generic RDD-based graphics processing library GraphX that optimizes distributed computing and supports graphics algorithms, it still has some challenges — there are no Java and Python apis, and it is based on low-level RDD apis. Due to these issues, it does not enjoy the performance optimizations recently introduced through Project Tungsten and Catalyst Optimizer.

In contrast, GraphFrames, a Dataframe-based graph-processing library, solves all the problems: it provides a library similar to GraphX but with a higher level, more readable and readable API that supports Java, Scala, and Python; Can save and download graphics; Take advantage of the underlying performance and query optimization of Spark2.0. In addition, it integrates GraphX. This means that you can seamlessly convert GraphFrames into the equivalent GraphX representation.

In the figure below, these cities have individual airport codes and all vertices can be represented as rows of dataframes; Similarly, all edges can be treated as DataFrame rows, with columns of their own names and types. Collectively, the vertices and edges of these Dataframes form a graph-processing library called GraphFrames.

// create a Vertices DataFrame val vertices = spark.createDataFrame(List(("JFK", "New York", "NY"))).toDF("id", "city", "state") // create a Edges DataFrame val edges = spark.createDataFrame(List(("JFK", "SEA", 45, 1058923))).toDF("src", "dst", "delay", "tripID") // create a GraphFrame and use its APIs val airportGF = GraphFrame(vertices, edges) // filter all vertices from the GraphFrame with delays greater an 30 mins val delayDF = airportGF.edges.filter("delay > 30") // Using PageRank algorithm, Determine the Airport ranking of importance of val pageRanksGF = airportGF. PageRank. ResetProbability (0.15). The maxIter (5). The run ()  display(pageRanksGF.vertices.orderBy(desc("pagerank")))Copy the code

GraphFrame allows you to express three powerful queries. The first is simple SQL-type queries about points and edges, such as what routes might cause significant delays. Second, graph type queries, such as how many vertices come in and how many edges come out. Third, topic queries, which find the model of the dataset in the graph by providing a structured model or path of vertices and edges.

In addition, GraphFrames can easily support all GraphX algorithms. For example, use PageRank to find all important points, or determine the shortest path from the starting point to the destination, or perform a breadth-first search (BFS), or identify strongly connected points for exploring connections.

In the web seminar (go.databricks.com/graphframes…). Spark community contributor Joseph Bradley talked about the motivation and ease of use of GraphFrames for image processing, as well as the benefits of dataframe-based apis. As part of the workshop, you’ll also learn about the convenience of using graph-processing library GraphFrames, as well as all of the above types of queries and algorithms.

Apache Spark 2.0 and many Spark components, including machine learning MLlib and Streaming, are increasingly offering equivalent DataFrame apis due to performance improvements, ease of use, and high levels of abstraction and structure. In use cases where necessary or appropriate, you can choose to use GraphFrames instead of GraphX. Here is a neat summary and comparison between GraphX and GraphFrames.

The graph-processing library GraphFrames will definitely evolve faster and faster. The new version of GraphFrame will work as a Spark package compatible with Spark2.0.

Jules S. Damji & Sameer Farooqui, Databricks. Article source: www.kdnuggets.com/2016/09/7-s…