Key points:

  • Learn about machine learning data pipelining.
  • How to implement machine learning data pipelining using the Apache Spark machine learning package.
  • Steps in data value chain processing.
  • Spark machine learning pipeline module and API.
  • Text sorting and AD detection use cases.

In the previous article series, we learned about the Apache Spark framework, introduced Spark and its different libraries for big data processing (Part 1), the Spark SQL library (Part 2), Spark Stream (Part 3) and Spark MLlib machine learning library (Part 4).

In this article, our other Spark machine learning API, called Spark ML, is the recommended solution for developing big data applications using data pipelining.


The Spark ML (Spark.ml) package provides a machine learning API built on top of DataFrame and has become a core part of the Spark SQL library. This package can be used to develop and manage machine learning pipelines. It can also provide feature extractors, converters, selectors, and support machine learning techniques such as classification, aggregation, and clustering. All of this is critical to developing machine learning solutions.

Here we take a look at Exploratory Data Analysis using Apache Spark, developing machine learning pipelines, and using apis and algorithms provided in the Spark ML package.

With support for building machine learning data pipelining, the Apache Spark framework is now a great choice for building a comprehensive use case that includes ETL, digital analysis, real-time flow analysis, machine learning, graph processing, and visualization.

Machine learning data pipelining

Machine learning pipelines can be used to create, tune, and validate machine learning workflow programs. The machine learning pipeline helps us focus more on big data needs and machine learning tasks in projects, rather than spending time and energy on infrastructure and distributed computing. It can also help us when dealing with machine learning problems, where we develop iterative functionality and composition models during the exploration phase.

Machine learning workflows typically need to include a series of processing and learning phases. The machine learning data pipeline is often described as a sequence of stages, each of which is either a converter module or an estimator module. These phases are executed sequentially, and the input data is processed and transformed as it moves through each phase in the pipeline.

Machine learning development frameworks support distributed computing and serve as tools for assembly line modules. There are other requirements for building data pipelining, including fault tolerance, resource management, extensibility, and maintainability.

In real projects, machine learning workflow solutions also include model import and export tools, cross-validation to select parameters, and data accumulation for multiple data sources. They also provide data tools such as functional extraction, selection, and statistics. These frameworks support machine learning pipeline persistence to save and import machine learning models and pipelines for future use.

The concept of machine learning workflow and the combination of workflow processors have become increasingly popular in many different systems. Big data processing frameworks such as SciKit-Learn and GraphLab also use pipeline-based concepts to build systems.

A typical data value chain process consists of the following steps:

  • found
  • injection
  • To deal with
  • save
  • integration
  • Analysis of the
  • show

Machine learning data pipelining uses similar methods. The following figure shows the different steps involved in machine learning pipeline processing.

Step #

The name

describe

ML1

Data into

Import data from different data sources.

ML2

Data cleaning

The data is preprocessed to prepare for the following machine learning data analysis.

ML3

Feature extracting

Also known as feature engineering, this step extracts functionality from a data set.

ML4

Model training

The machine learning model is trained with training data sets in the next few steps.

ML5

Model validation

The next step is to evaluate the efficiency of the machine learning model based on different predictive parameters. We also tune the model during the validation step, which is used to pick out the best model.

ML6

Model test

This step is to test the model before deploying it.

ML7

Deployment model

The final step is to deploy the selected model to run in production.

Table 1: Machine learning pipeline processing steps

These steps can also be represented in Figure 1 below.

Figure 1: Machine learning data pipeline processing flow diagram

Let’s take a look at each step in detail.

Data injection: The data we collect for machine learning pipeline applications can come from a variety of sources, ranging in size from a few hundred gigabytes to a few terabytes. Also, big data applications are characterized by injecting data in different formats.

Data cleansing: Data cleansing is the first and crucial step in the entire data analysis pipeline. It can also be called data cleansing or data transformation. This step is mainly to make the input data structured for subsequent data processing and predictive analysis. Depending on the quality of the data coming into the system, 60 to 70 percent of the total processing time is spent cleaning the data, converting it into the proper format so that machine learning models can be applied to the data.

There are always various quality problems with data, such as incomplete data, or incorrect or illegal data items. The data cleaning process usually uses various methods, including custom converters, to perform the data cleaning action with custom converters in the pipeline.

Sparse or coarse-grained data is another challenge in data analysis. Many extreme cases occur in this regard, so we use the data cleansing techniques described above to ensure that the data entering the data pipeline is of high quality.

Data cleansing is often an iterative process with each successive attempt and update of the model as we understand the problem. Data conversion tools such as Trifacta, OpenRefine, or ActiveClean can be used to complete the data cleaning task.

Feature extraction: In feature extraction (sometimes called feature engineering), techniques such as Hashing Term Frequency and Word2Vec are used to extract specific functions from raw data. The output from this step often includes an assembly module, which is passed along for processing in the next step.

Model training: Machine learning model training involves providing an algorithm and some training data for the model to learn from. The learning algorithm discovers patterns in the training data and generates an output model.

Model validation: This step involves evaluating and adjusting the machine learning model to measure its effectiveness in making predictions. As mentioned in this paper, Receiver Operating Characteristic (ROC) curves can be used to evaluate indicators for binary classification models. The ROC curve can represent the performance of a binary classifier system. It is created by plotting the correspondence between the True Positive Rate (TPR) and False Positive Rate (FPR) at different threshold Settings.

Model selection: Model selection allows converters and estimators to select parameters using data. This is also a key step in machine learning pipelining. Classes like ParamGridBuilder and CrossValidator provide apis to select machine learning models.

Model deployment: Once the right model is selected, we can start to deploy, input new data, and get predictive analysis results. We can also deploy machine learning models as web services.

Spark Machine learning

The machine learning pipeline API was introduced in Version 1.2 of the Apache Spark framework. It provides developers with apis to create and execute complex machine learning workflows. The pipeline API aims to enable users to quickly and easily build and configure viable distributed machine learning pipelines by providing standardized apis for different machine learning concepts. The pipeline API is included in the org.apache.spark.ml package.

Spark ML also helps to combine multiple machine learning algorithms into a pipeline.

The Spark machine learning API is split into two packages, spark.mllib and spark.ml. The spark. Ml package contains the original API built based on RDD. The Spark.ml package provides advanced apis built on top of DataFrame for building machine learning pipelines.

The RDD-based MLlib library API is now in maintenance mode.

As shown in Figure 2 below, Spark ML is a very important big data analysis library in the Apache Spark ecosystem.

Figure 2: Spark ecosystem with Spark ML

Machine learning pipeline module

Machine learning data pipeline consists of multiple modules needed to complete data analysis tasks. Key modules of the data pipeline are listed below:

  • The data set
  • Assembly line
  • Pipeline stage
  • converter
  • estimator
  • estimator
  • Parameters (and parameter maps)

Let’s take a quick look at how these modules fit into the overall steps.

Data sets: DataFrame is used to represent data sets in the machine learning pipeline. It also allows structured data to be saved by named fields. These fields can be used to hold text, feature vectors, real labels, and predictions.

Pipelining: Machine learning workflows are modeled as pipelining, which consists of a series of phases. Each stage processes the input data to produce the output data for the next stage. A pipeline connects multiple converters and estimators to describe a machine learning workflow.

Pipeline stage: We define two stages, converter and estimator.

Converter: An algorithm can convert one DataFrame to another. For example, a machine learning model is a converter that converts a DataFrame with characteristics into a DataFrame with predictive information.

The converter converts one DataFrame into another and adds new features to it. For example, in the Spark ML package, OneHotEncoder converts a field with a label index into a field with vector characteristics. Each converter has a transform() function that, when called, converts one DataFrame to another.

Estimator: An estimator is a machine learning algorithm that learns from the data you provide. The input to the estimator is a DataFrame and the output is a converter. The estimator is used to train the model, and it generates the converter. For example, a logistic regression estimator produces a logistic regression converter. Another example is using K-means as an estimator, which takes training data and generates a K-means model, which is a converter.

Parameters: Machine learning modules use a common API to describe parameters. An example of a parameter is the maximum number of iterations to be used by the model.

The following figure shows the modules of a data pipeline that serves as a text classification.

Figure 3: Data pipelining using Spark ML

Use cases

One use case for the machine learning pipeline is word classification. Such use cases usually include the following steps:

  • Cleaning literal data
  • Convert the data into eigenvectors, and
  • Training classification model

In text classification, data preprocessing such as N-Gram abstraction and TF-IDF feature weight are carried out before the training of classification model (similar to SVM).

Another machine learning pipeline use case is the image classification described in this article.

There are many other machine learning use cases, including fraud detection (using classification models, which are also part of supervised learning) and user partitioning (clustering models, which are also part of unsupervised learning).

TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) is a static evaluation method for evaluating the importance of a word in a given sample set. This is an information retrieval algorithm used to rate the importance of a word within a collection of documents.

TF: If a word appears repeatedly in a document, that word is important. The specific calculation method is as follows:

TF = (# of times word X appears in a document) / (Total # of
words in the document)

IDF: But if a word appears frequently in multiple documents (such as the, and, of, etc.), then the word has no real meaning and therefore should be graded lower.

The sample program

Let’s take a look at an example application to see how the Spark ML package can be used in a big data processing system. We will develop a document classifier to distinguish advertising content in the program input data. The input data set for testing includes documents, E-mail, or any other content received from external systems that may contain advertisements.

We will build our sample application using the AD detection example “Building Machine Learning Applications with Spark” discussed at the Strata Hadoop World Conference workshop.

Use cases

This use case analyzes the various messages sent to our system. Some messages have advertising messages in them, but some don’t. Our goal is to use the Spark ML API to find messages that contain ads.

algorithm

We will use a logistic regression algorithm from machine learning. Logistic regression is a regression analysis model that can predict a possible yes or no outcome based on one or more independent variables.

Detailed solutions

Let’s take a look at the details of the Spark ML sample application and how to run it.

Data injection: We import both data that contains ads (text files) and data that does not.

Data cleansing: In our sample program, we do not do any special data cleansing. We’re just aggregating all the data into a DataFrame object.

We randomly select some data from the training data and test data and create an array object. In this case our choice is 70% training data and 30% test data.

In the subsequent pipelining operation, we used these two data objects to train the model and make predictions respectively.

Our machine learning data pipeline consists of four steps:

  • Tokenizer
  • HashingTF
  • IDF
  • LR

Create a pipeline object and set the above stages in the pipeline. We can then follow the example and create a logistic regression model based on the training data.

Now we use the test data (the new data set) to make predictions with the model.

The architecture diagram of the sample application is shown in Figure 4 below.

Figure 4: Data classifier architecture diagram

technology

The following techniques are used to implement the machine learning pipeline solution.

technology

version

Apache Spark

2.0.0

JDK

1.8

Maven

3.3

Table 2: Techniques and tools used in machine learning examples

The Spark ML program

The machine learning code based on the workshop examples is written in the Scala programming language and can be run directly using the Spark Shell console.

AD detection Scala snippet:

Step 1: Create a custom class to store the details of the AD content.

case class SpamDocument(file: String, text: String, label:
Double)
Copy the code

Step 2: Initialize the SQLContext and convert the Scala object into a DataFrame using an implicit conversion method. The data set is then imported from the specified directory where the input files are stored, and an RDD object is returned as a result. DataFrame objects are then created from the RDD objects of the two datasets.

val sqlContext = new SQLContext(sc) import sqlContext.implicits._ // // Load the data files with spam // val rddSData = sc.wholeTextFiles("SPAM_DATA_FILE_DIR", 1) val dfSData = rddSData.map(d => SpamDocument(d._1, D._2,1)).todf () dfsdata.show () // // Load the data files with no spam // val rddNSData = sc.wholeTextFiles("NO_SPAM_DATA_FILE_DIR", 1) val dfNSData = rddNSData.map(d => SpamDocument(d._1,d._2, 0)).toDF() dfNSData.show()Copy the code

Step 3: Now, aggregate the data set and split the whole data into training data and test data at 70% and 30% ratios.

// // Aggregate both data frames // val dfAllData = dfSData.unionAll(dfNSData) dfAllData.show() // // Split the data Val Array(trainingData, testData) = dfAllData. RandomSplit (Array(0.7, 0.3))Copy the code

Step 4: Now you can configure the machine learning data pipeline by creating the parts we discussed earlier in this article: Tokenizer, HashingTF, and IDF. The training data is then used to create regression models, in this case logistic regression.

//
// Configure the ML data pipeline
//

//
// Create the Tokenizer step
//
val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")

//
// Create the TF and IDF steps
//
val hashingTF = new HashingTF()
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("rawFeatures")

val idf = new
IDF().setInputCol("rawFeatures").setOutputCol("features")

//
// Create the Logistic Regression step
//
val lr = new LogisticRegression()
  .setMaxIter(5)
lr.setLabelCol("label")
lr.setFeaturesCol("features")

//
// Create the pipeline
//
val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, idf, lr))

val lrModel = pipeline.fit(trainingData)
println(lrModel.toString())
Copy the code

Step 5: Finally, we call the transformation methods in the logistic regression model to make predictions with the test data.

//
// Make predictions.
//
val predictions = lrModel.transform(testData)

//
// Display prediction results
//
predictions.select("file", "text", "label", "features", "prediction").show(300)
Copy the code

conclusion

Spark machine learning library is one of the most important libraries in the Apache Spark framework. It is used to implement data pipelining. In this article, we learned how to use the Spark ML package API and use it to implement a text categorization use case.

Next up

A graph data model is about connections and relationships between different entities in a data model. Graph data processing has received a lot of attention recently because it can be used to solve many problems, from fraud detection to recommendation engine development.

The Spark framework provides a library dedicated to graph data analysis. Next in this series of articles, we’ll learn about the library called Spark GraphX. We will use Spark GraphX to develop a sample application for graph data processing and analysis.

reference

  • Big Data Analysis with Apache Spark — Part 1: Introduction
  • Big data Analysis with Apache Spark — Part 2: Spark SQL
  • Big Data Analysis with Apache Spark — Part 3: Spark Flows
  • Big Data Analysis with Apache Spark — Part 4: Spark Machine Learning
  • Official website of the Apache Spark project
  • Spark machine learning website
  • Spark machine learning programming guide
  • Spark workshop on AD detection exercises

About the author

Srini PenchikalaHe now lives in Austin, Texas and works as a senior software architect. He has over 22 years of experience in software architecture, software development and design. Penchikala is currently writing a book about Apache Spark. He is also the author of manning’s Book”Spring Roo in ActionOne of the authors of “. He spoke at conferences, Including JavaOne, SEI Architecture Technology Conference (SATURN), IT Architecture Conference (ITARC), No Fluff Just Stuff, NoSQL Now, Enterprise Data World, Project World Conference, and more. Penchikala has also published numerous articles on software architecture, security, and risk management, He has also published articles on NoSQL and big data topics on InfoQ, ServerSide, OReilly Network (ONJava), DevX Java, Java.net and JavaWorld. He is also editor-in-chief of the Data Science Community at InfoQ.

Big Data Processing with Apache Spark – Part 5: Spark ML Data Pipelines