Spark ML feature processing combat

This article is from OPPO Internet technology team, please note the author. At the same time, welcome to follow our official account: OPPO_tech, share with you OPPO cutting-edge Internet technology and activities.

First, the significance of feature processing

More often than not, the data we get contains dirty data or noise. Before model training, these data need to be preprocessed, otherwise the best model can only be “garbage in, garbage out”.

Data preprocessing mainly includes three parts: feature extraction, feature conversion and feature selection.

2. Feature extraction

Feature extraction generally refers to the process of extracting features from original data.

1. Countvectorizer

(1) Definition and usage: The counting vectorizer numbers all the text words and counts the word frequency of the words in the document as feature vector.

(2) Code examples in Spark ML

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
      (0, Array("a"."e"."a"."d"."b")),
      (1, Array("a"."c"."b"."d"."c"."f"."a"."b")),
      (2, Array("a"."f"))
)).toDF("id"."words")    
var cv_model = new CountVectorizer().setInputCol("words").setOutputCol("features").setVocabSize(10).setMinDF(2).fit(df)
val cv1 = cv_model.transform(df)
cv1.show(false)
Copy the code

Note: The counting vectorizer will integrate all the data together to form a word list. The setVocabSize and setMinDF parameters are used to determine whether to enter the word list. SetVocabSize determines the length of the thesaurus, while setMinDF determines how many different samples it must appear in before entering the thesaurus. In the above example, the length of the thesaurus is set to 10, and only a, B, D and F can enter the thesaurus if they appear in at least two samples. C and e appear only in one piece of data so word frequency is not counted.

2. Word frequency-Reverse file frequency (TF-IDF)

(1) Definition and usage: The popular understanding is to calculate the degree to which a word distinguishes a document. It is evaluated by synthesizing the frequency of words in a document and how many documents that word appears in the document library. It is unreasonable to distinguish a document only by word frequency.

For example, a document may contain multiple words that represent a common meaning, but these words are meaningless to document recognition. We need special words that occur frequently and in a small number of documents to identify documents. To take an extreme example, the word “we” may appear in dozens of documents and not be used at all. A lot of kids say we can get rid of these words by stopping them. Yeah. And what I’m talking about is this category of words that have no discernible use except for stop words.

(2) Code examples in Spark ML

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
val wordsData = spark.createDataFrame(Seq(
      "Legendary Game Warrior".split(""),
      "Apples, pears and bananas.".split(""),
      "IPhone fluency.".split("")
    ).map(Tuple1.apply)).toDF("words")
wordsData.show(false)    
// step1 hashingTF
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(2000) Val featurizedData = hashingtf.transform (wordsData) // Step2 Calculate IDF val IDF = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
rescaledData.select("words"."features").show(false)

Copy the code

Note: setNumFeatures sets the length of features. In each of the three data sets, the words other than apple appear only once, so the value of identifying the document is high. Apple’s presence in both data sets diminishes the value of identifying documents.

3. Word turn (Word2Vec)

(1) Definition and usage: Word diversion is to map words into vector space and represent words through a set of vectors. The similarity of words can be represented by calculating the distance of vectors.

(2) Code examples in Spark ML

import org.apache.spark.ml.feature.Word2Vec
val documentDF = spark.createDataFrame(Seq(
      "Legendary Game Warrior".split(""),
      "Apples, pears and bananas.".split(""),
      "Legendary game variety".split(""),
      "IPhone fluency.".split("")
    ).map(Tuple1.apply)).toDF("text")
val word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(10).setMinCount(2)
val model = word2Vec.fit(documentDF)
val result = model.transform(documentDF)
result.show(false)

Copy the code

Note: setVectorSize sets the length of the vector. SetMinCount Sets the minimum number of occurrences of the word in the sample. For example, in the example above we set the length of the vector to 10, and the vector will be converted only if it appears in at least two samples. Meet the conditions to have the word “apple”, “game” these three words “legend”, so the article 1 and article 3 of the data vector distance is exactly the same, because the “warrior” and “variety” are only once, and will not be used for converted to vector, if setMinCount is set to 1, then the article 1 and article 3 of the vector space will be very close, But not exactly, because the words “warrior” and “variety” will also be considered.

3. Feature transformation

1. Continuous data is converted into discrete data

1.1 Binarizer

Definition and purpose: The process of converting continuous data into 0-1 features based on threshold values.
Note: If the eigenvalue is greater than the threshold, it will be mapped to 1.0; if the eigenvalue is less than or equal to the threshold, it will be mapped to 0.0; Binarization input inputCol support Vector and Double.

1.2 Bucketizer discrete Recombination

Definition and Purpose: To convert continuous data into corresponding segments according to segment rules.
Examples of code in Spark ML:

The import org. Apache. Spark. Ml. Feature. Bucketizer val data = Array (8.0, 0.5, 0.3, 0.0, 0.2, Our val Splits = Array(0, 0, 0, 0) Double.PositiveInfinity) val dataFrame = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val bucketizer = new Bucketizer().setInputCol("features").setOutputCol("bucketedFeatures").setSplits(splits)
bucketizer.transform(dataFrame).show(false)  

Copy the code

Note: In case the third line of code, segmentation rules formulated as (minus infinity, 0.5), [0.5, 0), [0,0.5), [0.5, plus or minus poor) four. Each section is left closed right away [a, b). NegativeInfinity and double. PositiveInfinity should be added when the upper and lower boundaries of the split are not determined.

1.3 QuantileDiscretizer

Definition and purpose: To convert continuous data into corresponding segments according to quantile rule.
Examples of code in Spark ML:

The import org. Apache. Spark. Ml. Feature. QuantileDiscretizer val data = Array ((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2)) var df = spark.createDataframe (data).todf ("id"."hour")
val discretizer = new QuantileDiscretizer().setInputCol("hour").setOutputCol("result").setNumBuckets(3)
val result = discretizer.fit(df).transform(df)
result.show()

Copy the code

Note: setNumBuckets sets quantile number of buckets to 3. The HOUR data is divided into three segments.

2. String and index conversion

2.1 String – Index Transform (StringIndexer)

Definition and purpose: To convert string characteristics into indexes. Many models only accept numerical features during training, so it is necessary to convert strings into numerical values for training.
Examples of code in Spark ML:

import org.apache.spark.ml.feature.StringIndexer
val df = spark.createDataFrame(
  Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"))
).toDF("id"."category")
val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex")
val indexed = indexer.fit(df).transform(df)
indexed.show(false)

Copy the code

Note: Indexes are sorted by tag frequency. The most common tag index 0 is the most frequent tag. New strings may be encountered in new data sets. So if you have a, B, C in the training set, you have A, B, C, D in the new data set. There are two strategies for handling the newly emerging string D. The first is to throw an exception (by default), and the second is to completely ignore lines containing such tags by dropping setHandleInvalid(” skip “).

2.2 Index – String (IndexToString)

Definition and purpose: generally used with the string – index converter above. First, string features are converted to numeric type features by string-index converter, and numeric features are restored to string features by index-string converter after model training.

3. Normalizer

Definition and usage: The scope of regularization is every row of data, that is, every sample data. Each piece of data is normalized by calculating the P-norm. The regularization operation can standardize the input data and improve the performance of the later learning algorithm.
Examples of code in Spark ML:

import org.apache.spark.ml.feature.Normalizer import org.apache.spark.ml.linalg.{Vector,Vectors} val Data = Seq (Vectors. Dense,1,1,8,56 (- 1), Vectors. The dense (- 1, 3, 1-9,88), Vectors. The dense,5,1,10,96 (0), ,5,1,11,589 Vectors. Dense (0), Vectors. The dense,5,1,11,688 (0)) val df = spark createDataFrame (data. The map (Tuple1. Apply)). ToDF ("features")    
val normalizer = new Normalizer().setInputCol("features").setOutputCol("normFeatures"Normalizer.) setP (1.0). The transform (df) show (false)

Copy the code

4. Normalization (StandardScaler)

Definition and purpose: The scope of normalization is for each column of data, i.e., each one-dimensional feature. Standardize each feature so that it has a uniform standard deviation. On the one hand, the values of different samples in the same feature may differ greatly, and some abnormally small or abnormally large data may mislead the correct training of the model. On the other hand, if the data distribution is very scattered, it will also affect the training results. In both cases, the variance is very large.
Examples of code in Spark ML:

import org.apache.spark.ml.feature.StandardScaler import org.apache.spark.ml.linalg.{Vector,Vectors} val dataFrame = Seq((0, Vectors. Dense (1.0, 0.5, -1.0)),(1, Vectors. Dense (2.0, 1.0, 1.0)),(2, Vectors. 10.0, 2.0)))). ToDF ("id"."features")
val scaler = new StandardScaler().setInputCol("features")
.setOutputCol("scaledFeatures").setWithStd(true).setWithMean(false)
val scalerModel = scaler.fit(dataFrame)
val scaledData = scalerModel.transform(dataFrame)
scaledData.show(false)
Copy the code

Note: The above scales the standard deviation of each column to 1. If the standard deviation of a feature is zero, the default value returned by the feature in the vector is 0.0.

5. Principal Component Analysis (PCA)

(1) Definition and Purpose: Principal component analysis (PCA) is a statistical method. In essence, it performs a basis transformation in a linear space to maximize the variance of the transformed data projected into a low-dimensional space. The weight or importance of the coordinate axis is determined according to the variance after transformation, and the one with high weight becomes the main component. Mainly used for dimensionality reduction.

(2) Code examples in Spark ML:

import org.apache.spark.ml.feature.PCA import org.apache.spark.ml.linalg.{Vector,Vectors} val data = Array( Vectors. Point (5, Seq ((1, 1.0), (3, 7.0))), Vectors. The dense (2.0, 0.0, 3.0, 4.0, 5.0), Vectors. Dense (4.0, 0.0, 0.0, 6.0, 7)) val df = spark.createDataframe (data.map(tuple1.apply)).todf ("features")
val scaledDataFrame = new StandardScaler().setInputCol("features").setOutputCol("scaledFeatures").fit(df).transform(df)
val pca = new PCA().setInputCol("features").setOutputCol("pcaFeatures").setK(3).fit(scaledDataFrame)
val pcaDF = pca.transform(scaledDataFrame)
pcaDF.select("features"."pcaFeatures").show(false)    
Copy the code

Note: use setK to set the drop to k-dimensional space. The above example originally had 5-dimensional features, which were reduced to 3-dimensional features through PCA. The feature vectors must be normalized before PCA. Because the values of each principal component vary too much, there are orders of magnitude differences. After the normalization of feature vectors, the principal components are basically at the same level, and the result is more reasonable. K value choice, can choose a larger value, first by pcaModel. ExplainedVariance variance calculation model, when the variance stable value, select the corresponding K value is a good choice.

6. VectorIndexer

(1) Definition and Application: It is mainly used to transform discrete features into category features in batches

(2) Code examples in Spark ML:

import org.apache.spark.ml.feature.VectorIndexer import org.apache.spark.ml.linalg.Vectors val Data = Seq (Vectors. Dense,1,1,8,56 (- 1), Vectors. The dense (- 1, 3, 1-9,88), Vectors. The dense,5,1,10,96 (0), ,5,1,11,589 Vectors. Dense (0)) val df = spark createDataFrame (data. The map (Tuple1. Apply)). ToDF ("features")    
val indexer = new VectorIndexer().setInputCol("features").setOutputCol("indexed").setMaxCategories(3)
val indexerModel = indexer.fit(df)
indexerModel.transform(df).show(false)
Copy the code

Note: Set setMaxCategories to K to convert features with a number less than or equal to K into indexes. For example, set setMaxCategories to 3 in the above example. If the second column has three categories, recode to 0,1,2.

7. SQLTransformer

(1) Definition and purpose: many people who are used to using SQL for data processing can use SQL converter to process features.

(2) Code examples in Spark ML:

The import org. Apache. Spark. Ml. Feature. SQLTransformer val df = spark createDataFrame (Seq ((0, 1.0, 3.0), (2, 2.0, 5.0))). ToDF ("id"."v1"."v2")
val sqlTrans = new SQLTransformer().setStatement(
  "SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")
sqlTrans.transform(df).show()
Copy the code

8. OneHotEncoder

Unique thermal coding maps label indicators to binary variables.

9. Max – min scale (MinMaxScaler)

Converts independent eigenvalues to a specified range, usually [0,1].

10. Feature vector combination (VectorAssembler)

The original feature and the features generated by different feature converters are combined into a single feature vector. The values of the input columns are added to a new vector in the specified order.

4. Feature selection

Feature selection is to select those simpler and more effective features from feature vectors. It is applicable to eliminate redundant features in high-dimensional data analysis and improve the performance of the model. The selected feature is a subset of the original feature.

1. Vector machine

Select some required features by index or column name based on the existing feature database.

2. RFormula

Generate a feature vector and a label column by R model formula. Suitable for OneHotEncoder, you can convert all discrete features into numerical representation in a simple code.

3. Chi square feature selection (ChiSqSelector)

(1) Definition and usage: Chi-square feature selection is based on chi-square independence test of classification to rank features. It basically applies to having a bunch of features, but we don’t know which ones work and which ones don’t. Features can be quickly screened by chi-square feature selection. The disadvantage is that the speed is relatively slow.

(2) Code examples in Spark ML:

import org.apache.spark.ml.feature.ChiSqSelector import org.apache.spark.ml.feature.VectorIndexer val data = Seq( (7, Vectors. Dense (0.0, 0.0, 18.0, 1.0), 1.0), (8, Vectors, dense (0.0, 1.0, 12.0, 0.0), 0.0), (9, Vectors. Dense (1.0, 0.0, Val df = spark.createdataset (data).todf ("id"."features"."clicked")
val selector = new ChiSqSelector().setNumTopFeatures(2).setFeaturesCol("features").setLabelCol("clicked").setOutputCol("selectedFeatures")
val result = selector.fit(df).transform(df)
result.show(false)
Copy the code

References:

Spark.apache.org/docs/latest…
www.apache.wiki/pages/viewp…
Blog.csdn.net/liulingyuan…

Finally, here’s the point

OPPO Business Center Data Tagging team hires for multiple positions, we are committed to penetrating big data to understand the business interests of each OPPO user. We are looking for people with more than two years of experience in data analysis, big data processing, machine learning/deep learning, AND NLP to join us and grow with our team and business.

Resume: ping.wang#oppo.com

First, the significance of feature processing

2. Feature extraction

1. Countvectorizer

2. Word frequency-Reverse file frequency (TF-IDF)

3. Word turn (Word2Vec)

3. Feature transformation

1. Continuous data is converted into discrete data

1.1 Binarizer

1.2 Bucketizer discrete Recombination

1.3 QuantileDiscretizer

2. String and index conversion

2.1 String – Index Transform (StringIndexer)

2.2 Index – String (IndexToString)

3. Normalizer

4. Normalization (StandardScaler)

5. Principal Component Analysis (PCA)

6. VectorIndexer

7. SQLTransformer

8. OneHotEncoder

9. Max – min scale (MinMaxScaler)

10. Feature vector combination (VectorAssembler)

4. Feature selection

1. Vector machine

2. RFormula

3. Chi square feature selection (ChiSqSelector)

References:

Finally, here’s the point

Related Posts

Open Source Workflow Tools Survey (MLOps)

What books can I recommend for natural language processing?

【 Thesis Study 】VAM