This set of technical column is the author (Qin Kaixin) usually work summary and sublimation, through extracting cases from the real business environment to summarize and share, and give business application tuning suggestions and cluster environment capacity planning and other content, please continue to pay attention to this set of blog. Copyright notice: No reprint, welcome to learn. QQ email address: [email protected], if there is any business exchange, can contact at any time.
1 Burn! Model selection
- Model selection can be carried out for a single Estimtor, such as logistic regression, decision tree, etc.
- Model selection can also be used to tune parameters for the entire PipeLine, eliminating the need to tune each element in the PipeLine individually.
- Estimtor: User-tuned algorithm or Pipeline.
- ParamMap: used for parameter selection and supports multiple parameters, such as the number of iterations and regularization.
- Evaluator: Evaluate the final fit of the model on test data.
2 Model Verification
- ML currently supports CrossValidator and training validation split
3. Model training process
- The training set and test set are segmented.
- Each test data and training data was iterated according to the parameter grid, and finally the performance of the model was evaluated against Evaluator.
- Select the best parameter set to generate the optimal model.
Evaluator
- RegressionEvaluator is used for regressive problems,
- BinaryClassificationEvaluator for binary classification, the default evaluation index is the AUC
- MulticlassClassificationEvaluator for many kind of problem.
- The default metric used to select the best value, ParamMap, can be overridden by Evaluators’ setMetricName method.
5 ML cross validation PipeLine case practice
5.1 CrossValidator training verification method
- A CrossValidator divides the data set into groups (e.g., three groups), each consisting of training sets and test sets, so there are three training sets and three test sets.
- 3 fold cross validation, each set of data is 2/3 for training, 1/3 for testing.
- To evaluate a particular set of parammaps, the crossValidator invokes the average estimates of the three models generated by FIT on three different sets of data via Estimator.
- After determining the optimal ParamMap, the CrossValidator finally re-fits the Estimator using the optimal ParamMap and the entire data set.
Examples are as follows:
Select 2-fold cross-validation, and there are two parameters in the parameter grid: hashingtf.numfeatures has 3 values and lr.regparam has 2 values. So how many models are there for training? (3×2)×2=12, that is, 12 models for training, so it can be seen that the cost is very high.
5.2 CrossValidator Case Combat
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.classification.LogisticRegressionModel
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.sql.Row
准备训练数据,格式(id,text,label)
val training = spark.createDataFrame(Seq(
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0),
(4L, "b spark who", 1.0),
(5L, "g d a y", 0.0),
(6L, "spark fly", 1.0),
(7L, "was mapreduce", 0.0),
(8L, "e spark program", 1.0),
(9L, "a e c l", 0.0),
(10L, "spark compile", 1.0),
(11L, "hadoop software", 0.0)
)).toDF("id", "text", "label")
1 配置一个ML pipeline,总共有三个stages:tokenizer, hashingTF, and lr
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
-参考 val tokenized = tokenizer.transform(training)
-参考 tokenized.show()
-参考 scala> tokenized.rdd.foreach(println)
[0,a b c d e spark,1.0,WrappedArray(a, b, c, d, e, spark)]
[1,b d,0.0,WrappedArray(b, d)]
[2,spark f g h,1.0,WrappedArray(spark, f, g, h)]
[3,hadoop mapreduce,0.0,WrappedArray(hadoop, mapreduce)]
[4,b spark who,1.0,WrappedArray(b, spark, who)]
[5,g d a y,0.0,WrappedArray(g, d, a, y)]
[6,spark fly,1.0,WrappedArray(spark, fly)]
[7,was mapreduce,0.0,WrappedArray(was, mapreduce)]
[8,e spark program,1.0,WrappedArray(e, spark, program)]
[9,a e c l,0.0,WrappedArray(a, e, c, l)]
[10,spark compile,1.0,WrappedArray(spark, compile)]
[11,hadoop software,0.0,WrappedArray(hadoop, software)]
2 配置一个ML HashingTF
val hashingTF = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("features")
3 配置一个ML LogisticRegression, 输入label,features,prediction均可采用默认值名称。
val lr = new LogisticRegression().setMaxIter(10)
lr.transform()
4 构建算法流水线
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))
5 用ParamGridBuilder构建一个查询用的参数网格hashingTF.numFeatures有三个值,lr.regParam有两个值该网格将会有3*2=6组参数被CrossValidator使用
val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures, Array(10, 100, 1000)).addGrid(lr.regParam, Array(0.1, 0.01)).build()
Array({
hashingTF_a4b3e2e4efc2-numFeatures: 10,
logreg_3f15efefe425-regParam: 0.1
}, {
hashingTF_a4b3e2e4efc2-numFeatures: 10,
logreg_3f15efefe425-regParam: 0.01
}, {
hashingTF_a4b3e2e4efc2-numFeatures: 100,
logreg_3f15efefe425-regParam: 0.1
}, {
hashingTF_a4b3e2e4efc2-numFeatures: 100,
logreg_3f15efefe425-regParam: 0.01
}, {
hashingTF_a4b3e2e4efc2-numFeatures: 1000,
logreg_3f15efefe425-regParam: 0.1
}, {
hashingTF_a4b3e2e4efc2-numFeatures: 1000,
logreg_3f15efefe425-regParam: 0.01
})
6 CrossValidator 交叉验证,默认的评估指标是AUC
这里对将整个PipeLine视为一个Estimator
这种方式允许我们联合选择这个Pipeline stages参数
一个CrossValidator需要一个Estimator,一组Estimator ParamMaps,一个Evaluator。
这个Evaluator是一个BinaryClassificationEvaluator,它默认度量是areaUnderROC
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)
7 建立测试集
val cvModel = cv.fit(training)
val test = spark.createDataFrame(Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "mapreduce spark"),
(7L, "apache hadoop")
)).toDF("id", "text")
8 模型训练,输出结果
val allresult=cvModel.transform(test)
allresult.show
+---+---------------+------------------+--------------------+--------------------+--------------------+----------+
| id| text| words| features| rawPrediction| probability|prediction|
+---+---------------+------------------+--------------------+--------------------+--------------------+----------+
| 4| spark i j k| [spark, i, j, k]|(10,[5,6,9],[1.0,...|[0.52647041270060...|[0.62865951622023...| 0.0|
| 5| l m n| [l, m, n]|(10,[5,6,8],[1.0,...|[-0.6393098371808...|[0.34540256830050...| 1.0|
| 6|mapreduce spark|[mapreduce, spark]|(10,[3,5],[1.0,1.0])|[-0.6753938557453...|[0.33729012038845...| 1.0|
| 7| apache hadoop| [apache, hadoop]|(10,[1,5],[1.0,1.0])|[-0.9696913340282...|[0.27494203016056...| 1.0|
+---+---------------+------------------+--------------------+--------------------+--------------------+----------+
9 模型训练,详细输出结果
val allresult=cvModel.transform(test)
allresult.rdd.foreach(println)
[4,spark i j k,WrappedArray(spark, i, j, k),(10,[5,6,9],[1.0,1.0,2.0]),[0.5264704127006035,-0.5264704127006035],[0.6286595162202399,0.37134048377976003],0.0]
[5,l m n,WrappedArray(l, m, n),(10,[5,6,8],[1.0,1.0,1.0]),[-0.6393098371808272,0.6393098371808272],[0.3454025683005015,0.6545974316994986],1.0]
[6,mapreduce spark,WrappedArray(mapreduce, spark),(10,[3,5],[1.0,1.0]),[-0.6753938557453469,0.6753938557453469],[0.3372901203884568,0.6627098796115432],1.0]
[7,apache hadoop,WrappedArray(apache, hadoop),(10,[1,5],[1.0,1.0]),[-0.9696913340282707,0.9696913340282707],[0.2749420301605646,0.7250579698394354],1.0]
10 模型训练,选择性输出结果
cvModel.transform(test).select("id", "text", "probability", "prediction").collect().foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>println(s"($id, $text) --> prob=$prob, prediction=$prediction")
}
(4, spark i j k) --> prob=[0.6286595162202399,0.37134048377976003], prediction=0.0
(5, l m n) --> prob=[0.3454025683005015,0.6545974316994986], prediction=1.0
(6, mapreduce spark) --> prob=[0.3372901203884568,0.6627098796115432], prediction=1.0
(7, apache hadoop) --> prob=[0.2749420301605646,0.7250579698394354], prediction=1.0
11 查看最优模型中各参数值
val bestModel= cvModel.bestModel.asInstanceOf[PipelineModel]
val lrModel=bestModel.stages(2).asInstanceOf[LogisticRegressionModel]
lrModel.getRegParam
res22: Double = 0.1
lrModel.numFeatures
res24: Int = 10
scala> lrModel.getMaxIter
res25: Int = 10
Copy the code
5.3 Training verification split method
-
In addition to the CrossValidator, Spark also provides a TrainValidationSplit for adjusting hyperparameters.
-
TrainValidationSplit evaluates each combination of one parameter only once, as opposed to the k-word adjustment of the CrossValidator. True means that the cost is relatively low, and when the training set is not large, it will not produce a reliable result.
-
Unlike CrossValidator, TrainValidationSplit produces a (training, test) dataset pair. The data set is split into two parts by using the trainRatio parameter. TrainRatio = 0.75, for example, TrainValidationSplit will produce a training set and test set, of which 75% data used for training, 25% used to validate data.
-
Like the CrossValidator, the TrainValidationSplit fits the Estimator at the end using the best parameters and the entire data set.
import org.apache.spark.ml.evaluation.RegressionEvaluator import org.apache.spark.ml.regression.LinearRegression import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit} 1 Test data (data in spark installation package) val data = The spark. Read. The format (" libsvm "). The load ("/data/mllib/sample_linear_regression_data. TXT ") 9.490009878824548 1:0.4551273600657362 2:0.36644694351969087 3:-0.38256108933468047 4:-0.4458430198517267 5:0.33109790358914726 6:0.80 67445293443565 7:-0.2624341731773887 8:-0.44850386111659524 9:-0.07269284838169332 10:0.5658035575800715 0.2577820163584905 1:0.8386555657374337 2:-0.1270180511534269 3:0.499812362510895 4:-0.22686625128130267 5:-0.6452430441812433 6:0.1886 9982177936828 7:-0.5804648622673358 8:0.651931743775642 9:-0.6555641246242951 10:0.17485476357259122-4.438869807456516 1:0.5025608135349202 2:0.14208069682973434 3:0.16004976900412138 4:0.505019897181302 5:-0.9371635223468384 6:-0.2841 601610457427 7:0.6355938616712786 8:-0.1646249064941625 Val Array(training, test) = data. RandomSplit (Array(0.9, 0.1), Seed = 12345) 2 Select model val lr = new LinearRegression().setMaxiter (10) 3 Use ParamGridBuilder to build a grid of parameters, Evaluator val paramGrid = new ParamGridBuilder().addGrid(lr.regparam, Array (0.1, 0.01)). AddGrid (lr) fitIntercept). AddGrid (lr) elasticNetParam, Array (0.0, 0.5, Build () 4 Estimator uses a simple linear regression model, 80% of the data is used for training, Valtrainvalidationsplit = new trainValidationSplit ().setEstimator(LR).seteValuator (new RegressionEvaluator). SetEstimatorParamMaps (paramGrid). SetTrainRatio (0.8) Select the best parameter Val Model = trainValidationsplit.fit (training) 6 Predict the test data. The parameters are the best parameters that you just trained. val allresult = model.transform(test) allresult.rdd.take(5).foreach(println) [23.51088409032297, (10,,1,2,3,4,5,6,7,8,9 [0], [0.4683538422180036, 0.1469540185936138, 0.9113612952591796, 0.983848266978 9823,0.4506466371133697, 0.6456121712599778, 0.8264783725578371, 0.562664168655115, 0.8299281852090683, 0.40690300256653256] ), 1.6659388625179559] [21.432387764165806, (10,,1,2,3,4,5,6,7,8,9 [0], [0.4785033857256795, 0.520350718059089, 0.2988515012130126, 0.46260150057 299754,0.5394344995663083, 0.39320468081626836, 0.1890560923345248, 0.13123799325264507, 0.43613839380760355, 0.3954199841973 1494]), 0.3400877302576284] [12.977848725392104, (10,,1,2,3,4,5,6,7,8,9 [0], [0.5908891529017144, 0.7678208242918028, 0.8512434510178621, 0.1491019641 0347298,0.6250260229199651, 0.5393378705290228, 0.9573580597625002, 0.864881502860934, 0.4175735160503429, 0.48721692159224 26]), 0.02335359093652395] [11.827072996392571, (10,,1,2,3,4,5,6,7,8,9 [0], [0.9409739656166973, 0.17053032210347996, 0.5735271206214345, 0.27130649524 43933, 0.11725988807909005, 0.34413389399753047, 0.2987734110474076, 0.5436538528015331, 0.06578668798680076, 0.7901644743 575837]), 2.5642684021108417] [10.945919657782932, (10,,1,2,3,4,5,6,7,8,9 [0], [0.7669971723591666, 0.38702771863552776, 0.6664311930513411, 0.2817072090 916286, 0.16955916900934387, 0.9425831315444453, 0.5685476711649924, 0.20782258743798265, 0.015213591474494637, 0.818372386 5760859]), 0.1631314487734783] scala > allresult. Show + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | label| features| prediction| +--------------------+--------------------+--------------------+ | 23.51088409032297 | (10, 0,1,2,3,4,5... 1.6659388625179559 | | - | | - 21.432387764165806 (10, [| 0,1,2,3,4,5, 12.977848725392104 to 0.3400877302576284 | | | (10, 0,1,2,3,4,5... 0.02335359093652395 | | - | 11.827072996392571 | (10, 0,1,2,3,4,5... 2.5642684021108417 | | | | - 10.945919657782932 (10, [| 0,1,2,3,4,5, 0.1631314487734783 | | | - 10.58331129986813 (10, [0,1,2,3,4,5... 2.517790654691453 | | | 10.288657252388708 | (10, 0,1,2,3,4,5... 0.9443474180536754 | | - | | - 8.822357870425154 (10, [| 0,1,2,3,4,5, 8.772667465932606 to 0.6872889429113783 | | | (10, 0,1,2,3,4,5... 1.485408580416465 | | - | 8.605713514762092 | (10, 0,1,2,3,4,5... 1.110272909026478 | | | | - 6.544633229269576 (10, [| 0,1,2,3,4,5, 5.055293333055445 to 3.0454559778611285 | | | (10, 0,1,2,3,4,5... 0.6441174575094268 | | | 5.039628433467326 | (10, 0,1,2,3,4,5... 0.9572366607107066 | | | | - 4.937258492902948 (10, [| 0,1,2,3,4,5, 3.741044592262687 to 0.2292114538379546 | | | (10, 0,1,2,3,4,5... 3.343205816009816 | | | 3.731112242951253 | (10, 0,1,2,3,4,5... 2.6826413698701064 | | - | | - 2.109441044710089 (10, [| 0,1,2,3,4,5, 2.1930034039595445 | | | - 1.8722161156986976 (10, [0,1,2,3,4,5... 0.49547270330052423 | | | 1.1009750789589774 | (10, 0,1,2,3,4,5... 0.9441633113006601 | | - | | - 0.48115211266405217 (10, [| 0,1,2,3,4,5, 0.6756196573079968 | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + only showing the top 20 rowsCopy the code
6 conclusion
Should have come to the end, through detailed comparative analysis, filled with emotion, hard written, each cherish
Qin Kaixin in Shenzhen 2018 11 18 15 46