Bank risk prediction is realized by random forest algorithm

Source code sharing and data set sharing: github.com/luo94852184…

In machine learning, a random forest is a classifier containing multiple decision trees, and its output categories are determined by the mode of the categories output by individual trees. Leo Breiman and Adele Cutler developed algorithms to deduce random forests. And “Random Forests” is their trademark. The term is derived from random decision Forests, which were proposed in 1995 by Tin Kam Ho of Bell Laboratories. This approach combines Breimans’ “Bootstrap Aggregating” idea with Ho’s “Random Subspace Method” to build a set of decision trees.



1. Splitting: In the training process of decision tree, the training data set needs to be split into two sub-data sets again and again, which is called splitting.

2. Features: In classification problems, the data entered into the classifier is called features. Take the stock prediction question above, which features the previous day’s trading volume and closing price.

3. Features to be selected: in the process of decision tree construction, features need to be selected from all features in a certain order. A selected feature is a collection of features that have not been selected prior to the current step. For example, if all the features are ABCDE, in step 1, the feature to be selected is ABCDE, in step 1 you choose C, then in step 2, the feature to be selected is ABDE.

4. Split feature: The definition of reception selection feature, each feature selected is split feature, for example, in the above example, the split feature of the first step is C. Because these selected features divide the dataset into discrete parts, they are called split features.

Data source type:

A is the user’s credit status: 1 normal 0 bad credit Other data is the user’s basic information

Code part:

// About the bank risk forecast
object Credit {

  // Use a Scala case to define case class attributes
  case class Credit(
                     creditability: Double,
                     balance: Double, duration: Double, history: Double, purpose: Double, amount: Double,
                     savings: Double, employment: Double, instPercent: Double, sexMarried: Double, guarantors: Double,
                     residenceDuration: Double, assets: Double, age: Double, concCredit: Double, apartment: Double,
                     credits: Double, occupation: Double, dependents: Double, hasPhone: Double, foreign: Double
                   )

  // Use a function to parse a line and store the value into the Credit class
  def parseCredit(line: Array[Double) :Credit = {
    Credit(
      line(0),
      line(1) - 1, line(2), line(3), line(4) , line(5),
      line(6) - 1, line(7) - 1, line(8), line(9) - 1, line(10) - 1,
      line(11) - 1, line(12) - 1, line(13), line(14) - 1, line(15) - 1,
      line(16) - 1, line(17) - 1, line(18) - 1, line(19) - 1, line(20) - 1)}// Convert the RDD of string to the RDD of class Double
  def parseRDD(rdd: RDD[String) :RDD[Array[Double]] = {
    rdd.map(_.split(",")).map(_.map(_.toDouble))
  }

  // Define main
  def main(args: Array[String) :Unit = {
    / / define ScalacAPP
    val conf = new SparkConf().setAppName("SparkDFebay")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    // Read the CVS file
    import sqlContext.implicits._
    // The first map converts the string RDD to Double RDD and the second map injects Double into the Credit class
    //toDF converts RDD to Credit DataFrame(a table structure)

    val creditDF = parseRDD(sc.textFile("germancredit.csv")).map(parseCredit).toDF().cache()
    creditDF.registerTempTable("credit")

    // creditdf. printSchema prints the result
    // creditDF.show
    SQLContext can be used to perform further SQL operations on the data


    // Change some features into a dimension for better machine use


    val featureCols = Array("balance"."duration"."history"."purpose"."amount"."savings"."employment"."instPercent"."sexMarried"."guarantors"."residenceDuration"."assets"."age"."concCredit"."apartment"."credits"."occupation"."dependents"."hasPhone"."foreign")
    val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
    val df2 = assembler.transform(creditDF)

    // Add credit to 2, this label
    // credrecognize is set to the indicator value
    val labelIndexer = new StringIndexer().setInputCol("creditability").setOutputCol("label")
    val df3 = labelIndexer.fit(df2).transform(df2)

    // Split the data from 3
    // The seed of 5043RANDOM can be ignored.
    val splitSeed = 5043
    val Array(trainingData, testData) = df3.randomSplit(Array(0.7.0.3), splitSeed)






    // The first method uses random forest classifier
    /* *MaxDepth The greater the effect, the better. However, the longer the training time * NumTrees the greater the number of Settings the higher the accuracy (considering the dimension) * maxBins maximum number of buckets determines the impurity of nodes * Indicators for information gain calculation * Auto number of features selected to participate in node separation * seed randomly generated seeds * /

    val classifier = new RandomForestClassifier().setImpurity("gini").setMaxDepth(3).setNumTrees(20).setFeatureSubsetStrategy("auto").setSeed(5043)
    val model = classifier.fit(trainingData) // Do some training

    val evaluator = new BinaryClassificationEvaluator().setLabelCol("label") // to set the label value
    val predictions = model.transform(testData) // Make a prediction
    model.toDebugString

    // Save the model
    model.save("BankModel001")





    // Calculate the accuracy of the prediction
    val accuracy = evaluator.evaluate(predictions)
    println("accuracy before pipeline fitting" + accuracy*100+"%")



    /* * The second method uses pipeline patterns to train the model (network search method) * * combine different parameters to predict results * * */
   // Use the ParamGridBuilder tool to build the parameter network
    // Classifier indicates that the classifier is associated
   val paramGrid = new ParamGridBuilder()
     .addGrid(classifier.maxBins, Array(25.31))
     .addGrid(classifier.maxDepth, Array(5.10))
     .addGrid(classifier.numTrees, Array(20.60))
     .addGrid(classifier.impurity, Array("entropy"."gini"))
     .build()

    // The pipeline is created by a series of stages, each equivalent to an Estimator or Transformer converter
    val steps: Array[PipelineStage] = Array(classifier)
    val pipeline = new Pipeline().setStages(steps)

    // Model filtering with the CrossValidator class cannot be too high
    // This class makes pipes crawl on the net
    val cv = new CrossValidator()
      .setEstimator(pipeline)
      .setEvaluator(evaluator)
      .setEstimatorParamMaps(paramGrid)
      .setNumFolds(10)

    // The pipe is optimized as it crawls through the parameter network
    val pipelineFittedModel = cv.fit(trainingData)
    pipelineFittedModel.save("BankPipelineMode")

    // Test data
    val predictions2 = pipelineFittedModel.transform(testData)
    val accuracy2 = evaluator.evaluate(predictions2)
    println("accuracy after pipeline fitting" + accuracy2*100+"%")}}Copy the code

Package the project into JAR and upload it to spark cluster for calculation to obtain model accuracy: