Follow and mark the stars for 3 minutes and seconds to understand big data
Punch in and read once a day
Faster and more comprehensive access to big data technology blogs
Optimization and practice of Xgboost_Ray in the implementation of Uber large-scale generation
This article will introduce the design principle, code design and application of XgBOOST_ray in Uber generation and implementation. The full text is divided into five parts as follows:
- The reason for introducing xgBOOST_ray
- Xgboost_ray architecture
- Xgbosot_ray function points
- Xgboost_ray code read
- Xgboost_ray application in Uber
1.1. What is XgBoost?
XGBoost is an open source machine learning project developed by Tianqi Chen et al., which efficiently implements GBDT algorithm and makes many improvements in algorithm and engineering. XGBoost has been widely used in Kaggle competition and many other machine learning competitions and has achieved good results. In terms of XGBoost, WE have to mention GBDT(Gradient Boosting Decision Tree). Because XGBoost is essentially a GBDT, but strives for maximum speed and efficiency, it is called X (Extreme) Gvboosting. Both methods are boosting.Copy the code
1.2. Why use distributed XgBoost?
• XGBoost training and prediction cannot meet the requirements of large data volume on a single machine • Multi-GPU accelerated training is not supported on a single machineCopy the code
1.3 What are the current distributed XgBoosts?
• Based on Spark, Dask, kubernetesCopy the code
1.4. Current distributed XgBoost deficiencies?
• Dynamic computing graph • fault tolerant processing • Multi-GPU support • Integration with hyperparametersCopy the code
1.5. What can XgBOOST_ray do?
• Each Ray Actor is a stateful trained worker • Good fault tolerance • Multi-GPU support • Support for distributed data loading • Integration with distributed hyperparametric optimized Ray TuneCopy the code
2.1 Distributed Xgbosot_Ray architecture diagram is as follows:
Xgboost’s distributed logic is done through Rabit, and task coordination is done using Rabit Tracker.
Rabit is a lightweight library that implements Allreduce(protocol) and Broadcast fault-tolerant interfaces. The underlying layer uses sockets for communication
2.2 Xgbosot_ray architecture diagram execution logic is as follows:
(1) Start a RabitTracker on the driver side, and the RabitTracker provides N endpoints for worker connections.
(2) Each XGBoost training node is registered through communication between Rabit framework and rabitTracker, and rabitTracker numbers these Rabits and forms a ring network structure.
(3) After all worker nodes have established a complete network topology, computing tasks can be started to monitor the entire execution process.
2.3 Specific Tasks performed by RabitTracker:
RabitTracker is created by tracker.py under project. The specific things are as follows:
- Start daemon threads and provide end points for worker nodes to register connections. All worker nodes can register their status information by communicating with Tracker.
- Ordinate Worker node execution: Assign Rank number to worker node. The ring network structure is constructed based on the received worker registration information and broadcast to worker nodes to ensure that a compliant network topology is established between worker nodes.
- When all worker nodes have established a complete network topology, computing tasks can be started to monitor the entire execution process.
2.4 Operating principle diagram of distributed Xgbosot_Ray is as follows:
- Ray. remote call is used in each Worker node of Ray to load distributed data set;
- The Driver side is assigned to each worker for Xgboost training, and each worker communicates through RABit.
- Feedback the checkpoint result and training evaluation result of each worker to the Driver.
3.1 Distributed data set loading
• The modin component is referenced here for the distributed dataset load
Pandas is a single-core software that copies a large number of dataframes into the cache to take full advantage of the multi-core CPU
• How to transfer partitioned data
Check which nodes the partition dataframe data is currently located on, and assign the partition to the current workers node to minimize cross-node data transfer, and the size of each partition is uniform.
3.2. Distributed data set loading — API use
As shown in the figure above: When writing code through the API, import the required modules, then convert the loaded data set to Modin Dataframe, and then convert the training set and tag set to xGBoost’s data format.
3.3 xgBOOST_ray fault tolerance mechanism
In distributed training, some nodes are bound to fail. The fault tolerance mechanism of XgBOOST_ray provides three recovery modes:
(1) Single point recovery
• Default: A single node recovers from the last checkpoint
If a worker fails during training, xgBOOST_ray will recover the training result from the previous checkpoint by default and retrain.
Non-elastic training
• Warm restart (restart recovery from the last failed worker node)
3) Elastic training
• Continue training on fewer workers until the failing actors get up
3.4 fault tolerance mechanism – API use
The following figure shows the specific configuration of the fault tolerance mechanism:
3.5. Fault tolerance — Benchmark
- When a small number of workers failed, the training time increased slowly with the increase of workes that died
- In Non Elastic training, the performance was almost twice as bad as in the other two
- In elastic training mode, the training performance is best, reaching the optimal training duration when 2 nodes fail
3.6. Ultra parameter tuning based on Ray Tune
• Run multiple XGBoost-Ray trainings in parallel, each with a different hyperparameter configuration. Move the training code into a function and then pass that function to tune.run. Internally, Train will detect if tune is in use and automatically report the results for adjustment.
3.7. Ultra parameter Tuning based on Ray Tune — API use
4.1. Xgboost_ray is used for distributed training of data sets, and the code execution flow chart is drawn as follows:
The following figure is the source code of the specific execution of the training. If you want to see the complete process, you can view it in the source code
4.2 Reasoning model with XGBOOST_ray, flow chart is as follows:
The specific source code execution process is as follows:
5.1 Application in Uber
The following is a flow chart of XgBOOST_ray used in Uber:
The Spark + XgBOOST_ray technology is used:
- Transformer: an algorithm that converts one DataFrame into another. Transformer implements a method transform() that converts one DataFrame to another by attaching one or more columns.
- Estimator: An Estimator, which is a conceptual abstraction of learning algorithms or training methods on training data. Estimator implements a method, FIT (), that takes a DataFrame and produces a converter.
The flow chart of Saprk+ XgBOOST_ray is as follows:
- Data preprocessing is performed on Spark
- Create a ray context
- Create a ray Estimator
- Ray Estimator was used for training and prediction set conversion
- Save and load the model
The spark+ xgBOOST_ray code is as follows:
The flow chart of Flink+ XgBOOST_ray for distributed training is as follows:
The above is Xgboost_Ray in Uber’s large-scale generation of the implementation of the optimization and practice of the content! Feel good, like, share, thank you!! !
Find all kinds of big data technology articles and interviews, come
<3 minutes and seconds to understand big data >
Update Internet big data component content at any time
Technical blog posts for learners
Fast and the side of the small partners together to pay attention to us!