• MEET MICHELANGELO: UBER’s MACHINE LEARNING PLATFORM
  • JEREMY HERMANN & MIKE DEL BALSO
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: lsvih
  • Proofread by: TobiasLee, xfffrank

Uber machine learning platform — Michelangelo

Uber engineers are constantly working to develop new technologies to provide customers with an effective and seamless user experience. Now, they are investing more in artificial intelligence and machine learning to realize this vision. At Uber, engineers developed a machine learning platform called Michelangelo, an internal “MLaaS” (machine learning as a service) platform that lowers the barriers to machine learning development and allows AI to be expanded and scaled for different business needs, It’s as convenient as a customer using Uber.

The Michelangelo platform allows in-house teams to seamlessly build, deploy, and run Uber-scale machine learning solutions. It is designed to cover the entire end-to-end machine learning workflow, including: data management, training models, evaluation models, deployment models, forecasting, predictive monitoring. This system supports not only traditional machine learning model, but also time series prediction and deep learning.

Michelangelo has been at Uber for about a year and has become a veritable “platform” for Uber engineers and data scientists, with dozens of teams building and deploying models. In fact, the Michelangelo platform is now deployed in multiple Uber data centers and uses dedicated hardware to provide predictive capabilities for the highest load online services within the company.

This article introduces Michelangelo and its product use cases, and briefly introduces the entire machine learning workflow through this powerful MLaaS system.

The motives behind Michelangelo

Before The Michelangelo platform, Uber’s engineers and data scientists faced many challenges in building and deploying machine learning models that the company needed to scale up based on actual operations. They try to create prediction models using a variety of tools (R, SciKit-learn, custom algorithms, etc.), and the engineering team builds one-off systems to make predictions using these models. As a result, there are few data scientists and engineers at Uber who can build frameworks in a short time using various open source tools, limiting the use of machine learning within the company.

Specifically, there was no reliable, unified, pipeline-reusable system for creating, managing, training, and predicting scale data. So back then, no one made models that data scientists couldn’t run on their desktops, there was no standardized way to store results, and it was very difficult to compare results from several experiments. More importantly, there was no sure way to deploy the model into production. As a result, most of the time it is the engineering team involved that has to develop a custom service container for the project at hand. At this point, they noticed that these signs fit the description of anti-patterns in machine learning documented by Scully et al.

Michelangelo aims to standardize workflows and tools across the team, making it easy for users across the company to build and run large machine learning systems through end-to-end systems. But the engineers’ goal is not just to solve these intuitive problems, but to create a system that grows with the business.

When engineers began building the Michelangelo system in mid-2015, they also began to solve some of the problems of training the scale model and deploying the model in a production environment container. They then focused on building systems that could better manage and share feature pipelines. More recently, the focus has shifted to developer productivity — how to speed up the process from idea to product model implementation and the rapid iteration that follows.

The next section uses an example to show how to use Michelangelo to build and deploy a model to solve a specific problem for Uber. While the following focuses on specific use cases in UberEATS, the platform also manages similar models for a variety of predictive use cases in the company.

Use case: UberEATS home delivery time estimation model

UberEATS has several models running in Michelangelo, including food arrival time prediction, search rankings, search autocomplete, restaurant rankings, etc. The food delivery arrival time prediction model can predict the time required for meal preparation, delivery and each stage of the meal delivery process.

Figure 1: UberEATS app provides the ability to estimate delivery time, driven by a machine learning model built on Michelangelo.

Predicting delivery time (ETD) is not a simple task. When an UberEATS user places an order, the order is sent to the restaurant for processing. The restaurant needs to confirm the order and prepare the food according to the complexity of the order and how busy the restaurant is, which naturally takes some time. When the meal was almost ready, the Uber delivery driver set out to pick it up. The delivery person then needs to drive to the restaurant, find the parking lot, pick up the food, return to the car, drive to the customer’s home (depending on the route, traffic, etc.), find the parking space, walk to the customer’s door, and finally make the delivery. The goal of UberEATS is to predict the total time of this complex multi-stage process and recalculate the ETD at each step.

On Michelangelo’s platform, UberEATS data scientists used a GBDT (Gradient Elevation decision tree) regression model to predict such end-to-end delivery times. The features used in this model include request information (such as time, delivery location), historical features (such as the average meal preparation time of the restaurant over the past 7 days), and near-real-time features (such as the average meal preparation time of the last hour). This model is deployed in a container provided by The Michelangelo platform in Uber’s data center and provides network calls through the UberEATS microservice. The forecast results will be presented to the customer before the meal is prepared and delivered.

System architecture

The Michelangelo system consists of a number of open source systems and built-in components. The main open source components used are HDFS, Spark, Samza, Cassandra, MLLib, XGBoost and TensorFlow. Where possible, the development team prefers to use mature open source systems and will fork, customize, and contribute to them if required. If they can’t find a suitable open source solution, they will build some systems themselves.

The Michelangelo systems are built on top of Uber’s data and computing infrastructure, providing a “data lake” containing all of Uber’s transaction and log data. Kafka collects and summarizes all service logs of Uber, and uses Samza flow computing engine managed by Cassandra cluster and Uber internal services for calculation and deployment.

In the next section, the ETD model of UberEATS is used as an example to briefly introduce the various layers of the system and illustrate Michelangelo’s technical details.

Machine learning workflow

At Uber, most machine learning use cases (including some of the work being done, such as classification, regression, and time series prediction) follow the same workflow. This workflow can be separated from the concrete implementation, so it can be easily extended to support new algorithms and frameworks (such as the latest deep learning frameworks). It also applies to a variety of deployment patterns for different predictive use cases (such as online versus offline, in-vehicle versus in-mobile).

Michelangelo is specifically designed to provide scalable, reliable, reusable, and easy-to-use automation tools to solve the following 6-step workflow:

  1. Management data
  2. Training model
  3. Evaluation model
  4. Deployment model
  5. Predicted results
  6. Predictive monitoring

The following details how Michelangelo’s architecture facilitates the various steps in the workflow.

Management data

Finding good features is often the hardest part of machine learning, and engineers have found that building and managing data pipes is the most time-consuming and laborious part of an entire machine learning solution.

Platform should provide a set of standard tools, therefore, to build a data pipeline, generating characteristics of the data sets are marked (used for training and retraining), and to provide unmarked feature data is used to predict, these tools need to lake with the company’s data, data center, and the company’s online data service system for the depth of integration. The pipeline must be scalable and high-performance enough to monitor data flow and data quality and provide comprehensive support for all kinds of online/offline training and prediction. These tools should also be able to generate features in a shared way among teams to reduce rework and improve data quality. In addition, these tools should provide strong safeguards that encourage users to use the tools in the best way possible (for example, ensuring that the same batch of data is used for training and prediction).

Michelangelo’s data management components are divided into online pipelines and offline pipelines. At present, offline pipeline is mainly used to provide data for batch model training and batch prediction. The online pipeline provides data primarily for online, low-latency predictive operations (and later for online learning systems).

In addition, engineers have added a feature storage system to the data management team, which allows teams to share and discover high-quality data features to solve their machine learning problems. Engineers have found that many of Uber’s models use similar or identical features, and there is value in sharing features across teams in different organizations and across different projects within the team.

Figure 2: The data preprocessing pipeline stores data into the signature database and training data warehouse.

The offline deployment

Uber’s transaction and log data “flows” into an HDFS data lake and can be easily invoked using Spark and Hive SQL calculations. The platform provides two ways to run regular jobs, container and scheduled tasks, for calculating private features within a project or publishing them to a feature repository (see below) to share with other teams. When a scheduled task runs a batch job or otherwise triggers a batch job, the job is integrated into a data quality monitoring tool that can quickly backtrack to find out where the problem was in the Pipeine and determine whether it was local code or upstream code that caused the data error.

Online deployment

The model deployed online will not have access to the DATA stored in HDFS, so some features that need to be read in the supporting database of Uber’s production service are difficult to use directly in this online model (for example, you can’t directly query UberEATS ‘order service to calculate the average meal preparation time for a restaurant in a particular time period). So the engineers predicted and stored the features needed for the online model in Cassandra, where the online model could read the data with low latency.

Online deployment supports two computing systems: batch predictive computing and near-real-time computing. Details are as follows:

  • Batch estimation. The system periodically performs large computations and loads the feature history from HDFS into the Cassandra database. This is simple and crude, but works fine if the desired features are not real-time (such as allowing updates every few hours). The system also ensures that the data used for training and service in the batch pipeline is the same batch. UberEATS uses this system to process some characteristics, such as “the average meal preparation time of the restaurant over the past seven days.”
  • Near real time computing. The system publishes metrics to the Kafka system, and then runs a Samza stream computation to generate all features with low latency. These characteristics are then stored directly into the Cassandra database for service provision and backed up to HDFS for subsequent training operations. This also ensures that the same batch of data is used for service delivery and training, as is the case with batch estimation systems. To avoid a cold start, engineers built a tool for the system that “backfills” data and runs batch processing to generate training data based on history. UberEATS uses this near-real-time calculation pipeline to get features like “average meal preparation time at a restaurant over the past hour.”

Shared feature database

The engineers found it useful to build a centralized signature library so that Uber teams could use reliable features created and managed by other teams, and that features could be shared. In the grand scheme of things, it does two things:

  1. It makes it easy for users to store their built features in a shared feature library (with a little metadata, such as addors, descriptions, SLAs, and so on), and it also allows some features used by specific projects to be stored in private form.
  2. Once a feature is in the feature library, it’s easy to use it later. For both online and offline models, simply write the name of the feature in the model configuration. The system will take out the correct data from HDFS and return the corresponding feature set after processing, which can be used for model training, batch prediction or online prediction from Cassandra value.

Uber currently has about 10,000 features in its signature library to speed up machine learning engineering, and teams are constantly adding new features to it. Features in the feature database are automatically calculated and updated every day.

In the future, engineers intend to build automated systems that perform feature library searches and find the most useful features for solving a given prediction problem.

Domain specific language (DSL) for feature selection and transformation.

The features generated by the data pipeline and the features transmitted by the client service often do not conform to the data format required by the model, and these data often miss some values, which need to be filled; Sometimes the model may only need a subset of the features passed in; In other cases, converting the timestamp passed in to hour/day or day/week will work better in the model; In addition, you may need to normalize the eigenvalues (for example, subtract the mean and divide by the standard deviation).

To address these issues, engineers have created a DSL (Domain Specific Language) for modelers to select, transform, and combine features that can be used for training or prediction. This DSL, a subset of Scala, is a purely functional language that contains a common set of functions, and engineers have added the ability to customize functions to the DSL. These functions can fetch features from the right places (offline model fetching feature values from data pipeline, online model fetching feature values from customer request, or directly from feature library).

In addition, DSL expressions are part of the configuration of the model, and the DSLS taken for characteristics during training need to be consistent with the DSLS used for testing to ensure consistency in the feature sets passed into the model at all times.

Training model

At present, the platform supports offline and large-scale distributed training, including decision tree, linear model, logical model, unsupervised model (K-means), time series model and deep neural network. Engineers will periodically add new models developed by Uber’s AI lab based on user demand. In addition, users can provide their own model types, including custom training, evaluation, and code to provide services. The distributed model training system can process billions of sample data on a large scale, and can also process some small data sets for rapid iteration.

The configuration of a model includes model types, superparameters, data sources, characteristic DSLS, and computing resource requirements (number of machines required, amount of memory, whether to use gpus, etc.). This information is used to configure training jobs running on YARN or Mesos clusters.

After the model training is completed, the system will combine the calculated performance indicators (such as ROC curve and PR curve) to obtain a model evaluation report. At the end of the training, the system will save the original configuration, learned parameters and evaluations back to the model library for analysis and deployment.

In addition to training individual models, the Michelangelo system also supports superparameter search for various models such as partitioned models. Taking the block model as an example, the system automatically blocks the training data according to user configuration, and trains one model for each block. The individual block models are then merged into the parent model as needed (for example, each city data is trained first and then merged into a national model if an accurate municipal model is not available).

Training tasks can be configured and managed using a Web UI or API (usually using Jupyter Notebook). Most teams regularly retrain their models using apis and process management tools.

Figure 3: Model training operations train models using datasets from the feature library and data training warehouse, and then store the models into the model library.

Evaluation model

Training model can be regarded as an exploration process to find the best feature, algorithm and superparameter to build the best model for the problem. It is not uncommon to train hundreds of models and find nothing before arriving at an ideal model for a use case. Although these failed models will not ultimately be used in production, they can guide engineers to better model configuration for better performance. Tracking these trained models (for example, who trained them, when, with what data sets, what superparameters, etc.), evaluating and comparing their performance, can bring more value and opportunities to the platform. But dealing with so many models is a huge challenge.

Each model trained in Michelangelo’s platform needs to store the following information as a version object in Cassandra’s model library:

  • Who trained the model.
  • Start time and end time of training model.
  • The complete configuration of the model (including what features are used, Settings for overparameters, and so on).
  • Reference training sets and test sets.
  • Describe the importance of each feature.
  • Model accuracy evaluation method.
  • Standard evaluation tables or graphs for each type of model (e.g. ROC graphs, PR graphs, and dichotomies of confusion matrices, etc.).
  • All learned parameters of the model.
  • Model visualization summary statistics.

Users can easily access this data through a Web UI or using aN API to examine the details of a single model or to compare multiple models.

Model Accuracy report

The accuracy report of regression model will show the standard accuracy index and chart; The accuracy report of the classification model will show different classification sets, as shown in FIG. 4 and FIG. 5:

Figure 4: The regression model report shows the performance metrics associated with regression.

Figure 5: The binary model report shows the performance metrics associated with the classification.

Visual decision tree

As an important model type, engineers provide visualization tools for decision trees to help modelers better understand the behavior of models and debug them when needed. For example, in a decision tree model, users can browse each tree branch and see variables such as its importance to the overall model, decision segmentation points, the weight of each feature to a particular branch, and data distribution on each branch. The user can input an eigenvalue, and the visual component will traverse the trigger path of the entire decision tree, the prediction of each tree, and the prediction of the entire model, presenting the data as follows:

Figure 6: View the tree model using the powerful tree visualization component.

Characteristics of the report

Michelangelo provided feature reports in which local dependency diagrams and mixed histograms were used to show the importance of each feature to the model. Selecting two features allows the user to see a diagram of their local dependencies on each other, as shown below:

Figure 7: Features that can be seen in the feature report, their importance to the model, and the correlation between different features.

Deployment model

Michelangelo supports deployment of an end-to-end management model using UI or API. A model can be deployed in one of three ways:

  1. Offline deployment. The model will be deployed in an offline container, using Spark jobs to make batch predictions based on requirements or scheduled tasks.
  2. Online deployment. The model will be deployed in an online prediction service cluster (typically hundreds of machines deployed using load balancing) where clients can initiate prediction requests individually or in batches through network RPC calls.
  3. Deploy as a library. Engineers want to be able to run the model in a service container. It can be consolidated as a library, or it can be called through the Java API (this type is not shown in the figure below, but this approach is similar to online deployment).

Figure 8: Models from the model library are deployed in both online and offline containers to provide services.

In all cases, the required model components (including metadata files, model parameter files, and compiled DSLS) are packaged as ZIP files that are copied onto relevant data in Uber data centers using Uber’s standard code deployment architecture. The prediction service container will automatically load the new model from disk and automatically start processing the prediction requests.

Many teams write their own automation scripts and use the Michelangelo API for regular retraining and deployment of general models. For example, UberEATS ‘food-delivery prediction model is trained and deployed by data scientists and engineers via Web UI control.

Predicted results

Once the model is deployed in the service container and loaded successfully, it can be used to predict characteristic data from the data pipeline or data from the client. The original features are passed through the compiled DSL, which can be modified to improve the original features if needed, or additional features can be pulled from the feature repository. Finally constructed feature vectors will be passed to the model for scoring. If the model is online, the prediction results are sent back to the client over the network. If the model is offline, the predicted results are written back to Hive and can be passed to the user via downstream batch jobs or directly using SQL queries, as shown below:

Figure 9: Online and offline prediction services generate prediction results using a set of feature vectors.

Reference model

Multiple models can be deployed to service containers simultaneously in the Michelangelo platform. This also makes painless migration from the old model to the new model and A/B testing of the model possible. In a service, different models can be identified by their UUID and a tag (or alias) that can be specified at deployment time. In the case of an online model, the client service sends a feature vector to the service container along with the UUID or tag of the model to be used. If a tag is used, the service container makes predictions using the latest deployed model corresponding to the tag. If multiple models are used, all models will predict the batch data and return the UUID and tag along with the forecast results for easy filtering by the client.

If a new model is deployed to replace the old model with the same thing (for example, with some of the same characteristics), the user can set the same tag for the new model as the old model, and the container will immediately start using the new model. This allows users to update the model without having to modify their client code. Users can also set UUID to deploy a new model, and then change the UUID of the old model in the client or middleware configuration to a new one, gradually switching traffic to the new model.

If A/B testing of the model is required, users can easily deploy the competing model via UUID or tag, and then use the Uber experimental framework in the client service to funnel some traffic to each model and evaluate performance metrics.

Scale scaling and delay

Because machine learning models are stateless and do not need to share anything, scaling them in online or offline mode is a breeze. With an online model, the engineer can simply add machines to the predictive service cluster, using a load balancer to spread the load. For offline prediction, engineers can set up more Executors for Spark to manage in parallel.

The latency of online services depends on the type and complexity of the model and whether features extracted from the Cassandra signature library are used. The P95 test delay is less than 5 ms in cases where the model does not need to extract features from Cassandra. The P95 test delay is still less than 10 ms when it comes to fetching features from Cassandra. The largest current model can provide more than 250,000 predictions per second.

Predictive monitoring

When model training is completed and evaluation is completed, the data used will be historical data. Monitoring the prediction of the model is an important guarantee to ensure its normal operation in the future. Engineers need to ensure that the data pipeline is feeding the right data and that the production environment has not changed so that the model can make accurate predictions.

To solve this problem, The Michelangelo system automatically records and adds some of the prediction results to the data pipeline labels. With this information, the model accuracy indicators can be obtained continuously and in real time. In the regression model, R^2/ determination coefficient, root mean square logarithmic error (RMSLE), root mean square error (RMSE) and mean absolute error will be published to the real-time monitoring system of Uber. Users can analyze the icon of the relationship between indicators and time and set threshold alarms:

Figure 10: The prediction results were sampled and compared with the observation results to obtain the model accuracy index.

Management layer, API, Web UI

The last important part of Michelangelo’s system is the API layer, which is also the brain of the system. The API layer contains a management application, which provides two access methods of Web UI and network API, and is combined with Uber’s monitoring and alarm system. It also contains workflow systems for managing batch data pipelines, training jobs, batch prediction jobs, model batch deployment, and online containers.

Michelangelo users can interact directly with these components through the Web UI, REST APIS, and monitoring and management tools.

Building work after the Michelangelo platform

Engineers plan to continue to expand and enhance the existing system in the coming months to support the growing number of users and Uber’s business. As The Michelangelo platform matures at all levels, they plan to develop higher-level tools and services to drive machine learning within the company and better support business:

  • AutoML. This will be a system for automatically searching and discovering model configurations (including algorithms, feature sets, superparameter values, etc.) to find the best performing model for a given problem. The system also automatically builds a data pipeline, generating features and labels according to the needs of the model. At present, the engineer team has solved most of the problems of the system through the feature database, the unified offline and online data pipeline, and the superparameter search feature. AutoML could speed up the early days of data science, with data scientists specifying a set of tags and a target function and then sitting back and using Uber’s data to find the best model to solve a problem. The ultimate goal of the system is to build smarter tools that will simplify the work of data scientists and increase productivity.
  • Model visualization. For machine learning, especially deep learning, understanding and debugging models is becoming increasingly important. While engineers have first provided visualization tools for tree models, more needs to be done to help data scientists understand, debug, and tweak their models to get really convincing results.
  • Online learning. Most of Uber’s machine learning model is directly influenced by Uber products in real time. This also means that these models need to be able to operate in a complex, ever-changing real world. To ensure the accuracy of models in changing environments, they need to evolve with the environment; Teams now regularly retrain models on the Michelangelo platform. A complete platform solution should allow users to easily upgrade, train and evaluate models quickly, with more sophisticated monitoring and alarm systems. It will be a big undertaking, but early research suggests that building and completing online learning systems could bring huge benefits.
  • Distributed deep learning. More and more Uber machine learning systems are implementing deep learning. The workflows for defining and iterating deep learning models are very different from standard workflows and require additional support from the platform. Deep learning needs to handle larger data sets and require different hardware support (such as gpus), so it needs more distributed learning support and tighter integration with a more resilient resource management stack.

If you are interested in challenging scale machine learning, welcome to join Uber’s Machine Learning Platform team!

Jeremy Hermann is the engineering manager of Uber’s Machine Learning Platform team. Mike Del Balso is the product manager of Uber’s Machine Learning Platform team.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, React, front-end, back-end, product, design and other fields. If you want to see more high-quality translation, please continue to pay attention to the Project, official Weibo, Zhihu column.