The past two decades have taught us something profound about software development. A big part of this is due to the advent of DevOps and its widespread use across the industry.
Leading software companies all follow the same pattern: first rapid iteration in software development, then continuous integration, continuous delivery, and continuous deployment. Each artifact is tested for its ability to provide value, and the software is always in a state of readiness and deployed through an automated approach.
While the field of machine learning is different from traditional software development, there are many practical lessons that can be learned from the software development industry. Over the past few years, we’ve been developing production-oriented machine learning programs. Our goal is not just proof-of-concept, but to be repeatable and repeatable, just like software development. So we built a machine learning pipeline choreographer with strong automation capabilities and established a workflow to do this.
Why don’t you just use Jupyter Notebook? How long does it take to build a set of notebooks with all the processing steps from scratch? How easy is it for new members to join the team? Can you quickly reproduce the results from two months ago? Can you compare today’s results with historical results? Can you provide sources of data during training? What happens if your model goes out of date?
We’ve had all these problems. We have now distilled these lessons into 12 elements of successful production machine learning (similar to the 12 elements of software development).
Version control
Version control is basically a matter of course for software engineers, but the methodology is not widely accepted by data scientists. Let me quote some people at Gitlab as a quick start:
Version control facilitates coordination, sharing, and collaboration across the entire software development team. Version control software enables teams to work in a distributed and asynchronous environment, manage changes and versioning of code and files, and resolve merge conflicts and associated exceptions.
Simply put, version control allows you to safely manage the changing parts of your software development.
Machine learning is actually a special kind of software development, with its own unique needs. First, there are not one but two parts of machine learning that change: code and data. Secondly, machine learning is actually a special kind of software development, with its own specific requirements. Second, the model is trained by (fast) iteration, and the differences in the code are large (such as data segmentation, preprocessing, model initialization).
Once the data changes, it needs to be versioned so that experiments and training models can be repeated. Bare-handed version control (hard copy) has a lot of room for improvement, but (especially) in shared team situations, being able to maintain constant version control is critical.
Versioning the code is even more critical. In addition to the above quotes, preprocessing code is important not only in the training phase, but also in the service phase, and needs to have a constant correlation with the model. To create a middle ground between the data scientist’s workflow and the need to put it into production, a serviceless architecture can provide an easily accessible way.
Bottom line: You need to version your code, and you need to version your data.
2. Explicit feature dependency
In a perfect world, whatever input data is produced will always produce exactly the same data, at least structurally. But the world isn’t perfect, and the data you get from upstream services is built by humans, so it could change. Eventually, characteristics may change, too. At best, your model will simply fail and fail, but at worst, your model will quietly continue to work, but the results will be rubbish.
Clearly defined feature dependencies enable early fault detection. Well-designed systems are adaptable to feature dependencies during ongoing training and service.
Bottom line: Make your feature dependencies clear in your code.
3. Descriptive training and pretreatment
Good software is well described and annotated – it’s easy to read and understand the functionality without having to read every line of code.
Although machine learning is a special type of software development, it does not exempt practitioners from following established coding guidelines. One of the most basic code requirements is that it takes very little effort and very little time to have a basic understanding of the code.
Preprocessing and model training and prediction codes should follow PEP8. Your code should use meaningful object names and include comments that help you understand them. Compliance with PEP8 improves code readability, reduces complexity, and speeds up debugging. Programming paradigms such as SOLID provide ideological guidelines that make code more maintainable, easy to understand, and flexible in future use cases.
The configuration should be separate from the code. Do not hard-code the data allocation ratio into the code, but provide it in a configuration manner so that it can be modified at run time. This is already well known in terms of hyperparametric tuning: using separate profiles can significantly speed up iteration and make the code base reusable.
Bottom line: Improve code readability and separate code from configuration.
4. The training results can be reproduced
If you can’t reproduce the training results, you can’t trust the results. Although this is the subject of this article, there are a few details in terms of repeatability that need to be explained. It’s not just you that needs to be able to replicate your training results, your entire team needs to be able to do the same. Whether on a PC or in the Jupyter Notebook on the AWS virtual machine, the elusive training results are the opposite of repeatability.
By using the pipeline to train the model, the whole team has transparent access to the experiments that have been performed and the training that has been run. By binding reusable code bases with separate configuration files, everyone can successfully retrain at any time.
Bottom line: Automate your workflow using pipelines.
5. Test
Testing takes many forms. Here are two examples:
- Unit testing is testing at the atomic level – each function is tested individually according to its own specific criteria.
- Integration testing takes the opposite approach – testing all elements of the code base together, as well as testing clone or mock versions of upstream and downstream services.
Both paradigms are adapted to machine learning.
The preprocessing code must be unit tested – given various inputs, does the transformation produce the right results?
Models are a good use case for integration testing – does your model perform as well as it was evaluated when providing model services in production?
Bottom line: Test your code, test your model.
6. Model drift/continuous training
In production scenarios, model drift is a reasonable problem. You need to consider the possibility of model drift whenever the data is likely to change (e.g. user input, upstream service instability). There are two measures that can be taken to address the risks of this problem:
-
Monitor data in production system. Establish automated reporting mechanisms to notify teams of changes in data that may go beyond well-defined feature dependencies.
-
Continuous training on new input data. Well-automated pipelines can be rerunelled on newly recorded data and provide a comparison with historical training results to show performance degradation and provide a quick way to bring newly trained models into production for better model performance.
Bottom line: If your data will change, use a pipelined process of continuous training.
7. Tracking of experimental results
Excel is not a good way to track the results of an experiment. And it’s not just Excel. Any information in the form of decentralized human tracking is not authoritative or trustworthy.
The correct approach is to automatically record the training results in a centralized data storage mode. Automation ensures that each workout is reliably tracked, making it easy to compare the results of each workout later. Centralized storage of results provides transparent information across teams and enables continuous analysis.
Summary: Tracking experimental results through automated methods.
8. Experimental model and production model
Understanding the data set takes effort. Often, we use experimentation to achieve understanding, especially when the area we’re focusing on has a lot of implicit domain knowledge. Create a Jupyter Notebook, import some/all of the data into Pandas Dataframe, do a few hours of unordered research, train the first model, and then evaluate the results – task done. Unfortunately, that’s not the case.
Experiments have a purpose in the life cycle of machine learning. But the result of these experiments is not models, but understanding. The model based on the exploratory Jupyter Notebook is for understanding, not for production. Once understood, further development and adaptation are needed before a training process can be built for production.
However, all understanding that has nothing to do with domain-specific knowledge can be automated. Generate statistics for each version of data you use, so you can skip all the one-off, temporary exploration you did in the Jupyter Notebook and go straight to the first pipe. The earlier you experiment in the pipeline, the earlier you can collaborate on intermediate results, and the earlier you can achieve a production-ready model.
Conclusion: Notebook cannot go into production, so it is important to experiment early in the pipeline process.
9. Training deviation
Training-ad-skew refers to the difference between the performance of the Training and the performance of the actual generated environment. This bias can be caused by:
- Process data in different ways during training and in the actual generating environment workflow.
- The data in the training is different from that in the actual operation.
- There is a feedback loop between the model and the algorithm.
To avoid deviations between the training environment and the production environment, it is usually necessary to properly embed all data preprocessing into the model service environment. That’s certainly true, and you need to follow that principle. However, this is also an overly narrow interpretation of training-ad-skew.
A little bit of DevOps history: In 2006, Werner Vogels, amazon’s CTO, coined the phrase “You build it, You run it.” This is a descriptive phrase that means the responsibility of the developer is not only to write programs, but also to run them.
Machine learning projects require a similar mechanism — understanding upstream data generation and downstream model usage is within the remit of the data scientist. What systems do you use to generate data for your training? Could it go wrong? What are the service level objectives (SLOs) for the system? Is this consistent with the goal of serving the actual production environment? How does your model serve? What does the runtime environment look like? How to preprocess functions while servicing processes? These are questions that data scientists need to understand and answer.
Bottom line: Properly embed preprocessing into the production environment’s model services to ensure you understand the upstream and downstream of the data.
10. comparability
Starting with the introduction of the second training script for the project, comparability became an important part of future work. If the results of the second model cannot be compared with the results of the first model, then the whole process is wasted and at least one of them, and possibly both, is redundant.
By definition, all model training that attempts to solve the same problem needs to be comparable, otherwise they are not solving the same problem. Although the iterative process may lead to changes in what is being compared, technically, achieving comparability of model training needs to be built into the training architecture from the start as a primary function.
Bottom line: Build your own pipeline process so you can easily compare training results from each process.
11. Monitoring
Roughly speaking, the goal of machine learning should be to learn data to solve problems. To solve this problem, computing resources need to be allocated. First is the training assigned to the model, then is the service assigned to the model. The person or department responsible for providing resources during training needs to be responsible for transferring those resources to the service. Many performance degradation problems can occur during the use of the model. Data can drift, models can become bottlenecks in overall performance, and bias is a real problem.
Model effectiveness: Data scientists and teams are responsible for monitoring the models they generate. They are not necessarily responsible for implementing the monitoring, especially if the organization structure is large, but they are certainly responsible for understanding and interpreting the monitoring data. At a minimum, you need to monitor input data, time cost of inference, resource usage (such as CPU, RAM), and output data.
Bottom line: Again, “You build it, You run it.” Monitoring models in a production environment is part of the data science effort.
12. Deployability of the model
Technically, each model training process needs to produce a finished product that can be deployed into a production environment. No doubt these models can turn out badly, but it needs to be made into a finished product that can be deployed to a production environment.
This is a common pattern in software development, also known as Continuous Delivery. Teams need to be able to deploy their software at any time, and to meet this goal, iteration cycles need to be fast enough.
Machine learning requires a similar approach. This forces the team to consider the balance between reality and expectations first. It should be clear to all stakeholders which outcomes are theoretically possible in terms of model outcomes. All stakeholders should agree on how the model should be deployed and how it should be integrated with the larger software architecture. At the same time, this also requires more powerful automation and will necessarily adopt most of the elements outlined in the previous paragraphs.
Bottom line: Every training process needs to result in a deployable product, not just a model.
conclusion
This is by no means an exhaustive list. It combines our experience and you are welcome to use it as a benchmark, a template for testing your production architecture, or a blueprint for designing your own.
12 Factors of reproducible Machine Learning in production