This is the 11th day of my participation in the First Challenge 2022
In the previous article, Quick Start DVC (5) : Data Pipelining, I covered how to complete a machine learning model training pipeline and how to reproduce the machine learning model. The rest of this article describes how DVC can track model indicators, modify training parameters to retrain, and visualize model performance using diagrams.
Collection model indicators
First, let’s look at what the mechanism is for getting the values of these ML attributes. We added an evaluation Stage at the end of the previous pipeline:
$ dvc run -n evaluate \
-d src/evaluate.py -d model.pkl -d data/features \
-M scores.json \
--plots-no-cache prc.json \
--plots-no-cache roc.json \
python src/evaluate.py model.pkl \
data/features scores.json prc.json roc.json
Copy the code
The -m parameter specifies a metric file, and the — copy-no-cache parameter specifies a plot file (generated by this Stage) that will not be cached by the DVC. DVC run generates a new Stage in the dvc.yaml file:
evaluate:
cmd: python src/evaluate.py model.pkl data/features .
deps:
- data/features
- model.pkl
- src/evaluate.py
metrics:
- scores.json:
cache: false
plots:
- prc.json:
cache: false
- roc.json:
cache: false
Copy the code
The biggest difference between this Stage and the previous stages in our pipeline lies in two new parts: indicators and charts. These metrics and charts are used to flag certain files that contain machine learning “telemetry” data (ML “Telemetry”). Metrics files contain scalar values (e.g. AUC) and chart files contain matrices and sets of data for visualization and comparison (e.g. ROC curves or model loss graphs).
When these parameters are usedcache: false
DVC skips the cache for output because we want toscore.json
,prc.json
和 roc.json
Version control is done by Git.
The evaluate.py Python file writes the roc-AUC and average accuracy (AP) of the model to the score.json file, which is then marked as a metric file using -m. It reads as follows:
{ "avg_prec": 0.5204838673030754."roc_auc": 0.9032012604172255 }
Copy the code
At the same time, the evaluate.py Python file also writes the precision, recall, and threshold arrays (obtained using Sklearn’s precision_recall_curve function) to the chart file prc.json:
{
"prc": [{"precision": 0.021473008227975116."recall": 1.0."threshold": 0.0 },
...,
{ "precision": 1.0."recall": 0.009345794392523364."threshold": 0.6}}]Copy the code
Similarly, it writes the array generated by SkLearn’s roc_curve function to roc.json to obtain additional charts.
DVC does not force you to use any particular file name, or the format or structure of a metric or chart file. It is completely defined by the user according to the specific usage scenario.
You can use DVC to view tracking metrics and charts.
First, let’s look at the metrics generated:
$ dvc metrics show
Path avg_prec roc_auc
scores.json 0.52048 0.9032
Copy the code
Next, let’s look at the chart:
Before viewing the chart, you need to specify the array to use as the axes. We only need to do this once, and the DVC will save our chart configuration.
$ dvc plots modify prc.json -x recall -y precision
Modifying stage 'evaluate' in 'dvc.yaml'
$ dvc plots modify roc.json -x fpr -y tpr
Modifying stage 'evaluate' in 'dvc.yaml'
Copy the code
Now, let’s look at the chart:
$ dvc plots show
file:///Users/dvc/example-get-started/plots.html
Copy the code
We save this iteration for later comparison:
$ git add scores.json prc.json roc.json
$ git commit -a -m "Create evaluation stage"
Copy the code
We’ll see how to compare and visualize different pipeline iterations later.
Now, let’s look at how to get another important piece of information that is useful for comparison: parameters.
Define the parameters of the pipeline Stage
It is common for data science pipelines to include configuration files that define parameters that can be modified to train the model, perform preprocessing, and so on.
DVC provides a mechanism for stages to rely on the values of specific parts of such configuration files (supported by YAML, JSON, TOML, and Python formats).
Fortunately, we already have a Stage with parameters in dvC.yaml:
featurize:
cmd: python src/featurization.py data/prepared data/features
deps:
- data/prepared
- src/featurization.py
params:
- featurize.max_features
- featurize.ngrams
outs:
- data/features
Copy the code
Let’s remember how this Stage was generated.
The featurize Stage is created using the following DVC run command:
dvc run -n featurize \
-p featurize.max_features,featurize.ngrams \
-d src/featurization.py -d data/prepared \
-o data/features \
python src/featurization.py data/prepared data/features
Copy the code
Note: Specify the -p parameter (short for –params), which defines the parameter dependencies for the Featurize Stage. By default, DVC reads these values (featurize.max_features and featurize.ngrams) from params.yaml files. However, like indicators and charts, parameter filenames and structures can be customized by users based on usage scenarios.
Here is the contents of our params.yaml file:
prepare:
split: 0.20
seed: 20170428
featurize:
max_features: 500
ngrams: 1
train:
seed: 20170428
n_est: 50
min_split: 2
Copy the code
Update Stage parameters and pipeline iteration
We are definitely not satisfied with the AUC index values obtained so far!
Let’s edit the params.yaml file to use Bigrams and increase the number of features:
featurize:
- max_features: 500
- ngrams: 1
+ max_features: 1500
+ ngrams: 2
Copy the code
The beauty of dvc.yaml is that all you need to do now is run the DVC repro command:
dvc repro
Copy the code
It analyzes the changed parts, uses the existing results in the run cache, executes the required commands and generates new results (models, metrics, graphs).
The same logic applies to other possible adjustments (editing the source code, updating the data set), after you make the changes, use the DVC repro command, and then DVC runs what you need.
Compare the results of two iterations on the pipeline
Finally, let’s see if the update operation improves the performance of the model.
DVC has commands to view and visualize changes in metrics, parameters, and charts. These commands can be used for iterations of one or more pipelines. Let’s compare the current “Bigrams” run to the “baseline” iteration submitted last time:
DVC Params Diff shows the difference between the parameters in the workspace and the last commit.
$ dvc params diff
Path Param HEAD workspace
params.yaml featurize.max_features 500 1500
params.yaml featurize.ngrams 1 2
Copy the code
DVC Metrics Diff does the same for metrics to show the difference between the metrics in the workspace and the last submission:
DVC metrics diff Path Metric HEAD workspace Change scores. Json avg_prec 0.52048 0.55259 0.03211 scores. Json roc_auc 0.9032 0.91536 0.01216Copy the code
Finally, we can compare the PR curve to the ROC curve with one command!
$ dvc plots diff
file:///Users/dvc/example-get-started/plots.html
Copy the code
conclusion
This article describes commands such as DVC Metrics and DVC Plots, which DVC uses to make it easy to track metrics, update parameters, and visualize model performance using graphs.