Version control, also known as source control, is used to track changes to code and other artifacts in software development and data science efforts. Versioning allows you to checkpoint artifacts, compare different versions, and branch off to different development paths. Most of us already know about version control, but it’s not obvious how to use it effectively in Jupyter notebooks. Because Jupyter notebooks are just JSON documents containing code, metadata, and output, comparing notebook versions can be cumbersome when using standard version control tools like Git. But for the developers and data scientists who use Jupyter laptops, there are ways to make life easier.

A simple example of difference shows how messy it can be

The best way to understand these problems is to look at simple examples. In this example, I will demonstrate how the three parts of the Jupyter notebook change values during development and how these files interact with the default Git tool.

If you want to follow suit, you can manually create a notebook and Git repository by mimicking the code and examples below, or you can clone my Git repository — specifically edit the test notebook and work with it.

In this example, we made a simple notebook with the following code.

import matplotlib.pyplot as plt
plt.plot([x**2 for x in range(100)])
Copy the code

X * * 2 Matplotlib figure

If we execute the notebook, save it, and add it to Git, we have our base version.

> git add jupyter_git_example.ipynb
> git commit -m "initial git commit for example, for diffing"
Copy the code

I won’t show you the full source code for the notebook here, but if you look at the source code yourself (using a text editor or less in a Unix shell), you’ll see that it’s just a JSON document. But what happens if we re-execute the cell that contains the code and save the notebook? What has changed? That’s the difference I see.

> git diff jupyter_git_example.ipynb diff --git a/tools/jupyter_git_example.ipynb b/tools/jupyter_git_example.ipynb index 906fb11.. 750fe85 100644 -- a/tools/jupyter_git_example.ipynb +++ b/tools/jupyter_git_example.ipynb @@-2,17 +2,17 @@ "cells": [ { "cell_type": "code", - "execution_count": 1, - "id": "28c01064", + "execution_count": 2, + "id": "e2c08a0a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[<matplotlib.lines.Line2D at 0x1187977c0>]" + "[<matplotlib.lines.Line2D at 0x11891e100>]" ] }, - "execution_count": 1, + "execution_count": 2, "metadata": {}, "output_type": "Execute_result"}, @@-33,6 +33,14 @@" import matplotlib.pyplot as PLT \n", "plt.plot([x**2 for x in range(100)])" ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2915ced8", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": {Copy the code

What’s going on here? As you can see, the number of executions of the cell has changed (along with some identifiers). It also adds a new blank cell (created when I executed the first cell) below the top cell. No code has actually changed, but we’ll have to look very closely to figure it out, which can be confusing if you haven’t seen a notebook file before.

Visual output, especially images, makes the difference even more confusing

Now, if I change the code (say from x**2 to x**3) and execute and save the notebook, both the code and the output change.

To me, it looks something like this

> git diff jupyter_git_example.ipynb diff --git a/tools/jupyter_git_example.ipynb b/tools/jupyter_git_example.ipynb index 906fb11.. 6a8084e 100644 - a/tools/jupyter_git_example.ipynb +++ b/tools/jupyter_git_example.ipynb @@ -2,23 +2,23 @@ "cells": [ { "cell_type": "code", - "execution_count": 1, - "id": "28c01064", + "execution_count": 3, + "id": "57e4b20e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[<matplotlib.lines.Line2D at 0x1187977c0>]" + "[<matplotlib.lines.Line2D at 0x11897a160>]" ] }, - "execution_count": 1, + "execution_count": 3, "metadata": {}, "output_type": "execute_result" }, { "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYMAAAD4CAYAAAAO9oqkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG9 0bGliLm9yZ + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAEDCAYAAAAlRP8qAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG9 0bGliLm9yZ "TEXT /plain": ["<Figure 432x288 with 1 Axes>"] @@-31,8 +31, 16@@], "source": [ "import matplotlib.pyplot as plt\n", - "plt.plot([x**2 for x in range(100)])" + "plt.plot([x**3 for x in range(100)])" ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "09023a82", + "metadata": {}, + "outputs": [], + "source": []Copy the code

As last time, the number of executions and the ID have changed, but now there are both code changes and output changes in the DIff. Also, since the Matplotlib image is base64 encoded, it’s hard to tell what’s changed — it’s image/ PNG in the output. If we want to make our differences as clean as possible, what should we do? I use one of two strategies, each for a different reason.

Simple strategy. Treat the notebook only as source code

The simplest strategy is to commit only purged notebooks with no output. To make a clean laptop with no output, select the Kernel menu and select “Reboot and clean Output”. If you save the notebook, you will see that there is no output in the notebook file, the only difference is the code and metadata. The advantage of this strategy is that you don’t have to deal with messy differences in output data, especially visual data. Your differences (and probably most mergers) will not be hard to understand.

This will make a lot of sense for some uses of the notebook. For example, if your notebook is a parameterized notebook for Papermill, you have a basic notebook for creating several or even thousands of different notebooks with different parameters. Saving a random output of one of the parameters isn’t necessarily helpful, and saving every version of these notebooks usually doesn’t make sense. However, saving the base version (with no output) may be a good choice. This solution treats the notebook primarily as source code, not as data or results.

If you use this option, you may want to consider creating unit tests for your notebook code. You might also consider pulling it out into a Python module (in a.py file) if that makes sense for your use case.

Alternative strategies. Think of your notebook as code and data

Ideally, you want to keep your notebook with your results, which is really one of the main reasons to use a notebook

Nbdime makes differentiation and merging easier

If you use a notebook as both code and data, you can use a simple tool to make your life easier. Install the NBDIME tool — Notebook differentials and merges. This tool, which is part of the Jupyter project, knows how to properly compare notebook files and integrates with Git or Mercurial for version control, as well as the Jupyter server and JupyterLab, among other tools.

The installation

You can install NBDIME with PIP. Ideally, you’ll install it in a virtual environment where you’re developing your laptop.

pip install nbdime
Copy the code

Simple difference

To compare a notebook, simply run NBdiff. Before running this command, I rerun the notebook and save it (because I had cleared the output earlier). The output is a bit verbose for such a small change, but it is broken up into several parts. As you can see, the cell ID, the text output of the cell, the graphic output of the cell, and the source of the cell have all changed. It also indicates a new unit of code at the end of the notebook. It’s easier for people to understand!

nbdiff jupyter_git_example.ipynb nbdiff tools/jupyter_git_example.ipynb (HEAD) tools/jupyter_git_example.ipynb --- Tools /jupyter_git_example.ipynb (HEAD) (no timestamp) +++ Tools /jupyter_git_example.ipynb 2021-09-02 17:25:50.889444 ## modified /cells/0/id: - 28c01064 + 2f344fee ## modified /cells/0/outputs/0/data/text/plain: - [<matplotlib.lines.Line2D at 0x1187977c0>] + [<matplotlib.lines.Line2D at 0x10f789a60>] ## inserted before /cells/0/outputs/1: + output: + output_type: display_data + data: + image/png: iVBORw0K... <snip base64, md5=86ab879409f2b42a... > + text/plain: <Figure size 432x288 with 1 Axes> + metadata (unknown keys): + needs_background: light ## deleted /cells/0/outputs/1: - output: - output_type: display_data - data: - image/png: iVBORw0K... <snip base64, md5=d93fe1c011d9afe8... > - text/plain: <Figure size 432x288 with 1 Axes> - metadata (unknown keys): - needs_background: light ## modified /cells/0/source: Import matplotlib.pyplot as plt-plt.plot ([x**2 for x in range(100)]) +plt.plot([x**3 for x in) range(100)]) ## inserted before /cells/1: + code cell:Copy the code

The visual difference

There’s also a web-based difference tool that allows you to visualize changes.

nbdiff-web jupyter_git_example.ipynb
Copy the code

This will launch a visual difference tool in a new browser TAB or window. For my laptop example above, it looks something like this.

Visualizing differences using nbdiff- Web

With Git integration

You can configure Git to use the nbdiff tool to compare notebook files.

From the root of your Git project, run the enable command.

nbdime config-git --enable
Copy the code

By default, it only enables NBDIME in the current version library. You can enable it globally (–global) or at the system level (–system), and you can also disable it (–disable). Once enabled, Git diff will output in nbDIff format.

merge

Once you track the changes to your notebook in version control, you will most likely run into a situation where you have to merge the changes to another version — either a branch you created or someone else’s changes — and deliver them to you as another.ipynb file. If you try to merge the results using the default Git tool, or manually merge the results using the JSON file itself, this can be an incredible hassle. In some merges, Git inserts merge conflict flags (that is, <<<< and =====) into the code in the middle of the JSON file, making it an invalid file in those places. If so, it cannot be loaded into any normal Jupyter notebook editing tools and it will have to be checked and repaired manually.

For a simple example of just merging two notebooks, I copied the notebook above but changed the source code in one place and then saved it as Jupyter_git_example2.ipynb. The merge tool knows how to combine the two changes into a valid notebook file that I can open using Jupyter.

nbmerge jupyter_git_example.ipynb jupyter_git_example2.ipynb  > merged.ipynb
Copy the code

If I open merged. Ipnb in Jupyter, I see that the code unit contains merged code — it has a conflict that I can easily resolve.

import matplotlib.pyplot as plt
<<<<<<< local
plt.plot([x**3 for x in range(100)])
=======
plt.plot([x**4 for x in range(100)])
>>>>>>> remote
Copy the code

The merge tool supports more options and can perform a three-way merge. It also supports a visualization tool like NBdiff-Web. Merge tools are most useful when they are integrated with Git and used in git workflows.

For example, you might choose to set up a feature branch and do some work on a notebook. Then, you might end up making some changes on the main branch. This is when you need to merge and move the changes to the trunk branch. Nbdime’s merge tool knows how to do this in the context of a notebook. You can also merge visually.

The following is an annotated example to show that we usually encounter this situation, which may be one or more developers committing on multiple branches.

> git branch nbdime-example    # make a feature branch
> git checkout nbdime-example  # checkout branch
# work on this branch, make a notebook change, save the notebook
> git add jupyter_git_example.ipynb
> git commit -m "changes on feature branch"
> git checkout main            # switch back to main branch
# make a change to jupyter_git_example.ipynb, but in the same lines as on nbdime-example branch
> git add jupyter_git_example.ipynb
> git commit -m "changes on main branch"
Copy the code

At this point, we might want to merge the two branches. Once we have nbDIME installed, we can use their DIff tool to better understand these changes.

> git checkout master
> git merge nbdime-example
# get message about merge conflicts on command line
> git mergetool --tool=nbdime
# now we see the merges in context
Copy the code

You can check the documentation for more detailed configuration. You can also enable a NBdiff button in your Jupyter Notebook menu bar (in Jupyter Notebook or JupyterLab). If you don’t want to run differences on the command line, this is probably the best option. Full instructions are here. Note whether you want to install extensions only in your virtual environment or on the entire system.

Some best practices

Here’s what worked pretty well for me when using Jupyter notebooks in version control. Note that larger working groups may require stricter standards, but generally these guidelines have worked well for me.

Extract as much code as you can into Python files

Putting your code in a Python file makes it easier to access, test, and use widely. It’s also easier to differentiate and merge with a version control tool like Git. But for some code, it’s more convenient to put it in a notebook. It may be worth giving notebook writers flexibility. So it doesn’t always make sense to force all Python code into a separate Python file.

Before committing the notebook, restart it and re-run it, then save.

Notebook development can introduce errors because cells are sometimes disordered or deleted after running. It’s a good idea to re-run the notebook from scratch. This ensures that all code is in the notebook and runs in the correct order. For laptops with long running times, rerun may not be practical. You’re better off splitting your notebook into smaller notebooks. Choosing not to run a full laptop is asking for trouble. As an added benefit, the metadata for most cells (such as the number of executions) does not change between commits of the same file.

If a notebook is a template, clear the output and save it

If a notebook is a parameterized notebook for Papermill use, I usually don’t keep the output when I save it. These notebooks run as a template with multiple inputs, so you can treat them as simple code.

Use NBDIME to differentiate and merge the Jupyter notebook code

Even if you don’t need to use merge tools, just seeing a nice diff that shows the code changes can go a long way toward making big changes. If you install NBDIME in the virtual environment of notebook development, you will find the extra information and convenience very helpful when submitting changes, recovering code, or sharing the notebook with a partner.

While the notebook format is really helpful because it keeps the code together with the output, it can be a pain to keep track of changes in the notebook. But we can use policies and tools to help us make good use of versioning tools.

The postVersion control for Jupyter notebooksappeared first onwrighters.io.