The paper contains 2,751 words and is expected to last 8 minutes

Source: Pexels

Notebook is a data scientist’s best friend and can be a work nightmare. For those used to modern Integrated Development environments (IDEs), using Notebook feels like stepping back decades. In addition, modern Notebook environments are mostly limited to Python programs and lack the first-class support of other programming languages.

But a few days ago, Netflix opened source Polynote, an entirely new Notebook environment that addresses some of those challenges.

Polynote was created in response to a need to speed up Netflix’s data science experiments. Over the years, Netflix has built a world-class machine learning platform based on JVM languages like Scala. The support of mainstream technologies such as Jupyter Notebooks for these languages is fundamentally fundamental and therefore requires a better solution. Polynote has emerged to meet this basic requirement, but it has drawn lessons from building the most ambitious Notebook experiment in data science.

Inside Netflix’s Notebook drive

Over the past few years, Netflix has transformed the use of data science Notebook from an experimental artifact to a critical part of the life cycle in machine learning solutions. Initially, Netflix used The Notebooks of Jupyter as a data exploration and analysis tool. However, the engineering team quickly realized that Jupyter had distinct advantages in terms of run-time abstraction, extensibility, interpretability of code, and debugging that, when used properly, could have a significant impact on data science workloads. To expand the use of Jupyter in terms of data science uptime, the Netflix team needed to address some major challenges:

  • Code output mismatch: Notebook often needs to be replaced, and in many cases, the output seen in the runtime environment does not correspond to the current code.
  • Server requirements: Notebook usually requires a Notebook server to run, which presents architectural challenges when Notebook is used on a large scale.
  • Planning: Most data science models need to be executed regularly, but the tools for planning a Notebook are still fairly limited.
  • Parameterization: Notebook is a fairly static code environment, and passing input parameters is not an easy process.
  • Integration test: Notebook is an isolated code environment that is notoriously difficult to integrate with other Notebooks. Therefore, tasks such as integration testing can be a nightmare when using Notebook.

To meet these requirements, Netflix established an ambitious structure that will enable the Notebooks to become operational. Initially, implementations can include techniques such as “Papermill,” which parameterizes Notebook.

While Netflix’s original Notebook architecture was ambitious and limited to Python programs, it has now been refined to extend it even further.

Enter the Polynote

Polynote is a multi-language experiment environment for Notebook. In addition to Python, the current version also supports SQL, VisualIzations (Vega), and of course Scala. The platform is also integrated with data science infrastructure, such as Apache Park. The core of Polynote includes the following features:

A) Improved editing experience: Polynote attempts to bring the editing experience closer to a modern IDE.

B) Multi-language support: Polynote provides first-class support for Scala and other languages in data science environments.

C) Data visualization improvements: Polynote can visualize native data and integrate it into Notebook’s dataset without adding a lot of code.

D) Configuration and dependency management: Languages such as Scala require complex package dependencies in their programs. Polynote saves the package dependency configuration in the Notebook to address some common problems encountered by JVM developers in this area.

E) Repeatability: Combining code, data, and execution results into a single document makes Notebook more powerful, but it’s also hard to replicate. Polynote’s repeatability is a first-class feature of the framework.

A) Improved editing experience

Polynote improves the experience for data scientists and researchers by including common FEATURES found in ides, such as code completion or syntax error highlighting. Additional editing capabilities are provided by the Monaco editor, which supports the Visual Studio Code experience.

B) Multi-language support:

Polynote not only supports multiple languages, but also combines them into a single program. In Polynote, each cell can be based on a different language. When a unit is run, the kernel provides typed input values available to the unit’s language interpreter, which returns the output values of the resulting input to the kernel. This allows cells in PolynoteNotebook to run in the same context. The following example presents a Python library for computing isotonic regressions of data sets generated in Scala.

C) Improvements in data visualization:

Data visualization is a common component of most Notebook runtime environments. Polynote, however, takes the visual value proposition to another level by including it in the native components of the platform, which intuitively explores data sets without requiring developers to write any code.

D) Configuration and dependency management

Most of the time, data scientists using Notebook enjoy the efficiency of handling program dependencies in Python’s package management model. However, in a JVM language like Scala, dependency management can be a nightmare. Polynote solves this challenge by storing configuration and dependencies directly in the Notebook instead of relying on external files. In addition, Polynote provides a user-friendly “Configuration” section where users can set dependencies for each Notebook.

E) Repeatability

With Polynote, Netflix has a new code interpretation module that doesn’t rely on the REPL model as traditional Notebook does. A key feature of the new interpretation model is the removal of hidden state, which allows the data scientist to copy units in the Notebook without importing any state from previous locations.

Polynote is a new version of the ambitious Notebook competition in data science, but it has its merits. The jVM-based language support makes Polynote a favorite for developers working on the Spark infrastructure. Likewise, editing and repeatability are definitely enhancements to the traditional Notebook environment.

Polynote can be found on Github or follow the project’s website.

Leave a comment like follow

We share the dry goods of AI learning and development. Welcome to pay attention to the “core reading technology” of AI vertical we-media on the whole platform.



(Add wechat: DXSXBB, join readers’ circle and discuss the freshest artificial intelligence technology.)