Design and Implementation of visual analysis framework for machine learning

preface

Data visualization

Data visualization is essentially the transformation of data into visual code. Visualization is good at exploring data, scientific insight, communication and education. Visualization and statistics are different and related. The difference is that the former does not necessarily define the problem, while the latter studies a specific problem. The connection is in partnership. Visualization attracts the audience’s attention through visual coding, and then transmits data to the observer. In addition, data is explored and analyzed interactively through computer and other media. Good visual coding takes full advantage of the pre-conscious processing that humans naturally have: processing spatial, color, shape and so on in parallel. But almost all of it is too much information to be properly displayed in a static graph, so design is not just about deciding how to show something, but deciding what to show and what not to show in terms of what we think is important to our imaginary readers. Through computers, we customize graphics according to readers’ interests. When designing interaction, we refer to Ben Shneiderman [1] to put forward a good guide for human-computer interaction: Overview First, Zoom and Filter, and Details on Demand. The first overview is the initial form of the graph. Its purpose is not to display everything, but to provide a “macro” view of all data; Zooming and filtering are ways to strip out displayed content to focus on topics of interest; On-demand details allow the reader to extract accurate values from the chart.

Machine learning visualization

The application of data visualization in machine learning is called machine learning visualization, which can be roughly divided into four categories according to the data used by different user groups at different stages: Training Data, Model Performance, Interpretability + Model Inspection, and high-dimensional Data The Data). We expose patterns to our eyes through data visualization. Visualization tools use machine learning to extract patterns for us, help us find deeper patterns, and give us new ways to navigate the data. The pattern extracted by machine learning constitutes high dimensional data in the form of feature vector. Embedding [2] is a tool for interactive visualization and high dimensional data analysis. It provides four methods for data Dimensionality reduction that are very useful for visualizing high-dimensional data: UMAP [3], T-SNE [4], PCA, and Custom Linear Projections [5].

UMAP is a dimensionality reduction algorithm based on manifold learning technology and topological data analysis. It provides a very general framework for approximating manifold learning and dimensionality reduction, but can also provide concrete implementation;
T-sne can be used to explore local nearest neighbor values and find clusters;
PCA, which can often effectively explore the embedded internal structure, reveals the most influential dimensions in the data;
Custom linear projections to help discover meaningful directions in a dataset;

Fig.1 Embedding Projection

Problems and Challenges

In the application scenario of machine learning, we encounter a series of challenges :(1) the object of study is uncertain, and it is impossible to design a static graph in advance that can express everything clearly, and of course it is not necessary to do so. (2) Currently, there is a lack of construction related to high-dimensional data visualization in the group ecology. G2 [6] is a set of low-level visualization engine based on graph grammar theory, which is data-driven and provides graph grammar and interactive grammar with high ease of use and expansibility. G2 and its related ecological concerns allow users to build a variety of interactive statistical charts using Canvas or SVG in a single statement without having to worry about the tedious implementation details of charts. As of G2 4.0, G2 and related ecology are more concerned with the visualization of individual statistical charts. Machine learning scenarios not only focus on the distribution of feature space, but also pay attention to the comparative analysis of multiple feature subspaces and feature space-time distribution. As we explore the unknown, visualization advanced syntax for exploration, such as Vega-Lite [7], can help with rapid analysis of data, creating a series of expressive visualizations, but Vega-Lite cannot take advantage of the group’s existing visualization capabilities. (3) Limitations of unit visualization theory. When the amount of data is large, each data item is marked by a unique visual mark during visual coding, which will lead to performance problems; Lack of necessary interaction support.

Figure 2 G2 statistics chart official case

The framework design

In view of the above challenges, we adopt the way of visualization combined with human-computer interaction to solve the uncertainty of the research object. Multi-view visualization technology is adopted for spatial visualization and subspace comparison analysis of high-dimensional data. Finally, we provide a visual analysis framework that can intuitively express and visualize design space with advanced syntax, and support multi-data, multi-attribute and multi-view visual analysis of high-dimensional mass data, covering comparative analysis scenarios such as time sequence and geographical space.

Data, contains multi-row Data table, each row contains multiple columns or multiple attributes, in order to facilitate the processing of Data, Data generally adopts flat structure;
Containers, Containers are a geometric abstraction that includes the location and area where a Group will be placed
- Bin, select an attribute that you are interested in and perform the bin operation on the attribute
- Layout, custom computes visual elements or group location information
Groups are subsets of row data. Groups can also be nested types. Groups include other Groups
Cells is a Container specific instance associated with a row in the data set
Units, graphical representation of a row of data. They can have visual attributes, such as color, shape, size (relative to the outer cells), and opacity
View, which is a specific visualization of a data table; It can be linked to other views of the same data
Through this data visualization pipeline, data level filtering and sorting operations; Bin Operation attribute selection; Selection of layout method; Selection of visual coding; Interactive methods such as visual unit selection, hints, Hover, and even linkage analysis between views
Animation: Animation of visual elements in the add, update, and delete phases

FIG. 3 Machine learning visual analysis framework

Key steps

The original data

To better explain the framework, we analyze specific business data that is provided as an array of feats, including field base information (base_info), choices, traits, and details. The visualization scheme determined after many discussions adopts multi-view visualization technology to support vertical comparison of feature data of different entities, and feature data is arranged in descending order of time.

// Business data
[
  // An entity description
  {
    // Basic information
    basic_info: {
      "id": "1"./ / group id. }, selection: { ... },// Physical feature space
    feats: {
    	"feature_1": 0./ / bool type
      "feature_2": 1."feature_3": 1."feature_4": 0."feature_5": 1."feature_6": 0."feature_7": 1."feature_8": 0."feature_9": 0."feature_10": 1."feature_11": 104.// Number type
      "feature_12": 104}, details:{ ... }},... ]// end    
      
Copy the code

Advanced syntax configuration

{
  width: 600,
  height: 200,
  margin: {
    top: 10,
    right: 30,
    bottom: 30,
    left: 100,
  },
  autoFit: true.// If set to false, you need to manually set width and height
  layouts: [
    {
      name: 'layout1',
      type: 'gridxy',
      aspect_ratio: 'fillY', // Layout mode
      align: 'TB', // From the top down
      subgroup: { / / subspace
        type: 'groupby', // bin ｜ groupby ｜ flatten
        key: 'basic_info.id', // Cluster by ID
      },
      size: {
        type: 'count', // Count the number of elements in the subspace
      },
      sort: null.// The sort of subspace elements
      padding: { // layout padding
        top: 0,
        left: 0,
        bottom: 0,
        right: 0,
      },
      margin: { // layout margin
        top: 0,
        left: 0,
        bottom: 0,
        right: 0,
      },
      box: { / / box style
        fill: 'white',
        stroke: 'gray',
        'stroke-width': 1,
        opacity: 0.5,
      },
    }, {
      name: 'layout2',
      type: 'gridxy',
      subgroup: {
        type: 'flatten', // Tiled layout
        key: 'feats', / / features
      },
      aspect_ratio: 'fillX',
      size: {
        type: 'count',
      },
      align: 'LR', // From left to right
      interactions: [], / / interaction
      padding: { // padding
        top: 5,
        left: 5,
        bottom: 5,
        right: 5,
      },
      margin: { // margin
        top: 5,
        left: 5,
        bottom: 5,
        right: 5,
      },
      box: { / / box style
        fill: 'white',
        stroke: 'gray',
        'stroke-width': 1,
        opacity: 0.5,
      },
    },
  ],
  mark: {
        shape: 'rect', // Cell shape
        isColorScaleShared: true,
        size: { // Depending on the cell shape
          type: 'uniform', // The size is uniform
          width: 20.// rect width
          height: 20.// rect height
          rx: 2.// rx
          ry: 2.// ry
        },
  },
  filters: [ // Filter field cross processing
    'feature_1',
    'feature_2',
    'feature_3',
    'feature_4',
    'feature_5',
    'feature_6',
    'feature_7',
    'feature_8',
    'feature_9',
    'feature_10',
    'feature_11',
    'feature_12',
  ],
  chart: undefined, // Use custom chart without specifying the chart name
};
Copy the code

Using the following figure, we start with the root container and include all data. The size of sub-containers is determined by the number and length of elements. The layout is from top to bottom. And then the layout is from left to right, in the same way as the characteristic attributes in the entity data; Finally, draw the order multi-view visualization.

Figure 4 A multi-view visualization of an order generated by advanced syntax

Advanced parsing

To generate the target visualization, our syntax builds a root container and applies the unit visualization operations recursively until all containers are units. In other words, rendering becomes tree traversal, where the root container is the root node of the tree and the unit container is the leaf node. Once all the units have been generated, the layout is complete and you can draw the unit visualizations. Before parsing the syntax, we build the RootContainer, which contains the raw data, precursor nodes, label, visual space (width, height, padding, and position) and other information. The layouts are parsed into a hierarchical nesting structure, which is then routed from RootContainer to ChilrenContainer, a child container of its own hierarchy.

Figure 5 Case data results

The application case

Heat map of single user order

Figure 6 when there is only one user in the order data

Multiple user order heat map

FIG. 7 Comparison of multiple users

interaction

Support click, mouseover, mouseout and other interactive methods, where Click gets all order information, mouseover and mouseout highlight and unhighlight the current focus order.

FIG. 8 Mouse interaction

In addition, attribute feature filtering is supported. If users only focus on feature 1 and feature 2, the effect is shown in the following figure.

FIG. 9 Comparison effect of field filtering

The resources

[1] The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations

[2] projector.tensorflow.org/

[3] umap – learn. Readthedocs. IO/en/latest/h…

[4] distill. Pub / 2016 / misrea…

[5] Visualization of Labeled Data Using Linear Transformations

[6] g2. Antv. Vision/useful/docs/man…

[7] vega.github.io/vega-lite/

Author: ES2049 / Miss Li

The article can be reproduced at will, but please keep this link to the original text.

You are welcome to join ES2049 Studio. Please send your resume to caijun.hcj@alibaba-inc.com