preface
Data visualization
Data visualization is essentially the transformation of data into visual code. Visualization is good at exploring data, scientific insight, communication and education. Visualization and statistics are different and related. The difference is that the former does not necessarily define the problem, while the latter studies a specific problem. The connection is in partnership. Visualization attracts the audience’s attention through visual coding, and then transmits data to the observer. In addition, data is explored and analyzed interactively through computer and other media. Good visual coding takes full advantage of the pre-conscious processing that humans naturally have: processing spatial, color, shape and so on in parallel. But almost all of it is too much information to be properly displayed in a static graph, so design is not just about deciding how to show something, but deciding what to show and what not to show in terms of what we think is important to our imaginary readers. Through computers, we customize graphics according to readers’ interests. When designing interaction, we refer to Ben Shneiderman [1] to put forward a good guide for human-computer interaction: Overview First, Zoom and Filter, and Details on Demand. The first overview is the initial form of the graph. Its purpose is not to display everything, but to provide a “macro” view of all data; Zooming and filtering are ways to strip out displayed content to focus on topics of interest; On-demand details allow the reader to extract accurate values from the chart.
Machine learning visualization
The application of data visualization in machine learning is called machine learning visualization, which can be roughly divided into four categories according to the data used by different user groups at different stages: Training Data, Model Performance, Interpretability + Model Inspection, and high-dimensional Data The Data). We expose patterns to our eyes through data visualization. Visualization tools use machine learning to extract patterns for us, help us find deeper patterns, and give us new ways to navigate the data. The pattern extracted by machine learning constitutes high dimensional data in the form of feature vector. Embedding [2] is a tool for interactive visualization and high dimensional data analysis. It provides four methods for data Dimensionality reduction that are very useful for visualizing high-dimensional data: UMAP [3], T-SNE [4], PCA, and Custom Linear Projections [5].
- UMAP is a dimensionality reduction algorithm based on manifold learning technology and topological data analysis. It provides a very general framework for approximating manifold learning and dimensionality reduction, but can also provide concrete implementation;
- T-sne can be used to explore local nearest neighbor values and find clusters;
- PCA, which can often effectively explore the embedded internal structure, reveals the most influential dimensions in the data;
- Custom linear projections to help discover meaningful directions in a dataset;
Problems and Challenges
In the application scenario of machine learning, we encounter a series of challenges :(1) the object of study is uncertain, and it is impossible to design a static graph in advance that can express everything clearly, and of course it is not necessary to do so. (2) Currently, there is a lack of construction related to high-dimensional data visualization in the group ecology. G2 [6] is a set of low-level visualization engine based on graph grammar theory, which is data-driven and provides graph grammar and interactive grammar with high ease of use and expansibility. G2 and its related ecological concerns allow users to build a variety of interactive statistical charts using Canvas or SVG in a single statement without having to worry about the tedious implementation details of charts. As of G2 4.0, G2 and related ecology are more concerned with the visualization of individual statistical charts. Machine learning scenarios not only focus on the distribution of feature space, but also pay attention to the comparative analysis of multiple feature subspaces and feature space-time distribution. As we explore the unknown, visualization advanced syntax for exploration, such as Vega-Lite [7], can help with rapid analysis of data, creating a series of expressive visualizations, but Vega-Lite cannot take advantage of the group’s existing visualization capabilities. (3) Limitations of unit visualization theory. When the amount of data is large, each data item is marked by a unique visual mark during visual coding, which will lead to performance problems; Lack of necessary interaction support.
The framework design
In view of the above challenges, we adopt the way of visualization combined with human-computer interaction to solve the uncertainty of the research object. Multi-view visualization technology is adopted for spatial visualization and subspace comparison analysis of high-dimensional data. Finally, we provide a visual analysis framework that can intuitively express and visualize design space with advanced syntax, and support multi-data, multi-attribute and multi-view visual analysis of high-dimensional mass data, covering comparative analysis scenarios such as time sequence and geographical space.
- Data, contains multi-row Data table, each row contains multiple columns or multiple attributes, in order to facilitate the processing of Data, Data generally adopts flat structure;
- Containers, Containers are a geometric abstraction that includes the location and area where a Group will be placed
- Bin, select an attribute that you are interested in and perform the bin operation on the attribute
- Layout, custom computes visual elements or group location information
- Groups are subsets of row data. Groups can also be nested types. Groups include other Groups
- Cells is a Container specific instance associated with a row in the data set
- Units, graphical representation of a row of data. They can have visual attributes, such as color, shape, size (relative to the outer cells), and opacity
- View, which is a specific visualization of a data table; It can be linked to other views of the same data
- Through this data visualization pipeline, data level filtering and sorting operations; Bin Operation attribute selection; Selection of layout method; Selection of visual coding; Interactive methods such as visual unit selection, hints, Hover, and even linkage analysis between views
- Animation: Animation of visual elements in the add, update, and delete phases
Key steps
The original data
To better explain the framework, we analyze specific business data that is provided as an array of feats, including field base information (base_info), choices, traits, and details. The visualization scheme determined after many discussions adopts multi-view visualization technology to support vertical comparison of feature data of different entities, and feature data is arranged in descending order of time.
// Business data
[
// An entity description
{
// Basic information
basic_info: {
"id": "1"./ / group id. }, selection: { ... },// Physical feature space
feats: {
"feature_1": 0./ / bool type
"feature_2": 1."feature_3": 1."feature_4": 0."feature_5": 1."feature_6": 0."feature_7": 1."feature_8": 0."feature_9": 0."feature_10": 1."feature_11": 104.// Number type
"feature_12": 104}, details:{ ... }},... ]// end
Copy the code
Advanced syntax configuration
{
width: 600,
height: 200,
margin: {
top: 10,
right: 30,
bottom: 30,
left: 100,
},
autoFit: true.// If set to false, you need to manually set width and height
layouts: [
{
name: 'layout1',
type: 'gridxy',
aspect_ratio: 'fillY', // Layout mode
align: 'TB', // From the top down
subgroup: { / / subspace
type: 'groupby', // bin | groupby | flatten
key: 'basic_info.id', // Cluster by ID
},
size: {
type: 'count', // Count the number of elements in the subspace
},
sort: null.// The sort of subspace elements
padding: { // layout padding
top: 0,
left: 0,
bottom: 0,
right: 0,
},
margin: { // layout margin
top: 0,
left: 0,
bottom: 0,
right: 0,
},
box: { / / box style
fill: 'white',
stroke: 'gray',
'stroke-width': 1,
opacity: 0.5,
},
}, {
name: 'layout2',
type: 'gridxy',
subgroup: {
type: 'flatten', // Tiled layout
key: 'feats', / / features
},
aspect_ratio: 'fillX',
size: {
type: 'count',
},
align: 'LR', // From left to right
interactions: [], / / interaction
padding: { // padding
top: 5,
left: 5,
bottom: 5,
right: 5,
},
margin: { // margin
top: 5,
left: 5,
bottom: 5,
right: 5,
},
box: { / / box style
fill: 'white',
stroke: 'gray',
'stroke-width': 1,
opacity: 0.5,
},
},
],
mark: {
shape: 'rect', // Cell shape
isColorScaleShared: true,
size: { // Depending on the cell shape
type: 'uniform', // The size is uniform
width: 20.// rect width
height: 20.// rect height
rx: 2.// rx
ry: 2.// ry
},
},
filters: [ // Filter field cross processing
'feature_1',
'feature_2',
'feature_3',
'feature_4',
'feature_5',
'feature_6',
'feature_7',
'feature_8',
'feature_9',
'feature_10',
'feature_11',
'feature_12',
],
chart: undefined, // Use custom chart without specifying the chart name
};
Copy the code
Using the following figure, we start with the root container and include all data. The size of sub-containers is determined by the number and length of elements. The layout is from top to bottom. And then the layout is from left to right, in the same way as the characteristic attributes in the entity data; Finally, draw the order multi-view visualization.
Advanced parsing
To generate the target visualization, our syntax builds a root container and applies the unit visualization operations recursively until all containers are units. In other words, rendering becomes tree traversal, where the root container is the root node of the tree and the unit container is the leaf node. Once all the units have been generated, the layout is complete and you can draw the unit visualizations. Before parsing the syntax, we build the RootContainer, which contains the raw data, precursor nodes, label, visual space (width, height, padding, and position) and other information. The layouts are parsed into a hierarchical nesting structure, which is then routed from RootContainer to ChilrenContainer, a child container of its own hierarchy.
The application case
Heat map of single user order
Multiple user order heat map
interaction
Support click, mouseover, mouseout and other interactive methods, where Click gets all order information, mouseover and mouseout highlight and unhighlight the current focus order.
In addition, attribute feature filtering is supported. If users only focus on feature 1 and feature 2, the effect is shown in the following figure.
The resources
[1] The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations
[2] projector.tensorflow.org/
[3] umap – learn. Readthedocs. IO/en/latest/h…
[4] distill. Pub / 2016 / misrea…
[5] Visualization of Labeled Data Using Linear Transformations
[6] g2. Antv. Vision/useful/docs/man…
[7] vega.github.io/vega-lite/
Author: ES2049 / Miss Li
The article can be reproduced at will, but please keep this link to the original text.
You are welcome to join ES2049 Studio. Please send your resume to [email protected]