Visualization of complex high-dimensional multivariate data

define

  • High-dimensional Multivariate Data means that each Data object has two or more independent or related attributes.
  • High dimensional index data has many independent properties
  • Multivariate (Multivariate) indices have a number of related attributes.
  • Because researchers are not sure whether the attributes of data are independent in many cases, it is often simply called multivariate data.
  • For example
    • When buying a laptop, you need to evaluate the configuration of different models, such as CPU, memory, hard disk, screen and weight. Each parameter is a property describing the computer, and the configuration of all parameters is a multivariate data.
    • This is a classic example of multivariate data-based decision making.

Conventional visualization methods

  • Two-dimensional and three-dimensional data can be represented using a conventional visualization method.
    • Scatter diagram: map the values of each attribute to different coordinate axes and determine the position of each data point in the coordinate system. When the dimension exceeds three dimensions, various visual codes can be used to represent additional attributes, such as color, size, shape, etc.
  • disadvantages
    • There are limited types of visual coding
    • Too much or too complex visual coding can reduce the readability of the visualization.
  • The solution
    • Display multivariate data in a low-dimensional space (usually a two-dimensional space).

Three basic methods of multivariate data visualization

Space mapping

The essence of scatterplot is to map abstract data objects to space represented by two-dimensional rectangular coordinate system. For multivariate data, the idea of scatterplot can be generalized as: Multiple data objects are arranged in two-dimensional plane space by different spatial mapping methods. The position of data objects in space reflects their attributes and their correlation with each other, while the distribution of the whole data set in space reflects the relationship between various dimensions and the overall characteristics of the data set.

Scatter matrix diagram

It is clear from the figure that MPG decreases as horsepower and weight increase.

Form the lens

  • The Table Lens method is an extension of the traditional method of using tables to present multivariate data (such as Excel software)
  • It uses a mapping approach similar to the traditional approach: each data object is represented by a row, and each column represents an attribute. Unlike traditional methods, the table lens method does not directly list the values of the data in each dimension, but represents these values as horizontal bars or points. Because dots or bars take up less space, they can represent a large amount of data and attributes in the limited screen space, and at the same time facilitate users to quickly compare data objects and attributes.

Parallel coordinates

  • It is widely used in the field of visualization and analysis of multivariate data.
  • In traditional data visualization methods, the axes are perpendicular to each other, and each data object corresponds to a point in the coordinate system.
  • The parallel coordinate method uses parallel axes, each representing a property of the data, so each data object corresponds to a polyline that runs through all the axes
  • Parallel coordinate is an important tool for multivariate data visualization analysis, which can not only reveal the distribution of data on each attribute, but also describe the relationship between two adjacent attributes.A car with a large number of cylinders has a relatively small mileage per liter, but a large horsepower

Dimension reduction

  • When the dimensions of the data are very high (for example, more than 50 dimensions), none of the visual rendering methods can clearly represent all the data details.
  • Project or embed multivariate data into low-dimensional space (usually two-dimensional or three-dimensional) through linear or nonlinear transformation, and keep the characteristics of data in multivariate space, that is, keep the relationship or characteristics of data in multivariate space as far as possible in low-dimensional space. This strategy is called dimensionality reduction.

Chernoff Faces

  • In this method, ICONS simulating human face are used to represent data objects, and different data are mapped to different parts and structures of human face, such as face size, eye size, etc.
  • Examples of us state crime data, where the length of a face indicates the incidence of murder, the width of a face indicates the incidence of rape, and so on. The idea behind Chernoff Faces is that our vision and brain are so good at recognizing Faces that we can detect subtle differences between Faces, and therefore we can detect differences between data objects by looking at ICONS that mimic Faces.

  • Disadvantages: Users often need to review the legend repeatedly to recognize the mapping

Visualization of unstructured and heterogeneous data

unstructured

  • The complexity of data comes not only from the high-dimensional nature of the data, but also from the non-structurality and heterogeneity of the data.
  • Conventional relational databases process structured data with a well-defined structure that can be efficiently stored in two-dimensional database tables.
  • Unstructured data (text, time, logs, and so on) cannot be represented in this form.
  • Not only does unstructured data exist in abundance, but it has enormous value.



Heterogeneous data

  • Heterogeneous data refers to data with different structures or attributes in the same data set.
  • Heterogeneous data can be expressed by network structure.

  • The topology
    • The study of the properties of a geometric figure or space that remain unchanged after continuous changes in shape. It only considers the position of objects in relation to each other, regardless of their shape and size.

Visualization of uncertainty

Statisticians have invented many visualization methods of uncertainty, such as error bar graph, box and whiskers graph, etc. In the field of visualization, uncertainty visualization is also listed as one of the ten core research problems of visualization, and many new uncertainty visualization methods are proposed, such as flow radar map, uncertainty visualization method based on visual element coding, etc.

Icon method

In error bars, the horizontal axis is usually used to represent data entities, while the vertical axis represents the statistical characteristics of each data entity. In most cases, the vertical axis consists of at least three values, including the mean, the lower limit error, and the upper limit error.





Arrows are used to represent the wind direction of each sampling point in the wind field, the length of the arrow represents the strength of the wind, and the width of the arrow indicates the variation range of the wind direction, namely the uncertainty of the wind direction. Thin arrows indicate less uncertainty, while thick ones indicate greater uncertainty.