Low dimensional data visualization

This is the 30th day of my participation in the August Text Challenge.More challenges in August

One, foreword

This is my machine learning course homework, some of the knowledge points used are of great learning significance, so I share the implementation process and part of the technology stack.

Second, scatter display

In this section, we will randomly generate some regular point coordinate data from Python’s Sklearn library and use PLT for a visual output.

1. Rely on installation

Here are the basic steps:

Create a new virtual environment (to prevent conflicts with existing library versions)
In the source
The installation

#Permanent source change (Tsinghua Source)
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
#The installation
pip install -r requirements.txt
Copy the code

The following are my library dependencies for this experiment:

# requirements.txt
numpy
pandas
matplotlib
seaborn
scipy
sklearn
Copy the code

2. Generate scatter points

Generates a circular scatter

We use the make_circles function under sklearn.datasets.

This function returns two numpy.narray lists. The first list element is the coordinates (x,y) of the resulting point, and the second list element is the tag of the point.

The values of tag alternate 0-1, representing the inner and outer circle points (which we can color to distinguish).

Here is the code implementation:

from sklearn import datasets
points, tags = datasets.make_circles(
    n_samples=400,
    shuffle=True,
    noise=1.,
    random_state=4103,
    factor=1.
)
Copy the code

Some parameter Settings:

N_samples: Number of points
Shuffle: shuffles samples -True
Noise: Standard deviation of Gaussian noise
Random_state: Random seed
Factor: It can be understood as the interval between inner and outer circles

Then the page display section:

colors = ["b" if tag else "m" for tag in tags]
plot = figure.add_subplot(location)
plot.set_title('data by make_circles()')
plot.scatter(
    x=points[:, 0],
    y=points[:, 1],
    s=100,
    marker="o",
    c=colors,
)
Copy the code

Point [:, 0] ‘ ‘points[:, 1]’ ‘points’ [:, 1]’ ‘points’ [:, 1]’

In addition, parameter S represents the size of the marker, marker represents the shape of the marker (” O “represents a circle), and C represents the color of the marker (in addition to passing in a list, you can also pass in a string, indicating that all use one color).

The effect is as follows:

Generates a crescent scatter

Actually pretty much the same as above, just replace make_moons with make_circles.

points, tags = datasets.make_moons(
    n_samples=400,
    shuffle=True,
    noise=1.,
    random_state=4103.)Copy the code

Everything else is the same.

After simple sorting, the code is as follows

import matplotlib.pyplot as plt
from sklearn import datasets


def build_circle_figure(figure: plt.Figure, location=211) - >None:
    points, tags = datasets.make_circles(
        n_samples=400,
        shuffle=True,
        noise=1.,
        random_state=4103,
        factor=1.
    )
    colors = ["b" if tag else "m" for tag in tags]
    plot = figure.add_subplot(location)
    plot.set_title('data by make_circles()')
    plot.scatter(
        x=points[:, 0],
        y=points[:, 1],
        s=100,
        marker="o",
        c=colors,
    )


def build_make_moons(figure: plt.Figure, location=212) - >None:
    points, tags = datasets.make_moons(
        n_samples=400,
        shuffle=True,
        noise=1.,
        random_state=4103,
    )
    colors = ["b" if tag else "m" for tag in tags]
    plot = figure.add_subplot(location)
    plot.set_title('data by make_moons()')
    plot.scatter(
        x=points[:, 0],
        y=points[:, 1],
        s=100,
        marker="o",
        c=colors,
    )


if __name__ == '__main__':
    fig = plt.figure()
    build_circle_figure(fig)
    build_make_moons(fig)
    plt.tight_layout()
    plt.show()
Copy the code

The results are as follows:

Thinking: How to visualize high-dimensional data?

A: You can try data dimension reduction, using PCA, LCA and other algorithms for processing.

Download the data set

Download address:Archive.ics.uci.edu/ml/index.ph…

Document data set

What I downloaded is the iris data set with the highest concentration heat of UCI data, whose data size is 150, and the unit length is 4 labels + one result output:

The label
- sepal_length
- sepal_width
- petal_length
- petal_width
The output
- class

Here is a list of downloaded files:

Where, Index is the folder directory, and iris.name is some basic information of the data, which has no effect on training and can be ignored.

Also bezdekiris.data and iris.data are exactly the same, so we only need to focus on iris.data, which is the data file:

Use operations similar to those in 2 to display data scatter diagrams:

The big point is THE sepAL, and the small point is petal.

The code implementation is as follows:

import matplotlib.pyplot as plt
import numpy


if __name__ == '__main__':
    with open("./iris.data") as f:
        lines = f.readlines()
    dataset = []
    for line in lines[:-1]:
        dataset.append(line.split(","))
    dataset = numpy.array(dataset)
    colors = []
    for label in dataset[:, 4] :if "setosa" in label:
            colors.append("r")
        elif "versicolor" in label:
            colors.append("g")
        else:
            colors.append("b")

    plt.scatter(
        x=dataset[:, 0],
        y=dataset[:, 1],
        s=100,
        marker="o",
        c=colors
    )
    plt.scatter(
        x=dataset[:, 2],
        y=dataset[:, 3],
        s=20,
        marker="o",
        c=colors
    )
    plt.show()
Copy the code

Image data set

Image data set I chose kaggle’s cat and dog recognition data set (12500 cats and dogs each) (Resnet18- cat and dog recognition was introduced in the previous blog)

Here’s how to read image data in PyTorch:

def get_data(input_size, batch_size) :
    """ Get file data and convert. ""
    from torchvision import transforms
    from torchvision.datasets import ImageFolder
    from torch.utils.data import DataLoader

    # Tandem multiple image transformation operations (training set)
    # transforms. RandomResizedCrop (input_size) random sampling first, and then to cut out the image zooming for the same size
    # RandomHorizontalFlip() rotates the image of a given PIL randomly and horizontally with a given probability
    # transforms.totensor () transforms images into Tensor, normalized to [0,1]
    # transforms.Normalize(mean=[0.5, 0.5, 0.5], STD =[0.5, 0.5, 0.5])
    transform_train = transforms.Compose([
        transforms.RandomResizedCrop(input_size),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.5.0.5.0.5], std=[0.5.0.5.0.5]])Get the training set (via the aspect above)
    train_set = ImageFolder('train', transform=transform_train)
    Encapsulate the training set
    train_loader = DataLoader(dataset=train_set,
                              batch_size=batch_size,
                              shuffle=True)

    # Concatenate multiple image transformation operations (validation set)
    transform_val = transforms.Compose([
        transforms.Resize([input_size, input_size]),  # Note that the Resize parameter is 2-dimensional, which is different from RandomResizedCrop
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.5.0.5.0.5], std=[0.5.0.5.0.5]])Get the validation set (via the above aspect)
    val_set = ImageFolder('val', transform=transform_val)
    Encapsulate the validation set
    val_loader = DataLoader(dataset=val_set,
                            batch_size=batch_size,
                            shuffle=False)
    # output
    return transform_train, train_set, train_loader, transform_val, val_set, val_loader
Copy the code

Generate a 10*10 single image size of 100*100 thumbnail, the implementation code is as follows:

import PIL.Image as Image
import os


def image_compose(label) :
    image_names = [name for name in os.listdir(label + "/")]
    to_image = Image.new('RGB', (10 * 100.10 * 100))
    for y in range(1.10 + 1) :for x in range(1.10 + 1):
            from_image = Image.open(label + "/" + image_names[10 * (y - 1) + x - 1]).resize(
                (100.100), Image.ANTIALIAS)
            to_image.paste(from_image, ((x - 1) * 100, (y - 1) * 100))
    return to_image.save(label + ".jpg")
Copy the code