Reproduced by: Quant Online

Data cleansing is an important part of data science and machine learning. This article describes how to process MNIST image data in Tensorflow.js (0.11.1) and explain the code line by line.


Some joke that 80% of data scientists clean up data and the other 20% complain about cleaning up data… Cleaning up data is a much bigger part of data science than outsiders think. In general, training models are usually a small part of machine learning or data scientist work (less than 10%). – Kaggle CEO Antony Goldbloom


Data processing is an important step in any machine learning problem. This article will use the MNIST sample from tensorflow.js (0.11.1) (github.com/tensorflow/…


MNIST sample

import * as tf from '@tensorflow/tfjs';

const IMAGE_SIZE = 784;
const NUM_CLASSES = 10;
const NUM_DATASET_ELEMENTS = 65000;

const NUM_TRAIN_ELEMENTS = 55000;
const NUM_TEST_ELEMENTS = NUM_DATASET_ELEMENTS - NUM_TRAIN_ELEMENTS;

const MNIST_IMAGES_SPRITE_PATH =
    'https://storage.googleapis.com/learnjs-data/model-builder/mnist_images.png';
const MNIST_LABELS_PATH =
    'https://storage.googleapis.com/learnjs-data/model-builder/mnist_labels_uint8'; 
Copy the code


First, import TensorFlow (make sure you’re translating code) and set up some constants, including:

  • IMAGE_SIZE: image size (28*28=784)
  • NUM_CLASSES: Number of tag classes (this number can be 0 to 9, so there are 10 classes)
  • NUM_DATASET_ELEMENTS: Total number of images (65000)
  • NUM_TRAIN_ELEMENTS: Number of images in the training set (55000)
  • NUM_TEST_ELEMENTS: The number of images in the test set (10000)
  • MNIST_IMAGES_SPRITE_PATH&MNIST_LABELS_PATH: path of image and label


Cascade these images into one giant image, as shown below:


MNISTData

Next, starting at line 38, is MnistData, which uses the following function:

  • Load: load images and annotation data asynchronously;
  • NextTrainBatch: Loads the next training batch;
  • NextTestBatch: loads the nextTestBatch.
  • NextBatch: a generic function that returns the nextBatch, depending on whether it is in a training or test set.

This article is a primer, so only the load function is used.


load

async load() {
 // Make a request for the MNIST sprited image.
 const img = new Image();
 const canvas = document.createElement('canvas');
 const ctx = canvas.getContext('2d');
Copy the code


Async functions are a relatively new language feature in Javascript, so you’ll need a translator.


The Image object is a local DOM function that represents an in-memory Image and provides callbacks to accessible Image properties when the Image is loaded. Canvas is another element of the DOM that provides an easy way to access an array of pixels and manipulate it with context. Because both of these are DOM elements, you don’t need to access them if you’re using Node.js (or a Web Worker). For other alternative methods, see below.


imgRequest

const imgRequest = new Promise((resolve, reject) => {
 img.crossOrigin = ' ';
 img.onload = () => {
 img.width = img.naturalWidth;
 img.height = img.naturalHeight;
Copy the code


This code initializes a new Promise that ends when the image loads successfully. This example does not explicitly deal with error states.


CrossOrigin is an image property that allows images to be loaded across domains and can solve CORS (Cross-source resource Sharing) problems when interacting with DOM. NaturalWidth and naturalHeight refer to the original dimension of the loaded image, which can be forced to correct the image size during calculation.

const datasetBytesBuffer =
    new ArrayBuffer(NUMDATASETELEMENTS * IMAGESIZE * 4);

const chunkSize = 5000;
canvas.width = img.width;
canvas.height = chunkSize;
Copy the code


This code initializes a new buffer containing each pixel of each graph. It multiplies the total number of images by the size and number of channels per image. I think chunkSize is useful to prevent the UI from loading too much data into memory at once, but I’m not 100% sure.

for (let i = 0; i < NUMDATASETELEMENTS / chunkSize; i++) {
    const datasetBytesView = new Float32Array(
        datasetBytesBuffer, i * IMAGESIZE * chunkSize * 4,
        IMAGESIZE * chunkSize);
    ctx.drawImage(
        img, 0, i * chunkSize, img.width, chunkSize, 0, 0, img.width,
        chunkSize);

    const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height);
Copy the code


This code iterates through each Sprite image and initializes a new TypedArray for the iteration. Next, the context image gets a drawn image block. Finally, the drawn image is converted into image data using the getImageData function of the context, which returns an object representing the underlying pixel data.


for (let j = 0; j < imageData.data.length / 4; j++) {
    // All channels hold an equal value since the image is grayscale, so
    // just readthe red channel. datasetBytesView[j] = imageData.data[j * 4] / 255; }}Copy the code


We iterate over these pixels and divide by 255 (the maximum possible number of pixels) to limit the value to between 0 and 1. Only the red channel is necessary because it is a grayscale image.


this.datasetImages = new Float32Array(datasetBytesBuffer); resolve(); }; img.src = MNISTIMAGESSPRITEPATH; ) ;Copy the code


This line creates the buffer, maps it to the new TypedArray that holds our pixel data, and then closes the promise. In fact, the last line (setting the SRC property) actually starts the function and loads the image. One of the things that bothered me at first was the behavior of TypedArray relative to its underlying buffer. You may have noticed that the datasetBytesView is set in the loop, but it never returns. DatasetBytesView references the buffer’s datasetBytesBuffer (used for initialization). When the code updates the pixel data, it indirectly edits the value of the buffer and then converts it to 78 lines of New Float32Array.


Get image data outside the DOM

If you’re in the DOM, just use the DOM, and the browser (via canvas) is responsible for formatting the image and converting the buffer data to pixels. But if you’re working outside the DOM (that is, using Node.js or a Web Worker), you need an alternative. Fetch provides a mechanism called Response.ArrayBuffer, which allows you to access the underlying buffer of a file. We can use this method to read bytes manually without DOM at all. Here is an alternative way to write the above code (this method requires fetch, which can be done in a Node using methods such as isomorphic-fetch) :

const imgRequest = fetch(MNISTIMAGESSPRITE_PATH).then(resp => resp.arrayBuffer()).then(buffer => {
    return new Promise(resolve => {
        const reader = new PNGReader(buffer);
        return reader.parse((err, png) => {
            const pixels = Float32Array.from(png.pixels).map(pixel => {
                return pixel / 255;
            });
            this.datasetImages = pixels;
            resolve();
        });
    });
});
Copy the code


This returns a buffer array for that particular image. In writing this article, I first tried parsing incoming buffers, but I don’t recommend doing so. If necessary, I recommend using PNGJS for PNG parsing. When working with images in other formats, you need to write your own parsing functions.


Remains to be further

Understanding data manipulation is an important part of machine learning with JavaScript. By understanding the use cases and requirements described in this article, we can format the data according to the requirements using only a few key functions. The Tensorflow.js team is constantly improving the underlying data API of Tensorflow.js, which helps meet more of the requirements. This also means that as Tensorflow.js continues to improve and evolve, the API will continue to move forward and keep pace.

Related reading: Machine Learning Fundamentals and Practices — Data cleansing