About the author:

MATTHIJS HOLLEMANS

Dutchman, independent developer, specializing in low-level coding, GPU optimization and algorithm research. His current research direction is deep learning on IOS and its application on APP.

Twitter: twitter.com/mhollemans

Email address: mailto:[email protected]

Github address: github.com/hollance

Personal blog: machineThink.net/


Before using a deep learning network for a predictive task, it is first trained. There are many tools for training neural networks, and TensorFlow is the first choice for most people.

You can use TensorFlow to train your machine learning models and then use those models to make predictions. The training process is usually done on a powerful machine or in the cloud, but TensorFlow can also run on IOS, albeit with some limitations.

In this article, the author describes in detail how to use TensorFlow to train a simple classifier on an IOS app. Gender Recognition by Voice and Speech Analysis Dataset will be used in this paper. The project source code has been hosted on GitHub.



TensorFlow profile

TensorFlow is a software library for building computational graphs for machine learning. Many of the other tools work in a higher level of abstraction, such as through Caffe, where you can design a neural network of layers linked to each other. This is similar to Basic Neural Network Subroutines (BNNS) and Metal Performance Shaders Convolution Neural Network (Metal Performance Shaders Convolution Neural Network, BPSCNN) provides similar functionality.

You can think of TensorFlow as a toolkit for implementing new machine learning algorithms, whereas other deep learning tools use already implemented algorithms. That means you don’t have to build everything from scratch. TensorFlow has a lot of reusable building blocks and other libraries that provide convenience modules on top of TensorFlow, such as Keras.


Binary classification using Logistic regression

In this article, we will create a classifier using logistic regression algorithm. The classifier takes input data and returns the category to which the data belongs. There are only two categories in the project: male and female, so this is a binary classifier.

Note: Binary classifiers are the simplest classifiers, but the idea is the same as those that can distinguish hundreds or thousands of classes. Although deep learning is not carried out in this article, some theoretical foundations are common.

Each input consists of 20 numbers representing the acoustic characteristics of the user’s voice, as explained later, but for now you can just think of it as audio and other information. As shown in the figure, 20 numbers are connected to a sum block. These numbers have different weights (weights) corresponding to the importance of the 20 numbers representing features.



In the figure, x0-x19 represents the input features, and w0-w19 represents the weight of the connection. In the sum block, calculation is carried out as follows (i.e. ordinary dot product) :



Training the classifier is to find the correct values for w and B. At initialization, set w and b to 0. After several rounds of training, the classifier uses appropriate W and B to distinguish male voices from female ones. To convert sum into a probability between 0 and 1, we use the Logistic sigmod function:



If sum is a large positive number, sigmod will return 1 or 100%. If sum is a large negative number, sigmod returns 0. So for large positive and negative numbers, we can get a definite yes and no prediction. However, if sum approaches 0, the SIGmod function returns a probability close to 50%. When we start training the classifier, the initial prediction will be 50/50, because the classifier hasn’t learned anything yet and isn’t sure about all the outputs. But as the number of training sessions increases, the probability gets closer to 1 and 0, and the classification becomes more clear-cut.

Y_pred is the probability that the pronunciation comes from a man. If the probability is greater than 0.5, we assume it’s a male voice, otherwise, we assume it’s a female voice.

The principle of binary classifier using Logistic regression: the input data of classifier consists of 20 numbers describing the acoustic characteristics of audio recordings, weighted sum and sigmod function, and the final output is the probability of male voice.


Implement classifiers on TensorFlow

To use the classifier on TensorFlow, you first need to create a Computational graph. A calculation diagram is composed of nodes that perform operations and data flowing between nodes. The diagram of Logistic regression is as follows:



This graph looks a little different from the previous one. The input data x is no longer 20 independent numbers, but a vector with 20 elements, the weight is represented by a matrix W, and the dot product is replaced by a simple matrix multiplication.

The input y here is used to verify the effect of the classifier. The data set used for the experiment had 3,168 voice recordings, and we knew whether each recording was male or female. These known outputs (male/female) are called labels, and these labels are stored in the input Y.

Because the weights are all set to 0 when initialized, the classifier may make incorrect predictions. Therefore, loss function is used to measure the classification level of the classifier. The loss function compares the predicted result y_pred with the correct label Y. After calculating the loss of the training samples, we used back propagation to correct the weight values of W and B. The training process is repeated on all samples until the optimal weight data is obtained. The loss value, which measures the effectiveness of the classifier, becomes smaller and smaller over time.


Tensor profile

In the figure above, data flows from left to right, from input to output. This is where the word “flow” in TensorFlow comes from. The numbers in this picture are flowing in the form of tensors. Tensors are just n-dimensional arrays. We mentioned that W is a weight matrix, and TensorFlow thinks it’s a second-order tensor, which is actually a two-dimensional array. Such as:

1. Scalar numbers are tensors of order 0;

2. Vectors are first-order tensors;

3. A matrix is a second-order tensor;

4. A three-dimensional array is a third-order tensor

Deep learning, such as convolutional neural networks (CNN), often needs to deal with four-dimensional tensors, but the logistic classifier in this paper is simple and cannot be more than second-order tensors, namely matrices. We said that x is a vector, but let’s think of x and y as matrices. In this way, the loss value can be calculated once and for all. A single sample has 20 data, and if you load all 3168 samples, x becomes a 3168 x 20 matrix. After multiplying x and w, the output y_pred is a matrix of 3168 x 1. The predictions were made for each sample in the dataset. In summary, using matrices/tensors to represent computational graphs allows predictions to be made for multiple samples at once.


Install TensorFlow

Environment: Python3.6

Online installation


brew install python3Copy the code





pip3 install numpy
pip3 install scipy
pip3 install scikit-learn
pip3 install pandas
pip3 install tensorflowCopy the code


We have installed the except TensorFlow Numpy and Scipy, pandas and scikit – learn library, these packages will be installed in/usr/local/lib/python3.6 / site – packages directory, you can see at any time. PIP automatically installs the version of TensorFlow that works best for your system. If you want to install other versions, please refer to the offline installation guide.

Let’s test that everything has been installed correctly. Create tryit.py as follows:


import tensorflow as tf

a = tf.constant([1, 2, 3])
b = tf.constant([4, 5, 6])

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

print(sess.run(a + b))Copy the code


Then running the script from the terminal will output debugging information about the device, most likely about the CPU, or gPU-related information if your Mac has an NVIDIA GPU. This will output:


[5 July 9]Copy the code


This is the sum of two vectors A and b. The following information may also appear:


W tensorflow/core/platform/cpu_feature_guard. Cc: 45] The tensorflow library wasn 't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.Copy the code


If this debugging information appears, it means that the version of TensorFlow installed on your system is not suitable for your CPU. One solution is to install TensorFlow from source, and you can configure all the options yourself.


Detailed analysis of data

In this experiment, instead of using the MNIST handwritten digits commonly used in TensorFlow tutorials, we used a data set based on the gender of speech recognition. The voice.csv file is shown below. These numbers represent the acoustic properties of different voice recordings. These features are extracted from the recording through a script and converted into a CSV file. If interested, you can refer to R language source code.

The dataset contains 3,168 samples, one for each row of the table, roughly evenly split between men and women. Each sample data contains 20 acoustic features, as shown in the figure:



It’s not clear what these features mean, but it doesn’t matter. We’re just interested in training a classifier from the data that can distinguish between male and female voices. To detect whether the audio is male or female in your APP, you first need to extract these acoustic features from the audio data. Once we find those 20 acoustic features, we can use our classifier to make predictions. Therefore, the classifier does not work directly on the audio, but only on the extracted features.

Note: It is important to point out the difference between deep learning and traditional algorithms such as Logistic regression. The classifier we trained cannot learn very complex things and needs to extract features in the data preprocessing stage. Deep learning systems can take raw audio data directly as input, extract important acoustic features, and then classify them.


Create training sets and test sets

I created a Python script called split_data.py to split the training set from the data set as follows:

# This script loads the original dataset and splits it into a training set and test set. import numpy as np import pandas as pd # Read the CSV file. df = pd.read_csv("voice.csv", header=0) # Extract the labels into a numpy array. The original labels are text but we convert # this to numbers: 1 = male, 0 = female. labels = (df["label"] == "male").values * 1 # labels is a row vector but TensorFlow expects a column vector,  so reshape it. labels = labels.reshape(-1, 1) # Remove the column with the labels. del df["label"] # OPTIONAL: Do additional preprocessing, such as scaling the features. # for column in df.columns: # mean = df[column].mean() # std = df[column].std() # df[column] = (df[column] - mean) / std # Convert the training data  to a numpy array. data = df.values print("Full dataset size:", data.shape) # Split into a random training set and a test set. from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.3, random_state=123456) print("Training set size:", X_train.shape) print("Test set size:", X_test.shape) # Save the matrices using numpy's native format. np.save("X_train.npy", X_train) np.save("X_test.npy", X_test) np.save("y_train.npy", y_train) np.save("y_test.npy", y_test)Copy the code


In the binary classifier for this example, we use 1 for male and 0 for female. Running the script file on a terminal produces four files: training data (x_train.npy) and its labels (y_train.npy), and test data (x_test.npy) and its labels (y_test.npy).


Build computational graph

The following will use the train.py script to train the logistic classifier with TensorFlow. The full code can be viewed on GitHub.

Pilot input training data (X_train and y_train) :

Copy the code

Let’s start building the computational diagram. So, when you enter data X and Y, use the placeholder definition first:

Copy the code

Tf.name_scope () divides the different parts of the diagram into different fields, and each layer is created under a unique tF.name_scope () as a prefix for elements created within that scope, The unique name for x will be ‘Inputs /x-input’. The inputs x and y are defined in the inputs field: x_input and y_put for later use.

Each input is a vector of 20 elements with a corresponding label (1 for male, 0 for female). If all the training data is formed into a matrix, the calculation can be done in one go. So x and y are defined above as two-dimensional tensors: the dimension of x is [None, 20], and the dimension of y is [None, 1]. None indicates that the first dimension is unknown. There were 2217 samples in the training set and 951 samples in the test set.

After importing the training data, define the parameters for the classifier:

    with tf.name_scope("model"):
        W = tf.Variable(tf.zeros([num_inputs, num_classes]), name="W")
        b = tf.Variable(tf.zeros([num_classes]), name="b")Copy the code

The tensor W is the weight matrix (a 20 by 1 matrix), and b is the bias. W and B are declared as TensorFlow variables and will be updated during the backpropagation process.

The core formula of logistic regression classifier is declared below:

    y_pred = tf.sigmoid(tf.matmul(x, W) + b)Copy the code

Here, multiply x and w plus b, and enter the sigmod function to get the predicted value y_pred, which represents the probability that the audio data in X is a male voice.

Note: Actually, this line of code isn’t evaluating anything yet, it’s just building the graph for now. This line of code adds nodes for matrix multiplication and addition, as well as the SIGmod function (tf.sigmoID), to the diagram. When the graph is built, create a TensorFlow session to test the real data.

In order to train the model, a loss function needs to be defined. For binary logistic regression classifier, Log_loss function has been built into TensorFlow:

    with tf.name_scope("loss-function"):
        loss = tf.losses.log_loss(labels=y, predictions=y_pred)
        loss += regularization * tf.nn.l2_loss(W)Copy the code

The log_Loss node receives the real label Y of the sample data as input and compares it with the predicted value Y_pred. The comparison result represents the loss value (Loss). On the first training, the predicted value of y_pred would be 0.5 on all samples, because the classifier now does not know the true answer. The initial loss value is -ln(0.5), i.e. 0.693146. As you train, the losses get smaller and smaller.

The third line above adds L2 regularization to prevent overfitting. Regularization coefficient regularization is defined in another placeholder:

    with tf.name_scope("hyperparameters"):
        regularization = tf.placeholder(tf.float32, name="regularization")
        learning_rate = tf.placeholder(tf.float32, name="learning-rate") Copy the code

Earlier we used placeholder to define input x and y, and here we define hyperparameters. Unlike weights w and bias B, which can be learned from the model, you can only set them empirically. Another hyperparameter, learning-rate, defines the step size.

The Optimizer does a back-propagation operation: With Loss as input, it decides how to update weights and biases. Various optimization classes in TensorFlow provide methods to calculate the gradient for the loss function. Here we use AdamOptimizer:

    with tf.name_scope("train"):
        optimizer = tf.train.AdamOptimizer(learning_rate)
        train_op = optimizer.minimize(loss)Copy the code

Operation node train_op is added here, which is used to minimize loss. This node will be run later to train the classifier. In the training process, we use snapshot technology and accuracy to determine the effect of classifier. Define a graph node to calculate the accuracy of prediction results:

with tf.name_scope("score"): Correct_prediction = tf.equal(tf.to_float(y_pred > 0.5), Y) Accuracy = tf.reduce_mean(tF.to_float (correct_prediction), name="accuracy")Copy the code

We said before that y_pred is the probability between zero and one. With tf.to_float(y_pred > 0.5), return 0 if the prediction is female; If it’s a man, it returns 1. The tf.equal method compares the predicted result y_pred with the actual result y, returning a Boolean value. First convert booleans to floating point numbers, tf.reduce_mean () calculates the mean, and the final result is accuracy. This accuracy node will also be used later on the test set to determine the true effect of the classifier.

For the new data without labels, the inference node was defined to predict:

With tf.name_scope("inference"): inference = tf.to_float(y_pred > 0.5, name="inference")Copy the code


Training classifier

This simple logistic classifier might be trained quickly, but a deep neural network might take hours or even days to achieve good enough accuracy. Here is the first part of train.py:

with tf.Session() as sess:
        tf.train.write_graph(sess.graph_def, checkpoint_dir, "graph.pb", False)

        sess.run(init)

        step = 0
        while True:
    	    # here comes the training codeCopy the code

We create a Session object to run the diagram. Call sess.run(init) to set w and b to 0. In the meantime, save the graph in/TMP /voice/graph.pb. This diagram will be used later to test the effect of the classifier on the test set and to use the classifier on IOS apps.

Inside the while True: loop, we do the following:

        perm = np.arange(len(X_train))
        np.random.shuffle(perm)
        X_train = X_train[perm]
        y_train = y_train[perm]Copy the code

During each training, the data in the training set should be randomly shuffled to avoid the classifier making predictions according to the order of samples. The following session will run the train_op node for a training session:

        feed = {x: X_train, y: y_train, learning_rate: 1e-2, 
                regularization: 1e-5}
        sess.run(train_op, feed_dict=feed)Copy the code

The feed_dict argument is passed to sess.run(), which assigns values to tensors in the placeholder to start the calculation.

The classifier used in this paper is a simple one. Each time, the whole training set is used for training, so the x_train array is put into X and the Y_train array is put into Y. If the data is very large, you should train with a small batch of data (100 to 1000 samples) per iteration.

The train_op node will run for many times, and the back propagation mechanism will fine-tune the weight W and bias B each time. As the number of iterations increases, W and B will gradually reach the optimal value. To help understand the training process, each iteration 1000 times, run the accuracy and loss nodes to output relevant information:

if step % print_every == 0:
            train_accuracy, loss_value = sess.run([accuracy, loss], 
                                                  feed_dict=feed)
            print("step: %4d, loss: %.4f, training accuracy: %.4f" % \
                    (step, loss_value, train_accuracy))Copy the code

Note that high accuracy on the training set does not mean good performance on the test set, but this value should gradually increase with the training process, and the loss value should continue to decrease.

You then define checkpoint files that can be used later to restore the model for further training or evaluation. The w and B learned by the classifier are saved to/TMP /voice/ :

Copy the code



When you find that the loss is no longer decreasing and the next *** SAVED MODEL *** message appears, you can press Ctrl+C to stop the training.

I choose Learning_rate = 1E-2 and regularization = 1E-5, which can achieve 97% accuracy and 0.157 loss value in the training set. If regularization = 0 in the feed, the loss value will be lower.


Classifier effect

After the classifier is trained, the actual effect of the classifier can be verified on the test data. We create a new script called test.py, load the graphs and test sets, and calculate the prediction accuracy.

Note: Accuracy on the test set will be lower than in the training set (97%), but should not be too low. If your trainer fits, you may need to recalibrate your training process.

Or load the test data first:

import numpy as np
import tensorflow as tf
from sklearn import metrics

X_test = np.load("X_test.npy")
y_test = np.load("y_test.npy")Copy the code

Since you are now only validating the effects of the classifier, you don’t need the whole diagram, just the TRAIN_op and Loss nodes. We already saved the calculation graph to graph.pb, so we just need to load it:

  with tf.Session() as sess:
    graph_file = os.path.join(checkpoint_dir, "graph.pb")
    with tf.gfile.FastGFile(graph_file, "rb") as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())
        tf.import_graph_def(graph_def, name="")Copy the code

TensorFlow recommends using *.pb to store data, so just a little helper code is needed to load the file and import it into the session. Load w and b from the checkpoint file:

W = sess.graph.get_tensor_by_name("model/W:0")
    b = sess.graph.get_tensor_by_name("model/b:0")

    checkpoint_file = os.path.join(checkpoint_dir, "model")
    saver = tf.train.Saver([W, b])
    saver.restore(sess, checkpoint_file)Copy the code

We put the nodes in the scope and name them, which can be easily found using get_tensor_by_name(). If you don’t give them an explicit name, then you have to look for the default TensorFlow name in the entire diagram, which can be very troublesome. Other nodes need to be referenced, especially the input x and y and the prediction node:

    x = sess.graph.get_tensor_by_name("inputs/x-input:0")
    y = sess.graph.get_tensor_by_name("inputs/y-input:0")
    accuracy = sess.graph.get_tensor_by_name("score/accuracy:0")
    inference = sess.graph.get_tensor_by_name("inference/inference:0")Copy the code

The data in the test set can now be predicted:

    feed = {x: X_test, y: y_test}
    print("Test set accuracy:", sess.run(accuracy, feed_dict=feed))Copy the code

Use scikit-learn to output some additional information:

predictions = sess.run(inference, feed_dict={x: X_test})
    print("Classification report:")
    print(metrics.classification_report(y_test.ravel(), predictions))
    print("Confusion matrix:")
    print(metrics.confusion_matrix(y_test.ravel(), predictions))Copy the code




As shown in the figure above, the accuracy was 96% on the test set, slightly lower than on the training set. This means that the classifier can be trained to accurately classify unknown data. The Classification Report and confusion matrix show that some samples are wrong. The mix-up matrix showed that 446 predictions were correct and 28 were wrong in the female sample. Of the men, 466 were correct and 11 were wrong. This suggests that the classifier made more errors in predicting female voices.

In the next section, we’ll show you how to apply this classifier to a real app.


The above is the translation

This article is recommended by Beijing Post @ Love coco – Love life teacher, translated by Ali Yunqi Community organization.

The article was originally titled “Getting Started with TensorFlow on iOS” and was published by Matthijs Hollemans.

Translator: Li Feng; Proofread: Dong Zhaonan

The article is a brief translation. For more details, please refer to the original text. The Chinese translation document is attached.