Learning position embedding with triple loss networks: Enabling position data to perform arithmetic operations

Compiled by Sentiance, Heart of the Machine.

Sentiance, a data science company, recently published an article about a new machine learning algorithm platform that learns location data in a self-supervised manner and extracts insights from it. Heart of the Machine has compiled this article.

The introduction

We at Sentiance have developed a platform that takes data from smartphone sensors like accelerometers, gyroscopes and location information and extracts behavioral insights from it. Our AI platform learns user patterns and can predict and explain why and when things happen, which allows our customers to guide their users in the right way at the right time.

The Venue Mapping algorithm is an important component of our platform. The goal of the site-mapping algorithm is to figure out where you’re going to be based on the often inaccurate location measurements from the smartphone location subsystem.

Figure 1: Left: Site mapping means estimating the adjacent site that the user is actually going to; Right: Human intuition can help us quickly rule out unlikely sites, such as a user heading to the beach who is unlikely to visit a rescue station.

Although site mapping is a big problem in general and will be the focus of a future blog post, human intuition based on the geography surrounding the area is easy to deal with. As shown in Figure 1, suppose a user is heading to Santa Monica Beach. One glance at the surrounding geography tells us that the probability that the user is actually heading to the survival station is probably quite small.

In fact, simply by looking at a map of the area, humans can often quickly rule out unlikely sites and construct an advance belief of what’s actually going on. Is the site located in an industrial area, park, near the beach, downtown or beside a road?

To give our site mapping algorithm the same intuitive awareness, we developed a deep learning-based solution that trains models for encoding geospatial relationships and describing semantic similarity around locations. Figure 2 is a schematic illustration of this concept.

Figure 2: The area around a given location is rasterized and then transmitted to a deep neural network. This network acts as an encoder, output an embed that takes the high-level semantics of the input location.

Encoders translate positions into distributed representations, similar to what Word2Vec [1] does for natural languages. These embeds are in a metric space and therefore follow algebraic rules. For example, we can use word embeddedness to infer word similarity and analogies. We can even perform arithmetic operations like “king – man + woman = Queen” directly in the embedded space.

In the next few paragraphs, we will discuss how we have devised a solution that learns to map positional coordinates into a metric space, which allows us to perform something similar to word embedding, as shown in Figure 3.

Figure 3: Our proposed solution directly optimizes the metric space so that the embedded space can be explored using basic arithmetic operations.

Image tile generation

Rasterize GIS data

Given a location coordinate and a radius, we can query our GIS database for a large amount of geographic information. Our GIS database is a local copy of OpenStreetMap stored in a PostGis database. PostGis is a handy PostgreSQL extension that adds support for spatial operators, types, and indexes.

For example, we can use a set of queries to easily check if a location is near a river, how far it is from the nearest train station, and if there are roads near the location. In addition, the actual road itself can be extracted as a broken line, while the outline of the railway station building can be extracted as a polygon object.

However, we do not know how to effectively provide such a large amount of unstructured data to neural networks for further processing. Considering that the goal of our neural network training is to understand the shape and spatial relations such as distance, inclusion, occlusion and phase separation, we decide to rasterize the surrounding situation of the position into a fixed-size image before feeding it into the encoder.

Fortunately, we have the tools to do just that. We combined Mapnik and its Python bundle with a custom version of the OpenStreetMap-Carto stylesheet to produce a fast Rasterizer that we could use to generate image tiles, See Figure 4.

Mapnik:github.com/mapnik/mapn…
OpenStreetmap-Carto:github.com/gravitystor…

Figure 4: Mapnik is used to raster GIS data taken from PostGis into images

We parameterized our rasterization service to easily perform data enhancement by rotating and shifting maps before generating image tiles. As shown in Figure 5, the image blocks show the same location, but with different directional angles and horizontal and vertical offsets.

Figure 5: Our image tile generator allows data enhancement to be performed easily by rotating and shifting the map prior to generating the image tile

From the graph to the tensor

Although these rasterized image tiles make it easy for our encoder to learn and acquire spatial structures and relationships, a lot of information is still lost during rasterization. In fact, rasterization brings together all polygons and polylines of roads, buildings, park profiles, rivers, etc. Because our GIS database contains individual information for each structure, it is not really necessary for the neural network encoder to learn to segment them.

Therefore, instead of rasterizing the data into a three-channel RGB image, we modified the rasterizer to produce a 12-channel tensor, as shown above, where each channel contains a different type of rasterized information. Figure 6 shows such a 12-channel tensor with the same coordinates as figure 5.

Figure 6: A 12-channel tensor is used to represent the region. Each passage contains a specific kind of information, such as road network, land (including green space, water, etc.), amenities, etc

For human observation, the rest of this article mostly shows the RGB rasterized version rather than the 12-channel tensor.

Characterization of learning

Spatial similarity

Our goal is to learn a metric space in which semantically similar image blocks correspond to embedding vectors that are close to each other in the space. The question then becomes how to define “semantically similar”.

A simple and direct method to obtain the similarity space is to represent each image block with histogram, use K-means clustering, and use a word bag model to model the space. However, it is still unclear what weight each channel should have. For example, if roads are similar but buildings are not, are two image blocks semantically similar?

In addition, even if two image blocks have similar histograms, this does not give us any spatial structure information about what is going on around that location. Assuming that half of an image block is covered by ocean, is the image block semantically similar to an image block containing a large number of small ponds, lakes, or fountains? FIG. 7 shows two image blocks that produce almost identical histograms:

Figure 7: Clustering based on histogram is insufficient to obtain semantic similarity and spatial relations. The two images have almost identical histograms, but their semantic meaning is quite different.

However, the semantics of these image blocks are not the same. The first image block is an intersection area, and the second image block is some small roads that might lead to people’s houses. In fact, in our embedding space, we find that the Euclidean distance between the embeddings of these two image blocks is actually quite large, even though the Chi-square distance between their histograms is close to zero.

In addition to using histograms, each channel can be summarized by a set of features to obtain spatial relationships, such as directed gradient histogram (HoG) or the more traditional SIFT or SURF descriptors. But instead of trying to manually specify which features define semantic similarity, we decided to use the power of deep learning to learn to automatically detect meaningful features.

To do this, we input this 12-channel tensor into a convolutional neural network that serves as our encoder. The network uses a triplet loss function to train in a form of self-supervision, which means that manual annotated data is not required during the training process.

Self-supervised learning: triple networks

The concept of triple loss is inspired by the architecture of Siamese network, which is a method proposed by Ailon et al. [2] to perform deep metric learning for unsupervised feature learning.

Triple network is a neural network architecture that uses triples for training, including:

An anchor instance x
A positive instance of x+ semantically similar to x
A negative instance of x- with different semantics than x

The network is then trained to learn an embedding function f(.) So as to directly optimize the metric space. See Figure 8.

Figure 8: The triple network is trained using a triple loss so that in the metric space learned, similar instances are closer to each other and dissimilar instances are farther apart.

Metric learning using triple network becomes more popular due to Google’s FaceNet [3], in which triple loss is used to learn the embedding space of face images, so that the embedding of similar faces is closer, while the embedding distance of different faces is farther.

For face recognition, positive example images are from the same person in the anchor image, while negative example images are randomly selected from mini-batch. However, in our case there is no classification that makes it easy to choose between positive and negative instances.

To define semantic similarity, we can use Tobler’s “first law of geography” : “In surface space, everything is connected, but things that are close to each other are more connected than things that are far away.”

Now, let’s say I(.) A mapping from position coordinates to rasterized image blocks. For position X, given a rotation and translation transformation T(.) performed before rasterizing the image block , and given a random position Y, and X≠Y, then we can get our triad:

x=I(X)
x^+=I(T(x))
x^-=I(Y)

Therefore, we assume that two geographically adjacent and partially overlapping image blocks are semantically more relevant than two completely different image blocks. Figure 9 shows an example of two triples whose anchor images are the same.

Figure 9: A triad of an anchor image, a positive example image, and a negative example image

In order to prevent the neural network from learning only simple transformations, we also randomly enabled or disabled some of the 12 channels for each positive instance during training. This forces the network to think that the positive example image block is approximate to the anchor image, even if some random subset of that information is different (no buildings, no roads, etc.).

SoftPN triple loss function

Figure 10 shows the general structure of our triple network.

Figure 10: This triple loss directly optimizes the ratio of “distance between anchor embedment and positive example embedment” to “distance between anchor embedment and negative example embedment”

The loss function is defined as such that optimizing the network corresponds to minimizing the mean square error (MSE) of the vector relative to the vector (0,1).

Why do we define the loss function this way? Let’s say we want delta (a,p) to be as close to zero as possible, and we want delta (a,n) to be as big as possible. To optimize this ratio, we apply a SoftMax to both distances to get the bounded similarity of this domain:

The definition of this triple loss is often referred to as SoftMax ratio and was first proposed by Ailon et al. [2].

The main problem with this definition is that it is easy for the network to quickly learn an embedding space where D_ is close to 1, because most random negative example images are very different from anchor images. Therefore, most (a,n) pairs have little effect on the gradient during optimization, which causes the network to stop learning quickly.

There are different ways to solve this problem, one of which is hard-negative mining [3], that is, carefully selecting (A, N) pairs to ensure that the network stays learning. However, in our case, it is not clear how to effectively select difficult negative examples without introducing bias into the learning process. The SoftPN triple loss function proposed by Balntas et al [4] is a simpler solution.

This SoftPN loss uses MIN (δ (A,n), δ (p,n)) instead of δ (a,n) in the SoftMax calculation above. The effect is that, during optimization, the network tries to learn a metric space in which both anchor embedding and positive example embedding are as far away from negative example embedding as possible. Comparatively speaking, the original SoftMax ratio loss only considers the distance between anchor embedding and negative example embedding. Figure 11 shows the difference between the two.

Figure 11: SoftPN loss optimizes this more difficult problem by maximizing the minimum distance between negative example embeddings and anchor embeddings and positive example embeddings

Neural network architecture

We use a fairly traditional convolutional neural network architecture as the encoder, which contains five convolution layers with filter size of 3×3, followed by two 1D convolution layers and a dense connection layer. The purpose of one-dimensional convolution is to reduce the dimensions leading to the top of the network by cross-channel parametric pooling [5].

The embedding layer itself is also composed of another dense layer with linear activation function, so that after the nonlinearity of its previous layer, its output is not always confined to the positive example domain. Figure 12 shows its complete network architecture.

Figure 12: The encoder contains a convolutional neural network followed by a full connection layer. The final embedding layer is a dense layer with linear activation functions

We actively used dropout and batch normalization, and used the Leaky ReLU activation function to avoid the ReLU death issue observed during the initial test run.

In addition, we apply spatial dropout directly to the input. This causes a randomly selected input channel to be discarded altogether, which forces the network to learn to distinguish between images by focusing on different channels.

The full network was implemented with Keras, containing only 305,040 parameters, and trained for two weeks on p3.2 Xlarge AWS machines using the Adam optimizer.

Training data

To generate our training data, we took one million locations that users had been to on our platform and added about half a million locations within transportation.

For each of these 1.5 million locations, we rasterized a 128x128x12 image tile, representing the area with a radius of 100 meters around the location. These tensors serve as anchor images.

For each position, we also rasterize 20 random translation and rotation image tiles, which are used as positive examples. The offsets are sampled evenly between 0 and 80 m and are both horizontal and vertical. This gives you 20 pairs (anchor images, positive example images) for each location, making a total of 30 million images.

The triplet is generated during network training and mini-Batch is generated at the same time. Each mini-batch contains 20 positions. For each position, we randomly select 5 pairs (anchor images, positive example images) to obtain a meaningful representation of the anchor-positive example distance. Negative example images are randomly selected within each Mini-batch, so that the size of each mini-batch is 100.

Mini-batch of triples can be generated simultaneously in the training process, and nearly infinite data sets of different triples can be obtained, which enables the network to continuously learn many epochs.

Visual filters and activation

Because the embedded space is learned in a self-supervised way without marked data, it is difficult to monitor whether the network has really learned anything during training.

Visual network learning filters are an inadequate but still useful approach. In effect, we want visualization to maximize the active input images of the different layers of the network. To do this, we can start with a randomly generated image and treat each pixel as a parameter to be optimized. We then update the image pixels with gradient rise so that it maximizes the output of the selected layer.

Compute the gradient of the input image based on the average output activation of this convolution layer and iteratively run the gradient up a few times to get an image that highlights the most relevant structures in the layer.

Since our input is a 12-channel tensor, not an RGB image, we select only three of the channels with the highest average pixel amplitude and arrange them into an RGB image. Histogram equalization was applied to each channel to further enhance visual detail.

Figure 13 shows the results of each of the 32 filters in an underlying network. Clearly, this layer seems to focus on low-level details like roads and small blocks of structure.

Figure 13: The bottom layer of the network learns to detect low-level details such as roads and block structures

Figure 14 visualizes a higher level of 64 filters. These filters are apparently activated by smoother and more complex structures, suggesting that the network may indeed be learning a kind of layered eigendecomposition of its inputs.

Figure 14: Higher layers of the network can often learn more complex structures by combining low-level features from lower layers

While the usefulness of these visualizations should not be overestimated, they look interesting, especially during many research iterations. Early versions, for example, quickly pointed us in the right direction and found a bunch of dead Relus in our network. We fixed this later by replacing it with the Leaky ReLU activation function.

Exploring metric space

Visual embedding

Visualizing the filters learned by the network is certainly interesting when debugging the network, but not very useful in evaluating the quality of the embedded space learned.

To get an idea of what the embedded space looks like, Figure 15 shows the embedded space after the dimension is reduced to three dimensions using PCA. For ease of understanding, each position embed is represented in the diagram with its rasterized image tiles.

Figure 15:3D image of embedded space obtained through PCA

This clearly shows that a lot of relevant information can be obtained even with only the first three principal components. There are clear differences between green areas such as parks, different road types such as highways and main roads, and areas such as the city centre in the lower right corner of the image.

To show these local structures more clearly, FIG. 16 shows a THREE-DIMENSIONAL T-SNE animation of the embedded space.

Figure 16:3D image of embedded space obtained by T-SNE

Although site maps are an obvious use case for these embeds, they can also be used by our mode of transportation classifier. Figure 17 shows the embedded scatter diagram of positioning extracted from our mode of Transportation Classification training set (goo.gl/VhwuwS). In this case, we use Linear Discriminant Analysis (LDA) to project a 16-dimensional embedding space into a 2-dimensional one.

Figure 17:2D LDA of embedded space of waypoints collected during transportation

The graph shows that different traffic patterns usually occur in different areas. For example, our embed gets information about train tracks or tram stops.

To show how different the coded geographic areas are, we used PCA to reduce the 16-dimensional embedding to 3 dimensions, which were used directly as RGB color values after scaling to plot our test data set on a map. Figure 18 shows the results in London, UK. You can clearly see that the codes for city centres, roads, water areas, tourist areas and residential areas are all different.

Figure 18: Embedding of randomly sampled locations in London, UK. The colors here are obtained by dimensionalizing the 16D embedding vector to a 3D RGB triplet using PCA.

We performed a similar operation in Birmingham, UK, and you can see that Birmingham’s suburbs are larger than London’s, and the area around London contains a lot more buildings. See Figure 19.

Figure 19: Embedding of randomly sampled locations in Birmingham, UK. The colors here are obtained by dimensionalizing the 16D embedding vector to a 3D RGB triplet using PCA.

It’s a random walk through space

To further check the smoothness of the embedded space, we can perform a random walk starting from a random seed point. At each jump, we randomly select a currently embedded K-nearest neighbor and visualize the corresponding image block.

Figure 20 shows several of the results of this random walk. Note that in most cases, the nearest neighbors in this embedded space are geographically hundreds or thousands of kilometers away from each other, but they have a high degree of semantic similarity.

Figure 20: Results of six random walks in the embedded space, each starting from a different seed point.

Use location to calculate

Although the above visualization results show that the learned embedding space is smooth and semantic similarity is learned, it does not prove that we have actually learned a Euclidean metric space. In a Euclidean metric space, we should be able to interpolate between inserts and perform basic arithmetic operations while producing meaningful results.

Figure 21 shows the interpolation results between the two inserts from left to right. At each step of the interpolation, the resulting embedding is mapped to its nearest neighbor embedding in our test data; The corresponding image tiles are shown here.

Figure 21: Interpolation from one embed (left) to another embed (right), showing the nearest neighbor images from our test data for each intermediate step.

Finally, Figure 22 shows the results of our addition and subtraction operations on the embedding. Again, these image results are the corresponding nearest neighbor images from the test data.

Figure 22: Calculation using embedding and mapping the results back to the nearest neighbor image in our test data

These results show that the distance in the metric space represented by our embedding space actually has meaning as well as basic arithmetic rules

Because this metric space is trained in a self-supervised manner, large amounts of unlabeled data can be used to force network learning to acquire meaningful relationships. Thus, using these embeds as feature vectors in our subsequent classifiers corresponds to a form of transfer learning that allows us to train powerful classifiers with a very limited amount of labeled data.

conclusion

In this article, we show how you can use triple networks to learn metric Spaces that capture semantic similarity between coordinates of different geographic locations.

We train a convolutional neural network to learn and extract the features that define this semantic similarity, and obtain an embedding space using metric learning.

The embedded space obtained can be directly used for site mapping or traffic classification tasks, and can greatly improve the accuracy and generalization ability of our classifier through transfer learning.

In addition, these inserts can add a certain amount of intuition to our classifier, so incorrect classification results are still intuitive. For example, a site mapper can quickly learn to associate day and night activities with specific areas like industrial areas, city centers, parks, train stations, and so on.

Journeys: www.sentiance.com/demo If you’d like to learn more about our platform and try it out for yourself, contact us or download our Journeys demo app

reference

[1] Mikolov, Tomas, Et al. “Efficient estimation of word representations in vector space.” arXiv PrePrint arXiv:1301.3781 (2013).
[2] Hoffer, Elad, And Nir Ailon. “Deep metric Learning Using Triplet Network.” International Workshop on Similarity-Based Pattern Recognition. Springer, Cham, 2015.
[3] Schroff, Florian, Dmitry Kalenichenko, and James Philbin. “Facenet: A Unified embedding for Face Recognition and Clustering. “Proceedings of the IEEE Conference on Computer Vision and pattern recognition. 2015.
[4] Balntas, Vassileios et al. “Pn-net: Conjoined triple deep network for learning local image descriptors. “arXiv preprint arXiv:1601.05030 (2016).
[5] Lin, Min, Qiang Chen, Shuicheng Yan. “Network in Network.” arXiv Preprint arXiv:1312.4400 (2013).

The original connection: www.sentiance.com/2018/05/03/…