By Lak Lakshmanan, Technical Director of Google Cloud Platform
The original link: mp.weixin.qq.com/s/MhadSlQER…
If you have a sparse classification variable (one that can have multiple possible values), embedding that variable in a lower dimension can be very useful. The best known form of embedding is word nesting (such as word2vec or Glove nesting), where all words in a language are represented by a vector of about 50 elements. The idea is that similar words are very close together in 50 dimensions. You can do the same with categorization variables, using a problem training nesting and then using that nesting again, rather than singly heat-coding categorization variables in related questions. The nested lower dimensional space is continuous, so the nesting can also serve as input to the clustering algorithm — you can find natural groupings of classification variables.
To provide nesting using estimator training, you can emit lower-dimensional representations of categorization variables along with ordinary predictive output. The nested weights are stored in the SavedModel and have an option to share the file itself. Alternatively, you can provide nesting on demand for the machine learning team’s clients, which are now only loosely coupled to the model architecture of your choice, so this may be easier to maintain. Each time a newer and better version replaces your model, the client gets the updated nesting.
In this article, I will show you how to: Create nesting in regression models or classification models represent classification variables in different ways using feature columns for mathematical calculations to distribute nesting and the output of the original model
You can find the full code for this article, with more context, on GitHub. I’ll just show you the key code snippets here. Note: GitHub links to github.com/GoogleCloud…
Let’s build a simple demand forecasting model to predict the number of bike rentals at bike rental stations in the early days of the week given whether it will rain or not. The data required came from publicly available datasets of New York City bike rentals and NOAA weather data:
The inputs to the model are as follows: Day of the week (integers from 1 to 7) Rental station ID (since we don’t know the full vocabulary, we use hash storage partitions here. The data set has about 650 unique values. We’ll use a large hash storage partition, but then embed it in a lower dimension.) Rain (true/false)
The tag we want to predict is NUM_TRIPS.
We can run the query to create the data sets in BigQuery, joining the bike and weather data set, and make the necessary aggregation: note: cloud.google.com/bigquery/ BigQuery link
1 #standardsql
2 WITH bicycle_rentals AS (
3 SELECT
4 COUNT(starttime) as num_trips,
5 EXTRACT(DATE from starttime) as trip_date,
6 MAX(EXTRACT(DAYOFWEEK from starttime)) as day_of_week,
7 start_station_id
8 FROM `bigquery-public-data.new_york.citibike_trips`
9 GROUP BY trip_date, start_station_id
10 ),
11
12 rainy_days AS
13 (
14 SELECT
15 date,
16 (MAX(prcp) > 5) AS rainy
17 FROM (
18 SELECT
19 wx.date AS date,
20 IF (wx.element = 'PRCP', wx.value/10, NULL) AS prcp
21 FROM
22 `bigquery-public-data.ghcn_d.ghcnd_2016` AS wx
23 WHERE
24 wx.id = 'USW00094728'
25 )
26 GROUP BY
27 date
28 )
29
30 SELECT
31 num_trips,
32 day_of_week,
33 start_station_id,
34 rainy
35 FROM bicycle_rentals AS bk
36 JOIN rainy_days AS wx
37 ON wx.date = bk.trip_date
Copy the code
To write a model with an estimator, we need to use a custom estimator in TensorFlow. Although this is only a linear model, we can’t use LinearRegressor because LinearRegressor hides all the underlying feature column calculations. We need access to the middle-tier output (the output of the nested feature columns) so that we can write linear models cleanly.
To use a custom estimator, you need to write a model function and pass it to the estimator constructor:
1 ef train_and_evaluate(output_dir, nsteps):
2 estimator = tf.estimator.Estimator(
3 model_fn = model_fn,
4 model_dir = output_dir)
The model function in the custom estimator consists of the following five parts:
1. Define the model:
1 def model_fn(features, labels, mode):
2 # linear model
3 station_col = tf.feature_column.categorical_column_with_hash_bucket('start_station_id', 5000, tf.int32)
4 station_embed = tf.feature_column.embedding_column(station_col, 2) # embed dimension
5 embed_layer = tf.feature_column.input_layer(features, station_embed) 6
7 cat_cols = [
8 tf.feature_column.categorical_column_with_identity('day_of_week', num_buckets = 8),
9 tf.feature_column.categorical_column_with_vocabulary_list('rainy'['false'.'true'])
10 ]
11 cat_cols = [tf.feature_column.indicator_column(col) for col in cat_cols]
12 other_inputs = tf.feature_column.input_layer(features, cat_cols)
13
14 all_inputs = tf.concat([embed_layer, other_inputs], axis=1)
15 predictions = tf.layers.dense(all_inputs, 1) # linear model
Copy the code
We will take the rental station column and put it into a storage partition based on its hash code. Use this technique to avoid building complete vocabularies. There are only about 650 bike rental stations in New York, so with 5,000 hash storage partitions, we can greatly reduce the chance of conflict. Then, by embedding rental station ids into a few dimensions, we also learn which rental stations are similar to each other, at least when it comes to rainy days. Finally, the ID of each rental station is represented by a two-dimensional vector. The number 2 controls the accuracy of the information in the lower dimensional space representing the classification variables. I chose 2 at random here, but in reality, we need to tune this hyperparameter to achieve the best performance.
The other two classification columns were both created using their actual vocabulary and then thermally encoded (the indicator column thermally encoded the data).
Both sets of inputs are cascaded to create a wide input layer, which is then passed through an output node to a dense layer. In this way, you write a linear model at a lower level. This is equivalent to writing a LinearRegressor, as shown below:
1 station_embed =
2 tf.feature_column.embedding_column(
tf.feature_column.categorical_column_with_hash_bucket('start_station_id', 5000, tf.int32), 2)
3 feature_cols = [
4 tf.feature_column.categorical_column_with_identity('day_of_week', num_buckets = 8),
5 station_embed,
6 tf.feature_column.categorical_column_with_vocabulary_list('rainy'['false'.'true'])
7 ]
8 estimator = tf.estimator.LinearRegressor(
9 model_dir = output_dir,
10 feature_columns = feature_cols)
Copy the code
Note that LinearRegressor hides the Input_layer, indicator_column, and so on. However, I wanted to access the nesting of rental stations, so I showed it.
For regressions, we can use the Ftrl optimizer to minimize the mean square error (MSE) :
1 my_head = tf.contrib.estimator.regression_head() 2 spec = my_head.create_estimator_spec( 3 features = features, mode = mode, 4 labels = labels, logits = predictions, 5 Optimizer = tf.train.FtrlOptimizer(Learning_rate = 0.1) 6) 3 -- 4 Normally, we just send predictions, but in this case, we want to send back predictions and nesting: 1# 3. Create predictions
2 predictions_dict = {
3 "predicted": predictions,
4 "station_embed": embed_layer
5 }
6
7 # 4. Create export outputs
8 export_outputs = {
9 "predict_export_outputs": tf.estimator.export.PredictOutput(outputs = predictions_dict)
10 }
Copy the code
Another reason for using a custom estimator here is that it can change export_outputs.
5. Echo EstimatorSpec with predicted results and export the replaced output:
1 # 5. Return EstimatorSpec
2 return spec._replace(predictions = predictions_dict,
3 export_outputs = export_outputs)
Copy the code
Now, we train the model as usual.
You can then provide the exported model using TensorFlow Serving, or you can choose to deploy it to the Cloud ML Engine (which is actually hosted TF Serving) and then invoke the prediction. You can also invoke the local model using Gcloud (which can provide a more convenient interface than saved_model_CLI for this purpose) :
1 EXPORTDIR=./model_trained/export/exporter/
2 MODELDIR=$(ls $EXPORTDIR | tail -1)
3 gcloud ml-engine local predict --model-dir=${EXPORTDIR}/${MODELDIR} --json-instances=./test.json
Copy the code
What’s in test.json? {” day_of_week “: 4,” start_station_id: “435,” rainy “:” true “} {” day_of_week “: 4,” start_station_id: “521,” rainy “: “True”} {” day_of_week “: 4,” start_station_id: “3221,” rainy “:” true “} {” day_of_week “: 4,” start_station_id “: 3237, “rainy” : “true”}
As you can see, I sent 4 instances corresponding to rental stations 435, 521, 3221 and 3237.
The first two stations are in Manhattan, an area of high rental activity (for both commuters and tourists). The last two stations are on Long Island, an area where bike rentals are not as common (they may only be available on weekends). The resulting output contains the predicted number of trips (our tag) and nesting of rental stations:
In this case, the nested first dimension is almost zero in all cases. So, we only need one dimension nesting. Looking at the second dimension, it is very clear that Manhattan station has positive values (0.0081, 0.0011) and Long Island station has negative values (-0.0025, -0.0031).
This is the information we got from our machine learning model, just looking at bike rentals in these two locations on different dates! If you have classification variables in your TensorFlow model, try allocating nesting from those models. Maybe they will bring new data analysis!
Welcome to TensorFlow on Google’s official wechat account!
Without permission, declined to be reproduced! Infringement will be prosecuted!