Inference performance is doubled, and TensorFlow Feature Column performance optimization practice

I put the 01 in front

In the CTR (Click Through Rate) recommendation algorithm scenario, TensorFlow Feature Column has been widely applied in practice. On the one hand, this brings the convenience of model feature processing, on the other hand, it also brings some performance problems of online inference service. In order to optimize recommendation service performance and improve online service efficiency, iQiyi deep learning platform team summarized some performance optimization methods in practice.

After these optimizations, the online inference service performance efficiency of recommendation business can be more than doubled, and the P99 delay can be reduced by more than 50%.

02 Background

Feature Column is a tool provided by TensorFlow for processing structured data, and is a bridge for mapping sample features to training model features. It provides a variety of feature processing methods, so that algorithms can easily convert various original features into model inputs for model experiments.

As shown in the figure above, all Feature columns are derived from FeatureColumn class and inherit three subclasses, CategoricalColumn, DenseColumn and SequenceDenseColumn, corresponding to sparse Feature, dense Feature and sequence dense Feature respectively. Algorithm personnel can find the corresponding interface directly according to the type of sample features.

In addition, Feature Column and TF Estimator interface have good integration, and the corresponding Feature input can be directly used in the predefined Estimator model by defining the corresponding Feature input. TF Estimator is widely used in recommendation algorithms, especially it encapsulates the function of distributed training.

The following figure shows an example of processing features by Feature Column and entering Estimator DNN Classifier:

Although Feature Column is convenient to use and the model code is fast to write, some performance problems gradually become prominent in the online service implementation of iQiyi recommendation business. The following will introduce some problems we encounter in practice one by one and how to optimize them.

Integer feature hashing optimization

Recommendation class model usually hashes the ID class features to a certain number of buckets, and then converts it to Embedding as the input of neural network, such as video class ID feature, user ID feature, commodity ID feature, etc. The following is an example:

In the ‘categorical_column_with_hash_bucket’ documentation [2] it says: For String input, ‘output_id = Hash(input_feature_string) % bucket_size’ is hashed. For integer input, the Hash is converted to String and then hashed again. Looking at the source code [3], you can see the logic like this:

In recommendation services, such ids are usually hashed in some way to form 64-bit integer features into the sample, so the operation of converting the integer to String must be performed. However, it can be seen from the Timeline of TensorFlow that the ‘AsString’ OP inside TF corresponding to function ‘AS_string’ is actually a time-consuming operation. After analysis and comparison, it is found that the time-consuming of ‘AsString’ OP is usually more than three times that of the subsequent hash operation. As shown below:

Further analysis of the code inside ‘AsString’ OP, it can be found that this OP also involves memory allocation and copy operations, so it is understandable that it is slower than pure hash calculation.

Naturally, the team considered removing the operations to optimize, so they wrote a function that hashes integers to optimize. Example code is as follows:

By doing this, the type-discriminating hashing completely optimizes the time-consuming conversion operation. Note here that the new OP corresponding to the newly added hash function also needs to be added to TF Serving.

Optimization of fixed-length feature conversion

Fixed-length features refer to the features resolved by the interface ‘tf.io.FixedLenFeature’, such as the user’s gender, age, etc. The length of such features is usually fixed and fixed to 1 or multidimensional dimensions. Such features are analyzed into Dense Tensor by ‘tf.io.parse_example’, then processed by Feature Column, and then entered into the input layer of the model. Examples of common code are as follows:

Take a look at the Tensor transformation logic inside TensorFlow. As shown in the following figure, the user_name of two samples are Bob and Wanda. After sample analysis, they become Dense Tensor with shape 2, and then go through ‘categorical_column_with_vocabulary_list’ transformation. The search word list was converted to 0 and 2, respectively, and the ‘indicator_column’ was converted to One hot encoding Dense input.

It can be seen from the above sample processing that there is no problem. Then, let’s look at the conversion processing logic inside the Feature Column code:

As you can see above, the Vocabulary Categorical Column removes illegal values and then converts the Dense Tensor into Sparse Tensor. In the Indicator Column, it turns once again from Sparse to Dense, and then One Hot Tensor, which is needed.

On the one hand, it is to remove some abnormal values in the sample data. On the other hand, it also takes into account the situation that the input is Sparse Tensor. If the input is Spare Tensor, you will do the Vocabulary search directly. And then that translates into a Dense Tensor. This conversion achieves code reuse, but incurs a performance penalty. If you can just convert the original Input Tensor into One Hot Tensor, you can save both conversion processes, and the conversion between Sparse and Dense Tensor is a very time-consuming operation.

Returning to the original nature of fixed-length features, for such fixed-length features, if there is no value in sample processing, it will be filled with default values, and no empty value or -1 will be guaranteed in sample generation. Therefore, the processing of outliers can be omitted. Finally, the optimized internal conversion logic is shown in the figure below, which saves two transformations between Sparse Tensor and Dense Tensor.

In addition to the Vocabulary Categorical Column above, there are other similar Feature columns that have the same problem. Therefore, the platform specially develops a set of optimized Feature Column interfaces for such features for business use, and the optimized performance is quite good.

05 User feature optimization

Recommendation algorithm models all have a typical feature, that is, the model will contain user side features and Item side features to be recommended, such as video features, product features, etc. When the model is deployed online, it will recommend multiple videos or products to a user. The model will return the scores of these videos or products, and then recommend them to the user according to the scores. Since the recommendation is made to a single user, the features of the user will be repeated several times according to the number of recommended items and then sent to the model. The following is a schematic diagram of online reasoning of a typical recommendation algorithm sorting model:

The model input in the diagram has three User features and three Item features. Suppose that a User is recommended, and the three features of the User are respectively U1, U2 and U3. At this time, recommendation scoring requests are made for two different items, that is, there are two items in one request, which are respectively I1 and I2. These two items have three characteristics respectively, I1 corresponds to I11, I12 and I13, and so on, thus constituting an inference request with batch size of 2. As can be seen from the figure, since two different items were recommended to the same user, the features on the Item side were different, but the features of the user were repeated twice.

The above example only 2 Item, for example, but the actual online service request a reasoning will get 100 Item or more, thus the characteristics of the user will also be repeated 100 times or more, repeat the user characteristics not only increased the transmission bandwidth, and increase the processing characteristics of the amount of calculation, so the business is very want to solve the problem.

The source of this problem starts with TensorFlow’s model training code. During TensorFlow training, each sample is the behavior of a user to a certain Item, and then it enters the training model after shuffle and batch. At this time, the data in a batch must contain multiple samples of user behavior, which is completely different from the input data format of online reasoning service.

How to solve this problem? The simplest idea is, what if online services just send one user profile? A quick attempt shows that concat fails when feature data enters the model input layer. This is because the batch size of Item features is multiple, while the batch size of user features is only 1, as shown in the following example:

In order to solve the problem of concat failure, from the perspective of the model only, it can be considered to restore user features to the same batch size as Item features before entering the input layer, as shown in the figure below.

Obviously this is mathematically possible, so here’s how to implement this idea in TensorFlow code. It is important to note that replication operations can only be performed in the online service model, not in the training model.

The TF Estimator interface in the recommended the application of the algorithms are common, and the Estimator interface provides a good model to distinguish method, by judging ModeKeys for ` TF. Estimator. ModeKeys. PREDICT ` is online service model, ModeKeys for ` tf) estimator) ModeKeys. TRAIN ` is training model, below is the sample code:

In the actual model, the feature columns of User and Item need to be separated and passed in separately. This is a big change to the original model code, and the batch size can be obtained by judging the length of Item features, which will not be described here.

In the actual online process, the team went through two stages. In the first stage, only the algorithm model code was modified and only the first dimension was taken when dealing with user characteristics, but the user characteristics would still be repeated for many times in the actual sent reasoning request. In the second stage, the sent recommendation request is optimized to send only one user feature. At this time, the model code does not need to be modified and has been automatically adapted.

As shown in the figure above, the input of user features is repeated many times in the first stage. In the model, only the first dimension of user features is taken before feature processing. The sample code is as follows:

The model code above can simultaneously adapt inference requests to send duplicate user characteristics, or to send only one user characteristic. So in the second phase, there is no need to modify the model code, just optimize the engine side code that sends the inference request.

After such optimization, online inference service does not need to send user characteristics repeatedly, which not only saves bandwidth, but also reduces serialization consumption. Only one Feature Column is converted for a batch of user features, and then the replication operation is performed. The replication consumption time is far less than the transformation time.

In fact, further optimization can be made to postpone the replication operation to the first layer of the neural network after the matrix multiplication, which can reduce part of the calculation consumption of the first matrix multiplication. If the dimension ratio of user features is high, the optimization effect will be more obvious.

06 summary

This paper introduces some optimization of TensorFlow Feature Column summarized by iQiyi deep learning platform in practice. After these optimizations, the performance efficiency of online inference service is more than doubled, and the P99 delay is reduced by more than 50%. And compared to op fuse, model diagram modification and other optimizations, these optimizations are easier to implement in business practice.

Finally, we still need to affirm the Feature processing convenience brought by TensorFlow Feature Column to the recommendation algorithm. It abstracts the processing of the whole Feature, and the algorithm can do iteration and experiment quickly as long as it slightly ADAPTS the sample features.

reference

1. www.tensorflow.org/tutorials/s…

2. www.tensorflow.org/api\_docs/p…

3. Github.com/tensorflow/…

Inference performance is doubled, and TensorFlow Feature Column performance optimization practice

I put the 01 in front

02 Background

Integer feature hashing optimization

Optimization of fixed-length feature conversion

05 User feature optimization

06 summary

Related Posts

Remember a programmer’s “fight” in the office

More than 1,000 malicious component packages targeted at the software supply chain have been exclusively disclosed by Bytedance LABS

[WSn] Optimized deployment of WSn nodes was realized based on the encapsulated group optimization algorithm