Introduction to the

TensorFlow is arguably the most popular deep learning framework out there, and has a lot of nice interface design, such as the Swig Python API that makes the user unaware of the underlying implementation, and a custom op that can be dynamically extended in Python or C++ (easily integrated into Serving), And SavedModel Model format integrating Model Signature.

TensorFlow also has some design features that might confuse novice or veteran programmers. For example, multi-GPU training requires users to implement tower structures and specify devices. Distributed training requires you to write PS by yourself, although there is almost only one line of join code (PS default does not exit after training requires you to use TensorFlow Queue to solve the problem). In addition, the recommended Dataset interface must be used with while(True) + except(OutOfRangeError), and there are various similar high-level apis such as Keras, Estimator, Learn and Slim…… Of course, these “unreasonable” interface designs exist for a reason. As the lowest level modeling language, exposing the with Device and PS/ Worker interfaces makes it possible for developers to implement more complex model parallelism, data parallelism, in-graph, and inter-graph structures.

In the last month we have found several equally “interesting” interface designs, which are discussed more on Github Issue for everyone to learn from.

1. Only default values are displayed for parsing

Tf.app. flags is used by those who have written TensorFlow scripts. The command line parameters and default values are defined by the TensorFlow interface, similar to Python’s argparse and ConfigParser. This is a “best practice” for almost all official TensorFlow code, and the sample code is simple.

import tensorflow as tf

flags = tf.app.flags
flags.DEFINE_integer("image_width", 224, "Width of the image")
flags.DEFINE_integer("image_height", 224, "Height of the image")
flags.DEFINE_integer("channels", 3, "Channel of the image")
FLAGS = flags.FLAGS

# FLAGS.image_width
Copy the code

There is a “Bug” that occurs when we try to print the values of FLAGS at run time. The default value is printed at first, but when we print the values again, the overridden value is displayed. See Issue The Default Values of tf.app.flags are printed event though passed parameters at The first time · Issue #20680 · Tensorflow/tensorflow.


There are two conditions that trigger this problem, one is that the code is strict and I will print the overridden or default parameters when the script starts, and the other is that the convention implemented by Tf.app. flags based on absl.flags is a bit rough. In order to avoid incorrect arguments passed in the command line call, we recommend runtime checking. Different models have different overarguments. In general, we can obtain all key-value pairs from FLAGS objects.

FLAGS = flags.FLAGS
parameter_value_map = {}
for key in FLAGS.__flags.keys():
  parameter_value_map[key] = FLAGS.__flags[key].value
print("Parameters: {}".format(parameter_value_map))
# Parameters: {'channels': 3, 'image_height': 224, 'image_width': 224}
Copy the code

The value obtained here is the default value defined in Python code, regardless of whether the argument is passed to the command line, which obviously does not work. Instead, we can call flags. channels, flags. image_height, and flags. image_width randomly. The value is updated to the overridden value, which is “resolve-in-use.” TensorFlow uses abseil-py, another open source project of Google engineers, to parse parameters github.com/abseil/abse… The arguments to sys.argv can be resolved when the __call__() function is actively called. So both TensorFlow and Abseil obtain command line arguments by reading Python’s sys.argv, and there is an explicit parse process. TensorFlow also has a wrapper around Abseil. Natively, obtaining a value without parse will throw an exception. If a value is obtained without parse, the code is the best parse github.com/tensorflow/… .


Most people don’t look at the source code when using TF.app. flags, and they don’t know that TensorFlow checks the parse parameter every time it reads a value. This interface design and implementation allows ordinary users to get the latest value without knowing when to use the parse parameter. __flags[key]. Value will get the default value without calling any value. At present, Issue is discussing pull-request and there is no implementation idea.

2. Multiple HashTables recovered from Checkpoint are overwritten

For TensorFlow, Checkpoint can save model parameters. Many people know that in addition to saving the matrix weight of neural network, it can also save the HashTable of key-value pairs. TensorFlow provides tf.contrib.lookup for a number of op implementations. For example, we try to store a training sample’s readable label (string) and training label (integer) in Checkpoint and model as MutableHashTable. In order to achieve two-way conversion, there are two hash tables: string-to-int and int-to-string. Our group colleagues found that multiple hashtables restore the overwritten Issue github.com/tensorflow/… .


After defining two mutableHashTables, if no name is specified, you can directly export the variables to Checkpoint and write code to restore the variables from Checkpoint, as shown in the following example.

keys = tf.placeholder(dtype=tf.string, shape=[None])
values = tf.placeholder(dtype=tf.int64, shape=[None])
table1 = tf.contrib.lookup.MutableHashTable(tf.string, tf.int64, -1)
table2 = tf.contrib.lookup.MutableHashTable(tf.string, tf.int64, -1)
insert_table1 = table1.insert(keys, values)
insert_table2 = table2.insert(keys, values)
saver = tf.train.Saver()
with tf.Session() as sess:
    sess.run(insert_table1, feed_dict={keys: ["a"], values: [1]})
    sess.run(insert_table2, feed_dict={keys: ["b"], values: [2]})
    print "table1:", sess.run(table1.export())
    print "table2:", sess.run(table2.export())
    saver.save(sess, "checkpoint/test")
Copy the code

No errors are reported, but tabel1 is empty and Table2 is normal. It was a good chance of TensorFlow source code into the pit, shall we go to the tf, contrib. Lookup. MutableHashTable Python code, the class has a name attribute, And name will provide a default value of “MutableHashTable”, if subsequent exports Checkpoint will be based on the passed or default name github.com/tensorflow/… .


Therefore, it has been emphasized that if you define multiple Hashtables without specifying a name, the unexpected behavior will be overridden. This exception will not be raised if the user is accustomed to defining unique names for each op. TensorFlow users usually don’t define op names for tf.add() and tf.muliple(), especially if TensorFlow overloads Python operators. TensorFlow’s Graph and Checkpoint are very dependent on the op name, so TensorFlow’s Python API sets a unique name for each op of the user. Even if you pass in a conflicting name, you’ll suffix it to unique name. This logic is implemented in TensorFlow Python, which means that you should be careful if you call the C++ API or any other language API directly.

Same logic at github.com/tensorflow/… However, all tables currently only provide a fixed default value. If multiple tables are exported at the same time without specifying name, more logical errors will occur due to overwriting. Currently the Issue is discussing pull-request has been sent (whether it can merge the unique op name assignment scheme similar to Variable needs to be discussed) github.com/tensorflow/… .

3. The Epoch progress saved by the Dataset conflicts with Shuffle

Dataset is currently the main data reading interface, supporting rich functions, including specifying epoch number, Batch size, whether Shufle, whether cache and many other advanced functions. It also works with custom map functions to parse and read files in different data formats, such as TFRecords or CSV. It also shows how a project name can be loaded dynamically so that no filename is stored too much in a placeholder Graph. The problem is that the user will add the function of “breakpoint run” in the TensorFlow script. Generally, the training progress and parameters will be stored in the Checkpoint, while the Dataset knows the epoch index and batch index of the user training. At the beginning of the Dataset design also consider this implementation of SaveInternal interface github.com/tensorflow/… .

The problem is that this Issue is triggered when we use it, which will directly result in the failure of saving Checkpoint after the model tries to join the parameter of Dataset. See Issue Fail to save the checkpoint with dataset Iterator saveable when using shuffle · Issue #18583 · Tensorflow/tensorflow.


When we checked, it turns out that a server based TensorSliceDataset does save progress to a Checkpoint, but if you put a Shuffle OP in the dataset, it fails to export. If a user adds shuffle, the trained data will be shuffled in a buffer size. The Dataset can ensure that all the data in one epoch will be used, but if the Dataset only saves the progress to Checkpoint, The next reshuffle run may miss some data.

One solution is to implement the current API. If shuffle cannot save the iterator variables of the Dataset to Checkpoint, users can prepare out-of-order data sets in advance without shuffle to avoid this problem. If API support is necessary, I also have two ideas. First, the Dataset iterator does not guarantee that each epoch can process all the data, so the Checkpoint recovery only needs to be shuffled again, regardless of whether the data has been processed. The other option is to save the shuffle result to Checkpoint, so that when the shuffle result is restored, it can be perceived as normal. Of course, even if these two solutions are implemented, some users will write articles to spray. At present, Issue is discussing pull-request and has no idea.

4. Other

In addition to the issues we mentioned recently, there are also some interface designs that are “interesting”. For example, we mentioned in the community that the Parameter server of distributed TensorFlow can automatically exit after training. The community replied that it can’t, but you can implement it. How do you do that? In general, PS to implement distributed TensorFlow will call server.join(), including the official documentation. From the name, we can know that this function is blocking, so PS will continue blocking and not exit even after all worker training is finished. We can use the queue provided by TensorFlow API. PS can get enough data from the queue (the number is the number of workers), and then the Worker will plug the data in the same queue after the training. PS will block on the queue until all the workers have finished before fetching enough stuff from the queue to exit the Python process. The code looks like this. For interest, you can see github.com/tobegit3hub… .

# If is PS queue = worker_done_queues[task_index] dequeue_op = queue.dequeue() for i in range(master_worker_number): sess.run(dequeue_op) logging.info("{} workers are already done".format(i + 1)) # If is Worker enqueue_ops = [] for queue  in worker_done_queues: enqueue_op = queue.enqueue(1) enqueue_ops.append(enqueue_op) for enqueue_op in enqueue_ops: sess.run(enqueue_op)Copy the code

Another interesting aspect of the distributed TensorFlow interface is the ability to use a wide variety of Session wrappers, such as Supervisor and MonitoredTrainingSession. The Supervisor may fail with PS or Worker in case of network jitter, while the Session object obtained with MonitoredTrainingSession, In fact, the SavedModel model cannot be directly exported using the official Saved_model_Builder because of the type mismatch. MonitoredTrainingSession exports model parameters to Checkpoint, and new tf.session () loads Checkpoint export model. MonitoredTrainingSession adds a SavedModelHook to the MonitoredTrainingSession. Note that it contains chief_only_hooks instead of generic hooks. It then inherits the SessionRunHook API to retrieve the Session object at end() to store the model, so when you see the following code, you don’t need to experience it, which is the wisdom of the designers of the TensorFlow API.

class SavedModelHook(tf.train.SessionRunHook):
  def end(self, session):
    saved_model(session, model_path, FLAGS.model_version,
                model_signature, legacy_init_op)

chief_hooks = [SavedModelHook()]

with tf.train.MonitoredTrainingSession(
    master=server.target,
    is_chief=(task_type == "master"),
    chief_only_hooks=chief_hooks) as mon_sess:
Copy the code

conclusion

We’ve teased some of the TensorFlow API design, but as developers, we’re well aware of the inherent flaws and difficulties involved. TensorFlow, as an “all-in-one” deep learning framework, implements a DAG management and control flow in addition to basic Autograd functions. The Filesystem API supports file systems such as Local, HDFS, and S3. The Filesystem API supports file systems such as LOCAL, HDFS, and S3. And to the CPU and GPU, TPU uniform interface operator, a layer of abstraction, and above all need to implement in the op, can guarantee in a c + + tensorflow: : ClientSession running.

As a result, all interfaces are difficult to be designed for ease of use and extensibility. For example, the FLAGS convention to check __getattr__ and parse once may cause other apis to not be aware of it, and many interfaces are slightly more complex to use when extensibility is strong. For example, Distributed TensorFlow is quite complex and can be explained separately. At this time, we can offer more issues and pull-requests to the TensorFlow community. The process of learning the source code of TensorFlow is the best use of TensorFlow process, and we look forward to more TensorFlow Contributor joining (to help us solve the above problems).