Training error reported at the beginning of training

Epoch 1/2000
Traceback (most recent call last):
  File "multi_type_question_number.py", line 450, in <module>
    train(model)
  File "multi_type_question_number.py", line 281, in train
    layers='heads')
  File "/root/Mask_RCNN/mrcnn/model.py", line 2381, in train
    use_multiprocessing=False,
  File "/ root/anaconda3 envs/dl/lib/python3.5 / site - packages/keras/legacy/interfaces. Py." ", line 91, in wrapper
    return func(*args, **kwargs)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / site - packages/keras/engine/training. Py." ", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/ root/anaconda3 / envs/dl/lib/python3.5 / site - packages/keras/engine/training_generator py." ", line 217, in fit_generator
    class_weight=class_weight)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / site - packages/keras/engine/training. Py." ", line 1211, in train_on_batch
    class_weight=class_weight)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / site - packages/keras/engine/training. Py." ", line 751, in _standardize_user_data
    exception_prefix='input')
  File "/ root/anaconda3 / envs/dl/lib/python3.5 / site - packages/keras/engine/training_utils py." ", line 138, in standardize_input_data
    str(data_shape))
ValueError: Error when checking input: expected input_image_meta to have shape (16,) but got array with shape (22,)

Copy the code

Cause: The class does not correspond. Solution: Background 1 + number of target detections required

Unable to save model error

100/100 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 261 - s 3 s/step - loss: 1.6207 - rpn_class_loss: 0.1146 - rpn_bbox_loss: 0.4441-MRCNn_class_loss: 0.2516-MRCNn_bbox_loss: 0.4165-MRCNn_mask_loss: 0.3940-val_loss: 0.8447 - VAL_RPn_class_loss: 0.0514 - val_RPn_bbox_loss: 0.4179 - val_MRCNn_class_loss: 0.0822 - val_MRCNn_bbox_loss: 0.1607 - VAL_MRCNn_MASk_Loss: 0.1326 Traceback (most recent Call last): File"multi_type_question_number.py", line 450, in <module>
    train(model)
  File "multi_type_question_number.py", line 281, in train
    layers='heads')
  File "/root/Mask_RCNN/mrcnn/model.py", line 2381, in train
    use_multiprocessing=False,
  File "/ root/anaconda3 envs/dl/lib/python3.5 / site - packages/keras/legacy/interfaces. Py." ", line 91, in wrapper
    return func(*args, **kwargs)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / site - packages/keras/engine/training. Py." ", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/ root/anaconda3 / envs/dl/lib/python3.5 / site - packages/keras/engine/training_generator py." ", line 251, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/ root/anaconda3 / envs/dl/lib/python3.5 / site - packages/keras/callbacks. Py." ", line 79, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/ root/anaconda3 / envs/dl/lib/python3.5 / site - packages/keras/callbacks. Py." ", line 446, in on_epoch_end
    self.model.save(filepath, overwrite=True)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / site - packages/keras/engine/network. Py." ", line 1090, in save
    save_model(self, filepath, overwrite, include_optimizer)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / site - packages/keras/engine/saving. Py." ", line 382, in save_model
    _serialize_model(model, f, include_optimizer)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / site - packages/keras/engine/saving. Py." ", line 83, in _serialize_model
    model_config['config'] = model.get_config()
  File "/ root/anaconda3 envs/dl/lib/python3.5 / site - packages/keras/engine/network. Py." ", line 931, in get_config
    return copy.deepcopy(config)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 155, in deepcopy
    y = copier(x, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 155, in deepcopy
    y = copier(x, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 218, in _deepcopy_list
    y.append(deepcopy(a, memo))
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 155, in deepcopy
    y = copier(x, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 155, in deepcopy
    y = copier(x, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 155, in deepcopy
    y = copier(x, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 223, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 223, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 155, in deepcopy
    y = copier(x, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 223, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 223, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 182, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 297, in _reconstruct
    state = deepcopy(state, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 155, in deepcopy
    y = copier(x, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 182, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 297, in _reconstruct
    state = deepcopy(state, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 155, in deepcopy
    y = copier(x, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/ root/anaconda3 envs/dl/lib/python3.5 / copy. Py." ", line 174, in deepcopy
    rv = reductor(4)
TypeError: can't pickle SwigPyObject objects

Copy the code

Solution:

Error model parameters are incorrect. Model. py is set to save weights only. Mask RCNN does not support full model error reporting, for unknown reasons

save_best_only=True,save_weights_only=True
Copy the code
#Callbacks
keras.callbacks.ModelCheckpoint(self.checkpoint_path,monitor='val_loss',
verbose=0,save_best_only=True,save_weights_only=True),
]
Copy the code

Loss nan

Epoch 33/2000 100/100 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 231 - s, 2 s/step - loss: 0.9838 - rpn_class_loss: 0.0180 - RPn_bbox_loss: 0.2431 - MRCNn_class_loss: 0.1373 - MRCNn_bbox_loss: 0.2723 - MRCNn_mask_loss: 0.3130-val_loss: 0.9683-val_RPn_class_loss: 0.0417 - val_RPn_bbox_loss: 0.3798 - val_MRCNn_class_loss: 0.0751-val_MRCNn_bbox_loss: 0.2362 - val_MRCNn_mask_loss: 0.2355 Epoch 34/2000 100/100 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 215 - s, 2 s/step - loss: 0.8713 - rpn_class_loss: 0.0227-RPn_bbox_loss: 0.1844-MRCNn_class_loss: 0.1292 - MRCNn_bbox_Loss: 0.2486-MRCNn_mask_loss: 0.2863 - val_loss: 0.7546 - val_RPn_class_loss: 0.0275 - val_RPn_bbox_loss: 0.2965 - val_MRCNn_class_loss: 0.0583-val_MRCNn_bbox_loss: 0.1699 - val_MRCNn_mask_loss: 0.2023 Epoch 35/2000 100/100 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 221 - s, 2 s/step - loss: nan - rpn_class_loss: 0.6370-RPN_bbox_loss: 0.4525 - MRCNn_class_loss: nan - MRCNn_bbox_loss: 0.0335 - MRCNn_mask_loss: 0.0907-val_loss: Nan-val_rpn_class_loss: 0.7072 - val_RPn_bbox_loss: 0.4715 - val_MRCNn_class_loss: Nan-val_MRCNn_bbox_loss: E+00 val_mrcnn_mask_loss - 0.0000: 0.0000 e+00 Epoch 36/2000 100/100 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 219 - s, 2 s/step - loss: nan - rpn_class_loss: 0.7061 - rpn_bbox_loss: 0.4460 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: + +00 - val_loss: nan - val_RPn_class_loss: 0.7072 - val_RPn_bbox_loss: 0.4431 - val_MRCNn_class_loss: Nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000 e+00 Epoch 37/2000 100/100 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 209 - s, 2 s/step - loss: nan - rpn_class_loss: 0.7066 - rpn_bbox_loss: 0.4681 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: + 0.0000e+00 - val_loss: nan - val_rpn_class_loss: 0.7073 - val_rpn_bbox_loss: 0.4084 - val_mrcnn_class_loss: Nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00..... Epoch 167/2000 100/100 [==============================] - 210s 2s/step - loss: nan - rpn_class_loss: 0.7062 - rpn_bbox_loss: 0.4489 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00 - val_loss: nan - val_rpn_class_loss: 0.7074 - val_rpn_bbox_loss: 0.5061 - val_mrcnn_class_loss: Nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000 e+00 Epoch 168/2000 24/100 [] = = = = = = >... -ETA: 1:34 - loss: nan - RPn_class_loss: 0.7069 - RPn_bbox_loss: 0.4690 - MRCNn_class_loss: Nan - MRCNn_bbox_loss: 0.0000 e+00 - mrcnn_mask_loss: 0.0000 e+00Copy the code

Mrcnn_class_loss: nan Results in total loss nan

Solution: There is a value outside the class in the training data

Class ID Precautions

Github.com/matterport/…

It takes me two days to running this code on my own data set. I thought there should be more details in the guidance. 1. When using add_image() in the utils.Dataset class, the image_id must be consecutive integer from 1 to some number, because image_id is the index of a list. Class number also should be consecutive integer from 1 to some number, or you will get an nanloss.

  1. The class ID must be a continuous integer number
  2. It must start at 1 or an error will be reported
  3. 0 is the class ID used as the background

The understanding of the RPN

RPN Class Loss: Not for all object classes, but for foreground and background classifications

Cuda is compatible with NVIDIA drivers

Nvtop Monitors GPU changes

Blog.csdn.net/fword/artic…

Blog.csdn.net/jiang_xinxi…

github.com/Syllo/nvtop

Training initiation process

  • Go to the training directory you created

    1. Activate the training environment: source Activate DL
    2. Start training: nohup python xxxx.py train >train.log &
  • Ecs remote server Jupyter Notebook and Tensorboard service background running:

Nohup jupyter notebook -- IP 0.0.0.0 --no-browser --allow-root > jupyter.log &

  • Tensorboard Loss detection: nohup tensorboard –logdir=./logs > tensorboard.log &

    • Tensorboard cannot be enabled in centos7:Tensorboard could not bind to unsupported address family

Ip4 is not the default. Host 0.0.0.0 is the default. Tensorboard –logdir=./logs –host=0.0.0.0 > tensorboard.log &

  • Port firewall You also need to enable port 6006 on the firewall.
iptables -A INPUT -p tcp --dport 6006 -j ACCEPT
iptables -A OUTPUT -p tcp --sport 6006 -j ACCEPT


iptables -A INPUT -p tcp --dport 8888 -j ACCEPT
iptables -A OUTPUT -p tcp --sport 8888 -j ACCEPT

service iptables restart
Copy the code

www.jianshu.com/p/586da7c8f…

Training jammed

Use_multiprocessing is turned off in the model.py train method and worker is set to 1.

Practice understanding step, batch-size and epoch

  1. Iteration: represents an iteration (also called training step). Each iteration updates the parameters of the network structure once.
  2. Batch-size: the sample size used for one iteration;
  3. Epoch: 1 epoch represents all samples in the training set that have been passed once.

It is worth noting that in the field of deep learning, the Stochastic Gradient Descent (SGD) algorithm with mini-batch is commonly used to train deep structures. One advantage of SGD is that it does not need to traverse all samples, and it is very effective when the amount of data is very large. At this point, the epoch can be defined according to actual problems. For example, 10,000 iterations can be defined as one epoch. If the batch-size of each iteration is set to 256, then one epoch is equivalent to passing 2560,000 training samples.