Nan loss while training with coco dataset #11

VeeranjaneyuluToka · 2021-12-19T15:45:15Z

Hi,

I have been trying to train a model with coco data using coco_train.py, but ending up Nan losses. It shows different nan loss error in different times, earlier was gettting nan loss wrt bbox, and now classification as show below

File "C:\Users\veeru\Downloads\source\maskrcnn_tf2-master\src\training.py", line 134, in train_model
model.fit(train_datagen.repeat(),
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1095, in fit
tmp_logs = self.train_function(iterator)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 800, in train_function
return step_function(self, iterator)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 790, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 1259, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 2730, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 3417, in _call_for_each_replica
return fn(*args, **kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 572, in wrapper
return func(*args, **kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 783, in run_step
outputs = model.train_step(data)
File "C:\Users\veeru\Downloads\source\maskrcnn_tf2-master\src\model.py", line 86, in train_step
assert not np.any(np.isnan(rpn_class_loss_val))

I have disabled completely augmentation but still. Do you have any suggestion to train coco weights successfully?

Thanks,
Veeru.

alexander-pv · 2021-12-19T16:10:06Z

Hi, @VeeranjaneyuluToka,

Could you please, provide more information about the issue to speedup the process:

What's happening with losses according to tensorboard?
What images were loaded in the batch that caused nans in losses? The info can be extracted from __getitem__ method of SegmentationDataset class.

Thanks,
Alexander

VeeranjaneyuluToka · 2021-12-20T08:44:50Z

Hi @alexander-pv ,

Thanks for quick reply!

Please note that it is not happening at first epoch in any case (with or without data augmentation). When i tried with augmentation in place, it was at 6th epoch and it was at 4th epoch without augmentation.

Please find the TB graphs:

One more thing is that, i thought of checking with mini data and debug the issue. I have given 150 epochs and it is going fine so far (it is around 80th epoch).

The image processed in that batch is 'Aeroplane' and it has only one mask in it.

Thanks,
Veeru.

VeeranjaneyuluToka · 2021-12-21T14:43:31Z

when i print class ids from getitem() method, some of them are negative as below
[ 26 26 52 52 52 52 1 1 1 1 1 1 1 1 1 1 1 1
26 52 52 52 52 52 52 52 52 1 26 52 -1 -52]
[ 26 26 52 52 52 52 1 1 1 1 1 1 1 1 1 1 1 1
26 52 52 52 52 52 52 52 52 1 26 52 -1 -52]

-ve values signifies the crowd looks like, but they are handled in build_rpn_targets(). So i do not think that should be a problem. Still investigating for the probable reason.

VeeranjaneyuluToka · 2021-12-24T10:32:26Z

Another time, but this time it is with bbox loss. Most likely if i am right, it is happening only with rpn (either bbox or class). Here is the log for bbox

outputs = model.train_step(data)
File "F:\users\downloads\source\maskrcnn_tf2-master\src\model.py", line 87, in train_step
assert not np.any(np.isnan(rpn_bbox_loss_val))
AssertionError

Will check the implementation of RPN in case if i can figure out something.

VeeranjaneyuluToka · 2021-12-28T09:39:14Z

!!! Detected Infinity or NaN in output 0 of eagerly-executing op "Log" (# of outputs: 1) !!!
dtype: <dtype: 'float32'>
shape: (1, 200, 1)
#of -Inf elements: 100
Input tensor: tf.Tensor(****)

I have used tf.debugging.check_enable_numerics(), it is reported as above. I started analysing these tensors. I will post here if any of these things are helpful to resolve Nan loss issue.

alexander-pv · 2021-12-29T16:14:19Z

Hi, @VeeranjaneyuluToka,

Thanks for the information!

I conducted a small review of the repository after a long absence, updated the general structure of the model class, and added epsilons, where there are risks of getting nan, for example, in logarithms or division operations.

I trained the model with a frozen mobilenet backbone, batch_size=2, train_bn=True using coco_minitrain.py, 1000 train and 100 val images, 150 epochs. No nan got caught.
I also added a method check_loss_nan in MaskRCNN class with an optional feature of nan replacing with some other value instead of a straightforward assertion.

I have a hypothesis that some kind of image or class produces these nans. I would be grateful if you could run training on the full coco dataset after the recent repo updates.

VeeranjaneyuluToka · 2021-12-30T12:23:45Z

Hi @alexander-pv ,

Thanks for looking into this issue. I will consider latest changes and trigger training again with complete dataset.

However i have tried earlier as well with coco_minitrain with efficientnet-b0, it did not fail with Nans.

VeeranjaneyuluToka · 2022-01-02T11:16:46Z

Hi, @alexander-pv ,

I wish you a very happy new year.

I have started training with complete dataset, however still it is failing with Nan loss

File "coco_train.py", line 98, in
sys.exit(coco_train(coco.coco_parse_arguments()))
File "coco_train.py", line 89, in coco_train
train_model(model,
File "C:\Users\veeru\Downloads\source\maskrcnn_tf2-master\src\training.py", line 140, in train_model
model.fit(train_datagen,#train_datagen.repeat(),
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1095, in fit
tmp_logs = self.train_function(iterator)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 800, in train_function
return step_function(self, iterator)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 790, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 1259, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 2730, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 3417, in _call_for_each_replica
return fn(*args, **kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 572, in wrapper
return func(*args, **kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 783, in run_step
outputs = model.train_step(data)
File "C:\Users\veeru\Downloads\source\maskrcnn_tf2-master\src\model.py", line 163, in train_step
loss = self.get_summary_loss(rpn_class_loss_val, rpn_bbox_loss_val, mrcnn_class_loss_val,
File "C:\Users\veeru\Downloads\source\maskrcnn_tf2-master\src\model.py", line 102, in get_summary_loss
rpn_bbox_loss = self.check_loss_nan('rpn_bbox_loss', rpn_bbox_loss, assert_nans)
File "C:\Users\veeru\Downloads\source\maskrcnn_tf2-master\src\model.py", line 80, in check_loss_nan
raise AssertionError(msg)
AssertionError: Warning. Nan loss was found: rpn_bbox_loss

VeeranjaneyuluToka · 2022-01-03T09:17:02Z

Hi, @alexander-pv ,

FYI, i have decreased learning rate from 0.001 to 0.0001 and it is at 7th epoch as of now. Am not sure if this helps to resolve Nan loss issue that we are facing. I will keep post you the progress.

VeeranjaneyuluToka closed this as completed Dec 20, 2021

VeeranjaneyuluToka reopened this Dec 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nan loss while training with coco dataset #11

Nan loss while training with coco dataset #11

VeeranjaneyuluToka commented Dec 19, 2021

alexander-pv commented Dec 19, 2021 •

edited

Loading

VeeranjaneyuluToka commented Dec 20, 2021 •

edited

Loading

VeeranjaneyuluToka commented Dec 21, 2021 •

edited

Loading

VeeranjaneyuluToka commented Dec 24, 2021

VeeranjaneyuluToka commented Dec 28, 2021 •

edited

Loading

alexander-pv commented Dec 29, 2021 •

edited

Loading

VeeranjaneyuluToka commented Dec 30, 2021

VeeranjaneyuluToka commented Jan 2, 2022

VeeranjaneyuluToka commented Jan 3, 2022

Nan loss while training with coco dataset #11

Nan loss while training with coco dataset #11

Comments

VeeranjaneyuluToka commented Dec 19, 2021

alexander-pv commented Dec 19, 2021 • edited Loading

VeeranjaneyuluToka commented Dec 20, 2021 • edited Loading

VeeranjaneyuluToka commented Dec 21, 2021 • edited Loading

VeeranjaneyuluToka commented Dec 24, 2021

VeeranjaneyuluToka commented Dec 28, 2021 • edited Loading

alexander-pv commented Dec 29, 2021 • edited Loading

VeeranjaneyuluToka commented Dec 30, 2021

VeeranjaneyuluToka commented Jan 2, 2022

VeeranjaneyuluToka commented Jan 3, 2022

alexander-pv commented Dec 19, 2021 •

edited

Loading

VeeranjaneyuluToka commented Dec 20, 2021 •

edited

Loading

VeeranjaneyuluToka commented Dec 21, 2021 •

edited

Loading

VeeranjaneyuluToka commented Dec 28, 2021 •

edited

Loading

alexander-pv commented Dec 29, 2021 •

edited

Loading