Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nan loss while training with coco dataset #11

Open
VeeranjaneyuluToka opened this issue Dec 19, 2021 · 9 comments
Open

Nan loss while training with coco dataset #11

VeeranjaneyuluToka opened this issue Dec 19, 2021 · 9 comments

Comments

@VeeranjaneyuluToka
Copy link

Hi,

I have been trying to train a model with coco data using coco_train.py, but ending up Nan losses. It shows different nan loss error in different times, earlier was gettting nan loss wrt bbox, and now classification as show below

File "C:\Users\veeru\Downloads\source\maskrcnn_tf2-master\src\training.py", line 134, in train_model
model.fit(train_datagen.repeat(),
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1095, in fit
tmp_logs = self.train_function(iterator)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 800, in train_function
return step_function(self, iterator)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 790, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 1259, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 2730, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 3417, in _call_for_each_replica
return fn(*args, **kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 572, in wrapper
return func(*args, **kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 783, in run_step
outputs = model.train_step(data)
File "C:\Users\veeru\Downloads\source\maskrcnn_tf2-master\src\model.py", line 86, in train_step
assert not np.any(np.isnan(rpn_class_loss_val))

I have disabled completely augmentation but still. Do you have any suggestion to train coco weights successfully?

Thanks,
Veeru.

@alexander-pv
Copy link
Owner

alexander-pv commented Dec 19, 2021

Hi, @VeeranjaneyuluToka,

Could you please, provide more information about the issue to speedup the process:

  1. What's happening with losses according to tensorboard?
  2. What images were loaded in the batch that caused nans in losses? The info can be extracted from __getitem__ method of SegmentationDataset class.

Thanks,
Alexander

@VeeranjaneyuluToka
Copy link
Author

VeeranjaneyuluToka commented Dec 20, 2021

Hi @alexander-pv ,

Thanks for quick reply!

Please note that it is not happening at first epoch in any case (with or without data augmentation). When i tried with augmentation in place, it was at 6th epoch and it was at 4th epoch without augmentation.

Please find the TB graphs:
image
image
image

One more thing is that, i thought of checking with mini data and debug the issue. I have given 150 epochs and it is going fine so far (it is around 80th epoch).

The image processed in that batch is 'Aeroplane' and it has only one mask in it.

Thanks,
Veeru.

@VeeranjaneyuluToka
Copy link
Author

VeeranjaneyuluToka commented Dec 21, 2021

when i print class ids from getitem() method, some of them are negative as below
[ 26 26 52 52 52 52 1 1 1 1 1 1 1 1 1 1 1 1
26 52 52 52 52 52 52 52 52 1 26 52 -1 -52]
[ 26 26 52 52 52 52 1 1 1 1 1 1 1 1 1 1 1 1
26 52 52 52 52 52 52 52 52 1 26 52 -1 -52]

-ve values signifies the crowd looks like, but they are handled in build_rpn_targets(). So i do not think that should be a problem. Still investigating for the probable reason.

@VeeranjaneyuluToka
Copy link
Author

Another time, but this time it is with bbox loss. Most likely if i am right, it is happening only with rpn (either bbox or class). Here is the log for bbox

outputs = model.train_step(data)
File "F:\users\downloads\source\maskrcnn_tf2-master\src\model.py", line 87, in train_step
assert not np.any(np.isnan(rpn_bbox_loss_val))
AssertionError

Will check the implementation of RPN in case if i can figure out something.

@VeeranjaneyuluToka
Copy link
Author

VeeranjaneyuluToka commented Dec 28, 2021

!!! Detected Infinity or NaN in output 0 of eagerly-executing op "Log" (# of outputs: 1) !!!
dtype: <dtype: 'float32'>
shape: (1, 200, 1)
#of -Inf elements: 100
Input tensor: tf.Tensor(****)

I have used tf.debugging.check_enable_numerics(), it is reported as above. I started analysing these tensors. I will post here if any of these things are helpful to resolve Nan loss issue.

@alexander-pv
Copy link
Owner

alexander-pv commented Dec 29, 2021

Hi, @VeeranjaneyuluToka,

Thanks for the information!

I conducted a small review of the repository after a long absence, updated the general structure of the model class, and added epsilons, where there are risks of getting nan, for example, in logarithms or division operations.

I trained the model with a frozen mobilenet backbone, batch_size=2, train_bn=True using coco_minitrain.py, 1000 train and 100 val images, 150 epochs. No nan got caught.
I also added a method check_loss_nan in MaskRCNN class with an optional feature of nan replacing with some other value instead of a straightforward assertion.

I have a hypothesis that some kind of image or class produces these nans. I would be grateful if you could run training on the full coco dataset after the recent repo updates.

coco_mini_1
coco_mini_2
coco_mini_3
coco_mini_4

@VeeranjaneyuluToka
Copy link
Author

Hi @alexander-pv ,

Thanks for looking into this issue. I will consider latest changes and trigger training again with complete dataset.

However i have tried earlier as well with coco_minitrain with efficientnet-b0, it did not fail with Nans.

@VeeranjaneyuluToka
Copy link
Author

Hi, @alexander-pv ,

I wish you a very happy new year.

I have started training with complete dataset, however still it is failing with Nan loss

File "coco_train.py", line 98, in
sys.exit(coco_train(coco.coco_parse_arguments()))
File "coco_train.py", line 89, in coco_train
train_model(model,
File "C:\Users\veeru\Downloads\source\maskrcnn_tf2-master\src\training.py", line 140, in train_model
model.fit(train_datagen,#train_datagen.repeat(),
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1095, in fit
tmp_logs = self.train_function(iterator)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 800, in train_function
return step_function(self, iterator)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 790, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 1259, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 2730, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 3417, in _call_for_each_replica
return fn(*args, **kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 572, in wrapper
return func(*args, **kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 783, in run_step
outputs = model.train_step(data)
File "C:\Users\veeru\Downloads\source\maskrcnn_tf2-master\src\model.py", line 163, in train_step
loss = self.get_summary_loss(rpn_class_loss_val, rpn_bbox_loss_val, mrcnn_class_loss_val,
File "C:\Users\veeru\Downloads\source\maskrcnn_tf2-master\src\model.py", line 102, in get_summary_loss
rpn_bbox_loss = self.check_loss_nan('rpn_bbox_loss', rpn_bbox_loss, assert_nans)
File "C:\Users\veeru\Downloads\source\maskrcnn_tf2-master\src\model.py", line 80, in check_loss_nan
raise AssertionError(msg)
AssertionError: Warning. Nan loss was found: rpn_bbox_loss

@VeeranjaneyuluToka
Copy link
Author

Hi, @alexander-pv ,

FYI, i have decreased learning rate from 0.001 to 0.0001 and it is at 7th epoch as of now. Am not sure if this helps to resolve Nan loss issue that we are facing. I will keep post you the progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants