-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nan loss while training with coco dataset #11
Comments
Hi, @VeeranjaneyuluToka, Could you please, provide more information about the issue to speedup the process:
Thanks, |
Hi @alexander-pv , Thanks for quick reply! Please note that it is not happening at first epoch in any case (with or without data augmentation). When i tried with augmentation in place, it was at 6th epoch and it was at 4th epoch without augmentation. One more thing is that, i thought of checking with mini data and debug the issue. I have given 150 epochs and it is going fine so far (it is around 80th epoch). The image processed in that batch is 'Aeroplane' and it has only one mask in it. Thanks, |
when i print class ids from getitem() method, some of them are negative as below -ve values signifies the crowd looks like, but they are handled in build_rpn_targets(). So i do not think that should be a problem. Still investigating for the probable reason. |
Another time, but this time it is with bbox loss. Most likely if i am right, it is happening only with rpn (either bbox or class). Here is the log for bbox outputs = model.train_step(data) Will check the implementation of RPN in case if i can figure out something. |
!!! Detected Infinity or NaN in output 0 of eagerly-executing op "Log" (# of outputs: 1) !!! I have used tf.debugging.check_enable_numerics(), it is reported as above. I started analysing these tensors. I will post here if any of these things are helpful to resolve Nan loss issue. |
Hi, @VeeranjaneyuluToka, Thanks for the information! I conducted a small review of the repository after a long absence, updated the general structure of the model class, and added epsilons, where there are risks of getting nan, for example, in logarithms or division operations. I trained the model with a frozen I have a hypothesis that some kind of image or class produces these nans. I would be grateful if you could run training on the full coco dataset after the recent repo updates. |
Hi @alexander-pv , Thanks for looking into this issue. I will consider latest changes and trigger training again with complete dataset. However i have tried earlier as well with coco_minitrain with efficientnet-b0, it did not fail with Nans. |
Hi, @alexander-pv , I wish you a very happy new year. I have started training with complete dataset, however still it is failing with Nan loss File "coco_train.py", line 98, in |
Hi, @alexander-pv , FYI, i have decreased learning rate from 0.001 to 0.0001 and it is at 7th epoch as of now. Am not sure if this helps to resolve Nan loss issue that we are facing. I will keep post you the progress. |
Hi,
I have been trying to train a model with coco data using coco_train.py, but ending up Nan losses. It shows different nan loss error in different times, earlier was gettting nan loss wrt bbox, and now classification as show below
File "C:\Users\veeru\Downloads\source\maskrcnn_tf2-master\src\training.py", line 134, in train_model
model.fit(train_datagen.repeat(),
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1095, in fit
tmp_logs = self.train_function(iterator)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 800, in train_function
return step_function(self, iterator)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 790, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 1259, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 2730, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 3417, in _call_for_each_replica
return fn(*args, **kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 572, in wrapper
return func(*args, **kwargs)
File "C:\Users\veeru\Anaconda3\envs\tf2.4_mrcnn\lib\site-packages\tensorflow\python\keras\engine\training.py", line 783, in run_step
outputs = model.train_step(data)
File "C:\Users\veeru\Downloads\source\maskrcnn_tf2-master\src\model.py", line 86, in train_step
assert not np.any(np.isnan(rpn_class_loss_val))
I have disabled completely augmentation but still. Do you have any suggestion to train coco weights successfully?
Thanks,
Veeru.
The text was updated successfully, but these errors were encountered: