Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Total time difference when training #65

Closed
alic-xc opened this issue May 23, 2021 · 8 comments
Closed

Total time difference when training #65

alic-xc opened this issue May 23, 2021 · 8 comments

Comments

@alic-xc
Copy link

alic-xc commented May 23, 2021

Hi guys,
is there any time difference between training 300 images vs 150 images ?
I tried training 300 images with 55 test using (ecpoh=300) which Google Colab terminate it at 267/300 ,it took me over 7 hours to reach. so i divided the dataset into 2 hoping it will reduces the time by 50% but from what i can see here, it is going by the previous time

Using resnet50 as network backbone For Mask R-CNN model
Applying Default Augmentation on Dataset
Train 300 images
Validate 55 images
Checkpoint Path: /content/mask_rcnn_models
Selecting layers to train
Epoch 1/300
100/100 [==============================] - 205s 1s/step - batch: 49.5000 - size: 4.0000 - loss: 0.9180 - rpn_class_loss: 0.0146 - rpn_bbox_loss: 0.3307 - mrcnn_class_loss: 0.0552 - mrcnn_bbox_loss: 0.3484 - mrcnn_mask_loss: 0.1690 - val_loss: 0.6101 - val_rpn_class_loss: 0.0078 - val_rpn_bbox_loss: 0.2722 - val_mrcnn_class_loss: 0.0228 - val_mrcnn_bbox_loss: 0.1798 - val_mrcnn_mask_loss: 0.1275
Epoch 2/300
100/100 [==============================] - 121s 1s/step - batch: 49.5000 - size: 4.0000 - loss: 0.4959 - rpn_class_loss: 0.0057 - rpn_bbox_loss: 0.1984 - mrcnn_class_loss: 0.0274 - mrcnn_bbox_loss: 0.1425 - mrcnn_mask_loss: 0.1218 - val_loss: 0.5547 - val_rpn_class_loss: 0.0047 - val_rpn_bbox_loss: 0.2960 - val_mrcnn_class_loss: 0.0110 - val_mrcnn_bbox_loss: 0.1219 - val_mrcnn_mask_loss: 0.1212
Epoch 3/300
100/100 [==============================] - 126s 1s/step - batch: 49.5000 - size: 4.0000 - loss: 0.4234 - rpn_class_loss: 0.0043 - rpn_bbox_loss: 0.1824 - mrcnn_class_loss: 0.0206 - mrcnn_bbox_loss: 0.1022 - mrcnn_mask_loss: 0.1140 - val_loss: 0.3582 - val_rpn_class_loss: 0.0029 - val_rpn_bbox_loss: 0.1576 - val_mrcnn_class_loss: 0.0124 - val_mrcnn_bbox_loss: 0.0807 - val_mrcnn_mask_loss: 0.1046
Epoch 4/300
100/100 [==============================] - 121s 1s/step - batch: 49.5000 - size: 4.0000 - loss: 0.3597 - rpn_class_loss: 0.0033 - rpn_bbox_loss: 0.1438 - mrcnn_class_loss: 0.0164 - mrcnn_bbox_loss: 0.0839 - mrcnn_mask_loss: 0.1123 - val_loss: 0.3611 - val_rpn_class_loss: 0.0018 - val_rpn_bbox_loss: 0.1736 - val_mrcnn_class_loss: 0.0076 - val_mrcnn_bbox_loss: 0.0670 - val_mrcnn_mask_loss: 0.1111
Epoch 5/300
100/100 [==============================] - 121s 1s/step - batch: 49.5000 - size: 4.0000 - loss: 0.3001 - rpn_class_loss: 0.0025 - rpn_bbox_loss: 0.1137 - mrcnn_class_loss: 0.0122 - mrcnn_bbox_loss: 0.0595 - mrcnn_mask_loss: 0.1123 - val_loss: 0.3264 - val_rpn_class_loss: 0.0020 - val_rpn_bbox_loss: 0.1344 - val_mrcnn_class_loss: 0.0089 - val_mrcnn_bbox_loss: 0.0771 - val_mrcnn_mask_loss: 0.1040
Epoch 6/300
100/100 [==============================] - 125s 1s/step - batch: 49.5000 - size: 4.0000 - loss: 0.2718 - rpn_class_loss: 0.0019 - rpn_bbox_loss: 0.0992 - mrcnn_class_loss: 0.0095 - mrcnn_bbox_loss: 0.0556 - mrcnn_mask_loss: 0.1055 - val_loss: 0.2959 - val_rpn_class_loss: 0.0023 - val_rpn_bbox_loss: 0.1174 - val_mrcnn_class_loss: 0.0098 - val_mrcnn_bbox_loss: 0.0614 - val_mrcnn_mask_loss: 0.1050
Epoch 7/300
100/100 [==============================] - 120s 1s/step - batch: 49.5000 - size: 4.0000 - loss: 0.2894 - rpn_class_loss: 0.0022 - rpn_bbox_loss: 0.1127 - mrcnn_class_loss: 0.0113 - mrcnn_bbox_loss: 0.0562 - mrcnn_mask_loss: 0.1071 - val_loss: 0.3831 - val_rpn_class_loss: 0.0028 - val_rpn_bbox_loss: 0.1883 - val_mrcnn_class_loss: 0.0095 - val_mrcnn_bbox_loss: 0.0698 - val_mrcnn_mask_loss: 0.1127
Epoch 8/300
Using resnet50 as network backbone For Mask R-CNN model
Applying Default Augmentation on Dataset
Train 150 images
Validate 28 images
Checkpoint Path: /content/mask_rcnn_models
Selecting layers to train
Epoch 1/300
100/100 [==============================] - 192s 1s/step - batch: 49.5000 - size: 4.0000 - loss: 0.8760 - rpn_class_loss: 0.0113 - rpn_bbox_loss: 0.2987 - mrcnn_class_loss: 0.0464 - mrcnn_bbox_loss: 0.3224 - mrcnn_mask_loss: 0.1972 - val_loss: 0.6289 - val_rpn_class_loss: 0.0057 - val_rpn_bbox_loss: 0.2860 - val_mrcnn_class_loss: 0.0325 - val_mrcnn_bbox_loss: 0.1937 - val_mrcnn_mask_loss: 0.1110
Epoch 2/300
100/100 [==============================] - 121s 1s/step - batch: 49.5000 - size: 4.0000 - loss: 0.4598 - rpn_class_loss: 0.0034 - rpn_bbox_loss: 0.1973 - mrcnn_class_loss: 0.0159 - mrcnn_bbox_loss: 0.1235 - mrcnn_mask_loss: 0.1197 - val_loss: 0.4465 - val_rpn_class_loss: 0.0044 - val_rpn_bbox_loss: 0.1912 - val_mrcnn_class_loss: 0.0217 - val_mrcnn_bbox_loss: 0.1230 - val_mrcnn_mask_loss: 0.1063
Epoch 3/300
100/100 [==============================] - 120s 1s/step - batch: 49.5000 - size: 4.0000 - loss: 0.3609 - rpn_class_loss: 0.0029 - rpn_bbox_loss: 0.1573 - mrcnn_class_loss: 0.0120 - mrcnn_bbox_loss: 0.0824 - mrcnn_mask_loss: 0.1064 - val_loss: 0.4007 - val_rpn_class_loss: 0.0034 - val_rpn_bbox_loss: 0.1635 - val_mrcnn_class_loss: 0.0149 - val_mrcnn_bbox_loss: 0.1122 - val_mrcnn_mask_loss: 0.1068
Epoch 4/300
100/100 [==============================] - 121s 1s/step - batch: 49.5000 - size: 4.0000 - loss: 0.3058 - rpn_class_loss: 0.0023 - rpn_bbox_loss: 0.1208 - mrcnn_class_loss: 0.0082 - mrcnn_bbox_loss: 0.0707 - mrcnn_mask_loss: 0.1039 - val_loss: 0.3454 - val_rpn_class_loss: 0.0027 - val_rpn_bbox_loss: 0.1386 - val_mrcnn_class_loss: 0.0125 - val_mrcnn_bbox_loss: 0.0861 - val_mrcnn_mask_loss: 0.1055
Epoch 5/300
100/100 [==============================] - 121s 1s/step - batch: 49.5000 - size: 4.0000 - loss: 0.2659 - rpn_class_loss: 0.0020 - rpn_bbox_loss: 0.1040 - mrcnn_class_loss: 0.0076 - mrcnn_bbox_loss: 0.0521 - mrcnn_mask_loss: 0.1001 - val_loss: 0.3603 - val_rpn_class_loss: 0.0025 - val_rpn_bbox_loss: 0.1673 - val_mrcnn_class_loss: 0.0144 - val_mrcnn_bbox_loss: 0.0813 - val_mrcnn_mask_loss: 0.0949
Epoch 6/300
100/100 [==============================] - 120s 1s/step - batch: 49.5000 - size: 4.0000 - loss: 0.2282 - rpn_class_loss: 0.0016 - rpn_bbox_loss: 0.0842 - mrcnn_class_loss: 0.0054 - mrcnn_bbox_loss: 0.0399 - mrcnn_mask_loss: 0.0971 - val_loss: 0.3388 - val_rpn_class_loss: 0.0021 - val_rpn_bbox_loss: 0.1219 - val_mrcnn_class_loss: 0.0145 - val_mrcnn_bbox_loss: 0.0944 - val_mrcnn_mask_loss: 0.1059
Epoch 7/300
100/100 [==============================] - 120s 1s/step - batch: 49.5000 - size: 4.0000 - loss: 0.1911 - rpn_class_loss: 0.0014 - rpn_bbox_loss: 0.0589 - mrcnn_class_loss: 0.0052 - mrcnn_bbox_loss: 0.0340 - mrcnn_mask_loss: 0.0915 - val_loss: 0.3143 - val_rpn_class_loss: 0.0018 - val_rpn_bbox_loss: 0.1305 - val_mrcnn_class_loss: 0.0069 - val_mrcnn_bbox_loss: 0.0735 - val_mrcnn_mask_loss: 0.1016
Epoch 8/300
100/100 [==============================] - 122s 1s/step - batch: 49.5000 - size: 4.0000 - loss: 0.1919 - rpn_class_loss: 0.0012 - rpn_bbox_loss: 0.0599 - mrcnn_class_loss: 0.0048 - mrcnn_bbox_loss: 0.0326 - mrcnn_mask_loss: 0.0934 - val_loss: 0.2894 - val_rpn_class_loss: 0.0019 - val_rpn_bbox_loss: 0.1249 - val_mrcnn_class_loss: 0.0065 - val_mrcnn_bbox_loss: 0.0618 - val_mrcnn_mask_loss: 0.0943

which means there is no difference between them, so i'm still tied to be using over 7 hours to train 150 images.

Is there any recommendation or i'm missing something ?
I would appreciate any help.

@khanfarhan10
Copy link
Contributor

what's your batch size?

Try a batch size of 2-4 on colab gpu.

@alic-xc
Copy link
Author

alic-xc commented May 24, 2021

I used the same batch size for both, which is batch size 4

@khanfarhan10
Copy link
Contributor

Can you share a notebook with a minimal reproducible ERROR?

@alic-xc
Copy link
Author

alic-xc commented May 25, 2021

The thing is, it's just about the total time estimate of training for 300 images is the same or almost the same when training 150 images with batch size of 4. it is not displaying any error

Can you share a notebook with a minimal reproducible ERROR?

@ayoolaolafenwa
Copy link
Owner

@alic-xc This is the weakness of Mask R-CNN algorithm. Training with Mask R-CNN consumes a lot of power. If you want to train faster you will have to make use of a GPU with a bigger capacity.
The alternative available is to train all the heads of Mask R-CNN, I set it by default to train all the layers.

train_maskrcnn.train_model(num_epochs = 300, augmentation=True, layers = "heads"  path_trained_models = "mask_rcnn_models")

In the train_model function you set the parameter layers to heads.

Note: Training the heads of the Mask R-CNN layers may not reach lower validation losses compared to training all the layers.

@ayoolaolafenwa
Copy link
Owner

Can you share a notebook with a minimal reproducible ERROR?

Thank you @khanfarhan10 for your contributions.

@alic-xc
Copy link
Author

alic-xc commented May 25, 2021

@alic-xc This is the weakness of Mask R-CNN algorithm. Training with Mask R-CNN consumes a lot of power. If you want to train faster you will have to make use of a GPU with a bigger capacity.
The alternative available is to train all the heads of Mask R-CNN, I set it by default to train all the layers.

train_maskrcnn.train_model(num_epochs = 300, augmentation=True, layers = "heads"  path_trained_models = "mask_rcnn_models")

In the train_model function you set the parameter layers to heads.

Note: Training the heads of the Mask R-CNN layers may not reach lower validation losses compared to training all the layers.

Okay, i think i understand it better now.
Thanks @ayoolaolafenwa @khanfarhan10

@alic-xc alic-xc closed this as completed May 25, 2021
@khanfarhan10
Copy link
Contributor

@alic-xc This is the weakness of Mask R-CNN algorithm. Training with Mask R-CNN consumes a lot of power. If you want to train faster you will have to make use of a GPU with a bigger capacity.
The alternative available is to train all the heads of Mask R-CNN, I set it by default to train all the layers.

train_maskrcnn.train_model(num_epochs = 300, augmentation=True, layers = "heads"  path_trained_models = "mask_rcnn_models")

In the train_model function you set the parameter layers to heads.

Note: Training the heads of the Mask R-CNN layers may not reach lower validation losses compared to training all the layers.

Ah, now we know why Training only heads is sometimes required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants