Training with PyTorch is significantly slower than with PaddleOCR #9

gaozhun · 2024-08-04T10:43:06Z

I have noticed that the training speed of our model using PyTorch is significantly slower compared to PaddleOCR. Here are some specific examples using the SVTR model under different settings:

Without AMP (Automatic Mixed Precision):
PaddleOCR: 310 samples/s
PyTorch: 250 samples/s

With AMP enabled:
PaddleOCR: 490 samples/s
PyTorch: 150 samples/s
Is there any known reason for such a discrepancy in training speeds? Are there any optimizations or configurations that we might be missing in our PyTorch setup to achieve similar or better performance than PaddleOCR?

Topdu · 2024-08-04T10:58:56Z

Please post the config file here and we will try to reproduce the issue.

gaozhun · 2024-08-04T11:11:32Z

Thank you for your prompt response. Here is the configuration file that we are using for our training setup. We hope this helps in reproducing the issue.
OpenOCR:

Global:
  device: gpu
  epoch_num: 200
  log_smooth_window: 20
  print_batch_step: 10
  output_dir: ./output/rec
  eval_epoch_step: [0, 1]
  eval_batch_step: [0, 5000]
  cal_metric_during_train: True
  pretrained_model:
  checkpoints:
  use_tensorboard: true
  infer_img:
  # for data or label process
  character_dict_path: &character_dict_path ./dicts/hwzh.dict
  max_text_length: &max_text_length 80
  use_space_char: &use_space_char False # 
  save_res_path: 
  use_amp: false
  distributed: true

Optimizer:
  name: AdamW
  lr: 0.0005
  weight_decay: 0.05
  filter_bias_and_bn: True

LRScheduler:
  name: CosineAnnealingLR
  warmup_epoch: 2

Architecture:
  model_type: rec
  algorithm: SVTR
  Transform:
  Encoder:
    name: SVTRNet
    img_size: [48, 320]
    out_char_num: 80 # W//4 or W//8 or W/12
    out_channels: 512
    patch_merging: Conv
    embed_dim: [192, 256, 512]
    depth: [6, 6, 9]
    num_heads: [6, 8, 16]
    mixer: ['Conv','Conv','Conv','Conv','Conv','Conv','Conv','Conv','Conv','Global','Global','Global','Global','Global','Global','Global','Global','Global','Global','Global','Global']
    local_mixer: [[5, 5], [5, 5], [5, 5]]
    last_stage: True
    prenorm: True
  Decoder:
    name: CTCDecoder

Loss:
  name: CTCLoss
  zero_infinity: True

PostProcess:
  name: CTCLabelDecode
  character_dict_path: *character_dict_path
  use_space_char: *use_space_char

Metric:
  name: RecMetric
  main_indicator: acc
  ignore_space: *ignore_space

Train:
  dataset:
    name: SimpleDataSet
    data_dir: .
    label_file_list:
    - ./xxx.txt
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      # - SVTRRecAug:
      #     aug_type: 0 # or 1
      - CTCLabelEncode: # Class handling label
          character_dict_path: *character_dict_path
          use_space_char: *use_space_char
          max_text_length: *max_text_length
      - SVTRResize:
          image_shape: [3, 48, 320]
          padding: True
      - KeepKeys:
          keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
  loader:
    shuffle: True
    batch_size_per_card: 128
    drop_last: True
    num_workers: 8

Eval:
  dataset:
    name: SimpleDataSet
    data_dir: .
    label_file_list:
    - ./xxx.txt
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - CTCLabelEncode: # Class handling label
          character_dict_path: *character_dict_path
          use_space_char: *use_space_char
          max_text_length: *max_text_length
      - SVTRResize:
          image_shape: [3, 48, 320]
          padding: True
      - KeepKeys:
          keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
  loader:
    shuffle: False
    drop_last: False
    batch_size_per_card: 128
    num_workers: 8

Paddle:

Global:
  debug: false
  use_gpu: true
  epoch_num: 200
  log_smooth_window: 20
  print_batch_step: 10
  save_model_dir: ./output/paddle_test
  save_epoch_step: 50
  # evaluation is run every 2000 iterations after the 0th iteration
  eval_batch_step: [0, 1000]
  cal_metric_during_train: true
  pretrained_model:
  checkpoints:
  save_inference_dir:
  use_tensorboard: true
  infer_img: 
  character_dict_path: dicts/hwzh.dict
  max_text_length: &max_text_length 80
  infer_mode: false
  use_space_char: false
  ignore_space: true
  distributed: true
  save_res_path: ./output/paddle_test


Optimizer:
  name: AdamW
  beta1: 0.9
  beta2: 0.99
  epsilon: 1.0e-08
  weight_decay: 0.05
  no_weight_decay_name: norm pos_embed char_node_embed pos_node_embed char_pos_embed vis_pos_embed
  one_dim_param_no_weight_decay: true
  lr:
    name: Cosine
    learning_rate: 0.0005 # 8gpus 64bs
    warmup_epoch: 5


Architecture:
  model_type: rec
  algorithm: SVTR_LCNet
  Transform: null    
  Backbone:
    name: SVTRNet
    img_size:
    - 48
    - 320
    out_char_num: 80
    out_channels: 512
    patch_merging: Conv
    embed_dim: [192, 256, 512]
    depth: [6, 6, 9]
    num_heads: [6, 8, 16]
    mixer: ['Conv','Conv','Conv','Conv','Conv','Conv','Conv','Conv','Conv','Global','Global','Global','Global','Global','Global','Global','Global','Global','Global','Global','Global']
    local_mixer: [[5, 5], [5, 5], [5, 5]]
    last_stage: True
    prenorm: True
  Neck:
    name: SequenceEncoder
    encoder_type: reshape
  Head:
    name: CTCHead

Loss:
  name: CTCLoss

PostProcess:
  name: CTCLabelDecode

Metric:
  name: RecMetric
  main_indicator: acc
  ignore_space: true

Train:
  dataset:
    name: SimpleDataSet
    data_dir: .
    label_file_list:
    - ./xxx.txt
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - CTCLabelEncode:
    - RecResizeImg:
        image_shape: [3, 48, 320]
    - KeepKeys:
        keep_keys:
        - image
        - label
        - length
  loader:
    shuffle: true
    batch_size_per_card: 128
    drop_last: true
    num_workers: 8
Eval:
  dataset:
    name: SimpleDataSet
    data_dir: .
    label_file_list:
    - ./xxx.txt
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - CTCLabelEncode:
    - SVTRRecResizeImg:
        image_shape: [3, 48, 320]
    - KeepKeys:
        keep_keys:
        - image
        - label
        - length
  loader:
    shuffle: false
    drop_last: false
    batch_size_per_card: 128
    num_workers: 4

Topdu · 2024-08-04T11:59:51Z

Based on the provided config file, we obtain the following results:
Without AMP (Automatic Mixed Precision):
PaddleOCR: 240 samples/s
OpenOCR: 225 samples/s

With AMP enabled:
PaddleOCR: 435 samples/s
OpenOCR: 325 samples/s

The possible reasons are as follows:

When there are too many character classes (e.g. Chinese), evaluation during training will affect the training speed. This can be avoided by setting cal_metric_during_train: False.
2, torch built-in CTCLoss is not as effective as Paddle's CTCLoss (the best CTCLoss on GPUs).

In addition, Paddle may experience training crashes when using AMP and needs to be careful in setting the AMP parameters.Torch AMP performs consistently and is nearly identical to the full-precision results.

gaozhun · 2024-08-04T12:50:56Z

Thank you for your prompt response and for sharing your results.

However, I am puzzled as to why my training speed decreases when AMP is enabled. It seems that instead of speeding up, my training slows down significantly with AMP. For your reference, I am using PyTorch version 2.1.1 with CUDA 11.8.

Any insights or suggestions on what might be causing this issue would be greatly appreciated.

Thank you!

Topdu · 2024-08-04T12:59:30Z

It is indeed puzzling. We use the same setting: PyTorch version 2.1.1 with CUDA 11.8. What is the gpu type?

gaozhun · 2024-08-04T13:02:50Z

Thank you for your response.

I am using four NVIDIA H800 GPUs for training.

Topdu · 2024-08-04T13:28:18Z

Unfortunately, we have no way to execute it on the H800, the GPUs we use are the 4090. If you can, try to verify it again on another type of GPU.

gaozhun · 2024-08-04T13:38:34Z

Thank you for your understanding.
I will have an opportunity to verify it on another type of GPU and will update you with the results.
Thanks again for your assistance!

gaozhun · 2024-08-05T06:21:26Z

I conducted some experiments on a machine with V100 GPUs, using the OpenOCR code directly pulled from the repository without any modifications. Due to insufficient GPU memory, I had to reduce the batch size to 64. The results show that, without AMP, the PyTorch version seems to perform better. However, with AMP enabled, the performance of PyTorch's OpenOCR is significantly lower, which is quite strange.

Here are the details:

Batch Size: 64, GPU: V100

Without AMP:
PaddleOCR: 77 samples/s
OpenOCR (PyTorch): 92 samples/s

With AMP enabled:
PaddleOCR: 178 samples/s
OpenOCR (PyTorch): 45 samples/s

Additionally, when using AMP, I encounter the following warning, although the code doesn't seem to reflect this order:

[2024/08/05 14:16:33] openrec INFO: During the training process, after the 0th iteration, an evaluation is run every 500 iterations
/home/anaconda3/envs/torch2/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

Topdu · 2024-08-05T06:32:17Z

Please post the log like this:

[2024/08/04 11:36:10] openrec INFO: epoch: [1/2], global_step: 310, lr: 0.000006, acc: 0.000000, norm_edit_dis: 0.000000, num_samples: 64.000000, loss: 22.806549, avg_reader_cost: 0.00679 s, avg_batch_cost: 0.19636 s, avg_samples: 64.0, ips: 325.93692 samples/s, eta: 1:25:07

gaozhun · 2024-08-05T06:43:24Z

OpenOCR(Without AMP):

[2024/08/05 14:06:54] openrec INFO: epoch: [1/20], global_step: 40, lr: 0.000001, acc: 0.000000, norm_edit_dis: 0.000210, num_samples: 64.000000, loss: 676.488342, avg_reader_cost: 0.00903 s, avg_batch_cost: 0.69598 s, avg_samples: 64.0, ips: 91.95673 samples/s, eta: 2 days, 16:16:56

OpenOCR(With AMP enabled:):

[2024/08/05 14:17:32] openrec INFO: epoch: [1/20], global_step: 40, lr: 0.000001, acc: 0.000000, norm_edit_dis: 0.000204, num_samples: 64.000000, loss: 676.378845, avg_reader_cost: 0.00928 s, avg_batch_cost: 1.40544 s, avg_samples: 64.0, ips: 45.53749 samples/s, eta: 4 days, 14:54:22

Topdu · 2024-08-05T06:57:11Z

It's so strange. Also, we perform AMP (OpenOCR) on 3090 GPUs accelerated about 50%. You can try changing the torch version, or verifying other models.

gaozhun · 2024-08-05T07:23:44Z

Additionally, to investigate the performance further, I reduced the number of dictionary items from 8000 to 3000 and 1000, respectively. I observed that smaller dictionary sizes tend to align more with the expected performance improvements when AMP is enabled.

Here are the details:

Batch Size: 64, GPU: V100

Without AMP:

3000 dictionary items:

[2024/08/05 15:11:11] openrec INFO: epoch: [1/20], global_step: 40, lr: 0.000001, acc: 0.000000, norm_edit_dis: 0.000457, num_samples: 64.000000, loss: 610.872131, avg_reader_cost: 0.00905 s, avg_batch_cost: 0.62218 s, avg_samples: 64.0, ips: 102.86458 samples/s, eta: 2 days, 9:48:37

1000 dictionary items:

[2024/08/05 15:14:29] openrec INFO: epoch: [1/20], global_step: 40, lr: 0.000001, acc: 0.000000, norm_edit_dis: 0.001395, num_samples: 64.000000, loss: 528.975159, avg_reader_cost: 0.00901 s, avg_batch_cost: 0.58037 s, avg_samples: 64.0, ips: 110.27408 samples/s, eta: 2 days, 6:35:46

With AMP enabled:

3000 dictionary items:

[2024/08/05 15:12:45] openrec INFO: epoch: [1/20], global_step: 40, lr: 0.000001, acc: 0.000000, norm_edit_dis: 0.000468, num_samples: 64.000000, loss: 614.066589, avg_reader_cost: 0.00929 s, avg_batch_cost: 0.73057 s, avg_samples: 64.0, ips: 87.60238 samples/s, eta: 2 days, 14:42:51

1000 dictionary items:

[2024/08/05 15:16:30] openrec INFO: epoch: [1/20], global_step: 40, lr: 0.000001, acc: 0.000000, norm_edit_dis: 0.001320, num_samples: 64.000000, loss: 530.905640, avg_reader_cost: 0.00895 s, avg_batch_cost: 0.42328 s, avg_samples: 64.0, ips: 151.19841 samples/s, eta: 1 day, 15:25:31

Topdu · 2024-08-05T07:29:42Z

Our experiments above were executed on the English 94 categories + 3 special categories. We will verify the effect of character categories again based on the information you provide.

Topdu · 2024-08-06T10:25:14Z

Chinese 6624 dictionary

Without AMP:

[2024/08/06 18:05:16] openrec INFO: epoch: [1/2], global_step: 100, lr: 0.000005, loss: 68.377701, avg_reader_cost: 0.00710 s, avg_batch_cost: 0.15078 s, avg_samples: 64.0, ips: 424.46925 samples/s, eta: 0:28:14
[2024/08/06 18:05:19] openrec INFO: epoch: [1/2], global_step: 110, lr: 0.000005, loss: 63.503082, avg_reader_cost: 0.00717 s, avg_batch_cost: 0.15115 s, avg_samples: 64.0, ips: 423.41447 samples/s, eta: 0:27:50
[2024/08/06 18:05:22] openrec INFO: epoch: [1/2], global_step: 120, lr: 0.000006, loss: 63.131191, avg_reader_cost: 0.00723 s, avg_batch_cost: 0.15088 s, avg_samples: 64.0, ips: 424.17841 samples/s, eta: 0:27:30
[2024/08/06 18:05:25] openrec INFO: epoch: [1/2], global_step: 130, lr: 0.000006, loss: 61.284630, avg_reader_cost: 0.00737 s, avg_batch_cost: 0.15119 s, avg_samples: 64.0, ips: 423.31444 samples/s, eta: 0:27:13
[2024/08/06 18:05:28] openrec INFO: epoch: [1/2], global_step: 140, lr: 0.000007, loss: 59.751579, avg_reader_cost: 0.00718 s, avg_batch_cost: 0.15115 s, avg_samples: 64.0, ips: 423.43237 samples/s, eta: 0:26:58
[2024/08/06 18:05:31] openrec INFO: epoch: [1/2], global_step: 150, lr: 0.000007, loss: 60.813232, avg_reader_cost: 0.00701 s, avg_batch_cost: 0.15105 s, avg_samples: 64.0, ips: 423.69061 samples/s, eta: 0:26:45
[2024/08/06 18:05:33] openrec INFO: epoch: [1/2], global_step: 160, lr: 0.000008, loss: 62.727783, avg_reader_cost: 0.00719 s, avg_batch_cost: 0.15147 s, avg_samples: 64.0, ips: 422.53764 samples/s, eta: 0:26:33
[2024/08/06 18:05:36] openrec INFO: epoch: [1/2], global_step: 170, lr: 0.000008, loss: 60.813217, avg_reader_cost: 0.00668 s, avg_batch_cost: 0.15051 s, avg_samples: 64.0, ips: 425.23351 samples/s, eta: 0:26:22
[2024/08/06 18:05:39] openrec INFO: epoch: [1/2], global_step: 180, lr: 0.000009, loss: 58.633625, avg_reader_cost: 0.00688 s, avg_batch_cost: 0.15068 s, avg_samples: 64.0, ips: 424.74361 samples/s, eta: 0:26:13
[2024/08/06 18:05:42] openrec INFO: epoch: [1/2], global_step: 190, lr: 0.000009, loss: 58.076485, avg_reader_cost: 0.00680 s, avg_batch_cost: 0.15077 s, avg_samples: 64.0, ips: 424.47697 samples/s, eta: 0:26:04
[2024/08/06 18:05:45] openrec INFO: epoch: [1/2], global_step: 200, lr: 0.000010, loss: 60.331661, avg_reader_cost: 0.00687 s, avg_batch_cost: 0.15088 s, avg_samples: 64.0, ips: 424.17754 samples/s, eta: 0:25:56
[2024/08/06 18:05:47] openrec INFO: epoch: [1/2], global_step: 210, lr: 0.000010, loss: 58.427017, avg_reader_cost: 0.00692 s, avg_batch_cost: 0.15091 s, avg_samples: 64.0, ips: 424.10510 samples/s, eta: 0:25:49
[2024/08/06 18:05:50] openrec INFO: epoch: [1/2], global_step: 220, lr: 0.000011, loss: 59.204517, avg_reader_cost: 0.00691 s, avg_batch_cost: 0.15059 s, avg_samples: 64.0, ips: 424.98623 samples/s, eta: 0:25:42
[2024/08/06 18:05:53] openrec INFO: epoch: [1/2], global_step: 230, lr: 0.000011, loss: 58.415764, avg_reader_cost: 0.00680 s, avg_batch_cost: 0.15089 s, avg_samples: 64.0, ips: 424.15837 samples/s, eta: 0:25:35
[2024/08/06 18:05:56] openrec INFO: epoch: [1/2], global_step: 240, lr: 0.000012, loss: 56.117561, avg_reader_cost: 0.00695 s, avg_batch_cost: 0.15076 s, avg_samples: 64.0, ips: 424.50422 samples/s, eta: 0:25:29
[2024/08/06 18:05:59] openrec INFO: epoch: [1/2], global_step: 250, lr: 0.000012, loss: 57.601730, avg_reader_cost: 0.00688 s, avg_batch_cost: 0.15065 s, avg_samples: 64.0, ips: 424.81621 samples/s, eta: 0:25:23
[2024/08/06 18:06:01] openrec INFO: epoch: [1/2], global_step: 260, lr: 0.000013, loss: 57.454727, avg_reader_cost: 0.00670 s, avg_batch_cost: 0.15068 s, avg_samples: 64.0, ips: 424.75228 samples/s, eta: 0:25:18
[2024/08/06 18:06:04] openrec INFO: epoch: [1/2], global_step: 270, lr: 0.000013, loss: 54.404583, avg_reader_cost: 0.00685 s, avg_batch_cost: 0.15115 s, avg_samples: 64.0, ips: 423.43403 samples/s, eta: 0:25:13
[2024/08/06 18:06:07] openrec INFO: epoch: [1/2], global_step: 280, lr: 0.000014, loss: 56.005775, avg_reader_cost: 0.00674 s, avg_batch_cost: 0.15045 s, avg_samples: 64.0, ips: 425.38014 samples/s, eta: 0:25:08
[2024/08/06 18:06:10] openrec INFO: epoch: [1/2], global_step: 290, lr: 0.000014, loss: 57.292404, avg_reader_cost: 0.00684 s, avg_batch_cost: 0.15064 s, avg_samples: 64.0, ips: 424.86011 samples/s, eta: 0:25:04
[2024/08/06 18:06:13] openrec INFO: epoch: [1/2], global_step: 300, lr: 0.000015, loss: 54.812862, avg_reader_cost: 0.00682 s, avg_batch_cost: 0.15047 s, avg_samples: 64.0, ips: 425.32251 samples/s, eta: 0:24:59
[2024/08/06 18:06:16] openrec INFO: epoch: [1/2], global_step: 310, lr: 0.000015, loss: 52.929249, avg_reader_cost: 0.00726 s, avg_batch_cost: 0.15144 s, avg_samples: 64.0, ips: 422.62292 samples/s, eta: 0:24:55
[2024/08/06 18:06:18] openrec INFO: epoch: [1/2], global_step: 320, lr: 0.000016, loss: 52.851337, avg_reader_cost: 0.00696 s, avg_batch_cost: 0.15058 s, avg_samples: 64.0, ips: 425.01887 samples/s, eta: 0:24:51
[2024/08/06 18:06:21] openrec INFO: epoch: [1/2], global_step: 330, lr: 0.000016, loss: 52.118073, avg_reader_cost: 0.00721 s, avg_batch_cost: 0.15060 s, avg_samples: 64.0, ips: 424.97143 samples/s, eta: 0:24:47
[2024/08/06 18:06:24] openrec INFO: epoch: [1/2], global_step: 340, lr: 0.000017, loss: 52.040649, avg_reader_cost: 0.00686 s, avg_batch_cost: 0.15040 s, avg_samples: 64.0, ips: 425.52080 samples/s, eta: 0:24:43
[2024/08/06 18:06:27] openrec INFO: epoch: [1/2], global_step: 350, lr: 0.000018, loss: 49.329895, avg_reader_cost: 0.00685 s, avg_batch_cost: 0.15061 s, avg_samples: 64.0, ips: 424.95145 samples/s, eta: 0:24:40
[2024/08/06 18:06:30] openrec INFO: epoch: [1/2], global_step: 360, lr: 0.000018, loss: 51.202164, avg_reader_cost: 0.00691 s, avg_batch_cost: 0.15066 s, avg_samples: 64.0, ips: 424.80195 samples/s, eta: 0:24:36
[2024/08/06 18:06:32] openrec INFO: epoch: [1/2], global_step: 370, lr: 0.000019, loss: 51.715294, avg_reader_cost: 0.00687 s, avg_batch_cost: 0.15062 s, avg_samples: 64.0, ips: 424.90214 samples/s, eta: 0:24:33
[2024/08/06 18:06:35] openrec INFO: epoch: [1/2], global_step: 380, lr: 0.000019, loss: 51.019779, avg_reader_cost: 0.00700 s, avg_batch_cost: 0.15098 s, avg_samples: 64.0, ips: 423.89481 samples/s, eta: 0:24:30
[2024/08/06 18:06:38] openrec INFO: epoch: [1/2], global_step: 390, lr: 0.000020, loss: 51.824024, avg_reader_cost: 0.00692 s, avg_batch_cost: 0.15073 s, avg_samples: 64.0, ips: 424.61166 samples/s, eta: 0:24:26
[2024/08/06 18:06:41] openrec INFO: epoch: [1/2], global_step: 400, lr: 0.000020, loss: 50.717896, avg_reader_cost: 0.00669 s, avg_batch_cost: 0.15070 s, avg_samples: 64.0, ips: 424.69469 samples/s, eta: 0:24:23

With AMP:

[2024/08/06 18:03:31] openrec INFO: epoch: [1/2], global_step: 100, lr: 0.000005, loss: 69.376022, avg_reader_cost: 0.00971 s, avg_batch_cost: 0.18023 s, avg_samples: 64.0, ips: 355.10891 samples/s, eta: 0:32:33
[2024/08/06 18:03:33] openrec INFO: epoch: [1/2], global_step: 110, lr: 0.000005, loss: 63.990349, avg_reader_cost: 0.01058 s, avg_batch_cost: 0.18164 s, avg_samples: 64.0, ips: 352.35357 samples/s, eta: 0:32:12
[2024/08/06 18:03:35] openrec INFO: epoch: [1/2], global_step: 120, lr: 0.000006, loss: 63.299240, avg_reader_cost: 0.01072 s, avg_batch_cost: 0.18242 s, avg_samples: 64.0, ips: 350.84672 samples/s, eta: 0:31:55
[2024/08/06 18:03:37] openrec INFO: epoch: [1/2], global_step: 130, lr: 0.000006, loss: 61.347343, avg_reader_cost: 0.01023 s, avg_batch_cost: 0.18157 s, avg_samples: 64.0, ips: 352.48265 samples/s, eta: 0:31:40
[2024/08/06 18:03:38] openrec INFO: epoch: [1/2], global_step: 140, lr: 0.000007, loss: 59.796600, avg_reader_cost: 0.01043 s, avg_batch_cost: 0.18167 s, avg_samples: 64.0, ips: 352.28051 samples/s, eta: 0:31:27
[2024/08/06 18:03:40] openrec INFO: epoch: [1/2], global_step: 150, lr: 0.000007, loss: 60.858330, avg_reader_cost: 0.01041 s, avg_batch_cost: 0.18174 s, avg_samples: 64.0, ips: 352.15369 samples/s, eta: 0:31:15
[2024/08/06 18:03:42] openrec INFO: epoch: [1/2], global_step: 160, lr: 0.000008, loss: 62.775482, avg_reader_cost: 0.00933 s, avg_batch_cost: 0.18135 s, avg_samples: 64.0, ips: 352.91465 samples/s, eta: 0:31:04
[2024/08/06 18:03:44] openrec INFO: epoch: [1/2], global_step: 170, lr: 0.000008, loss: 60.844872, avg_reader_cost: 0.01013 s, avg_batch_cost: 0.18200 s, avg_samples: 64.0, ips: 351.65021 samples/s, eta: 0:30:55
[2024/08/06 18:03:46] openrec INFO: epoch: [1/2], global_step: 180, lr: 0.000009, loss: 58.660248, avg_reader_cost: 0.01100 s, avg_batch_cost: 0.18293 s, avg_samples: 64.0, ips: 349.85706 samples/s, eta: 0:30:47
[2024/08/06 18:03:48] openrec INFO: epoch: [1/2], global_step: 190, lr: 0.000009, loss: 58.103447, avg_reader_cost: 0.01108 s, avg_batch_cost: 0.18313 s, avg_samples: 64.0, ips: 349.47937 samples/s, eta: 0:30:39
[2024/08/06 18:03:50] openrec INFO: epoch: [1/2], global_step: 200, lr: 0.000010, loss: 60.355034, avg_reader_cost: 0.01048 s, avg_batch_cost: 0.18146 s, avg_samples: 64.0, ips: 352.69481 samples/s, eta: 0:30:32
[2024/08/06 18:03:51] openrec INFO: epoch: [1/2], global_step: 210, lr: 0.000010, loss: 58.444603, avg_reader_cost: 0.00983 s, avg_batch_cost: 0.18169 s, avg_samples: 64.0, ips: 352.24348 samples/s, eta: 0:30:25
[2024/08/06 18:03:53] openrec INFO: epoch: [1/2], global_step: 220, lr: 0.000011, loss: 59.229370, avg_reader_cost: 0.00897 s, avg_batch_cost: 0.18147 s, avg_samples: 64.0, ips: 352.66891 samples/s, eta: 0:30:19
[2024/08/06 18:03:55] openrec INFO: epoch: [1/2], global_step: 230, lr: 0.000011, loss: 58.432549, avg_reader_cost: 0.00992 s, avg_batch_cost: 0.18137 s, avg_samples: 64.0, ips: 352.87377 samples/s, eta: 0:30:12
[2024/08/06 18:03:57] openrec INFO: epoch: [1/2], global_step: 240, lr: 0.000012, loss: 56.128662, avg_reader_cost: 0.00969 s, avg_batch_cost: 0.18164 s, avg_samples: 64.0, ips: 352.34506 samples/s, eta: 0:30:07
[2024/08/06 18:03:59] openrec INFO: epoch: [1/2], global_step: 250, lr: 0.000012, loss: 57.606712, avg_reader_cost: 0.01010 s, avg_batch_cost: 0.18331 s, avg_samples: 64.0, ips: 349.13365 samples/s, eta: 0:30:02
[2024/08/06 18:04:01] openrec INFO: epoch: [1/2], global_step: 260, lr: 0.000013, loss: 57.450527, avg_reader_cost: 0.01046 s, avg_batch_cost: 0.18173 s, avg_samples: 64.0, ips: 352.18021 samples/s, eta: 0:29:57
[2024/08/06 18:04:02] openrec INFO: epoch: [1/2], global_step: 270, lr: 0.000013, loss: 54.395607, avg_reader_cost: 0.01038 s, avg_batch_cost: 0.18134 s, avg_samples: 64.0, ips: 352.93065 samples/s, eta: 0:29:52
[2024/08/06 18:04:04] openrec INFO: epoch: [1/2], global_step: 280, lr: 0.000014, loss: 55.983822, avg_reader_cost: 0.01070 s, avg_batch_cost: 0.18134 s, avg_samples: 64.0, ips: 352.93195 samples/s, eta: 0:29:47
[2024/08/06 18:04:06] openrec INFO: epoch: [1/2], global_step: 290, lr: 0.000014, loss: 57.264977, avg_reader_cost: 0.01036 s, avg_batch_cost: 0.18174 s, avg_samples: 64.0, ips: 352.15549 samples/s, eta: 0:29:43
[2024/08/06 18:04:08] openrec INFO: epoch: [1/2], global_step: 300, lr: 0.000015, loss: 54.802746, avg_reader_cost: 0.01047 s, avg_batch_cost: 0.18134 s, avg_samples: 64.0, ips: 352.92583 samples/s, eta: 0:29:38
[2024/08/06 18:04:10] openrec INFO: epoch: [1/2], global_step: 310, lr: 0.000015, loss: 52.902641, avg_reader_cost: 0.01031 s, avg_batch_cost: 0.18130 s, avg_samples: 64.0, ips: 353.01035 samples/s, eta: 0:29:34
[2024/08/06 18:04:12] openrec INFO: epoch: [1/2], global_step: 320, lr: 0.000016, loss: 52.814461, avg_reader_cost: 0.01083 s, avg_batch_cost: 0.18140 s, avg_samples: 64.0, ips: 352.81770 samples/s, eta: 0:29:30
[2024/08/06 18:04:14] openrec INFO: epoch: [1/2], global_step: 330, lr: 0.000016, loss: 52.073181, avg_reader_cost: 0.01115 s, avg_batch_cost: 0.18350 s, avg_samples: 64.0, ips: 348.77103 samples/s, eta: 0:29:27
[2024/08/06 18:04:15] openrec INFO: epoch: [1/2], global_step: 340, lr: 0.000017, loss: 51.995941, avg_reader_cost: 0.01031 s, avg_batch_cost: 0.18142 s, avg_samples: 64.0, ips: 352.77681 samples/s, eta: 0:29:23
[2024/08/06 18:04:17] openrec INFO: epoch: [1/2], global_step: 350, lr: 0.000018, loss: 49.285133, avg_reader_cost: 0.01002 s, avg_batch_cost: 0.18138 s, avg_samples: 64.0, ips: 352.84831 samples/s, eta: 0:29:19
[2024/08/06 18:04:19] openrec INFO: epoch: [1/2], global_step: 360, lr: 0.000018, loss: 51.151272, avg_reader_cost: 0.00993 s, avg_batch_cost: 0.18172 s, avg_samples: 64.0, ips: 352.19066 samples/s, eta: 0:29:16
[2024/08/06 18:04:21] openrec INFO: epoch: [1/2], global_step: 370, lr: 0.000019, loss: 51.662258, avg_reader_cost: 0.00962 s, avg_batch_cost: 0.18115 s, avg_samples: 64.0, ips: 353.29306 samples/s, eta: 0:29:12
[2024/08/06 18:04:23] openrec INFO: epoch: [1/2], global_step: 380, lr: 0.000019, loss: 50.934994, avg_reader_cost: 0.00991 s, avg_batch_cost: 0.18152 s, avg_samples: 64.0, ips: 352.57631 samples/s, eta: 0:29:09
[2024/08/06 18:04:25] openrec INFO: epoch: [1/2], global_step: 390, lr: 0.000020, loss: 51.761528, avg_reader_cost: 0.00978 s, avg_batch_cost: 0.18120 s, avg_samples: 64.0, ips: 353.19660 samples/s, eta: 0:29:06
[2024/08/06 18:04:26] openrec INFO: epoch: [1/2], global_step: 400, lr: 0.000020, loss: 50.644489, avg_reader_cost: 0.00957 s, avg_batch_cost: 0.18125 s, avg_samples: 64.0, ips: 353.10103 samples/s, eta: 0:29:02

The IPS is consistent with what you posted.
However, it is interesting to note that even though the IPS shows a slowdown with AMP turned on, a look at the log timestamps shows that AMP worked:
Without AMP: [2024/08/06 18:05:16] -- [2024/08/06 18:06:41] : 1m25s
With AMP: [2024/08/06 18:03:31] -- [2024/08/06 18:04:26] : 55s

However the IPS calculated in the code seems to be working without error. We will follow up with a deeper troubleshooting of this issue.

In addition, when the number of dictionaries is too large, it is recommended to set ''cal_metric_during_train'' to False. This will speed up the training significantly.

gaozhun · 2024-08-06T11:21:25Z

I appreciate your insights regarding the IPS and the log timestamps showing that AMP indeed worked. I'll look forward to further troubleshooting on this issue.

Regarding the suggestion to set cal_metric_during_train to False when the number of dictionaries is large, I will make this adjustment in my experiments to see how it impacts the training speed.

Thank you again for your assistance!

gaozhun · 2024-08-07T10:52:58Z

I am encountering an issue where the accuracy of OpenOCR implemented with PyTorch is consistently lower compared to PaddleOCR, despite using identical configurations. Specifically, the accuracy of the PyTorch version is around 4 percentage points lower than that of the PaddleOCR version.

From the training curves, it is evident that OpenOCR (PyTorch) shows a more rapid increase in training accuracy. However, the evaluation accuracy remains lower compared to PaddleOCR.
Have you encountered similar issues with OpenOCR (PyTorch) evaluation accuracy being lower than PaddleOCR?

Topdu · 2024-08-07T11:06:28Z

I somehow suspect that the post-processing of the two is inconsistent. You can save the recognition results of both and observe the difference between them.

nissansz · 2024-10-27T09:53:11Z

I have noticed that the training speed of our model using PyTorch is significantly slower compared to PaddleOCR. Here are some specific examples using the SVTR model under different settings:

Without AMP (Automatic Mixed Precision): PaddleOCR: 310 samples/s PyTorch: 250 samples/s

With AMP enabled: PaddleOCR: 490 samples/s PyTorch: 150 samples/s Is there any known reason for such a discrepancy in training speeds? Are there any optimizations or configurations that we might be missing in our PyTorch setup to achieve similar or better performance than PaddleOCR?

what is your system and paddle version to train rec_svtrv2_ch.yml?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training with PyTorch is significantly slower than with PaddleOCR #9

Training with PyTorch is significantly slower than with PaddleOCR #9

gaozhun commented Aug 4, 2024

Topdu commented Aug 4, 2024

gaozhun commented Aug 4, 2024 •

edited

Loading

Topdu commented Aug 4, 2024

gaozhun commented Aug 4, 2024

Topdu commented Aug 4, 2024

gaozhun commented Aug 4, 2024

Topdu commented Aug 4, 2024

gaozhun commented Aug 4, 2024

gaozhun commented Aug 5, 2024

Topdu commented Aug 5, 2024

gaozhun commented Aug 5, 2024 •

edited

Loading

Topdu commented Aug 5, 2024

gaozhun commented Aug 5, 2024

Topdu commented Aug 5, 2024

Topdu commented Aug 6, 2024

gaozhun commented Aug 6, 2024

gaozhun commented Aug 7, 2024

Topdu commented Aug 7, 2024

nissansz commented Oct 27, 2024

Training with PyTorch is significantly slower than with PaddleOCR #9

Training with PyTorch is significantly slower than with PaddleOCR #9

Comments

gaozhun commented Aug 4, 2024

Topdu commented Aug 4, 2024

gaozhun commented Aug 4, 2024 • edited Loading

Topdu commented Aug 4, 2024

gaozhun commented Aug 4, 2024

Topdu commented Aug 4, 2024

gaozhun commented Aug 4, 2024

Topdu commented Aug 4, 2024

gaozhun commented Aug 4, 2024

gaozhun commented Aug 5, 2024

Topdu commented Aug 5, 2024

gaozhun commented Aug 5, 2024 • edited Loading

Topdu commented Aug 5, 2024

gaozhun commented Aug 5, 2024

Topdu commented Aug 5, 2024

Topdu commented Aug 6, 2024

gaozhun commented Aug 6, 2024

gaozhun commented Aug 7, 2024

Topdu commented Aug 7, 2024

nissansz commented Oct 27, 2024

gaozhun commented Aug 4, 2024 •

edited

Loading

gaozhun commented Aug 5, 2024 •

edited

Loading