[`feat`] Integration of dataset benchmarks #25

ir2718 · 2024-11-03T20:26:06Z

Hi,

as mentioned in issue #24, this is a PR for introducing benchmark datasets for the STR task. Each of the datasets has a .yaml config defined which is used for the scripts. For example:

python tools/download/download_dataset.py --config configs/dataset/rec/iiit.yaml
python tools/create_lmdb_dataset.py --config configs/dataset/rec/iiit.yaml

The create_lmdb_dataset.py script is slightly changed to accomodate for dynamically creating a lmdb dataset. The current PR does not support different splits (uses a concatenation of all examples).

For now, I'll set this as a draft PR as I'm open to any suggestions or changes.

@Topdu

Topdu · 2024-11-04T03:00:00Z

Thank you for your efforts, it's a very exciting one to work on. For some of these datasets, they have been changed compared to the original versions, especially for commonly used datasets, e.g., PARSeq has made changes to some of the mislabeling in IIIT5k, SVT, SVTP, IC13, IC15, CUTE80. And many of the work follows the datasets provided by PARSeq, I suggest you to download the datasets provided by PARSeq directly, that many have been uploaded to Google Drive. For the training and test sets of Union14M-L we will provide a Google Drive version. We will let you know specifically at that time.

Topdu · 2024-11-21T14:36:58Z

Downloading Datasets

All data can be downloaded from Google Drive.

The structure of Datasets and OpenOCR code will be organized as follows:

Structure of Datasets and OpenOCR code

```text
benchmark_bctr # Chinese text datasets, optional
├── benchmark_bctr_test
│   ├── document_test
│   ├── handwriting_test
│   ├── scene_test
│   └── web_test
└── benchmark_bctr_train
    ├── document_train
    ├── handwriting_train
    ├── scene_train
    └── web_train
evaluation
├── CUTE80
├── IC13_857
├── IC15_1811
├── IIIT5k
├── SVT
└── SVTP
iiit5k_test_images # for Latency Measurement, optional
ltb # Long Text Benchmark
OpenOCR
OST
synth # optional
├── MJ
│   ├── test
│   ├── train
│   └── val
└── ST
test # Common Benchmarks from PARSeq
├── ArT
├── COCOv1.4
├── CUTE80
├── IC13_1015
├── IC13_1095  
├── IC13_857
├── IC15_1811
├── IC15_2077
├── IIIT5k
├── SVT
├── SVTP
└── Uber
u14m # lmdb format Union14M-Benchmark
├── artistic
├── contextless
├── curve
├── general
├── multi_oriented
├── multi_words
└── salient
Union14M-L-LMDB-Filtered # lmdb format Union14M-L-Filtered
├── train_challenging
├── train_easy
├── train_hard
├── train_medium
└── train_normal
```

Datasets used during Training

Datsets	Google Drive	Baidu Yun
Union14M-L-Filter	LMDB archives
Evaluation	LMDB archives

If you have downloaded Union14M-L, you can use the filtered list of images to create an LMDB of the training set Union14M-L-Filter.

Test Set

Datsets	Google Drive	Baidu Yun
Union14M-L-Benchmark	LMDB archives
Common-Benchmarks	LMDB archives
Long Text Benchmark (LTB)	LMDB archives
Occluded Scene Text (OST)	LMDB archives

Note: Both Union14M-L-Filter and Union14M-L-Benchmark are based on Union14M-L and therefore comply with its copyright. Common Benchmarks and OST are derived from PARSeq and VisionLAN, respectively.

ir2718 · 2024-11-21T17:35:40Z

@Topdu

Thanks for uploading the datasets, this will help immensely. I was a bit busy these past few weeks, but I'll try to finish the PR over the weekend.

ir2718 · 2024-12-01T15:14:00Z

Hi,

I've finally found time start wrapping up this PR. The uploaded datasets helped out immensely. If you've got time, @Topdu please have a look and tell me if you would like me to change something.

Before converting this into a non-draft PR, I'd like to ask 2 questions:

Aside from the aforementioned datasets, I also added OpenVINO and TextOCR from parseq. However, I noticed parseq also uses LSVT, MLT19, RCTW17, ReCTS, and COCO-Text v2.0. Would you like me to add those as well?
I'm having trouble finding a link for the benchmark_bctr datasets as I'm not well acquainted with chinese text recognition. Can you provide me with a link for downloading those datasets?

ir2718 · 2024-12-10T10:13:48Z

Are there any updates on the datasets? @Topdu

Topdu · 2024-12-10T10:33:39Z

Very sorry for our delayed response.

Aside from the aforementioned datasets, I also added OpenVINO and TextOCR from parseq. However, I noticed parseq also uses LSVT, MLT19, RCTW17, ReCTS, and COCO-Text v2.0. Would you like me to add those as well?

They are already included in Union14M, and you can leave them out for now.

I'm having trouble finding a link for the benchmark_bctr datasets as I'm not well acquainted with chinese text recognition. Can you provide me with a link for downloading those datasets?

Since benchmark_bctr is copyrighted (especially regarding the Chinese handwriting dataset), it is recommended to request permission to use it from the dataset's preliminary creator. After getting permission from the creator, we can send it to the applicant confidentially.

Are there any updates on the datasets? @Topdu

I think this draft version can be merged as a ready PR.

Thank you for your outstanding contribution.

ir2718 · 2024-12-11T01:40:52Z

You're welcome. In the future I hope I'll find time to work on more PRs.

docs/openocr.md

tools/create_lmdb_dataset.py

Topdu · 2024-12-13T02:48:38Z

It is now ready to be merged in. Thanks to your efforts!!

Topdu

Done

ir2718 added 7 commits November 1, 2024 20:32

add automatic download for datasets

2df23ff

add loading for all but union

56d389d

add sroie, textocr

bfc626b

refactor to using configs, add totaltext, synthtext

9bbe90e

add dataset configs

76871da

add more datasets

bf67657

fix max len param

4f18b66

ir2718 marked this pull request as draft November 3, 2024 20:26

ir2718 added 3 commits December 1, 2024 03:08

parseq compliant refactor

5bd5f1b

update datasets

3b3dcab

add textocr, openvino

0d7a051

Merge branch 'main' into datasets

31730c7

ir2718 added 2 commits December 10, 2024 23:20

final cleanup

ef9cd1f

update docs

450ae4c

ir2718 marked this pull request as ready for review December 10, 2024 23:03

remove unnecessary imports

0b278ff

Topdu reviewed Dec 12, 2024

View reviewed changes

docs/openocr.md Outdated Show resolved Hide resolved

tools/create_lmdb_dataset.py Outdated Show resolved Hide resolved

ir2718 added 3 commits December 12, 2024 22:07

revert to old docs

35da22f

revert

16dcbd4

add removed comment

ebf718b

Topdu approved these changes Dec 13, 2024

View reviewed changes

Topdu merged commit da8b837 into Topdu:main Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`feat`] Integration of dataset benchmarks #25

[`feat`] Integration of dataset benchmarks #25

ir2718 commented Nov 3, 2024 •

edited

Loading

Topdu commented Nov 4, 2024

Topdu commented Nov 21, 2024

ir2718 commented Nov 21, 2024

ir2718 commented Dec 1, 2024

ir2718 commented Dec 10, 2024

Topdu commented Dec 10, 2024

ir2718 commented Dec 11, 2024

Topdu commented Dec 13, 2024

Topdu left a comment

[feat] Integration of dataset benchmarks #25

[feat] Integration of dataset benchmarks #25

Conversation

ir2718 commented Nov 3, 2024 • edited Loading

Topdu commented Nov 4, 2024

Topdu commented Nov 21, 2024

Downloading Datasets

The structure of Datasets and OpenOCR code will be organized as follows:

Datasets used during Training

Test Set

ir2718 commented Nov 21, 2024

ir2718 commented Dec 1, 2024

ir2718 commented Dec 10, 2024

Topdu commented Dec 10, 2024

ir2718 commented Dec 11, 2024

Topdu commented Dec 13, 2024

Topdu left a comment

Choose a reason for hiding this comment

[`feat`] Integration of dataset benchmarks #25

[`feat`] Integration of dataset benchmarks #25

ir2718 commented Nov 3, 2024 •

edited

Loading