Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Integration of dataset benchmarks #25

Merged
merged 17 commits into from
Dec 13, 2024
Merged

[feat] Integration of dataset benchmarks #25

merged 17 commits into from
Dec 13, 2024

Conversation

ir2718
Copy link
Contributor

@ir2718 ir2718 commented Nov 3, 2024

Hi,

as mentioned in issue #24, this is a PR for introducing benchmark datasets for the STR task. Each of the datasets has a .yaml config defined which is used for the scripts. For example:

python tools/download/download_dataset.py --config configs/dataset/rec/iiit.yaml
python tools/create_lmdb_dataset.py --config configs/dataset/rec/iiit.yaml 

The create_lmdb_dataset.py script is slightly changed to accomodate for dynamically creating a lmdb dataset. The current PR does not support different splits (uses a concatenation of all examples).

For now, I'll set this as a draft PR as I'm open to any suggestions or changes.

@Topdu

@ir2718 ir2718 marked this pull request as draft November 3, 2024 20:26
@Topdu
Copy link
Owner

Topdu commented Nov 4, 2024

Thank you for your efforts, it's a very exciting one to work on. For some of these datasets, they have been changed compared to the original versions, especially for commonly used datasets, e.g., PARSeq has made changes to some of the mislabeling in IIIT5k, SVT, SVTP, IC13, IC15, CUTE80. And many of the work follows the datasets provided by PARSeq, I suggest you to download the datasets provided by PARSeq directly, that many have been uploaded to Google Drive. For the training and test sets of Union14M-L we will provide a Google Drive version. We will let you know specifically at that time.

@Topdu
Copy link
Owner

Topdu commented Nov 21, 2024

Downloading Datasets

All data can be downloaded from Google Drive.

The structure of Datasets and OpenOCR code will be organized as follows:

Structure of Datasets and OpenOCR code
```text
benchmark_bctr # Chinese text datasets, optional
├── benchmark_bctr_test
│   ├── document_test
│   ├── handwriting_test
│   ├── scene_test
│   └── web_test
└── benchmark_bctr_train
    ├── document_train
    ├── handwriting_train
    ├── scene_train
    └── web_train
evaluation
├── CUTE80
├── IC13_857
├── IC15_1811
├── IIIT5k
├── SVT
└── SVTP
iiit5k_test_images # for Latency Measurement, optional
ltb # Long Text Benchmark
OpenOCR
OST
synth # optional
├── MJ
│   ├── test
│   ├── train
│   └── val
└── ST
test # Common Benchmarks from PARSeq
├── ArT
├── COCOv1.4
├── CUTE80
├── IC13_1015
├── IC13_1095  
├── IC13_857
├── IC15_1811
├── IC15_2077
├── IIIT5k
├── SVT
├── SVTP
└── Uber
u14m # lmdb format Union14M-Benchmark
├── artistic
├── contextless
├── curve
├── general
├── multi_oriented
├── multi_words
└── salient
Union14M-L-LMDB-Filtered # lmdb format Union14M-L-Filtered
├── train_challenging
├── train_easy
├── train_hard
├── train_medium
└── train_normal
```

Datasets used during Training

Datsets Google Drive Baidu Yun
Union14M-L-Filter LMDB archives
Evaluation LMDB archives

If you have downloaded Union14M-L, you can use the filtered list of images to create an LMDB of the training set Union14M-L-Filter.

Test Set

Datsets Google Drive Baidu Yun
Union14M-L-Benchmark LMDB archives
Common-Benchmarks LMDB archives
Long Text Benchmark (LTB) LMDB archives
Occluded Scene Text (OST) LMDB archives

Note: Both Union14M-L-Filter and Union14M-L-Benchmark are based on Union14M-L and therefore comply with its copyright. Common Benchmarks and OST are derived from PARSeq and VisionLAN, respectively.

@ir2718
Copy link
Contributor Author

ir2718 commented Nov 21, 2024

@Topdu

Thanks for uploading the datasets, this will help immensely. I was a bit busy these past few weeks, but I'll try to finish the PR over the weekend.

@ir2718
Copy link
Contributor Author

ir2718 commented Dec 1, 2024

Hi,

I've finally found time start wrapping up this PR. The uploaded datasets helped out immensely. If you've got time, @Topdu please have a look and tell me if you would like me to change something.

Before converting this into a non-draft PR, I'd like to ask 2 questions:

  1. Aside from the aforementioned datasets, I also added OpenVINO and TextOCR from parseq. However, I noticed parseq also uses LSVT, MLT19, RCTW17, ReCTS, and COCO-Text v2.0. Would you like me to add those as well?
  2. I'm having trouble finding a link for the benchmark_bctr datasets as I'm not well acquainted with chinese text recognition. Can you provide me with a link for downloading those datasets?

@ir2718
Copy link
Contributor Author

ir2718 commented Dec 10, 2024

Are there any updates on the datasets? @Topdu

@Topdu
Copy link
Owner

Topdu commented Dec 10, 2024

Very sorry for our delayed response.

  1. Aside from the aforementioned datasets, I also added OpenVINO and TextOCR from parseq. However, I noticed parseq also uses LSVT, MLT19, RCTW17, ReCTS, and COCO-Text v2.0. Would you like me to add those as well?

They are already included in Union14M, and you can leave them out for now.

  1. I'm having trouble finding a link for the benchmark_bctr datasets as I'm not well acquainted with chinese text recognition. Can you provide me with a link for downloading those datasets?

Since benchmark_bctr is copyrighted (especially regarding the Chinese handwriting dataset), it is recommended to request permission to use it from the dataset's preliminary creator. After getting permission from the creator, we can send it to the applicant confidentially.

Are there any updates on the datasets? @Topdu

I think this draft version can be merged as a ready PR.

Thank you for your outstanding contribution.

@ir2718 ir2718 marked this pull request as ready for review December 10, 2024 23:03
@ir2718
Copy link
Contributor Author

ir2718 commented Dec 11, 2024

You're welcome. In the future I hope I'll find time to work on more PRs.

docs/openocr.md Outdated Show resolved Hide resolved
tools/create_lmdb_dataset.py Outdated Show resolved Hide resolved
@Topdu
Copy link
Owner

Topdu commented Dec 13, 2024

It is now ready to be merged in. Thanks to your efforts!!

Copy link
Owner

@Topdu Topdu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Topdu Topdu merged commit da8b837 into Topdu:main Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants