-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat
] Integration of dataset benchmarks
#25
Conversation
Thank you for your efforts, it's a very exciting one to work on. For some of these datasets, they have been changed compared to the original versions, especially for commonly used datasets, e.g., PARSeq has made changes to some of the mislabeling in IIIT5k, SVT, SVTP, IC13, IC15, CUTE80. And many of the work follows the datasets provided by PARSeq, I suggest you to download the datasets provided by PARSeq directly, that many have been uploaded to Google Drive. For the training and test sets of Union14M-L we will provide a Google Drive version. We will let you know specifically at that time. |
Downloading DatasetsAll data can be downloaded from Google Drive. The structure of Datasets and OpenOCR code will be organized as follows:Structure of Datasets and OpenOCR code
Datasets used during Training
If you have downloaded Union14M-L, you can use the filtered list of images to create an LMDB of the training set Union14M-L-Filter. Test Set
Note: Both Union14M-L-Filter and Union14M-L-Benchmark are based on Union14M-L and therefore comply with its copyright. Common Benchmarks and OST are derived from PARSeq and VisionLAN, respectively. |
Thanks for uploading the datasets, this will help immensely. I was a bit busy these past few weeks, but I'll try to finish the PR over the weekend. |
Hi, I've finally found time start wrapping up this PR. The uploaded datasets helped out immensely. If you've got time, @Topdu please have a look and tell me if you would like me to change something. Before converting this into a non-draft PR, I'd like to ask 2 questions:
|
Are there any updates on the datasets? @Topdu |
Very sorry for our delayed response.
They are already included in Union14M, and you can leave them out for now.
Since benchmark_bctr is copyrighted (especially regarding the Chinese handwriting dataset), it is recommended to request permission to use it from the dataset's preliminary creator. After getting permission from the creator, we can send it to the applicant confidentially.
I think this draft version can be merged as a ready PR. Thank you for your outstanding contribution. |
You're welcome. In the future I hope I'll find time to work on more PRs. |
It is now ready to be merged in. Thanks to your efforts!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Hi,
as mentioned in issue #24, this is a PR for introducing benchmark datasets for the STR task. Each of the datasets has a .yaml config defined which is used for the scripts. For example:
The create_lmdb_dataset.py script is slightly changed to accomodate for dynamically creating a lmdb dataset. The current PR does not support different splits (uses a concatenation of all examples).
For now, I'll set this as a draft PR as I'm open to any suggestions or changes.
@Topdu