Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(translation): Experimental integration of new YADT backend into CLI #475

Merged
merged 8 commits into from
Jan 20, 2025

Conversation

awwaawwa
Copy link
Contributor

@awwaawwa awwaawwa commented Jan 16, 2025

yadt is a new document translation backend that rewrites pdf2zh's translation backend, not intended for direct end-user use. We hope pdf2zh will be used by users who need self-deployment.

Currently, yadt is still in early development with some serious bugs, but in some cases, it can provide better translation quality than pdf2zh's original backend, and its performance is significantly better.

Breaking Changes:

  • Raised minimum Python version required by pdf2zh to 3.10

Known Issues:

  • Some formula layout errors
  • Loss of all lines (table borders, etc.)
  • Some layout parsing errors
  • Poor table of contents performance

Major Improvements:

  • Introduced intermediate representation, completely decoupling each stage
  • Improved performance: Local testing demonstrates the ability to process nearly 1 page per second with 37 QPS upstream, translating 600 pages in just 12 minutes.
  • PDF generation more compliant with PDF specifications, better output PDF compatibility
  • More advanced typography features: hanging punctuation, auto half-space insertion between Chinese and English text, dynamic line spacing & font size scaling
  • OCR-based paragraph recognition for better paragraph identification
  • arxiv watermark support
  • Ignores all text within chart/diagram areas
  • Bold & Italic support
  • First line indent (fixed at 2 Chinese character widths)
  • Other improvements I've forgotten

Usage Notes:

  • The new backend is currently for testing only. Due to limited maintainer resources and yadt's rapid iteration, community contributions and issue reports are not accepted for now. Will gradually open up later.
  • Please use capable LLMs for translation, such as glm-4-flash, deepseek-chat, etc.

Preview:2023 - Zhao_Xiangyu, Wang_Maolin - Embedding in Recommender Systems A Survey.zh-CN.dual.pdf

- ignore *.pdf files
- ignore *.docx files
…ndency

- bump requires-python to ">=3.12,<3.13"
- add yadt dependency with version ">=0.0.1a15"
- bump python version to 3.12 in build workflow
- update python version to 3.12 in publish workflow
- import and use `yadt_translate` for yadt backend
- add `--yadt` option to enable yadt backend
- implement `yadt_main` function to handle yadt translation process
- download remote fonts for yadt translation
- configure and use appropriate translator based on service name

refactor(translator): add placeholder methods for rich text and formulars

- add `get_rich_text_left_placeholder` and `get_rich_text_right_placeholder` methods
- add `get_formular_placeholder` method
- update `OpenAITranslator` to override placeholder methods
@awwaawwa
Copy link
Contributor Author

I tested locally using the following commands and the translation works fine. When no output folder is specified, the output file is placed in the same folder as the input file.

Need community help to test compatibility of other supported features. Related issues are tracked at funstory-ai/yadt#20.

Please note that we currently do not accept issues regarding the output PDF, such as translation quality or translation errors. For now, we only accept issues like parameter passing errors, such as direct errors that prevent completing the translation process.

uv run pdf2zh --yadt --pages 1 -s openai:Qwen/Qwen2.5-7B-Instruct -t 20 "FILE"

- add upper bound to yadt version to prevent breaking changes
@awwaawwa awwaawwa marked this pull request as ready for review January 16, 2025 05:26
@Byaidu
Copy link
Owner

Byaidu commented Jan 18, 2025

3.12的限制有点太严格了,可以尝试放宽点吗

@awwaawwa
Copy link
Contributor Author

主要是用了这个特性: https://docs.python.org/3.12/whatsnew/3.12.html#pep-701-syntactic-formalization-of-f-strings ,应该是可以放宽的,不过暂时没有精力处理.. 可以暂时把yadt作为可选依赖?后续再处理这些事宜

@awwaawwa
Copy link
Contributor Author

awwaawwa commented Jan 18, 2025

如果要作为可选依赖的话,我可以在 #480 合入后做相应修改。

- expand python version range to >=3.10,<3.13
- update yadt requirement to >=0.0.1a20, <0.0.2
@awwaawwa
Copy link
Contributor Author

awwaawwa commented Jan 20, 2025

Thanks to funstory-ai/yadt#21, now supports Python 3.10 and 3.11。@Byaidu

Since the latest versions of onnxruntime and numpy no longer support Python 3.9, yadt also does not plan to support Python 3.9.

ONNX Runtime packages will stop supporting Python 3.8 and Python 3.9. This decision aligns with NumPy Python version support. To continue using ORT with Python 3.8 and Python 3.9, you can use ORT 1.19.2 and earlier.
https://github.com/microsoft/onnxruntime/releases/tag/v1.20.0

NumPy 2.1.0 provides support for the upcoming Python 3.13 release and drops support for Python 3.9.
https://github.com/numpy/numpy/releases/tag/v2.1.0

- adjust python version to mitigate potential compatibility issues
- maintain existing pip caching configuration
@Byaidu Byaidu merged commit c7a3cbd into Byaidu:main Jan 20, 2025
2 checks passed
@awwaawwa awwaawwa deleted the yadt branch January 20, 2025 04:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants