Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 9: invalid continuation byte 报错 #486

Closed
konbakuyomu opened this issue Jan 18, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@konbakuyomu
Copy link

konbakuyomu commented Jan 18, 2025

问题描述

请对问题进行描述:在上传一些pdf时会报错UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 9: invalid continuation byte导致无法正常工作,我是直接使用docker拉取的,我看到 converter.py 应该是修复了这个问题的,请问下是docker的源代码没有更新吗?

运行日志如下:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 9: invalid continuation byte
Files before translation: ['20250117233209_kzc84ja3-dual.pdf', 'BL24C256A-test.pdf', 'G_series_Lua_API-mono.pdf', 'BL24C256A-test1.pdf', 'mathematics-10-03413-v2-dual.pdf', 'BL24C256A-008.pdf', 'mathematics-10-03413-v2-mono.pdf', 'BL24C256A-3.pdf', 'Graph Equations Involving Tensor Product of Graphs.pdf', 'BL24C256A-2-mono.pdf', 'BL24C256A-2.pdf', 'Graph Equations Involving Tensor Product of Graphs-mono.pdf', 'BL24C256A.pdf', 'BL24C256A-1-mono.pdf', 'BL24C256A-1.pdf', 'BL24C256A-1-dual.pdf', 'C700177_模数转换芯片ADC_SGM58031XMS10G-TR_规格书_WJ1490145.PDF', 'BL24C256A-2-dual.pdf', '20250117233209_kzc84ja3.pdf', 'Graph Equations Involving Tensor Product of Graphs-dual.pdf', 'BL24C256A-09.pdf', 'C2859066_心率传感器_MAX30101EFDT_规格书_WJ09450.PDF', 'G_series_Lua_API.pdf', 'mathematics-10-03413-v2.pdf', 'G_series_Lua_API-dual.pdf', '20250117233209_kzc84ja3-mono.pdf']
{'files': ['pdf2zh_files/BL24C256A-2.pdf'], 'pages': None, 'lang_in': 'en', 'lang_out': 'zh', 'service': 'google', 'output': PosixPath('pdf2zh_files'), 'thread': 4, 'callback': <function translate_file..progress_bar at 0x7f232f5d2480>}

0%| | 0/16 [00:00<?, ?it/s]
6%|▋ | 1/16 [00:01<00:15, 1.06s/it]
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/gradio/queueing.py", line 625, in process_events
response = await route_utils.call_process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/gradio/route_utils.py", line 322, in call_process_api
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/gradio/blocks.py", line 2047, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/gradio/blocks.py", line 1594, in call_function
prediction = await anyio.to_thread.run_sync( # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 2505, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 1005, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/gradio/utils.py", line 869, in wrapper
response = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/pdf2zh/gui.py", line 165, in translate_file
translate(**param)
File "/usr/local/lib/python3.12/site-packages/pdf2zh/high_level.py", line 278, in translate
s_mono, s_dual = translate_stream(s_raw, **locals())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/pdf2zh/high_level.py", line 213, in translate_stream
obj_patch: dict = translate_patch(fp, **locals())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/pdf2zh/high_level.py", line 148, in translate_patch
interpreter.process_page(page)
File "/usr/local/lib/python3.12/site-packages/pdf2zh/pdfinterp.py", line 266, in process_page
ops_new = self.device.end_page(page)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/pdf2zh/converter.py", line 56, in end_page
return self.receive_layout(self.cur_item)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/pdf2zh/converter.py", line 224, in receive_layout
or vflag(child.fontname, child.get_text()) # 3. 公式字体
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/pdf2zh/converter.py", line 175, in vflag
font = font.decode()
^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 9: invalid continuation byte

测试文档

Important

请提供用于复现测试的 PDF 文档

BL24C256A.pdf

@konbakuyomu konbakuyomu added the bug Something isn't working label Jan 18, 2025
@awwaawwa
Copy link
Contributor

awwaawwa commented Jan 18, 2025

根据测试,最新版源码是可以翻译的。docker的确实也没有更新。
现有后端BL24C256A-dual.pdf

新后端BL24C256A.zh-CN.dual.pdf

另请注意,目前本项目并没有针对技术文档做优化,所以效果不一定好。针对技术文档的优化在相对远期的待办事项中,当前新后端的主要任务是优化论文效果以及修复bug,请耐心等待。

@konbakuyomu
Copy link
Author

好的,期待后面的更新

@Byaidu
Copy link
Owner

Byaidu commented Jan 19, 2025

已更新

@Byaidu Byaidu closed this as completed Jan 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants