Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Encoding error in the case which use Japanese texts in simple-labelled.py #17846

Open
meniyama5319 opened this issue Feb 17, 2025 · 1 comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@meniyama5319
Copy link

Bug Description

Hi,
Thank you so much to develop and maintain the great library.
I faced encoding error in property-graph rag indexing process when using Japanese language.
And fixed it by adding "encoding = 'utf-8' " two code below on \llama_index\core\graph_stores\simple_labelled.py

  1. 'with fs.open(persist_path, "w") as f:'
  2. 'with fs.open(persist_path, "r") as f:'

I'm very happy if you modify your original code because I don't need to change it anymore after pulling your code!!

Thank you,

Version

0.12.10

Steps to Reproduce

  1. create a property graph index using many Japanese documents.
  2. then happened encoding error when writing or reading the index json.

Relevant Logs/Tracbacks

@meniyama5319 meniyama5319 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Feb 17, 2025
Copy link

dosubot bot commented Feb 17, 2025

Hey @meniyama5319,

I'm here to help you with any issues you encounter. Whether it's solving bugs, answering questions, or becoming a contributor, I've got your back. Let's tackle this together!

The issue you're experiencing with encoding errors when using Japanese text in the simple_labelled.py file can be resolved by explicitly setting the encoding to 'utf-8' in the file operations. You can modify the code as follows:

  1. Change the line with fs.open(persist_path, "w") as f: to with fs.open(persist_path, "w", encoding="utf-8") as f:.
  2. Change the line with fs.open(persist_path, "r") as f: to with fs.open(persist_path, "r", encoding="utf-8") as f:.

This change ensures that the file is read and written using UTF-8 encoding, which should handle Japanese characters correctly. This modification aligns with the default encoding used in other parts of the LlamaIndex library, such as the SimpleDirectoryReader class, which also uses UTF-8 encoding [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

1 participant