Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Recognition Fails When Using create_by_text in Parent-Child Mode in the Knowledge Base API #13007

Closed
5 tasks done
tigflanker opened this issue Jan 24, 2025 · 3 comments
Labels
🐞 bug Something isn't working

Comments

@tigflanker
Copy link

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

v0.15.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Step 1:
Execute the following curl command:

curl --location --request POST 'http://xxx/v1/datasets/0e9fd7d3-d610-42a2-bf9d-b29f6cf14faf/document/create_by_text' \
--header 'Authorization: Bearer dataset-AWrQfx5NXyAwmI3VRhiGxaNP' \
--header 'Content-Type: application/json' \
--data-raw '{
    "name": "test-doc1",
    "text": "content_part_1<sep>content_part_2\tcontent_part_3\n\ncontent_part_4\ncontent_part_5",
    "indexing_technique": "high_quality",
    "doc_form": "hierarchical_model",
    "process_rule": {
        "mode": "custom",
        "rules": {
            "pre_processing_rules": [
                {"id": "remove_extra_spaces", "enabled": false},
                {"id": "remove_urls_emails", "enabled": false}
            ],
            "segmentation": {
                "separator": "<sep>",
                "max_tokens": 2000,
                "parent_mode": "paragraph"
            },
            "subchunk_segmentation": {
                "separator": "\t",
                "max_tokens": 200,
                "chunk_overlap": 50
            }
        }
    }
}'

The response is as follows:

{
  "document": {
    "id": "68c8a870-fc54-4c28-bcda-eb4faec73a56",
    "position": 2,
    "data_source_type": "upload_file",
    "data_source_info": {
      "upload_file_id": "c078b1b9-93c4-4e6c-adcf-ab629a5db643"
    },
    "data_source_detail_dict": {
      "upload_file": {
        "id": "c078b1b9-93c4-4e6c-adcf-ab629a5db643",
        "name": "test-doc1",
        "size": 79,
        "extension": "txt",
        "mime_type": "text/plain",
        "created_by": "bf295d5e-b00f-4d5b-bd78-32e5f7aa8519",
        "created_at": 1737687348.807085
      }
    },
    "dataset_process_rule_id": "1d5c12bb-dd4c-4fe2-be50-7c171c537846",
    "name": "test-doc1",
    "created_from": "api",
    "created_by": "bf295d5e-b00f-4d5b-bd78-32e5f7aa8519",
    "created_at": 1737685813,
    "tokens": 0,
    "indexing_status": "parsing",
    "error": null,
    "enabled": true,
    "disabled_at": null,
    "disabled_by": null,
    "archived": false,
    "display_status": "indexing",
    "word_count": 87,
    "hit_count": 0,
    "doc_form": "hierarchical_model"
  },
  "batch": "20250124025548716342"
}

Step 2: The knowledge base shows that the document has been loaded successfully, but the "Preview Segmentation Interface" displays no content.

Image

Step 3: Upon clicking on the segmentation settings, it's observed that the parent-child mode is correctly set. However, the delimiters and corresponding segment lengths are not properly configured (they remain at default settings, such as the delimiter being \n\n and the parent segment length being 500).

Image

After manually setting these parameters, the segmentation works correctly, and the preview displays as expected.

Image

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

Copy link

dosubot bot commented Jan 24, 2025

The issue you're experiencing with the 'create_by_text' API endpoint in parent-child mode seems to be related to the segmentation settings not being applied correctly. Here are a few things to consider:

  1. Segmentation Configuration: Ensure that the segmentation settings in your process_rule are correctly defined. The separator and max_tokens should be set as per your requirements. In your case, it seems like the custom settings are not being applied, and the defaults (\n\n and 500 tokens) are being used instead.

  2. Process Rule Structure: Double-check the structure of your process_rule parameter. It should be correctly nested under rules, and all necessary fields should be included. The parent_mode and subchunk_segmentation should be properly placed under the rules object [1].

  3. API Endpoint Behavior: There might be a bug or limitation in the API that causes it to revert to default settings under certain conditions. This could be related to how the API validates and applies the process_rule parameter.

  4. Manual Adjustment: As you've noted, manually setting the parameters resolves the issue. This suggests that the API might not be correctly processing the initial configuration, possibly due to a bug or misconfiguration.

If the problem persists, consider reviewing the API documentation for any updates or known issues related to segmentation settings. Additionally, ensure that your Dify version is up to date, as newer versions might contain fixes for such issues.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added the 🐞 bug Something isn't working label Jan 24, 2025
@crazywoola
Copy link
Member

Please upgrade to v0.15.2.

@tigflanker
Copy link
Author

Image

@crazywoola Hi crazywoola, I have upgraded Docker Compose to the latest version, but the result remains the same; the segment configuration does not take effect.
I have minimized the reproduction case to the simplest form. If there is a mistake in my POST structure, please point it out for me. Thank you.

curl --location --request POST 'http://ip/v1/datasets/7bb72914-d1ed-4422-957c-56bb25ac8170/document/create_by_text' \
--header 'Authorization: Bearer dataset-AWrQfx5NXyAwmI3VRhiGxaNP' \
--header 'Content-Type: application/json' \
--data-raw '{
    "name": "test-doc1",
    "text": "content_part_1<sep>content_part_2\tcontent_part_3\n\ncontent_part_4\ncontent_part_5",
    "indexing_technique": "high_quality",
    "doc_form": "hierarchical_model",
    "process_rule": {
        "mode": "custom",
        "rules": {
            "pre_processing_rules": [
                {"id": "remove_extra_spaces", "enabled": false},
                {"id": "remove_urls_emails", "enabled": false}
            ],
            "segmentation": {
                "separator": "<sep>",
                "max_tokens": 2000,
                "parent_mode": "paragraph"
            },
            "subchunk_segmentation": {
                "separator": "\t",
                "max_tokens": 200,
                "chunk_overlap": 50
            }
        }
    }
}'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants