Segmentation Recognition Fails When Using create_by_text in Parent-Child Mode in the Knowledge Base API #13007

tigflanker · 2025-01-24T03:14:43Z

Self Checks

This is only for bug report, if you would like to ask a question, please head to Discussions.
I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template :) and fill in all the required fields.

Dify version

v0.15.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Step 1:
Execute the following curl command:

curl --location --request POST 'http://xxx/v1/datasets/0e9fd7d3-d610-42a2-bf9d-b29f6cf14faf/document/create_by_text' \
--header 'Authorization: Bearer dataset-AWrQfx5NXyAwmI3VRhiGxaNP' \
--header 'Content-Type: application/json' \
--data-raw '{
    "name": "test-doc1",
    "text": "content_part_1<sep>content_part_2\tcontent_part_3\n\ncontent_part_4\ncontent_part_5",
    "indexing_technique": "high_quality",
    "doc_form": "hierarchical_model",
    "process_rule": {
        "mode": "custom",
        "rules": {
            "pre_processing_rules": [
                {"id": "remove_extra_spaces", "enabled": false},
                {"id": "remove_urls_emails", "enabled": false}
            ],
            "segmentation": {
                "separator": "<sep>",
                "max_tokens": 2000,
                "parent_mode": "paragraph"
            },
            "subchunk_segmentation": {
                "separator": "\t",
                "max_tokens": 200,
                "chunk_overlap": 50
            }
        }
    }
}'

The response is as follows:

{
  "document": {
    "id": "68c8a870-fc54-4c28-bcda-eb4faec73a56",
    "position": 2,
    "data_source_type": "upload_file",
    "data_source_info": {
      "upload_file_id": "c078b1b9-93c4-4e6c-adcf-ab629a5db643"
    },
    "data_source_detail_dict": {
      "upload_file": {
        "id": "c078b1b9-93c4-4e6c-adcf-ab629a5db643",
        "name": "test-doc1",
        "size": 79,
        "extension": "txt",
        "mime_type": "text/plain",
        "created_by": "bf295d5e-b00f-4d5b-bd78-32e5f7aa8519",
        "created_at": 1737687348.807085
      }
    },
    "dataset_process_rule_id": "1d5c12bb-dd4c-4fe2-be50-7c171c537846",
    "name": "test-doc1",
    "created_from": "api",
    "created_by": "bf295d5e-b00f-4d5b-bd78-32e5f7aa8519",
    "created_at": 1737685813,
    "tokens": 0,
    "indexing_status": "parsing",
    "error": null,
    "enabled": true,
    "disabled_at": null,
    "disabled_by": null,
    "archived": false,
    "display_status": "indexing",
    "word_count": 87,
    "hit_count": 0,
    "doc_form": "hierarchical_model"
  },
  "batch": "20250124025548716342"
}

Step 2: The knowledge base shows that the document has been loaded successfully, but the "Preview Segmentation Interface" displays no content.

Step 3: Upon clicking on the segmentation settings, it's observed that the parent-child mode is correctly set. However, the delimiters and corresponding segment lengths are not properly configured (they remain at default settings, such as the delimiter being \n\n and the parent segment length being 500).

After manually setting these parameters, the segmentation works correctly, and the preview displays as expected.

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

The text was updated successfully, but these errors were encountered:

dosubot · 2025-01-24T03:16:46Z

The issue you're experiencing with the 'create_by_text' API endpoint in parent-child mode seems to be related to the segmentation settings not being applied correctly. Here are a few things to consider:

Segmentation Configuration: Ensure that the segmentation settings in your process_rule are correctly defined. The separator and max_tokens should be set as per your requirements. In your case, it seems like the custom settings are not being applied, and the defaults (\n\n and 500 tokens) are being used instead.
Process Rule Structure: Double-check the structure of your process_rule parameter. It should be correctly nested under rules, and all necessary fields should be included. The parent_mode and subchunk_segmentation should be properly placed under the rules object ^[1].
API Endpoint Behavior: There might be a bug or limitation in the API that causes it to revert to default settings under certain conditions. This could be related to how the API validates and applies the process_rule parameter.
Manual Adjustment: As you've noted, manually setting the parameters resolves the issue. This suggests that the API might not be correctly processing the initial configuration, possibly due to a bug or misconfiguration.

If the problem persists, consider reviewing the API documentation for any updates or known issues related to segmentation settings. Additionally, ensure that your Dify version is up to date, as newer versions might contain fixes for such issues.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

crazywoola · 2025-01-24T03:20:11Z

Please upgrade to v0.15.2.

tigflanker · 2025-01-24T06:36:50Z

@crazywoola Hi crazywoola, I have upgraded Docker Compose to the latest version, but the result remains the same; the segment configuration does not take effect.
I have minimized the reproduction case to the simplest form. If there is a mistake in my POST structure, please point it out for me. Thank you.

curl --location --request POST 'http://ip/v1/datasets/7bb72914-d1ed-4422-957c-56bb25ac8170/document/create_by_text' \
--header 'Authorization: Bearer dataset-AWrQfx5NXyAwmI3VRhiGxaNP' \
--header 'Content-Type: application/json' \
--data-raw '{
    "name": "test-doc1",
    "text": "content_part_1<sep>content_part_2\tcontent_part_3\n\ncontent_part_4\ncontent_part_5",
    "indexing_technique": "high_quality",
    "doc_form": "hierarchical_model",
    "process_rule": {
        "mode": "custom",
        "rules": {
            "pre_processing_rules": [
                {"id": "remove_extra_spaces", "enabled": false},
                {"id": "remove_urls_emails", "enabled": false}
            ],
            "segmentation": {
                "separator": "<sep>",
                "max_tokens": 2000,
                "parent_mode": "paragraph"
            },
            "subchunk_segmentation": {
                "separator": "\t",
                "max_tokens": 200,
                "chunk_overlap": 50
            }
        }
    }
}'

dosubot bot added the 🐞 bug Something isn't working label Jan 24, 2025

crazywoola closed this as completed Jan 24, 2025

tigflanker mentioned this issue Jan 24, 2025

[ReOpen] Segmentation Recognition Fails When Using create_by_text in Parent-Child Mode in the Knowledge Base API #13013

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation Recognition Fails When Using create_by_text in Parent-Child Mode in the Knowledge Base API #13007

Segmentation Recognition Fails When Using create_by_text in Parent-Child Mode in the Knowledge Base API #13007

tigflanker commented Jan 24, 2025

dosubot bot commented Jan 24, 2025

crazywoola commented Jan 24, 2025

tigflanker commented Jan 24, 2025

Segmentation Recognition Fails When Using create_by_text in Parent-Child Mode in the Knowledge Base API #13007

Segmentation Recognition Fails When Using create_by_text in Parent-Child Mode in the Knowledge Base API #13007

Comments

tigflanker commented Jan 24, 2025

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

dosubot bot commented Jan 24, 2025

crazywoola commented Jan 24, 2025

tigflanker commented Jan 24, 2025