Skip to content

Commit

Permalink
Require users to provide unique urls when using VectorRM.
Browse files Browse the repository at this point in the history
  • Loading branch information
AMMAS1 committed Jul 7, 2024
1 parent df8912d commit e31dcf1
Show file tree
Hide file tree
Showing 3 changed files with 5 additions and 7 deletions.
2 changes: 1 addition & 1 deletion examples/helper/process_kaggle_arxiv_abstract_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@

# Reformat the dataset to match the VectorRM input format.
df.rename(columns={"abstracts": "content", "titles": "title"}, inplace=True)
df['url'] = ['uid_' + str(idx) for idx in range(len(df))]
df['url'] = ['uid_' + str(idx) for idx in range(len(df))] # Ensure the url is unique.
df['description'] = ''

print(f'The downsampled dataset has {len(df)} samples.')
Expand Down
6 changes: 2 additions & 4 deletions examples/run_storm_wiki_gpt_with_VectorRM.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,11 @@
You will also need an existing Qdrant vector store either saved in a folder locally offline or in a server online.
If not, then you would need a CSV file with documents, and the script is going to create the vector store for you.
The CSV should be in the following format:
Content | Title | URL | Description
content | title | url | description
I am a document. | Document 1 | docu-n-112 | A self-explanatory document.
I am another document. | Document 2 | docu-l-13 | Another self-explanatory document.
Notice that the URL can be an identifier for the document for any internal use.
The Title, URL, and Description columns are optional. If not provided, the script will use default empty values.
The content column is crucial and should be provided.
Notice that the URL will be a unique identifier for the document so ensure different documents have different urls.
Output will be structured as below
args.output_dir/
Expand Down
4 changes: 2 additions & 2 deletions src/rm.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,8 +171,8 @@ class VectorRM(dspy.Retrieve):
To be compatible with STORM, the custom documents should have the following fields:
- content: The main text content of the document.
- title: The title of the document.
- url: The URL of the document. STORM use url as the unique identifier of the document.
If not provided, a random string will be generated as the url.
- url: The URL of the document. STORM use url as the unique identifier of the document, so ensure different
documents have different urls.
- description (optional): The description of the document.
The documents should be stored in a CSV file.
"""
Expand Down

0 comments on commit e31dcf1

Please sign in to comment.