Require users to provide unique urls when using VectorRM.

ailabteam · Jul 7, 2024 · e31dcf1 · e31dcf1
1 parent df8912d
commit e31dcf1
Show file tree

Hide file tree

Showing 3 changed files with 5 additions and 7 deletions.
diff --git a/examples/helper/process_kaggle_arxiv_abstract_dataset.py b/examples/helper/process_kaggle_arxiv_abstract_dataset.py
@@ -21,7 +21,7 @@
 
  # Reformat the dataset to match the VectorRM input format.
  df.rename(columns={"abstracts": "content", "titles": "title"}, inplace=True)
- df['url'] = ['uid_' + str(idx) for idx in range(len(df))]
+ df['url'] = ['uid_' + str(idx) for idx in range(len(df))] # Ensure the url is unique.
  df['description'] = ''
 
  print(f'The downsampled dataset has {len(df)} samples.')

diff --git a/examples/run_storm_wiki_gpt_with_VectorRM.py b/examples/run_storm_wiki_gpt_with_VectorRM.py
@@ -8,13 +8,11 @@
 You will also need an existing Qdrant vector store either saved in a folder locally offline or in a server online.
 If not, then you would need a CSV file with documents, and the script is going to create the vector store for you.
 The CSV should be in the following format:
-Content | Title | URL | Description
+content | title | url | description
 I am a document. | Document 1 | docu-n-112 | A self-explanatory document.
 I am another document. | Document 2 | docu-l-13 | Another self-explanatory document.
 
-Notice that the URL can be an identifier for the document for any internal use.
-The Title, URL, and Description columns are optional. If not provided, the script will use default empty values.
-The content column is crucial and should be provided.
+Notice that the URL will be a unique identifier for the document so ensure different documents have different urls.
 
 Output will be structured as below
 args.output_dir/

diff --git a/src/rm.py b/src/rm.py
@@ -171,8 +171,8 @@ class VectorRM(dspy.Retrieve):
  To be compatible with STORM, the custom documents should have the following fields:
  - content: The main text content of the document.
  - title: The title of the document.
- - url: The URL of the document. STORM use url as the unique identifier of the document.
- If not provided, a random string will be generated as the url.
+ - url: The URL of the document. STORM use url as the unique identifier of the document, so ensure different
+ documents have different urls.
  - description (optional): The description of the document.
  The documents should be stored in a CSV file.
  """