feat(dataframe): add support for DataFrame outputs across multiple co…

…mponents (langflow-ai#5589) * add dataframe outputs to vector stores, directory, url, split text * add dataframe import * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes (attempt 2/3) * add parse dataframe * [autofix.ci] apply automated fixes * Refactor: Update DataFrame handling in components - Added import of DataFrame in directory and url components. - Renamed variable 'df' to 'dataframe' in ParseDataFrameComponent for clarity. - Updated method _clean_args and parse_data to use 'dataframe' instead of 'df' for consistency. These changes enhance code readability and maintainability by standardizing the terminology used for DataFrame objects. * [autofix.ci] apply automated fixes * remove parse dataframe * Add tests for URL component functionality and data handling * Enhance DirectoryComponent tests with new functionality and parameters - Added tests for loading files with specific types and handling hidden files. - Implemented tests for directory loading with depth and multithreading support. - Introduced a new test for converting directory contents to a DataFrame. - Updated existing tests to include additional parameters like 'silent_errors' and 'types'. These changes improve test coverage and ensure the DirectoryComponent behaves as expected under various conditions. * update retrieve_file_paths for backwards compatibility * Refactor DirectoryComponent to handle file types more robustly - Removed the default assignment of TEXT_FILE_TYPES to 'types' and added logic to use all supported types if none are specified. - Implemented validation to ensure only valid file types are processed, improving error handling. - Updated the file retrieval process to utilize the filtered list of valid types. These changes enhance the flexibility and reliability of the DirectoryComponent's file loading functionality. * Refactor and simplify tests in test_data_components.py - Removed multiple tests related to HTTP requests, including successful and failed GET requests, timeouts, and multiple URL handling, to streamline the test suite. - Cleaned up imports and unnecessary mock setups to enhance readability and maintainability. - Focused on retaining essential tests for DirectoryComponent and URLComponent functionality, ensuring core features are still validated. These changes improve the clarity and efficiency of the test suite while maintaining coverage for critical components. * Add unit tests for DirectoryComponent functionality - Introduced a new test file for DirectoryComponent, enhancing test coverage. - Implemented various tests to validate loading files with specific types, handling hidden files, and supporting multithreading. - Added tests for directory loading with depth and converting directory contents to a DataFrame. - Ensured tests cover different scenarios, including recursive loading and file type filtering. These changes improve the robustness and reliability of the DirectoryComponent's functionality through comprehensive testing. * Add unit tests for URLComponent functionality - Introduced a new test file for URLComponent, enhancing test coverage for its methods. - Implemented tests for fetching content from valid URLs, handling multiple URLs, and validating error handling for invalid URLs. - Added tests for converting fetched content to a DataFrame and ensuring correct message formatting. - Mocked web requests to simulate various scenarios, ensuring robust testing of URLComponent's functionality. These changes improve the reliability and correctness of the URLComponent through comprehensive testing. * Add unit tests for SplitTextComponent functionality - Introduced a new test file for SplitTextComponent, enhancing test coverage for its methods. - Implemented tests for basic text splitting, handling overlaps, custom separators, and preserving metadata. - Added tests for converting split text results to a DataFrame and handling empty input. - Ensured functionality for single and multiple input texts is validated. These changes improve the reliability and correctness of the SplitTextComponent through comprehensive testing. * Add comment to ignore FBT001 in retrieve_file_paths function * Validate specified file types in DirectoryComponent and raise ValueError for invalid types * Fix type hint in DataFrame constructor to support list of dicts or Data objects. This change enhances type safety and clarity in the DataFrame initialization process. * Enhance DirectoryComponent tests to validate error handling for invalid file types - Removed the test case for 'exe' file type from valid scenarios. - Added a new test to ensure DirectoryComponent raises a ValueError for invalid file types, specifically when 'exe' is specified. - Improved test coverage for DirectoryComponent by validating error messages for unsupported file types. These changes strengthen the reliability of the DirectoryComponent by ensuring proper error handling for invalid inputs. * [autofix.ci] apply automated fixes * Update error handling in Component class to return None for missing flow_id or session_id - Modified the send_error_message method to include a type hint that allows for returning None in addition to Message. - Added a conditional check to return None if flow_id or session_id is not present, improving robustness in error handling. These changes enhance the reliability of the Component class by ensuring it gracefully handles cases with missing identifiers. * Refactor error handling in Component class to return None for missing session_id - Updated the send_error_message method to remove the flow_id check, simplifying the logic. - Enhanced robustness by ensuring that the method returns None if session_id is not present. These changes improve the reliability of the Component class in handling error messages. * Update required_inputs for DataFrame method in JSON configurations - Modified the 'required_inputs' field for the 'DataFrame' method in both 'Graph Vector Store RAG.json' and 'Vector Store RAG.json' files to include necessary parameters: 'api_endpoint', 'collection_name', and 'token'. - In 'Vector Store RAG.json', added 'collection_name_new' to the 'required_inputs' list. These changes ensure that the DataFrame method has the appropriate inputs defined for proper functionality. * [autofix.ci] apply automated fixes * Enhance BaseComponent to use deep copy for attribute values in template configuration - Updated the BaseComponent class to utilize `copy.deepcopy` when assigning values to `template_config`. This change ensures that modifications to the original component's attributes do not affect the template configuration, enhancing data integrity and preventing unintended side effects. These changes improve the reliability of the BaseComponent by ensuring that the template configuration remains consistent and isolated from the original component's state. * Added output for 'dataframe' in both ingestion and rag graphs - Updated the ingestion vector store ID for better identification. - Added output for 'dataframe' in both ingestion and rag graphs to enhance data handling. - Simplified the output assignment for search results in rag graph by using a data list. These changes improve the test structure and ensure that the vector store components are correctly configured for better testing outcomes. * Refactor vector store RAG tests for improved validation and consistency - Updated test assertions in `test_vector_store_rag_dump_components_and_edges` to verify the expected number of nodes and their types using a mapping for easier lookup. - Changed the ingestion vector store ID from `vector-store-123` to `ingestion-vector-store-123` for better identification. - Adjusted expected edges in the tests to reflect the new vector store ID, ensuring accurate edge validation. These changes enhance the test structure and ensure that the vector store components are correctly configured for better testing outcomes. --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Gabriel Luiz Freitas Almeida <[email protected]>
ParisNeo · Jan 10, 2025 · 45ed8e5 · 45ed8e5
1 parent 18acd30
commit 45ed8e5
Show file tree

Hide file tree

Showing 18 changed files with 952 additions and 216 deletions.
diff --git a/src/backend/base/langflow/base/data/utils.py b/src/backend/base/langflow/base/data/utils.py
@@ -56,11 +56,12 @@ def format_directory_path(path: str) -> str:
     return path.replace("\n", "\\n")
 
 
+# Ignoring FBT001 because the DirectoryComponent in 1.0.19
+# calls this function without keyword arguments
 def retrieve_file_paths(
     path: str,
-    *,
-    load_hidden: bool,
-    recursive: bool,
+    load_hidden: bool,  # noqa: FBT001
+    recursive: bool,  # noqa: FBT001
     depth: int,
     types: list[str] = TEXT_FILE_TYPES,
 ) -> list[str]:

diff --git a/src/backend/base/langflow/base/vectorstores/model.py b/src/backend/base/langflow/base/vectorstores/model.py
@@ -6,7 +6,7 @@
 from langflow.field_typing import Text, VectorStore
 from langflow.helpers.data import docs_to_data
 from langflow.io import DataInput, MultilineInput, Output
-from langflow.schema import Data
+from langflow.schema import Data, DataFrame
 
 if TYPE_CHECKING:
     from langchain_core.documents import Document
@@ -70,6 +70,7 @@ def __init_subclass__(cls, **kwargs):
             name="search_results",
             method="search_documents",
         ),
+        Output(display_name="DataFrame", name="dataframe", method="as_dataframe"),
     ]
 
     def _validate_outputs(self) -> None:
@@ -143,6 +144,9 @@ def search_documents(self) -> list[Data]:
         self.status = search_results
         return search_results
 
+    def as_dataframe(self) -> DataFrame:
+        return DataFrame(self.search_documents())
+
     def get_retriever_kwargs(self):
         """Get the retriever kwargs. Implementations can override this method to provide custom retriever kwargs."""
         return {}

diff --git a/src/backend/base/langflow/components/data/directory.py b/src/backend/base/langflow/components/data/directory.py
@@ -2,6 +2,7 @@
 from langflow.custom import Component
 from langflow.io import BoolInput, IntInput, MessageTextInput, MultiselectInput
 from langflow.schema import Data
+from langflow.schema.dataframe import DataFrame
 from langflow.template import Output
 
 
@@ -67,11 +68,12 @@ class DirectoryComponent(Component):
 
     outputs = [
         Output(display_name="Data", name="data", method="load_directory"),
+        Output(display_name="DataFrame", name="dataframe", method="as_dataframe"),
     ]
 
     def load_directory(self) -> list[Data]:
         path = self.path
-        types = self.types or TEXT_FILE_TYPES
+        types = self.types
         depth = self.depth
         max_concurrency = self.max_concurrency
         load_hidden = self.load_hidden
@@ -81,13 +83,22 @@ def load_directory(self) -> list[Data]:
 
         resolved_path = self.resolve_path(path)
 
+        # If no types are specified, use all supported types
+        if not types:
+            types = TEXT_FILE_TYPES
+
+        # Check if all specified types are valid
+        invalid_types = [t for t in types if t not in TEXT_FILE_TYPES]
+        if invalid_types:
+            msg = f"Invalid file types specified: {invalid_types}. Valid types are: {TEXT_FILE_TYPES}"
+            raise ValueError(msg)
+
+        valid_types = types
+
         file_paths = retrieve_file_paths(
-            resolved_path, load_hidden=load_hidden, recursive=recursive, depth=depth, types=types
+            resolved_path, load_hidden=load_hidden, recursive=recursive, depth=depth, types=valid_types
         )
 
-        if types:
-            file_paths = [fp for fp in file_paths if any(fp.endswith(ext) for ext in types)]
-
         loaded_data = []
         if use_multithreading:
             loaded_data = parallel_load_data(file_paths, silent_errors=silent_errors, max_concurrency=max_concurrency)
@@ -97,3 +108,6 @@ def load_directory(self) -> list[Data]:
         valid_data = [x for x in loaded_data if x is not None and isinstance(x, Data)]
         self.status = valid_data
         return valid_data
+
+    def as_dataframe(self) -> DataFrame:
+        return DataFrame(self.load_directory())
diff --git a/src/backend/base/langflow/components/data/url.py b/src/backend/base/langflow/components/data/url.py
@@ -6,6 +6,7 @@
 from langflow.helpers.data import data_to_text
 from langflow.io import DropdownInput, MessageTextInput, Output
 from langflow.schema import Data
+from langflow.schema.dataframe import DataFrame
 from langflow.schema.message import Message
 
 
@@ -35,6 +36,7 @@ class URLComponent(Component):
     outputs = [
         Output(display_name="Data", name="data", method="fetch_content"),
         Output(display_name="Text", name="text", method="fetch_content_text"),
+        Output(display_name="DataFrame", name="dataframe", method="as_dataframe"),
     ]
 
     def ensure_url(self, string: str) -> str:
@@ -88,3 +90,6 @@ def fetch_content_text(self) -> Message:
         result_string = data_to_text("{text}", data)
         self.status = result_string
         return Message(text=result_string)
+
+    def as_dataframe(self) -> DataFrame:
+        return DataFrame(self.fetch_content())
diff --git a/src/backend/base/langflow/components/processing/split_text.py b/src/backend/base/langflow/components/processing/split_text.py
@@ -2,7 +2,7 @@
 
 from langflow.custom import Component
 from langflow.io import HandleInput, IntInput, MessageTextInput, Output
-from langflow.schema import Data
+from langflow.schema import Data, DataFrame
 from langflow.utils.util import unescape_string
 
 
@@ -19,6 +19,7 @@ class SplitTextComponent(Component):
             info="The data to split.",
             input_types=["Data"],
             is_list=True,
+            required=True,
         ),
         IntInput(
             name="chunk_overlap",
@@ -42,6 +43,7 @@ class SplitTextComponent(Component):
 
     outputs = [
         Output(display_name="Chunks", name="chunks", method="split_text"),
+        Output(display_name="DataFrame", name="dataframe", method="as_dataframe"),
     ]
 
     def _docs_to_data(self, docs):
@@ -61,3 +63,6 @@ def split_text(self) -> list[Data]:
         data = self._docs_to_data(docs)
         self.status = data
         return data
+
+    def as_dataframe(self) -> DataFrame:
+        return DataFrame(self.split_text())
diff --git a/src/backend/base/langflow/custom/custom_component/base_component.py b/src/backend/base/langflow/custom/custom_component/base_component.py
@@ -1,3 +1,4 @@
+import copy
 import operator
 import re
 from typing import TYPE_CHECKING, Any, ClassVar
@@ -83,7 +84,8 @@ def get_template_config(component):
             if hasattr(component, attribute):
                 value = getattr(component, attribute)
                 if value is not None:
-                    template_config[attribute] = func(value=value)
+                    value_copy = copy.deepcopy(value)
+                    template_config[attribute] = func(value=value_copy)
 
         for key in template_config.copy():
             if key not in ATTR_FUNC_MAPPING:

diff --git a/src/backend/base/langflow/custom/custom_component/component.py b/src/backend/base/langflow/custom/custom_component/component.py
@@ -1173,9 +1173,11 @@ async def send_error(
         session_id: str,
         trace_name: str,
         source: Source,
-    ) -> Message:
+    ) -> Message | None:
         """Send an error message to the frontend."""
         flow_id = self.graph.flow_id if hasattr(self, "graph") else None
+        if not session_id:
+            return None
         error_message = ErrorMessage(
             flow_id=flow_id,
             exception=exception,

diff --git a/src/backend/base/langflow/initial_setup/starter_projects/Blog Writer.json b/src/backend/base/langflow/initial_setup/starter_projects/Blog Writer.json
@@ -187,6 +187,17 @@
                   "Message"
                 ],
                 "value": "__UNDEFINED__"
+              },
+              {
+                "cache": true,
+                "display_name": "DataFrame",
+                "method": "as_dataframe",
+                "name": "dataframe",
+                "selected": "DataFrame",
+                "types": [
+                  "DataFrame"
+                ],
+                "value": "__UNDEFINED__"
               }
             ],
             "pinned": false,
@@ -208,7 +219,7 @@
                 "show": true,
                 "title_case": false,
                 "type": "code",
-                "value": "import re\n\nfrom langchain_community.document_loaders import AsyncHtmlLoader, WebBaseLoader\n\nfrom langflow.custom import Component\nfrom langflow.helpers.data import data_to_text\nfrom langflow.io import DropdownInput, MessageTextInput, Output\nfrom langflow.schema import Data\nfrom langflow.schema.message import Message\n\n\nclass URLComponent(Component):\n    display_name = \"URL\"\n    description = \"Fetch content from one or more URLs.\"\n    icon = \"layout-template\"\n    name = \"URL\"\n\n    inputs = [\n        MessageTextInput(\n            name=\"urls\",\n            display_name=\"URLs\",\n            info=\"Enter one or more URLs, by clicking the '+' button.\",\n            is_list=True,\n            tool_mode=True,\n        ),\n        DropdownInput(\n            name=\"format\",\n            display_name=\"Output Format\",\n            info=\"Output Format. Use 'Text' to extract the text from the HTML or 'Raw HTML' for the raw HTML content.\",\n            options=[\"Text\", \"Raw HTML\"],\n            value=\"Text\",\n        ),\n    ]\n\n    outputs = [\n        Output(display_name=\"Data\", name=\"data\", method=\"fetch_content\"),\n        Output(display_name=\"Text\", name=\"text\", method=\"fetch_content_text\"),\n    ]\n\n    def ensure_url(self, string: str) -> str:\n        \"\"\"Ensures the given string is a URL by adding 'http://' if it doesn't start with 'http://' or 'https://'.\n\n        Raises an error if the string is not a valid URL.\n\n        Parameters:\n            string (str): The string to be checked and possibly modified.\n\n        Returns:\n            str: The modified string that is ensured to be a URL.\n\n        Raises:\n            ValueError: If the string is not a valid URL.\n        \"\"\"\n        if not string.startswith((\"http://\", \"https://\")):\n            string = \"http://\" + string\n\n        # Basic URL validation regex\n        url_regex = re.compile(\n            r\"^(https?:\\/\\/)?\"  # optional protocol\n            r\"(www\\.)?\"  # optional www\n            r\"([a-zA-Z0-9.-]+)\"  # domain\n            r\"(\\.[a-zA-Z]{2,})?\"  # top-level domain\n            r\"(:\\d+)?\"  # optional port\n            r\"(\\/[^\\s]*)?$\",  # optional path\n            re.IGNORECASE,\n        )\n\n        if not url_regex.match(string):\n            msg = f\"Invalid URL: {string}\"\n            raise ValueError(msg)\n\n        return string\n\n    def fetch_content(self) -> list[Data]:\n        urls = [self.ensure_url(url.strip()) for url in self.urls if url.strip()]\n        if self.format == \"Raw HTML\":\n            loader = AsyncHtmlLoader(web_path=urls, encoding=\"utf-8\")\n        else:\n            loader = WebBaseLoader(web_paths=urls, encoding=\"utf-8\")\n        docs = loader.load()\n        data = [Data(text=doc.page_content, **doc.metadata) for doc in docs]\n        self.status = data\n        return data\n\n    def fetch_content_text(self) -> Message:\n        data = self.fetch_content()\n\n        result_string = data_to_text(\"{text}\", data)\n        self.status = result_string\n        return Message(text=result_string)\n"
+                "value": "import re\n\nfrom langchain_community.document_loaders import AsyncHtmlLoader, WebBaseLoader\n\nfrom langflow.custom import Component\nfrom langflow.helpers.data import data_to_text\nfrom langflow.io import DropdownInput, MessageTextInput, Output\nfrom langflow.schema import Data\nfrom langflow.schema.dataframe import DataFrame\nfrom langflow.schema.message import Message\n\n\nclass URLComponent(Component):\n    display_name = \"URL\"\n    description = \"Fetch content from one or more URLs.\"\n    icon = \"layout-template\"\n    name = \"URL\"\n\n    inputs = [\n        MessageTextInput(\n            name=\"urls\",\n            display_name=\"URLs\",\n            info=\"Enter one or more URLs, by clicking the '+' button.\",\n            is_list=True,\n            tool_mode=True,\n        ),\n        DropdownInput(\n            name=\"format\",\n            display_name=\"Output Format\",\n            info=\"Output Format. Use 'Text' to extract the text from the HTML or 'Raw HTML' for the raw HTML content.\",\n            options=[\"Text\", \"Raw HTML\"],\n            value=\"Text\",\n        ),\n    ]\n\n    outputs = [\n        Output(display_name=\"Data\", name=\"data\", method=\"fetch_content\"),\n        Output(display_name=\"Text\", name=\"text\", method=\"fetch_content_text\"),\n        Output(display_name=\"DataFrame\", name=\"dataframe\", method=\"as_dataframe\"),\n    ]\n\n    def ensure_url(self, string: str) -> str:\n        \"\"\"Ensures the given string is a URL by adding 'http://' if it doesn't start with 'http://' or 'https://'.\n\n        Raises an error if the string is not a valid URL.\n\n        Parameters:\n            string (str): The string to be checked and possibly modified.\n\n        Returns:\n            str: The modified string that is ensured to be a URL.\n\n        Raises:\n            ValueError: If the string is not a valid URL.\n        \"\"\"\n        if not string.startswith((\"http://\", \"https://\")):\n            string = \"http://\" + string\n\n        # Basic URL validation regex\n        url_regex = re.compile(\n            r\"^(https?:\\/\\/)?\"  # optional protocol\n            r\"(www\\.)?\"  # optional www\n            r\"([a-zA-Z0-9.-]+)\"  # domain\n            r\"(\\.[a-zA-Z]{2,})?\"  # top-level domain\n            r\"(:\\d+)?\"  # optional port\n            r\"(\\/[^\\s]*)?$\",  # optional path\n            re.IGNORECASE,\n        )\n\n        if not url_regex.match(string):\n            msg = f\"Invalid URL: {string}\"\n            raise ValueError(msg)\n\n        return string\n\n    def fetch_content(self) -> list[Data]:\n        urls = [self.ensure_url(url.strip()) for url in self.urls if url.strip()]\n        if self.format == \"Raw HTML\":\n            loader = AsyncHtmlLoader(web_path=urls, encoding=\"utf-8\")\n        else:\n            loader = WebBaseLoader(web_paths=urls, encoding=\"utf-8\")\n        docs = loader.load()\n        data = [Data(text=doc.page_content, **doc.metadata) for doc in docs]\n        self.status = data\n        return data\n\n    def fetch_content_text(self) -> Message:\n        data = self.fetch_content()\n\n        result_string = data_to_text(\"{text}\", data)\n        self.status = result_string\n        return Message(text=result_string)\n\n    def as_dataframe(self) -> DataFrame:\n        return DataFrame(self.fetch_content())\n"
               },
               "format": {
                 "_input_type": "DropdownInput",