This project accompanies the Data + AI Summit 2024 presentation "Streaming Data Pipelines: From Supernovas to LLMs", which is available here.
Note: This is not a beginner tutorial, nor is it a step-by-step guide. For easy to replicate tutorials, please visit databricks.com/demos.
This hands-on, in-depth project demonstrates the use of 36,000 NASA circulars for a compoind AI application (RAG + LLL).
-
Get the Circulars in packed JSON format from here
-
Upload them to a Databricks UC managed volume and extract them
-
Use the use the provided DLT pipeline to ingest the data, make sure to specify the right folder to read from
-
Chunk the data (as a reference, have a look at the provided data prep notebook)
-
Define a Langchain template with a question and RAG content. (as a reference have a look at the provided RAG chain notebook)
-
Examine the output and iterate over the prompt. Note that unlike without RAG, this version contains precise data from 2024 and no hallucinations.
The code is provided "as is" without any warranty.
For questions about Databricks products, use our forum at community.databricks.com.
We would like to express our gratitude to the following individuals for their contributions and support:
- Judith Rascusin (NASA)
- Alex, Nicolas, Raghu, Praveen, Neil, Eric (Databricks)