Introduction

Welcome to my data pipeline project. This is not one project, it contains multiple things that help me explore new technologies. I'll explain the different applications below. In general the structure follows the following foundation.

config_files : Contains the files need to configer components like indexes in OpenSearch or a schema in Weaviate
data_sources : contain the different available data files used in the applications
tests : Contain very little unit tests, only used to test small specific things to try out.
infra : docker files to startup specific components (OpenSearch for instance)

The other folders are modules:

pipeline : Contains the Dagster pipeline that manages the indexes in OpenSearch
reranking : Components that can re-rank results
search : Contains the files related to using OpenSearch for templates, data, query and a tool to parse explain output
util : Small utilities that we can re-use in the modules
weaviatedb : Files used to interact with Weaviate
langchainmod : Files used to interact with Langchain code

Files

log_config.py : Configuration for the Python logging framework
requirements.txt : The Python libraries used in the project
run_query_pipeline.py : runner for the Weaviate pipeline to query Weaviate using the re-ranker as well
run_langchain_ro_vac.py : Runner for the langchain sample that imports and queries multiple vector stores

Docker issues with Opensearch

Create the following file: /Users/jettrocoenradie/Library/Application Support/rancher-desktop/lima/_config/override.yaml

provision:
- mode: system
  script: |
    #!/bin/sh
    set -o xtrace
    sysctl -w vm.max_map_count=262144
    cat <<'EOF' > /etc/security/limits.d/rancher-desktop.conf
    * soft     nofile         82920
    * hard     nofile         82920
    EOF

You can test using curl, for now we do not check the certificate:

curl https://localhost:9200 -ku admin:admin

Dagster

I am experimenting with Dagster

dagster dev -f ./pipeline/products_dagster.py

https://docs.dagster.io/getting-started

Weaviate custom QnA module

First we need to create the custom DockerFile. Then start docker compose with the added local docker image

docker build -f mdeberta.Dockerfile -t mdeberta-qna-transformers .
docker compose -f docker-compose-weaviate.yml up -d

I am experimenting with streamlit to work with a gui as well. You have to run the sample with the following command:

streamlit run run_weaviate_qna.py

Weaviate clip image search

Start docker compose for the clip model. Everything runs on your local machine

docker compose -f docker-compose-weaviate-clip.yml up -d

You have to run the sample with the following command, make sure streamlit is installed. Which should be fine if you use the requirements.txt

streamlit run run_weaviate_clip.py

References

Just some articles I used

Setting up Docker/Opensearch

https://opster.com/guides/opensearch/opensearch-basics/spin-up-opensearch-cluster-with-docker/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Docker issues with Opensearch

Dagster

Weaviate custom QnA module

Weaviate clip image search

References

Setting up Docker/Opensearch

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
config_files		config_files
data_sources		data_sources
infra		infra
langchainmod		langchainmod
pipeline		pipeline
reranking		reranking
search		search
tests		tests
util		util
weaviatedb		weaviatedb
.gitignore		.gitignore
docker-compose.yml		docker-compose.yml
env_template		env_template
log_config.py		log_config.py
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt
run_amazon_esci_import.py		run_amazon_esci_import.py
run_cross_encoder_try.py		run_cross_encoder_try.py
run_langchain_ro_vac.py		run_langchain_ro_vac.py
run_openai_deeplearning_ai.py		run_openai_deeplearning_ai.py
run_openai_function.py		run_openai_function.py
run_openai_images.py		run_openai_images.py
run_query_pipeline.py		run_query_pipeline.py
run_unstructured.py		run_unstructured.py
run_weaviate_clip.py		run_weaviate_clip.py
run_weaviate_info.py		run_weaviate_info.py
run_weaviate_qna.py		run_weaviate_qna.py
streamlit_rag.py		streamlit_rag.py

jettro/MyDataPipeline

Folders and files

Latest commit

History

Repository files navigation

Introduction

Docker issues with Opensearch

Dagster

Weaviate custom QnA module

Weaviate clip image search

References

Setting up Docker/Opensearch

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages