Skip to content

Repository containing my experiment to play with a data pipeline

Notifications You must be signed in to change notification settings

jettro/MyDataPipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

Welcome to my data pipeline project. This is not one project, it contains multiple things that help me explore new technologies. I'll explain the different applications below. In general the structure follows the following foundation.

  • config_files : Contains the files need to configer components like indexes in OpenSearch or a schema in Weaviate
  • data_sources : contain the different available data files used in the applications
  • tests : Contain very little unit tests, only used to test small specific things to try out.
  • infra : docker files to startup specific components (OpenSearch for instance)

The other folders are modules:

  • pipeline : Contains the Dagster pipeline that manages the indexes in OpenSearch
  • reranking : Components that can re-rank results
  • search : Contains the files related to using OpenSearch for templates, data, query and a tool to parse explain output
  • util : Small utilities that we can re-use in the modules
  • weaviatedb : Files used to interact with Weaviate
  • langchainmod : Files used to interact with Langchain code

Files

  • log_config.py : Configuration for the Python logging framework
  • requirements.txt : The Python libraries used in the project
  • run_query_pipeline.py : runner for the Weaviate pipeline to query Weaviate using the re-ranker as well
  • run_langchain_ro_vac.py : Runner for the langchain sample that imports and queries multiple vector stores

Docker issues with Opensearch

Create the following file: /Users/jettrocoenradie/Library/Application Support/rancher-desktop/lima/_config/override.yaml

provision:
- mode: system
  script: |
    #!/bin/sh
    set -o xtrace
    sysctl -w vm.max_map_count=262144
    cat <<'EOF' > /etc/security/limits.d/rancher-desktop.conf
    * soft     nofile         82920
    * hard     nofile         82920
    EOF

You can test using curl, for now we do not check the certificate:

curl https://localhost:9200 -ku admin:admin

Dagster

I am experimenting with Dagster

dagster dev -f ./pipeline/products_dagster.py

https://docs.dagster.io/getting-started

Weaviate custom QnA module

First we need to create the custom DockerFile. Then start docker compose with the added local docker image

docker build -f mdeberta.Dockerfile -t mdeberta-qna-transformers .
docker compose -f docker-compose-weaviate.yml up -d

I am experimenting with streamlit to work with a gui as well. You have to run the sample with the following command:

streamlit run run_weaviate_qna.py

Weaviate clip image search

Start docker compose for the clip model. Everything runs on your local machine

docker compose -f docker-compose-weaviate-clip.yml up -d

You have to run the sample with the following command, make sure streamlit is installed. Which should be fine if you use the requirements.txt

streamlit run run_weaviate_clip.py

References

Just some articles I used

Setting up Docker/Opensearch

https://opster.com/guides/opensearch/opensearch-basics/spin-up-opensearch-cluster-with-docker/

About

Repository containing my experiment to play with a data pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published