The challenge for this project is to create a data pipeline that will ingest a UniProt XML file (data/Q9Y261.xml) and store the data in a Neo4j graph database.
The XML file Q9Y261.xml located in the data directory contains information about a protein. The task is to create a data pipeline that will ingest the XML file and store as much information as possible in a Neo4j graph database.
You will need to have Docker and Docker Compose installed on your system to use this project.
Once you have installed Docker and Docker Compose, follow the instructions below to get started with the project.
-
Clone the repository to your local machine.
-
Navigate to the project directory.
-
Run the following command to build the Docker images for Airflow and Spark:
sh build_images.sh
-
Start the Docker containers by running the following command:
sh start.sh
-
Once the containers are up and running, open a browser and navigate to http://localhost:8080 to access the Airflow web UI.
- The login credentials for Airflow are username: "airflow" and the password is "airflow".
-
In the Airflow UI, enable the "challenge_dag" DAG to run the XML data ingestion and validation process.
-
The XML data files that need to be ingested and validated can be placed in the "data" directory.
-
Once the XML data has been processed successfully, you can view the graph representation of the data in Neo4j by going to http://localhost:7474/ in your browser.
- The login credentials for Neo4j are username: "neo4j" and the password is "password".
.
├── README.md
├── apps
│ └── xml-ingest.py
├── assets
│ └── xmlSchema.xml
├── build_images.sh
├── dags
│ └── challenge_dag.py
├── data
│ └── Q9Y261.xml
├── docker
│ ├── airflow
│ │ ├── Dockerfile
│ │ └── requirements.txt
│ ├── docker-compose.yml
│ ├── spark
│ │ ├── Dockerfile
│ │ ├── requirements.txt
│ │ └── start-spark.sh
│ └── spark-docker-compose.yml
└── start.sh
- Apache Spark - Open source big data processing framework.
- Apache Airflow - Open source workflow management platform.
- Docker - Containerization platform.
- Neo4j - Graph database platform.