This repository, DuckDB Geospatial Analysis Project, demonstrates how to handle massive datasets efficiently using Python and DuckDB. The project focuses on analyzing geospatial data using real-world datasets with over 2 billion records, showcasing the power of in-memory SQL analytics.
By following this project, you will:
- Learn to process large-scale datasets with DuckDB’s in-memory database.
- Perform advanced geospatial analysis using SQL.
- Explore how to work with cloud-stored datasets in formats like GeoParquet and FlatGeobuf.
-
In-Memory Processing with DuckDB
Analyze datasets of up to 140 million records in a lightweight, scalable setup. -
Cloud Integration
Directly access datasets stored in the cloud, including public data from Google Open Buildings and Microsoft Building Footprints. -
Geospatial Analysis
Utilize geospatial extensions to perform:- Intersections and comparisons of datasets.
- Statistical analysis and insights.
- Exports to geospatial file formats such as FlatGeobuf.
-
No Database Installation
Run the entire project in-memory without installing external database systems.
- Google Open Buildings V3
- Microsoft Building Footprints
These datasets provide geospatial information for billions of buildings worldwide. For this project, a subset of 140 million records is extracted for analysis.
geospatial_analysis_duckdb.ipynb
: The complete Jupyter Notebook for the project workflow.requirements.txt
: A list of required Python packages.boundaries.geojson
: A GeoJSON file defining the Area of Interest (AOI) for analysis.
-
Clone the Repository
git clone https://github.com/Anello92/duckdb-project.git cd duckdb-project
-
Install Dependencies Use
pip
to install the required Python packages:pip install -r requirements.txt
-
Run the Jupyter Notebook Launch the notebook in your local environment:
jupyter notebook geospatial_analysis_duckdb.ipynb
-
Run in Google Colab (Optional)
If your local machine has limited memory, upload the project files to Google Colab for execution.
- Step 1: Install and enable DuckDB geospatial extensions (
httpfs
andspatial
). - Step 2: Load geospatial datasets directly from the cloud (AWS S3).
- Step 3: Execute SQL queries to analyze geospatial data and generate insights.
- Step 4: Export results in geospatial formats (e.g., FlatGeobuf).
Install the following dependencies from requirements.txt
:
duckdb
geopandas
pyarrow
Installation command:
pip install -r requirements.txt
- Process massive geospatial datasets using SQL.
- Achieve high performance with DuckDB’s in-memory engine.
- Avoid complex infrastructure setups—everything is self-contained.
- Dataset Sources:
Google Open Buildings
Microsoft Building Footprints
Let me know if there’s anything else you’d like to tweak or refine!