This repository contains our tools & research for running DeepCell segmentation and QuPath measurements on Google Cloud Batch.
Our results show an overall improvement from ~13 hours to ~10 minutes for segmenting & measuring a cell. The starting point was running on a laptop or colo machine, and our work ran on GCP Batch with some cloud-focused enhancements.
The workflow operates on one or more input images, converted from image to numpy
pixel array. Then DeepCell preprocesses the data (denoising & normalization), runs the segmentation prediction, and postprocesses the predictions into a cell mask. Then, we load the image and mask into QuPath to compute quantitative metrics (size, channel intensity, etc.) for further analysis. For an example downstream usage, see SpaFlow (cell clustering & quantification).
Here is the workflow diagram:
You'll need a JSON file available in a cloud bucket, configuring the application environment. Create a file something like this:
{
"segment_container_image": "$REPOSITORY/benchmarking:latest",
"quantify_container_image": "$REPOSITORY/qupath-project-initializer:latest",
"bigquery_benchmarking_table": "$PROJECT.$DATASET.$TABLE",
"region": "$REGION",
"networking_interface": {
"network": "the_network",
"subnetwork": "the_subnetwork",
"no_external_ip_address": true
},
"service_account": {
"email": "[email protected]"
}
}
You'll need to replace the variables with your environment.
- You can use the public Docker Hub containers, or copy them to your own artifact repository.
- For the benchmarking, you need to create a dataset & table in a GCP project; or you can omit it or set it to blank to skip collecting benchmarks. The table must be created with the schema specified in this file.
- Lastly, specify the GCP region where compute resources will be provisioned. This is not the same as storage buckets, but consider making it the same for efficiency & egress cost reduction.
- The
networking_interface
andservice_account
sections are optional if you want to use default settings.
For example, using the Docker Hub containers & skipping benchmarking & default networking + service account:
{
"segment_container_image": "dchaley/deepcell-imaging:latest",
"quantify_container_image": "dchaley/qupath-project-initializer:latest",
"region": "us-central1"
}
Upload this file somewhere to GCP storage. We put ours in the root of our working bucket. You'll pass this GS URI as a parameter to the scripts.
To run DeepCell on input images then compute QuPath measurements, use the helper scripts/segment-and-measure.py
. There are two ways to run this script: (1) running on a QuPath workspace, and (2) running on explicit paths.
-
QuPath workspace:
-
Many QuPath projects are organized something like this:
📁 Dataset ↳ 📁 OMETIFF ↳ 🖼️ SomeTissueSample.ome.tiff ↳ 🖼️ AnotherTissueSample.ome.tiff ↳ 📁 NPZ_INTERMEDIATE ↳ 🔢 SomeTissueSample.npz ↳ 🔢 AnotherTissueSample.npz ↳ 📁 SEGMASK ↳ 🔢 SomeTissueSample_WholeCellMask.tiff ↳ 🔢 SomeTissueSample_NucleusMask.tiff ↳ 🔢 AnotherTissueSample_WholeCellMask.tiff ↳ 🔢 AnotherTissueSample_NucleusMask.tiff ↳ 📁 REPORTS ↳ 📄 SomeTissueSample_QUANT.tsv ↳ 📄 AnotherTissueSample_QUANT.tsv ↳ 📁 PROJ ↳ 📁 data ↳ ... ↳ 📄 project.qpproj
To generate segmentation masks & quantification reports, run the following command:
scripts/segment-and-measure.py --env_config_uri gs://bucket/path/to/env-config.json workspace gs://bucket/path/to/dataset
This will enumerate all files in the
OMETIFF
directory that have matching files inNPZ_INTERMEDIATE
, and run DeepCell segmentation to generate theSEGMASK
numpy files. Then it will run QuPath measurements to generate theREPORTS
files.If your folder structure is different (for example
OME-TIFF
instead ofOMETIFF
) you can use these parameters to specify the workspace subdirectories:--images_subdir
,--npz_subdir
,--segmasks_subdir
,--project_subdir
,--reports_subdir
. Put these parameters after theworkspace
command.
-
-
Explicit paths.
-
You can also specify all paths explicitly (the files don't have to be organized in a dataset). To do so, run this command:
scripts/segment-and-measure.py --env_config_uri gs://bucket/path/to/env-config.json paths --images_path gs://bucket/path/to/ometiffs --numpy_path gs://bucket/path/to/npzs --segmasks_path gs://bucket/path/to/segmasks --project_path gs://bucket/path/to/project --reports_path gs://bucket/path/to/reports
-
In either case, when you download the QuPath project, you'll need to download the OMETIFF files as well. When you open the project it will prompt you to select the base directory containing the OMETIFFs, and from there should automatically remap the image paths.
You can use the parameter --image_filter
to only operate on a subset of the OMETIFFs. For example,
scripts/segment-and-measure.py
--env_config_uri gs://.../config.json
--image_filter SomeTissue
workspace gs://path/to/workspace
This will operate on every file whose name begins with the string SomeTissue
. This would match SomeTissueSample
, SomeTissueImage
, etc. Note that this parameter has to come before the workspace
or paths
parameter.
DeepCell does not process TIFF files. The TIFF channels must be extracted into Numpy arrays first.
DeepCell divides the preprocessed input into 512x512 tiles which it predicts in batches, then recombines into a single image for postprocessing.
This makes the prediction very resource-efficient, note however that pre- and post-processing still operate on the entire image. This is particularly problematic for post-processing which is very resource-intensive.
The prediction step outputs which pixels are most likely to be the center of their cell. The post-processing step runs image analysis algorithms to create the final cell masks. It operates a bit like a "flood fill" to expand the center out.
This uses the h_maxima grayscale reconstruction algorithm, which is (counterintuitively) far slower than prediction itself for large images.
Once we have cell predictions, we need to generate quantified metrics for the cells: location, size, channel intensities, and so on. This is crucial for downstream processing & analysis, including in a QuPath desktop environment. For example, a researcher might provide an analyzed & packaged QuPath project to a principal investigator for review.
QuPath is distributed as JAR files. Bioinformaticians typically run Groovy scripts in the embedded QuPath environment, however we don't have a desktop or VM environment for that. Instead we compile Kotlin code with the JARs to run on Google Batch.
The source code for quantifying the metrics plus building the container is located in a different repository: qupath-project-initializer.
QuPath measurements are computed a cell at a time. The algorithm re-fetches the image region containing the cell for each cell. This is prohibitively expensive for bulk measurement.
Adding code to prefetch the image into memory, then retrieve subregions from memory, provided a dramatic ~99% speed-up.
- GOAL: Understand and optimize DeepCell cellular segmentation on GCP at scale.
- KEY LINK #1: our benchmarking process.
- KEY LINK #2: our support/testing notebooks.
- KEY LINK #3: our project board & work areas for this project.
GPU makes a dramatic difference in model inference time.
Memory usage increases linearly with number of pixels.
Here are some areas we've identified:
- Preprocessing
- DeepCell converts everything to 64bit float. That's memory intensive. Do we actually need to?
- Postprocessing
- h_maxima: need to ship a ~15x speedup optimization
- Cost
- Run the prediction phase only with GPU infrastructure. Run everything else with CPU-only infrastructure.
This repo uses git-lfs (Git Large File System) to exclude large files (like sample numpy data) in the source history. This process is automatic & transparent, but requires git-lfs
to be installed beforehand. Please see these instructions.
TLDR,
- on Mac,
brew install git-lfs
. - on Linux,
sudo [apt-get | yum] install git-lfs
. - on Windows,
git-lfs
is included in the Git distribution.
Set these repository variables:
-
DOCKERHUB_REPOSITORY
egdchaley
-
If you set this, you need to set these further variables:
_DOCKERHUB_USERNAME_SECRET_NAME
egdockerhub-username/versions/1
_DOCKERHUB_PASSWORD_SECRET_NAME
egdockerhub-password/versions/1
-
And you need the corresponding secrets in the GCP project.
-
-
GCP_ARTIFACT_REPOSITORY
egmy-repository
-
GCP_PROJECT_ID
egmy-gcp-project-4321
-
GCP_REGION
egus-central1
You need Python 3.10 at the latest.
The main trick is that some dependencies specify outdated sub-dependencies. In particular DeepCell specifies TensorFlow 2.8 but we want to pull in 2.17 to get security patches. So, we install DeepCell without dependencies, then fill in the dependencies in our main requirements.txt
file.
python3.10 -m venv venv
source venv/bin/activate
pip install --no-deps -r requirements-no-deps.txt
pip install -r requirements.txt