Skip to content

Commit

Permalink
Merge branch 'master' into update/docs
Browse files Browse the repository at this point in the history
  • Loading branch information
vsoch authored Dec 11, 2017
2 parents fe1b521 + a21fd3e commit 78f44ae
Show file tree
Hide file tree
Showing 17 changed files with 313 additions and 338 deletions.
108 changes: 7 additions & 101 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,113 +1,19 @@
FROM python:3.6
ENV PYTHONUNBUFFERED 1
RUN apt-get update && apt-get install -y cmake \
libpng12-dev libtiff5-dev libxml2-dev libjpeg62-turbo-dev \
zlib1g-dev libwrap0-dev libssl-dev \
libopenblas-dev \
gfortran \
pkg-config \
libxml2-dev \
libxmlsec1-dev \
libhdf5-dev \
libgeos-dev \
build-essential \
openssl \
nginx \
wget \
vim
FROM pydicom/sendit-base

RUN pip install --upgrade setuptools
RUN pip install --upgrade pip
RUN pip install cython
RUN pip install numpy
RUN pip install scikit-learn pandas h5py matplotlib
RUN pip install uwsgi
RUN pip install Django==1.11.2
RUN pip install social-auth-app-django
RUN pip install social-auth-core[saml]
RUN pip install djangorestframework
RUN pip install django-rest-swagger
RUN pip install django-filter
RUN pip install django-taggit
RUN pip install django-form-utils
RUN pip install django-crispy-forms
RUN pip install django-taggit-templatetags
RUN pip install django-dirtyfields
RUN pip install 'dropbox==1.6'
RUN pip install 'django-dbbackup<2.3'
RUN pip install psycopg2
RUN pip install numexpr
RUN pip install shapely
RUN pip install Pillow
RUN pip install requests
RUN pip install requests-oauthlib
RUN pip install python-openid
RUN pip install django-sendfile
RUN pip install django-polymorphic
RUN pip install celery[redis]==3.1.25
RUN pip install django-celery
RUN pip install scikit-learn
RUN pip install django-cleanup
RUN pip install django-chosen
RUN pip install opbeat
RUN pip install 'django-hstore==1.3.5'
RUN pip install django-datatables-view
RUN pip install django-oauth-toolkit
RUN pip install simplejson
RUN pip install django-gravatar2
RUN pip install pygments
RUN pip install django-lockdown
RUN pip install xmltodict
RUN pip install grpcio
#RUN pip install som
RUN pip install django-cors-headers
RUN pip install django-user-agents
RUN pip install django-guardian
RUN pip install pyinotify


# Install pydicom
WORKDIR /tmp
RUN git clone https://github.com/pydicom/pydicom
WORKDIR pydicom
RUN git checkout affb1cf10c6be2aca311c29ddddc622f8bd1f810
RUN python setup.py install

# deid
WORKDIR /tmp
# update deid
WORKDIR /opt
RUN git clone -b development https://github.com/pydicom/deid
WORKDIR /tmp/deid
WORKDIR /opt/deid
RUN python setup.py install

# som
WORKDIR /tmp
RUN git clone https://github.com/vsoch/som
WORKDIR /tmp/som
WORKDIR /opt
RUN git clone -b add/bigquery https://github.com/vsoch/som
WORKDIR /opt/som
RUN python setup.py install


RUN mkdir /code
RUN mkdir -p /var/www/images
RUN mkdir /data
WORKDIR /code
ADD . /code/
RUN /usr/bin/yes | pip uninstall cython
RUN apt-get remove -y gfortran

# Crontab
RUN apt-get update && apt-get install -y gnome-schedule

RUN apt-get autoremove -y
RUN apt-get clean
RUN rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

ENV MESSAGELEVEL -1

# This sometimes errors, need to run manually
#RUN pip install -r /code/google-requirements.txt > /dev/null 2>&1
#RUN pip3 install -r /code/google-requirements.txt > /dev/null 2>&1

WORKDIR /code
CMD /code/run_uwsgi.sh

EXPOSE 3031
20 changes: 7 additions & 13 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,15 @@
# SendIt Documentation

## Overview
The Sendit application is intended to be a modular application that includes the following:
The Sendit application is an on demand application that works in two stages to optimally anonymize and push anonymized images and metadata to Google Cloud Storage, and Google Cloud BigQuery, respectively. It works as follows:

- a data folder that is watched for complete DICOM datasets.
- an (optional) pipeline for anonymization, meaning removing/replacing fields in the header and image data.
- (optionally) sending data to storage, meaning an Orthanc server, and/or Google Cloud Storage/Datastore
- the researcher starts the anonymization pipeline with an input of one or mode folders
- each folder is added as a "Batch" with status "QUEUE" to indicate they are ready for import
- anonymization is performed (status "PROCESSING"), meaning removing/replacing fields in the header and image data, .
- when status "DONEPROCESSING" is achieved for all in the queue, the researcher triggers the final job to send data to storage (status "SENT")

Reasonable updates would be:

- to add a DICOM receiver directly to the application using `pynetdicom3`, so instead of listening for datasets on the filesystem, we can receive them directly.
- remove the web interface component and make sendit more of a service.


## Application Flow

- [Application](application.md): If you are a new developer, please read about the application flow and infrastructure first. Sendit is a skeleton that uses other python modules to handle interaction with Stanford and Google APIs, along with anonymization of datasets.
## Preparation
The base of the image is distributed via [sendit-base](scripts/docker/README.md). This image has all dependencies for the base so we can easily bring the image up and down.

## Deployment

Expand Down
76 changes: 40 additions & 36 deletions docs/application.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,29 +23,53 @@ This application lives in a docker-compose orchestration of images running on `S


## Job Queue
The job queue generally works by processing tasks when the server has available resources. There will be likely 5 workers for a single application deployment. The worker will do the following:

1. First receive a job from the queue to run the [import dicom](import_dicom.md) task when a finished folder is detected by the [watcher](watcher.md)
2. When import is done, hand to the next task to [anonymize](anomymize.md) images. If the user doesn't want to do this based on [settings](../sendit/settings/config.py), a task is fired off to send to storage. If they do, the request is made to the DASHER endpoint, and the identifiers saved.
a. In the case of anonymization, the next job will do the data strubbing with the identifiers, and then trigger sending to storage.
3. Sending to storage can be enabled to work with any or none of OrthanC and Google Cloud storage. If no storage is taken, then the application works as a static storage.
### Step 1: Start Queue
The job queue accepts a manual request to import one or more dicom directories, subfolderes under `/data`. We call it a "queue" because it is handled by the worker and redis images, where the worker is a set of threads that can process multiple (~16) batches at once, and redis is the database to manage the queue. The queue can "pile up" and the workers will process tasks when the server has available resources. Thus, to start the pipeline:

**Important note**: for this first testing when we are starting with many pre-existing folders, we are using instead a continuous worker queue with 16 threads (over 16 cores).
1. You should make sure your `DATA_INPUT_FOLDERS` are defined in [sendit/settings/config.py](../sendit/settings/config.py).
2. You should then start the queue, which means performing dicom import, get_identifiers, replace identifiers (not upload). This means that images go from having status "QUEUE" to "DONEPROCESSING"

```
# Start the queue
python manage.py start_queue
## Status
In order to track status of images, we have status states for images and batches.
# The defaults are max count 1, /data folder
python manage.py start_queue --number 1 --subfolder /data
```

When you call the above, the workers will do the following:

1. Check for any Batch objects with status "QUEUE," meaning they were added and not started yet. If there are none in the QUEUE (the default when you haven't used it yet!) then the function uses the `DATA_INPUT_FOLDERS` to find new "contenders." The contender folders each have a Batch created for them, and the Batch is given status QUEUE. We do this up to the max count provided by the "number" variable in the `start_queue` request above.
2. Up to the max count, the workers then launch the [import dicom](import_dicom.md) task to run async. This function changes the Batch status to "PROCESSING," imports the dicom, extracts header information, prepares/sends/receives a request for [anonymized identifiers](anonymize.md) from DASHER, and then saves a BatchIdentifiers objects. The Batch then is given status "DONEPROCESSING".

It is expected that a set of folders (batches) will do these steps first, meaning that there are no Batches with status "QUEUE" and all are "DONEPROCESSING." We do this because we want to upload to storage in large batches to optimize using the client.


### Step 2: Upload to Storage
When all Batches have status "DONEPROCESSING" we launch a second request to the application to upload to storage:

```
python manage.py upload_finished
```
IMAGE_STATUS = (('NEW', 'The image was just added to the application.'),
('PROCESSING', 'The image is currently being processed, and has not been sent.'),
('DONEPROCESSING','The image is done processing, but has not been sent.'),
('SENT','The image has been sent, and verified received.'),
('DONE','The image has been received, and is ready for cleanup.'))

BATCH_STATUS = (('NEW', 'The batch was just added to the application.'),
This task looks for Batches that are "DONEPROCESSING" and distributes the Batches equally among 10 workers. 10 is not a magic number, but I found in testing was a good balance to not trigger weird connection errors that likely come from the fact we are trying to use network resources from inside a Docker container. Sending to storage means two steps:

1. Upload Images (compressed .tar.gz) to Google Storage, and receive back metadata about bucket locations
2. Send image metadata + storage metadata to BigQuery

If you are more interested in reading about the storage formats, read more about [storage](storage.md).

## Status
In order to track status of images, we have status states for batches.


```
BATCH_STATUS = (('QUEUE', 'The batch is queued and not picked up by worker.'),
('NEW', 'The batch was just added to the application.'),
('EMPTY', 'After processing, no images passed filtering.'),
('PROCESSING', 'The batch currently being processed.'),
('EMPTY', 'No images passed filters'),
('DONE','The batch is done, and images are ready for cleanup.'))
```

Expand All @@ -56,23 +80,6 @@ python manage.py export_metrics
sendit-process-time-2017-08-26.tsv
```

### Image Status
Image statuses are updated at each appropriate timepoint, for example:

- All new images by default are given `NEW`
- When an image starts any anonymization, but before any request to send to storage, it will have status `PROCESSING`. This means that if an image is not to be processed, it will immediately be flagged with `DONEPROCESSING`
- As soon as the image is done processing, or if it is intended to go right to storage, it gets status `DONEPROCESSING`.
- After being send to storage, the image gets status `SENT`, and only when it is ready for cleanup is gets status `DONE`. Note that this means that if a user has no requests to send to storage, the image will remain with the application (and not be deleted.)

### Batch Status
A batch status is less granular, but more informative for alerting the user about possible errors.

- All new batches by default are given `NEW`.
- `PROCESSING` is added to a batch as soon as the job to anonymize is triggered.
- `DONEPROCESSING` is added when the batch finished anonimization, or if it skips and is intended to go to storage.
- `DONE` is added after all images are sent to storage, and are ready for cleanup.


## Errors
The most likely error would be an inability to read a dicom file, which could happen for any number of reasons. This, and generally any errors that are triggered during the lifecycle of a batch, will flag the batch as having an error. The variable `has_error` is a boolean that belongs to a batch, and a matching JSONField `errors` will hold a list of errors for the user. This error flag will be most relevant during cleanup.

Expand All @@ -82,10 +89,7 @@ For server errors, the application is configured to be set up with Opbeat. @vsoc
## Cleanup
Upon completion, we will want some level of cleanup of both the database, and the corresponding files. It is already the case that the application moves the input files from `/data` into its own media folder (`images`), and cleanup might look like any of the following:

- In the most ideal case, there are no errors, no flags for the batch, and the original data folder was removed by the `dicom_import` task, and the database and media files removed after successful upload to storage. This application is not intended as some kind of archive for data, but a node that filters and passes along.
- In the most ideal case, there are no errors, no flags for the batch, and the database and media files removed after successful upload to storage. Eventually we would want to delete the original files too. This application is not intended as some kind of archive for data, but a node that filters and passes along.
- Given an error to `dicom_import`, a file will be left in the original folder, and the batch `has_error` will be true. In this case, we don't delete files, and we rename the original folder to have extension `.err`

If any further logging is needed (beyond the watcher) we should discuss (see questions below)


Now let's [start the application](start.md)!
40 changes: 6 additions & 34 deletions docs/config.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Configuration
The configuration for the application consists of the files in the [sendit/settings](../sendit/settings) folder. The files that need attention are `secrets.py` and [config.py](../sendit/settings/config.py).


## Application Secrets
First make your secrets.py like this:

Expand All @@ -14,6 +15,7 @@ Once you have your `secrets.py`, it needs the following added:
- `SECRET_KEY`: Django will not run without one! You can generate one [here](http://www.miniwebtool.com/django-secret-key-generator/)
- `DEBUG`: Make sure to set this to `False` for production.


## "anonymization" (Coding)
For [config.py](../sendit/settings/config.py) you should first configure settings for the anonymization process, which is everything that happens after images are import, but before sending to storage. These steps broadly include:

Expand All @@ -37,12 +39,6 @@ ANONYMIZE_PIXELS=False

**Important** the pixel scrubbing is not yet implemented, so this variable will currently only check for the header, and alert you of the image, and skip it. Regardless of the setting that you choose for the variable `ANONYMIZE_PIXELS` the header will always be checked. If you have pixel scrubbing turned on (and it's implemented) the images will be scrubbed, and included. If you have scrubbing turned on (and it's not implemented) it will just yell at you and skip them. The same thing will happen if it's off, just to alert you that they exist.

```
# The default study to use
SOM_STUDY="test"
```
The `SOM_STUDY` is part of the Stanford DASHER API to specify a study, and the default should be set before you start the application. If the study needs to vary between calls, please [post an issue](https://www.github.com/pydicom/sendit) and it can be added to be done at runtime.

Next, you likely want a custom filter applied to whitelist (accept no matter what), greylist (not accept, but in the future know how to clean the data) and blacklist (not accept). Currently, the deid software applies a [default filter](https://github.com/pydicom/deid/blob/development/deid/data/deid.dicom) to filter out images with known burned in pixels. If you want to add a custom file, currently it must live with the repository, and is referenced by the name of the file after the `deid`. You can specify this string in the config file:

```
Expand Down Expand Up @@ -72,44 +68,19 @@ Note that the fields for `ENTITY_ID` and `ITEM_ID` are set to the default of [de
## Storage
The next set of variables are specific to [storage](storage.md), which is the final step in the pipeline.

```
# We can turn on/off send to Orthanc. If turned off, the images would just be processed
SEND_TO_ORTHANC=True
# The ipaddress of the Orthanc server to send the finished dicoms (cloud PACS)
ORTHANC_IPADDRESS="127.0.0.1"
# The port of the same machine (by default they map it to 4747
ORTHAC_PORT=4747
```

Since the Orthanc is a server itself, if we are ever in need of a way to quickly deploy and bring down these intances as needed, we could do that too, and the application would retrieve the ipaddress programatically.

And I would (like) to eventually add the following, meaning that we also send datasets to Google Cloud Storage and Datastore, ideally in compressed nifti instead of dicom, and with some subset of fields. These functions are by default turned off.

```
# Should we send to Google at all?
SEND_TO_GOOGLE=False
SEND_TO_GOOGLE=True
# Google Cloud Storage Bucket (must be created)
GOOGLE_CLOUD_STORAGE='radiology'
GOOGLE_STORAGE_COLLECTION=None # define here or in your secrets
GOOGLE_PROJECT_NAME="project-name" # not the id, usually the end of the url in Google Cloud
```

Note that the storage collection is set to None, and this should be the id of the study (eg, the IRB). If this is set to None, it will not upload. Finally, to add a special header to signify a Google Storage project, you should add the name of the intended project to your header:

```
GOOGLE_PROJECT_ID_HEADER="12345"
# Will produce this key/value header
x-goog-project-id: 12345
```

** Note we aren't currently using this header and it works fine.

Note that this approach isn't suited for having more than one study - when that is the case, the study will likely be registered with the batch. Importantly, for the above, there must be a `GOOGLE_APPLICATION_CREDENTIALS` filepath exported in the environment, or it should be run on a Google Cloud Instance (unlikely).
Note that the storage collection is set to None, and this should be the id of the study (eg, the IRB). For Google Storage, this collection corresponds with a Bucket. For BigQuery, it corresponds with a database (and a table of dicom). If this is set to None, it will not upload. Also note that we derive the study name to use with Dasher from this bucket. It's simply the lowercase version of it. This means that a `GOOGLE_STORAGE_COLLECTION` of `IRB12345` maps to a study name `irb12345`.

Note that this approach isn't suited for having more than one study - when that is the case, the study will likely be registered with the batch. Importantly, for the above, there must be a `GOOGLE_APPLICATION_CREDENTIALS` filepath exported in the environment, or it should be run on a Google Cloud Instance (unlikely in the near future).

## Authentication
If you look in [sendit/settings/auth.py](../sendit/settings/auth.py) you will see something called `lockdown` and that it is turned on:
Expand Down Expand Up @@ -155,6 +126,7 @@ LOCKDOWN_PASSWORDS = ('mysecretpassword',)
Note that here we will need to add notes about securing the server (https), etc. For now, I'll just mention that it will come down to changing the [nginx.conf](../nginx.conf) and [docker-compose.yml](../docker-compose.yml) to those provided in the folder [https](../https).



### Reading Input
You need to specify either a `DATA_SUBFOLDER` (assumed within `DATA_BASE` for the application) OR a list of `DATA_INPUT_FOLDERS` instead. If you have a streaming application.

Expand Down
Binary file added docs/img/bigquery.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 78f44ae

Please sign in to comment.