Merge branch 'master' into update/docs

pydicom · Dec 11, 2017 · 78f44ae · 78f44ae
2 parents fe1b521 + a21fd3e
commit 78f44ae
Show file tree

Hide file tree

Showing 17 changed files with 313 additions and 338 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -1,113 +1,19 @@
-FROM python:3.6
-ENV PYTHONUNBUFFERED 1
-RUN apt-get update && apt-get install -y cmake \
- libpng12-dev libtiff5-dev libxml2-dev libjpeg62-turbo-dev \
- zlib1g-dev libwrap0-dev libssl-dev \
- libopenblas-dev \
- gfortran \
- pkg-config \
- libxml2-dev \
- libxmlsec1-dev \
- libhdf5-dev \
- libgeos-dev \
- build-essential \
- openssl \
- nginx \
- wget \
- vim
+FROM pydicom/sendit-base
 
-RUN pip install --upgrade setuptools
-RUN pip install --upgrade pip
-RUN pip install cython
-RUN pip install numpy
-RUN pip install scikit-learn pandas h5py matplotlib
-RUN pip install uwsgi
-RUN pip install Django==1.11.2
-RUN pip install social-auth-app-django
-RUN pip install social-auth-core[saml]
-RUN pip install djangorestframework
-RUN pip install django-rest-swagger
-RUN pip install django-filter
-RUN pip install django-taggit
-RUN pip install django-form-utils
-RUN pip install django-crispy-forms
-RUN pip install django-taggit-templatetags
-RUN pip install django-dirtyfields
-RUN pip install 'dropbox==1.6'
-RUN pip install 'django-dbbackup<2.3'
-RUN pip install psycopg2
-RUN pip install numexpr
-RUN pip install shapely
-RUN pip install Pillow
-RUN pip install requests
-RUN pip install requests-oauthlib
-RUN pip install python-openid
-RUN pip install django-sendfile
-RUN pip install django-polymorphic
-RUN pip install celery[redis]==3.1.25
-RUN pip install django-celery
-RUN pip install scikit-learn
-RUN pip install django-cleanup
-RUN pip install django-chosen
-RUN pip install opbeat
-RUN pip install 'django-hstore==1.3.5'
-RUN pip install django-datatables-view
-RUN pip install django-oauth-toolkit
-RUN pip install simplejson
-RUN pip install django-gravatar2
-RUN pip install pygments
-RUN pip install django-lockdown
-RUN pip install xmltodict
-RUN pip install grpcio
-#RUN pip install som
-RUN pip install django-cors-headers
-RUN pip install django-user-agents
-RUN pip install django-guardian
-RUN pip install pyinotify
-
-
-# Install pydicom
-WORKDIR /tmp
-RUN git clone https://github.com/pydicom/pydicom
-WORKDIR pydicom
-RUN git checkout affb1cf10c6be2aca311c29ddddc622f8bd1f810
-RUN python setup.py install
-
-# deid
-WORKDIR /tmp
+# update deid
+WORKDIR /opt
 RUN git clone -b development https://github.com/pydicom/deid
-WORKDIR /tmp/deid
+WORKDIR /opt/deid
 RUN python setup.py install
 
 # som
-WORKDIR /tmp
-RUN git clone https://github.com/vsoch/som
-WORKDIR /tmp/som
+WORKDIR /opt
+RUN git clone -b add/bigquery https://github.com/vsoch/som
+WORKDIR /opt/som
 RUN python setup.py install
 
-
-RUN mkdir /code
-RUN mkdir -p /var/www/images
-RUN mkdir /data
 WORKDIR /code
 ADD . /code/
-RUN /usr/bin/yes | pip uninstall cython
-RUN apt-get remove -y gfortran
-
-# Crontab
-RUN apt-get update && apt-get install -y gnome-schedule
-
-RUN apt-get autoremove -y
-RUN apt-get clean
-RUN rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
-
-ENV MESSAGELEVEL -1
-
-# This sometimes errors, need to run manually
-#RUN pip install -r /code/google-requirements.txt > /dev/null 2>&1
-#RUN pip3 install -r /code/google-requirements.txt > /dev/null 2>&1
-
-WORKDIR /code
 CMD /code/run_uwsgi.sh
 
 EXPOSE 3031
diff --git a/docs/README.md b/docs/README.md
@@ -1,21 +1,15 @@
 # SendIt Documentation
 
 ## Overview
-The Sendit application is intended to be a modular application that includes the following:
+The Sendit application is an on demand application that works in two stages to optimally anonymize and push anonymized images and metadata to Google Cloud Storage, and Google Cloud BigQuery, respectively. It works as follows:
 
- - a data folder that is watched for complete DICOM datasets.
- - an (optional) pipeline for anonymization, meaning removing/replacing fields in the header and image data.
- - (optionally) sending data to storage, meaning an Orthanc server, and/or Google Cloud Storage/Datastore
+ - the researcher starts the anonymization pipeline with an input of one or mode folders
+ - each folder is added as a "Batch" with status "QUEUE" to indicate they are ready for import
+ - anonymization is performed (status "PROCESSING"), meaning removing/replacing fields in the header and image data, .
+ - when status "DONEPROCESSING" is achieved for all in the queue, the researcher triggers the final job to send data to storage (status "SENT")
 
-Reasonable updates would be:
-
- - to add a DICOM receiver directly to the application using `pynetdicom3`, so instead of listening for datasets on the filesystem, we can receive them directly.
- - remove the web interface component and make sendit more of a service.
-
-
-## Application Flow
-
- - [Application](application.md): If you are a new developer, please read about the application flow and infrastructure first. Sendit is a skeleton that uses other python modules to handle interaction with Stanford and Google APIs, along with anonymization of datasets.
+## Preparation
+The base of the image is distributed via [sendit-base](scripts/docker/README.md). This image has all dependencies for the base so we can easily bring the image up and down.
 
 ## Deployment
 

diff --git a/docs/application.md b/docs/application.md
@@ -23,29 +23,53 @@ This application lives in a docker-compose orchestration of images running on `S
 
 
 ## Job Queue
-The job queue generally works by processing tasks when the server has available resources. There will be likely 5 workers for a single application deployment. The worker will do the following:
 
- 1. First receive a job from the queue to run the [import dicom](import_dicom.md) task when a finished folder is detected by the [watcher](watcher.md)
- 2. When import is done, hand to the next task to [anonymize](anomymize.md) images. If the user doesn't want to do this based on [settings](../sendit/settings/config.py), a task is fired off to send to storage. If they do, the request is made to the DASHER endpoint, and the identifiers saved.
- a. In the case of anonymization, the next job will do the data strubbing with the identifiers, and then trigger sending to storage.
- 3. Sending to storage can be enabled to work with any or none of OrthanC and Google Cloud storage. If no storage is taken, then the application works as a static storage.
+### Step 1: Start Queue
+The job queue accepts a manual request to import one or more dicom directories, subfolderes under `/data`. We call it a "queue" because it is handled by the worker and redis images, where the worker is a set of threads that can process multiple (~16) batches at once, and redis is the database to manage the queue. The queue can "pile up" and the workers will process tasks when the server has available resources. Thus, to start the pipeline:
 
-**Important note**: for this first testing when we are starting with many pre-existing folders, we are using instead a continuous worker queue with 16 threads (over 16 cores). 
+ 1. You should make sure your `DATA_INPUT_FOLDERS` are defined in [sendit/settings/config.py](../sendit/settings/config.py).
+ 2. You should then start the queue, which means performing dicom import, get_identifiers, replace identifiers (not upload). This means that images go from having status "QUEUE" to "DONEPROCESSING"
+
+```
+# Start the queue
+python manage.py start_queue
 
-## Status
-In order to track status of images, we have status states for images and batches. 
+# The defaults are max count 1, /data folder
+python manage.py start_queue --number 1 --subfolder /data
+
+```
+
+When you call the above, the workers will do the following:
+
+ 1. Check for any Batch objects with status "QUEUE," meaning they were added and not started yet. If there are none in the QUEUE (the default when you haven't used it yet!) then the function uses the `DATA_INPUT_FOLDERS` to find new "contenders." The contender folders each have a Batch created for them, and the Batch is given status QUEUE. We do this up to the max count provided by the "number" variable in the `start_queue` request above.
+ 2. Up to the max count, the workers then launch the [import dicom](import_dicom.md) task to run async. This function changes the Batch status to "PROCESSING," imports the dicom, extracts header information, prepares/sends/receives a request for [anonymized identifiers](anonymize.md) from DASHER, and then saves a BatchIdentifiers objects. The Batch then is given status "DONEPROCESSING".
 
+It is expected that a set of folders (batches) will do these steps first, meaning that there are no Batches with status "QUEUE" and all are "DONEPROCESSING." We do this because we want to upload to storage in large batches to optimize using the client.
 
+
+### Step 2: Upload to Storage
+When all Batches have status "DONEPROCESSING" we launch a second request to the application to upload to storage:
+
+```
+python manage.py upload_finished
 ```
-IMAGE_STATUS = (('NEW', 'The image was just added to the application.'),
- ('PROCESSING', 'The image is currently being processed, and has not been sent.'),
- ('DONEPROCESSING','The image is done processing, but has not been sent.'),
- ('SENT','The image has been sent, and verified received.'),
- ('DONE','The image has been received, and is ready for cleanup.'))
 
-BATCH_STATUS = (('NEW', 'The batch was just added to the application.'),
+This task looks for Batches that are "DONEPROCESSING" and distributes the Batches equally among 10 workers. 10 is not a magic number, but I found in testing was a good balance to not trigger weird connection errors that likely come from the fact we are trying to use network resources from inside a Docker container. Sending to storage means two steps:
+
+ 1. Upload Images (compressed .tar.gz) to Google Storage, and receive back metadata about bucket locations
+ 2. Send image metadata + storage metadata to BigQuery
+
+If you are more interested in reading about the storage formats, read more about [storage](storage.md).
+
+## Status
+In order to track status of images, we have status states for batches. 
+
+
+```
+BATCH_STATUS = (('QUEUE', 'The batch is queued and not picked up by worker.'),
+ ('NEW', 'The batch was just added to the application.'),
+ ('EMPTY', 'After processing, no images passed filtering.'),
  ('PROCESSING', 'The batch currently being processed.'),
- ('EMPTY', 'No images passed filters'),
  ('DONE','The batch is done, and images are ready for cleanup.'))
 ```
 
@@ -56,23 +80,6 @@ python manage.py export_metrics
 sendit-process-time-2017-08-26.tsv
 ```
 
-### Image Status
-Image statuses are updated at each appropriate timepoint, for example:
-
- - All new images by default are given `NEW`
- - When an image starts any anonymization, but before any request to send to storage, it will have status `PROCESSING`. This means that if an image is not to be processed, it will immediately be flagged with `DONEPROCESSING`
- - As soon as the image is done processing, or if it is intended to go right to storage, it gets status `DONEPROCESSING`.
- - After being send to storage, the image gets status `SENT`, and only when it is ready for cleanup is gets status `DONE`. Note that this means that if a user has no requests to send to storage, the image will remain with the application (and not be deleted.)
-
-### Batch Status
-A batch status is less granular, but more informative for alerting the user about possible errors.
-
- - All new batches by default are given `NEW`.
- - `PROCESSING` is added to a batch as soon as the job to anonymize is triggered.
- - `DONEPROCESSING` is added when the batch finished anonimization, or if it skips and is intended to go to storage.
- - `DONE` is added after all images are sent to storage, and are ready for cleanup.
-
-
 ## Errors
 The most likely error would be an inability to read a dicom file, which could happen for any number of reasons. This, and generally any errors that are triggered during the lifecycle of a batch, will flag the batch as having an error. The variable `has_error` is a boolean that belongs to a batch, and a matching JSONField `errors` will hold a list of errors for the user. This error flag will be most relevant during cleanup.
 
@@ -82,10 +89,7 @@ For server errors, the application is configured to be set up with Opbeat. @vsoc
 ## Cleanup
 Upon completion, we will want some level of cleanup of both the database, and the corresponding files. It is already the case that the application moves the input files from `/data` into its own media folder (`images`), and cleanup might look like any of the following:
 
- - In the most ideal case, there are no errors, no flags for the batch, and the original data folder was removed by the `dicom_import` task, and the database and media files removed after successful upload to storage. This application is not intended as some kind of archive for data, but a node that filters and passes along.
+ - In the most ideal case, there are no errors, no flags for the batch, and the database and media files removed after successful upload to storage. Eventually we would want to delete the original files too. This application is not intended as some kind of archive for data, but a node that filters and passes along.
  - Given an error to `dicom_import`, a file will be left in the original folder, and the batch `has_error` will be true. In this case, we don't delete files, and we rename the original folder to have extension `.err`
 
-If any further logging is needed (beyond the watcher) we should discuss (see questions below)
-
-
 Now let's [start the application](start.md)!
diff --git a/docs/config.md b/docs/config.md
@@ -1,6 +1,7 @@
 # Configuration
 The configuration for the application consists of the files in the [sendit/settings](../sendit/settings) folder. The files that need attention are `secrets.py` and [config.py](../sendit/settings/config.py). 
 
+
 ## Application Secrets
 First make your secrets.py like this:
 
@@ -14,6 +15,7 @@ Once you have your `secrets.py`, it needs the following added:
  - `SECRET_KEY`: Django will not run without one! You can generate one [here](http://www.miniwebtool.com/django-secret-key-generator/)
  - `DEBUG`: Make sure to set this to `False` for production.
 
+
 ## "anonymization" (Coding)
 For [config.py](../sendit/settings/config.py) you should first configure settings for the anonymization process, which is everything that happens after images are import, but before sending to storage. These steps broadly include:
 
@@ -37,12 +39,6 @@ ANONYMIZE_PIXELS=False
 
 **Important** the pixel scrubbing is not yet implemented, so this variable will currently only check for the header, and alert you of the image, and skip it. Regardless of the setting that you choose for the variable `ANONYMIZE_PIXELS` the header will always be checked. If you have pixel scrubbing turned on (and it's implemented) the images will be scrubbed, and included. If you have scrubbing turned on (and it's not implemented) it will just yell at you and skip them. The same thing will happen if it's off, just to alert you that they exist.
 
-```
-# The default study to use
-SOM_STUDY="test"
-```
-The `SOM_STUDY` is part of the Stanford DASHER API to specify a study, and the default should be set before you start the application. If the study needs to vary between calls, please [post an issue](https://www.github.com/pydicom/sendit) and it can be added to be done at runtime. 
-
 Next, you likely want a custom filter applied to whitelist (accept no matter what), greylist (not accept, but in the future know how to clean the data) and blacklist (not accept). Currently, the deid software applies a [default filter](https://github.com/pydicom/deid/blob/development/deid/data/deid.dicom) to filter out images with known burned in pixels. If you want to add a custom file, currently it must live with the repository, and is referenced by the name of the file after the `deid`. You can specify this string in the config file:
 
 ```
@@ -72,44 +68,19 @@ Note that the fields for `ENTITY_ID` and `ITEM_ID` are set to the default of [de
 ## Storage
 The next set of variables are specific to [storage](storage.md), which is the final step in the pipeline.
 
-```
-# We can turn on/off send to Orthanc. If turned off, the images would just be processed
-SEND_TO_ORTHANC=True
-
-# The ipaddress of the Orthanc server to send the finished dicoms (cloud PACS)
-ORTHANC_IPADDRESS="127.0.0.1"
-
-# The port of the same machine (by default they map it to 4747
-ORTHAC_PORT=4747
-```
-
-Since the Orthanc is a server itself, if we are ever in need of a way to quickly deploy and bring down these intances as needed, we could do that too, and the application would retrieve the ipaddress programatically.
-
-And I would (like) to eventually add the following, meaning that we also send datasets to Google Cloud Storage and Datastore, ideally in compressed nifti instead of dicom, and with some subset of fields. These functions are by default turned off.
-
 ```
 # Should we send to Google at all?
-SEND_TO_GOOGLE=False
+SEND_TO_GOOGLE=True
 
 # Google Cloud Storage Bucket (must be created)
 GOOGLE_CLOUD_STORAGE='radiology'
 GOOGLE_STORAGE_COLLECTION=None # define here or in your secrets
 GOOGLE_PROJECT_NAME="project-name" # not the id, usually the end of the url in Google Cloud
 ```
 
-Note that the storage collection is set to None, and this should be the id of the study (eg, the IRB). If this is set to None, it will not upload. Finally, to add a special header to signify a Google Storage project, you should add the name of the intended project to your header:
-
-```
-GOOGLE_PROJECT_ID_HEADER="12345"
-
-# Will produce this key/value header
-x-goog-project-id: 12345
-```
-
-** Note we aren't currently using this header and it works fine.
-
-Note that this approach isn't suited for having more than one study - when that is the case, the study will likely be registered with the batch. Importantly, for the above, there must be a `GOOGLE_APPLICATION_CREDENTIALS` filepath exported in the environment, or it should be run on a Google Cloud Instance (unlikely).
+Note that the storage collection is set to None, and this should be the id of the study (eg, the IRB). For Google Storage, this collection corresponds with a Bucket. For BigQuery, it corresponds with a database (and a table of dicom). If this is set to None, it will not upload. Also note that we derive the study name to use with Dasher from this bucket. It's simply the lowercase version of it. This means that a `GOOGLE_STORAGE_COLLECTION` of `IRB12345` maps to a study name `irb12345`.
 
+Note that this approach isn't suited for having more than one study - when that is the case, the study will likely be registered with the batch. Importantly, for the above, there must be a `GOOGLE_APPLICATION_CREDENTIALS` filepath exported in the environment, or it should be run on a Google Cloud Instance (unlikely in the near future).
 
 ## Authentication
 If you look in [sendit/settings/auth.py](../sendit/settings/auth.py) you will see something called `lockdown` and that it is turned on:
@@ -155,6 +126,7 @@ LOCKDOWN_PASSWORDS = ('mysecretpassword',)
 Note that here we will need to add notes about securing the server (https), etc. For now, I'll just mention that it will come down to changing the [nginx.conf](../nginx.conf) and [docker-compose.yml](../docker-compose.yml) to those provided in the folder [https](../https).
 
 
+
 ### Reading Input
 You need to specify either a `DATA_SUBFOLDER` (assumed within `DATA_BASE` for the application) OR a list of `DATA_INPUT_FOLDERS` instead. If you have a streaming application.
 

diff --git a/docs/img/bigquery.png b/docs/img/bigquery.png