This repository contains the source code for the Semantic Knowledge Extractor Tool (SKET).
SKET is an unsupervised hybrid knowledge extraction system that combines a rule-based expert system with pre-trained machine learning models to extract cancer-related information from pathology reports.
CAVEAT: the package has been tested using Python 3.7 and 3.8 on unix-based systems and win64 systems. There are no guarantees that it works with different configurations.
Clone this repository
git clone https://github.com/ExaNLP/sket.git
Install all the requirements:
pip install -r requirements.txt
Then install any core
model from scispacy v0.3.0 (default is en_core_sci_sm
):
pip install </path/to/download>
The required scispacy models are available at: https://github.com/allenai/scispacy/tree/v0.3.0
Users can go into the datasets
folder and place their datasets within the corresponding use case folders. Use cases are: Colon Cancer (colon), Cervix Uterine Cancer (cervix), and Lung Cancer (lung).
Datasets can be provided in two formats:
Users can provide .xls
or .xlsx
files with the first row consisting of column headers (i.e., fields) and the rest of data inputs.
Users can provide .json
files structured in two ways:
As a dict containing a reports
field consisting of multiple key-value reports;
{'reports': [{k: v, ...}, ...]}
As a dict containing a single key-value report.
{k: v, ...}
SKET concatenates data from all the fields before translation. Users can alterate this behavior by filling ./sket/rep_proc/rules/report_fields.txt
with target fields, one per line. Users can also provide a custom file to SKET, as long as it contains one field per line (more on this below).
Users can provide special headers that are treated differently from regular text by SKET. These fields are:
id
: when specified, the id
field is used to identify the corresponding report. Otherwise, uuid
is used.
gender
: when specified, the gender
field is used to provide patient's information within RDF graphs. Otherwise, gender
is set to None.
age
: when specified, the age
field is used to provide patient's information within RDF graphs. Otherwise, age
is set to None.
Users can compute dataset statistics to uderstand the distribution of concepts extracted by SKET for each use case. For instance, if a user wants to compute statistics for Colon Cancer, they can run
python compute_stats.py --outputs ./outputs/concepts/refined/colon/*.json --use_case colon
SKET can be deployed with different pretrained models, i.e., fastText and BERT. In our experiments, we employed the BioWordVec fastText model and the Bio + Clinical BERT model.
BioWordVec can be downloaded from https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/BioWordVec_PubMed_MIMICIII_d200.bin
Bio + Clinical BERT model can be automatically downloaded at run time by setting the biobert
SKET parameter equal to 'emilyalsentzer/Bio_ClinicalBERT'
Users can pass different pretrained models depending on their preferences.
Users can deploy SKET using run_med_sket.py
. We release within ./examples
three sample datasets that can be used as toy examples to play with SKET. SKET can be deployed with different configurations and using different combinations of matching models.
Furthermore, SKET exhibits a tunable threshold
parameter that can be tuned to decide the hardness of the entity linking component. The higher the threshold
, the more precise the model -- at the expense of recall -- and vice versa. Users can fine-tune this parameter to obtain the desired trade-off between precision and recall. Note that threshold
must always be lower than or equal to the number of considered matching models. Otherwise, the entity linking component does not return any concept.
The available matching models, in form of SKET parameters, are:
biow2v
: the scispacy pretrained word embeddings. Set this parameter to True
to use them.
biofast
: the fastText model. Set this parameter to /path/to/fastText/file
to use fastText.
biobert
: the BERT model. Set this parameter to bert-name
to use BERT (see https://huggingface.co/transformers/pretrained_models.html for model IDs).
str_match
: the Gestalt Pattern Matching (GPM) model. Set this parameter to True
to use GPM.
When using BERT, users can also set gpu
parameter to the corresponding GPU number to fasten SKET execution.
For instance, a user can run the following script to obtain concepts, labels, and RDF graphs on the test.xlsx sample dataset:
python run_med_sket.py
--src_lang it
--use_case colon
--spacy_model en_core_sci_sm
--w2v_model
--string_model
--thr 2.0
--store
--dataset ./examples/test.xlsx
or, if a user also wants to use BERT with GPU support, they can run the following script:
python run_med_sket.py
--src_lang it
--use_case colon
--spacy_model en_core_sci_sm
--w2v_model
--string_model
--bert_model emilyalsentzer/Bio_ClinicalBERT
--gpu 0
--thr 2.5
--store
--dataset ./examples/test.xlsx
In both cases, we set the src_lang
to it
as the source language of reports is Italian. Therefore, SKET needs to translate reports from Italian to English before performing information extraction.
SKET can also be deployed as a Docker container -- thus avoiding the need to install its dependencies directly on the host machine. Two Docker images can be built: sket_cpu and sket_gpu.
For sket_gpu
, NVIDIA drivers have to be already installed within the host machine. Users can refer to NVIDIA user-guide for more information.
Instructions on how to build and run sket images are reported below, if you already have docker installed on your machine, you can skip the first step.
-
Install Docker. In this regard, check out the correct installation procedure for your platform.
-
Install docker-compose. In this regard, check the correct installation procedure for your platform.
-
Check the Docker daemon (i.e.,
dockerd
) is up and running. -
Download or clone the sket repository.
-
In
sket_server/sket_rest_config
theconfig.json
file allows you to configure the sket instance, edit this file in order to set the following parameters:w2v_model, fasttext_model, bert_model, string_model, gpu, thr
wherethr
stands for similarity threshold and its default value is set to 0.9. -
Depending on the Docker image of interest, follow one of the two procedures below:
6a) SKET CPU-only: from the sket, type:docker-compose run --service-ports sket_cpu
6b) SKET GPU-enabled: from the sket, type:docker-compose run --service-ports sket_gpu
-
When the image is ready, the sket server is running at: http://0.0.0.0:8000 if you run
sket_cpu
. If you runsket_gpu
the server will run at: http://0.0.0.0:8001. -
The annotation of medical reports can be performed with two types of POST request:
8a) If you want to store the annotations inoutputs
directory the URL to make the request to is:http://0.0.0.0:8000/annotate/<use_case>/<language>
whereuse_case
andlanguage
are the use case and the language (identified using ISO 639-1 Code) of your reports respectively.
Request example:curl -H "Content-Type: multipart/form-data" -F "data=@path/to/examples/test.xlsx" http://0.0.0.0:8000/annotate/colon/it
where
path/to/examples
is the path to examples folder. With this type of request labels and concepts are stored in.json
files, while graphs are stored in.json
,.n3
,.ttl
,.trig
files.
If you want to store exclusively one file format among:.n3
,.ttl
,.trig
, put after the desired language/trig
if you want to store graphs in.trig
format,/turtle
if you want to store graphs inttl
format and/n3
if you want to store graphs in.n3
format.
Request example:curl -H "Content-Type: multipart/form-data" -F "data=@path/to/examples/test.xlsx" http://0.0.0.0:8000/annotate/colon/it/turtle
where
path/to/examples
is the path to examples folder.
8b) If you want to use the labels, the concepts or the graphs returned by sket without saving them the URL to make the request to is:http://0.0.0.0:8000/annotate/<use_case>/<language>/<output>
whereuse_case
andlanguage
are the use case and the language (identified using ISO 639-1 Code) of your reports respectively andoutput
islabels
orconcepts
orgraphs
.
Request example:curl -H "Content-Type: multipart/form-data" -F "data=@path/to/examples/test.xlsx" http://0.0.0.0:8000/annotate/colon/it/labels
where
path/to/examples
is the path to examples folder.
If you want your request to return a graph, your request must include also the graphs' format. Hence, your request will be:http://0.0.0.0:8000/annotate/<use_case>/<language>/graphs/<rdf_format>
where<rdf_format>
can be on format among:turtle
,n3
andtrig
.curl -H "Content-Type: multipart/form-data" -F "data=@path/to/examples/test.xlsx" http://0.0.0.0:8000/annotate/colon/it/graphs/turtle
where
path/to/examples
is the path to examples folder. -
If you want to embed your medical reports in the request, change the application type and set:
-H "Content-Type: application/json"
then, instead of- F "data=@..."
put:-d '{"reports":[{},...,{}]}'
if you have multiple reports, or:-d '{"k":"v",...}'
if you have a single report. -
If you want to build the images again, from the project folder type
docker-compose down --rmi local
, pay attention that this command will remove all the images created (both CPU and GPU). If you want to remove only one image between CPU and GPU see the docker image documentation. Finally repeat steps 5-8.
Regarding SKET GPU-enabled, the corresponding Dockerfile (you can find the Dockerfile at the following path: sket_server/docker-sket_server-config/sket_gpu) contains the nvidia/cuda:11.0-devel
. Users are encouraged to change the NVIDIA/CUDA image within the Dockerfile depending on the NVIDIA drivers installed in their host machine. NVIDIA images can be found here.