Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
  • Loading branch information
trangptm authored Oct 4, 2016
1 parent 413ea06 commit a02ba4a
Show file tree
Hide file tree
Showing 3 changed files with 131 additions and 0 deletions.
69 changes: 69 additions & 0 deletions work_flow/1.data_preprocessing.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
1. Raw preprocess:
- Replace '|' -> '\t'
raw_preprocess.py
inputs: data/raw/*.txt
outputs: data/preprocessed/*.txt

2. Mapping:
- Map medication names -> medication codes
map_medi_code.py
- Map procedure codes -> procedure blocks
map_proc_code.py

inputs:
data/preprocessed/medications.txt
data/preprocessed/diagnosis_procedures.txt
outputs:
data/preprocessed/medication_mapped.txt
data/preprocessed/diag_proc_block_mapped.txt


3.
- Cut off levels of diagnosis codes: keep the first 2 letters
- Cut off levels of medication codes: keep the first 6 letters
cut_off_code.py

inputs:
data/preprocessed/medication_mapped.txt
data/preprocessed/diag_proc_block_mapped.txt

outputs:
data/preprocessed/medication_mapped_cutoff.txt
data/preprocessed/diag_proc_block_mapped_cutoff.txt

4. Filter admission:
- Filter out all unusual admissions (admissions starting with one of following characters: R, D, L, M, Q, S, Y)
- Filter out all dialysis admissions (not Emergency)
filter_adm.py

inputs:
data/preprocessed/admissions.txt
data/preprocessed/diag_proc_block_mapped_cutoff.txt
outputs:
data/preprocessed/admissions_filtered.txt
data/preprocessed/diag_proc_filtered.txt

5. Filter & cut off attendance:
- Filter out all miss-information attendances
- Cut off levels of diagnosis code: keep the first 2 letters
filter_cutoff_atd.py

inputs:
data/preprocessed/attendances.txt
outputs:
data/preprocessed/atd_filtered.txt

6. Filter patients: remove all duplicated information
filter_patients.py
input:
data/preprocessed/patients.txt
output:
data/preprocessed/patnts_filtered.txt


---> Files for creating the dataset:
patnts_filtered.txt
admissions_filtered.txt
diag_proc_filtered
medication_mapped_cutoff.txt
atd_filtered.txt
43 changes: 43 additions & 0 deletions work_flow/2.data_combining.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
Files for creating datasets:
patnts_filtered.txt
admissions_filtered.txt
diag_proc_filtered
medication_mapped_cutoff.txt
atd_filtered.txt

I. CREATE ADM DATASET AND ATD DATASET
Brief description:
- From admissions_filtered.txt & diag_proc_filtered.txt: create adm_dataset
- From atd_filtered.txt: create atd_dataset
- Dump 2 datasets into 2 separated pkl files: adm.pkl & atd.pkl

Steps:
- Create 3 dictionaries:
+ diag_dict (encoding diagnosis in diag_proc and attendances)
+ proc_dict (encoding procedures in diag_proc)
+ medi_dict (encoding medications in medi)
- Create 2 dictionaries: prvsp_dict & prcae_dict (medi uses prvsp_refno & diag_proc uses prcae_refno)
These two dicts are used for mapping diag, proc, medi into their admissions
- Map diag, proc, medi into their admissions
After this step, we have adm_dataset containing information of all admissions.
Each admission has information of patnt_refno, admit_time, disch_time, method & a list its diag and a list of it's proc & medi
- Create atd_dataset:
Encode diagnosis of each attendance and then create the atd_dataset with the information of UR, arr_time, dep_time & code (code of diagnosis)

Script: combine_data.py

II. CREATE PATIENT DATASET
Steps:
- Create 2 dictionaries:
+ patnt_dict (admissions use patnt_refno to identify the patients)
+ ur_dict (attendances use ur to identify the patients)

- Map admissions into their patients (use patnt_dict):
This step create for each patient a list of his/her admissions (list_adm)

- Map attendances into their patiens (use ur_dict):
This step create for each patien a list of his/her attendances (list_atd)

- Dump these 2 lists (list_adm & list_atd) to the file patnt.pkl

Script: create_patnt_records.py
19 changes: 19 additions & 0 deletions work_flow/3.create_dataset.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
- Input: 'patnt.pkl', 'adm_pkl'
patnt.pkl: contains all information of all patients' admissions.
Each row is information of a patient with a list of admissions and a list of attendances.


- Randomly create a datasets, each dataset includes train, validation, test sets and an adm dataset

- Readmission prediction:
Each data point is a sequence of a patient's admissions. Randomly choose an admission A and cut off all latter ones.
Check if in the duration of 1 year (for diabetes) and 3 months (for mental health) after admission A there is any emergency admission. If it is, the label is 1, otherwise 0.

- Next diagnosis prediction
Sequence mapping: from a sequence of admissions -> sequence of outputs, each output is a set of next diagnoses.

- High risk prediction:
Same as readmission prediction. Output is 1 if after 1 year (for diabetes) and 3 months (for mental health) of discharge, the patient have at least 3 emergency readmissions.

- Current interventions:
Same as Next diagnosis prediction

0 comments on commit a02ba4a

Please sign in to comment.