Add files via upload

trangptm · Oct 4, 2016 · a02ba4a · a02ba4a
1 parent 413ea06
commit a02ba4a
Show file tree

Hide file tree

Showing 3 changed files with 131 additions and 0 deletions.
diff --git a/work_flow/1.data_preprocessing.txt b/work_flow/1.data_preprocessing.txt
@@ -0,0 +1,69 @@
+1. Raw preprocess:
+	- Replace '|' -> '\t'
+		raw_preprocess.py
+	inputs: data/raw/*.txt 
+	outputs: data/preprocessed/*.txt
+
+2. Mapping:
+	- Map medication names -> medication codes
+		map_medi_code.py
+	- Map procedure codes -> procedure blocks
+		map_proc_code.py
+
+	inputs: 
+		data/preprocessed/medications.txt
+		data/preprocessed/diagnosis_procedures.txt
+	outputs:
+		data/preprocessed/medication_mapped.txt
+		data/preprocessed/diag_proc_block_mapped.txt
+
+
+3. 
+	- Cut off levels of diagnosis codes: keep the first 2 letters
+	- Cut off levels of medication codes: keep the first 6 letters
+		cut_off_code.py	
+
+	inputs:
+		data/preprocessed/medication_mapped.txt
+		data/preprocessed/diag_proc_block_mapped.txt
+
+	outputs:
+		data/preprocessed/medication_mapped_cutoff.txt
+		data/preprocessed/diag_proc_block_mapped_cutoff.txt
+
+4. Filter admission: 
+	- Filter out all unusual admissions (admissions starting with one of following characters: R, D, L, M, Q, S, Y)
+	- Filter out all dialysis admissions (not Emergency)
+		filter_adm.py
+
+	inputs: 
+		data/preprocessed/admissions.txt
+		data/preprocessed/diag_proc_block_mapped_cutoff.txt
+	outputs:
+		data/preprocessed/admissions_filtered.txt
+		data/preprocessed/diag_proc_filtered.txt
+
+5. Filter & cut off attendance:
+	- Filter out all miss-information attendances
+	- Cut off levels of diagnosis code: keep the first 2 letters
+		filter_cutoff_atd.py
+
+	inputs:
+		data/preprocessed/attendances.txt
+	outputs:
+		data/preprocessed/atd_filtered.txt
+
+6. Filter patients: remove all duplicated information
+		filter_patients.py
+	input: 
+		data/preprocessed/patients.txt
+	output:
+		data/preprocessed/patnts_filtered.txt
+
+
+---> Files for creating the dataset: 
+	patnts_filtered.txt
+	admissions_filtered.txt
+	diag_proc_filtered
+	medication_mapped_cutoff.txt
+	atd_filtered.txt
diff --git a/work_flow/2.data_combining.txt b/work_flow/2.data_combining.txt
@@ -0,0 +1,43 @@
+Files for creating datasets:
+	patnts_filtered.txt
+	admissions_filtered.txt
+	diag_proc_filtered
+	medication_mapped_cutoff.txt
+	atd_filtered.txt
+
+I. CREATE ADM DATASET AND ATD DATASET	
+Brief description:
+- From admissions_filtered.txt & diag_proc_filtered.txt: create adm_dataset
+- From atd_filtered.txt: create atd_dataset
+- Dump 2 datasets into 2 separated pkl files: adm.pkl & atd.pkl
+
+Steps:
+- Create 3 dictionaries: 
+	+ diag_dict (encoding diagnosis in diag_proc and attendances)
+	+ proc_dict (encoding procedures in diag_proc)
+	+ medi_dict (encoding medications in medi)
+- Create 2 dictionaries: prvsp_dict & prcae_dict (medi uses prvsp_refno & diag_proc uses prcae_refno)
+	These two dicts are used for mapping diag, proc, medi into their admissions
+- Map diag, proc, medi into their admissions
+	After this step, we have adm_dataset containing information of all admissions.
+	Each admission has information of patnt_refno, admit_time, disch_time, method & a list its diag and a list of it's proc & medi
+- Create atd_dataset: 
+	Encode diagnosis of each attendance and then create the atd_dataset with the information of UR, arr_time, dep_time & code (code of diagnosis)
+
+Script: combine_data.py
+
+II. CREATE PATIENT DATASET
+Steps:
+- Create 2 dictionaries:
+	+ patnt_dict (admissions use patnt_refno to identify the patients)
+	+ ur_dict (attendances use ur to identify the patients)
+
+- Map admissions into their patients (use patnt_dict):
+	This step create for each patient a list of his/her admissions (list_adm)
+
+- Map attendances into their patiens (use ur_dict):
+	This step create for each patien a list of his/her attendances (list_atd)
+
+- Dump these 2 lists (list_adm & list_atd) to the file patnt.pkl
+
+Script: create_patnt_records.py
diff --git a/work_flow/3.create_dataset.txt b/work_flow/3.create_dataset.txt
@@ -0,0 +1,19 @@
+- Input: 'patnt.pkl', 'adm_pkl'
+	patnt.pkl: contains all information of all patients' admissions.
+		Each row is information of a patient with a list of admissions and a list of attendances.
+
+
+- Randomly create a datasets, each dataset includes train, validation, test sets and an adm dataset
+
+- Readmission prediction:
+	Each data point is a sequence of a patient's admissions. Randomly choose an admission A and cut off all latter ones.
+	Check if in the duration of 1 year (for diabetes) and 3 months (for mental health) after admission A there is any emergency admission. If it is, the label is 1, otherwise 0.
+
+- Next diagnosis prediction
+	Sequence mapping: from a sequence of admissions -> sequence of outputs, each output is a set of next diagnoses.
+
+- High risk prediction:
+	Same as readmission prediction. Output is 1 if after 1 year (for diabetes) and 3 months (for mental health) of discharge, the patient have at least 3 emergency readmissions.
+
+- Current interventions:
+	Same as Next diagnosis prediction