diff --git a/work_flow/1.data_preprocessing.txt b/work_flow/1.data_preprocessing.txt new file mode 100644 index 0000000..3ea38f9 --- /dev/null +++ b/work_flow/1.data_preprocessing.txt @@ -0,0 +1,69 @@ +1. Raw preprocess: + - Replace '|' -> '\t' + raw_preprocess.py + inputs: data/raw/*.txt + outputs: data/preprocessed/*.txt + +2. Mapping: + - Map medication names -> medication codes + map_medi_code.py + - Map procedure codes -> procedure blocks + map_proc_code.py + + inputs: + data/preprocessed/medications.txt + data/preprocessed/diagnosis_procedures.txt + outputs: + data/preprocessed/medication_mapped.txt + data/preprocessed/diag_proc_block_mapped.txt + + +3. + - Cut off levels of diagnosis codes: keep the first 2 letters + - Cut off levels of medication codes: keep the first 6 letters + cut_off_code.py + + inputs: + data/preprocessed/medication_mapped.txt + data/preprocessed/diag_proc_block_mapped.txt + + outputs: + data/preprocessed/medication_mapped_cutoff.txt + data/preprocessed/diag_proc_block_mapped_cutoff.txt + +4. Filter admission: + - Filter out all unusual admissions (admissions starting with one of following characters: R, D, L, M, Q, S, Y) + - Filter out all dialysis admissions (not Emergency) + filter_adm.py + + inputs: + data/preprocessed/admissions.txt + data/preprocessed/diag_proc_block_mapped_cutoff.txt + outputs: + data/preprocessed/admissions_filtered.txt + data/preprocessed/diag_proc_filtered.txt + +5. Filter & cut off attendance: + - Filter out all miss-information attendances + - Cut off levels of diagnosis code: keep the first 2 letters + filter_cutoff_atd.py + + inputs: + data/preprocessed/attendances.txt + outputs: + data/preprocessed/atd_filtered.txt + +6. Filter patients: remove all duplicated information + filter_patients.py + input: + data/preprocessed/patients.txt + output: + data/preprocessed/patnts_filtered.txt + + +---> Files for creating the dataset: + patnts_filtered.txt + admissions_filtered.txt + diag_proc_filtered + medication_mapped_cutoff.txt + atd_filtered.txt \ No newline at end of file diff --git a/work_flow/2.data_combining.txt b/work_flow/2.data_combining.txt new file mode 100644 index 0000000..9075940 --- /dev/null +++ b/work_flow/2.data_combining.txt @@ -0,0 +1,43 @@ +Files for creating datasets: + patnts_filtered.txt + admissions_filtered.txt + diag_proc_filtered + medication_mapped_cutoff.txt + atd_filtered.txt + +I. CREATE ADM DATASET AND ATD DATASET +Brief description: +- From admissions_filtered.txt & diag_proc_filtered.txt: create adm_dataset +- From atd_filtered.txt: create atd_dataset +- Dump 2 datasets into 2 separated pkl files: adm.pkl & atd.pkl + +Steps: +- Create 3 dictionaries: + + diag_dict (encoding diagnosis in diag_proc and attendances) + + proc_dict (encoding procedures in diag_proc) + + medi_dict (encoding medications in medi) +- Create 2 dictionaries: prvsp_dict & prcae_dict (medi uses prvsp_refno & diag_proc uses prcae_refno) + These two dicts are used for mapping diag, proc, medi into their admissions +- Map diag, proc, medi into their admissions + After this step, we have adm_dataset containing information of all admissions. + Each admission has information of patnt_refno, admit_time, disch_time, method & a list its diag and a list of it's proc & medi +- Create atd_dataset: + Encode diagnosis of each attendance and then create the atd_dataset with the information of UR, arr_time, dep_time & code (code of diagnosis) + +Script: combine_data.py + +II. CREATE PATIENT DATASET +Steps: +- Create 2 dictionaries: + + patnt_dict (admissions use patnt_refno to identify the patients) + + ur_dict (attendances use ur to identify the patients) + +- Map admissions into their patients (use patnt_dict): + This step create for each patient a list of his/her admissions (list_adm) + +- Map attendances into their patiens (use ur_dict): + This step create for each patien a list of his/her attendances (list_atd) + +- Dump these 2 lists (list_adm & list_atd) to the file patnt.pkl + +Script: create_patnt_records.py \ No newline at end of file diff --git a/work_flow/3.create_dataset.txt b/work_flow/3.create_dataset.txt new file mode 100644 index 0000000..0fb2647 --- /dev/null +++ b/work_flow/3.create_dataset.txt @@ -0,0 +1,19 @@ +- Input: 'patnt.pkl', 'adm_pkl' + patnt.pkl: contains all information of all patients' admissions. + Each row is information of a patient with a list of admissions and a list of attendances. + + +- Randomly create a datasets, each dataset includes train, validation, test sets and an adm dataset + +- Readmission prediction: + Each data point is a sequence of a patient's admissions. Randomly choose an admission A and cut off all latter ones. + Check if in the duration of 1 year (for diabetes) and 3 months (for mental health) after admission A there is any emergency admission. If it is, the label is 1, otherwise 0. + +- Next diagnosis prediction + Sequence mapping: from a sequence of admissions -> sequence of outputs, each output is a set of next diagnoses. + +- High risk prediction: + Same as readmission prediction. Output is 1 if after 1 year (for diabetes) and 3 months (for mental health) of discharge, the patient have at least 3 emergency readmissions. + +- Current interventions: + Same as Next diagnosis prediction \ No newline at end of file