Cite this article if using the data:
Low, D. M., Randolph, G., Rao, V., Ghosh, S. S. & Song, P., C. (2023). Uncovering the important acoustic features for detecting vocal fold paralysis with explainable machine learning. MedRxiv.
Note:
- "speech" in this repo refers to "reading" task in the manuscript.
- The original audio wav files cannot be shared due to consent restrictions. Here we provide the extracted eGeMAPS features (see manuscript for details).
./data/input/VFP_DeidentifiedDemographics.csv
De-identified demographic information
Columns:
sid
subject ID. Important for group shuffle split.filename
from which wav file were features extractedtoken
what type of sample (speechN, vowelN where N is sample number)target
label to be predicted
egemaps_vector_both.csv
egemaps_vector_speech.csv
egemaps_vector_vowel.csv
To run the .py
(inluding pydra-ml package) or the .ipynb
on Jupter Notebook, create a virtual environment and install the requirements.txt
:
conda create --name pydra_vfp --file requirements.txt
conda activate pydra_vfp
./fig_1.ipynb/
./data/input/rainbow.wav
audio file used./data/input/rainbow_f0.txt
f0 over time from PRAAT
./data/input/VFP_DeidentifiedDemographics.csv
De-identified demographic informationdemographics.py
Script to obtain info for Table 1.
duration.ipynb
We ran models using pydra-ml for which a spec file is needed where the dataset is specified. The dataset needs to be in the same dir where the spec file is run. Since we ran models on a cluster, we have SLURM scripts, so the dataset is in the same dir as the SLURM scripts.
if
and indfact
stand for independence factor, the algorithm we created for removing redundant features.
-
./vfp_v7_indfact/
dir/specs/
spec filesrun_collinearity_job_array_{data_type}_if.sh
SLURM script to run pydra-ml spec files wheredata_type
is 'speech', 'vowel' or 'both'.$ pydraml -s specs/vfp_spec_4models_both_if_{spec_id}.json
wherespec_id
is value inrange(1,10)
corresponding to dcorrs thresholds ofnp.arange(0.2, 1.1, 0.1)
(i.e., we removed redudant features according to the dcor threshold). The job array runs the different spec files in parallel.
thresholds_if_Nvars_{data_type}.txt
were used to build those spec files.run_clear_locks.sh
runsclear_locks.py
Run this this if you want to re-run model with different specs (pydra-ml will re-use cache-wf)run_collinearity_speech_explanations.sh
re-runs models settinggen_shap
to true in spec files to output SHAP values/explanations../outputs/
each run will output a dir such asout-vfp_spec_4models_both_if_1.json-20200910T024552.823868
with the name of the spec file.
-
performance_stats.py
p-values in Figure 3 -
./vfp_v8_top5/
runs top 5 featurs specificed in spec files -
analyze_results.py
takesoutputs/out-*
files from pydra-ml and produces summaries which were then concatenated into table 2. Also figures for Sup. Figure S10. -
cpp.ipynb
CPP models -
duration.ipynb
Duration models
shap_analysis.ipynb
Parallel coordinate plots using SHAP scores.
./vfp_v8_top1outof5/
runs one of the top 5 features at a time.shap_analysis.ipynb
makes the plots
collinearity.py
remove redudant features (reduce multicollinearity) using Independence Factorredudant_features.ipynb
Clustermap (Figure 6)
audio_annotation.ipynb
code to run experiment/surveyanalyze_annotations.ipynb
classification_wo_correlated_features_duration.ipynb
We removed 24 patients that were recorded using a different device (an iPad). If performance drops significantly, then the original dataset may be using recording set up to dissociate groups (i.e., if features related to iPad are within certain range determined by iPad, then prediction equals patient).
Patients recorded with iPad are: [3,4,5,8,9,12,13,18,24,27,28,29,31,33,38,53,54,55,56,64,65,66,71,74]
./data/input/features/egemaps_vector_both_wo-24-patients.csv
dataset./data/output/vfp_v8_wo-24-patients/
pydra-ml scripts-
egemaps_vector_both_wo-24-patients.csv egemaps_vector_speech_wo-24-patients.csv egemaps_vector_vowel_wo-24-patients.csv
See test_different_recording.ipynb