This project contains the code and data for the paper A random forest based computational model for predicting novel lncRNA-disease associations by Yao, Dengju, et al.
Original data and code of RFLDA algorithm is available in code & data.zip
The data and code have been re-organised as follows:
-
Data resides in
input_data
folder. These are the original excel files extracted fromcode & data.zip
-
Source code resides in the
src
folder -
Output generated by RFLDA resides in the
output_data
folder -
Results of code optimisation tests reside in
optimisation_data
File | Description |
---|---|
RFLDA.R | The original RFLDA.txt has been renamed to RFLDA.R and the code optimised. For more information see the RFLDA code changes section. Additionally, only the functions are declared in this file. |
test_RFLDA.R | executes the functions in RFLDA.R in order and records the execution time in optimisation_data/RFLDA_result.csv |
common_functions.R | shared functions used by optimisation test scripts |
compare_excel_libraries_read.R | compares read speed of OpenXLSX and ReadXL packages and records results in optimisation_data/excel_read_result.csv |
compare_excel_libraries_write.R | compares write speed of OpenXLSX and WriteXL packages and records results in optimisation_data/excel_write_result.csv |
compare_parquet_and_excel_libraries.R | compares write speed of OpenXLSX, WriteXL and Arrow packages and records results in optimisation_data/parquet_and_excel_write_result.csv |
compare_randomforest_libraries.R | compares training time of RandomForest and Ranger packages and records results in optimisation_data/randomforest_comparison_result.csv |
optimise_nested_loop.R | compares joining two datasets with original code which used a nested loop vs SqlDF package. Warning the nested loop takes a long time! Records results in optimisation_data/optimise_nested_loop.csv |
save_main_R_package_versions.R | writes the package name and versions for R package that were used for code optimisation to main_R_package_versions.txt |
save_R_package_versions.R | writes all package names and associated versions in R environment to R_package_versions.txt |
The following changes have been made to the original code:
1 - The original code fails to write and read back the LDA object. This is resolved by converting LDA to a data.frame before saving it to disk. Perhaps the code may have worked with the openxlsx before and subsequent changes to this package stopped supporting of writing the matrix to excel?
2 - The openxlsx is very slow at writing xlsx files. Additionally, the file lncRNA-disease-ALL.xlsx cannot be opened with LibreOffice Calc. To resolve this issue, writexl is used instead.
3 - The original code converted LDA into a matrix which is a bug as this dataframe contains two columns of text.
4 - Changed generation of labels for LDExcl0 to use sqldf instead of nested loops changing the time taken from approx. 10 hours to 1 minute.
5 - Switched from RandomForest library to ranger as it supports usage of multiple processor cores.
6 - Added support for using parquet files which are compact and fast to read from.
You can run the R files directly in R Studio or from the command line using Rscript utility. Please make sure the working directory, using the setwd function, is set the src
folder containing the R files. This is to ensure the input files are found and the output files are placed in the desired location.
- Primary Development OS: Ubuntu 24.04 LTS
- Compatible OS: Windows 10, macOS 11.0 (Big Sur)
- R (version 4.4.1)
- Bash (5.2.21)
The full list of R-Packages that were installed on the development machine can be found in R_package_versions.txt.
The main R-Packages are:
Package | Version |
---|---|
arrow | 16.1.0 |
diffdf | 1.0.4 |
openxlsx | 4.2.5.2 |
randomForest | 4.7-1.1 |
ranger | 0.16.0 |
sqldf | 0.4-11 |
writexl | 1.5.0 |
The styler package is used to format the R files created in this project.
There is a pre-commit git hook which you can use to automate this process.
Make sure that styler is installed in your R environment, then install this hook using ./install-hooks.sh
bash script.