This script provides a simple means of applying the functional group filters from the ChEMBL database, as well as a number of property filters from the RDKit, to a set of compounds. As of ChEMBL 23, the database table structural_alerts contains 8 sets of alerts. ChEMBL doesn't apper to have much in the way of documentation on the different alert sets.
Rule Set | Number of Alerts |
---|---|
BMS | 180 |
Dundee | 105 |
Glaxo | 55 |
Inpharmatica | 91 |
LINT | 57 |
MLSMR | 116 |
PAINS | 479 |
SureChEMBL | 166 |
The SMARTS patterns in a number of these alerts were not compatible with the RDKit so I edited them. A complete list of the changes I made is in the file Notes.txt.
- At least Python 3.6
- The RDKit, you can find installation instructions here. I'd recommend the conda route.
pip install git+https://github.com/PatWalters/rd_filters.git
git clone https://github.com/PatWalters/rd_filters
cd rd_filters
pip install .
The script needs 2 files to operate.
- alert_collection.csv - the set of structural alerts
- rules.json - the configuration file
The script uses the following logic to find alert_collection.csv and rules.json.
- Use locations specified by the "--alert" (for alerts.csv) and "--rules" (for rules.json) command line arguments.
- Look in the current directory.
- Look in the directory pointed to by the FILTER_RULES_DATA environment variable.
I'll provide some examples below to illustrate.
That's it, at this point you should be good to go.
The file alert_collection.csv contains alerts. You shouldn't have to mess with this unless you want to add your own structural alerts. I think the format is pretty obvious.
The file rules.json controls which filters and alerts are used. You can use the command below to generate a rules.json with the default settings.
rd_filters template --out rules.json
The rules.json file looks like this. The values for the properties are the maximum and minimum allowed (inclusive). To set which structural alerts are used, set true and false. You can use multiple alert sets. Just edit the file with your favorite text editor.
{
"HBA": [
0,
10
],
"HBD": [
0,
5
],
"LogP": [
-5,
5
],
"MW": [
0,
500
],
"Rule_BMS": false,
"Rule_Dundee": false,
"Rule_Glaxo": false,
"Rule_Inpharmatica": true,
"Rule_LINT": false,
"Rule_MLSMR": false,
"Rule_PAINS": false,
"Rule_SureChEMBL": false,
"TPSA": [
0,
200
]
}
First off, you're going to want to copy alert_collection.csv and rules.json to a directory and set the FILTER_RULES_DATA environment variable to point to that directory. If you are using a bash-ish shell and the files are in /home/elvis/data that would be:
export FILTER_RULES_DATA=/home/elvis/data
If you type
rd_filters -h
you'll see this:
Usage:
rd_filters filter --in INPUT_FILE --prefix PREFIX [--rules RULES_FILE_NAME] [--alerts ALERT_FILE_NAME][--np NUM_CORES]
rd_filters template --out TEMPLATE_FILE [--rules RULES_FILE_NAME]
Options:
--in INPUT_FILE input file name
--prefix PREFIX prefix for output file names
--rules RULES_FILE_NAME name of the rules JSON file
--alerts ALERTS_FILE_NAME name of the structural alerts file
--np NUM_CORES the number of cpu cores to use (default is all)
--out TEMPLATE_FILE parameter template file name
"""
The basic operation is pretty simple. If I want to filter a file called test.smi and I want my output files to start with "out", I could do something like this:
rd_filters filter --in test.smi --prefix out
This will create 2 files
- out.smi - contains the SMILES strings and molecule names for all of the compounds passing the filters
- out.csv - contains calculated property values and a listing alerts triggered by a molecule
By default, this script runs in parallel and uses all available processors. To change this value, use the --np flag.
rd_filters filter --in test.smi --prefix out --np 4
As mentioned above, alternate rules files or alerts files can be specified on the command line.
rd_filters filter --in test.smi --prefix out --rules myrules.json
rd_filters filter --in test.smi --prefix out --alerts myalerts.csv
rd_filters filter --in test.smi --prefix out --rules myrules.json --alerts myalerts.csv
A new default rules template file can be generated using the template option.
rdfilters.py template --out myrules.json
As always please let me know if you have questions, comments, etc.
Pat Walters, August 2018
Programatic access to RDFilters:
from rd_filters import rd_filters
rules = rd_filters.RDFilters()
rules.get_alert_sets()
rule_dict = {
"Rule_BMS": False,
"Rule_Dundee": False,
"Rule_Glaxo": False,
"Rule_Inpharmatica": True,
"Rule_LINT": False,
"Rule_MLSMR": False,
"Rule_PAINS": False,
"Rule_SureChEMBL": False,
}
rule_list= [x.replace("Rule_", "") for x in rule_dict.keys() if x.startswith("Rule") and rule_dict[x]]
rules.build_rule_list(rule_list)
"""
input should have a column of Mols in the dataframe.
"""
input = output_mol_col_in_df_format
input = input[rules.filter(input)]