forked from greglandrum/rdkit_blog
-
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Greg Landrum
committed
Jan 16, 2023
1 parent
87aac15
commit 032b083
Showing
3 changed files
with
372 additions
and
0 deletions.
There are no files selected for viewing
Binary file not shown.
Binary file not shown.
372 changes: 372 additions & 0 deletions
372
notebooks/Building a Similarity Comparison Set Revisited-2023.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,372 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Goal: construct a set of molecular pairs that can be used to compare similarity methods to each other.\n", | ||
"\n", | ||
"Update from http://rdkit.blogspot.com/2016/04/revisiting-similarity-comparison-set.html\n", | ||
"The earlier version of this notebook (http://rdkit.blogspot.ch/2013/10/building-similarity-comparison-set-goal.html or https://github.com/greglandrum/rdkit_blog/blob/master/notebooks/Building%20A%20Similarity%20Comparison%20Set.ipynb)included a number of molecules that have counterions (from salts). Because this isn't really what we're interested in (and because the single-atom fragments that make up many salts triggered a bug in the RDKit's Morgan fingerprint implementation), I repeat the analysis here and restrict it to single-fragment molecules (those that do not include a `.` in the SMILES).\n", | ||
"\n", | ||
"The other big difference from the previous post is that an updated version of ChEMBL is used; this time it's ChEMBL21.\n", | ||
"\n", | ||
"I want to start with molecules that have some connection to each other, so I will pick pairs that have a baseline similarity: a Tanimoto similarity using count based Morgan0 fingerprints of at least 0.7. I also create a second set of somewhat more closely related molecules where the baseline similarity is 0.6 with a Morgan1 fingerprint. Both thresholds were selected empirically.\n", | ||
"\n", | ||
"**Note:** this notebook and the data it uses/generates can be found in the github repo: https://github.com/greglandrum/rdkit_blog" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Creating the tables in PostgreSQL\n", | ||
"\n", | ||
"I'm going to use ChEMBL as my data source, so I'll start by adding a table with count-based morgan fingerprints. Here's the SQL for that, assuming that you've installed the RDKit extension and setup an RDKit schema as described in the [docs](https://www.rdkit.org/docs/Cartridge.html#loading-chembl)\n", | ||
"```\n", | ||
"select molregno,morgan_fp(m,0) mfp0,morgan_fp(m,1) mfp1,morgan_fp(m,2) mfp2 into rdk.countfps from rdk.mols;\n", | ||
"create index cfps_mfp0 on rdk.countfps using gist(mfp0);\n", | ||
"create index cfps_mfp1 on rdk.countfps using gist(mfp1);\n", | ||
"create index cfps_mfp2 on rdk.countfps using gist(mfp2);\n", | ||
"```\n", | ||
"\n", | ||
"Fingerprints that only contains molecules with <= 50 heavy atoms and a single fragment (we recognize this because there is no '.' in the SMILES):\n", | ||
"\n", | ||
"```\n", | ||
"select molregno,mfp0,mfp1 into table rdk.tfps_smaller from rdk.countfps join compound_properties using (molregno) join compound_structures using (molregno) where heavy_atoms<=50 and canonical_smiles not like '%.%';\n", | ||
"create index sfps_mfp0_idx on rdk.tfps_smaller using gist(mfp0);\n", | ||
"create index sfps_mfp1_idx on rdk.tfps_smaller using gist(mfp1);\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"And now I'll build the set of pairs using Python. This is definitely doable in SQL, but my SQL-fu isn't that strong.\n", | ||
"\n", | ||
"Start by getting a set of 60K random small molecules:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2023-01-16T08:20:51.492914Z", | ||
"start_time": "2023-01-16T08:20:51.394115Z" | ||
} | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"2022.09.1\n", | ||
"Mon Jan 16 09:20:51 2023\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from rdkit import Chem\n", | ||
"from rdkit import rdBase\n", | ||
"print(rdBase.rdkitVersion)\n", | ||
"import time\n", | ||
"print(time.asctime())" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2023-01-16T08:25:43.609849Z", | ||
"start_time": "2023-01-16T08:25:35.695948Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import psycopg2\n", | ||
"cn = psycopg2.connect(host='localhost',dbname='chembl_30')\n", | ||
"curs = cn.cursor()\n", | ||
"curs.execute(\"select chembl_id,m from rdk.mols join rdk.tfps_smaller using (molregno)\"\n", | ||
" \" join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND')\"\n", | ||
" \" order by random() limit 60000\")\n", | ||
"qs = curs.fetchall()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"And now find one neighbor for 50K of those from the mfp0 table of smallish molecules:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 15, | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2023-01-16T12:48:36.981493Z", | ||
"start_time": "2023-01-16T12:43:05.733091Z" | ||
} | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Done: 0\n", | ||
"Done: 1000\n", | ||
"Done: 2000\n", | ||
"Done: 4000\n", | ||
"Done: 5000\n", | ||
"Done: 6000\n", | ||
"Done: 7000\n", | ||
"Done: 8000\n", | ||
"Done: 9000\n", | ||
"Done: 10000\n", | ||
"Done: 11000\n", | ||
"Done: 12000\n", | ||
"Done: 13000\n", | ||
"Done: 14000\n", | ||
"Done: 15000\n", | ||
"Done: 16000\n", | ||
"Done: 17000\n", | ||
"Done: 18000\n", | ||
"Done: 19000\n", | ||
"Done: 20000\n", | ||
"Done: 21000\n", | ||
"Done: 22000\n", | ||
"Done: 23000\n", | ||
"Done: 24000\n", | ||
"Done: 25000\n", | ||
"Done: 26000\n", | ||
"Done: 27000\n", | ||
"Done: 28000\n", | ||
"Done: 29000\n", | ||
"Done: 30000\n", | ||
"Done: 31000\n", | ||
"Done: 32000\n", | ||
"Done: 33000\n", | ||
"Done: 34000\n", | ||
"Done: 35000\n", | ||
"Done: 36000\n", | ||
"Done: 37000\n", | ||
"Done: 38000\n", | ||
"Done: 39000\n", | ||
"Done: 40000\n", | ||
"Done: 41000\n", | ||
"Done: 42000\n", | ||
"Done: 43000\n", | ||
"Done: 44000\n", | ||
"Done: 45000\n", | ||
"Done: 46000\n", | ||
"Done: 47000\n", | ||
"Done: 48000\n", | ||
"Done: 49000\n", | ||
"Done: 50000\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"cn.rollback()\n", | ||
"curs.execute('set rdkit.tanimoto_threshold=0.65')\n", | ||
"\n", | ||
"keep=[]\n", | ||
"for i,row in enumerate(qs):\n", | ||
" curs.execute(\"select chembl_id,m from rdk.mols join (select chembl_id,molregno from rdk.tfps_smaller \"\n", | ||
" \"join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND') \"\n", | ||
" \"where mfp0%%morgan_fp(%s,0) \"\n", | ||
" \"and chembl_id!=%s limit 1) t2 using (molregno) \"\n", | ||
" \"limit 1\",(row[1],row[0]))\n", | ||
" d = curs.fetchone()\n", | ||
" if not d: \n", | ||
" continue\n", | ||
" keep.append((row[0],row[1],d[0],d[1]))\n", | ||
" if len(keep)>=50000: \n", | ||
" break\n", | ||
" if not i%1000: print('Done: %d'%i)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Finally, write those out to a file so that we can use them elsewhere:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 16, | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2023-01-16T12:48:37.896921Z", | ||
"start_time": "2023-01-16T12:48:36.983261Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import gzip\n", | ||
"outf = gzip.open('../data/chembl30_50K.mfp0.pairs.txt.gz','wb+')\n", | ||
"for cid1,smi1,cid2,smi2 in keep: \n", | ||
" outf.write(f'{cid1}\\t{smi1}\\t{cid2}\\t{smi2}\\n'.encode('UTF-8'))\n", | ||
"outf=None\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Try molecules that are a bit more similar.\n", | ||
"Use a similarity threshold for the pairs using MFP1 bits." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 17, | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2023-01-16T13:17:03.816967Z", | ||
"start_time": "2023-01-16T12:48:37.898370Z" | ||
} | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Done: 0\n", | ||
"Done: 1000\n", | ||
"Done: 2000\n", | ||
"Done: 4000\n", | ||
"Done: 5000\n", | ||
"Done: 6000\n", | ||
"Done: 7000\n", | ||
"Done: 8000\n", | ||
"Done: 9000\n", | ||
"Done: 10000\n", | ||
"Done: 11000\n", | ||
"Done: 12000\n", | ||
"Done: 13000\n", | ||
"Done: 14000\n", | ||
"Done: 15000\n", | ||
"Done: 16000\n", | ||
"Done: 17000\n", | ||
"Done: 18000\n", | ||
"Done: 19000\n", | ||
"Done: 20000\n", | ||
"Done: 21000\n", | ||
"Done: 22000\n", | ||
"Done: 23000\n", | ||
"Done: 24000\n", | ||
"Done: 25000\n", | ||
"Done: 26000\n", | ||
"Done: 27000\n", | ||
"Done: 28000\n", | ||
"Done: 29000\n", | ||
"Done: 30000\n", | ||
"Done: 31000\n", | ||
"Done: 32000\n", | ||
"Done: 33000\n", | ||
"Done: 34000\n", | ||
"Done: 35000\n", | ||
"Done: 36000\n", | ||
"Done: 37000\n", | ||
"Done: 38000\n", | ||
"Done: 39000\n", | ||
"Done: 40000\n", | ||
"Done: 41000\n", | ||
"Done: 42000\n", | ||
"Done: 43000\n", | ||
"Done: 44000\n", | ||
"Done: 45000\n", | ||
"Done: 46000\n", | ||
"Done: 47000\n", | ||
"Done: 48000\n", | ||
"Done: 49000\n", | ||
"Done: 50000\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"cn.rollback()\n", | ||
"curs.execute('set rdkit.tanimoto_threshold=0.55')\n", | ||
"\n", | ||
"keep=[]\n", | ||
"for i,row in enumerate(qs):\n", | ||
" curs.execute(\"select chembl_id,m from rdk.mols join (select chembl_id,molregno from rdk.tfps_smaller \"\n", | ||
" \"join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND') \"\n", | ||
" \"where mfp1%%morgan_fp(%s,1) \"\n", | ||
" \"and chembl_id!=%s limit 1) t2 using (molregno) \"\n", | ||
" \"limit 1\",(row[1],row[0]))\n", | ||
" d = curs.fetchone()\n", | ||
" if not d: \n", | ||
" continue\n", | ||
" keep.append((row[0],row[1],d[0],d[1]))\n", | ||
" if len(keep)>=50000: \n", | ||
" break\n", | ||
" if not i%1000: print('Done: %d'%i)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 19, | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2023-01-16T13:23:34.529592Z", | ||
"start_time": "2023-01-16T13:23:33.517374Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import gzip\n", | ||
"outf = gzip.open('../data/chembl30_50K.mfp1.pairs.txt.gz','wb+')\n", | ||
"for cid1,smi1,cid2,smi2 in keep: \n", | ||
" outf.write(f'{cid1}\\t{smi1}\\t{cid2}\\t{smi2}\\n'.encode('UTF-8'))\n", | ||
"outf=None\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.6" | ||
}, | ||
"toc": { | ||
"base_numbering": 1, | ||
"nav_menu": {}, | ||
"number_sections": true, | ||
"sideBar": true, | ||
"skip_h1_title": false, | ||
"title_cell": "Table of Contents", | ||
"title_sidebar": "Contents", | ||
"toc_cell": false, | ||
"toc_position": {}, | ||
"toc_section_display": true, | ||
"toc_window_display": false | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 1 | ||
} |