Skip to content

Commit

Permalink
update data set
Browse files Browse the repository at this point in the history
  • Loading branch information
Greg Landrum committed Jan 16, 2023
1 parent 87aac15 commit 032b083
Show file tree
Hide file tree
Showing 3 changed files with 372 additions and 0 deletions.
Binary file added data/chembl30_50K.mfp0.pairs.txt.gz
Binary file not shown.
Binary file added data/chembl30_50K.mfp1.pairs.txt.gz
Binary file not shown.
372 changes: 372 additions & 0 deletions notebooks/Building a Similarity Comparison Set Revisited-2023.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,372 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Goal: construct a set of molecular pairs that can be used to compare similarity methods to each other.\n",
"\n",
"Update from http://rdkit.blogspot.com/2016/04/revisiting-similarity-comparison-set.html\n",
"The earlier version of this notebook (http://rdkit.blogspot.ch/2013/10/building-similarity-comparison-set-goal.html or https://github.com/greglandrum/rdkit_blog/blob/master/notebooks/Building%20A%20Similarity%20Comparison%20Set.ipynb)included a number of molecules that have counterions (from salts). Because this isn't really what we're interested in (and because the single-atom fragments that make up many salts triggered a bug in the RDKit's Morgan fingerprint implementation), I repeat the analysis here and restrict it to single-fragment molecules (those that do not include a `.` in the SMILES).\n",
"\n",
"The other big difference from the previous post is that an updated version of ChEMBL is used; this time it's ChEMBL21.\n",
"\n",
"I want to start with molecules that have some connection to each other, so I will pick pairs that have a baseline similarity: a Tanimoto similarity using count based Morgan0 fingerprints of at least 0.7. I also create a second set of somewhat more closely related molecules where the baseline similarity is 0.6 with a Morgan1 fingerprint. Both thresholds were selected empirically.\n",
"\n",
"**Note:** this notebook and the data it uses/generates can be found in the github repo: https://github.com/greglandrum/rdkit_blog"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Creating the tables in PostgreSQL\n",
"\n",
"I'm going to use ChEMBL as my data source, so I'll start by adding a table with count-based morgan fingerprints. Here's the SQL for that, assuming that you've installed the RDKit extension and setup an RDKit schema as described in the [docs](https://www.rdkit.org/docs/Cartridge.html#loading-chembl)\n",
"```\n",
"select molregno,morgan_fp(m,0) mfp0,morgan_fp(m,1) mfp1,morgan_fp(m,2) mfp2 into rdk.countfps from rdk.mols;\n",
"create index cfps_mfp0 on rdk.countfps using gist(mfp0);\n",
"create index cfps_mfp1 on rdk.countfps using gist(mfp1);\n",
"create index cfps_mfp2 on rdk.countfps using gist(mfp2);\n",
"```\n",
"\n",
"Fingerprints that only contains molecules with <= 50 heavy atoms and a single fragment (we recognize this because there is no '.' in the SMILES):\n",
"\n",
"```\n",
"select molregno,mfp0,mfp1 into table rdk.tfps_smaller from rdk.countfps join compound_properties using (molregno) join compound_structures using (molregno) where heavy_atoms<=50 and canonical_smiles not like '%.%';\n",
"create index sfps_mfp0_idx on rdk.tfps_smaller using gist(mfp0);\n",
"create index sfps_mfp1_idx on rdk.tfps_smaller using gist(mfp1);\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And now I'll build the set of pairs using Python. This is definitely doable in SQL, but my SQL-fu isn't that strong.\n",
"\n",
"Start by getting a set of 60K random small molecules:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-16T08:20:51.492914Z",
"start_time": "2023-01-16T08:20:51.394115Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2022.09.1\n",
"Mon Jan 16 09:20:51 2023\n"
]
}
],
"source": [
"from rdkit import Chem\n",
"from rdkit import rdBase\n",
"print(rdBase.rdkitVersion)\n",
"import time\n",
"print(time.asctime())"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-16T08:25:43.609849Z",
"start_time": "2023-01-16T08:25:35.695948Z"
}
},
"outputs": [],
"source": [
"import psycopg2\n",
"cn = psycopg2.connect(host='localhost',dbname='chembl_30')\n",
"curs = cn.cursor()\n",
"curs.execute(\"select chembl_id,m from rdk.mols join rdk.tfps_smaller using (molregno)\"\n",
" \" join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND')\"\n",
" \" order by random() limit 60000\")\n",
"qs = curs.fetchall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And now find one neighbor for 50K of those from the mfp0 table of smallish molecules:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-16T12:48:36.981493Z",
"start_time": "2023-01-16T12:43:05.733091Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done: 0\n",
"Done: 1000\n",
"Done: 2000\n",
"Done: 4000\n",
"Done: 5000\n",
"Done: 6000\n",
"Done: 7000\n",
"Done: 8000\n",
"Done: 9000\n",
"Done: 10000\n",
"Done: 11000\n",
"Done: 12000\n",
"Done: 13000\n",
"Done: 14000\n",
"Done: 15000\n",
"Done: 16000\n",
"Done: 17000\n",
"Done: 18000\n",
"Done: 19000\n",
"Done: 20000\n",
"Done: 21000\n",
"Done: 22000\n",
"Done: 23000\n",
"Done: 24000\n",
"Done: 25000\n",
"Done: 26000\n",
"Done: 27000\n",
"Done: 28000\n",
"Done: 29000\n",
"Done: 30000\n",
"Done: 31000\n",
"Done: 32000\n",
"Done: 33000\n",
"Done: 34000\n",
"Done: 35000\n",
"Done: 36000\n",
"Done: 37000\n",
"Done: 38000\n",
"Done: 39000\n",
"Done: 40000\n",
"Done: 41000\n",
"Done: 42000\n",
"Done: 43000\n",
"Done: 44000\n",
"Done: 45000\n",
"Done: 46000\n",
"Done: 47000\n",
"Done: 48000\n",
"Done: 49000\n",
"Done: 50000\n"
]
}
],
"source": [
"cn.rollback()\n",
"curs.execute('set rdkit.tanimoto_threshold=0.65')\n",
"\n",
"keep=[]\n",
"for i,row in enumerate(qs):\n",
" curs.execute(\"select chembl_id,m from rdk.mols join (select chembl_id,molregno from rdk.tfps_smaller \"\n",
" \"join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND') \"\n",
" \"where mfp0%%morgan_fp(%s,0) \"\n",
" \"and chembl_id!=%s limit 1) t2 using (molregno) \"\n",
" \"limit 1\",(row[1],row[0]))\n",
" d = curs.fetchone()\n",
" if not d: \n",
" continue\n",
" keep.append((row[0],row[1],d[0],d[1]))\n",
" if len(keep)>=50000: \n",
" break\n",
" if not i%1000: print('Done: %d'%i)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, write those out to a file so that we can use them elsewhere:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-16T12:48:37.896921Z",
"start_time": "2023-01-16T12:48:36.983261Z"
}
},
"outputs": [],
"source": [
"import gzip\n",
"outf = gzip.open('../data/chembl30_50K.mfp0.pairs.txt.gz','wb+')\n",
"for cid1,smi1,cid2,smi2 in keep: \n",
" outf.write(f'{cid1}\\t{smi1}\\t{cid2}\\t{smi2}\\n'.encode('UTF-8'))\n",
"outf=None\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Try molecules that are a bit more similar.\n",
"Use a similarity threshold for the pairs using MFP1 bits."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-16T13:17:03.816967Z",
"start_time": "2023-01-16T12:48:37.898370Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done: 0\n",
"Done: 1000\n",
"Done: 2000\n",
"Done: 4000\n",
"Done: 5000\n",
"Done: 6000\n",
"Done: 7000\n",
"Done: 8000\n",
"Done: 9000\n",
"Done: 10000\n",
"Done: 11000\n",
"Done: 12000\n",
"Done: 13000\n",
"Done: 14000\n",
"Done: 15000\n",
"Done: 16000\n",
"Done: 17000\n",
"Done: 18000\n",
"Done: 19000\n",
"Done: 20000\n",
"Done: 21000\n",
"Done: 22000\n",
"Done: 23000\n",
"Done: 24000\n",
"Done: 25000\n",
"Done: 26000\n",
"Done: 27000\n",
"Done: 28000\n",
"Done: 29000\n",
"Done: 30000\n",
"Done: 31000\n",
"Done: 32000\n",
"Done: 33000\n",
"Done: 34000\n",
"Done: 35000\n",
"Done: 36000\n",
"Done: 37000\n",
"Done: 38000\n",
"Done: 39000\n",
"Done: 40000\n",
"Done: 41000\n",
"Done: 42000\n",
"Done: 43000\n",
"Done: 44000\n",
"Done: 45000\n",
"Done: 46000\n",
"Done: 47000\n",
"Done: 48000\n",
"Done: 49000\n",
"Done: 50000\n"
]
}
],
"source": [
"cn.rollback()\n",
"curs.execute('set rdkit.tanimoto_threshold=0.55')\n",
"\n",
"keep=[]\n",
"for i,row in enumerate(qs):\n",
" curs.execute(\"select chembl_id,m from rdk.mols join (select chembl_id,molregno from rdk.tfps_smaller \"\n",
" \"join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND') \"\n",
" \"where mfp1%%morgan_fp(%s,1) \"\n",
" \"and chembl_id!=%s limit 1) t2 using (molregno) \"\n",
" \"limit 1\",(row[1],row[0]))\n",
" d = curs.fetchone()\n",
" if not d: \n",
" continue\n",
" keep.append((row[0],row[1],d[0],d[1]))\n",
" if len(keep)>=50000: \n",
" break\n",
" if not i%1000: print('Done: %d'%i)\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-16T13:23:34.529592Z",
"start_time": "2023-01-16T13:23:33.517374Z"
}
},
"outputs": [],
"source": [
"import gzip\n",
"outf = gzip.open('../data/chembl30_50K.mfp1.pairs.txt.gz','wb+')\n",
"for cid1,smi1,cid2,smi2 in keep: \n",
" outf.write(f'{cid1}\\t{smi1}\\t{cid2}\\t{smi2}\\n'.encode('UTF-8'))\n",
"outf=None\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}

0 comments on commit 032b083

Please sign in to comment.