update data set

cthoyt · Jan 16, 2023 · 032b083 · 032b083
1 parent 87aac15
commit 032b083
Show file tree

Hide file tree

Showing 3 changed files with 372 additions and 0 deletions.
diff --git a/data/chembl30_50K.mfp0.pairs.txt.gz b/data/chembl30_50K.mfp0.pairs.txt.gz
diff --git a/data/chembl30_50K.mfp1.pairs.txt.gz b/data/chembl30_50K.mfp1.pairs.txt.gz
diff --git a/notebooks/Building a Similarity Comparison Set Revisited-2023.ipynb b/notebooks/Building a Similarity Comparison Set Revisited-2023.ipynb
@@ -0,0 +1,372 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Goal: construct a set of molecular pairs that can be used to compare similarity methods to each other.\n",
+    "\n",
+    "Update from http://rdkit.blogspot.com/2016/04/revisiting-similarity-comparison-set.html\n",
+    "The earlier version of this notebook (http://rdkit.blogspot.ch/2013/10/building-similarity-comparison-set-goal.html or https://github.com/greglandrum/rdkit_blog/blob/master/notebooks/Building%20A%20Similarity%20Comparison%20Set.ipynb)included a number of molecules that have counterions (from salts). Because this isn't really what we're interested in (and because the single-atom fragments that make up many salts triggered a bug in the RDKit's Morgan fingerprint implementation), I repeat the analysis here and restrict it to single-fragment molecules (those that do not include a `.` in the SMILES).\n",
+    "\n",
+    "The other big difference from the previous post is that an updated version of ChEMBL is used; this time it's ChEMBL21.\n",
+    "\n",
+    "I want to start with molecules that have some connection to each other, so I will pick pairs that have a baseline similarity: a Tanimoto similarity using count based Morgan0 fingerprints of at least 0.7. I also create a second set of somewhat more closely related molecules where the baseline similarity is 0.6 with a Morgan1 fingerprint. Both thresholds were selected empirically.\n",
+    "\n",
+    "**Note:** this notebook and the data it uses/generates can be found in the github repo: https://github.com/greglandrum/rdkit_blog"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Creating the tables in PostgreSQL\n",
+    "\n",
+    "I'm going to use ChEMBL as my data source, so I'll start by adding a table with count-based morgan fingerprints. Here's the SQL for that, assuming that you've installed the RDKit extension and setup an RDKit schema as described in the [docs](https://www.rdkit.org/docs/Cartridge.html#loading-chembl)\n",
+    "```\n",
+    "select molregno,morgan_fp(m,0) mfp0,morgan_fp(m,1) mfp1,morgan_fp(m,2) mfp2 into rdk.countfps from rdk.mols;\n",
+    "create index cfps_mfp0 on rdk.countfps using gist(mfp0);\n",
+    "create index cfps_mfp1 on rdk.countfps using gist(mfp1);\n",
+    "create index cfps_mfp2 on rdk.countfps using gist(mfp2);\n",
+    "```\n",
+    "\n",
+    "Fingerprints that only contains molecules with <= 50 heavy atoms and a single fragment (we recognize this because there is no '.' in the SMILES):\n",
+    "\n",
+    "```\n",
+    "select molregno,mfp0,mfp1 into table rdk.tfps_smaller from rdk.countfps join compound_properties using (molregno) join compound_structures using (molregno) where heavy_atoms<=50 and canonical_smiles not like '%.%';\n",
+    "create index sfps_mfp0_idx on rdk.tfps_smaller using gist(mfp0);\n",
+    "create index sfps_mfp1_idx on rdk.tfps_smaller using gist(mfp1);\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And now I'll build the set of pairs using Python. This is definitely doable in SQL, but my SQL-fu isn't that strong.\n",
+    "\n",
+    "Start by getting a set of 60K random small molecules:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-01-16T08:20:51.492914Z",
+     "start_time": "2023-01-16T08:20:51.394115Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2022.09.1\n",
+      "Mon Jan 16 09:20:51 2023\n"
+     ]
+    }
+   ],
+   "source": [
+    "from rdkit import Chem\n",
+    "from rdkit import rdBase\n",
+    "print(rdBase.rdkitVersion)\n",
+    "import time\n",
+    "print(time.asctime())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-01-16T08:25:43.609849Z",
+     "start_time": "2023-01-16T08:25:35.695948Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import psycopg2\n",
+    "cn = psycopg2.connect(host='localhost',dbname='chembl_30')\n",
+    "curs = cn.cursor()\n",
+    "curs.execute(\"select chembl_id,m from rdk.mols join rdk.tfps_smaller using (molregno)\"\n",
+    "             \" join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND')\"\n",
+    "             \" order by random() limit 60000\")\n",
+    "qs = curs.fetchall()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And now find one neighbor for 50K of those from the mfp0 table of smallish molecules:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-01-16T12:48:36.981493Z",
+     "start_time": "2023-01-16T12:43:05.733091Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Done: 0\n",
+      "Done: 1000\n",
+      "Done: 2000\n",
+      "Done: 4000\n",
+      "Done: 5000\n",
+      "Done: 6000\n",
+      "Done: 7000\n",
+      "Done: 8000\n",
+      "Done: 9000\n",
+      "Done: 10000\n",
+      "Done: 11000\n",
+      "Done: 12000\n",
+      "Done: 13000\n",
+      "Done: 14000\n",
+      "Done: 15000\n",
+      "Done: 16000\n",
+      "Done: 17000\n",
+      "Done: 18000\n",
+      "Done: 19000\n",
+      "Done: 20000\n",
+      "Done: 21000\n",
+      "Done: 22000\n",
+      "Done: 23000\n",
+      "Done: 24000\n",
+      "Done: 25000\n",
+      "Done: 26000\n",
+      "Done: 27000\n",
+      "Done: 28000\n",
+      "Done: 29000\n",
+      "Done: 30000\n",
+      "Done: 31000\n",
+      "Done: 32000\n",
+      "Done: 33000\n",
+      "Done: 34000\n",
+      "Done: 35000\n",
+      "Done: 36000\n",
+      "Done: 37000\n",
+      "Done: 38000\n",
+      "Done: 39000\n",
+      "Done: 40000\n",
+      "Done: 41000\n",
+      "Done: 42000\n",
+      "Done: 43000\n",
+      "Done: 44000\n",
+      "Done: 45000\n",
+      "Done: 46000\n",
+      "Done: 47000\n",
+      "Done: 48000\n",
+      "Done: 49000\n",
+      "Done: 50000\n"
+     ]
+    }
+   ],
+   "source": [
+    "cn.rollback()\n",
+    "curs.execute('set rdkit.tanimoto_threshold=0.65')\n",
+    "\n",
+    "keep=[]\n",
+    "for i,row in enumerate(qs):\n",
+    "    curs.execute(\"select chembl_id,m from rdk.mols join (select chembl_id,molregno from rdk.tfps_smaller \"\n",
+    "                 \"join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND') \"\n",
+    "                 \"where mfp0%%morgan_fp(%s,0) \"\n",
+    "                 \"and chembl_id!=%s limit 1) t2 using (molregno) \"\n",
+    "                 \"limit 1\",(row[1],row[0]))\n",
+    "    d = curs.fetchone()\n",
+    "    if not d: \n",
+    "        continue\n",
+    "    keep.append((row[0],row[1],d[0],d[1]))\n",
+    "    if len(keep)>=50000: \n",
+    "        break\n",
+    "    if not i%1000: print('Done: %d'%i)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, write those out to a file so that we can use them elsewhere:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-01-16T12:48:37.896921Z",
+     "start_time": "2023-01-16T12:48:36.983261Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import gzip\n",
+    "outf = gzip.open('../data/chembl30_50K.mfp0.pairs.txt.gz','wb+')\n",
+    "for cid1,smi1,cid2,smi2 in keep: \n",
+    "    outf.write(f'{cid1}\\t{smi1}\\t{cid2}\\t{smi2}\\n'.encode('UTF-8'))\n",
+    "outf=None\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Try molecules that are a bit more similar.\n",
+    "Use a similarity threshold for the pairs using MFP1 bits."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-01-16T13:17:03.816967Z",
+     "start_time": "2023-01-16T12:48:37.898370Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Done: 0\n",
+      "Done: 1000\n",
+      "Done: 2000\n",
+      "Done: 4000\n",
+      "Done: 5000\n",
+      "Done: 6000\n",
+      "Done: 7000\n",
+      "Done: 8000\n",
+      "Done: 9000\n",
+      "Done: 10000\n",
+      "Done: 11000\n",
+      "Done: 12000\n",
+      "Done: 13000\n",
+      "Done: 14000\n",
+      "Done: 15000\n",
+      "Done: 16000\n",
+      "Done: 17000\n",
+      "Done: 18000\n",
+      "Done: 19000\n",
+      "Done: 20000\n",
+      "Done: 21000\n",
+      "Done: 22000\n",
+      "Done: 23000\n",
+      "Done: 24000\n",
+      "Done: 25000\n",
+      "Done: 26000\n",
+      "Done: 27000\n",
+      "Done: 28000\n",
+      "Done: 29000\n",
+      "Done: 30000\n",
+      "Done: 31000\n",
+      "Done: 32000\n",
+      "Done: 33000\n",
+      "Done: 34000\n",
+      "Done: 35000\n",
+      "Done: 36000\n",
+      "Done: 37000\n",
+      "Done: 38000\n",
+      "Done: 39000\n",
+      "Done: 40000\n",
+      "Done: 41000\n",
+      "Done: 42000\n",
+      "Done: 43000\n",
+      "Done: 44000\n",
+      "Done: 45000\n",
+      "Done: 46000\n",
+      "Done: 47000\n",
+      "Done: 48000\n",
+      "Done: 49000\n",
+      "Done: 50000\n"
+     ]
+    }
+   ],
+   "source": [
+    "cn.rollback()\n",
+    "curs.execute('set rdkit.tanimoto_threshold=0.55')\n",
+    "\n",
+    "keep=[]\n",
+    "for i,row in enumerate(qs):\n",
+    "    curs.execute(\"select chembl_id,m from rdk.mols join (select chembl_id,molregno from rdk.tfps_smaller \"\n",
+    "                 \"join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND') \"\n",
+    "                 \"where mfp1%%morgan_fp(%s,1) \"\n",
+    "                 \"and chembl_id!=%s limit 1) t2 using (molregno) \"\n",
+    "                 \"limit 1\",(row[1],row[0]))\n",
+    "    d = curs.fetchone()\n",
+    "    if not d: \n",
+    "        continue\n",
+    "    keep.append((row[0],row[1],d[0],d[1]))\n",
+    "    if len(keep)>=50000: \n",
+    "        break\n",
+    "    if not i%1000: print('Done: %d'%i)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-01-16T13:23:34.529592Z",
+     "start_time": "2023-01-16T13:23:33.517374Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import gzip\n",
+    "outf = gzip.open('../data/chembl30_50K.mfp1.pairs.txt.gz','wb+')\n",
+    "for cid1,smi1,cid2,smi2 in keep: \n",
+    "    outf.write(f'{cid1}\\t{smi1}\\t{cid2}\\t{smi2}\\n'.encode('UTF-8'))\n",
+    "outf=None\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}