Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Saturated Ring Sampling Conformation Feature #34

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

Feriolet
Copy link
Contributor

@Feriolet Feriolet commented May 11, 2024

Implement Saturated Ring Sampling Conformation as discussed in the Issues #33 .
test_ring_conformer.zip

Note that:

  1. The compound is aligned per saturated ring found in the compound and the RMSD is then averaged (so if a ligand has three saturated ring, the ligand is aligned three times based on each saturated ring. Three RMSD Matrix is then averaged).
  2. I am not sure if I should put remove_confs_rms(mol) function outside of the mol_embedding_3d(mol) function, so I just put it inside for now.
  3. I thought of leaving the numConfs value as default in the EmbedMultipleConfs, but I have to specify the number in the function. So, I put 10 as its default value. I am not sure what value we should put for this parameter.
  4. I am using the default rms=0.25. I am not sure how to do the subset of conformers with keep_nconf after rms criterion. Is it the below code?
mol = remove_confs_rms(mol, saturated_ring, rms=0.25)
mol = remove_confs_rms(mol, saturated_ring, keep_nconf)
  1. remove_confs_rms(mol) is executed after UFFOptimizeMolecule(mol) if I interpret it correctly.
  2. I changed the remove_confs_rms(mol) to return mol instead of mol_tmp because the mk_prepare_ligand(mol) complained about the implicit hydrogen, which I assume is because of the mol_tmp = Chem.RemoveHs(mol)
  3. The command line I used to test the feature is: run_dock -i test_saturated_ring.smi -o test_ring_conformer.db --config config.yml -s 2 --protonation pkasolver --sdf --program vina. Since the sugar gives too many isomers, I just limited it to two

sys.stderr.write('STDERR output:\n')
sys.stderr.write(e.stderr + '\n')
sys.stderr.flush()
output = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is already not necessary and can be deleted. If docking was unsuccessful it will not contribute to the output_list

for i in atomIds:
d = conf1.GetAtomPosition(i).Distance(conf2.GetAtomPosition(i))
ssr += d * d
ssr /= mol.GetNumAtoms()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ssr /= len(atomsIds)

Comment on lines 94 to 97
for i in atomIds:
d = conf1.GetAtomPosition(i).Distance(conf2.GetAtomPosition(i))
ssr += d * d
ssr /= mol.GetNumAtoms()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add if statement to split the code execution and cover the previous behavior

if atomsIds:
    # new code
else:
    # old code

pdbqt_string, is_ok, error_msg = PDBQTWriterLegacy.write_string(setup)
if not is_ok:
print(f"{mol.GetProp('_Name')} has error in converting to pdbqt: {error_msg}")
pdbqt_string_list.append(pdbqt_string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be pdbqt_string if not is_ok? Is it reasonable to add this value to the list? Need to check

Copy link
Contributor Author

@Feriolet Feriolet May 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of the implicit hydrogen error, the mk_prepare_ligand return an empty string ''. So, is it better to not append the empty string and let the program continue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is not reasonable. Then it can be treated in this way below. If one conformer will fail, maybe some other will pass, so no need to interrupt preparation. In the worst case we will return an empty list.

if not is_ok:
    print(f"{mol.GetProp('_Name')} has error in converting to pdbqt: {error_msg}")
else:
    pdbqt_string_list.append(pdbqt_string)

Comment on lines 169 to 175
for ring in ssr:
is_atom_saturated_array = np.array([atom_list[atom_id].GetHybridization() == Chem.HybridizationType.SP3 for atom_id in ring])
is_ring_unsaturated = np.any(np.nonzero(is_atom_saturated_array==0))
if is_ring_unsaturated:
continue

saturated_ring_list.append(ring)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for ring in ssr:
    is_atom_saturated_array = [mol.GetAtomWithIdx(atom_id).GetHybridization() == Chem.HybridizationType.SP3 for atom_id in ring])
    if any(is_atom_saturated_array):
        saturated_ring_list.append(ring)

This seems more compact and expressive

def mol_embedding_3d(mol: Chem.Mol, seed: int=43) -> Chem.Mol:

def find_saturated_ring(mol: Chem.Mol) -> list[list[int]]:
atom_list = mol.GetAtoms()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atom_list var will be no longer required, the line can be deleted

if not isinstance(mol, Chem.Mol):
return None

saturated_ring = find_saturated_ring(mol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to rename the variable and the function and add s at the end of both names to designate that the function returns multiple objects and the variable stores multiple objects

@DrrDom
Copy link
Contributor

DrrDom commented May 12, 2024

  1. This is not correct if rings have different size. If there are two rings A (size 5) and B (size 6) then RMSD = sqrt((A**2 * 5 + b**2 * 6) / (5 + 6)), where A and B are corresponding RMSD. I'm not sure that averaging is a good strategy, because one ring can be sampled while other not and in average RMSD will be non-zero. This may be checked later in experiments.
  2. It is OK. Later we will revise functions placement. I usually try to keep function visibility and not put them inside other functions except some very specific ones.
  3. OK, later we may adjust the number if needed.
  4. In general you are right. However we may avoid calculation of RMSD twice. We may in remove_confs_rms invoke first clustering for rms, virtually remove those conformers and invoke clustering with fixed nconf, at the end we will combine two lists of conformers to remove from both calls. Is this explanation clear?
  5. Yes, you are right.
  6. OK
  7. I did not run test yet, but thanks for the example

@Feriolet
Copy link
Contributor Author

Feriolet commented May 13, 2024

For number one, say we have ring A and B, each with two ring conformations (A1, A2) and (B1, B2). If we have a molecule with four conformation, excluding similar conformation due to the linker (A1-B1, A1-B2, A2-B1, A2-B2), do we take all of the conformers because each has different conformers? Or do we take only two (either A1-B1 and A2-B2 or A1-B2 and A2-B1), because they compromise the four conformers?

I assume the issue is with the averaging the matrix and not generating the matrix based on the ring number, right?

@Feriolet
Copy link
Contributor Author

Feriolet commented May 13, 2024

Also for number four, I think I get your explanation, but can that accidentally remove the whole conformation? For example, out of 7 conformers [0...6], can rms ask to remove [0,1,2,4,6] and then keep_nconf ask to remove [2,3,4,5], removing all of the conformers?

@DrrDom
Copy link
Contributor

DrrDom commented May 13, 2024

For number 4. We will pass to keep_nconf only those conformers which passed rms filter. So after rms at least one will survive and then it cannot be removed on the next step with keep_nconf.
There will be a minor issue. If after rms on;y one conformer will remain, clustering should return an error, because it cannot work with a single instance. So, there is a need to add the check this and skip keep_nconfif there is only one conformer.

@DrrDom
Copy link
Contributor

DrrDom commented May 13, 2024

Effectively at the step keep_nconf we will give a RMS matrix, where rows and columns of conformers which did not pass rms filter are removed. Thus, we will not recalculate the matrix

@DrrDom
Copy link
Contributor

DrrDom commented May 13, 2024

The number 1 is a difficult question.

  1. The simplest way, which I considered initially, is to use A1-B1 and A2-B2 only. Because we do not generate conformers for rings individually, but always for a whole molecule. It is difficult to cross-link ring conformers sampled in different conformers of a molecule.

  2. The right way is to use all four conformers, because they all will be different and their docking may differ as well. However, I do not know how to implement this easily.

  • we can split a molecule for rings with attached atoms and other parts (linkers, substituents and aromatic rings); generate ring conformers individually for selected substructures; select different conformers for each ring individually and then assembly rings and other parts to re-create the target molecules. However, we will have some parts 3D, some 2D. We may run conformer embedding keeping the conformations of rings intact and sample only 2D parts, but we cannot place 3D parts in right position relative to each other and therefor this procedure will fail.

Probably it will be a good idea to ask rdkit mailing list.

As an alternative we may try conforgeconformer generator. It is very flexible and developed by my colleague from Vienna. Maybe it has such an option or can be easily implemented. However, this will add additional dependence to the project and it does not have simple converter between RDKit and conforge molecule types, so we have to write it by ourselves. But we will receive another benefit - conforge is much faster than RDKit.

Solution 1 is appropriate, although not perfect. so we may implement it as a beginning.

@Feriolet
Copy link
Contributor Author

Yeah, it seems that the second solution is quite troublesome. We can start with solution 1 first, then move to the second if needed.

I have tried to interpret point number four. Can you check if I have understood it correctly?


cmat_list_array = np.array(cmat_list)

return list(np.mean(cmat_list_array, axis=0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not strong in numpy to write the code directly from the top of my head. So, I'll explain. If cmat_list_array is a matrix A of size N x M, where N is a number of conformer pairs and M is the number of rings. You have to compute A^2 for every element, then multiply each column on the corresponding number of atoms in ring 1..M, sum over rows, divide on the sum of all ring atoms in a molecule and take a square root. This would be mathematically exact.
Simple averaging will work only if all rings have the same size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still confused, but do you want to calculate the mathematical RMSD based on the total number of atoms in the ring? Say we have a five- and six- membered ring. Instead of calculating:
$$\frac{\sqrt{\frac{\sum{d_{a}^{2}}}{5}} + \sqrt{\frac{\sum{d_{b}^{2}}}{6}}}{2} $$
We should calculate it this way to reflect the actual RMSD:
$$\sqrt{\frac{\sum{d_{a}^{2}} + \sum{d_{b}^{2}}}{5 + 6}}$$

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I propose to calculate true RMSD which will account atoms in all ring systems - the latter equation. However, we have RMSD for individual rings only.

$$RMSD_{a} = \sqrt{\frac{\sum{d_{a}^{2}}}{5}}$$

and

$$RMSD_{b} = \sqrt{\frac{\sum{d_{b}^{2}}}{5}}$$

Therefore, I suggest to recalculate them into the final RMSD by simple math operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright. I have edited the code to calculate the true RMSD based on the rdkit RMSD

saturated_ring_list = []
for ring in ssr:
is_atom_saturated_array = [mol.GetAtomWithIdx(atom_id).GetHybridization() == Chem.HybridizationType.SP3 for atom_id in ring]
if all(is_atom_saturated_array):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any is more appropriate then all, because this will skip any saturated ring fused with an aromatic ring. If the conformations of such a ring will be very similar they will be cut by rms filter. So, it looks safe to use less strict conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yeah in that case, we should make it less stricter

Comment on lines 239 to 241
if keep_nconf:
if mol.GetNumConformers() <= keep_nconf:
return mol
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be simplified

if keep_nconf and mol.GetNumConformers() > keep_nconf:
    ...

return mol

cids = [c.GetId() for c in mol.GetConformers()]
arr = arr[np.ix_(cids, cids)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a potential issue. Conformer ids are arbitrary, they do not guarantee to be sequential from 0 to N-1 as atoms in RDKit. Thus, to be safe, another indexing should be used, probably involving keep_ids.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, instead of cids, we can use keep_ids as the reference to be removed for the nkeep filters? Something like this?

            arr = arr[np.ix_(keep_ids, keep_ids)]
            keep_ids_nconf_filter = []
            #calculation 
            remove_ids = set(keep_ids) - set(keep_ids_nconf_filter)

@Feriolet
Copy link
Contributor Author

test_conformer2.zip
Also here is the output of the fixed indexing, I still include 0.35-0.5 rms because I see that some of the molecules still generate many conformers in 0.35 rms

@DrrDom
Copy link
Contributor

DrrDom commented May 27, 2024

You were right with these corrections, I really lost with indexing.

  1. You chose too low thresholds as I understood: 0.2-0.6 A. They should be in the range 1-3 A. We cannot consider so many remaining conformers, even if many of them are relevant and really distinct. We have to find a reasonable trade-off between representativeness of conformations and their number. i would expect to keep maximum 5 conformers to not overload docking too much.
  2. As I understood, after the current selection procedure (complete linkage clustering) the remaining conformers may still have RMSD below the given threshold. Thus, it will be reasonable to apply an additional filtration step to keep only conformers with RMSD greater than the threshold. This can be implemented in an iterative manner. You may create a smaller RMSD matrix with remaining conformers and while there is any value greater than the threshold you select for the entry (conformer) with the maximum number of RMSD values below the threshold and remove it from the matrix. Is the explanation clear? Could you implement this?

@Feriolet
Copy link
Contributor Author

Feriolet commented May 27, 2024

  1. Sorry, I misread your suggestion to use rms 2-3.5 as rms 0.25-0.35. That's why it still keeps a huge number of conformers.
  2. I still don't quite understand the explanation to filter the lower rms matrix. Assuming that we have these remaining four conformers and assume rms threshold is 3, how should I iteratively remove the conformation with lower rms?

The way I understand your interpretation is that from the truncated rms matrix:
cid 1 has two rms below 3
cid 4 has one rms below 3
cid 6 has one rms below 3
cid 8 has no rms below 3.

Given this information, we have to remove cid 1 for our second filter. Is that correct?

333783512-f0f8dd2d-8e3a-4d81-a080-94ca9ef6cef0

@DrrDom
Copy link
Contributor

DrrDom commented May 27, 2024

Yes, you are completely right. In this case on the first iteration you select cid 1 for removal as it has the greatest number of RMSD values below 3. After the removal the remaining matrix has no values below the threshold, therefore, the procedure is finished. If there would remain other RMSD value below 3 in the matrix, we had to continue iterations.

This iterative strategy can be implemented instead of the suggested two-step approach. However, the two-step procedure looks a little bit more reasonable and should be faster. On the first step we quickly identify a representative subset of conformers and after that we reduce redundancy to a given level.

@Feriolet
Copy link
Contributor Author

Okay, I got what you mean by the filter. I'll try to code the feature later. For the meantime, here is the rms result. I didn't try >2, because at rms=2, all of the rings only have one conformer.
conformer2.zip

@DrrDom
Copy link
Contributor

DrrDom commented May 28, 2024

I looked through the conformers and my opinion is that rms=1 looks a reasonable trade-off.

We may combine keep_nconf with rms to avoid overwhelming of docking with too many conformers, that may occur in polycyclic systems. To set keep_nconf we can use a fixed value or the value calculated on-the-fly depending on the number of saturated ring systems and their sizes. For example for 6-membered ring and below the increment may be 2, for 7-membered - 3, for 8-membered and above - 4 (we will not sample macrocycles, this is a completely another problem and there are approaches to solve it). We determine the number and size of saturated rings, sum their increments and set keep_nconf value.

To test how this combination will work we may use betulinic acid. It has 5 saturated ring systems, but the structure is very rigid and should result in a very small number of conformers.

Do you see some issues with this filtration approach?
If not, then we may finalize the code and test on some real examples.

@Feriolet
Copy link
Contributor Author

I don't particularly see much problem with the keep_nconf filtration approach. It's probably won't use too much because the "below_rms_filter" will probably reduce the nconf to the desired number, but it's good as a check.

I'm not sure how I should count the ring size, since the find_saturated_ring_with_substituent function has the neighbour atom id. I decided to make another function find_saturated_ring, although it will calculate the ring id twice overall. I can change the function find_saturated_ring_substituent which takes saturated_ring_list and mol to prevent calculating twice if you prefer it that way.

conformer_belowrmsfilter_and_keepnconf.zip

Here is the .sdf file for both the rms_matrix_filter and rms_matrix_filter & keep_nconf result.

@DrrDom
Copy link
Contributor

DrrDom commented May 29, 2024

I don't particularly see much problem with the keep_nconf filtration approach. It's probably won't use too much because the "below_rms_filter" will probably reduce the nconf to the desired number, but it's good as a check.

The results with and without keen_nconf are the same (the number of remained conformers). This is expected for the majority of compounds, However we have only 2 conformers for chondroitin (isomer 0) that is quite small for me. Conformer 1 has a saturated ring 1 with all axial substituent and a partially unsaturated ring 2 with all equatorial ones. Conformer 2 has a ring 1 with all equatorial substituents and a ring 2 with all axial ones. Why are there no conformers where substituents in both rings are axial or equatorial simultaneously? This is somewhat unexpected. Could you check this please?

I'm not sure how I should count the ring size, since the find_saturated_ring_with_substituent function has the neighbour atom id. I decided to make another function find_saturated_ring, although it will calculate the ring id twice overall. I can change the function find_saturated_ring_substituent which takes saturated_ring_list and mol to prevent calculating twice if you prefer it that way.

What you suggest is a right way. First, this will avoid code duplication. Second, it will bring a small speed up.

Please also remove hard coded saving to a file before the merge)

@Feriolet
Copy link
Contributor Author

Feriolet commented May 29, 2024

For chondroitin, the conformers with both axial or equitorial position conformer seen in the conformer2.zip is unfortunately removed during the rms_matrix_filter. This is the arr value of the conformers before being filtered out. The first and third row (corresponding to both axial and both equitorial conformer, respectively) is removed because they have three rms < 1

[[0.         0.96358251 0.78035173 0.93407865]
 [0.96358251 0.         1.15282597 0.80218658]
 [0.78035173 1.15282597 0.         0.71092557]
 [0.93407865 0.80218658 0.71092557 0.        ]]

@DrrDom
Copy link
Contributor

DrrDom commented May 29, 2024

Thank you. So maybe this filtration is not worth? However, if disable it, there will be output conformers with RMSD below the threshold, that can be considered an unexpected behavior.

Let's try to disable this filtration step and repeat tests. How conformers for other (single ring) compounds will look like, will they be as diverse as for the current version.

You may test one more implementation. We may replace the first clustering with this iterative procedure. Could you test this hypothesis as well.

@Feriolet
Copy link
Contributor Author

conformer_without_clustering_and_iterative.zip
Here is the result for both test. It seems that not using the iterative method is better than without clustering, because the iterative method still reduce the chondroitin conformer to two. While I have not seen the .sdf file, the method without the iterative method still retains the same conformer for other molecules, as well as retaining 4 chondroitin conformers

@DrrDom
Copy link
Contributor

DrrDom commented May 30, 2024

Difficult to decide which one is better. Without iterative step it is better for chrondroitin but worse for oxepane. Without clustering it is vice versa.

However, the output for oxepane looks strange. Clustering only gives 1 conformer, iterative approach alone gives 3. We use complete linkage clustering, that means that all conformers in a cluster has at most distance 1A. Since we did not remove conformers after clustering, that means that there was only one cluster for exapane where all conformers differed less than 1A from any other. But in the case when only iterative procedure was applied there were at least three conformers differed greater than 1A. Could you please check this? Maybe I'm wrong in my reasoning.

@Feriolet
Copy link
Contributor Author

Feriolet commented May 30, 2024

For thiepane and oxepane in iterative process, the three conformers exist because it skipped the iterative process, and immediately filtered through the keep_nconf=3 (seven membered-ring), hence the 3 remaining conformers. I set the iterative process to be skipped if there is no instance where arr > rms as can be seen in the all(arr[arr != 0] < rms) part because otherwise it will remove all conformers.

            #sometimes clustering result gives matrix < rms when rms is high enough
            if all(arr[arr != 0] < rms) or not any(arr[arr != 0] < rms):
                break

Here is the arr for both thiepane and oxepane:

[For Testing Only] oxepane_0 has 1 saturated ring
[For Testing Only] Before removing conformation: oxepane_0 has 100 conf
[[0.         0.61248338 0.58374022]
 [0.61248338 0.         0.51815107]
 [0.58374022 0.51815107 0.        ]]
[For Testing Only] After removing conformation: oxepane_0 has 3 conf
conformer_without_clustering/oxepane_0_after_remove_100.sdf
[For Testing Only] thiepane_unsaturated_0 has 1 saturated ring
[For Testing Only] Before removing conformation: thiepane_unsaturated_0 has 100 conf
[[0.         0.6227772  0.25156321]
 [0.6227772  0.         0.39474366]
 [0.25156321 0.39474366 0.        ]]
[For Testing Only] After removing conformation: thiepane_unsaturated_0 has 3 conf

@DrrDom
Copy link
Contributor

DrrDom commented May 30, 2024

Thank you!
So, what will be the final version? Clustering by rms followed by clustering by the number of clusters?

@Feriolet
Copy link
Contributor Author

Yepp, that should be fine. I think the three conformers generated is also very similar because of the symmetries, so the docking pose should be very similar to each other.

@DrrDom
Copy link
Contributor

DrrDom commented May 30, 2024

Agree, then please finalize the code to merge it and test.

@Feriolet
Copy link
Contributor Author

should I remove the sanity check print also before merging it?

@DrrDom
Copy link
Contributor

DrrDom commented May 30, 2024

Not necessary, I expect that I'll go through the code and will remove it later. Thank you!

@Feriolet
Copy link
Contributor Author

Feriolet commented Jun 7, 2024

my bad I missed the review for the mk_prepare_ligand function.

@DrrDom
Copy link
Contributor

DrrDom commented Jun 7, 2024

We made preliminary tests and the results were not too encouraged. Docking of some strange ring conformations can be much more favorable than the native conformation. So, to answer the question we have to perform a more systematic study.

  1. We may inspect studies on macrocycles sampling and docking to maybe borrow some ideas.
  2. We may select proteins with many co-crystallized ligands having saturated cycles and perform re-docking for them to evaluate whether our procedure improves success rate of re-docking and ring conformations.

All these will require time and I'm not sure that we have this time right now.

@Feriolet
Copy link
Contributor Author

Alright. Then, let me know if there is anything that I can do or when we can continue implementing the feature.

I have tried testing betulinic acid with two co-crystallised protein structure, and one of them does not look promising to me (5LSG). The other one (8GXP) shows the same conformation as the crystallised structure, given the correct isomer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants