Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch t-SNE implementation to openTSNE #1561

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

pavlin-policar
Copy link

@dkobak @ivirshup @Koncopd This is a first stab #1233.

Features

  • Construct t-SNE embeddings
  • Recipes
  • Ingest functionality

As discussed, this PR currently implements t-SNE with uniform affinity kernels, making it fit in nicely with the existing sc.pp.neighbors architecture. While this isn't technically t-SNE per se, it's visually almost impossible to tell them apart.

It would also make sense to add a tsne option to sc.pp.neighbors, but from what I can tell, there's no direct way to change the existing code to do this. It looks like sc.pp.neighbors calls UMAP to calculate the nearest neighbors directly, calculating the UMAP weights. We'd probably have to do something similar to the gauss option and just overwrite the UMAP weights after the fact. Does this sound reasonable?

I like the API of calling sc.tl.tsne.recipe_X(adata). Adding the recipes would be simple here; we can just add a simple class wrapper around _tsne which which would __call__ tsne, and a bunch of static methods to the wrapper for recipes. This is kind-of messy and probably not something you guys do anywhere else throughout the code base, so I'd appreciate your feedback on this. Do you like this, or should we do it in a different way?

The ingest functionality should be fairly straightforward as well, just adding a tsne option to embedding_method.

All of these things should be pretty easy to do, but I'd like to see that this is moving in a direction you like first.

@dkobak
Copy link

dkobak commented Jan 2, 2021

Hi Pavlin, Happy New Year!

Great to see this going forward!

We'd probably have to do something similar to the gauss option and just overwrite the UMAP weights after the fact. Does this sound reasonable?

Yes, I also thought we'd need to do it similar to the gauss option. Can one use openTSNE code for computing perplexity-based weights or would one need to copy the binary search in here? I think it may be possible to use openTSNE's function to compute the affinities and then get the weights out of there?

One question here though: how would tsne function know if it should use the uniform kernel or the weights constructed by the neighbors function? Can some flag be switched if neighbors is run with mode='tsne' so that the tsne() function later on uses those weights? Not sure what's the best interface here. Alternatively the tsne() function could check if the weights conform to what t-SNE expects (sum to 1). Maybe that's better actually.

Apart from that, I noticed that you implemented

 class UniformAffinities(openTSNE.affinity.Affinities):

in here, but isn't it part of openTSNE already?

@dkobak
Copy link

dkobak commented Jan 2, 2021

Oh, one other small thing: I would definitely suggest to add exaggeration=1 argument to tsne().

@pavlin-policar
Copy link
Author

Hey Dmitry, happy New Year's to you too!

Can one use openTSNE code for computing perplexity-based weights or would one need to copy the binary search in here? [...] I noticed that you implemented UniformAffinities in here, but isn't it part of openTSNE already?

No, I think we should be able to call the existing machinery. But we'd need to do something like I do with the Uniform affinities here. The reason I had to write separate classes is that the ones in openTSNE calculate the KNNG internally, and don't really offer a way to pass an existing KNNG. In openTSNE that makes sense, since otherwise, the API would be pretty complicated. But here, we have to deal with that. As you can see, it's a pretty trivial wrapper anyway.

How would tsne function know if it should use the uniform kernel or the weights constructed by the neighbors function?

I noticed that sc.tl.umap and now sc.tl.tsne add their parameters to adata.uns. I would imagine sc.pp.neighbors probably do the same, and if not, that seems like an easy addition, which is in line with the scanpy architecture. Determining which affinity kernel to use would then be as simple as looking into adata.uns to find which parameter value sc.pp.neighbors was called with.

I would definitely suggest to add exaggeration=1 argument to tsne().

I added exaggeration=None, as is the default in openTSNE. But setting it to 1 instead of None is better, and I should change that in the next release.

@dkobak
Copy link

dkobak commented Jan 3, 2021

As you can see, it's a pretty trivial wrapper anyway.

Yes, makes sense.

Determining which affinity kernel to use would then be as simple as looking into adata.uns to find which parameter value sc.pp.neighbors was called with.

Yes, I like this.

I added exaggeration=None, as is the default in openTSNE. But setting it to 1 instead of None is better, and I should change that in the next release.

Ah, right, I somehow overlooked that you did add the exaggeration parameter. That's fine then!

@dkobak
Copy link

dkobak commented Jan 11, 2021

I hope everybody had a good New Year break! Just pinging @ivirshup again because I think it'd be great to move forward with this. Also wanted to ping @falexwolf as I know he did a lot of work on dim reduction etc. Cheers.

@ivirshup
Copy link
Member

Happy new year! And thanks for opening this PR @pavlin-policar.


First a general question. What is the scope of this PR? Will this just be single dataset TSNE calculation, with integration/ ingest functionality happening separately, or would you like to do it all at once?


In terms of workflow, I think I'd like it to look similar to UMAP

  • One function for calculating the graph/ manifold
  • One function for computing the embedding

If possible, I would like it if the user could specify an arbitrary manifold (e.g. the umap weighted one) to pass to the embedding step, but this is icing.

It would also make sense to add a tsne option to sc.pp.neighbors

I would prefer for this to be a separate function, maybe neighbors_tsne? This could use the entire neighbor calculating workflow from openTSNE.

How different are the arguments to the various affinity methods? At first glance they look pretty similar. I'd like to have the option of choosing which one, but does it make sense to have all the methods available through one function?

noticed that sc.tl.umap and now sc.tl.tsne add their parameters to adata.uns. ... Determining which affinity kernel to use would then be as simple as looking into adata.uns to find which parameter value sc.pp.neighbors was called with.

+1. Do you need to know what the affinity method was if you're just calculating an embeddings? Or does that only become important when you want to add new data?

Comment on lines 13 to 25
class UniformAffinities(openTSNE.affinity.Affinities):
def __init__(self, neighbors, symmetrize=True, verbose=False):
self.verbose = verbose

# Ignore the weights and just assign every neighbor equal weight
P = neighbors > 0
# Symmetrize the probability matrix
if symmetrize:
P = (P + P.T) / 2
# Convert weights to probabilities
P /= np.sum(P)

self.P = P
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will error if symmetrize is False, since you can't do an inplace type conversion. That is, if symmetrize is False, P.dtype is bool, so you can't inplace assign floats.

Example
from scipy import sparse
s = sparse.random(100, 100, format="csr", density=0.1)
sb = s > 0
sb /= sb.sum()
---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
<ipython-input-4-23169c7c63c6> in <module>
----> 1 sb /= sb.sum()

/usr/local/lib/python3.8/site-packages/scipy/sparse/data.py in __itruediv__(self, other)
     61         if isscalarlike(other):
     62             recip = 1.0 / other
---> 63             self.data *= recip
     64             return self
     65         else:

UFuncTypeError: Cannot cast ufunc 'multiply' output from dtype('float64') to dtype('bool') with casting rule 'same_kind'

@dkobak
Copy link

dkobak commented Jan 12, 2021

It would also make sense to add a tsne option to sc.pp.neighbors

I would prefer for this to be a separate function, maybe neighbors_tsne? This could use the entire neighbor calculating workflow from openTSNE.

Not sure I understood this bit. Why? sc.pp.neighbors already has method='gauss' implemented which is using Gaussian weights instead of UMAP weights. t-SNE also uses Gaussian weights, only slightly differently. Why would you prefer a separate function over having method='tsne' in the same function? My first impression is that it'd more confusing.

@ivirshup
Copy link
Member

I think it may be possible to use openTSNE's function to compute the affinities and then get the weights out of there?

I would definitely like this to be the case. I'm not sure I see

Why? sc.pp.neighbors already has method='gauss'

To me, it’s largely of a maintenance and documentation issue. Most bugs I fix (here, and in upstream libraries) come from argument handling. The more features you lump into a function, the more complicated argument handling gets. There are questions of default values and fallbacks for different backends, and being sure users understand which arguments are valid for each backend.

The use of the Neighbors class ends up making the neighbors function much more complicated than it needs to be. I think skipping out on that here can make this implementation much more simple.

From an API stand point, I would like the "blessed" tsne workflow to be dead obvious. I'm thinking:

sc.pp.neighbors_tsne(adata)
sc.tl.tsne(adata)

How many arguments is it going to take to make this work if this functionality is in sc.pp.neighbors? At a minimum, k=30, method=tsne_affinity, nn_method="annoy", right?

@dkobak
Copy link

dkobak commented Jan 13, 2021

Thanks! So my understanding is that you are saying that neighbors function is ALREADY too complicated, so we should not complicate it any further (and rather the existing function could be eventually split by taking that gauss out of it, I guess?)

From an API stand point, I would like the "blessed" tsne workflow to be dead obvious. I'm thinking:
sc.pp.neighbors_tsne(adata)
sc.tl.tsne(adata)
How many arguments is it going to take to make this work if this functionality is in sc.pp.neighbors? At a minimum, k=30, method=tsne_affinity, nn_method="annoy", right?

I'd say perplexity=30, method='tsne'.

I don't see why t-SNE should use annoy if UMAP uses pynndescent. This does not matter for the algorithm. So nn_method could either be included in all neighbors functions or into none, IMHO.

In any case,

sc.pp.neighbors_tsne(adata)
sc.tl.tsne(adata)

would also be fine with me. I think it's important that the following works:

sc.pp.neighbors(adata)
sc.tl.tsne(adata)

and is actually the recommended way to run t-SNE within scanpy. (Using uniform affinities). But if somebody wants to do the "actual" t-SNE, then they can run sc.pp.neighbors_tsne(adata) first. I think this makes sense.

One question here is maybe what should other downstream functions like Leiden clustering use, if somebody runs neighbors_tsne (or both neighbors and neighbors_tsne).

@ivirshup
Copy link
Member

ivirshup commented Jan 13, 2021

Thanks! So my understanding is that you are saying that neighbors function is ALREADY too complicated, so we should not complicate it any further (and rather the existing function could be eventually split by taking that gauss out of it, I guess?)

Pretty much. I prefer more smaller, simpler functions with common APIs than fewer functions with larger APIs.

and rather the existing function could be eventually split by taking that gauss out of it, I guess?

I think I'd be pro that. I'd probably prefer exposing an interface for computing weights from KNN distances where methods like gauss could sit.

I think it's important that the following works and is actually the recommended way to run t-SNE within scanpy. (Using uniform affinities).

Couple questions, first scientific: Why would you prefer uniform edge weights as input to your t-sne? I would think the information about relative distance is useful.

Second API: I'm not sure I completely agree with this. I think it would be the most clear for sc.pp.neighbors to essentially mean "build umap's connectivity graph", and functions like sc.tl.tsne or sc.tl.umap to be "find a 2d embedding using the passed connectivity graph". This means whatever affinities you're passing through (e.g. via connectivities_key) are the weights that get used.

Are there cases you think this disallows?

One question here is maybe what should other downstream functions like Leiden clustering use, if somebody runs neighbors_tsne (or both neighbors and neighbors_tsne).

The graph that's used is provided from arguments like neighbors_key or obsp from leiden.

@dkobak
Copy link

dkobak commented Jan 13, 2021

Why would you prefer uniform edge weights as input to your t-sne? I would think the information about relative distance is useful.

My argument was mostly about API. But one big benefit of using kNN with k=15 is that it's much faster then using k=90 (with perplexity=30) which is what t-SNE is using by default (for historical reasons). It's not about uniform vs non-uniform, it's about k=15 vs k=90. Uniform on k=15 just happens to give almost the same results as perplexity=30 on k=90. So it's a lucky coincidence that I thought we could benefit from.

Second API: I'm not sure I completely agree with this. I think it would be the most clear for sc.pp.neighbors to essentially mean "build umap's connectivity graph", and functions like sc.tl.tsne or sc.tl.umap to be "find a 2d embedding using the passed connectivity graph". This means whatever affinities you're passing through (e.g. via connectivities_key) are the weights that get used.

I understand what you saying, but the situation won't be symmetric because neighbors already exists, is not called neighbors_umap, and all users of Scanpy are very familiar with this function. I'd like to make it very easy to use t-SNE in scanpy and that it naturally fits into the scanpy's established workflow. That's why I think simply running tsne() after running neighbors() should be possible.

Please note that t-SNE requires normalized weights (they should sum to 1). The weights constructed by UMAP in neighbors are not normalized. So if you run neighbors() and then tsne() then t-SNE should do something in order to be able to use this graph. My suggestion is that it discards the weights and uses normalized affinity matrix. I am not sure what exactly is your suggestion? Take UMAP weights and normalize them? This has never been explored in the literature.

@pavlin-policar
Copy link
Author

I'm a little late to the party, but here's my 0.02$.

What is the scope of this PR? Will this just be single dataset TSNE calculation, with integration/ ingest functionality happening separately, or would you like to do it all at once?

I think we can split it into two PRs, since they're going to touch different parts of the code base, and it should be easier to review them individually.

How different are the arguments to the various affinity methods?

So, if we use the KNNG provided by sc.pp.neighbors, these parameters become unnecessary. Both perplexity and k specify the number of k-nearest neighbors when constructing the KNNG. Here, we assume that the KNNG exists from before, so there is no need for this parameter.

Do you need to know what the affinity method was if you're just calculating an embeddings? Or does that only become important when you want to add new data?

Yes, the affinity model will have to be somehow kept, since when we call transform, we need to find the nearest neighbors in the index. I haven't checked how your UMAP functionality does this, but I'm guessing it's similar.

Regarding the whole API, I have a few comments.

I very much dislike the API sc.pp.neighbors_tsne(adata). scanpy is nice because it's easy to use and the API is dead simple. I can just call sc.pp.neighbors followed by clustering, visualization, and whatever else I want using simple function calls. If we went this route, this would mean changing sc.pp.neighbors to sc.pp.umap_neighbors, and then splitting of yet another sc.pp.gauss_neighbors. This would not only make things confusing, it would mean re-calculating the KNNG at each call, which we would inevitably have to do if we wanted different visualizations. It then also becomes quite unclear what to do when I want to do Louvain clustering. Should there be a sc.pp.louvain_neighbors as well? Which neighbors should I use there? (As an aside, I don't understand why using UMAP connectivites is the default for clustering at all. From what I can tell, the standard way of weighing the KNNG for graph-based clustering in single-cell is to use the Jaccard index of the mutual nearest neighbors to weigh the edges).

From an implementation standpoint, the sc.pp.tsne_negihbors will inevitably have to call the UMAP KNNG construction, since I can see that it's not split out in the code-base. sc.pp.neighbors calls the UMAP implementation directly, and since the goal is to use the same KNNG construction procedure, t-SNE will have to call the same UMAP function, and override the weights afterward. Much like gauss does now. It would probably make more sense to split out the KNNG construction from the UMAP weight calculation, but that seems like a lot of work. Or maybe not. In the latest UMAP release from a few days ago, they split out the graph construction into pynndescent. Either way, refactoring this doesn't belong in this PR.

Alternatively, we could construct our own KNNG in sc.pp.tsne_neighbors using Annoy, which openTSNE does by default. But that seems suboptimal, because the design philosophy seems to be re-use the same graph for everything.

What I think would make more sense is to remove the connectivity calculation from the sc.pp.neighbors altogether, and have that calculate an unweighted KNNG. Since different methods need their own connectivities anyways, that should be done in each method separately. So UMAP connectivities would be calculated in sc.tl.umap, and the Louvain Jaccard connectivities in sc.tl.louvain. Then, you wouldn't be assigning any preference to any one connectivity method. From what I can tell, there's no evidence the UMAP connectivities are better than the others in any way, especially not for clustering. If you have any information on this, I'd be really curious to know.

Ultimately, the decision is up to you guys, since this is more of a design choice of where you want to take the scanpy API. The existing solution is worse than anything we've discussed because we use a slow variant of t-SNE, which calculates its own KNNG completely separately from the sc.pp.neighbors. So really, any step we take would be a step in the right direction.

@ivirshup
Copy link
Member

About the API, I still think it makes sense for TSNE weighted neighbor calculation to be separate, especially if it is going to have multiple weighting options that depend on the openTSNE package. If it turns out these methods don't have much in the way of parameters, then it might be reasonable for this to be a part of sc.pp.neighbors.

How about this, the implementation here should be well factored out into:

  1. Getting nearest neighbors
  2. Weighting the graph
  3. Computing the layout

Once the available parameters are clear I think it'll be easier to make an informed decision about whether neighbor weighting for tsne should occur through sc.pp.neighbors. Additionally, I think it'll be easier to integrate cleanly separated code than to separate integrated code.

The weights constructed by UMAP in neighbors are not normalized. So if you run neighbors() and then tsne() then t-SNE should do something in order to be able to use this graph.

For passing the umap connectivity matrix to tsne layout, I think I would expect the weights to be used. Something like this should accomplish that:

class WrappedAffinities(openTSNE.affinity.Affinities):
    def __init__(self, neighbors, symmetrize=True, verbose=False):
        self.verbose = verbose
        P = neighbors
        if symmetrize:
            P = (P + P.T) / 2
        total = P.sum()
        if not np.isclose(total, 1.):
            P = P / total
        self.P = P

That said, I'm not too familiar with the assumptions of tsne, or if this would be appropriate. I think binarizing the edge weights is a bit of a strong assumption unless specifically requested though.

With umap, we throw a warning if it looks like the passed graph didn't come from umap. You could do the same here?

From an implementation standpoint, the sc.pp.tsne_negihbors will inevitably have to call the UMAP KNNG construction, since I can see that it's not split out in the code-base.

I would like nearest neighbor calculation and graph weighting to be split out eventually. Since it's already done this way in openTSNE, I think this could help with that goal.

From what I can tell, the standard way of weighing the KNNG for graph-based clustering in single-cell is to use the Jaccard index of the mutual nearest neighbors to weigh the edges).

Seurat does this, but I'm not sure this has been standardized much otherwise. Off the top of my head, I'd guess that the Jaccard method is going to be much more sensitive to k as a parameter, particularly how it relates to cluster size. I'm not really aware of any consensus made on this or benchmarks comparing approaches. When I've looked into it, using the weights (as opposed to just binarized edges) seems to give more stable results, particularly for small clusters. We've had some previous discussions on this here: #586, #240.

@dkobak
Copy link

dkobak commented Jan 18, 2021

My primary wish here is very simple. I'd like the following sequence of commands:

sc.pp.neighbors(adata)
sc.tl.tsne(adata)

to produce a reasonable t-SNE result that could be called "t-SNE" in publications. What you suggest @ivirshup (t-SNE on normalized UMAP affinities) could maybe achieve that, but we would need to check. As I said, I don't think anybody ever has tried that. I could imagine that it would roughly correspond to t-SNE with perplexity less than 30, perhaps 20 or so, but this is just a wild guess.

I am worried that it may be a bit weird to refer to this as "t-SNE" in publications, because it's really t-SNE on normalized UMAP affinities which is an odd-sounding hybrid. But if the result is similar enough to t-SNE, then maybe it's okay to call it simply "t-SNE (as implemented in Scanpy)"...

A separate question is how a user would be able to achieve t-SNE proper, and here I could live with either

sc.pp.neighbors(adata, method='tsne')  # this would use perplexity=30 by default
sc.tl.tsne(adata)

or

sc.pp.neighbors_tsne(adata)
sc.tl.tsne(adata)

This is just a question of API, and is less important for me personally. I agree that it could be better to have neighbors() compute kNN adjacency matrix without computing any weights, but this is refactoring beyond the scope of this PR.

@ivirshup
Copy link
Member

I'd like the following sequence of commands ... to produce a reasonable t-SNE result that could be called "t-SNE" in publications.
I am worried that it may be a bit weird to refer to this as "t-SNE" in publications

I share the same worry, but am not qualified to answer when something becomes "t-SNE". I think it would be sufficient for sc.tl.tsne to warn users if the graph it was passed looks unexpected (or if it could tell it was generated by a different method).

What you suggest (t-SNE on normalized UMAP affinities) could maybe achieve that

From an API point of view, we don't control weights at the sc.tl.umap call, so I think it would be strange to control weights at the sc.tl.tsne call. I'm also not sure if binarizing the graph would be closer to "t-SNE".


About sc.pp.neighbors vs sc.pp.neighbors_tsne

This is just a question of API, and is less important for me personally. I agree that it could be better to have neighbors() compute kNN adjacency matrix without computing any weights, but this is refactoring beyond the scope of this PR.

I think for backwards compatibility I would like to keep neighbors pretty much as is. I think new functions like distance_neighbors, umap_neighbors, tsne_neighbors could be reasonable to add. It's also possible we could add a "tsne" method to neighbors, but I think the implementation can look very similar to having a tsne_neighbors function, so this can be kicked down the road a bit.

@dkobak
Copy link

dkobak commented Jan 19, 2021

I think it would be sufficient for sc.tl.tsne to warn users if the graph it was passed looks unexpected

Yes, warning is a good idea. But the warning IMHO should not convey the message "Do not do this!". In my mind, it should convey the message "What you are computing is not exactly t-SNE, but it is close enough to t-SNE that you can ignore this message. If you really want to get exactly t-SNE, run the following command instead: ..., but note that it can be slower and the result will likely look around the same".

From an API point of view, we don't control weights at the sc.tl.umap call, so I think it would be strange to control weights at the sc.tl.tsne call.

But we will have to control them anyway... Your suggested solution also controls them: namely, symmetrizes and normalizes.

I'm also not sure if binarizing the graph would be closer to "t-SNE".

Maybe not, but
(1) it will be further away from UMAP, so that e.g. UMAP paper does not need to be cited when using such t-SNE.
(2) There is citeable literature showing that binary affinities yield practically the same result as t-SNE proper. We can cite this in Scanpy docs. I am not aware of any such literature for normalized UMAP affinities in t-SNE.

Stepping back, I am not sure I managed to convey my main point here. Which is: Scanpy is in a unique position to offer people t-SNE with k=15 binary affinities as a convenient, faster, UMAP-independent, and nearly equivalent replacement for k=90, perplexity=30 affinities.

I agree with Pavlin though that pretty much any decision would be better than the current situation :-)

@ivirshup
Copy link
Member

ivirshup commented Feb 4, 2021

Sorry about the wait, had to focus on getting the last release out. Now we can do new features!

But the warning IMHO should not convey the message "Do not do this!". In my mind, it should convey the message "What you are computing is not exactly t-SNE, but it is close enough to t-SNE that you can ignore this message.

That sounds appropriate.

But we will have to control them anyway... Your suggested solution also controls them: namely, symmetrizes and normalizes.

I think normalization is a "lighter touch" than binarization. To me, the alternative would be to error for non-normalized data since the optimization won't converge properly.

Not knowing too much about the internals of tsne, is a symmetric graph necessary? If it's not, then I'd be fine with not doing that.

Exactly how the option to do this is provided to users could take some consideration. I think it would be clean and composable to have graph weighting options separate from embedding layout options, but considering tsne has restrictions on graph weights there may have to be some exception here. Perhaps there needs to be a weights option on tsne which allows normalization, binarization, or just erroring if the passed graph doesn't have correct weighting.


From my perspective, what we have to gain here is:

  • More efficient TSNE by default
  • Consolidate implementation to a single well maintained library
  • More flexibility in how tsne is computed

Scanpy is in a unique position to offer people t-SNE with k=15 binary affinities as a convenient, faster, UMAP-independent, and nearly equivalent replacement for k=90, perplexity=30 affinities.

I'm happy to have this be an option. I'm less comfortable with something like this being the "recommended path", since not using perplexity weights seems non-standard.


In general, are we agreed on these points?

  • tsne should allow weights to be passed through (whether perplexity based, or not)
    • There should be a warning to notify the user if the weights were computed in a non-standard way
  • There should be a function for computing a perplexity weighted nearest neighbor graph.

@dkobak
Copy link

dkobak commented Feb 4, 2021

In general, are we agreed on these points?
tsne should allow weights to be passed through (whether perplexity based, or not)
There should be a warning to notify the user if the weights were computed in a non-standard way
There should be a function for computing a perplexity weighted nearest neighbor graph.

Yes. I agree.

Perhaps there needs to be a weights option on tsne which allows normalization, binarization, or just erroring if the passed graph doesn't have correct weighting.

This sounds good. IMHO erroring is not necessary. There will be a warning anyway.

Not knowing too much about the internals of tsne, is a symmetric graph necessary? If it's not, then I'd be fine with not doing that.

Actually UMAP weights are symmetric. So it would be enough to normalize the entire weight matrix to sum to 1.


I think there are two different choices that we have been disagreeing about:

  • Choice 1: whether preprocess_weights='normalize' or preprocess_weights='binarize' is default for tl.tsne() if the passed weights do not sum to 1.
    • Argument for normalize (Isaac): closer to originally calculated weights;
    • Argument for binarize (Dmitry and Pavlin): will make it UMAP-independent if tl.tsne() is run after default tt.neighbors().
  • Choice 2: whether perplexity weights are given by pp.neighbors_tsne() or by pp.neighbors(method='tsne')
    • Argument for neighbors_tsne (Isaac): the existing function is complicated enough, so let's not make it even more complicated;
    • Argument for neighbors(method='tsne') (Dmitry and Pavlin): the other option would make the API for UMAP weights and for tSNE weights assymetric and even confusing, as neighbors is not called neighbors_umap.

Does this summarize the arguments from both sides?

@ivirshup
Copy link
Member

ivirshup commented Mar 1, 2021

Ahh, I had written a response to this, but appear not to have hit send! Apologies!

For the decisions, I think that these are both relatively easy to change, and it'll be easier to get another opinion from the team once the basic functionality is present in this PR. Does that sound right to you?

@dkobak
Copy link

dkobak commented Mar 12, 2021

@pavlin-policar Have you had a chance to read our conversation above (I tried to summarize it in my last comment above)? What are your thoughts?

@pavlin-policar
Copy link
Author

So, having re-read the thread, the steps forward seem pretty clear, and we're really just debating the API, which wouldn't be that hard to change before a release anyways. It becomes much harder after a release because they you have to worry about backward compatibility.

So, I suggest the following.

First, calling sc.pp.neighbors followed by sc.tl.tsne should not recompute the nearest neighbors, and use the existing KNNG. To get around the whole "should we binarize or not", I suggest adding a parameter to sc.tl.tsne(binarize: bool = "auto"). If binarize=True, we binarize the KNNG, regardless of input. If binarize=False, we just re-normalize the weights if needed. This way, we can potentially use UMAP connectivities. As for the default option binarize="auto", this would automatically binarize weights if they don't come from sc.pp.neighbors_tsne. This way, the default would either use t-SNE proper, or the uniform kernel t-SNE, which is close enough. Since most users use default values, this would avoid people running a strange combination of UMAP and t-SNE, and have something close to t-SNE proper, and would only have to cite the t-SNE paper (as implemented in scanpy). This way, we can run any of the three scenarios.

Second, I agree that adding more parameters to sc.pp.neighbors is not a good idea, so, at least for now, the least bad solution seems to add sc.pp.neighbors_tsne. This way, we can see what parameters are needed and not need to work around the existing implementation. That said, this is not a good solution, just not as bad as the other one. This gives clear preferential treatment to UMAP weights. I am still confused why the UMAP weights are the used for everything, including downstream clustering (e.g. sc.pp.neighbors(...); sc.tl.leiden(...)). I haven't been following single-cell literature as much lately, but from what I can tell, there's no evidence that shows this is better than anything else.

From #1739, it seems that you are considering a change in the API, and I would definitely be in favour of that. As you add more and more functionality to scanpy, things are inevitably going to get more complicated, and patching the existing API will just lead to thousands of parameters. The API in #1739 seems like the logical next step.

I'll try to work on this in the coming days/weeks, so we can see what's really necessary, and we can work out the exact API after we have a working prototype. What do you think?

@dkobak
Copy link

dkobak commented Mar 18, 2021

Sounds great to me! Looking forward.

Would be interesting to compare these three t-SNEs:

sc.pp.neighbors()
sc.tl.tsne(binarize=True)
sc.tl.tsne(binarize=False)
sc.pp.neighbors_tsne()
sc.tl.tsne(binarize=False)

on a couple of datasets after the standard scanpy preprocessing pipeline.

@@ -22,3 +22,4 @@ umap-learn>=0.3.10,<0.5
legacy-api-wrap
packaging
sinfo
openTSNE>=0.5.0
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated to 0.6, when I release it. Until then, CI will fail here.

@@ -2,7 +2,7 @@

import numpy as np
import pytest
from sklearn.utils.testing import assert_array_almost_equal
from numpy.testing import assert_array_almost_equal
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scikit-learn imported this internally, and was throwing warnings that it will be removed from the scikit-learn public API.


# If `binarize="auto"`, we binarize everything except t-SNE weights
if binarize == "auto":
binarize = neighbors["params"]["method"] != "tsne"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binarization will produce uniform weights that differ ever so slightly differently from openTSNE.affinity.Uniform. The difference is that in openTSNE, we do not have a symmetric graph, so when we symmetrize uniform weights, the points that appear as each others neighbors actually have higher weights. So in that case we have two sets of possible weights, e.g. 0.05 and 0.1.

Here, we already have a symmetric connectivity graph, so this doesn't happen, and all points are assigned equal weights. I strongly suspect this has practically no impact on the resulting embeddings.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. For consistency, we can make the Uniform affinity class in openTSNE to support not only symmetrize=True/False but also something like symmetrize='or' which would compute P=(P+P.T) > 0. I thought about it a while ago already actually.

if adata.is_view: # we shouldn't need this here...
adata._init_as_actual(adata.copy())

n_neighbors = int(np.ceil(perplexity * 3)) + 1 # +1 because we it includes self
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really the only functionally different LOC from the original neighbors method.

@@ -743,6 +888,10 @@ def compute_neighbors(
self._adata.shape[0],
self.n_neighbors,
)
elif method == 'tsne':
self._distances, self._connectivities = _compute_connectivities_tsne(
knn_indices[:, 1:], knn_distances[:, 1:], n_jobs=-1
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use n_jobs=-1 meaning all available cores, since this is what the other functions do as well (I guess).

@pavlin-policar
Copy link
Author

@ivirshup @dkobak I've fixed up this PR, so now it implements what I mentioned in my comment above. I've left a couple of comments on the code, commenting on anything noteworthy.

The tests and everything will fail until I release a new version of openTSNE, which I'll do in the coming days. But please look through the changes and let me know if there's anything you'd like me to change, so we can get this merged. Also, I haven't updated the docstrings at all.

The most glaring thing is neighbors_tsne. Over 90% of the code here is identical to neighbors. Really, the only difference is that I changed the n_neighbors parameter to perplexity. But there was no elegant way to incorporate that into neighbors. I've also tried refactoring the duplicated code that saves the settings into adata.uns, but doingt that would also make the code pretty messy. Obviously, it's not a good idea to have duplicated code like this. What do you think would be the best way to handle this?

Functionally, this now works as agreed. Let me know how you want to proceed.

@dkobak
Copy link

dkobak commented Apr 18, 2021

Cool! Did you have a chance to compare these embeddings:

sc.pp.neighbors()
sc.tl.tsne(binarize=True)
sc.tl.tsne(binarize=False)
sc.pp.neighbors_tsne()
sc.tl.tsne(binarize=False)

on one of the standard scanpy datasets, like e.g. scanpy.datasets.pbmc3k_processed (used in scanpy tutorials such as https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html)?

@pavlin-policar
Copy link
Author

pavlin-policar commented Apr 18, 2021

Indeed I did:

Plots

UMAP weights, binarize=True
image

UMAP weights, binarize=False
image

t-SNE weights
image

The differences between UMAP weights with and without binarization make sense: using the weights look like t-SNE using a smaller perplexity, compared to the uniform kernel, which tends to make things more compact. The only noticeable structural difference between the UMAP and t-SNE weights are that maybe the Megakaryocytes are a bit closer to the left cluster. But that's probably due to the larger number of neighbors.

@dkobak
Copy link

dkobak commented Apr 18, 2021

For some reason I don't see the figures here on the Github page (and get an error message when I click on the link), but they showed fine in the email notification I received. Looks good!

@pavlin-policar
Copy link
Author

I had put them in the collapsible thing, because they take quite a bit of space. But I've removed that now, so they should be visible.

@dkobak
Copy link

dkobak commented Apr 18, 2021

The problem is not the collapsible thing. I still don't see the images here but only links instead, and when I click on a link e.g. https://user-images.githubusercontent.com/5758119/115158730-f0768a00-a08f-11eb-939a-1b9c35373fae.png I get an error message that the image cannot be displayed because it contains errors. Odd. Maybe it's something weird on my side.

@@ -692,11 +837,11 @@ def compute_neighbors(
if n_neighbors > self._adata.shape[0]: # very small datasets
n_neighbors = 1 + int(0.5*self._adata.shape[0])
logg.warning(f'n_obs too small: adjusting to `n_neighbors = {n_neighbors}`')
if method == 'umap' and not knn:
if method in {'umap', 'tsne'} and not knn:
raise ValueError('`method = \'umap\' only with `knn = True`.')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ValueError should now probably print method=\'{method}\' and not method=\'umap\'.

raise ValueError('`method = \'umap\' only with `knn = True`.')
if method == 'rapids' and metric != 'euclidean':
raise ValueError("`method` 'rapids' only supports the 'euclidean' `metric`.")
if method not in {'umap', 'gauss', 'rapids'}:
if method not in {'umap', 'gauss', 'rapids', 'tsne'}:
raise ValueError("`method` needs to be 'umap', 'gauss', or 'rapids'.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add 'tsne' to the ValueError message?

@ivirshup
Copy link
Member

Thanks for updating this! Just a heads up that it might be a little bit until I can give a full review.

As a wider design question, would you want to allow using the other Affinities weightings from openTSNE?

@pavlin-policar
Copy link
Author

As a wider design question, would you want to allow using the other Affinities weightings from openTSNE?

The only other interesting affinity kernel we could use is the multiscale kernel. But that generally requires an even larger number of nearest neighbors. Either way, it would be easy to incorporate it into neighbors_tsne by just allowing perplexity to be a list instead of a single number. But I think this might not belong in this PR. Let's get the basic t-SNE sorted out first.

@dkobak
Copy link

dkobak commented Jun 30, 2021

Just got reminded of this PR... @ivirshup Should we maybe ping somebody else from the Scanpy team to join the discussion / review here?

@dkobak
Copy link

dkobak commented Jul 6, 2021

I've just seen a new really big/important study that extensively uses t-SNE visualization of ~500k cells (this is unusual as most papers in the field seem to be using UMAP nowadays), and they use Scanpy t-SNE with default parameters :-/ I think this highlights the importance of this PR.

https://www.biorxiv.org/content/10.1101/2021.07.04.451050v1

I am going to CC @LuckyMD @giovp (apologies guys if it's not up your street).

@giovp
Copy link
Member

giovp commented Jul 15, 2021

@dkobak sorry for late reply, I'm not able to follow this but I'll ask around and keep you psted here!

@pavlin-policar
Copy link
Author

pavlin-policar commented Sep 19, 2021

@ivirshup @giovp I'm wondering whether you've had the time to look over this. If this PR is maybe too big a change, then perhaps it would make more sense to migrate to openTSNE in a more iterative approach. For instance, we could just replace the t-SNE implementation to openTSNE, ignoring the API discussion and ignoring the precomputed graphs. I think switching to openTSNE, regardless of integration, would make the t-SNE implementation faster. We could then go for tighter integration step by step. What do you think?

@dkobak
Copy link

dkobak commented Sep 16, 2022

I was talking to some of our students who attended the seminar in Munich earlier this week, and this made me think of this PR... Too bad it got stalled. Is there anything we can do about it @ivirshup?

@dkobak
Copy link

dkobak commented Oct 24, 2022

Update here: I met @ivirshup at the SCG conference in Utrecht and we briefly chatted about this issue on the way to a pub. Isaac seemed supportive of the whole setup in this PR, which would allow to do this:

sc.pp.neighbors()
sc.tl.tsne()                # default binarize='auto' resolves to binarize=True here. UMAP kNN graph but binary weights
sc.tl.tsne(binarize=False)  # this would use UMAP weights but normalize to sum to 1
sc.pp.neighbors_tsne()
sc.tl.tsne()                # default binarize='auto' resolves to binarize=False here

As @pavlin-policar showed above, these three t-SNE calls yield very similar embeddings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants