-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch t-SNE implementation to openTSNE #1561
base: main
Are you sure you want to change the base?
Conversation
Hi Pavlin, Happy New Year! Great to see this going forward!
Yes, I also thought we'd need to do it similar to the One question here though: how would Apart from that, I noticed that you implemented
in here, but isn't it part of |
Oh, one other small thing: I would definitely suggest to add |
Hey Dmitry, happy New Year's to you too!
No, I think we should be able to call the existing machinery. But we'd need to do something like I do with the Uniform affinities here. The reason I had to write separate classes is that the ones in openTSNE calculate the KNNG internally, and don't really offer a way to pass an existing KNNG. In openTSNE that makes sense, since otherwise, the API would be pretty complicated. But here, we have to deal with that. As you can see, it's a pretty trivial wrapper anyway.
I noticed that
I added |
Yes, makes sense.
Yes, I like this.
Ah, right, I somehow overlooked that you did add the exaggeration parameter. That's fine then! |
I hope everybody had a good New Year break! Just pinging @ivirshup again because I think it'd be great to move forward with this. Also wanted to ping @falexwolf as I know he did a lot of work on dim reduction etc. Cheers. |
Happy new year! And thanks for opening this PR @pavlin-policar. First a general question. What is the scope of this PR? Will this just be single dataset TSNE calculation, with integration/ In terms of workflow, I think I'd like it to look similar to UMAP
If possible, I would like it if the user could specify an arbitrary manifold (e.g. the umap weighted one) to pass to the embedding step, but this is icing.
I would prefer for this to be a separate function, maybe How different are the arguments to the various
+1. Do you need to know what the affinity method was if you're just calculating an embeddings? Or does that only become important when you want to add new data? |
scanpy/tools/_tsne.py
Outdated
class UniformAffinities(openTSNE.affinity.Affinities): | ||
def __init__(self, neighbors, symmetrize=True, verbose=False): | ||
self.verbose = verbose | ||
|
||
# Ignore the weights and just assign every neighbor equal weight | ||
P = neighbors > 0 | ||
# Symmetrize the probability matrix | ||
if symmetrize: | ||
P = (P + P.T) / 2 | ||
# Convert weights to probabilities | ||
P /= np.sum(P) | ||
|
||
self.P = P |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will error if symmetrize
is False
, since you can't do an inplace type conversion. That is, if symmetrize
is False
, P.dtype
is bool
, so you can't inplace assign floats.
Example
from scipy import sparse
s = sparse.random(100, 100, format="csr", density=0.1)
sb = s > 0
sb /= sb.sum()
---------------------------------------------------------------------------
UFuncTypeError Traceback (most recent call last)
<ipython-input-4-23169c7c63c6> in <module>
----> 1 sb /= sb.sum()
/usr/local/lib/python3.8/site-packages/scipy/sparse/data.py in __itruediv__(self, other)
61 if isscalarlike(other):
62 recip = 1.0 / other
---> 63 self.data *= recip
64 return self
65 else:
UFuncTypeError: Cannot cast ufunc 'multiply' output from dtype('float64') to dtype('bool') with casting rule 'same_kind'
Not sure I understood this bit. Why? |
I would definitely like this to be the case. I'm not sure I see
To me, it’s largely of a maintenance and documentation issue. Most bugs I fix (here, and in upstream libraries) come from argument handling. The more features you lump into a function, the more complicated argument handling gets. There are questions of default values and fallbacks for different backends, and being sure users understand which arguments are valid for each backend. The use of the From an API stand point, I would like the "blessed" sc.pp.neighbors_tsne(adata)
sc.tl.tsne(adata) How many arguments is it going to take to make this work if this functionality is in |
Thanks! So my understanding is that you are saying that
I'd say I don't see why t-SNE should use annoy if UMAP uses pynndescent. This does not matter for the algorithm. So In any case,
would also be fine with me. I think it's important that the following works:
and is actually the recommended way to run t-SNE within scanpy. (Using uniform affinities). But if somebody wants to do the "actual" t-SNE, then they can run One question here is maybe what should other downstream functions like Leiden clustering use, if somebody runs |
Pretty much. I prefer more smaller, simpler functions with common APIs than fewer functions with larger APIs.
I think I'd be pro that. I'd probably prefer exposing an interface for computing weights from KNN distances where methods like
Couple questions, first scientific: Why would you prefer uniform edge weights as input to your t-sne? I would think the information about relative distance is useful. Second API: I'm not sure I completely agree with this. I think it would be the most clear for Are there cases you think this disallows?
The graph that's used is provided from arguments like |
My argument was mostly about API. But one big benefit of using kNN with k=15 is that it's much faster then using k=90 (with perplexity=30) which is what t-SNE is using by default (for historical reasons). It's not about uniform vs non-uniform, it's about k=15 vs k=90. Uniform on k=15 just happens to give almost the same results as perplexity=30 on k=90. So it's a lucky coincidence that I thought we could benefit from.
I understand what you saying, but the situation won't be symmetric because Please note that t-SNE requires normalized weights (they should sum to 1). The weights constructed by UMAP in |
I'm a little late to the party, but here's my 0.02$.
I think we can split it into two PRs, since they're going to touch different parts of the code base, and it should be easier to review them individually.
So, if we use the KNNG provided by
Yes, the affinity model will have to be somehow kept, since when we call Regarding the whole API, I have a few comments. I very much dislike the API From an implementation standpoint, the Alternatively, we could construct our own KNNG in What I think would make more sense is to remove the connectivity calculation from the Ultimately, the decision is up to you guys, since this is more of a design choice of where you want to take the scanpy API. The existing solution is worse than anything we've discussed because we use a slow variant of t-SNE, which calculates its own KNNG completely separately from the |
About the API, I still think it makes sense for TSNE weighted neighbor calculation to be separate, especially if it is going to have multiple weighting options that depend on the How about this, the implementation here should be well factored out into:
Once the available parameters are clear I think it'll be easier to make an informed decision about whether neighbor weighting for tsne should occur through
For passing the umap connectivity matrix to tsne layout, I think I would expect the weights to be used. Something like this should accomplish that: class WrappedAffinities(openTSNE.affinity.Affinities):
def __init__(self, neighbors, symmetrize=True, verbose=False):
self.verbose = verbose
P = neighbors
if symmetrize:
P = (P + P.T) / 2
total = P.sum()
if not np.isclose(total, 1.):
P = P / total
self.P = P That said, I'm not too familiar with the assumptions of tsne, or if this would be appropriate. I think binarizing the edge weights is a bit of a strong assumption unless specifically requested though. With
I would like nearest neighbor calculation and graph weighting to be split out eventually. Since it's already done this way in
|
My primary wish here is very simple. I'd like the following sequence of commands:
to produce a reasonable t-SNE result that could be called "t-SNE" in publications. What you suggest @ivirshup (t-SNE on normalized UMAP affinities) could maybe achieve that, but we would need to check. As I said, I don't think anybody ever has tried that. I could imagine that it would roughly correspond to t-SNE with perplexity less than 30, perhaps 20 or so, but this is just a wild guess. I am worried that it may be a bit weird to refer to this as "t-SNE" in publications, because it's really t-SNE on normalized UMAP affinities which is an odd-sounding hybrid. But if the result is similar enough to t-SNE, then maybe it's okay to call it simply "t-SNE (as implemented in Scanpy)"... A separate question is how a user would be able to achieve t-SNE proper, and here I could live with either
or
This is just a question of API, and is less important for me personally. I agree that it could be better to have |
I share the same worry, but am not qualified to answer when something becomes "t-SNE". I think it would be sufficient for
From an API point of view, we don't control weights at the About
I think for backwards compatibility I would like to keep neighbors pretty much as is. I think new functions like |
Yes, warning is a good idea. But the warning IMHO should not convey the message "Do not do this!". In my mind, it should convey the message "What you are computing is not exactly t-SNE, but it is close enough to t-SNE that you can ignore this message. If you really want to get exactly t-SNE, run the following command instead: ..., but note that it can be slower and the result will likely look around the same".
But we will have to control them anyway... Your suggested solution also controls them: namely, symmetrizes and normalizes.
Maybe not, but Stepping back, I am not sure I managed to convey my main point here. Which is: Scanpy is in a unique position to offer people t-SNE with k=15 binary affinities as a convenient, faster, UMAP-independent, and nearly equivalent replacement for k=90, perplexity=30 affinities. I agree with Pavlin though that pretty much any decision would be better than the current situation :-) |
Sorry about the wait, had to focus on getting the last release out. Now we can do new features!
That sounds appropriate.
I think normalization is a "lighter touch" than binarization. To me, the alternative would be to error for non-normalized data since the optimization won't converge properly. Not knowing too much about the internals of tsne, is a symmetric graph necessary? If it's not, then I'd be fine with not doing that. Exactly how the option to do this is provided to users could take some consideration. I think it would be clean and composable to have graph weighting options separate from embedding layout options, but considering From my perspective, what we have to gain here is:
I'm happy to have this be an option. I'm less comfortable with something like this being the "recommended path", since not using perplexity weights seems non-standard. In general, are we agreed on these points?
|
Yes. I agree.
This sounds good. IMHO erroring is not necessary. There will be a warning anyway.
Actually UMAP weights are symmetric. So it would be enough to normalize the entire weight matrix to sum to 1. I think there are two different choices that we have been disagreeing about:
Does this summarize the arguments from both sides? |
Ahh, I had written a response to this, but appear not to have hit send! Apologies! For the decisions, I think that these are both relatively easy to change, and it'll be easier to get another opinion from the team once the basic functionality is present in this PR. Does that sound right to you? |
@pavlin-policar Have you had a chance to read our conversation above (I tried to summarize it in my last comment above)? What are your thoughts? |
So, having re-read the thread, the steps forward seem pretty clear, and we're really just debating the API, which wouldn't be that hard to change before a release anyways. It becomes much harder after a release because they you have to worry about backward compatibility. So, I suggest the following. First, calling Second, I agree that adding more parameters to From #1739, it seems that you are considering a change in the API, and I would definitely be in favour of that. As you add more and more functionality to scanpy, things are inevitably going to get more complicated, and patching the existing API will just lead to thousands of parameters. The API in #1739 seems like the logical next step. I'll try to work on this in the coming days/weeks, so we can see what's really necessary, and we can work out the exact API after we have a working prototype. What do you think? |
Sounds great to me! Looking forward. Would be interesting to compare these three t-SNEs:
on a couple of datasets after the standard scanpy preprocessing pipeline. |
34a97fe
to
80d93e0
Compare
80d93e0
to
50d327d
Compare
@@ -22,3 +22,4 @@ umap-learn>=0.3.10,<0.5 | |||
legacy-api-wrap | |||
packaging | |||
sinfo | |||
openTSNE>=0.5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be updated to 0.6, when I release it. Until then, CI will fail here.
@@ -2,7 +2,7 @@ | |||
|
|||
import numpy as np | |||
import pytest | |||
from sklearn.utils.testing import assert_array_almost_equal | |||
from numpy.testing import assert_array_almost_equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scikit-learn imported this internally, and was throwing warnings that it will be removed from the scikit-learn public API.
|
||
# If `binarize="auto"`, we binarize everything except t-SNE weights | ||
if binarize == "auto": | ||
binarize = neighbors["params"]["method"] != "tsne" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binarization will produce uniform weights that differ ever so slightly differently from openTSNE.affinity.Uniform
. The difference is that in openTSNE, we do not have a symmetric graph, so when we symmetrize uniform weights, the points that appear as each others neighbors actually have higher weights. So in that case we have two sets of possible weights, e.g. 0.05 and 0.1.
Here, we already have a symmetric connectivity graph, so this doesn't happen, and all points are assigned equal weights. I strongly suspect this has practically no impact on the resulting embeddings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. For consistency, we can make the Uniform
affinity class in openTSNE to support not only symmetrize=True/False
but also something like symmetrize='or'
which would compute P=(P+P.T) > 0
. I thought about it a while ago already actually.
if adata.is_view: # we shouldn't need this here... | ||
adata._init_as_actual(adata.copy()) | ||
|
||
n_neighbors = int(np.ceil(perplexity * 3)) + 1 # +1 because we it includes self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really the only functionally different LOC from the original neighbors
method.
@@ -743,6 +888,10 @@ def compute_neighbors( | |||
self._adata.shape[0], | |||
self.n_neighbors, | |||
) | |||
elif method == 'tsne': | |||
self._distances, self._connectivities = _compute_connectivities_tsne( | |||
knn_indices[:, 1:], knn_distances[:, 1:], n_jobs=-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use n_jobs=-1
meaning all available cores, since this is what the other functions do as well (I guess).
@ivirshup @dkobak I've fixed up this PR, so now it implements what I mentioned in my comment above. I've left a couple of comments on the code, commenting on anything noteworthy. The tests and everything will fail until I release a new version of openTSNE, which I'll do in the coming days. But please look through the changes and let me know if there's anything you'd like me to change, so we can get this merged. Also, I haven't updated the docstrings at all. The most glaring thing is Functionally, this now works as agreed. Let me know how you want to proceed. |
Cool! Did you have a chance to compare these embeddings:
on one of the standard scanpy datasets, like e.g. scanpy.datasets.pbmc3k_processed (used in scanpy tutorials such as https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html)? |
Indeed I did: The differences between UMAP weights with and without binarization make sense: using the weights look like t-SNE using a smaller perplexity, compared to the uniform kernel, which tends to make things more compact. The only noticeable structural difference between the UMAP and t-SNE weights are that maybe the Megakaryocytes are a bit closer to the left cluster. But that's probably due to the larger number of neighbors. |
For some reason I don't see the figures here on the Github page (and get an error message when I click on the link), but they showed fine in the email notification I received. Looks good! |
I had put them in the collapsible thing, because they take quite a bit of space. But I've removed that now, so they should be visible. |
The problem is not the collapsible thing. I still don't see the images here but only links instead, and when I click on a link e.g. https://user-images.githubusercontent.com/5758119/115158730-f0768a00-a08f-11eb-939a-1b9c35373fae.png I get an error message that the image cannot be displayed because it contains errors. Odd. Maybe it's something weird on my side. |
@@ -692,11 +837,11 @@ def compute_neighbors( | |||
if n_neighbors > self._adata.shape[0]: # very small datasets | |||
n_neighbors = 1 + int(0.5*self._adata.shape[0]) | |||
logg.warning(f'n_obs too small: adjusting to `n_neighbors = {n_neighbors}`') | |||
if method == 'umap' and not knn: | |||
if method in {'umap', 'tsne'} and not knn: | |||
raise ValueError('`method = \'umap\' only with `knn = True`.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ValueError should now probably print method=\'{method}\'
and not method=\'umap\'
.
raise ValueError('`method = \'umap\' only with `knn = True`.') | ||
if method == 'rapids' and metric != 'euclidean': | ||
raise ValueError("`method` 'rapids' only supports the 'euclidean' `metric`.") | ||
if method not in {'umap', 'gauss', 'rapids'}: | ||
if method not in {'umap', 'gauss', 'rapids', 'tsne'}: | ||
raise ValueError("`method` needs to be 'umap', 'gauss', or 'rapids'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add 'tsne'
to the ValueError message?
Thanks for updating this! Just a heads up that it might be a little bit until I can give a full review. As a wider design question, would you want to allow using the other |
The only other interesting affinity kernel we could use is the multiscale kernel. But that generally requires an even larger number of nearest neighbors. Either way, it would be easy to incorporate it into |
Just got reminded of this PR... @ivirshup Should we maybe ping somebody else from the Scanpy team to join the discussion / review here? |
I've just seen a new really big/important study that extensively uses t-SNE visualization of ~500k cells (this is unusual as most papers in the field seem to be using UMAP nowadays), and they use Scanpy t-SNE with default parameters :-/ I think this highlights the importance of this PR. https://www.biorxiv.org/content/10.1101/2021.07.04.451050v1 I am going to CC @LuckyMD @giovp (apologies guys if it's not up your street). |
@dkobak sorry for late reply, I'm not able to follow this but I'll ask around and keep you psted here! |
@ivirshup @giovp I'm wondering whether you've had the time to look over this. If this PR is maybe too big a change, then perhaps it would make more sense to migrate to openTSNE in a more iterative approach. For instance, we could just replace the t-SNE implementation to openTSNE, ignoring the API discussion and ignoring the precomputed graphs. I think switching to openTSNE, regardless of integration, would make the t-SNE implementation faster. We could then go for tighter integration step by step. What do you think? |
I was talking to some of our students who attended the seminar in Munich earlier this week, and this made me think of this PR... Too bad it got stalled. Is there anything we can do about it @ivirshup? |
Update here: I met @ivirshup at the SCG conference in Utrecht and we briefly chatted about this issue on the way to a pub. Isaac seemed supportive of the whole setup in this PR, which would allow to do this: sc.pp.neighbors()
sc.tl.tsne() # default binarize='auto' resolves to binarize=True here. UMAP kNN graph but binary weights
sc.tl.tsne(binarize=False) # this would use UMAP weights but normalize to sum to 1
sc.pp.neighbors_tsne()
sc.tl.tsne() # default binarize='auto' resolves to binarize=False here As @pavlin-policar showed above, these three t-SNE calls yield very similar embeddings. |
@dkobak @ivirshup @Koncopd This is a first stab #1233.
Features
As discussed, this PR currently implements t-SNE with uniform affinity kernels, making it fit in nicely with the existing
sc.pp.neighbors
architecture. While this isn't technically t-SNE per se, it's visually almost impossible to tell them apart.It would also make sense to add a
tsne
option tosc.pp.neighbors
, but from what I can tell, there's no direct way to change the existing code to do this. It looks likesc.pp.neighbors
calls UMAP to calculate the nearest neighbors directly, calculating the UMAP weights. We'd probably have to do something similar to thegauss
option and just overwrite the UMAP weights after the fact. Does this sound reasonable?I like the API of calling
sc.tl.tsne.recipe_X(adata)
. Adding the recipes would be simple here; we can just add a simple class wrapper around_tsne
which which would__call__
tsne, and a bunch of static methods to the wrapper for recipes. This is kind-of messy and probably not something you guys do anywhere else throughout the code base, so I'd appreciate your feedback on this. Do you like this, or should we do it in a different way?The
ingest
functionality should be fairly straightforward as well, just adding atsne
option toembedding_method
.All of these things should be pretty easy to do, but I'd like to see that this is moving in a direction you like first.