Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default t-SNE parameters #22

Closed
dkobak opened this issue May 5, 2021 · 4 comments
Closed

Default t-SNE parameters #22

dkobak opened this issue May 5, 2021 · 4 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@dkobak
Copy link

dkobak commented May 5, 2021

Great paper and great package. Amazing work!

I am specifically interested in your UMAP/t-SNE comparisons and benchmarks and am now trying to figure out the SG-t-SNE-Pi default parameters that you use. As far as I understood your API, your defaults are max_iter=500, early_iter=200, alpha=10 where alpha denotes early exaggeration coefficient. I noticed that 10 and 200 are slightly different from the default values in most existing t-SNE implementations (12 and 250), I wonder why. But they are pretty close so it does not really matter. What is not mentioned though is the learning rate. Learning rate can have a huge influence on t-SNE embeddings and speed of convergence. See https://www.nature.com/articles/s41467-019-13056-x and https://www.nature.com/articles/s41467-019-13055-y that recommend setting learning rate to n/12 where n is the sample size. What learning rate is used by the SG-t-SNE-Pi implementation that you use?

Unrelated, if I understood the "SG" part and your implementation correctly, you construct a kNN graph using k=10, then assign UMAP weights to the edges, and then when running t-SNE, SG-t-SNE will normalize each row of the affinity matrix to sum to 1. Then symmetrize and run t-SNE as usual. Right? If I understood correctly, then this is pretty much exactly how it should be implemented in Scanpy soon, see pending scverse/scanpy#1561 by @pavlin-policar. Nice.

Finally, I am not entirely sure I understood your initialization approach. It's great that you use the same initialization for t-SNE and UMAP (another relevant paper here: https://www.nature.com/articles/s41587-020-00809-z). But I am confused by the following bit:

PCA is performed on the Kmeans clusters centroid matrix which of form n × c, where n is the number of cells and c is the number of centroids.

Is this a binary matrix that has 1 in position ij if cell i belongs to cluster j? If so, I'm not quite sure what's the point of running PCA on such a matrix? I'm probably misunderstanding.

@parashardhapola
Copy link
Owner

Thanks!

Learning rate: The default being used is 200. Thanks for bringing this up. I completely overlooked it. The implementation of SG-tSNE has not exposed this parameter. I have create an issue here and hope that the developers fix it soon. Once that is up, I will set learning rate dynamically based on no. of cells, as has been suggested.

SG part:: You are mostly correct, the sum of rows in the affinity matrix add upto 1. The value of k is of course tunable. The only catch is the lambda parameter in SG-tSNE. I have not investigated the effect of lambda in detail yet and have set it to 1 which means that there is no rescaling. I refer you to the SG-tSNE paper for more details.

Initialization: Thank you for brining this to our notice. There is a major error in the explanation and will be corrected in the preprint asap. So, we perform PCA reduction (d dimensions) of the data and then fit a Minibatch Kmeans (c clusters) on this data. This gives cluster centroid matrix. This matrix is of form c x d. We then perform another round of PCA on this matrix to two obtain an either c x 2 or c x 3 matrix. Thereafter each cell is assigned a 2D or 3D initial embedding based on their cluster identity from Kmeans. Justification: The idea here is to keep cells with higher similarity closer together in the initial embedding. This could of course be done simply using the first two or three PCA dimensions. However, there is a likelihood that first 2/3 PCs do not capture the features that mark rare population of cells and hence these cells may end up getting a poorly informed initial embedding. By clubbing cells together into clusters and reperforming PCA the likelihood of rare cells obtaining disjunct initial coordinates can be increased. This is all a conjecture though, as we not performed a systematic comparison of this approach against using first 2/3 PC dimensions directly. I'll happy to have your take on this and am open to any suggestions.

@parashardhapola parashardhapola added question Further information is requested enhancement New feature or request labels May 6, 2021
@dkobak
Copy link
Author

dkobak commented May 6, 2021

Thanks for the replies.

Re learning rate -- great! Some remarks here: if the SG-tSNE implementation has the factor 4 in the gradient (as sklean does), then the heuristic should be n/48 and not n/12.

Also, 12 here comes from the early exaggeration, so I'd suggest you consider changing default alpha to 12 and early_iter to 250 to, but if you prefer to keep your defaults, then maybe the learning rate should be n/10 (or n/40).

In any case for n=4mln you will change the learning rate from 200 to something like 100000 which is a HUGE change, so you definitely should run your benchmarks and see if everything still holds. I have no experience with SG-tSNE implementation so don't know if there are any caveats here.


SG part: yes, I noticed that you set lambda to 1. This is what makes the affinities to sum to 1, if I understood correctly. The UMAP weights will not sum to 1, so you need to fix that before running t-SNE, and it seems that SG-tSNE will do that for you, i.e. choose some gamma exponents so that the weights in each row sum to 1. Correct me if I am wrong.


Re initialization -- this makes perfect sense and is actually what I suspected you are doing :-)

As we show in https://www.nature.com/articles/s41587-020-00809-z, it's important to have informative initialization, but there are many possible choices for what is informative.

I can see two caveats here:

(1) when you perform the PCA on c x 2 matrix, do you set the scale of PCs somehow? I always scale the PCs for initialization so that the std of PC1 is 0.0001, which is the default value for random initialization (see linked paper). Using large initialization sometimes interferes with convergence.

And (2) the way you do it, you will have lots of points that are exactly overlapping in the initialization. I found this to cause a lot of numerical problems especially for Barnes-Hut but also for FFT the way it's implemented in FIt-SNE. So it's actually beneficial to add a tiny amount of noise to the initialization. But I don't know if it matters in the SG-tSNE implementation.

In any case, your results look reasonable, so maybe none of these caveats matters for you.

@parashardhapola
Copy link
Owner

Learning rate: These are some fantastic suggestions! Thanks. It will interesting to see how the embedding benefits from an increased learning rate. I expect that runtime will get prolonged when using higher values for the learning rate. It will be interesting to compare the results between low learning rate + more iterations and high learning rate + fewer iterations. From the articles you shared, it seems that setting a high learning rate solves issues that a large number of iterations (under a reasonable limit) might not be able to solve. it will be interesting how the local neighborhood is preserved in these comparisons.

SG part: Yes, lambda is set to 1 but this is not what makes the row affinities sum to 1. This happens here in the SG-tSNE code like a pre-processing step. The lambda rescaling happens here. I think that the lambda rescaling step is akin to smooth_knn_dist. Hence, by default, the lambda rescaling is turned off (by setting the value to 1). In UMAP, the equivalent to lambda would be local_connectivity and bandwidth parameters that regulate the sparsity of the graph. I'm by no means (by far) an expert here so please feel free to correct me. Any clarification from SG-tSNE developers is welcome here (realized I can't tag them here).

Initialization: Again, thank you for a very constructive feedback here.

  1. Yes, it is a good idea to scale the PC axes and I would definitely incorporate that into the code (will create a separate issue to keep track of it). What we do currently is simply floor/ceil the extreme values on each axis using this function. This is done with the expectation that clusters of cells do not "fly off" in the embedding creating large 'white spaces'. In practice, it does help.

  2. Good point. I have, though, never experienced any numerical errors in SG-tSNE. Not sure what solves the problem here. What we should also note here that we encourage the users to 'overcluster' in this step (default value: 1000). This in itself makes sure that a large number of points are not clubbed up together. Maybe more usage of the tool might unveil this issue.

@dkobak
Copy link
Author

dkobak commented May 7, 2021

From the articles you shared, it seems that setting a high learning rate solves issues that a large number of iterations (under a reasonable limit) might not be able to solve. it will be interesting how the local neighborhood is preserved in these comparisons.

Yes, but what high learning rate solves is the incomplete early exaggeration phase, because increasing the number of iterations without increasing the length of the early exaggeration phase won't help it. I don't think this will matter much for local neighborhood preservation though... So I am actually very curious to see how the learning rate will affect your local neighborhood metric.

SG part: Yes, lambda is set to 1 but this is not what makes the row affinities sum to 1.

Hmm, I am not so sure. The fist link in your comment goes to a place where the entire P matrix is normalized to sum to 1. But not the individual rows... Equations 5-6 in the original paper suggest to me that rows are normalized to 1 before that, via lambda rescaling. But I see now that due to Equation 6 they will sum to 1 independent of the value of lambda.

Repository owner locked and limited conversation to collaborators Jul 3, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants