Dealing with challenging data sets #9

ivan-marroquin · 2021-05-18T16:25:28Z

Hi,

Many thanks for such great package! I found very interesting the dimensionality reduction approach proposed in PaCMAP when compared to other techniques. So, I decided to give a try with my data sets.

I tried different initial conditions:

number of neighbors 10 or 32 (according to number of samples in data set) with default MN_ratio and FP_ratio
With MN_ratio 0.25 or 1.0 while keeping number of neighbors 10 (or 32) and default FP_ratio
With FP_ratio 1.0 or 3.0 while keeping number of neighbors 10 (or 32) and default MN_ratio
With apply_pca set to False or True
With init set to 'random' or 'pca'

In all tests, I always get a "blob"

So, I am looking for you suggestions/comments. I provided a Python script with one of my data sets (see attached file).

Many thanks,

Ivan

testing_dim_reduction.zip

hyhuang00 · 2021-05-19T04:30:04Z

Hi Ivan,

Thank you for using our package! Are you trying to discover a particular structure from your dataset? Do you know what kind of structure it could like? For general purposes (such as discovering cluster structures) it may be useful to increase the FP_ratio while keeping all the other hyperparameters as default. Keep in mind that PaCMAP can only discover structure that already exist in the dataset, so you may want to know what kind of structure you would like to find before you perform the dimensionality reduction.

ivan-marroquin · 2021-05-19T13:06:18Z

Hi @hyhuang00

Many thanks for your suggestions. I believe that the dataset should have regions of high density and they are interconnected or show some degree of overlapping. For that reason, I believe that PaCMAP has some challenges to unfold the original structure.

I will give a try with increasing FP_ratio, and let you know how did it work.

Ivan

cynrudin · 2021-05-19T13:43:25Z

Hi Ivan, DR methods don’t tend to preserve high density - they tend to try to spread points out a bit so they look nice in 2d, which sounds like it might be the opposite of what you’re looking for. Anyway, please let us know if you are able to get it. Cheers, Cynthia

…

On May 19, 2021, at 9:06 AM, ivan-marroquin ***@***.***> wrote: Hi @hyhuang00 Many thanks for your suggestions. I believe that the dataset should have regions of high density and they are interconnected or show some degree of overlapping. For that reason, I believe that PaCMAP has some challenges to unfold the original structure. I will give a try with increasing FP_ratio, and let you know how did it work. Ivan — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

ivan-marroquin · 2021-05-25T14:23:20Z

Hi Cynthia,

Thanks for the advice. I shared results that I obtained by generating my own specified nearest neighbors:

tree= NNDescent(, metric= 'minkowski', metric_kwds= {'p': 0.3}, n_neighbors= n_neighbors,
random_state= 1969, n_jobs= cpu_count)
tree.prepare()

nbrs= np.zeros((input_data.shape[0], n_neighbors), dtype= np.int32)
nbrs_= tree.query(train_attr_cube, k= n_neighbors + 1)
nbrs= nbrs_[0][:,1:].copy()

scaled_dist= np.ones((train_attr_cube.shape[0], n_neighbors), dtype= np.float32)

pair_neighbors= pacmap.pacmap.sample_neighbors_pair(train_attr_cube, scaled_dist, nbrs, np.int32(n_neighbors))

embedding= pacmap.PaCMAP(n_dims= 2, n_neighbors= n_neighbors, MN_ratio= 0.05, FP_ratio= 20.0, lr= 1.0,
pair_neighbors= pair_neighbors, num_iters= 200, apply_pca= True)

note that n_neighbors is set according to rule used in PaCMAp and FP_ratio = 2.0, 10.0 and 20.0

I noted that with an increased FP_ratio, the computation time increased as well. The results are in the attached zip file.

Any comments/suggestions?

Ivan

fp_ratio_tests_pacmac.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with challenging data sets #9

Dealing with challenging data sets #9

ivan-marroquin commented May 18, 2021

hyhuang00 commented May 19, 2021

ivan-marroquin commented May 19, 2021

cynrudin commented May 19, 2021 via email

ivan-marroquin commented May 25, 2021

Dealing with challenging data sets #9

Dealing with challenging data sets #9

Comments

ivan-marroquin commented May 18, 2021

hyhuang00 commented May 19, 2021

ivan-marroquin commented May 19, 2021

cynrudin commented May 19, 2021 via email

ivan-marroquin commented May 25, 2021