Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with challenging data sets #9

Open
ivan-marroquin opened this issue May 18, 2021 · 4 comments
Open

Dealing with challenging data sets #9

ivan-marroquin opened this issue May 18, 2021 · 4 comments

Comments

@ivan-marroquin
Copy link

Hi,

Many thanks for such great package! I found very interesting the dimensionality reduction approach proposed in PaCMAP when compared to other techniques. So, I decided to give a try with my data sets.

I tried different initial conditions:

  • number of neighbors 10 or 32 (according to number of samples in data set) with default MN_ratio and FP_ratio
  • With MN_ratio 0.25 or 1.0 while keeping number of neighbors 10 (or 32) and default FP_ratio
  • With FP_ratio 1.0 or 3.0 while keeping number of neighbors 10 (or 32) and default MN_ratio
  • With apply_pca set to False or True
  • With init set to 'random' or 'pca'

In all tests, I always get a "blob"

So, I am looking for you suggestions/comments. I provided a Python script with one of my data sets (see attached file).

Many thanks,

Ivan

testing_dim_reduction.zip

@hyhuang00
Copy link
Collaborator

Hi Ivan,

Thank you for using our package! Are you trying to discover a particular structure from your dataset? Do you know what kind of structure it could like? For general purposes (such as discovering cluster structures) it may be useful to increase the FP_ratio while keeping all the other hyperparameters as default. Keep in mind that PaCMAP can only discover structure that already exist in the dataset, so you may want to know what kind of structure you would like to find before you perform the dimensionality reduction.

@ivan-marroquin
Copy link
Author

Hi @hyhuang00

Many thanks for your suggestions. I believe that the dataset should have regions of high density and they are interconnected or show some degree of overlapping. For that reason, I believe that PaCMAP has some challenges to unfold the original structure.

I will give a try with increasing FP_ratio, and let you know how did it work.

Ivan

@cynrudin
Copy link
Collaborator

cynrudin commented May 19, 2021 via email

@ivan-marroquin
Copy link
Author

Hi Cynthia,

Thanks for the advice. I shared results that I obtained by generating my own specified nearest neighbors:

tree= NNDescent(, metric= 'minkowski', metric_kwds= {'p': 0.3}, n_neighbors= n_neighbors,
random_state= 1969, n_jobs= cpu_count)
tree.prepare()

nbrs= np.zeros((input_data.shape[0], n_neighbors), dtype= np.int32)
nbrs_= tree.query(train_attr_cube, k= n_neighbors + 1)
nbrs= nbrs_[0][:,1:].copy()

scaled_dist= np.ones((train_attr_cube.shape[0], n_neighbors), dtype= np.float32)

pair_neighbors= pacmap.pacmap.sample_neighbors_pair(train_attr_cube, scaled_dist, nbrs, np.int32(n_neighbors))

embedding= pacmap.PaCMAP(n_dims= 2, n_neighbors= n_neighbors, MN_ratio= 0.05, FP_ratio= 20.0, lr= 1.0,
pair_neighbors= pair_neighbors, num_iters= 200, apply_pca= True)

note that n_neighbors is set according to rule used in PaCMAp and FP_ratio = 2.0, 10.0 and 20.0

I noted that with an increased FP_ratio, the computation time increased as well. The results are in the attached zip file.

Any comments/suggestions?

Ivan

fp_ratio_tests_pacmac.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants