Skip to content

zhenyanchen/Summer-Camp-2022-in-Diffusion-Map

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Summer Camp 2022 in Diffusion Map

Convention

There are $n$ data, which is $p$-dimensional. The data is stored in $n\times p$ matrix. After apply diffusion map, we will reduce the dimension from $p$ to $m$. Hence, the data is compressed in $n\times m$ matrix.

Outline

# Topic Demo Contents TODO exp
0 Eigen Decomposition & SVD Link
  • eigen-decomposition
  • SVD
  • sparse matrix
  • low rank approximation
1 Transition Matrix & Difffusion Map Link
  • Transition matrix
  • Spectral of transition matrix
  • Kernel bandwidth and outlier
  • ev of all ones matrix
  • ev of all identity matrix
  • Mahalanobis distance
2 Diffusion Map on Simulation Data Link
  • Bandwidth
  • Diffusion distance
  • Mobius strip
  • Klein bottle
  • bandwidth and recover manifold
  • ambient/geodesic distance
3 Diffusion Map on Real Data Link
  • Fisheriris dataset
  • MNIST dataset
  • ECG dataset
  • EEG dataset
  • Roseland (SVD)
  • self-tune bandwidth
  • EEG signal
  • wavelet transform & scattering transform
4 Clustering & Classification Link
  • Clustering (k-means)
  • Classification (SVM)
  • metric
  • K-fold
  • Leave one subject out
  • metric
5 Dynamical Diffusion Map Link
  • RRI
  • Wave shape
  • Spectrum

Eigen Decomposition & SVD

  • How to apply eigen-decomposition in MATLAB.
  • Reduce time complexicity.
  • Know eigs and svd.
  • the costs of eigen-decomposition and SVD are as
    • eigen-decomposition: $9n^3$ if matrix is with size $n\times n$
    • SVD: $14nk^2+8k^3$ if matrix is with size $n\times k$ where $n\geq k$.
    • Please refer to Nicholas J. Higham, Functions of Matrices: Theory and Computation, (2008) for more detail.

Transition Matrix & Difffusion Map

  • Apply diffusion map step by step:
    1. distance matrix $d_{ij}=d(x_i, x_j)$, e.g. $d_{ij}=|x_i-x_j|$
    2. affinity matrix $W_{ij}=\exp\left(-\frac{{d_{ij}}^2}{\epsilon^2}\right)$
    3. transition matrix $K=D^{-1}W$ where diagonal matrix $D_{ii}=\sum_jW_{ij}$
    4. Eigen-decomposition of $K$ is $U$, i.e. $KU=US$ where $S_{ii}=\lambda_i$
    5. dimension reduction by largest $m$ eigenvector $U'=\left[u_2, u_2, \cdots,u_{m+1}\right]$
  • Tune bandwidth to know when the outliers occur.
  • What happen to transition matrix if the bandwidth is too small. What is eigenvalue and eigenvector of identity matrix.
  • What happen to transition matrix if the bandwidth is too large. What is eigenvalue and eigenvector of all ones matrix.
  • Know different type of distance, e.g. mahalanobis distance. Please refer to Malik, Shen, Wu & Wu, (2018).
Click me to see something scary Let $\{x_i\}_{i=1}^n\subset \mathcal{M}$. Let the kernel matrix $W_{ij}=k_\epsilon(x_i,x_j)$. The degree matrix is defined as $D_{ii}=\sum_jW_{ij}$. Then, the (negative) graph Laplacian operator is defined as $$L=\frac{D^{-1}W-I}{\epsilon^2}$$ In Coifman and Lafon (2006) and Singer (2006), they showed the bias term and variance term, respectively. Given "normalized" kernel, $m_0=1$, the graph Laplacian operator is approximated as follows, $$\sum_{j=1}^{n} L_{i j} f\left(x_{j}\right)=\frac{m_2}{2d}\Delta f\left(x_{i}\right)+O\left(\frac{\sqrt{\log(n)}}{n^{1 / 2} \epsilon^{d/2+1}}\right)+O(\epsilon).$$ Singer (2006) gives the lower bound of $\epsilon$. That is, if the dataset is finite, the bandwidth cannot be approach to zero. Actually, it is because there must be enough number of data points in the kernel bandwidth.

Diffusion Map on Simulation Data

  • In practice, since $K=D^{-1}W$ is not symmetric, we will use symmetric matrix $D^{-1/2}WD^{-1/2}$ which is similar to $K$. Please refer to J. Banks, J. Garza-Vargas, A. Kulkarni, N. Srivastava, (2019) for more detail of time complixity of eigen-decomposition of symetric matrix.
  • In practice, we use knnsearch to construct affinity instead of pdist because knnsearch is based on KD algorithm which time complexity is $O(nk\log(k))$. Moreover, time complexity of pdist and sort is $O(n^2\log(n))$.
  • Suppose a dataset belongs to a $d$-dimensional manifold $M$, which is in ambient space $R^p$. Diffusion map reduce the dimension $p$ to dimension $m$ but preserve the topological property of the manifold.
  • Different type of torus, different type of embedding figure.
  • Preserve topological properties, e.g. geodesic distance and diffusion distance. Please refer to A. Singer, H.-T. Wu, (2011) for more detail.
Click me to see comparison with PCA
    PCA is linear dimension reduction method, so it could not straighten the spring. Hence, PCA could recover the geodesic distance.

Diffusion Map on Real Data

Two useful techniques

  • Roeseland:

    • In order to accelerate the algorithm, this algorithm is based on SVD.
    • The default number landmark is chosen $\sqrt{n}$.
    • In my code, I apply few steps k-means to choose landmark.
    • Please refer to Shen and Wu, (2019) for more detail.
  • Self-tune:

    • The affinity matrix is created by $k(x_i,x_j)=\exp\left(\frac{|x_i-x_j|^2}{\epsilon_i\epsilon_j}\right)$.
    • Please refer to Zelnik-Manor & Perona, (2005) for more detail.

Clustering & Classification

Classifier

  • Clustering: k-means is unsupervised learning.
  • Classification: SVM is supervised learning.
    • There are two parameters, box constraint and kernel scale. Note that kernel scale is $\gamma$ in this equation $k(x,y)=\exp\left(-\gamma|x-y|^2\right)$.
    • Usually, we use $\log$-scale to fine tune these two parameters.

Evaluation

  • Metric: recall, precision, F1 score

  • Evaluation on k-folds cross validation.

Dynamical Diffusion Map

Appendix for methodology

There are many method to dimension reduction. Diffusion map is just one of them. Hence, you could compare diffusion map and other methods. Feel free to use the data in my folder data.

  • Diffusion map
  • PCA (kernel PCA)
  • Locally linear embedding (LLE)
  • t-SNE

Material

Hackmd version

Data

The data source in folder data is introduced as following.

  1. Sphere data UniSphere.mat: There are 998 sphere points in (x, y, z)-coordinate.

This is generated from Brian Z Bentz (2021), ( https://www.mathworks.com/matlabcentral/fileexchange/57877-mysphere-n ), MATLAB Central File Exchange.

  1. Iris data irismat.mat: There are 3 categories of flowers and each categories contains 50 data. Each flower data has 4 features.

From MATLAB database load('fisheriris')

  1. Fake ECG data: There are 1229 pulse and each pulse is approximated by 141 points. This original data is about 15 minutes.

This is generated from McSharry PE, Clifford GD, Tarassenko L, Smith L. (2003).

  1. Real ECG data: The database contains 14552 heartbeat pulses from 290 people. There are two categories: normal and abnormal, where there are 10506 abnormal pulses and 4046 normal pulses.

This dataset is from kaggle, called the PTB Diagnostic ECG Database.

  1. EEG spectrum: The database contains 4462 EEG epochs from 5 people. The channel of this EEG is Fpz-Cz, which sampled at 100 Hz. The sleep stages are reduced to 5 stages, Awake, REM, N1, N2, N3.

This dataset is from Physionet, which is called Sleep-EDF Database.

  1. Klein bottle dataset

This dataset is generated and modified from David Smith (2022). Klein Bottle ( https://www.mathworks.com/matlabcentral/fileexchange/5880-klein-bottle ), MATLAB Central File Exchange.

  1. MNIST dataset: The database contains 5000 digital images with size 28x28. Each class contains 400-600 images.

This dataset is randomly chosen from here.

Code references

  1. src/Lazykmeans.m is modified from Kai (2021), Improved Nystrom Kernel Low-rank Approximation ( https://www.mathworks.com/matlabcentral/fileexchange/38422-improved-nystrom-kernel-low-rank-approximation ), MATLAB Central File Exchange.
  2. src/cluster_acc.m is from Dong Dong (2022). clustering accuracy ( https://www.mathworks.com/matlabcentral/fileexchange/77452-clustering-accuracy ), MATLAB Central File Exchange.

References

Random Arrangement

  1. J. de la Porte, B. M. Herbst, W. Hereman and S. J. van der Walt, An introduction to diffusion maps, (2008).
  2. J. Wang, Geometric Structure of High-Dimensional Data and Dimensionality Reduction.
  3. L. Zelnik-Manor and P. Perona, Self-Tuning Spectral Clustering, (2005).
  4. C. Shen and H.-T. Wu, Scalability and robustness of spectral embedding: landmark diffusion is all you need, (2019).
  5. J. Malik, C. Shen, H.-T. Wu and N. Wu, Connecting Dots -- from Local Covariance to Empirical Intrinsic Geometry and Locally Linear Embedding, (2018).
  6. A. Singer and H.-T. Wu, Orientability and diffusion maps, (2011).
  7. R. R. Lederman, R Talmon, H.-T. Wu, Y.-L. Lo and R. R. Coifman, Alternating diffusion for common manifold learning with application to sleep stage assessment, (2015).
  8. R. R. Coifman and S. Lafon, Diffusion Map, (2006).
  9. A. Singer, From graph to manifold Laplacian: The convergence rate, (2006).
  10. A. Singer and H.-T. Wu, Vector Diffusion Maps and the Connection Laplacian, (2011).
  11. Y.-T. Lin, J. Malik and H.-T. Wu, Wave-shape oscillatory model for nonstationary periodic time series analysis, (2021).

Contact me

Please do not hesitate to contact me (Sing-Yuan Yeh, [email protected]) if you have any question.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • MATLAB 100.0%