Summer Camp 2022 in Diffusion Map

Convention

There are $n$ data, which is $p$-dimensional. The data is stored in $n\times p$ matrix. After apply diffusion map, we will reduce the dimension from $p$ to $m$. Hence, the data is compressed in $n\times m$ matrix.

Outline

#	Topic	Demo	Contents	TODO	exp
0	Eigen Decomposition & SVD	Link	eigen-decomposition SVD sparse matrix low rank approximation
1	Transition Matrix & Difffusion Map	Link	Transition matrix Spectral of transition matrix Kernel bandwidth and outlier	ev of all ones matrix ev of all identity matrix Mahalanobis distance
2	Diffusion Map on Simulation Data	Link	Bandwidth Diffusion distance Mobius strip Klein bottle	bandwidth and recover manifold ambient/geodesic distance	Circle Distance Torus
3	Diffusion Map on Real Data	Link	Fisheriris dataset MNIST dataset ECG dataset EEG dataset	Roseland (SVD) self-tune bandwidth EEG signal wavelet transform & scattering transform	ECG MNIST
4	Clustering & Classification	Link	Clustering (k-means) Classification (SVM) metric K-fold	Leave one subject out metric	EEG K-Fold
5	Dynamical Diffusion Map	Link	RRI Wave shape Spectrum		RRI wave shape

Eigen Decomposition & SVD

How to apply eigen-decomposition in MATLAB.
Reduce time complexicity.
Know eigs and svd.
the costs of eigen-decomposition and SVD are as
- eigen-decomposition: $9n^3$ if matrix is with size $n\times n$
- SVD: $14nk^2+8k^3$ if matrix is with size $n\times k$ where $n\geq k$.
- Please refer to Nicholas J. Higham, Functions of Matrices: Theory and Computation, (2008) for more detail.

Transition Matrix & Difffusion Map

Apply diffusion map step by step:
1. distance matrix $d_{ij}=d(x_i, x_j)$, e.g. $d_{ij}=|x_i-x_j|$
2. affinity matrix $W_{ij}=\exp\left(-\frac{{d_{ij}}^2}{\epsilon^2}\right)$
3. transition matrix $K=D^{-1}W$ where diagonal matrix $D_{ii}=\sum_jW_{ij}$
4. Eigen-decomposition of $K$ is $U$, i.e. $KU=US$ where $S_{ii}=\lambda_i$
5. dimension reduction by largest $m$ eigenvector $U'=\left[u_2, u_2, \cdots,u_{m+1}\right]$
Tune bandwidth to know when the outliers occur.
What happen to transition matrix if the bandwidth is too small. What is eigenvalue and eigenvector of identity matrix.
What happen to transition matrix if the bandwidth is too large. What is eigenvalue and eigenvector of all ones matrix.
Know different type of distance, e.g. mahalanobis distance. Please refer to Malik, Shen, Wu & Wu, (2018).

Click me to see something scary

Let $\{x_i\}_{i=1}^n\subset \mathcal{M}$. Let the kernel matrix $W_{ij}=k_\epsilon(x_i,x_j)$. The degree matrix is defined as $D_{ii}=\sum_jW_{ij}$. Then, the (negative) graph Laplacian operator is defined as $$L=\frac{D^{-1}W-I}{\epsilon^2}$$ In Coifman and Lafon (2006) and Singer (2006), they showed the bias term and variance term, respectively. Given "normalized" kernel, $m_0=1$, the graph Laplacian operator is approximated as follows, $$\sum_{j=1}^{n} L_{i j} f\left(x_{j}\right)=\frac{m_2}{2d}\Delta f\left(x_{i}\right)+O\left(\frac{\sqrt{\log(n)}}{n^{1 / 2} \epsilon^{d/2+1}}\right)+O(\epsilon).$$ Singer (2006) gives the lower bound of $\epsilon$. That is, if the dataset is finite, the bandwidth cannot be approach to zero. Actually, it is because there must be enough number of data points in the kernel bandwidth.

Diffusion Map on Simulation Data

In practice, since $K=D^{-1}W$ is not symmetric, we will use symmetric matrix $D^{-1/2}WD^{-1/2}$ which is similar to $K$. Please refer to J. Banks, J. Garza-Vargas, A. Kulkarni, N. Srivastava, (2019) for more detail of time complixity of eigen-decomposition of symetric matrix.
In practice, we use knnsearch to construct affinity instead of pdist because knnsearch is based on KD algorithm which time complexity is $O(nk\log(k))$. Moreover, time complexity of pdist and sort is $O(n^2\log(n))$.
Suppose a dataset belongs to a $d$-dimensional manifold $M$, which is in ambient space $R^p$. Diffusion map reduce the dimension $p$ to dimension $m$ but preserve the topological property of the manifold.
Different type of torus, different type of embedding figure.
Preserve topological properties, e.g. geodesic distance and diffusion distance. Please refer to A. Singer, H.-T. Wu, (2011) for more detail.

Click me to see comparison with PCA

Diffusion Map on Real Data

Two useful techniques

Roeseland:
- In order to accelerate the algorithm, this algorithm is based on SVD.
- The default number landmark is chosen $\sqrt{n}$.
- In my code, I apply few steps k-means to choose landmark.
- Please refer to Shen and Wu, (2019) for more detail.
Self-tune:
- The affinity matrix is created by $k(x_i,x_j)=\exp\left(\frac{|x_i-x_j|^2}{\epsilon_i\epsilon_j}\right)$.
- Please refer to Zelnik-Manor & Perona, (2005) for more detail.

Clustering & Classification

Classifier

Clustering: k-means is unsupervised learning.
Classification: SVM is supervised learning.
- There are two parameters, box constraint and kernel scale. Note that kernel scale is $\gamma$ in this equation $k(x,y)=\exp\left(-\gamma|x-y|^2\right)$.
- Usually, we use $\log$-scale to fine tune these two parameters.

Evaluation

Metric: recall, precision, F1 score
Evaluation on k-folds cross validation.

Dynamical Diffusion Map

Please refer to Lin, Malik and Wu (2019) for more detail of dynamical diffusion map.
There are three ways to apply feature extraction
- Reciprocal of RRI (instantaneous frequency). Please refer to Li, Frasch and Wu (2017)
- Wave shape. Please refer to Lin, Malik and Wu (2019).
- Spectrum. Here, we use scattering transformation to extract feature. Please refer to Anden and Mallat (2013) for more detail of scattering transformation.

Appendix for methodology

There are many method to dimension reduction. Diffusion map is just one of them. Hence, you could compare diffusion map and other methods. Feel free to use the data in my folder data.

Diffusion map
PCA (kernel PCA)
Locally linear embedding (LLE)
t-SNE

Material

Hackmd version

Data

The data source in folder data is introduced as following.

Sphere data UniSphere.mat: There are 998 sphere points in (x, y, z)-coordinate.

This is generated from Brian Z Bentz (2021), ( https://www.mathworks.com/matlabcentral/fileexchange/57877-mysphere-n ), MATLAB Central File Exchange.

Iris data irismat.mat: There are 3 categories of flowers and each categories contains 50 data. Each flower data has 4 features.

From MATLAB database load('fisheriris')

Fake ECG data: There are 1229 pulse and each pulse is approximated by 141 points. This original data is about 15 minutes.

This is generated from McSharry PE, Clifford GD, Tarassenko L, Smith L. (2003).

Real ECG data: The database contains 14552 heartbeat pulses from 290 people. There are two categories: normal and abnormal, where there are 10506 abnormal pulses and 4046 normal pulses.

This dataset is from kaggle, called the PTB Diagnostic ECG Database.

EEG spectrum: The database contains 4462 EEG epochs from 5 people. The channel of this EEG is Fpz-Cz, which sampled at 100 Hz. The sleep stages are reduced to 5 stages, Awake, REM, N1, N2, N3.

This dataset is from Physionet, which is called Sleep-EDF Database.

Klein bottle dataset

This dataset is generated and modified from David Smith (2022). Klein Bottle ( https://www.mathworks.com/matlabcentral/fileexchange/5880-klein-bottle ), MATLAB Central File Exchange.

MNIST dataset: The database contains 5000 digital images with size 28x28. Each class contains 400-600 images.

This dataset is randomly chosen from here.

Code references

src/Lazykmeans.m is modified from Kai (2021), Improved Nystrom Kernel Low-rank Approximation ( https://www.mathworks.com/matlabcentral/fileexchange/38422-improved-nystrom-kernel-low-rank-approximation ), MATLAB Central File Exchange.
src/cluster_acc.m is from Dong Dong (2022). clustering accuracy ( https://www.mathworks.com/matlabcentral/fileexchange/77452-clustering-accuracy ), MATLAB Central File Exchange.

References

Random Arrangement

J. de la Porte, B. M. Herbst, W. Hereman and S. J. van der Walt, An introduction to diffusion maps, (2008).
J. Wang, Geometric Structure of High-Dimensional Data and Dimensionality Reduction.
L. Zelnik-Manor and P. Perona, Self-Tuning Spectral Clustering, (2005).
C. Shen and H.-T. Wu, Scalability and robustness of spectral embedding: landmark diffusion is all you need, (2019).
J. Malik, C. Shen, H.-T. Wu and N. Wu, Connecting Dots -- from Local Covariance to Empirical Intrinsic Geometry and Locally Linear Embedding, (2018).
A. Singer and H.-T. Wu, Orientability and diffusion maps, (2011).
R. R. Lederman, R Talmon, H.-T. Wu, Y.-L. Lo and R. R. Coifman, Alternating diffusion for common manifold learning with application to sleep stage assessment, (2015).
R. R. Coifman and S. Lafon, Diffusion Map, (2006).
A. Singer, From graph to manifold Laplacian: The convergence rate, (2006).
A. Singer and H.-T. Wu, Vector Diffusion Maps and the Connection Laplacian, (2011).
Y.-T. Lin, J. Malik and H.-T. Wu, Wave-shape oscillatory model for nonstationary periodic time series analysis, (2021).

Contact me

Please do not hesitate to contact me (Sing-Yuan Yeh, [email protected]) if you have any question.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
demo		demo
exp		exp
fig		fig
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summer Camp 2022 in Diffusion Map

Convention

Outline

Eigen Decomposition & SVD

Transition Matrix & Difffusion Map

Diffusion Map on Simulation Data

Diffusion Map on Real Data

Two useful techniques

Clustering & Classification

Classifier

Evaluation

Dynamical Diffusion Map

Appendix for methodology

Material

Data

Code references

References

Contact me

About

Releases

Packages

Languages

zhenyanchen/Summer-Camp-2022-in-Diffusion-Map

Folders and files

Latest commit

History

Repository files navigation

Summer Camp 2022 in Diffusion Map

Convention

Outline

Eigen Decomposition & SVD

Transition Matrix & Difffusion Map

Diffusion Map on Simulation Data

Diffusion Map on Real Data

Two useful techniques

Clustering & Classification

Classifier

Evaluation

Dynamical Diffusion Map

Appendix for methodology

Material

Data

Code references

References

Contact me

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages