There are
# | Topic | Demo | Contents | TODO | exp |
---|---|---|---|---|---|
0 | Eigen Decomposition & SVD | Link |
|
||
1 | Transition Matrix & Difffusion Map | Link |
|
|
|
2 | Diffusion Map on Simulation Data | Link |
|
|
|
3 | Diffusion Map on Real Data | Link |
|
|
|
4 | Clustering & Classification | Link |
|
|
|
5 | Dynamical Diffusion Map | Link |
|
- How to apply eigen-decomposition in MATLAB.
- Reduce time complexicity.
- Know
eigs
andsvd
. - the costs of eigen-decomposition and SVD are as
- Apply diffusion map step by step:
- distance matrix
$d_{ij}=d(x_i, x_j)$ , e.g.$d_{ij}=|x_i-x_j|$ - affinity matrix
$W_{ij}=\exp\left(-\frac{{d_{ij}}^2}{\epsilon^2}\right)$ - transition matrix
$K=D^{-1}W$ where diagonal matrix$D_{ii}=\sum_jW_{ij}$ - Eigen-decomposition of
$K$ is$U$ , i.e.$KU=US$ where$S_{ii}=\lambda_i$ - dimension reduction by largest
$m$ eigenvector$U'=\left[u_2, u_2, \cdots,u_{m+1}\right]$
- distance matrix
- Tune bandwidth to know when the outliers occur.
- What happen to transition matrix if the bandwidth is too small. What is eigenvalue and eigenvector of identity matrix.
- What happen to transition matrix if the bandwidth is too large. What is eigenvalue and eigenvector of all ones matrix.
- Know different type of distance, e.g. mahalanobis distance. Please refer to Malik, Shen, Wu & Wu, (2018).
Click me to see something scary
Let- In practice, since
$K=D^{-1}W$ is not symmetric, we will use symmetric matrix$D^{-1/2}WD^{-1/2}$ which is similar to$K$ . Please refer to J. Banks, J. Garza-Vargas, A. Kulkarni, N. Srivastava, (2019) for more detail of time complixity of eigen-decomposition of symetric matrix. - In practice, we use
knnsearch
to construct affinity instead ofpdist
becauseknnsearch
is based on KD algorithm which time complexity is$O(nk\log(k))$ . Moreover, time complexity ofpdist
and sort is$O(n^2\log(n))$ . - Suppose a dataset belongs to a
$d$ -dimensional manifold$M$ , which is in ambient space$R^p$ . Diffusion map reduce the dimension$p$ to dimension$m$ but preserve the topological property of the manifold. - Different type of torus, different type of embedding figure.
- Preserve topological properties, e.g. geodesic distance and diffusion distance. Please refer to A. Singer, H.-T. Wu, (2011) for more detail.
Click me to see comparison with PCA
-
Roeseland:
- In order to accelerate the algorithm, this algorithm is based on SVD.
- The default number landmark is chosen
$\sqrt{n}$ . - In my code, I apply few steps
k-means
to choose landmark. - Please refer to Shen and Wu, (2019) for more detail.
-
Self-tune:
- The affinity matrix is created by
$k(x_i,x_j)=\exp\left(\frac{|x_i-x_j|^2}{\epsilon_i\epsilon_j}\right)$ . - Please refer to Zelnik-Manor & Perona, (2005) for more detail.
- The affinity matrix is created by
- Clustering: k-means is unsupervised learning.
- Classification: SVM is supervised learning.
- Please refer to Lin, Malik and Wu (2019) for more detail of dynamical diffusion map.
- There are three ways to apply feature extraction
-
Reciprocal of RRI (instantaneous frequency). Please refer to Li, Frasch and Wu (2017)
-
Wave shape. Please refer to Lin, Malik and Wu (2019).
-
Spectrum. Here, we use scattering transformation to extract feature. Please refer to Anden and Mallat (2013) for more detail of scattering transformation.
-
There are many method to dimension reduction. Diffusion map is just one of them. Hence, you could compare diffusion map and other methods. Feel free to use the data in my folder data
.
The data source in folder data
is introduced as following.
- Sphere data
UniSphere.mat
: There are 998 sphere points in (x, y, z)-coordinate.
This is generated from Brian Z Bentz (2021), ( https://www.mathworks.com/matlabcentral/fileexchange/57877-mysphere-n ), MATLAB Central File Exchange.
- Iris data
irismat.mat
: There are 3 categories of flowers and each categories contains 50 data. Each flower data has 4 features.
From MATLAB database
load('fisheriris')
- Fake ECG data: There are 1229 pulse and each pulse is approximated by 141 points. This original data is about 15 minutes.
This is generated from McSharry PE, Clifford GD, Tarassenko L, Smith L. (2003).
- Real ECG data: The database contains 14552 heartbeat pulses from 290 people. There are two categories: normal and abnormal, where there are 10506 abnormal pulses and 4046 normal pulses.
This dataset is from kaggle, called the PTB Diagnostic ECG Database.
- EEG spectrum: The database contains 4462 EEG epochs from 5 people. The channel of this EEG is Fpz-Cz, which sampled at 100 Hz. The sleep stages are reduced to 5 stages, Awake, REM, N1, N2, N3.
This dataset is from Physionet, which is called Sleep-EDF Database.
- Klein bottle dataset
This dataset is generated and modified from David Smith (2022). Klein Bottle ( https://www.mathworks.com/matlabcentral/fileexchange/5880-klein-bottle ), MATLAB Central File Exchange.
- MNIST dataset: The database contains 5000 digital images with size 28x28. Each class contains 400-600 images.
This dataset is randomly chosen from here.
src/Lazykmeans.m
is modified from Kai (2021), Improved Nystrom Kernel Low-rank Approximation ( https://www.mathworks.com/matlabcentral/fileexchange/38422-improved-nystrom-kernel-low-rank-approximation ), MATLAB Central File Exchange.src/cluster_acc.m
is from Dong Dong (2022). clustering accuracy ( https://www.mathworks.com/matlabcentral/fileexchange/77452-clustering-accuracy ), MATLAB Central File Exchange.
Random Arrangement
- J. de la Porte, B. M. Herbst, W. Hereman and S. J. van der Walt, An introduction to diffusion maps, (2008).
- J. Wang, Geometric Structure of High-Dimensional Data and Dimensionality Reduction.
- L. Zelnik-Manor and P. Perona, Self-Tuning Spectral Clustering, (2005).
- C. Shen and H.-T. Wu, Scalability and robustness of spectral embedding: landmark diffusion is all you need, (2019).
- J. Malik, C. Shen, H.-T. Wu and N. Wu, Connecting Dots -- from Local Covariance to Empirical Intrinsic Geometry and Locally Linear Embedding, (2018).
- A. Singer and H.-T. Wu, Orientability and diffusion maps, (2011).
- R. R. Lederman, R Talmon, H.-T. Wu, Y.-L. Lo and R. R. Coifman, Alternating diffusion for common manifold learning with application to sleep stage assessment, (2015).
- R. R. Coifman and S. Lafon, Diffusion Map, (2006).
- A. Singer, From graph to manifold Laplacian: The convergence rate, (2006).
- A. Singer and H.-T. Wu, Vector Diffusion Maps and the Connection Laplacian, (2011).
- Y.-T. Lin, J. Malik and H.-T. Wu, Wave-shape oscillatory model for nonstationary periodic time series analysis, (2021).
Please do not hesitate to contact me (Sing-Yuan Yeh, [email protected]) if you have any question.