Skip to content

robertsamples/pvclust

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pvclust

The original algorithm is implemented in R by Suzuki and Shimodira (2006): Pvclust: an R package for assessing the uncertainty in hierarchical clustering. This is its Python reimplementation. The final values produced are Approximately Unbiased p-value (AU) and Bootstrap Probability (BP) which are reporting the significance of each cluster in clustering structure. The AU value is less biased and clusters that have this value greater than 95% are considered significant. Both values are calculated using Multiscale Bootstrap Resampling.

This implementation is part of the Master Thesis at the Faculty of Computer and Information Science, University of Ljubljana.

Example

Here, we will show exmple of usage of the Python implemention on the Boston Housing dataset.

import pandas as pd
from sklearn.datasets import load_boston
from pvclust import PvClust

if __name__ == "__main__":
    X, y = load_boston(return_X_y=True)
    X = pd.DataFrame(X)
    pv = PvClust(X, method="ward", metric="euclidean", nboot=1000)

While aglorithm is running we follow its stages.

bootstrap_stages

To display the obtained dendrogram with p-values we call pv.plot().

dendrogram

To display result we call function print_result.

pv.print_result()

results

Furthermore, if we are interested in specific clusters or want to display values with certain decimal points we can call following:

pv.print_result(which=[2, 6], digits=5)

results2

The standard errors of AU p-values can be displayed on a graph by calling function seplot.

pv.seplot()

seplot

We also implemented parallel version of this implementation which can run by setting the parallel=True. In this mode, the algorithm will deploy all the cores on the machine and speed up the calculation.

from sklearn.datasets import load_boston
from pvclust import PvClust

if __name__ == "__main__":
    X, y = load_boston(return_X_y=True)
    X = pd.DataFrame(X)
    pv = PvClust(X, method='ward', metric='euclidean', nboot=1000 , parallel=True)

parallel

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%