scikit-learning - Clustering

scikit-learning 提供了 sklearn.cluster 模块,以用于无标签数据的聚类.

sklearn.cluster 模块共提供了一下几种聚类算法:

[1] - K-means

[2] - Affinity Propagation

[3] - Mean Shift

[4] - Spectral clustering

[5] - Hierarchical clustering

[6] - DBSCAN

[7] - OPTICS

[8] - Birch

每一种聚类算法包含两部分:

[1] - 一个类(class),实现 fit 方法,以对训练数据进行聚类.

[2] - 一个函数(function),对于给定训练数据,返回对应于不同聚类的整数类别标签数组.

对于该 class,训练数据的聚类标签可以通过 labels_ 属性得到.

输入数据:

[1] - sklearn.cluster 中的聚类算法支持不同类型的矩阵作为输入.

[2] - 所有的聚类算法都支持标准的 [nsamples, nfeatures] 形式的数据矩阵.

特征提取模块:sklearn.feature_extraction

[3] - AffinityPropagation, SpectralClusteringDBSCAN 还支持 [nsamples, nsamples]形式的相似性矩阵作为输入.

相似性矩阵计算模块:sklearn.metrics.pairwise

1. 不同聚类算法对比

算法名参数Scalability使用场景Geometry (metric used)
K-Meansnumber of clustersVery large n_samples, medium n_clusters with MiniBatch codeGeneral-purpose, even cluster size, flat geometry, not too many clustersDistances between points
Affinity propagationdamping, sample preferenceNot scalable with n_samplesMany clusters, uneven cluster size, non-flat geometryGraph distance (e.g. nearest-neighbor graph)
Mean-shiftbandwidthNot scalable with n_samplesMany clusters, uneven cluster size, non-flat geometryDistances between points
Spectral clusteringnumber of clustersMedium n_samples, small n_clustersFew clusters, even cluster size, non-flat geometryGraph distance (e.g. nearest-neighbor graph)
Ward hierarchical clusteringnumber of clusters or distance thresholdLarge n_samples and n_clustersMany clusters, possibly connectivity constraintsDistances between points
Agglomerative clusteringnumber of clusters or distance threshold, linkage type, distanceLarge n_samples and n_clustersMany clusters, possibly connectivity constraints, non Euclidean distancesAny pairwise distance
DBSCANneighborhood sizeVery large n_samples, medium n_clustersNon-flat geometry, uneven cluster sizesDistances between nearest points
OPTICSminimum cluster membershipVery large n_samples, large n_clustersNon-flat geometry, uneven cluster sizes, variable cluster densityDistances between points
Gaussian mixturesmanyNot scalableFlat geometry, good for density estimationMahalanobis distances to centers
Birchbranching factor, threshold, optional global clusterer.Large n_clusters and n_samplesLarge dataset, outlier removal, data reduction.Euclidean distance between points
Last modification:April 28th, 2021 at 10:46 am