Faiss 构建了几个非常高效的实现:K-means 聚类、PCA降维、PQ量化(编码/解码).

Faiss building blocks: clustering, PCA, quantization

1. K-means 聚类

Faiss K-means 与 Scikit-learn K-means 实现的对比可参考:

K-Means 8x faster, 27x lower error than Scikit-learn’s in 25 lines

K-means 迭代(From: https://en.wikipedia.org/wiki/K-means_clustering#/media/File:K-means_convergence.gif)

例如,将给定 2D 张量 x 进行聚类的实现如:

ncentroids = 1024 niter = 20 verbose = True d = x.shape[1] #cpu kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose) #gpu,使用所有的gpu kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose, gpu=True) #gpu,使用 3 个gpu kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose, gpu=3) # kmeans.train(x) cluster_centers = kmeans.centroids #聚类后的聚类中心 obj = kmeans.obj #目标函数,kmeans 中为总的平方差 iteration_stats = kmeans.iteration_stats #聚类中的统计信息

1.1. 参数说明

Kmeans 对象是 C++ Clustering 的封装层,其参数如:

struct ClusteringParameters { int niter; // 聚类迭代次数 int nredo; // 重复聚类的次数,保留最佳的centroids bool verbose; // 是否使聚类更加冗长(verbose) bool spherical; /// 是否球形 kmeans;是否需要在每次迭代后将 centroids 进行 L2 归一化 bool int_centroids; /// 是否将 centroids 坐标取整 bool update_index; /// 是否每次迭代后重训练索引 bool frozen_centroids; /// 是否使用预提供的 centroids 作为输入,并在迭代中更新 centroids int min_points_per_centroid; // int max_points_per_centroid; // int seed; // 随机数生成器种子 size_t decode_block_size; // ClusteringParameters(); };

1.2. 计算归属聚类

当 Kmeans 训练完成后,可以将向量 x 集映射到聚类中心,如:

D, I = kmeans.index.search(x, 1)

其中, I 中为 x 中每一行的向量所对应的最接近的聚类(centroid),D 包含了对应的平方 L2 距离.

对于倒排才做,如,查找 x 的 15 个最近邻点,需要建立索引,如:

index = faiss.IndexFlatL2 (d) index.add (x) D, I = index.search (kmeans.centroids, 15)

其中,I 的大小为 (ncentroids,15),包含了每个 centroid 的最近邻点.

1.3. FaissKmeans 使用

import faiss import numpy as np class FaissKMeans: def __init__(self, n_clusters=8, n_init=10, max_iter=300): self.n_clusters = n_clusters self.n_init = n_init self.max_iter = max_iter self.kmeans = None self.cluster_centers_ = None self.inertia_ = None def fit(self, X, y): self.kmeans = faiss.Kmeans(d=X.shape[1], k=self.n_clusters, niter=self.max_iter, nredo=self.n_init) self.kmeans.train(X.astype(np.float32)) self.cluster_centers_ = self.kmeans.centroids self.inertia_ = self.kmeans.obj[-1] def predict(self, X): return self.kmeans.index.search(X.astype(np.float32), 1)[1]

2. PCA 计算

例如,将 40D 向量降维到 10D,

#随机生成训练数据 mt = np.random.rand(1000, 40).astype('float32') mat = faiss.PCAMatrix (40, 10) mat.train(mt) assert mat.is_trained tr = mat.apply_py(mt) #print this to show that the magnitude of tr's columns is decreasing print((tr ** 2).sum(0))

3. PQ 量化

如:

d = 32 # data dimension cs = 4 # code size (bytes) #随机生成数据集 nt = 10000 xt = np.random.rand(nt, d).astype('float32') # dataset to encode (could be same as train) n = 20000 x = np.random.rand(n, d).astype('float32') # pq = faiss.ProductQuantizer(d, cs, 8) pq.train(xt) # encode codes = pq.compute_codes(x) # decode x2 = pq.decode(codes) # compute reconstruction error avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()

标量量化(scalar quantizer):

d = 32 # data dimension # train set nt = 10000 xt = np.random.rand(nt, d).astype('float32') # dataset to encode (could be same as train) n = 20000 x = np.random.rand(n, d).astype('float32') # QT_8bit allocates 8 bits per dimension (QT_4bit also works) sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit) sq.train(xt) # encode codes = sq.compute_codes(x) # decode x2 = sq.decode(codes) # compute reconstruction error avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
Last modification:March 15th, 2021 at 09:51 am