Faiss - 聚类与PCA及量化实现

Faiss 构建了几个非常高效的实现：K-means 聚类、PCA降维、PQ量化(编码/解码).

Faiss building blocks: clustering, PCA, quantization

1. K-means 聚类

Faiss K-means 与 Scikit-learn K-means 实现的对比可参考：
K-Means 8x faster, 27x lower error than Scikit-learn’s in 25 lines

K-means 迭代(From: https://en.wikipedia.org/wiki/K-means_clustering#/media/File:K-means_convergence.gif)

例如，将给定 2D 张量 x 进行聚类的实现如：


ncentroids = 1024
niter = 20
verbose = True
d = x.shape[1]
#cpu
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose)
#gpu，使用所有的gpu
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose, gpu=True)
#gpu，使用 3 个gpu
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose, gpu=3)
#
kmeans.train(x)
cluster_centers = kmeans.centroids #聚类后的聚类中心
obj = kmeans.obj #目标函数，kmeans 中为总的平方差
iteration_stats = kmeans.iteration_stats #聚类中的统计信息

1.1. 参数说明

Kmeans 对象是 C++ Clustering 的封装层，其参数如：


struct ClusteringParameters {
    int niter; // 聚类迭代次数
    int nredo; // 重复聚类的次数，保留最佳的centroids

    bool verbose; // 是否使聚类更加冗长(verbose)
    bool spherical;        /// 是否球形 kmeans；是否需要在每次迭代后将 centroids 进行 L2 归一化
    bool int_centroids;    /// 是否将 centroids 坐标取整
    bool update_index;     /// 是否每次迭代后重训练索引
    bool frozen_centroids; /// 是否使用预提供的 centroids 作为输入，并在迭代中更新 centroids
    int min_points_per_centroid; //
    int max_points_per_centroid; //

    int seed; // 随机数生成器种子

    size_t decode_block_size; //
  
    ClusteringParameters();
};

1.2. 计算归属聚类

当 Kmeans 训练完成后，可以将向量 x 集映射到聚类中心，如：


D, I = kmeans.index.search(x, 1)

其中， I 中为 x 中每一行的向量所对应的最接近的聚类(centroid)，D 包含了对应的平方 L2 距离.

对于倒排才做，如，查找 x 的 15 个最近邻点，需要建立索引，如：


index = faiss.IndexFlatL2 (d)
index.add (x)
D, I = index.search (kmeans.centroids, 15)

其中，I 的大小为 (ncentroids，15)，包含了每个 centroid 的最近邻点.

1.3. FaissKmeans 使用


import faiss
import numpy as np

class FaissKMeans:
    def __init__(self, n_clusters=8, n_init=10, max_iter=300):
        self.n_clusters = n_clusters
        self.n_init = n_init
        self.max_iter = max_iter
        self.kmeans = None
        self.cluster_centers_ = None
        self.inertia_ = None

    def fit(self, X, y):
        self.kmeans = faiss.Kmeans(d=X.shape[1],
                                   k=self.n_clusters,
                                   niter=self.max_iter,
                                   nredo=self.n_init)
        self.kmeans.train(X.astype(np.float32))
        self.cluster_centers_ = self.kmeans.centroids
        self.inertia_ = self.kmeans.obj[-1]

    def predict(self, X):
        return self.kmeans.index.search(X.astype(np.float32), 1)[1]

2. PCA 计算

例如，将 40D 向量降维到 10D,


#随机生成训练数据
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply_py(mt)
#print this to show that the magnitude of tr's columns is decreasing
print((tr ** 2).sum(0))

3. PQ 量化

如：


d = 32  # data dimension
cs = 4  # code size (bytes)

#随机生成数据集
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

#
pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)

# encode 
codes = pq.compute_codes(x)

# decode
x2 = pq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()

标量量化(scalar quantizer):


d = 32  # data dimension

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

# QT_8bit allocates 8 bits per dimension (QT_4bit also works)
sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit)
sq.train(xt)

# encode 
codes = sq.compute_codes(x)

# decode
x2 = sq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()