Kmeans K均值聚类，OpenCV实现

2023-11-19

Clustering 聚类

kmeans k均值聚类

Finds centers of clusters and groups input samples around the clusters.

寻找clusters的中心，并且将输入的样本聚合

C++: double kmeans(InputArray data, int K, InputOutputArray bestLabels, TermCriteria criteria, int attempts, intflags, OutputArray centers=noArray() )

Python: cv2.kmeans(data, K, criteria, attempts, flags[, bestLabels[, centers]]) → retval, bestLabels, centers

C: int cvKMeans2(const CvArr* samples, int cluster_count, CvArr* labels, CvTermCriteria termcrit, intattempts=1, CvRNG* rng=0, int flags=0, CvArr* _centers=0, double* compactness=0 )

Python: cv.KMeans2(samples, nclusters, labels, termcrit, attempts=1, flags=0, centers=None) → float

Parameters:	samples – Floating-point matrix of input samples, one row per sample. //输入的浮点型样本 data – Data for clustering. //聚类的数据 cluster_count – Number of clusters to split the set by.// K – Number of clusters to split the set by. labels – Input/output integer array that stores the cluster indices for every sample. criteria – The algorithm termination criteria, that is, the maximum number of iterations and/or the desired accuracy. The accuracy is specified as `criteria.epsilon`. As soon as each of the cluster centers moves by less than `criteria.epsilon` on some iteration, the algorithm stops. termcrit – The algorithm termination criteria, that is, the maximum number of iterations and/or the desired accuracy. attempts – Flag to specify the number of times the algorithm is executed using different initial labellings. The algorithm returns the labels that yield the best compactness (see the last function parameter). rng – CvRNG state initialized by RNG(). flags – Flag that can take the following values: KMEANS_RANDOM_CENTERS Select random initial centers in each attempt. KMEANS_PP_CENTERS Use `kmeans++` center initialization by Arthur and Vassilvitskii [Arthur2007]. KMEANS_USE_INITIAL_LABELS During the first (and possibly the only) attempt, use the user-supplied labels instead of computing them from the initial centers. For the second and further attempts, use the random or semi-random centers. Use one of `KMEANS__CENTERS` flag to specify the exact method. centers* – Output matrix of the cluster centers, one row per each cluster center. _centers – Output matrix of the cluster centers, one row per each cluster center. compactness – The returned value that is described below.

Parameters:

samples – Floating-point matrix of input samples, one row per sample. //输入的浮点型样本
data – Data for clustering. //聚类的数据
cluster_count – Number of clusters to split the set by.//
K – Number of clusters to split the set by.
labels – Input/output integer array that stores the cluster indices for every sample.
criteria – The algorithm termination criteria, that is, the maximum number of iterations and/or the desired accuracy. The accuracy is specified as criteria.epsilon. As soon as each of the cluster centers moves by less than criteria.epsilon on some iteration, the algorithm stops.
termcrit – The algorithm termination criteria, that is, the maximum number of iterations and/or the desired accuracy.
attempts – Flag to specify the number of times the algorithm is executed using different initial labellings. The algorithm returns the labels that yield the best compactness (see the last function parameter).
rng – CvRNG state initialized by RNG().
flags –
Flag that can take the following values:
- KMEANS_RANDOM_CENTERS Select random initial centers in each attempt.
- KMEANS_PP_CENTERS Use kmeans++ center initialization by Arthur and Vassilvitskii [Arthur2007].
- KMEANS_USE_INITIAL_LABELS During the first (and possibly the only) attempt, use the user-supplied labels instead of computing them from the initial centers. For the second and further attempts, use the random or semi-random centers. Use one of KMEANS_*_CENTERS flag to specify the exact method.
centers – Output matrix of the cluster centers, one row per each cluster center.
_centers – Output matrix of the cluster centers, one row per each cluster center.
compactness – The returned value that is described below.

The function kmeans implements a k-means algorithm that finds the centers of cluster_count clusters and groups the input samples around the clusters. As an output, $\texttt{labels}_i$ contains a 0-based cluster index for the sample stored in the $i^{th}$ row of the samples matrix.

The function returns the compactness measure that is computed as

after every attempt. The best (minimum) value is chosen and the corresponding labels and the compactness value are returned by the function. Basically, you can use only the core of the function, set the number of attempts to 1, initialize labels each time using a custom algorithm, pass them with the ( flags = KMEANS_USE_INITIAL_LABELS ) flag, and then choose the best (most-compact) clustering.

Note

An example on K-means clustering can be found at opencv_source_code/samples/cpp/kmeans.cpp
(Python) An example on K-means clustering can be found at opencv_source_code/samples/python2/kmeans.py

基于这样一个假设，我们再来导出 k-means 所要优化的目标函数：设我们一共有 N 个数据点需要分为 K 个 cluster ，k-means 要做的就是最小化

<span style="font-size:18px;"><img title="\displaystyle J = \sum_{n=1}^N\sum_{k=1}^K r_{nk} \|x_n-\mu_k\|^2" alt="\displaystyle J = \sum_{n=1}^N\sum_{k=1}^K r_{nk} \|x_n-\mu_k\|^2" align="absMiddle" src="http://blog.pluskid.org/latexrender/pictures/6d769d53cfc5e304cda806c84b310ec8.png" style="border: none; max-width: 100%;" /></span>

这个函数，其中 $r_{nk}$ 在数据点 n 被归类到 cluster k 的时候为 1 ，否则为 0 。直接寻找 $r_{nk}$ 和 $\mu_k$ 来最小化并不容易，不过我们可以采取迭代的办法：先固定 $\mu_k$ ，选择最优的 $r_{nk}$ ，很容易看出，只要将数据点归类到离他最近的那个中心就能保证最小。下一步则固定 $r_{nk}$ ，再求最优的 $\mu_k$ 。将对 $\mu_k$ 求导并令导数等于零，很容易得到最小的时候 $\mu_k$ 应该满足：

<span style="font-size:18px;"><img title="\displaystyle \mu_k=\frac{\sum_n r_{nk}x_n}{\sum_n r_{nk}}" alt="\displaystyle \mu_k=\frac{\sum_n r_{nk}x_n}{\sum_n r_{nk}}" align="absMiddle" src="http://blog.pluskid.org/latexrender/pictures/a0aa5b1fd15778697fc5f5c6f1c3f348.png" style="border: none; max-width: 100%;" /></span>

亦即 $\mu_k$ 的值应当是所有 cluster k 中的数据点的平均值。由于每一次迭代都是取到的最小值，因此只会不断地减小（或者不变），而不会增加，这保证了 k-means 最终会到达一个极小值。虽然 k-means 并不能保证总是能得到全局最优解，但是对于这样的问题，像 k-means 这种复杂度的算法，这样的结果已经是很不错的了。

下面我们来总结一下 k-means 算法的具体步骤：

选定 K 个中心 $\mu_k$ 的初值。这个过程通常是针对具体的问题有一些启发式的选取方法，或者大多数情况下采用随机选取的办法。因为前面说过 k-means 并不能保证全局最优，而是否能收敛到全局最优解其实和初值的选取有很大的关系，所以有时候我们会多次选取初值跑 k-means ，并取其中最好的一次结果。
将每个数据点归类到离它最近的那个中心点所代表的 cluster 中。
用公式 $\mu_k = \frac{1}{N_k}\sum_{j\in\text{cluster}_k}x_j$ 计算出每个 cluster 的新的中心点。
重复第二步，一直到迭代了最大的步数或者前后的的值相差小于一个阈值为止。

OpenCV实现：

<span style="font-size:18px;">#include "opencv2/highgui/highgui.hpp"
#include "opencv2/core/core.hpp"
#include <iostream>

using namespace cv;
using namespace std;

int main( int /*argc*/, char** /*argv*/ )
{
    const int MAX_CLUSTERS = 5;
    Scalar colorTab[] =     //因为最多只有5类，所以最多也就给5个颜色
    {
        Scalar(0, 0, 255),
        Scalar(0,255,0),
        Scalar(255,100,100),
        Scalar(255,0,255),
        Scalar(0,255,255)
    };

    Mat img(500, 500, CV_8UC3);
    RNG rng(12345); //随机数产生器

    for(;;)
    {
        int k, clusterCount = rng.uniform(2, MAX_CLUSTERS+1);
        int i, sampleCount = rng.uniform(1, 1001);
        Mat points(sampleCount, 1, CV_32FC2), labels;   //产生的样本数，实际上为2通道的列向量，元素类型为Point2f

        clusterCount = MIN(clusterCount, sampleCount);
        Mat centers(clusterCount, 1, points.type());    //用来存储聚类后的中心点

        /* generate random sample from multigaussian distribution */
        for( k = 0; k < clusterCount; k++ ) //产生随机数
        {
            Point center;
            center.x = rng.uniform(0, img.cols);
            center.y = rng.uniform(0, img.rows);
            Mat pointChunk = points.rowRange(k*sampleCount/clusterCount,
                                             k == clusterCount - 1 ? sampleCount :
                                             (k+1)*sampleCount/clusterCount);   //最后一个类的样本数不一定是平分的，
                                                                                //剩下的一份都给最后一类
            //每一类都是同样的方差，只是均值不同而已
            rng.fill(pointChunk, CV_RAND_NORMAL, Scalar(center.x, center.y), Scalar(img.cols*0.05, img.rows*0.05));
        }

        randShuffle(points, 1, &rng);   //因为要聚类，所以先随机打乱points里面的点，注意points和pointChunk是共用数据的。

        kmeans(points, clusterCount, labels,
               TermCriteria( CV_TERMCRIT_EPS+CV_TERMCRIT_ITER, 10, 1.0),
               3, KMEANS_PP_CENTERS, centers);  //聚类3次，取结果最好的那次，聚类的初始化采用PP特定的随机算法。

        img = Scalar::all(0);

        for( i = 0; i < sampleCount; i++ )
        {
            int clusterIdx = labels.at<int>(i);
            Point ipt = points.at<Point2f>(i);
            circle( img, ipt, 2, colorTab[clusterIdx], CV_FILLED, CV_AA );
        }

        imshow("clusters", img);

        char key = (char)waitKey();     //无限等待
        if( key == 27 || key == 'q' || key == 'Q' ) // 'ESC'
            break;
    }

    return 0;
}</span>

参考：http://www.cnblogs.com/tornadomeet/archive/2012/11/23/2783709.html

http://blog.csdn.net/heavendai/article/details/7029465

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)