(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011
between a data vector x(i) and the nearest cluster
nearest neighbor conditions that are necessary for
centroid Cj:
optimal vector quantization. K
SSE = ∑ i −1
The
above
quantization,
Eq. a
∑ε
x( j ) − ci 2 1.
x ( j ) Ci
is form
encountered of
in
clustering
associate the data vectors into codebook
vector that
particularly intended for compressing data.
Given a codebook of vectors ci i=1,2,......K
is
vectors according to the nearest neighbor
In
condition. Now, each code book vector has a set of data vectors Ci associated to it.
vector quantization, the cluster centroids appearing 2.
in the above Eq. are called codebook vectors. The
Update the codebook vectors to the centroids
codebook vectors partition the input space in
of sets Ci according to the centroid condition.
nearest neighbor regions Vi. A region Vi
That is, for all i set ci :=(1/⎪Ci⎪)∑j∈Ci xj. 3.
associated with the nearest cluster centroid by
ci do not change any more.
Vi= {x: x-ci ≤ x-c1 ;∀∫}
When the iteration stops, a local minimum for the quantity SSE is achieved K-means typically
(nearest neighbor condition).
converges very fast. Furthermore, when K<< N,
Cluster Ci in the above Eq is now the set of input
K-means is computationally far less expensive than
data points that belong to Vi.
the hierarchical agglomerative methods, since
K-means
computing
k-means refers to a family of algorithms
K-means
algorithms
KN
distances between
codebook
vectors and the data vectors suffices.
that appear often in the context of vector quantization.
Repeat form step 1 until the codebook vectors
Well known problems with the K-means
are
procedure are that it converges but to a local
tremendously popular in clustering and often used
minimum
for exploratory purpose. As a clustering model the
and
is
quite
sensitive
to
initial
conditions. A simple initialization is to start the
vector quantizer has an obvious limitation. The
procedure using K randomly picked vectors from
nearest neighbor regions are convex, which limits
the sample. A first aid solution for trying to avoid
the shape of clusters that can be separated.
bad local minima is to repeat K-means a couple of
We consider only the batch k-means algorithm;
times from different initial conditions.
different sequential procedures are explained. The
More
advanced solutions include using some form of
batch K-means algorithm proceeds by applying
stochastic relaxation among other modifications.
alternatively in successive steps the centroid and
176
http://sites.google.com/site/ijcsis/ ISSN 1947-5500