Journal of Computer Science and Information Security January 2011 by ijcsis Editor

Journal of Computer Science and Information Security January 2011

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011

between a data vector x(i) and the nearest cluster

nearest neighbor conditions that are necessary for

centroid Cj:

optimal vector quantization. K

SSE = ∑ i −1

The

above

quantization,

Eq. a

∑ε

x( j ) − ci 2 1.

x ( j ) Ci

is form

encountered of

clustering

associate the data vectors into codebook

vector that

particularly intended for compressing data.

Given a codebook of vectors ci i=1,2,......K

vectors according to the nearest neighbor

condition. Now, each code book vector has a set of data vectors Ci associated to it.

vector quantization, the cluster centroids appearing 2.

in the above Eq. are called codebook vectors. The

Update the codebook vectors to the centroids

codebook vectors partition the input space in

of sets Ci according to the centroid condition.

nearest neighbor regions Vi. A region Vi

That is, for all i set ci :=(1/⎪Ci⎪)∑j∈Ci xj. 3.

associated with the nearest cluster centroid by

ci do not change any more.

Vi= {x: x-ci ≤ x-c1 ;∀∫}

When the iteration stops, a local minimum for the quantity SSE is achieved K-means typically

(nearest neighbor condition).

converges very fast. Furthermore, when K<< N,

Cluster Ci in the above Eq is now the set of input

K-means is computationally far less expensive than

data points that belong to Vi.

the hierarchical agglomerative methods, since

K-means

computing

k-means refers to a family of algorithms

K-means

algorithms

distances between

codebook

vectors and the data vectors suffices.

that appear often in the context of vector quantization.

Repeat form step 1 until the codebook vectors

Well known problems with the K-means

are

procedure are that it converges but to a local

tremendously popular in clustering and often used

minimum

for exploratory purpose. As a clustering model the

and

quite

sensitive

initial

conditions. A simple initialization is to start the

vector quantizer has an obvious limitation. The

procedure using K randomly picked vectors from

nearest neighbor regions are convex, which limits

the sample. A first aid solution for trying to avoid

the shape of clusters that can be separated.

bad local minima is to repeat K-means a couple of

We consider only the batch k-means algorithm;

times from different initial conditions.

different sequential procedures are explained. The

advanced solutions include using some form of

batch K-means algorithm proceeds by applying

stochastic relaxation among other modifications.

alternatively in successive steps the centroid and

176

http://sites.google.com/site/ijcsis/ ISSN 1947-5500