Journal of Computer Science and Information Security January 2011 by ijcsis Editor

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011

A New Approach for Clustering Categorical Attributes

Parul Agarwa1 l, M. Afshar Alam2

Ranjit Biswas3

Department Of Computer Science, Jamia Hamdard(Hamdard University) Jamia Hamdard(Hamdard University) New Delhi =110062 ,India parul.pragna4@gmail.com, aalam@jamiahamdard.ac.in

Manav Rachna International University Manav Rachna International University Green Fields Colony Faridabad, Haryana 121001 ranjitbiswas@yahoo.com

Thus distance between clusters is defined as the distance between the closest pair of objects, where only one object from each cluster is considered.

Abstract— Clustering is a process of grouping similar objects together and placing the object in a cluster which is most similar to it.In this paper we provide a new measure for calculation of similarity between 2 clusters for categorical attributes and the approach used is agglomerative hierarchical clustering .

i.e. the distance between two clusters is given by the value of the shortest link between the clusters. In average Linkage method (or farthest neighbour), Distance between Clusters defined as the distance between the most distant pair of objects, one from each cluster is considered.

Keywords- Agglomerative hierarchical clustering, Categorical Attributes,Number of Matches.

INTRODUCTION

In the complete linkage method, D(Ci,Cj) is computed as

Data Mining is a process of extracting useful information.Clustering is the problem being solved in data mining.Clustering discovers interesting patterns in the underlying data. It groups similar objects together in a cluster(or clusters) and dissimilar objects in other cluster(or clusters).This grouping is based on the approach used for the algorithm and the similarity measure which identifies the similarity between an object and a cluster.The approach is based upon the clustering method chosen for clustering.The clustering methods are broadly divided into hierarchical and partitional.hierarchical clustering performs partitioning sequentially. It works on bottom –up and top-down.The bottom up approach known as agglomerative starts with each object in a separate cluster and continues combining 2 objects based on the similarity measure until they are combined in one big cluster which consists of all objects. .Wheras the top-down approach also known as divisive treats all objects in one big cluster and the large cluster is divided into small clusters until each cluster consists of just a single object. The general approach of hierarchical clustering is in using an appropriate metric which measures distance between 2 tuples and a linkage criteria which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets The linkage criteria could be of 3 types [28]single linkage ,average linkage and complete linkage.

D(Ci,Cj) = Max { d(a,b) : a Є Ci,b Є Cj.} the distance between two clusters is given by the value of the longest link between the clusters. Whereas,in average linkage D(Ci,Cj) = { d(a,b) / (l1 * l2): a Є Ci,b Є Cj. And l1 is the cardinality of cluster Ci,and l2 is cardinality of Cluster Cj. And d(a,b) is the distance defined.} The partitional clustering on the other hand breaks the data into disjoint clusters. In Section II we shall discuss the related work. In Section III, we shall talk about our algorithm followed by section IV containing the experimental results followed by Section V which contains the conclusion and Section VI will discuss the future work. II.

RELATED WORK

The hierarchical clustering forms its basis with older algorithms Lance-Williams formula(based on the Williams dissimilarity update formula which calculates dissimilarities between a cluster formed and the existing points, which are based on the dissimilarities found prior to the new cluster), conceptual clustering,SLINK[1], COBWEB[2] as well as newer algorithms like CURE[3] and CHAMELEON[4]. The SLINK algorithm performs single-link (nearest-neighbour) clustering on arbitrary dissimilarity coefficients and constructs a representation of the dendrogram which can be

In single linkage(also known as nearest neighbour), the distance between 2 clusters is computed as: D(Ci,Cj)= min {D(a,b) : where a Є Ci, b Є Cj.

http://sites.google.com/site/ijcsis/ ISSN 1947-5500