International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) ISSN 2249-6831 Vol. 3, Issue 4, Oct 2013, 143-150 © TJPRC Pvt. Ltd.
CLUSTER ANALYSIS – AN OVERVIEW ANURADHA BHATIA1 & GAURAV VASWANI2 1
Faculty, Department of Computer, VES Polytechnic, Mumbai, Maharashtra, India
Student, Department of Computer Technology, VESIT, Mumbai, Maharashtra, India
ABSTRACT Clustering analysis, also called segmentation analysis or taxonomy analysis, aims to identify homogeneous objects into a set of groups, named clusters, by given criteria. Clustering is a very important technique of knowledge discovery for human beings. It has a long history and can be traced back to the times of Aristotle .These days; cluster analysis is mainly conducted on computers to deal with very large-scale and complex datasets. With the development of computer-based techniques, clustering has been widely used in data mining, ranging from web mining, image processing, machine learning, artificial intelligence, pattern recognition, social network analysis, bio-informatics, geography, geology, biology, psychology, sociology, customers behaviour analysis, marketing to e-business and other fields.
KEYWORDS: Cluster Analysis, K Mean, Hierarchical, Genes, Microdata, Problems INTRODUCTION The clustering of large sized datasets in data mining is an iterative process involving humans. Thus, the user’s initial estimation of the cluster number is important for choosing the parameters of clustering algorithms for the pre-processing stage of clustering. Also, the user’s clear understanding on cluster distribution is helpful for assessing the quality of clustering results in the post-processing of clustering. All these heavily rely on the user’s visual perception of data distribution. Clearly, visualization is a crucial aspect of cluster exploration and verification in cluster analysis. Visual presentations can be very powerful in revealing trends, highlighting outliers, showing clusters, and exposing gaps in data. Cluster analysis divides data into meaningful or useful groups (clusters). If meaningful clusters are the goal, then the resulting clusters should capture the “natural” structure of the data. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, and to provide a grouping of spatial locations prone to earthquakes. However, in other cases, cluster analysis is only a useful starting point for other purposes, e.g., data compression or efficiently finding the nearest neighbours of points. Whether for understanding or utility, cluster analysis has long been used in a wide variety of fields: psychology and other social sciences, biology, statistics, pattern recognition, information retrieval, machine learning, and data mining. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique computational requirements on relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and were successfully applied to real-life data mining problems. They are subject of the survey. Cluster analysis, like factor analysis, makes no distinction between dependent and independent variables. The entire sets of interdependent relationships are examined. Cluster analysis is the obverse of factor analysis. Whereas
Anuradha Bhatia & Gaurav Vaswani
factor analysis reduces the number of variables by grouping them into a smaller set of factors, cluster analysis reduces the number of observations or cases by grouping them into a smaller set of clusters. Cluster analysis stayed inside academic circles for a long time, but recent “big data” wave made it relevant to BI, Data Visualization and Data Mining user.
Figure 1: Data Visualization Cluster analysis usually can be defined as method to find groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. Main reason to use such method is to reduce the size of large data sets! Some people confuse the clustering with classification, segmentation, partitioning or results of queries – it will be a mistake.
PURPOSE OF CLUSTER ANALYSIS Clustering occurs in almost every aspect of daily life. A factory’s Health and Safety Committee may be regarded as a cluster of people. Supermarkets display items of similar nature, such as types of meat or vegetables in the same or nearby locations. Biologists have to organize the different species of animals before a meaningful description of the differences between animals is possible. In medicine, the clustering of symptoms and diseases leads to taxonomies of illnesses. In the ﬁeld of business, clusters of consumer segments are often sought for successful marketing strategies. Cluster analysis (CA) is an exploratory data analysis tool for organizing observed data (e.g. people, things, events, brands, companies) into meaningful taxonomies, groups, or clusters, based on combinations of IV’s, which maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that are initially unknown. In this sense, CA creates new groupings without any preconceived notion of what clusters may arise, whereas discriminant analysis classiﬁes people and items into already known groups. CA provides no explanation as to why the clusters neither exist nor is any interpretation made. Each cluster thus describes, in terms of the data collected, the class to which its members belong. Items in each cluster are similar in some ways to each other and dissimilar to those in other clusters. Using cluster analysis, a customer ‘type’ can represent a homogeneous market segment. Identifying their particular needs in that market allows products to be designed with greater precision and direct appeal within the segment. Targeting speciﬁc segments is cheaper and more accurate than broad-scale marketing. Customers respond better to segment marketing which addresses their speciﬁc needs, leading to increased market share and customer retention. This is valuable, for example, in banking, insurance and tourism markets. Imagine four clusters or market segments in the vacation travel industry. They are:
The elite they want top level service and expect to be pampered;
The Escapists – they want to get away and just relax;
Cluster Analysis – An Overview
The Educationalist – they want to see new things, go to museums, have a safari, or experience new cultures;
The Sports Person – they want the golf course, tennis court, surﬁng, deep-sea ﬁshing, climbing, etc.
Different brochures and advertising is required for each of these. Brand image analysis, or deﬁning product ‘types’ by customer perceptions, allows a company to see where its products are positioned in the market relative to those of its competitors. This type of modelling is valuable for branding new products or identifying possible gaps in the market. Clustering supermarket products by linked purchasing patterns can be used to plan store layouts, maximizing spontaneous purchasing opportunities. Banking institutions have used hierarchical cluster analysis to develop a typology of customers, for two purposes, as follows: To retain the loyalty of members by designing the best possible new ﬁnancial products to meet the needs of different groups (clusters), i.e. new product opportunities. To capture more market share by identifying which existing services are most proﬁtable for which type of customer and improve market penetration.
Figure 2: Clustering Types What Cluster Analysis is Not Cluster analysis is a classification of objects from the data, where by classification we mean a labelling of objects with class (group) labels. As such, clustering does not use previously assigned class labels, except perhaps for verification of how well the clustering worked. Thus, cluster analysis is distinct from pattern recognition or the areas of statistics know as discriminate analysis and decision analysis, which seek to find rules for classifying objects given a set of pre-classified objects. While cluster analysis can be useful in the previously mentioned areas, either directly or as a preliminary means of finding classes, there is much more to these areas than cluster analysis. For example, the decision of what features to use when representing objects is a key activity of fields such as pattern recognition. Cluster analysis typically takes the features as given and proceeds from there. Thus, cluster analysis, while a useful tool in many areas (as described later), is normally only part of a solution to a larger problem which typically involves other steps and techniques.
Figure 3: Stages of Clustering
CLUSTERING METHODS Because we usually don’t know the number of groups or clusters that will emerge in our sample and because we want an optimum solution, a two-stage sequence of analysis occurs as follows: We carry out a hierarchical cluster analysis using Ward’s method applying squared Euclidean Distance as the distance or similarity measure. This helps to determine the optimum number of clusters we should work with.
Anuradha Bhatia & Gaurav Vaswani
The next stage is to rerun the hierarchical cluster analysis with our selected number of clusters, which enables us to allocate every case in our sample to a particular cluster. Hierarchical Cluster Analysis This is the major statistical method for ﬁnding relatively homogeneous clusters of cases based on measured characteristics. It starts with each case as a separate cluster, i.e. there are as many clusters as cases, and then combines the clusters sequentially, reducing the number of clusters at each step until only one cluster is left. The clustering method uses the dissimilarities or distances between objects when forming the clusters. The SPSS programme calculates ‘distances’ between data points in terms of the speciﬁed variables. A hierarchical tree diagram, called a dendrogram on SPSS, can be produced to show the linkage points. The clusters are linked at increasing levels of dissimilarity. The actual measure of dissimilarity depends on the measure used, for example:
Figure 4: Hierarchical Clustering This example illustrates clusters B and C being combined at the fusion value of 2, and BC with A at 4. The fusion values or linkage distances are calculated by SPSS. The goal of the clustering algorithm is to join objects together into successively larger clusters, using some measure of similarity or distance. At the left of the dendrogram we begin with each object or case in a class by itself (in our example above there are only four cases). In very small steps, we ‘relax’ our criterion as to what is and is not unique. Put another way, we lower our threshold regarding the decision when to declare two or more objects to be members of the same cluster. As a result; we link more and more objects together and amalgamate larger and larger clusters of increasingly dissimilar elements. Finally, in the last step, all objects are joined together as one cluster. In these plots, the horizontal axis denotes the fusion or linkage distance. For each node in the graph (where a new cluster is formed) we can read off the criterion distance at which the respective elements were linked together into a new single cluster. As a general process, clustering can be summarized as follows:
The distance is calculated between all initial clusters. In most analyses, initial clusters will be made up of individual cases.
Then the two most similar clusters are fused and distances recalculated.
Step 2 is repeated until all cases are eventually in one cluster.
K-Means Clustering This method of clustering is very different from the hierarchical clustering and Ward method, which are applied when there is no prior knowledge of how many clusters there may be or what they are characterized by. K-means clustering is used when you already have hypotheses concerning the number of clusters in your cases or variables. You may want to ‘tell’ the computer to form exactly three clusters that are to be as distinct as possible. This is the type of research question that can be addressed by the k-means clustering algorithm. In general, the k-means method will produce the exact k different clusters demanded of greatest possible distinction. Very frequently, both the hierarchical and the k-means techniques are used successively.
Cluster Analysis – An Overview
The former (Ward’s method) is used to get some sense of the possible number of clusters and the way they merge as seen from the dendrogram.
Then the clustering is rerun with only a chosen optimum number in which to place all the cases (k means clustering).
One of the biggest problems with cluster analysis is identifying the optimuie. The classiﬁcation becomes increasingly artiﬁcial. Deciding upon the optimum number of clusters is largely subjective, although looking at a dendrogram may help. Clusters are interpreted solely in terms of the variables included in them. Clusters should also contain at least four elements. Once we drop to three or two elements it ceases to be meaningful. Basic K-means Algorithm for finding K clusters.
Select K points as the initial centroids.
Assign all points to the closest centroid.
Recompute the centroid of each cluster.
Repeat steps 2 and 3 until the centroids don’t change (or change very little).
K-means has a number of variations, depending on the method for selecting the initial centroids, the choice for the measure of similarity, and the way that the centroid is computed. The common practice, at least for Euclidean data, is to use the mean as the centroid and to select the initial centroids randomly. In the absence of numerical problems, this procedure converges to a solution, although the solution is typically a local minimum. Since only the vectors are stored, the space requirements are O (m*n), where m is the number of points and n is the number of attributes. The time requirements are O (I*K*m*n), where I is the number of iterations required for convergence. I is typically small and can be easily bounded as most changes occur in the first few iterations. Thus, the time required by K-means is efficient, as well as simple, as long as the number of clusters is significantly less than m.
ANALYZING MICROARRAY DATA As gene chips become more routine in basic research, it is important for biologists to understand the bio statistical methods used to analyze these data so that they can better interpret the biological meaning of the results. Strategies for analyzing gene chip data can be broadly grouped into two categories: discrimination (or supervised learning) and clustering (or unsupervised learning). Raw Data Gene expression data measured by gene chips (microarrays) are pre processed using image analysis techniques to extract expression values from images and scaling algorithms to make expression values comparable across chips. These pre-processing steps are generally done with a microarray-platform vendor’s software or through software developed by researchers interested in improving the estimates of the expression data .While these steps can have a significant impact on the quality of the data and are an area of active research, our review will start with the assumption that these pre processing steps have already been performed and the estimates of the expression level are as good as can be obtained. Expression data are typically analyzed in matrix form with each row representing a gene and each column representing a chip or sample. For a study with 20 samples run on Affymetrix GeneChips™, the dimensions of the data
Anuradha Bhatia & Gaurav Vaswani
matrix would be (approximately) 12,000 rows (one for each gene) by 20 columns. Newer chips have even more genes on them. Often there will be one additional column giving the gene label for identification. However, this column is excluded from analysis and only the chip columns containing expression values are used. We represent the data matrix by the symbol X and denote the data as follows:
Figure 5: Matrix of Gene Data
PROBLEMS IN CLUSTER ANALYSIS Problems The important problems with cluster analysis that we have identified in our survey are as follows:
The Identification of Distance Measure: For numerical attributes, distance measures can be used. But identification of measure for categorical attributes is difficult.
The Number of Clusters: Identifying the number of clusters is a difficult task if the number of class labels is not known in advance. A careful analysis of number of clusters is necessary to produce correct results.
Structure of Database: Real life data may not always contain clearly identifiable clusters. Also the order in which the tuples are arranged may affect the results when an algorithm is executed if the distance measure used is not perfect. With a structure less data (for eg. Having lots of missing values), identification of appropriate number of clusters will not yield good results.
Types of Attributes in a Database: The databases may not necessarily contain distinctively numerical or categorical attributes. They may also contain other types like nominal, ordinal, binary etc. So these attributes have to be converted to categorical type to make calculations simple.
CONCLUSIONS Cluster analysis determines how many ‘natural’ groups there are in the sample. It also allows you to determine who in your sample belongs to which group. Cluster analysis is not as much a typical statistical test as it is a ‘collection’ of different algorithms that put objects into clusters according to well-deﬁned similarity rules. The aim is to
Minimize variability within clusters
Maximize variability between clusters.
The most common approach is to use hierarchical cluster analysis and Ward’s method. Unlike many other statistical procedures, cluster analysis methods are mostly used when we do not have any ‘a priori’ hypotheses, but are still
Cluster Analysis – An Overview
in the exploratory phase of research. Once clear clusters are identiﬁed the analysis is re-rerun with only those clusters to allocate items/respondents to the selected clusters.
A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. In KDD, 2004.
A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. In SDM, 2004.
S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In KDD, 2004.
A. Battle, E. Segal, and D. Koller. Probabilistic discovery of overlapping cellular processes and their regulation using gene expression data. In RECOMB, 2004.
J. C. Bezdek and S. K. Pal Fuzzy Models for Pattern Recognition. IEEE Press, 1992.
A. Bjorck. Numerical Methods for Least Squares Problems. Society for Industrial & Applied Math (SIAM), 1996.
Y. Censor and S. Zenios. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, 1998.
Y. Cheng and G. M. Church. Biclustering of expression data. In ISMB, 2000.
V. Chv´ atal. Hard knapsack problems. Operations Research, 28(6):1402–1412, 1980.
10. Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., Powell, J.I., Yang, L., Marti, G.E., Moore, T., Hudson Jr, J., Lu, L., Lewis, D.B., Tibshirani, R., Sherlock, G., Chan, W.C., Greiner, T.C., Weisenburger, D.D., Armitage, J.O., Warnke, R., Levy, R., Wilson, W., Grever, M.R., Byrd, J.C., Botstein, D., Brown, P.O. and Staudt, L.M. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503-511. 11. Bar-Joseph, Z., Gifford, D.K. and Jaakkola, T.S. (2001) Fast optimal leaf ordering for hierarchical clustering. Proc. 9th ISMB, Bioinformatics, 17, S22-S29. 12. Chen, Y. and Church, G.M. (2000) Biclustering of expression data. Proc. 8th ISMB, AAAI Press, 93-103. 13. Anil K. Jain, Richard C. Dubes. (1988). Algorithms for Clustering Data. Prentice-Hall. Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D. and Futcher, B. (1998) 14. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9, 3273-3297. 15. Xing, E.P. and Karp, R. M.. (2001) CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. Proc. 9th ISMB, Bioinformatics, 17, S306-S315. 16. T. Hastie, R. Tibshirani, M. B. Eisen, A. Alizadeh, R. Levy, L. Staudt, W. C.Chan, D. Botstein, and P. Brown. Gene shaving as a method for identifyingdistinct sets of genes with similar expression patterns. Genome Biology, 2000. 17. L. Lazzeroni and A. B. Owen. Plaid models for gene expression data.Statistica Sinica, 12(1):61–86, 2002. 18. W. T. McCormick, P. J. Schweitzer, and T. W. White. Problem decompositionand data reorganization by a
Anuradha Bhatia & Gaurav Vaswani
clustering technique. Operations Research, 20:993â€“1009, 1972. 19. M. Sahami, M. Hearst, and E. Saund. Applying the Multiple Cause MixtureModel to Text Categorization. In ICML, 1996. 20. E. Segal, A. Battle, and D. Koller. Decomposing gene expression into cellularprocesses. In PSB, 2003.