Thesis report by Sagar Makwana

Parallel K-Means Clustering Using Hadoop MapReduce

Submitted by NAME OF GROUP MEMBERS

SAP ID

ADITYA D. KALUSHTE

60004115033

SAGAR B. MAKWANA

60004115043

VATSAL R. SHAH

60004115061

Guided by Prof. Harish Narula Asst. Professor

Department of Computer Engineering D.J. Sanghvi College of Engineering, Mumbaiâ&#x20AC;&#x201C; 400 056

CERTIFICATE This is to certify that the following students have submitted the synopsis for the seminar titled

Parallel K-Means Clustering Using Hadoop MapReduce At D. J. Sanghvi College of Engineering, Mumbai for the Seminar subject for the degree of Computer Engineering (Semester VI) of University of Mumbai in the year 2013 â&#x20AC;&#x201C; 2014.

Student Name

SAP ID

ADITYA D. KALUSHTE

60004115033

SAGAR B. MAKWANA

60004115043

VATSAL R. SHAH

60004115061

________________________

InternalGuide Prof. Harish Narula (Asst. Professor)

____________________ Internal Examiner

___________________________________

HoD, Computer Engg.Dept (Dr. N. M. Shekokar)

____________________ External Examiner

_______________________________

Principal (Dr.HariVasudevan)

ACKNOWLEDGEMENT The work presented in this report was completed as a part of T.E.SEMINAR course in Computer Engineering, Mumbai University. We have great pleasure in presenting this report on â&#x20AC;&#x153;Parallel K-Means Clustering using Hadoop MapReduceâ&#x20AC;?. We take this opportunity to thank all those who have contributed in successful completion of this report. We wish to express sincere thanks and deep sense of gratitude to respected mentor and guide Prof. Harish Narula, Asst. Professor in Computer Engineering department of D.J. Sanghvi College of Engineering, for his In-depth and enlightening support with all of his kindness. Apart from technical advice we have got lots of encouragements and constructive criticism, which motivates to strive harder for excellence. We also wish to express my thanks to Prof. Narendra M. Shekhokar, H.O.D, Computer Engineering department for his continuous support and encouragement to achieve the best in whatsoever we are doing. We are also thankful to our parents for their continuous encouragement to pursue higher studies and our friends for the help and support they provided.

Aditya D. Kalushte-60004115033 Sagar B. Makwana-60004115043 Vatsal R. Shah-60004115061

TABLE OF CONTENTS Sr. No.

Topic

Page No.

List of figures………………………………………………………...………............................ I List of Tables………………………………………………………...……….............................II Abstract…………………………………………………………………………………………III 1.

Introduction…………………………………………………………………………. 1

…… K-MEANS CLUSTERING ALGORITHM……………………………………..

THE HADOOP DISTRIBUTED FILE SYSTEM……………………………… 6-15

PARALLEL K-MEANS BASED ON MAPREDUCE………………………….

16-22

COMPARATIVE ANALYSIS….……………………………………………….

23-28

……… Conclusions………………………………………………………………………..

Future Work………………………………………………………………………

2-5

References……………………………………………………………………………………… 30 ……… Papers Research

Parallel K-Means Clustering using Hadoop MapReduce

LIST OF FIGURES Figure No.

Figure Name

Page No.

3.1

Data Nodes……………………….……………………………

3.2

An HDFS client creates a new file by NameNode…………….....

3.3

Replication Management……………………………………....

3.4

MapReduce Programming Model……………………….……..

3.5

MapReduce operation process……………………………………....

3.6

Solving a word count …………………………..……..……….

5.1

Execution time of K-means clustering algorithm………..……..

5.2

Speed up………………………………………………………..

5.3

Scale up………………………………………………………...

5.4

Size up ……………………………………………………..…..

5.5

Comparison of standalone k-means vs parallel k-means .…………. 28

Computer Engineering Department,DJSCOE

Page I

Parallel K-Means Clustering using Hadoop MapReduce

LIST OF TABLES Table No.

Table Name

Page No.

3.1

Hadoop project components……....…………………………..

5.1

Execution time for K-means clustering………………………

5.2

Performance measures for different runs of original K-means..

5.3

The executing result of 2 different K-means Algorithm………

Computer Engineering Department,DJSCOE

Page II

Parallel K-Means Clustering using Hadoop MapReduce

Abstract: Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, many researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel k-means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware. The results also show that the MapReduce framework K-Means clustering algorithm can obtain a higher performance when handling largescale document automatic classification, and they prove the effectiveness and accuracy of the algorithm.[2][3] Keywords: Data mining; Parallel clustering; K-means; Hadoop; MapReduce.

Computer Engineering Department,DJSCOE

Page III

Parallel K-Means Clustering using Hadoop MapReduce

1. Introduction: With the development of information technology, data volumes processed by many applications will routinely cross the peta-scale threshold, which would in turn increase the computational requirements. Efficient parallel clustering algorithms and implementation techniques are the key to meeting the scalability and performance requirements entailed in such scientific data analyses. So far, several researchers have proposed some parallel clustering algorithms. All these parallel clustering algorithms have the following drawbacks: a) They assume that all objects can reside in main memory at the same time; b) Their parallel systems have provided restricted programming models and used the restrictions to parallelize the computation automatically. Both assumptions are prohibitive for very large datasets with millions of objects. Therefore, dataset oriented parallel clustering algorithms should be developed. MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Google and Hadoop both provide MapReduce runtimes with fault tolerance and dynamic flexibility support. In this paper, we adapt k-means algorithm in MapReduce framework which is implemented by Hadoop to make the clustering method applicable to large scale data. By applying proper <key, value> pairs, the proposed algorithm can be parallel executed effectively. We conduct comprehensive experiments to evaluate the proposed algorithm. The results demonstrate that our algorithm can effectively deal with large scale datasets. The rest of the paper is organized as follows. We also present our parallel k-means algorithm based on MapReduce framework. Experimental results are also shown that evaluates our parallel algorithm with respect to speedup, scaleup, and sizeup. Finally, we offer our conclusions. [3]

Computer Engineering Department, DJSCOE

Page 1

Parallel K-Means Clustering using Hadoop MapReduce

2. K-MEANS CLUSTERING ALGORITHM: 2.1.K-MEANS ALGORITHM It assigns each given point to the cluster whose center is nearest (also called centroid). Here, the center is average of all the points in the cluster that is, its co-ordinates are arithmetic mean for each dimension separately over all the points in the cluster. It is popular because of its time complexity O(k + n)) and it is order independent. Letâ&#x20AC;&#x2122;s see how it works: This algorithm consists of four steps: 1. Initialization In this first step data set, number of clusters and the centroid that we defined for each cluster. 2. Classification The distance is calculated for each data point from the centroid and the data point having minimum distance from the centroid of a cluster is assigned to that particular cluster. 3. Centroid Recalculation Clusters generated previously, the centroid is again repeatedly calculated means recalculation of the centroid. 4. Convergence Condition Some convergence conditions are given as below: 4.1 Stopping when reaching a given or defined number of iterations. 4.2 Stopping when there is no exchange of data points between the clusters. 4.3 Stopping when a threshold value is achieved. 5. If all of the above conditions are not satisfied, then go to step 2 and the whole process repeat again, until the given conditions are not satisfied. Main advantages: 1. K-means clustering is very Fast, robust and easily understandable. If the data set is well separated from each other data set, then it gives best results. 2. The clusters do not having overlapping character and are also non-hierarchical in nature. Main disadvantages: 1. In this algorithm, complexity is more as compared to others . 2. Need of predefined cluster centres. 3. Handling any of empty Clusters: One more problem with K-means clustering is that empty clusters are generated during execution, if in case no data points are allocated to a cluster under consideration during the assignment phase. [4] Computer Engineering Department, DJSCOE

Page 2

Parallel K-Means Clustering using Hadoop MapReduce The actual data samples are to be collected before the application of the clustering algorithm. Priority has to be given to the features that describe each data sample in the database. The values of these features make up a feature vector (Fi1, Fi2, Fi3,â&#x20AC;Śâ&#x20AC;Śâ&#x20AC;Ś.., Fim) where Fim is the value of the M-dimensional space. As in the other clustering algorithms, k- means requires that a distance metric between points is to be defined. This distance metric is used in the above mentioned step (iii) of the algorithm. A common distance metric is the Euclidean distance. In case, the different features used in the feature vector have different relative values and ranges then the distance computation may be distorted and so may be scaled. The input parameters of the clustering algorithm are the number of clusters that are to be found along with the initial starting point values. When the initial starting values are given, the distance from each sample data point to each initial starting value is found using equation. Then each data point is placed in the cluster associated with the nearest starting point. After all the data points are assigned to a cluster, the new cluster centroids are calculated. For each factor in each cluster, the new centroid value is then calculated. The new centroids are then considered as the new initial starting values and steps (iii) and (iv) of the algorithm are repeated. This process continues until no more data point changes or until the centroids no longer move. [2] 2.2.VARIOUS FORMULAE FOR CALCULATING OF DISTANCES: COSINE SIMILARITY It is a measure of distance between two vectors of dimension n, by finding the cosine of angle between them. Comparing data of text mining, it is mostly used. Given A and B vectors, the cosine similarity is given by formula đ??´.đ??ľ Cos (Î¸) = ||đ??´|.|đ??ľ|| Result ranges from -1 to 1, closer the result to one the more similar the vectors. MAHALANOBIS DISTANCE It is used to determine the similarity of an unknown sample set to known set. It form a group of values with mean Îź = (Îź1,Îź2,â&#x20AC;ŚÎźN)T and covariance matrix S for a multivariate vector is defined as : DM(x) =â&#x2C6;&#x161;(đ?&#x2018;Ľ â&#x2C6;&#x2019; đ?&#x153;&#x2021;)đ?&#x2018;&#x2021; đ?&#x2018;&#x2020; â&#x2C6;&#x2019; 1(đ?&#x2018;Ľ â&#x2C6;&#x2019; đ?&#x153;&#x2021;) It is widely used in cluster analysis and other classification techniques. EUCLIDEAN DISTANCE It find the distance between two data points involves computing the square root of sum of the squares of the differences between corresponding values. The formula for this distance between a point X(x1,x2,x3,..etc.) and a point Y(y1,y2,y3â&#x20AC;Śetc.) is : d = (â&#x2C6;&#x2018;đ?&#x2018;&#x203A;đ?&#x2018;&#x2014;=1(đ?&#x2018;Ľđ?&#x2018;&#x2014; â&#x2C6;&#x2019; đ?&#x2018;Śđ?&#x2018;&#x2014;)2 )1/2 It is most widely use distance metrics. [1]

Computer Engineering Department, DJSCOE

Page 3

Parallel K-Means Clustering using Hadoop MapReduce 2.3.K-MEANS WITH MAPREDUCE The K-means algorithm main task is to compute the distance between datasets and centroids and it takes most of the execution time. If there is large datasets, iteration steps use to increase and hence computation becomes overload. However the process is important to calculate the algorithm and in that way efficiency of algorithm can be improved. Therefore, to enhance distance calculation method is the most important things to improve the time calculation of the algorithm. In that case execution orders of distance calculation of objects will not affect result of clustering. Therefore, distance between datasets and centroids can be execute in parallel for better result using MapReduce Framework. The k-means with MapReduce works as follows : 1. Letâ&#x20AC;&#x2122;s take k points into the space which represent objects that are being clustered. These points represent the centroids. 2. Consider each object that are most closet to the clusters and they are assign to them. 3. Finally, when all the distance from the centroids are calculated, then objects assign to them and at last recalculate the centroids position based on results. Repeat Steps 2 and 3 until the centroids got the repeated values. This produces a separation of the objects into clusters from which the metric to be minimized can be calculated. [1] 2.4.FUNCTION K-MEANS IN MAP The map function of the K-means algorithm using MapReduce and its steps is shown in Figure 1, and this function is used to runs on each mapper in the cluster. This function results in assigns the object to the nearest centroid after calculation. Here the function uses a distance method to calculate the distance between the given object and each centroid in clusters in step 3. After calculating the nearest centroid to the object, this function puts the object in to the cluster which contains the centroid in step 8.

Computer Engineering Department, DJSCOE

Page 4

Parallel K-Means Clustering using Hadoop MapReduce Function K-means in Map Input: centroids, input. // The key of the input is the offset of the input segment in the raw text(data set), and the value of the input is an object for clustering. The centroids are raw centroids, and are given to the map function by the JobCount of MapReduce. Output: output. // The key of output is the nearest cluster to the object, and the value of output is the object. 1: nstCentroid â&#x2020;? null, nstDist â&#x2020;? â&#x2C6;&#x17E; 2: for each cđ?&#x153;&#x2013; centroids do 3: dist â&#x2020;? Distance (input, value, c); 4: if nstCentroid == null || dist < nstDist then 5: nstCentroid â&#x2020;? c, nstDist â&#x2020;? dist; 6: end if 7: end for 8: output, collect (nstCentroid, object); 2.5.FUNCTION K-MEANS IN REDUCE The reduce function of the K-means algorithm using MapReduce is shown in Figure 2, and this function runs on each given reducer in the given cluster. This function helps to calculates the new centroid of the cluster for next step. This function first gets all objects in this cluster in step 3, and then uses a Centroid function to re-calculate the centroid with the objects in the cluster in step 5. After getting the new centroid of the cluster, this function returns the centroid in step 6. Function K-means_in_Reduce Input: input. // The key of input is the centroid of a cluster, and the value of input is a list of objects who are assigned to the cluster. Output: output. // The key of output is the old centroid of the cluster, and the value of output is the centroid of the cluster. 1: v â&#x2020;? đ??&#x2039;; 2: for each obj đ?&#x153;&#x2013; input, value do 3: v â&#x2020;? v â&#x2C6;Ş {obj}; 4: end for 5: centroid â&#x2020;? ReCalCentroi(v); 6: output, collect (input, key, centroid); [1]

Computer Engineering Department, DJSCOE

Page 5

Parallel K-Means Clustering using Hadoop MapReduce

3. THE HADOOP DISTRIBUTED FILE SYSTEM 3.1.

Introduction

Hadoop provides a distributed file system and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. An important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and executing application computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by simply adding commodity servers. Hadoop clusters at Yahoo! span 25 000 servers, and store 25 petabytes of application data, with the largest cluster being 3500 servers. One hundred other organizations worldwide report using Hadoop. Distributed file system Subject of this paper! MapReduce Distributed computation framework HBase Column-oriented table service Dataflow language and parallel Pig execution framework Hive Data warehouse infrastructure ZooKeeper Distributed coordination service Chukwa System for collecting management data Avro Data serialization system HDFS

Table 3.1. Hadoop project components Hadoop is an Apache project; all components are available via the Apache open source license. Yahoo! has developed and contributed to 80% of the core of Hadoop (HDFS and MapReduce). HBase was originally developed at Powerset, now a department at Microsoft. Hive was originated and developed at Facebook. Pig, ZooKeeper, and Chukwa were originated and developed at Yahoo! Avro was originated at Yahoo! and is being co-developed with Cloudera. HDFS is the file system component of Hadoop. While the interface to HDFS is patterned after the UNIX file system, faithfulness to standards was sacrificed in favour of improved performance for the applications at hand. HDFS stores file system metadata and application data separately. As in other distributed file systems, like PVFS, Lustre and GFS, HDFS stores metadata on a dedicated server, called the NameNode. Application data are stored on other servers called DataNodes. All servers are fully connected and communicate with each other using TCP-based protocols. [7]

Computer Engineering Department, DJSCOE

Page 6

Parallel K-Means Clustering using Hadoop MapReduce

3.2.

ARCHITECTURE

NAME NODE The HDFS namespace is a hierarchy of files and directories. Files and directories are represented on the NameNode by inodes, which record attributes like permissions, modification and access times, namespace and disk space quotas. The file content is split into large blocks (typically 128 megabytes, but user selectable file-by-file) and each block of the file is independently replicated at multiple DataNodes (typically three, but user selectable file-by-file). The NameNode maintains the namespace tree and the mapping of file blocks to DataNodes 2 (the physical location of file data). An HDFS client wanting to read a file first contacts the NameNode for the locations of data blocks comprising the file and then reads block contents from the DataNode closest to the client. When writing data, the client requests the NameNode to nominate a suite of three DataNodes to host the block replicas. The client then writes data to the DataNodes in a pipeline fashion. The current design has a single NameNode for each cluster. The cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently. HDFS keeps the entire namespace in RAM. The inode data and the list of blocks belonging to each file comprise the metadata of the name system called the image. The persistent record of the image stored in the local host’s native files system is called a checkpoint. The NameNode also stores the modification log of the image called the journal in the local host’s native file system. For improved durability, redundant copies of the checkpoint and journal can be made at other servers. During restarts the NameNode restores the namespace by reading the namespace and replaying the journal. The locations of block replicas may change over time and are not part of the persistent checkpoint. DATA NODES Each block replica on a DataNode is represented by two files in the local host’s native file system. The first file contains the data itself and the second file is block’s metadata including checksums for the block data and the block’s generation stamp. The size of the data file equals the actual length of the block and does not require extra space to round it up to the nominal block size as in traditional file systems. Thus, if a block is half full it needs only half of the space of the full block on the local drive. During startup each DataNode connects to the NameNode and performs a handshake. The purpose of the handshake is to verify the namespace ID and the software version of the DataNode. If either does not match that of the NameNode the DataNode automatically shuts down. The namespace ID is assigned to the file system instance when it is formatted. The namespace ID is persistently stored on all nodes of the cluster. Nodes with a different namespace ID will not be able to join the cluster, thus preserving the integrity of the file system. The consistency of software versions is important because incompatible version may cause data corruption or loss, and on large clusters of thousands of machines it is easy to overlook nodes that did not shut down properly prior to the software upgrade or were not available during the upgrade. A DataNode that is newly initialized and without any namespace ID is permitted to join the cluster and receive the cluster’s namespace ID. After the handshake the DataNode registers with the NameNode. DataNodes persistently store their unique storage Computer Engineering Department, DJSCOE

Page 7

Parallel K-Means Clustering using Hadoop MapReduce IDs. The storage ID is an internal identifier of the DataNode, which makes it recognizable even if it is restarted with a different IP address or port. The storage ID is assigned to the DataNode when it registers with the NameNode for the first time and never changes after that. A DataNode identifies block replicas in its possession to the NameNode by sending a block report. A block report contains the block id, the generation stamp and the length for each block replica the server hosts. The first block report is sent immediately after the DataNode registration. Subsequent block reports are sent every hour and provide the NameNode with an up-todate view of where block replicas are located on the cluster. During normal operation DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available. The default heartbeat interval is three seconds. If the NameNode does not receive a heartbeat from a DataNode in ten minutes the NameNode considers the DataNode to be out of service and the block replicas hosted by that DataNode to be unavailable. The NameNode then schedules creation of new replicas of those blocks on other DataNodes. Heartbeats from a DataNode also carry information about total storage capacity, fraction of storage in use, and the number of data transfers currently in progress. These statistics are used for the NameNode’s space allocation and load balancing decisions. The NameNode does not directly call DataNodes. It uses replies to heartbeats to send instructions to the DataNodes. The instructions include commands to: • replicate blocks to other nodes; • remove local block replicas; • re-register or to shut down the node; • send an immediate block report. These commands are important for maintaining the overall system integrity and therefore it is critical to keep heartbeats frequent even on big clusters. The NameNode can process thousands of heartbeats per second without affecting other NameNode operations. [7] Fig No. 3.1 Data Nodes [8]

Computer Engineering Department, DJSCOE

Page 8

Parallel K-Means Clustering using Hadoop MapReduce HDFS CLIENT User applications access the file system using the HDFS client, a code library that exports the HDFS file system interface. Similar to most conventional file systems, HDFS supports operations to read, write and delete files, and operations to create and delete directories. The user references files and directories by paths in the namespace. The user application generally does not need to know that file system metadata and storage are on different servers, or that blocks have multiple replicas. When an application reads a file, the HDFS client first asks the NameNode for the list of DataNodes that host replicas of the blocks of the file. It then contacts a DataNode directly and requests the transfer of the desired block. When a client writes, it first asks the NameNode to choose DataNodes to host replicas of the first block of the file. The client organizes a pipeline from node-to-node and sends the data. When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block. A new pipeline is organized, and the client sends the further bytes of the file. Each choice of DataNodes is likely to be different. The interactions among the client, the NameNode and the DataNodes are illustrated in Fig. 1. Unlike conventional file systems, HDFS provides an API that exposes the locations of a file blocks. This allows applications like the MapReduce framework to schedule a task to where the data are located, thus improving the read performance. It also allows an application to set the replication factor of a file. By default a fileâ&#x20AC;&#x2122;s replication factor is three. For critical files or files which are accessed very often, having a higher replication factor improves their tolerance against faults and increase their read bandwidth. [7]

Figure 3.2.

An HDFS client creates a new file by giving its path to the NameNode.

For each block of the file, the NameNode returns a list of DataNodes to host its replicas. The client then pipelines data to the chosen DataNodes, which eventually confirm the creation of the block replicas to the NameNode.

Computer Engineering Department, DJSCOE

Page 9

Parallel K-Means Clustering using Hadoop MapReduce

3.3.

REPLICATION MANAGEMENT

The NameNode endeavours to ensure that each block always has the intended number of replicas. The NameNode detects that a block has become under- or over-replicated when a block report from a DataNode arrives. When a block becomes over replicated, the NameNode chooses a replica to remove. The NameNode will prefer not to reduce the number of racks that host replicas, and secondly prefer to remove a replica from the DataNode with the least amount of available disk space. The goal is to balance storage utilization across DataNodes without reducing the blockâ&#x20AC;&#x2122;s availability. When a block becomes under-replicated, it is put in the replication priority queue. A block with only one replica has the highest priority, while a block with a number of replicas that is greater than two thirds of its replication factor has the lowest priority. A background thread periodically scans the head of the replication queue to decide where to place new replicas. Block replication follows a similar policy as that of the new block placement. If the number of existing replicas is one, HDFS places the next replica on a different rack. In case that the block has two existing replicas, if the two existing replicas are on the same rack, the third replica is placed on a different rack; otherwise, the third replica is placed on a different node in the same rack as an existing replica. Here the goal is to reduce the cost of creating new replicas. The NameNode also makes sure that not all replicas of a block are located on one rack. If the NameNode detects that a blockâ&#x20AC;&#x2122;s replicas end up at one rack, the NameNode treats the block as under-replicated and replicates the block to a different rack using the same block placement policy described above. After the NameNode receives the notification that the replica is created, the block becomes over-replicated. The NameNode then will decides to remove an old replica because the over replication policy prefers not to reduce the number of racks. [7] Fig No. 3.3 Replication Management [8]

Computer Engineering Department, DJSCOE

Page 10

Parallel K-Means Clustering using Hadoop MapReduce 3.4.MAPREDUCE OVERVIEW In a distributed data storage, when parallel processing the data, we need to consider much, such as synchronization, concurrency, load balancing and other details of the underlying system. It makes the simple calculation become very complex. MapReduce programming model was proposed in 2004 by the Google, which is used in processing and generating large data sets implementation. This framework solves many problems, such as data distribution, job scheduling, fault tolerance, machine to machine communication, etc. MapReduce is applied in Google's Web search. Programmers need to write many programs for the specific purpose to deal with the massive data distributed and stored in the server cluster, such as crawled documents, web request logs, etc., in order to get the results of different data, such as inverted indices, web document, different views, worms collected the number of pages for each host a summary of a given date within the collection of the most common queries and so on. [2] MAPREDUCE PROGRAMMING MODEL MapReduce programming model, by map and reduce function realize the Mapper and Reducer interfaces. They form the core of task. Fig No. 3.4 MapReduce Programming model [9]

MAPPER Map function requires the user to handle the input of a pair of key value and produces a group of intermediate key and value pairs. <key,value> consists of two parts, value stands for the data related to the task, key stands for the "group number " of the value . MapReduce combine the intermediate values with same key and then send them to reduce function. [7]

Computer Engineering Department, DJSCOE

Page 11

Parallel K-Means Clustering using Hadoop MapReduce MAP ALGORITHM PROCESS IS DESCRIBED AS FOLLOWS: Step1. Hadoop and MapReduce framework produce a map task for each InputSplit, and each InputSplit is generated by the InputFormat of job. Each <Key,Value> corresponds to a map task. Step2. Execute Map task, process the input <key,value> to form a new <key,value>. This process is called "divide into groups". That is, make the correlated values correspond to the same key words. Output key value pairs that do not required the same type of the input key value pairs. A given input value pair can be mapped into 0 or more output pairs. Step3. Mapper's output is sorted to be allocated to each Reducer. The total number of blocks and the number of job reduce tasks is the same. Users can implement Partitioner interface to control which key is assigned to which Reducer. 3.5.REDUCER Reduce function is also provided by the user, which handles the intermediate key pairs and the value set relevant to the intermediate key value. Reduce function mergers these values, to get a small set of values. the process is called "merge ". But this is not simple accumulation. There are complex operations in the process. Reducer makes a group of intermediate values set that associated with the same key smaller. In MapReduce framework, the programmer does not need to care about the details of data communication, so <key,value> is the communication interface for the programmer in MapReduce model. <key,value> can be seen as a "letter", key is the letter’s posting address, value is the letter’s content. With the same address letters will be delivered to the same place. Programmers only need to set up correctly <key,value>, MapReduce framework can automatically and accurately cluster the values with the same key together. REDUCER ALGORITHM PROCESS IS DESCRIBED AS FOLLOWS: Step1. Shuffle. Input of Reducer is the output of sorted Mapper. In this stage, MapReduce will assign related block for each Reducer. Step2. Sort. In this stage, the input of reducer is grouped according to the key (because the output of different mapper may have the same key). The two stages of Shuffle and Sort are synchronized; Step3. Secondary Sort. If the key grouping rule in the intermediate process is different from its rule before reduce, we can define a Comparator. The comparator is used to group intermediate keys for the second time. Map tasks and Reduce task is a whole, cannot be separated. They should be used together in the program. We call a MapReduce the process as an MR process. In an MR process, Map tasks run in parallel, Reduce tasks run in parallel, Map and Reduce tasks run serially. An MR process and the next MR process run in serial, synchronization between these operations is guaranteed by the MR system, without programmer’s involvement. [7]

Computer Engineering Department, DJSCOE

Page 12

Parallel K-Means Clustering using Hadoop MapReduce

Fig no. 3.5 MapReduce operation process 3.6.EXAMPLE Consider the problem of counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

Computer Engineering Department, DJSCOE

Page 13

Parallel K-Means Clustering using Hadoop MapReduce The map function emits each word plus an associated count of occurrences (just `1' in this simple example). The reduce function sums together all counts emitted for a particular word. In addition, the user writes code to a mapreduce specification object with the names of the input and output, and optional tuning parameters. The user then invokes the MapReduce function, passing it the specification object. The user's code is linked together with the MapReduce_library. [6]

Fig No. 3.6 Solving a word count [9]

Computer Engineering Department, DJSCOE

Page 14

Parallel K-Means Clustering using Hadoop MapReduce 3.7.PRACTICE AT YAHOO! Large HDFS clusters at Yahoo! include about 3500 nodes. A typical cluster node has: · 2 quad core Xeon processors @ 2.5ghz · Red Hat Enterprise Linux Server Release 5.1 · Sun Java JDK 1.6.0_13-b03 · 4 directly attached SATA drives (one terabyte each) · 16G RAM · 1-gigabit Ethernet Seventy percent of the disk space is allocated to HDFS. The remainder is reserved for the operating system (Red Hat Linux), logs, and space to spill the output of map tasks. (MapReduce intermediate data are not stored in HDFS.) Forty nodes in a single rack share an IP switch. The rack switches are connected to each of eight core switches. The core switches provide connectivity between racks and to out-of-cluster resources. For each cluster, the NameNode and the BackupNode hosts are specially provisioned with up to 64GB RAM; application tasks are never assigned to those hosts. In total, a cluster of 3500 nodes has 9.8 PB of storage available as blocks that are replicated three times yielding a net 3.3 PB of storage for user applications. As a convenient approximation, one thousand nodes represent one PB of application storage. Over the years that HDFS has been in use (and into the future), the hosts selected as cluster nodes benefit from improved technologies. New cluster nodes always have faster processors, bigger disks and larger RAM. Slower, smaller nodes are retired or relegated to clusters reserved for development and testing of Hadoop. The choice of how to provision a cluster node is largely an issue of economically purchasing computation and storage. HDFS does not compel a particular ratio of computation to storage, or set a limit on the amount of storage attached to a cluster node. [7]

Computer Engineering Department, DJSCOE

Page 15

Parallel K-Means Clustering using Hadoop MapReduce

4. PARALLEL K-MEANS BASED ON MAPREDUCE Parallel K-Means algorithm needs one kind of MapReduce job. The map function performs the procedure of assigning each sample to the closest center while the reduce function performs the procedure of updating the new centers. In order to decrease the cost of network communication, a combiner function is developed to deal with partial combination of the intermediate values with the same key within the same map task. 4.1.MAP FUNCTION The input dataset is stored on HDFS (Hadoop Distributed File System) as a sequence file of <key, value> pairs, each of which represents a record in the dataset. The key is the offset in bytes of this record to the start point of the data file, and the value is a string of the content of this record. The dataset is split and globally broadcast to all mappers. Consequently, the distance computations are parallel executed. For each map task, Parallel K-Means construct a global variant centers which is an array containing the information about centers of the clusters. Given the information, a mapper can compute the closest center point for each sample. The intermediate values are then composed of two parts: the index of the closest center point and the sample information. The pseudocode of map function is shown in Algorithm 1. [3] Algorithm 1. map (key, value) Input: Global variable centers, the offset key, the sample value Output: <key’, value’> pair, where the key’ is the index of the closest center point and value’ is a string comprise of sample information 1. Construct the sample instance from value; 2. minDis = Double.MAX VALUE; 3. index = -1; 4. For i=0 to centers.length do dis= ComputeDist(instance, centers[i]); If dis < minDis { minDis = dis; index = i; } 5. End For 6. Take index as key’; 7. Construct value’ as a string comprise of the values of different dimensions; 8. output < key, value> pair; 9. End Note that Step 2 and Step 3 initialize the auxiliary variable minDis and index ; Step 4 computes the closest center point from the sample, in which the function ComputeDist (instance, centers[i]) returns the distance between instance and the center point centers[i]; Step 8 outputs the intermediate data which is used in the subsequent procedures. [3] Computer Engineering Department, DJSCOE

Page 16

Parallel K-Means Clustering using Hadoop MapReduce 4.2.COMBINE FUNCTION After each map task, we apply a combiner to combine the intermediate data of the same map task. Since the intermediate data is stored in local disk of the host, the procedure cannot consume the communication cost. In the combine function, we partially sum the values of the points assigned to the same cluster. In order to calculate the mean value of the objects for each Parallel K-Means Clustering Based on MapReduce cluster, we should record the number of samples in the same cluster in the same map task. The pseudocode for combine function is shown in Algorithm 2. Algorithm 2. combine (key, V) Input: key is the index of the cluster, V is the list of the samples assigned to the same cluster Output: < key, value> pair, where the key’ is the index of the cluster, value’ is a string comprised of sum of the samples in the same cluster and the sample number. 1. Initialize one array to record the sum of value of each dimensions of the samples contained in the same cluster, i.e. the samples in the list V; 2. Initialize a counter num as 0 to record the sum of sample number in the same cluster; 3. while(V.hasNext()){ Construct the sample instance from V.next(); Add the values of different dimensions of instance to the array num++; } 4. Take key as key’; 5. Construct value’ as a string comprised of the sum values of different dimensions and num; 6. output < key, value> pair; 8. End [3]

4.3.REDUCE FUNCTION The input of the reduce function is the data obtained from the combine function of each host. As described in the combine function, the data includes partial sum of the samples in the same cluster and the sample number. In reduce function, we can sum all the samples and compute the total number of samples assigned to the same cluster. Therefore, we can get the new centers which are used for next iteration. The pseudocode for reduce function is shown in Algorithm 3.

Computer Engineering Department, DJSCOE

Page 17

Parallel K-Means Clustering using Hadoop MapReduce Algorithm 3. reduce (key, V) Input: key is the index of the cluster, V is the list of the partial sums from different host Output: < key, value> pair, where the key’ is the index of the cluster, value’ is a string representing the new center. 1. Initialize one array record the sum of value of each dimensions of the samples contained in the same cluster, e.g. the samples in the list V; 2. Initialize a counter NUM as 0 to record the sum of sample number in the same cluster; 3. while(V.hasNext()){ Construct the sample instance from V.next(); Add the values of different dimensions of instance to the array NUM += num; } 4. Divide the entries of the array by NUM to get the new center’s coordinates; 5. Take key as key’; 6. Construct value’ as a string comprise of the center’s coordinates; 7. output < key, value> pair; 9. End [3] 4.4. NUMERICAL ANALYSIS OF PARALLEL K-MEANS BASED ON MAPREDUCE Here we have three mappers and one reducer on distributed Hadoop Framework to implement parallel k-means on an input dataset. The input dataset is stored in the Hadoop Distributed File System(HDFS) on various datanodes. The input dataset is given as follows:

The global array consisting of cluster centers is constructed for each map task . The initial cluster centers are chosen as C1(2,2) , C2(1,14) and C3(4,3). The mappers perform the procedure of assigning each sample to the closest center while the reducer performs the procedure of updating the new centers. The solution is as follows:

Computer Engineering Department, DJSCOE

Page 18

Parallel K-Means Clustering using Hadoop MapReduce MAPPER 1: i)MAP FUNCTION : map (key, value) Input : < P1 , [2,2] > < P2 , [1,14] > < P3 , [10,7] > < P4 , [1,11] > < P5 , [3,4] > Calculations: P1(2,2) P2(1,14) P3(10,7) P4(1,11) P5(3,4)

C1(2,2) 0 12.04 9.43 9.05 2.24

C2(1,14) 12.04 0 11.4 3 10.2

C3(4,3) 2.24 11.4 7.21 8.54 1.41

CLUSTER C1 C2 C3 C2 C3

Output: < C1 , [2,2] > < C2 , [1,14] > < C3 , [10,7] > < C2 , [1,11] > < C3 , [3,4] >

ii)COMBINE FUNCTION : combine (key, V) Input : < C1 , { [2,2] } > < C2 , { [1,14] , [1,11] } > < C3 , { [10,7] , [3,4] } > Output: < C1 , { [2,2] , 1 } > < C2 , { [2,25] , 2 } > < C3 , { [13,11] , 2 } >

Computer Engineering Department, DJSCOE

Page 19

Parallel K-Means Clustering using Hadoop MapReduce MAPPER 2: i)MAP FUNCTION : map (key, value) Input : < P6 , [11,8] > < P7 , [4,3] > < P8 , [12,9] > < P9 , [6,9] > < P10 , [2,19] > Calculations: P6(11,8) P7(4,3) P8(12,9) P9(6,9) P10(2,19)

C1(2,2) 10.82 2.24 12.2 8.06 17

C2(1,14) 11.66 11.4 12.08 7.07 5.1

C3(4,3) 8.6 0 10 6.32 16.12

CLUSTER C3 C3 C3 C3 C2

Output: < C3 , [11,8] > < C3 , [4,3] > < C3 , [12,9] > < C3 , [6,9] > < C2 , [2,19] >

ii)COMBINE FUNCTION : combine (key, V) Input : < C2 , { [2,19] } > < C3 , { [11,8] , [4,3] , [12,9] , [6,9] } > Output: < C2 , { [2,19] , 1 } > < C3 , { [33,29] , 4 } >

Computer Engineering Department, DJSCOE

Page 20

Parallel K-Means Clustering using Hadoop MapReduce MAPPER 3: i)MAP FUNCTION : map (key, value) Input : < P11 , [7,10] > < P12 , [2,0] > < P13 , [6,6] > < P14 , [1,3] > < P15 , [2,12] > Calculations: P11(7,10) P12(2,0) P13(6,6) P14(1,3) P15(2,12)

C1(2,2) 9.43 2 5.65 1.41 10

C2(1,14) 7.21 14.03 9.43 11 2.24

C3(4,3) 7.61 3.6 3.6 3 9.22

CLUSTER C2 C1 C3 C1 C2

Output: < C2 , [7,10] > < C1 , [2,0] > < C3 , [6,6] > < C1 , [1,3] > < C2 , [2,12] >

ii)COMBINE FUNCTION : combine (key, V) Input : < C1 , { [2,0] , [1,3]} > < C2 , { [7,10] , [2,12] } > < C3 , { [6,6] } >

Output: < C1 , { [3,3] , 2 } > < C2 , { [9,22] , 2 } > < C3 , { [6,6] , 1 } >

Computer Engineering Department, DJSCOE

Page 21

Parallel K-Means Clustering using Hadoop MapReduce REDUCER iii)Reduce function: reduce (key, V) Input : < C1 , { ([2,2] , 1) , ([3,3] , 2) } > < C2 , { ([2,25] , 2) , ([2,19] , 1) , ([9,22],2) } > < C3 , { ([13,11] , 2) , ([33,29] , 4) , ([6,6],1) } > Calculations: C1 = [2,2] +[3,3] / 1+2 = [5,5] / 3 =[1.67,1.67] C2

= [2,25]+[2,19]+[9,22] / 2+1+2 = [13,66] / 5 = [2.6,13.2]

= [13,11]+[33,29]+[6,6] / 2+4+1 = [52,46] / 7 = [7.43,6.57]

Output: < C1 , [1.67,1.67] > < C2 , [2.6,13.2] > < C3 , [7.43,6.57] >

Computer Engineering Department, DJSCOE

Page 22

Parallel K-Means Clustering using Hadoop MapReduce

5. COMPARATIVE ANALYSIS: 5.1.Experimental results of Execution Time Analysis for K- Means Clustering Algorithm Execution time analysis for K-means clustering algorithm is done on the basis of the number of records that are considered for clustering and how much time is taken by this whole process. As if the number of records are 100 that are considered for clustering, then it takes execution time 132ms and so on for all records. So in this way using different number of records, the execution time differentiation is shown.

Fig 5.1 Execution time of K-means clustering algorithm. In the table that also shows the number of records and the clustering execution time taken by K-means clustering algorithm is shown. As if the number of records are 50, the the execution time will be 98ms and so on. With the help of this type of tables we can easily calculate the performance.

Computer Engineering Department, DJSCOE

Page 23

Parallel K-Means Clustering using Hadoop MapReduce Table 5.1. Execution time for K-means clustering Records 50 100 150 200 250 300 350 400 450 500

Execution Time for Clustering Method 98 132 198 209 287 309 380 390 467 487

[4]

Table 5.2. Performance measures for different runs of Original K-Means Runs

No. of iterations

No. of points misclassified

Accuracy (%)

Validity Index

Time (s)

97.8

0.38

4.9

98.4

0.3819

10.03

98.4

0.3816

5.61

97.6

0.3815

16.40

97.6

0.3815

18.48

96.6

0.37

5.096

97.6

0.3815

21.49

97.6

0.3815

5.98

0.3709

4.85

98.2

0.3611

10.05

[5]

Computer Engineering Department, DJSCOE

Page 24

Parallel K-Means Clustering using Hadoop MapReduce 5.2.Experimental Results of Comparison based on: a) SpeedUp b) ScaleUp c) SizeUp In this section, we evaluate the performance of our proposed algorithm with respect to speedup, scaleup and sizeup. Performance experiments were run on a cluster of computers, each of which has two 2.8 GHz cores and 4GB of memory. Hadoop version 0.17.0 and Java 1.5.0 14 are used as the MapReduce system for all experiments. To measure the speedup, we kept the dataset constant and increase the number of computers in the system. The perfect parallel algorithm demonstrates linear speedup: a system with m times the number of computers yields a speedup of m. However, linear speedup is difficult to achieve because the communication cost increases with the number of clusters becomes large. a) Speedup: We have performed the speedup evaluation on datasets with different sizes and systems. The number of computers varied from 1 to 4. The size of the dataset increases from 1GB to 8GB. Fig.1.(a) shows the speedup for different datasets. As the result shows, PKMeans has a very good speedup performance. Specifically, as the size of the dataset increases, the speedup performs better. PKMeans algorithm can treat large datasets efficiently.

Fig. 5.2 Speed up b) Scaleup: Evaluates the ability of the algorithm to grow both the system and the dataset size. Scaleup is defined as the ability of an m-times larger system to perform an m-times larger job in the same run-time as the original system. To demonstrate how well the PKMeans algorithm handles larger datasets when more computers are available, we have performed scaleup experiments where we have increased the size of the datasets in direct proportion to the number of computers in the system. The datasets size of 1GB, 2GB, 3GB and 4GB are executed on 1, 2, 3 and 4 computers respectively. Fig.1.(b) shows the performance results of the datasets. Clearly, the PKMeans algorithm scales very well.

Computer Engineering Department, DJSCOE

Page 25

Parallel K-Means Clustering using Hadoop MapReduce

Fig. 5.3. Scale up c) Sizeup: Sizeup analysis holds the number of computers in the system constant, and grows the size of the datasets by the factor m. Sizeup measures how much longer it takes on a given system, when the dataset size ism-times larger than the original dataset. To measure the performance of sizeup, we have fixed the number of computers to 1, 2, 3, and 4 respectively. Fig.1.(c) shows the sizeup results on different computers. The graph shows that PKMeans has a very good sizeup performance. [3]

Fig. 5.4 Size up

Computer Engineering Department, DJSCOE

Page 26

Parallel K-Means Clustering using Hadoop MapReduce

5.3.Comparison based on no. of nodes and accuracy: The data used in this experiment comes from text classification corpus of Sogou lab. The corpus is the news on Sohusite in recent years. Corpus data is divided into 10 categories (automobile, finance, IT, health, sports, tourism, education, recruitment, culture, and military). We choose a number of documents to cluster and verify the effectiveness and feasibility of the algorithm. Experiment 1. Executing time by K-Means algorithm on stand-alone computer and based on MapReduce and Hadoop for the same number of documents. The following table shows the executing time of stand-alone K-Means algorithm and the parallel one for 50,000 documents, about 50M data Table 5.3. The Executing Result of 2 different K-means Algorithm Clustering The number of Documents The number of nodes Execution time

Stand along K-Means clustering 50000(50M) 1 30 minutes

Hadoop and MapReduce Parallel K-Means Clustering 50000(50M) 10 10 minutes

Fig.4 The following figure shows for different numbers of documents, the executing time of stand-alone K-Means algorithm and the parallel one. [2]

Computer Engineering Department, DJSCOE

Page 27

Parallel K-Means Clustering using Hadoop MapReduce Experiment 2. Executing time of parallel K-Means algorithm based on Hadoop and MapReduce under different numbers of nodes.

Fig. 5.5 Comparison of standalone k-means vs parallel k-means [2] Computer Engineering Department, DJSCOE

Page 28

Parallel K-Means Clustering using Hadoop MapReduce

6. CONCLUSION This work represents only a small first step in using the MapReduce programming technique in the process of large-scale data Sets. We take the advantage of the parallelism of MapReduce to design a parallel K-Means clustering algorithm based on MapReduce. This algorithm can automatically cluster the massive data, making full use of the Hadoop cluster performance. It can finish the text clustering in a relatively short period of time. Experiments show that it achieves high accuracy. The main deficiency: we have to set the number k (the number of clusters to be generated) in advance, and this algorithm is sensitive to initial values. Different initial values may lead to different results. And it is sensitive to the "noise" and the outlier data. We need to optimize from three aspects to reduce the running time based on Hadoop platform, the following three aspects can be optimized to reduce the level of the Hadoop cluster platform in the run-time. (1) Reduce program runs among multi-nodes instead of a single node. (2) Platform optimization, we have not optimized the platform yet. The platform spends extra management time in each iteration. We can reduce platform management time. (3) Optimization of the algorithm itself. We need to custom outputs of mapper and reducer functions. There is more efficient data format that the intermediate results can be stored. These options will enable faster read and write. [2] As data clustering has attracted a significant amount of research attention, many clustering algorithms have been proposed in the past decades. However, the enlarging data in applications makes clustering of very large scale of data a challenging task. In this paper, we proposed a fast parallel k-means clustering algorithm based on MapReduce, which has been widely embraced by both academia and industry. We use speedup, scaleup and sizeup to evaluate the performances of our proposed algorithm. The results show that the proposed algorithm can process large datasets on commodity hardware effectively. [3] 7. FUTURE WORK

MapReduce paradigm most efficiently enables parallelization of various large data centric operations. Hadoop provides a very good framework for implementation of parallel data centric operation by the use of HDFS (Hadoop Distributed FileSystem) and MapReduce. We have implemented Parallel K-means clustering on Hadoop and the results are more efficient and promising. However, this parallel distributed paradigm can be extended for various other clustering algorithms, one of which is the Hierarchical Clustering algorithm. In future, we look forward to implement Hierarchical clustering on large scale data sets and analyzing its efficiency over other parallel clustering algorithms.

Computer Engineering Department, DJSCOE

Page 29

Parallel K-Means Clustering using Hadoop MapReduce REFERENCES:

[1] Yadav Krishna R (Pursuing M.tech,) PIET, Vadodara Ms. Purnima Singh (Assistant Professor) PIET, Vadodara in International Journal of Advanced Research, “MapReduce Programming Paradigm Solving Big-Data Problems by Using Data-Clustering Algorithm”, in Computer Engineering & Technology (IJARCET) Volume 3, Issue 1, January 2014 [2] Ping ZHOU †, Jingsheng LEI, Wenjun YE, School of Computer and Information Engineering, Shanghai University of Electric Power, Shanghai 200090, China, “Large-Scale Data Sets Clustering Based on MapReduce and Hadoop”, in Journal of Computational Information Systems 7: 16 (2011) 5956-5963 [3] Weizhong Zhao1,2, Huifang Ma1,2, and Qing He1, M.G. Jaatun, G. Zhao, and C. Rong (Eds.): “Parallel K-Means Clustering Based on MapReduce”, in CloudCom 2009, LNCS 5931, pp. 674–679, 2009, Springer-Verlag Berlin Heidelberg 2009 [4] Navjot Kaur, Jaspreet Kaur Sahiwal, Navneet Kaur in ISSN: 2278 – 1323 International Journal of Advanced Research, “EFFICIENT K-MEANS CLUSTERING ALGORITHM USING RANKING METHOD IN DATA MINING”, in Computer Engineering & Technology Volume 1, Issue 3, May2012 [5] Dr. M.P.S Bhatia1and Deepika Khurana21Department of Computer Engineering, Netaji Subhash Institute of Technology,University of Delhi, New Delhi, “Experimental study of Data clustering using k- Means and modified algorithms”, in International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.3, May 2013 [6] Jeffrey Dean and Sanjay Ghemawat, jeff@google.com, sanjay@google.com Google, Inc. “MapReduce: Simplified Data Processing on Large Clusters”, in OSDI 2004 [7] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo!, Sunnyvale, California USA, {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com, “The Hadoop Distributed File System”, in 978-1-4244-7153-9/10/2010 IEEE [8] Video 3.Hadoop Distributed File System (HDFS), http://www.cbtnuggets.com/it-training-videos/course/apache-hadoop/video3 [9] Video 4. Introduction to MapReduce http://www.cbtnuggets.com/it-training-videos/course/apache-hadoop/video4

Computer Engineering Department, DJSCOE

Page 30

Parallel K-Means Clustering using Hadoop MapReduce

Research Papers: 1.

Weizhong Zhao1,2, Huifang Ma1,2, and Qing He1 M.G. Jaatun, G. Zhao, and C. Rong (Eds.), “Parallel K-Means Clustering Based on MapReduce” in CloudCom 2009, LNCS 5931, pp. 674–679, 2009, Springer-Verlag Berlin Heidelberg 2009

2. Ping ZHOU †, Jingsheng LEI, Wenjun YE School of Computer and Information Engineering, Shanghai University of Electric Power, Shanghai 200090, China, “Large-Scale Data Sets Clustering Based on MapReduce and Hadoop” in Journal of Computational Information Systems 7: 16 (2011) 5956-5963

3. Yadav Krishna R (Pursuing M.tech,) PIET, Vadodara Ms. Purnima Singh (Assistant Professor) PIET, Vadodara, “MapReduce Programming Paradigm Solving Big-Data Problems by Using Data-Clustering Algorithm” in International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) Volume 3, Issue 1, January 2014

4. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA, {Shv, Hairong, SRadia, Chansler} @ Yahoo-Inc.com, “The Hadoop Distributed File System”, in 978-1-4244-7153-9/10 ©2010 IEEE

Computer Engineering Department, DJSCOE

Page 31