Journal of Computer Science and Information Security January 2011

Page 197

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, 2011

reliability and places them on compute nodes on server groups. So MapReduce can handle these data on the associated nodes. Figure 3 shows the Hadoop MapReduce framework [10]. III.

PROPOSED ALGORITHM

A. Distributed C4.5 with ensemble and MapReduce The proposed algorithm DC4.5 with ensemble and Map reduces has 3 phases. Three phases are partition Phase/the map,Build base classifier phase and Reduce/Ensemble phase Divide data sets D into n subsets of {D1,D2,..Dn} and users determine the value n. In the first phase of Map phase a Base classifier BCi needs to be generated into Classifier Ci with the DC4.5 algorithm. In Reduce/Ensemble phase assemble the n base classifiers to generated final classifier using Bagging. B. Types of keys and values The types of sets of key and value of MReC4.5 illustrates as follows key1 : Text value1 : Instances key2 : Text value2 : Iterator of Classifiers key3 : Text value3 : Classifer

Fig 3-Hadoop MapReduce framework Architecture

During the Map process, in order to improve the combination efficiency, a Combiner can be used which has similar function as Reducer to reduce at local. The transformation between the input and the output looks as follows [6].

key1, key2, key3 are all the Text type offered by Hadoop and their values are the file name associated with the input data set D. In the Partition phase, when the data set D is split into m data sets, according to the input format of the C4.5 algorithm each data set is formatted as value1 with the Instances type. In the Map phase, we build a classifier model with the C4.5 algorithm and obtain a classifier model set value2 which belongs to the Iterator of Classifiers type; In the Reducer phase, we assemble classifiers from value2 to obtain a classifier model value3 with the Classifier type.

map (key1, value1) → (key2, value2) [ ] reduce (key2, value2 [ ]) → (key3, value3) [ ] Hadoop is an implementation of the MapReduce parallel computing model of the open source framework for distributed programming. With the help of Hadoop, programmers can easily write parallel and distributed programs. It runs in computing clusters to deal with massive data [11]. The basic components of an application on Hadoop's MapReduce include a Mapper and a Reducer class, as well as a program to create a JobConf. Some applications also include a Combiner class which is actually the implement of the Reducer on local.

C. Map/Reduce Phase Figure 4 specifies the proposed algorithm for the Map operations in respect to the C4.5 algorithm. A change is done in the original proposed algorithm to Map and reduce for the key value pairs function mapper(key, value) /* Build base-classifier */ 1: Build a C4.5 Classifier c with the data set value; /* Submit intermediate results */ 2: Emit(key, c); 3. Generate {D1,D2,..Dn} for all subsets 4.Build and Map with Ci 5.Integrate into one cluster the (key,c)

Hadoop implements a distributed file system, referred to HDFS. HDFS has the characteristic of high fault-tolerant, and is designed for deployed on low-cost hardware. It provides high throughput to access the data of applications, which is suitable for an application with large data sets. HDFS relax the requirements of POSIX, allowing streaming access to data in the file system. In addition, Hadoop implements the MapReduce distributed computing paradigm. MapReduce splits the mission of an application into small blocks of work. HDFS establishes multiple replicas of data blocks for

Fig 4. The Map Operation

190

http://sites.google.com/site/ijcsis/ ISSN 1947-5500


Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.