GRD Journals | Global Research and Development Journal for Engineering | International Conference on Innovations in Engineering and Technology (ICIET) - 2016 | July 2016
e-ISSN: 2455-5703
Association Rule Mining in Big Data using MapReduce Approach in Hadoop 1J.
Jenifer Nancy 2M. Jansi Rani 3Dr. D. Devaraj 1 P. G Scholar 2Assistant Professor 3Senior Professor and H.O.D 1,2 Department of Computer Science & Engineering 3Department of Electrical and Electronics Engineering 1,2,3 Kalasalingam University, Krishnankovil, India Abstract The concept of Association rule mining is an important task in data mining. In case of big data the large volume of data makes is impossible to generate rules at a faster pace. By making use of parallel execution in Hadoop using the MapReduce framework, the rules can be generated much faster and in an efficient way. The existing method transforms the input dataset into binomial representation before processing them using MapReduce. But binomial conversion is not user-friendly since it is complex in case of continuous values. In this paper, an improved and scalable algorithm is proposed for association rule mining that will convert the input dataset into key-value pairs instead of binomial. All the stages of proposed association rule mining algorithm are parallelized using MapReduce. The proposed algorithm works on high cardinality features and so no dimension detection is needed. Keyword- Hadoop; MapReduce; Association rule mining; Data mining; big data __________________________________________________________________________________________________
I. INTRODUCTION A. Big Data and Characteristics The data is collected and stored in every minute, every hour and every day in an organization or institute and is available in large quantity. But the amount of data is not of importance but what the organizations do with these data to identify information that can be useful for them. This can be done by analyzing the data to identify insights or critical information that can help the organization to make useful decisions for their growth. The term big data describes a large volume of data that is available in both structured and in unstructured formats. Even though the concept of big data is a new term, the process of collecting the data, storing them in large amounts and analyzing them to gather new information is something that has been done since long before big data has been used. The characteristics of big data can be explained using 3 V’s such as (1) Volume, (2) Velocity and (3) Variety. The applications of big data include areas such as health care, telecom, finance, etc. In this paper the process of association rule generation in big data is discussed and an association rule mining technique is proposed to generate the rules from the KDD CUP – 99 dataset. B. Data Mining in Big Data Big Data mining deals with a large amount of data that is stored in the data warehouses and databases. The concept of big data mining can be used to extract or identify the interesting patterns and information from these large data. Many data mining techniques are available that can be applied to the big data. They are classification, clustering, association rules, prediction, estimation, documentation and description. The researches around these techniques have been large since long ago. Many algorithms have been applied in each of the data mining techniques and this also applies to big data. One such well known technique that is applied is the association rule mining in big data. This is a most efficient data mining technique that is used to discover the various hidden patterns and information from large databases. Here the relationships between the various attributes of the data are identified using the association rule mining algorithm. Some basic types of association rule mining algorithms are the Apriori algorithm, Distributed algorithm and Parallel algorithm. C. Association Rule Mining The Association Rule Mining (ARM) [1] in data mining is a popular approach that is used to analyse the given dataset to discover interesting patterns or relationships between the various items in the dataset. The concept of strong association rules was first used by Agarwal et al. [2] to identify the various association rules between the items that are sold during a large scale transaction database collected from a supermarket using a point system. The relationship between the items is identified based on the purchase pattern. The ARM technique generates a set of association rules prevailing between the various items of the given dataset based on the number of occurrences of these items combination in the dataset.
All rights reserved by www.grdjournals.com
179