International Journal of Computer Networking, Wireless and Mobile Communications (IJCNWMC) ISSN 2250-1568 Vol. 3, Issue 2, Jun 2013, 53-56 © TJPRC Pvt. Ltd.
ENHANCE THE PERFORMANCE OF HADOOP DISTRIBUTED FILE SYSTEM FOR RANDOM FILE ACCESS USING INCREASED BLOCK SIZE P. D. GUDADHE1, A. D. GAWANDE2 & L. K. GAUTHAM3 1
M.E Scholar, Computer Engineering, Sipna COET, Amravati, Maharashtra, India
HOD, Department of Computer Science & Engineering, Sipna COET, Amravati, Maharashtra, India
Assistant professor in Department of computer Science & Engineering, Sipna COET, Amravati, Maharashtra, India
ABSTRACT Apache Hadoop is a top-level Apache project that includes open source implementations of a distributed file system and MapReduce that were inspired by Google’s GFS and MapReduce projects. It also called as a HDFS. The Hadoop Distributed File System (HDFS) is designed to store very large data sets, and to stream those data sets at high bandwidth to any application. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. Hadoop Distributed File System (HDFS) has been widely used to manage the large-scale data due to its high scalability. Data distribution, storage and access are essential to CPU-intensive and data-intensive high performance Grid computing . The random queries in large-scale data are becoming more and more important. Unfortunately, the HDFS is not optimized for random reads. Hence there are many disadvantages in random access to HDFS. So to overcome that problem we try to provide solutions which improve the performance of HDFS while accessing random file. Improvement of file access performance is a great challenge in real-time system. Then to achieve this design and implementation of a novel distributed layered cache system built on the top of the Hadoop Distributed File System which is named HDFS-based Distributed Cache System (HDCache). Another method is to use the block of larger size for storing the files in HDFS. This will increase the performance of HDFS for random access.
KEYWORDS: Apache Hadoop, Hadoop Distributed File System (HDFS), Robust Coherency Model INTRODUCTION HDFS has many similarities with other distributed file systems, but is different in several respects. One noticeable difference is HDFS's write-once-read-many model that relaxes concurrency control requirements, simplifies data coherency, and enables high-throughput access . Another unique attribute of HDFS is the viewpoint that it is usually better to locate processing logic near the data rather than moving the data to the application space. HDFS rigorously restricts data writing to one writer at a time. Bytes are always appended to the end of a stream, and byte streams are guaranteed to be stored in the order written. HDFS has following goals:
Fault tolerance by detecting faults and applying quick, automatic recovery
P. D. Gudadhe, A. D. Gawande & L. K. Gautham
Data access via MapReduce streaming
Simple and robust coherency model
Processing logic close to the data, rather than the data close to the processing logic
Portability across heterogeneous commodity hardware and operating systems
Scalability to reliably store and process large amounts of data
Economy by distributing data and processing across clusters of commodity personal computers
Efficiency by distributing data and logic to process it in parallel on nodes where data is located
Reliability by automatically maintaining multiple copies of data and automatically redeploying processing logic in the event of failures .
NAME NODES AND DATA NODES Within HDFS, a given name node manages file system namespace operations like opening, closing, and renaming files and directories. A name node also maps data blocks to data from HDFS clients. Data nodes also create, delete, and replicate data blocks according to instructions from the governing name node , .
Figure 1: The HDFS Architecture As Figure 1 illustrates, each cluster contains one name node. This design facilitates a simplified model for managing each namespace and arbitrating data distribution. Name nodes and data nodes are software components designed to run in a decoupled manner on commodity machines across heterogeneous operating systems. HDFS is built using the Java programming language; therefore, any machine that supports the Java programming language can run HDFS. A typical installation cluster has a dedicated machine that runs a name node and possibly one data node. Each of the other machines in the cluster runs one data node. Data nodes continuously loop, asking the name node for instructions. A name node can't connect directly to a data node; it simply returns values from functions invoked by a data node. Each data node maintains an open server socket so
Enhance the Performance of Hadoop Distributed File System for Random File Access Using Increased Block Size
that client code or other data nodes can read or write data. The host or port for this server socket is known by the name node, which provides the information to interested clients or other data nodes. See the Communications protocols sidebar for more about communication between data nodes, name nodes, and clients.
DEFAULT BLOCK SIZE HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 64MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different DataNode.
Figure 2: File Indexing for 64 MB Block As figure 2 shows the indexing for total 320MB of data (i.e. 64*5), we need 5 File index. With the help of file index we can search for the particular file on a block. More the number of index it is difficult to find a particular file on a Block. It will degrade HDFS performance for random file access , , .
OUR STRATEGY TO OVERCOME THE PROBLEM STATED ABOVE In figure 2 we discussed about the problem which degrades HDFS performance for random file access. To overcome this we problem we have suggested increasing the Block size to 128MB. This will reduce the numbers of index for accessing file in a particular Block.
Figure 3: File Indexing for 128MB Block
P. D. Gudadhe, A. D. Gawande & L. K. Gautham
In figure 3, we have stored a data of size 384MB. For storing a 384MB of data we need 3 Blocks and for accessing the files on 3 Blocks we need 3 Index. As number of index reduces the searching for particular file will become faster. In figure 2, we stored 324MB of data in 5 Blocks and for accessing data on that block we required 5 Index but in figure 3, we stored 384MB of data in 3 Blocks and for accessing particular data we need 3 Index. This technique will increase the performance for accessing a random file on HDFS.
We have suggested to increase the block size in HDFS to increase the performance.
As index reduces the file can be accessed faster than the older system.
This works best under very large set of files.
Wei Zhou, Jizhong Han, Zhang Zhang, Jiao Dai ” Dynamic Random Access for Hadoop Distributed FileSystem”, Distributed Computing Systems Workshops (ICDCSW), 2012 32nd International Conference on18-21 June 2012
2. Hung-Chang Hsiao, Haiying Shen, Hsueh-Yi Chung, Yu-Chang Chao” Load Rebalancing for Distributed File Systems in Clouds” , IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 24, NO. 5, MAY 2013 3.
VMware, http://www.vmware.com/, 2012.
Hadoop Distributed File System, http://hadoop.apache.org/hdfs/ , 2012.
Apache Hadoop, http://hadoop.apache.org/, 2012.
G.S. Manku, “Balanced Binary Trees for ID Management and Load Balance in Distributed Hash Tables,” Proc. 23rd ACM Symp. Principles Distributed Computing (PODC ’04), pp. 197-205, July 2004.
Hieu Hanh Le Dept. of Comput. Sci., Tokyo Inst. of Technol., Tokyo, Japan Hikida, S., Yokota, H. “An Evaluation of Power-Proportional Data Placement for Hadoop Distributed File Systems” Dependable, Autonomic and Secure Computing (DASC), 2011 IEEE Ninth International Conference on 12-14 Dec. 2011
Chandrasekar, S. Dept. of Comput. Sci. &'||','||' Eng., SSN Coll. of Eng., Kalavakkam, India Dakshinamurthy, R., Seshakumar, P.G., Prabavathy, B.;Babu, C. “A novel indexing scheme for efficient handling of small files in Hadoop Distributed File System ” Computer Communication and Informatics (ICCCI), 2013 International Conference on 4-6 Jan. 2013
Published on Jun 30, 2013
Apache Hadoop is a top-level Apache project that includes open source implementations of a distributed file system and MapReduce that were i...