International Journal of Innovative Research in Applied Sciences and Engineering by G.SUSEENDRAN

11 International Journal of Innovative Research in Applied Sciences and Engineering (IJIRASE)

Volume 1, Issue 1, June 2016

A DETAILED STUDY ON DEDUPLICATION IN CLOUD COMPUTING B.Mahalakshmi, Research Scholar, Department of Information Technology, School of Computing Sciences, Vels University, Chennai maha.karthik921@gmail.com ABSTRACT: In Cloud computing, the term cloud refers to a widely accepted remote service provider that provides massive network service or all type of resources that are connected to the network. In this data deduplication is a data compression techniques which is very important in cloud computing. Due to the massive usage of cloud storage the increasing volume will be one of the major issues. To avoid the complication and also to reduce the storage space and bandwidth we are using the deduplication concept. In this paper we discuss about cloud architecture and a detailed study about deduplication. 1. INTRODUCTION 1.1 Cloud Computing Cloud computing could be a style of computing that primarily depends on resource sharing rather than handling applications by native servers or individual devices. Using the web enabled devices, cloud computing permit system software to access the resources. Clouds computing, as well referred to as the cloud, and are often used as a word for the web. Cloud computing will serve a various functions

over the web like storage, virtual servers, applications and permission for desktop applications. By taking the benefit of shared resource, cloud computing is ready to manage scalability and reliability. Cloud computing is classified into two models. Cloud computing service models and cloud computing deployment models. 1.2. Types of Cloud Computing Model The deployment model 1 represent the types of cloud environment that mostly distinguished by the privileges, size and access. It describes about the character of the cloud and the purpose of the model. Based on the needs of the organization and individuals the cloud is used and matches the requirements.

Fig 1 Cloud Deployment Models

12 International Journal of Innovative Research in Applied Sciences and Engineering (IJIRASE)

Volume 1, Issue 1, June 2016 1.2.4. Community Cloud

1.2.1. Public Cloud Public clouds are operated and owned by companies that use them to offer rapid access to reasonable computing resources to other organization or individuals. There is no need for the users to purchase hardware, software or supporting infrastructure, which is controlled and hold by providers while an enterprise makes use of a public cloud service. It provides low cost, but lacks the security in private and hybrid clouds.

Community cloud is a model, which consist of groups with widespread and definite needs shares the cloud infrastructure. This consists of groups such as a U.S. federal bureau cloud with strict security requirements, medical cloud with rigid and strategy needs for data privacy. A community cloud supports mutual efforts and the management of secure data.

1.2.2. Private Cloud

In cloud there are many service providers to decide on your business cloud which will connect you anyplace. The cloud provides clients cloud primarily based collaboration, communication and infrastructure. For operations that have to meet necessary restrictive needs, specialized compliance and security services modify you to safeguard your information and be compliant. Compliant messaging should also meet government and restrictive mandates. Below figure 2 represents the cloud service models 2.

A private cloud is owned and controlled by an individual company that controls the way virtualized resources and other automated services are adapted and used by a variety of business entities. A private cloud recommends an enterprise the chance for a elevated level of security as well as major configuration control. 1.2.3. Hybrid Cloud A hybrid cloud could be a combination of public and private cloud. It uses a private cloud basis combined with the considered use of public cloud services. Private cloud must be incorporated with a company's resources and with the public cloud for best possible functionality. Most companies utilizing private clouds will ultimately control the workloads across data centers in public and private clouds. The result is the creation of hybrid clouds. Hybrid cloud can support strong security and can be optimally configured to handle secure and public process at the lowest cost.

2. CLOUD COMPUTING SERVICES

Fig 2 Cloud Service Model

13 International Journal of Innovative Research in Applied Sciences and Engineering (IJIRASE)

2.1. SOFTWARE (SAAS)

SERVICE

Software as a service (SaaS) is a service that provides software application for end user as a service. It refers to a host service which is deployed by a software and it is accessed through the internet. 2.2. Platform As A Service (Paas) Platform as a service provides a cloud-based a service that provides a platform for developing a web based application. The complexity of import and the cost managing the basic hardware, software, and conditioning are minimized with this service option. While choosing PaaS we have to know about what is incorporated for free, how long to hold it, if your security needs are meet and who are all the customers who can fits in terms of functionality and size. 2.3. Infrastructure as A Service (IAAS) Infrastructure as a service (IaaS) afford corporation with computing resources together with servers, networking, storage, in addition to data center liberty on a payper-use basis. The service provider posses their tools and is in charge for its maintaining the service. The client pays for use relatively than up front so creating costs can be minimized. 3. LITERATURE REVIEW Using the Fuzzy cluster model the duplicate information square measure clustered into a bunch and also the elimination is finished simply therefore

Volume 1, Issue 1, June 2016 the level of deduplication is improved. A semantic Deduplication of Temporal Dynamic Records from Multiple web Databases 3. In cloud computing the info is encrypted in a method the user has certain attributes and so the privilege rights are used for accessing the information . therefore the information is stored securely from unknown user 4 The artificial intelligence technique is employed to find if any intrusion sharpens in private cloud. So the important time information is secured using this method. This model is planned to use within the banking sector because it may be a high finish technique 5.

Investigation regarding information deduplic ation its techniques and changes introduced in deduplication as a result of virtualized information center and evolution of current cloud computing era, an investigation on information Deduplication methods And its Recent 6. Advancement In a Hybrid cloud design a replacement deduplication system with differential duplicate check is planned wherever the S-CSP resides within the public cloud. The duplicate check is finished for files marked with the corresponding privileges are allowed by the user solely, A hybrid Cloud Approach for Secure approved Deduplication 7.

14 International Journal of Innovative Research in Applied Sciences and Engineering (IJIRASE)

Volume 1, Issue 1, June 2016 data are at risk to from both outside and inside attacks.

4. DEDUPLICATION Among the data compression methods Data 8, 9 is one of the best for reducing duplicate copies of replicate data in storage. This is the technique used to enhance the storage utilization and can also be applied to network for data transfer to decrease the number of bytes that must be sent. As an alternative of keeping multiple data copies with the identical content, deduplication reduce the redundant data by maintaining only one physical copy and addressing other redundant data to that copy shows as in the fig 3.

In conventional encryption, while having data confidentiality, data deduplication is incompatible. Specially, conventional encryption needs their data that are encrypted by different users having their own keys. Hence the user is tends to have multiple cipher text for same data copies which in turn makes deduplication impossible. 5. OVERVIEW OF EXISTING ALGORITHMS USED FOR DEDUPLICATION Here we are listing some of the existing algorithms which are used for the deduplication. The overviews of these algorithms are useful in knowing about deduplication and choose the better for implementing. 5.1 Post-Process Deduplication

Fig 3 Before and After Data Deduplication Two levels of Deduplication occur that is duplication at file level or duplication at the block level. In the file level deduplication the same file gets eliminated. Deduplication at the block level reduces the data in the duplicate blocks that occur in non identical files. Even if data deduplication is having lot of benefits, there is some security and privacy issues arise as userâ&#x20AC;&#x2122;s sensitive

Post-process deduplication 10 is a technique in which a original data is first stored on the storage space device. Then the data will be analyzed later for duplication while processing it. The advantage is the hash calculation is not necessary and search for data before storing and makes ensure the performance of storage is tainted. Implementations providing policy-based operation will provide users the flexibility to defer improvement on "active" files or to method files based on kind and site. One attainable drawback is that you simply could needlessly store duplicate knowledge for a time that is a problem if the storage

15 International Journal of Innovative Research in Applied Sciences and Engineering (IJIRASE)

system is close to full capability. This is explained in the fig 4 given below.

Volume 1, Issue 1, June 2016 deduplication equally. Post-process and inline deduplication strategies are typically heavily debated. Fig 5 explains the Inline deduplication

Fig 4 Post â&#x20AC;&#x201C;Process Deduplication 5.2 In-Line Deduplication

Fig 5 Inline Deduplication

The device in real time makes the inline deduplication 11 process where the deduplication hash computation is formed on the target device as the information centers. If the device spots a block that it already keep on the system it doesn't store the new block, simply references to the present block. The advantage of in-line deduplication over post-process deduplication is that it needs less storage acknowledge isn't duplicated. On the negative aspect, it's often argued that as a result of hash calculations and lookups takes so long, it will mean that the information intake will be slower thereby reducing the backup output of the device. However, bound vendors with in-line deduplication have incontestable equipment with similar performance to their post-process

5.3 Source Versus Target Deduplication One more way to consider about data deduplication is by the place where it occurs. When the deduplication happens close to place where data is created, it is constantly referred to as source deduplication. When it happens near the place where the data is accumulated, it is called "target deduplication." Make sure that the data on the data source is deduplicated in source deduplication. This usually takes place within the file system. The file system will occasionally scan new files forming hashes and compare the hashes of existing files 12. When files with same hashes are found then the file copy is removed and also the new file points to

16 International Journal of Innovative Research in Applied Sciences and Engineering (IJIRASE)

the previous file. in contrast to hard links but, duplicated files are thought of to be separate entities and if one among the duplicated files is later changed, then using a system known as Copy-onwrite a duplicate of that file or modified block is made. The deduplication process is clear to the users and backup applications. Backing up a deduplicated file system can usually cause duplication to occur resulting in the backups being larger than the source information. Target deduplication is the method of removing duplicates of information in the secondary store. Usually this can be a backup store like a knowledge repository or a virtual tape library. The below figure 6 explains the source vs target deduplication.

Volume 1, Issue 1, June 2016 Fig 6 Source versus Target Deduplication One of the most acquainted kinds of data deduplication performance works by evaluates chunks of information to spot duplicates. For that to happen, every chunk of information is assigned identification calculated by the software, usually using cryptographic hash functions. In several implementations, the idea is created that if the identification is identical, the data is identical, although this cannot be true altogether cases owing to the pigeonhole principle, different implementations do not assume that two blocks of information with identical identifier are identical, but really verify that information with identical identification is identical. If the software either assumes that a given identification already exists within the deduplication namespace or really verifies the identity of the two blocks of information, depending on the implementation, then it will replace that duplicate chunk with a link. Once the information has been deduplicated, upon read back of the file, where a link is found, the system simply replaces that link with the documented information chunk. The deduplication method is meant to be clear to finish users and applications 6. CONCLUSION From the above study of analysis on existing algorithms, it can be discussed that many algorithms are used for deduplication and some of the existing algorithms like inline deduplication, post process deduplication, source versus target deduplication are discussed here. Result of this study is though there are many

17 International Journal of Innovative Research in Applied Sciences and Engineering (IJIRASE)

deduplication algorithm exist we have to improve on storage volume and bandwidth in a secured way. 7. REFERENCES 1. Li, Xiao-Yong, et al. "A trusted computing environment model in cloud architecture." Machine Learning and Cybernetics (ICMLC), 2010 International Conference on. Vol. 6. IEEE, 2010. 2. Ostermann, Simon, et al. "A performance analysis of EC2 cloud computing services for scientific computing." Cloud computing. Springer Berlin Heidelberg, 2009. 115-131. 3. Devi R P, Thigarasu V, A semantic Deduplication of Temporal Dynamic Records from Multiple Web Databases, Indian Journal of Science and Technology, 2015 Dec, 8(34), Doi no:10.17485/ijst/2015/v8i34/75103. 4. Manjusha R, Ramachandran R, Secure Authentication and Access System for Cloud Computing Auditing Services Using Associated Digital Certificate, Indian Journal of Science and technology, 2015 Apr, 8(S7), Doi no: 10.17485/ijst/2015/v8iS7/71223. 5. Rajendran P K, Hybrid Intrusion Detection Algorithm for Private Cloud, Indian Journal of Science and Technology, 2015 Dec, 8(35), Doi no: 10.17485/ijst/2015/v8i35/80167.

Volume 1, Issue 1, June 2016 6. Kaurav N, An investigation on Data De-duplication Methods And its Recent Advancement 2014. 7. Li J, Li Y K, Chen X, Patrick P.C.Lee, Lou W, A hybrid Cloud Approach for Secure Authorized Deduplication ,IEEE, 2015 May 1,26(5), pp. 1206-16. 8. George C, Efficient Secure Authorized Deduplication in Hybrid Cloud Using OAuth, 2015 March, 4(3), pp. 200-05. 9. Dheeraj, Alamuri, and P. Krishna Sai. "The cloud Approach for Consistent Appropriate deduplication." IJSEAT 3.11 (2015): 1010-1015. 10. Tsuchiya, Yoshihiro, and Takashi Watanabe. "Dblk: Deduplication for primary block storage." Mass Storage Systems and Technologies (MSST), 2011 IEEE 27th Symposium on. IEEE, 2011. 11. Fu, Yinjin, et al. "AA-Dedupe: An application-aware source deduplication approach for cloud backup services in the personal computing environment."Cluster Computing (CLUSTER), 2011 IEEE International Conference on. IEEE, 2011. 12. Kakariya G, and Rangdale S, "A HYBRID CLOUD APPROACH FOR SECURE AUTHORIZED DEDUPLICATION."