Page 1

ISSN: 2250–3676

K PRIYANKA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

Volume - 2, Special Issue - 1, 71 – 75

AN APPROACH FOR PRIVACY PRESERVATION USING XML DISTANCE MEASURE K.Priyanka1, U.Arundhathi2, D.Grace Priscilla3 1

2

M.Tech, CSE , K LUniversity , Andhra Pradesh , India, priyankarachelk@gmail.com M.Tech, CSE , K LUniversity , Andhra Pradesh , India, arundhathi.ummadisetty89@gmail.com 3 M.Tech, CSE , K LUniversity , Andhra Pradesh , India, prici.diyya@gmail.com Abstract

In recent years a wide variety of personal information is available. Privacy preservation for individual information from external sources has become a great challenge. Data mining which plays a key role in knowledge discovery is applied to a wide variety of these applications which reveals private and sensitive details of individuals. It is necessary to provide privacy. One commonly used approach to provide privacy is to anonymize the data. However Anonymization technique suffers from homogeneous attack and background attack and cannot prevent the attribute disclosure. In this paper we use a XML distance measure to preserve privacy for individual’s information.

Index Terms: Anonymity, Closeness, XML distance, Privacy. ----------------------------------------------------------------------*** ---------------------------------------------------------------------1. INTRODUCTION Many organizations publish micro data like census data and medical data. This data contains private and sensitive details. For example in medical domain the data mining technique is commonly applied to datasets to obtain the disease patterns. But we need to keep the information regarding the patient‟s disease private. For this we need to anonymize data. Two commonly used anonymization techniques are generalization and suppression. The main drawback with generalization is that it requires manual generation of domain hierarchy tree for every domain. And the drawback with suppression is that it results in over anonymity if not properly used. The micro data is published in the form of tables where each table consists of records containing information about individuals. Every record has number some set of attributes and these attributes are categorized into three types. 1. Explicit Identifiers: Attributes that identify an individual exactly. Ex: Name, social security number. 2. Quasi Identifiers: Attribute that can be linked with some external sources like voter registration table to reidentify the individual even after anonymization. Ex: Zip-Code, Birth-Date and Gender 3.Sensitive Attributes: The attributes that are considered sensitive Ex: disease and Salary. We have to preserve the sensitive attributes before we release the data for data mining task. Samarati and Sweeney [13, 14, 15, 16] formulated mechanisms for k-anonymization using the ideas of generalization and suppression. In a relational database, there

is a domain (e.g., integer, date) associated with each attribute of a relation. Given this domain, it is possible to construct a more general" domain in a variety of ways. For example, the Zip code domain can be generalized by dropping the least significant digit, and continuous attribute domains can be generalized into ranges For preserving the sensitive attributes of the individuals the commonly used technique is kanonymity. However there are some drawbacks like homogeneous attack and also it is insufficient for attribute disclosure. So we use a XML distance measure to preserve privacy for individual‟s information. The basic model is we determine the closeness between the attributes by setting a threshold value and then we represent it in XML format using a XML distance measure.

2.

PRIVACY

PRESERVATION

USING

K-

ANONYMITY The approach we have used here is k-anonymity and the information is not protected due to attribute disclosure. In our approach, tuples in the table are partitioned into several groups such that each group has at least k different sensitive attribute values. We then performa permutation between the tuples‟ quasi-identifiers with their sensitive attribute inside each group k-anonymity def : Each release of data must be such that every combination of values of quasi-identifiers can be indistinctly matched to at least k respondents. Besides full-

IJESAT | Jan-Feb 2012 Available online @ http://www.ijesat.org

71


ISSN: 2250–3676

K PRIYANKA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY domain generalization, numerous other models have also been proposed for k-anonymization. The second contribution in this paper is a single taxonomy thatcategorizes previous models and introduces some promising new alternatives kanonymity[1] demands that every tuple in the microdata table released be indistinguishably related to no fewer than k respondents. One of the interesting aspect of k-anonymity is its association with protection techniques that preserve the truthfulness of the data. In this we discuss the concept of k-anonymity, discuss the algorithms for its enforcement. The[12] k-anonymity [2]requirement is quite simple. Intuitively, it stipulates that no individual record should be uniquely identifiable from a group of k on the basis of its quasi-identifier values. We will refer to each group of tuples in T with identical quasi identifier values as an equivalence class. The concept of k-anonymity has been proposed by Samarati and Sweeney [10] to anonymise micro data such that the correctness of the released (anonymised) data can be preserved. In order for micro data to meet the requirement of k-anonymity, every record in the micro data must be related to at least k other records or individuals. However, k-anonymity cannot always guarantee to protect privacy.. that each record is indistinguishable with at least k-1 of the records with respect to the quasi-identifier. In other words ,k-anonymity requires that each equivalence class contains at least k records. While k-anonymity protects against identity disclosure, it is insufficient to prevent attribute disclosure. Machanavajjhala et al. [11] k- anonymity suffers

Volume - 2, Special Issue - 1, 71 – 75

After applying the k anonymity on the above table attributes like Age,Zip Code. We have to replace the Age second digit as „*‟ and the last two digits of zip code as „*‟ in above table 1. It also remove the significant attributes names in the Table 2 In this table the data is anonymized and the first 4 records can form into single equivalence class having variety of diseases like Heart disease, Flu, Gastritis and we can not identify the person who is suffering from Flu but if we know the background details like age of the person we can find the person. So to keep the person details for better preservation we are introducing the t-Closeness method . Table 2 S.NO

AGE

DISEASE

5*

ZIP CODE 678**

1 2

5*

678**

HEART DISEASE

FEMAL E

3

3*

678**

FLU

FEMAL E

4

4*

678**

GASTRITIS

FEMAL E

5

3*

768**

BRONCHITIS

MALE

HEART DISEASE

GENDE R MALE

Table 1 S.NO

AGE

ZIP CODE

DISEASE

GENDER

6

6*

768**

PNEUMONIA

MALE

1

50

67891

HEART DISEASE

MALE

7

5*

768**

FLU

2

56

67898

FEMALE

8

4*

897**

DIABETIS

3

32

67892

HEART DISEASE FLU

FEMAL E MALE

9

2*

897**

CANCER

MALE

4

45

67821

GASTRITIS

FEMALE

10

2*

674**

CANCER

FEMAL E

5

35

76890

BRONCHITIS

MALE

6

60

76894

PNEUMONIA

MALE

3 .PRIVACY GOAL

7

56

76897

FLU

FEMALE

8

45

89759

DIABETIS

MALE

9

29

89756

CANCER

MALE

There are three types of attributes in an original micro data table M: For simplicity, we assume there is only one sensitive attribute S in a micro data table, and focus on numeric sensitive attributes. our technique can be easily adapted to process categorical sensitive attributes as well.

10

25

67456

CANCER

FEMALE

FEMALE

IJESAT | Jan-Feb 2012 Available online @ http://www.ijesat.org

72


ISSN: 2250–3676

K PRIYANKA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

Volume - 2, Special Issue - 1, 71 – 75

4. t-CLOSENESS

consistent, reducing the number of notations needed to represent similar constructs.

An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes [4] have t-closeness. Closeness is a property in which similar equivalent attributes come in to the same group. For identifying the closeness we are proposed a distance method called XML distance measure. t-closeness is the basic model. The more flexible model is the (n,t)Closness. Choosing the parameters n and t would affect the level of privacy and utility. In this method n is the number of individuals and t is the threshold value. the larger the n value better the privacy but has less utility when the threshold value is low. We have to see that both n and t values are compatible we can have the better privacy as well as utility .In this (n,t) closeness the equivalence class and the super set of the class the sensitive attributes is not more than the threshold t.

Distance measure techniques are used to match the records. For each record having the particular details of the individual.XML distance method is one of the distance measure technique. This helps to find the distance between the attributes. We have to keep the records in the specified format. Here we are using the XML method [8] so documented type method need to be used.

5. XML DISTANCE MEASURE XML Distance measure is one of the efficient technique for measuring the distance between the two attributes. In this table the details of the individuals are listed along with their sensitive details like disease and age. We need to keep these data in private to the users only. Closeness is proposed by Li[3].XML Distance measure can use the various mathematical notations for finding the distance between the attributes. The XML distance mainly useful for the nested relations. This can be also helpful for the different kind of attributes like categorical as well as numerical attributes. We represent this in the string form. In the XML distance measure the comparison can be done in the left portion and the right portion .An algorithm is also used for this distance if both the attributes distance is zero then we compute it as attribute one and attribute 2 .If the Distance of one of the attribute is zero we can compute this as attributes 1and attribute 2 plus 1.

5.1 XML REPRESENTATION XML uses two very different sets of syntax, as well as a variety of representations for linked content. Although the choice was made very early to preserve SGML compatibility, alternative notation might enhance XML's simplicity and its extensibility. The alternative notation described in this proposal could still be mapped to existing SGML notation, while making it possible to extend XML in new directions. This notation is also intended to make XML more self-

The representation languages are typically classified as being either attribute-value or relational first-order. In an attributevalue representation, the objects in a data set can be summarized in a table, where the columns represent attributes, and each row represents an object, with the cell entries in that row being the specific values for the corresponding attributes. In a first-order representation, an object is represented by a ground atom of a distinguished predicate symbol, where the position on the arguments represents the attributes, the arguments themselves represent [5] the corresponding attribute‟s values, and these arguments are further defined by their occurrence in some additional set of ground atoms. Analysis of data in terms of a first-order representation is also referred to as multi relational data mining, since the first-order representation provides a more powerful and reasonable way to describe objects than an attribute value representation. The increased use of XML(Extended Markup Language) as a means for unambiguous representation of data across platforms, more and more data is delivered in XML. When performing data analysis over such data, this is normally transformed into attribute-value or first-order representation, if distance measures are involved. This XML distance measure was developed initially by [6] for use with a hierarchical clustering algorithm to cluster XML documents representing patterns of alerts in the domain of intrusion detection. The results of that data mining experiment show that the measure is very effective for this purpose. The document contains the patient details Name, Age, Zip code, Disease.DTD Document type definition used to find the distance between the two attributes. The XML document can be constructed in the form of tree like structure and compare the attributes. Based on the table we can construct XML tree. Here For the distance we can have a formula using Hungarian matrix method. From the table Age, Zip code, disease as apply on the patient numbers .we have 10 patients then the matrix is of 10 rows and 4 columns. Matrix is the M 10*4 m>n so resultant matrix is M1 m*m . The distance between the two element sets is then defined as Hungarian(M′) / m If m is much larger than n, then M′ will be much larger than M, and applying the Hungarian algorithm to the latter will incur a

IJESAT | Jan-Feb 2012 Available online @ http://www.ijesat.org

73


ISSN: 2250–3676

K PRIYANKA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY much greater cost. It turns out, however, that under the above assumptions Hungarian(M′) can be computed more simply as Hungarian(M′) + m − n. This is because adding a virtual element to one of the sets, and defining its distance to the elements in the other sets to be 1, means that however that virtual element is matched with an element from the other set, this always adds exactly 1 to the overall cost. similarity between XML documents have concentrated on structural similarity. This paper has proposed a XML distance measure based on finding optimal matchings. XML structure, attributes, and contents are considered in the similarity computation. This measure has been proven effective as the basis for a clustering algorithm in the specific domain of network intrusion detection

Volume - 2, Special Issue - 1, 71 – 75

generalize a sensitive attribute value, rather than hiding it completely. An interesting question is how to effectively combine these techniques with generalization and suppression to achieve better data quality. For finding the distance between the data sets or attributes we can use Euclidian distance or earth movers distance but in this paper we have introduced XML for the Privacy preservation techniques that to in the anonymizayion process. This is an approach this contain the anonymized data by using the k-Anonymity and after this technique Closeness is identified between the attributes and that distance is measured using the XML distance method., which intern used the Hungarian algorithm

6. CONCLUSION In This paper we have explained about the Privacy preservation technique K-Anonymity in this we have used the distance measure technique called XML distance. This can be used for identify the closeness between the attributes. If the closeness is more we can provide better privacy for the individual sensitive details.

ACKNOWLEDGEMENT

In this matrix we can apply the Hungarian method for the distance measurement. That can be used in the XML method.. That can be used in the XML method ato j are the patients and numbers are the details of the individual applied on the matrix. This Hungarian method can be effectively used for finding the distances between the two sets of data or two points in the table.

We like to express our gratitude to all those who gave us the possibility to carry out the paper. We would like to thank Mr.K.Satyanarayana, chancellor of K.L.University, Dr.K.Raja Sekhara Rao, Dean, and K.L.University for stimulating suggestions and encouragement. We have further more to thank Prof.S.Venkateswarlu, Dr.K.Subrahmanyam, and Dr.G.Rama Krishna, who encouraged us to go ahead with this paper. REFERENCES

Other Anonymization Techniques: (n, t)-closeness allows us to take advantage of anonymization techniques other than generalization Samarati describes an algorithm for ending a single minimal k-anonymous fulldomain generalization, based on the specific definition of minimality outlined in the suppression. ,k-Anonymity and other privacy preservation techniques .For example, instead of suppressing a whole record, one can hide some sensitive attributes of the record; one advantage is that the number of records in the anonymized table is accurate, which may be useful in some applications. Because this technique does not affect quasi-identifiers, it does not help achieve k-anonymity and hence has not been considered before. Removing a sensitive value in a group reduces diversity and therefore, it does not help in achieving. However, in t-closeness, removing an outlier may smooth a distribution and bring it closer to the overall distribution. Another possible technique is to

[1] C. Aggarwal, “On k-Anonymity and the Curse of Dimensionality,” Proc.of the Int’l Conf. on Very Large Data Bases (VLDB), pp. 901909,2005. [2]A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam,“l-Diversity: Privacy Beyond kAnonymity,” Proc. Int’l Conf. DataEngineering (ICDE), pp. 24, 2006. [3] N. Li, T. Li, and S. Venkatasubramanian, “t-closeness: Privacy beyond k-anonymity and l- diversity,” Proc. Int’l Conf. Data Engineering (ICDE),pp. 106115, 2007. [4] Long, J., Schwartz, D.G., and Stoecklin, S. Improving the Effectiveness of Snort by Clustering Patterns of Alerts, ACM Transactions on Information and System Security, in review

IJESAT | Jan-Feb 2012 Available online @ http://www.ijesat.org

74


ISSN: 2250–3676

K PRIYANKA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY [5] Bettini C, Wang XS, Jajodia S (2005). Protecting privacy against location-based personal identification. In Proc. of the Secure Data Management, Trond-heim, Norway [6] Long, J., Schwartz, D.G., and Stoecklin, S. Improving the Effectiveness of Snort by Clustering Patterns of Alerts, ACM Transactions on Information and System Security, in review [7] Ninghui Li, Member, IEEE, Tiancheng Li, and Suresh Venkatasubramanian Closeness: A New Privacy Measurefor Data Publishing Knowledge and Data Engineering, IEEE Transactions on pp943, 2010. [8] Bertino, E., Guerrini, G., and Mesiti, M. 2004. A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications, Information Systems, 19(1): 23-46. [9] Jidong Long, Daniel G. Schwartz, and Sara Stoecklin Department of Computer Science Florida State University Tallahassee, FL, U.S.A.” An XML Distance Measure” [10] P. Samarati, “Protecting Respondent‟s Privacy in Microdata Release,”IEEE Trans. on Knowledge and Data Engineering (TKDE) vol. 13, no.6, pp. 1010-1027, 2001. [11] L. Sweeney, “k-Anonymity: A Model for Protecting Privacy,” Int’l J.Uncertain. Fuzz., vol. 10, no. 5, pp. 557-570, 2002. [12] T. Li, N. Li, and J. Zhang “Modeling and Integrating Background Knowledge in Data Anonymization,” To appear in Proc. Int’l Conf. DataEngineering (ICDE), 2009.

Volume - 2, Special Issue - 1, 71 – 75

BIOGRAPHIES K.Priyanka received her B.Tech. Degree in Information Technology from KLCE, Acharya Nagarjuna university in 2009.She is currently pursuing M.Tech in the Department of Computer Science at KL University. Her interests include data privacy data mining.

U.Arundhathi received her B.Tech. degree in Computer Science and Engineering Information from VelTech Multi tech Engineering College, Anna University in 2010.She is currently pursuing M.Tech in the Department of Computer Science at KL University.

D.Grace Priscilla received her B.Tech. degree in Computer Science and Engineering Information from Narasaraopet Engineering College,JNTU University in 2010.She is currently pursuing M.Tech in the Department of Computer Science at KL University. .

[13] P. Samarati. Protecting respondants' identities inmicrodata release. IEEE Transactions on Knowledge andData Engineering, 13(6), November/December 2001. [14] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcementthrough generalization and suppression. Technical Report SRI-CSL-98-04, SRI Computer Science Laboratory, 1998. [15] L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal on Uncertainty, Fuzziness, and Knowledge-based Systems,10(5):571{588, 2002.}

IJESAT | Jan-Feb 2012 Available online @ http://www.ijesat.org

75

IJESAT_2012_02_SI_01_14  

1. INTRODUCTION ISSN: 2250–3676 IJESAT | Jan-Feb 2012 The basic model is we determine the closeness between the attributes by setting a thre...

Read more
Read more
Similar to
Popular now
Just for you