Duplicate Detection of Records in Queries using Clustering

Page 4

32

M.Anitha, A.Srinivas, T.P.Shekhar, D.Sagar

• Calculate the similarity of the documents matched in main concepts (Xmc) and the similarity of the documents matched in detailed descriptions (Xdd). • Evaluate Xmc and Xdd using the rules to derive the corresponding memberships. • Compare the memberships and select the minimum membership from these two sets to represent the membership of the corresponding concept (high similarity, medium similarity, and low similarity) for each rule. • Collect memberships which represent the same concept in one set. • Derive the maximum membership for each set, and compute the final inference result.

Data cleansing is a complex and challenging problem. This rule-based strategy helps to manage the complexity, but does not remove that complexity. This approach can be applied to any subject oriented data warehouse in any domain. V. REFERENCES [1] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani,

[2]

[3]

C. Evaluation Metric The overall performance can be found using precision and recall where đ?‘ƒđ?‘ƒđ?‘ƒđ?‘ƒđ?‘ƒđ?‘ƒđ?‘ƒđ?‘ƒđ?‘ƒđ?‘ƒđ?‘ƒđ?‘ƒđ?‘ƒđ?‘ƒđ?‘ƒđ?‘ƒđ?‘ƒđ?‘ƒ = đ?‘…đ?‘…đ?‘…đ?‘…đ?‘…đ?‘…đ?‘…đ?‘…đ?‘…đ?‘…đ?‘…đ?‘… =

Number of correctly identified duplicate pairs Number of all identified duplicate pairs

Number of correctly identified duplicate pairs Number of true duplicate pairs

The classification quality is evaluated using F-measure which is the harmonic mean of precision and recall đ??šđ??š − đ?‘šđ?‘šđ?‘šđ?‘šđ?‘šđ?‘šđ?‘šđ?‘šđ?‘šđ?‘šđ?‘šđ?‘šđ?‘šđ?‘š =

2(đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘? )(đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x; ) đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘? +đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;

IV. CONCLUSION

Deduplication and data linkage are important tasks in the pre-processing step for many data mining projects. It is important to improve data quality before data is loaded into data warehouse. Locating approximate duplicates in large data warehouse is an important part of data management and plays a critical role in the data cleaning process. In this research wok, a framework is designed to clean duplicate data for improving data quality and also to support any subject oriented data.

[4]

[5]

[6]

[7]

“Robust and Efficient Fuzzy Match for Online Data Cleaning,� Proc. ACM SIGMOD, pp. 313-324, 2003. Kuhanandha Mahalingam and Michael N.Huhns, “Representing and using Ontologies�,USC-CIT Technical Report 98-01. Weifeng Su, Jiying Wang, and Federick H.Lochovsky, � Record Matching over Query Results from Multiple Web Databases� IEEE transactions on Knowledge and Data Engineering, vol. 22, N0.4,2010. R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses. VLDB�, pages 586-597, 2002. Tetlow.P,Pan.J,Oberle.D,Wallace.E,Uschold.M,Kendall .E,�Ontology Driven Architectures and Potential Uses of the Semantic Web in Software Engineering�,W3C,Semantic Web Best Practices and Deployment Working Group,Draft(2006). Ji-Rong Wen, Fred Lochovsky, Wei-Ying Ma, “Instance-based Schema Matching for Web Databases by Domain-specific Query Probing�, Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004. Amy J.C.Trappey, Charles V.Trappey, Fu-Chiang Hsu,and David W.Hsiao, “A Fuzzy Ontological Knowledge Document Clustering Methodology�,IEEE Transactions on Systems,Man,and Cybernetics-Part B:Cybernetics,Vol.39,No.3,june 2009.

In this research work, efficient duplicate detection and duplicate elimination approach is developed to obtain good result of duplicate detection and elimination by reducing false positives. Performance of this research work shows that the time saved significantly and improved duplicate results than existing approach. The framework is mainly developed to increase the speed of the duplicate data detection and elimination process and to increase the quality of the data by identifying true duplicates and strict enough to keep out false-positives. The accuracy and efficiency of duplicate elimination strategies are improved by introducing the concept of a certainty factor for a rule. www.ijorcs.org


Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.