C2Net A Network-Efficient Approach to Collision Counting LSH Similarity Join

Page 1

C2Net a Network-Efficient Approach to Collision Counting LSH Similarity Join

Abstract: Similarity join of two datasets P and Q is a primitive operation that is useful in many application domains. The operation involves identifying pairs (p; q), in the Cartesian product of P and Q such that (p; q) satisfies a stipulated similarity condition. In a high-dimensional space, an approximate similarity join based on locality-sensitive hashing (LSH) provides a good solution while reducing the processing cost with a predictable loss of accuracy. A distributed processing framework such as Map Reduce allows the handling of large and high-dimensional datasets. However, network cost estimation frequently turns into a bottleneck in a distributed processing environment, thus resulting in a challenge of achieving faster and more efficient similarity join. This paper focuses on collision counting LSH-based similarity join in Map Reduce and proposes a network-efficient solution called C2Net to improve the utilization of Map Reduce combiners. The solution uses two graph partitioning schemes: (i) minimum spanning tree for organizing LSH buckets replication; and (ii) spectral clustering for runtime collision counting task scheduling. Experiments have shown that, in comparison to


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.