Linking Heterogeneous Data in the Semantic Web Using Scalable and Domain-Independent Independent Candidate Selection
Abstract: Due to the decentralized nature of the Semantic Web, the same real-world real entity may be described in various data sources with different ontologies and assigned syntactically distinct identifiers. In order to facilitate data utilization and consumption in the Semantic Web, without compromising the freedom of people to publish their data, one critical problem is to appropriately interlink such su heterogeneous data. This interlinking process is sometimes referred to as Entity Matching, i.e., finding which identifiers refer to the same real real-world world entity. In this paper, we propose two candidate selection algorithms to improve the scalability of entity tity matching systems. First of all, we propose HistSim that utilizes the matching histories of the instances to prune instance pairs that are not sufficiently similar to the same pool of other instances. A sigmoid function based thresholding method is pro proposed posed to automatically adjust the threshold for such commonality on-the-fly. fly. Furthermore, we propose DisNGram that selects candidate instance pairs by computing a character character-level level similarity metric on discriminating literal values that are chosen using doma domain in-independent unsupervised learning. Instances are indexed on the chosen predicates' literal values to enable efficient look look-up up for similar instances. Finally, in order to be able to handle heterogeneous datasets with a large number of predicates, a mechanism nism for automatically determining predicate comparability is proposed.