Link Prediction Evaluation Di Matteo Valerio 1379412 dimatteovalerio@hotmail.it ABSTRACT Given a representation of a social network, are we able to predict its evolution in the (more or less) near future? What is the chance that nodes that are “similar” will themselves be linked together? This problem has been well studied and is known as “link prediction”. Similarity between nodes can be evaluated in many different ways. In this paper we focus on the similarity between their “neighborhood”, i.e. the set of the nodes that they are linked to, and we will do it with the Jaccard coefficient.
Keywords Social Networks, Link Prediction, Jaccard, Similarity.
1. INTRODUCTION In the recent past the study of the social networks has been given more and more attention. The nodes of a network can be thought as people, societies, companies or any social entity, with the links representing any kind of relation that could occur among the nodes: friendships, collaborations, deals, etc. We know many social network models (think of Erdős and Rényi, Wattz and Strogatz, and many others) that we can use to represent a realistic social network, and by looking at its evolution in time we are able to make predictions about the evolution of the true social environment that we are trying to represent. Solving the Link Prediction problem tries to do exactly so. It has been well explained by Liben-Nowell and Kleinberg in [1]. In a few words, given a snapshot of a network, we want to use the information that it gives us to predict the new links that will emerge in the future. How can we do so? There are many techniques, but mainly, what we do is look at the neighbors of two nodes, compute their similarity, and output a number that represents how similar they are. If this number is “sufficiently” large, then we predict that a link between the two nodes will appear.
1.2 Similarities There are many possible similarities that we can compute on two nodes. In particular:
Graph distance
Common neighbors
Jaccard’s coefficient
Adamic / Adar
Preferential attachment
Katz
Hitting time
Commute time
Rooted PageRank
SimRank
All of them are explained into details in [1]. For our purpose, we used the Jaccard coefficient. This coefficient is a number between 0 and 1 which represents the fraction of common neighbors between two nodes over the total number of neighbors they have.
1.1 Link Prediction problem First of all, let’s formalize the problem. We have a social network G=(V,E), where each edge e=(u,v) represents an interaction between nodes u and v. Given a training graph G’(V,E’) containing a subset of edges in E and a test graph G’’(V,E’’) containing the rest of edges in E, we want to use G’ to predict G’’. To evaluate the result, we can use either precision P or recall R.
So, essentially, what we want to do is compute similarities on the nodes of the training graph, predict (or not) a link if this similarity score is big enough or not, and then use the actual test graph as benchmark to evaluate our prediction performance.
The higher this score is, the more the nodes will be similar for us.
2. EXPERIMENTAL SETUP 2.1 Data Set The network we used is from the ca-GrQc.txt file from https://snap.stanford.edu/data/index.html. It is a collaborations network where nodes (5,242) are scientists and edges (14,496) are co-authorships. In particular, it is an undirected network, so each edge is doubled in the text file, where links are represented as source id – destination id.
2.2 Computing the sets To get our Training Set and Test Set, we first filtered the nodes, excluding those with less than k=6 collaborations. By doing this, we obtained a filtered set of 10614 edges (what Liben-Nowell and Kleinberg call the Core). Then, we put those edges either in the Training Set or in the Test Set with a 50% probability.