Triangle Counting and Vertex Similarity by Charalampos Tsourakakis

Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu

Canadian Mathematical Society 12th December ‘11 CMS '11

Mihail N. Kolountzakis Gary L. Miller Math, University of Crete SCS, CMU

CMS '11

Rasmus Pagh SCS, University Copenhagen

  PART I: Triangle counting   Motivation and Related Work   Algorithms, Results and Discussion

  PART II: Vertex Similarity   Motivation and Related Work   Our Approach, few Results and Discussion

CMS '11

Friends of friends tend to become friends themselves!

[Wasserman Faust ’94]

(left to right) Paul Erdös , Ronald Graham, Fan Chung Graham CMS '11

http://fellows-‐exp.com/

[Friggeri et al., 2011]

721 million users 69 billion links

CMS '11

Subjective ratings given to communities by real persons show that triangles are the key quantity that determines the rating. 5

Uncovering the Hidden Thematic Structure of the Web [Eckmann-‐ Moses, PNAS 2001] Key Idea: Connected regions of high curvature (i.e., dense in triangles) indicate a common topic! CMS '11

Triangles used for Web Spam Detection [Becchetti et al. KDD ’08]

Key Idea: Triangle distribution among spam hosts is signiﬁcantly diﬀerent from non-‐spam hosts!

CMS '11

Triangles used for assessing content quality in Social Networks Welser, Gleave, Fisher, Smith Journal of Social Structure 2007 Key Claim: The amount of triangles in the self-‐centered social network of a user is a good indicator of the role of that user in the community! CMS '11



[Watts,Strogatz’98]

CMS '11



Signed triangles appear in structural balance theory



Triangle closing models also used to model the microscopic evolution of social networks [Leskovec et.al., KDD ’08] CMS '11

 Numerous

other applications including : •  Motif Detection/ Frequent Subgraph Mining •  Community Detection [Berry et al. ’09] •  Outlier Detection and Link Recommendation and many more.. Fast triangle counting algorithms are necessary. CMS '11

Alon

Yuster

Zwick

Asymptotically the fastest algorithm but not practical for large graphs.

In practice, one of the iterator algorithms are preferred. •  Node Iterator (count the edges among the neighbors of each vertex) •  Edge Iterator (count the common neighbors of the endpoints of each edge) Both run asymptotically in O(mn) time. CMS '11 12

  Remarks   In Alon, Yuster, Zwick appears the idea of

partitioning the vertices into “large” and “small” degree and treating them appropriately.   For more work, see references in our paper(s): ▪  Itai, Rodeh (STOC ‘77) ▪  Papadimitriou, Yannakakis (IPL ‘81) ……

CMS '11



r independent samples of three distinct vertices

Then the following holds:

with probability at least 1-δ

Works for dense graphs. e.g., T3 n2logn CMS '11

  (Yosseﬀ, Kumar, Sivakumar ‘02) require n2/

polylogn edges   More follow up work:   (Jowhari, Ghodsi ‘05)

  (Buriol, Frahling, Leondardi, Marchetti,

Spaccamela, Sohler ‘06)   (Becchetti, Boldi, Castillio, Gionis ‘08)

CMS '11

  Approximate a given graph G with a sparse

graph H, such that H is close to G in a certain notion.

  Examples:

Cut preserving Benczur-‐Karger Spectral Sparsiﬁer Spielman-‐Teng Modern Data Mining Algorithms

  t: number of triangles.   T: triangles in sparsiﬁed graph, essentially our

estimate.   Δ: maximum number of triangles an edge is contained in.   Δ=O(n)

  tmax: maximum number of triangles a vertex is

contained in.   tmax =Ο(n2) CMS '11

CMS '11

How to choose Mildness, pick p=1 p?

Concentration CMS '11



Kim

CMS '11



CMS '11



CMS '11



Given a graph G with n vertices and m edges which graph maximizes the edges in the line graph L(G)? CMS '11



CMS '11

Orkut (3.1M,117M)

LiveJournal (5.4M,48M) YouTube (1.2M,3M) Flickr, (1.9M, 15.6M)

CMS '11

Web-‐EDU (9.9M,46.3M)

Social networks abundant in triangles!

CMS '11

250 200 150

Exact

secs

Triple Sampling

100

Hybrid

50 0 Orkut

Flickr CMS '11

Livejournal Wiki-‐2006 Wiki-‐2007 28

    



p was set to 0.1. More sophisticated techniques for setting p exist using a doubling procedure. Sampling from a binomial can be done easily in (expected) sublinear time. Our code, even our exact algorithm, outperforms the fastest approximate counting competitors code, hence we compared diﬀerent versions of our code! To the best of our knowledge, used in Twitter.

CMS '11

Remove any weighted edge, w suﬃciently large.

CMS '11

Remove edge (1,2)

CMS '11

Let N=1/p be the number of colors we use to color the vertices. Call an edge monochromatic if its endpoints receive the same color.

CMS '11



CMS '11



CMS '11



CMS '11

From these extreme cases we see that if we want to hope for concentration p has to be at least ω(n)t/Δ and ω(n)t-‐1/2 respectively. CMS '11



We pick p large enough to make Var(T)=o(E[T]2). CMS '11



CMS '11

Every graph on n vertices with max. degree Δ(G) =k is (k+1) -‐colorable with all color classes diﬀering at size by at most 1.

k+1

….

2 CMS '11

  Create an auxiliary graph where each triangle

is a vertex and two vertices are connected iﬀ the corresponding triangles share a vertex.

  Invoke Hajnal-‐Szemerédi theorem and apply

Chernoﬀ bound per each chromatic class. Finally, take a union bound. Q.E.D. CMS '11

Pr(Xi=1|rest are monochromatic) =p ≠ Pr(Xi=1)=p2

CMS '11

  We can adapt our proposed method in the

semi-‐streaming model with space usage

so that it performs only 3 passes over the data.   MapReduce implementations.

CMS '11

  PART I: Triangle counting   Motivation and Related Work   Algorithms, Results and Discussion

  PART II: Vertex Similarity   Motivation and Related Work   Our Approach, few Results and Discussion

CMS '11

Privacy attacks , [Hay et al., VLDB]

Vertex Similarity & Link Recommendation

Viral Marketing CMS '11

Since there can be pairs of vertices which are highly similar, recursive equations are better, e.g.,:

S=φAS+ψI where φ,ψ are given parameters.

Vertex similarity in networks, Leicht et al. CMS '11



CMS '11

We are interested in robust simplex ﬁtting since in many real-‐world graph embeddings there exist outliers.

CMS '11

“Hard” formulation

Robust formulation CMS '11

CMS '11

Having a set of mixture coeﬃcients for each vertex we can use any nearest neighbor structure to perform queries such as “ﬁnd the k vertices most similar to v”. Several data structures exist, e.g., Mount & Arya.

CMS '11

  Reconstructing more complex geometric

structures, such as simplicial complexes?

CMS '11

THANK YOU!

CMS '11