Sample project - genetic algorithm to cluster graphs

Page 1

Course: Algorithms in Nature/02317 School of Computer Science / Machine Learning Department Carnegie Mellon University

Professor: Ziv Bar-Joseph Student: Fereshteh Shahmiri

A genetic algorithm to cluster graphs Finding dense modules or clusters in a graph is an important part of many data mining problems. One popular definition of a ‘module’ is a set of nodes that have many more within-module connections (i.e. connections between nodes in the same module) than between-module connections (i.e. connections between nodes in different modules) than expected by chance. In 2002, Newman proposed an objective function, called modularity, that characterizes the quality of a clustering C of a graph G = (V, E): (

)

∑(

)(

)

Where Auv is 1 if u and v have an edge in E and 0 otherwise; ku is the degree of node u (i.e. its number of neighbors); m is the total number of edges in the graph; and the variables x uv describe C by indicating which nodes are in the same module. Specifically, for every pair of nodes, xuv = 0 if u and v belong to the same module, and xuv = 1 otherwise. Notice that there is no contribution towards the modularity score for a pair of nodes that lie in different modules and that all terms (A uv, ku, kv, m) are fixed besides the xuv terms. The goal is to find the clustering C that maximizes this function. In general, the clustering C can have any number of modules (from 1 to n, where n is the number of nodes in the graph), but all nodes must be assigned to exactly one module. Write a genetic program to cluster an input graph into modules that optimizes the Newman objective function using at most 5 clusters. Answer: Here are some sample outputs with different number of generations and also number of individuals and the running time based on seconds. 10 generations: How fast you want the result? Enter a number between 1 and 10, higher value return result faster: 2#modularity=22.7820512821 Module 1: 11 14 15 17 18 19 22 23 24 26 27 28 29 30 31 32 33 Module 2: 0 1 2 3 4 5 6 7 8 9 10 12 13 16 20 21 25 Time: 1304.78400016 30 generations: How fast you want the result? Enter a number between 1 and 10, higher value return result faster:


Course: Algorithms in Nature/02317 School of Computer Science / Machine Learning Department Carnegie Mellon University

Professor: Ziv Bar-Joseph Student: Fereshteh Shahmiri

8#modularity=30.2628205128 Module 1: 5 6 16 24 25 27 28 31 Module 2: 0 1 2 3 7 10 11 12 13 17 21 Module 3: 4 8 9 14 15 18 19 20 22 23 26 29 30 32 33 Time: 1024.56500006 How fast you want the result? Enter a number between 1 and 10, higher value return result faster: 2#modularity=31.8269230769 Module 1: 0 1 2 3 4 5 6 7 10 12 13 16 17 19 21 Module 2: 8 9 11 14 15 18 20 22 23 24 25 26 27 28 29 30 31 32 33 Time: 6231.27499986 How fast you want the result? Enter a number between 1 and 10, higher value return result faster: 2#modularity=31.8782051282 Module 1: 8 9 14 15 18 20 22 23 24 25 26 27 29 30 31 32 33 Module 2: 0 1 2 3 4 5 6 7 10 11 12 13 16 17 19 21 28 The best answer is the last one here which: #modularity=31.8782051282 Module 1: 8 9 14 15 18 20 22 23 24 25 26 27 29 30 31 32 33 Module 2: 0 1 2 3 4 5 6 7 10 11 12 13 16 17 19 21 28


Course: Algorithms in Nature/02317 School of Computer Science / Machine Learning Department Carnegie Mellon University

Professor: Ziv Bar-Joseph Student: Fereshteh Shahmiri

Coding Part: import random import time def readFile(filename, mode="rt"): # rt stands for "read text" fin = contents = None try: fin = open(filename, mode) contents = fin.read() finally: if (fin != None): fin.close() return contents def adjacency(contents): line=contents.splitlines() num_lines=int(line[0].split("=")[2]) num_nodes=int(line[0].split("=")[1].split(",")[0]) adjc=[] for row in range(num_lines): adjc += [[int(line[row+1].split("\t")[0]),int(line[row+1].split("\t")[1])]] return (num_nodes,num_lines,adjc) def is_adj(adj,u,v): for i in range(len(adj)): if ((adj[i][0]==u and adj[i][1]==v) or (adj[i][1]==u and adj[i][0]==v)): return 1 return 0 def degree(adj,u): count=0 for i in range(len(adj)): if (adj[i][0]==u or adj[i][1]==u): count+=1 return count def is_cluster(u,v,modules): if (modules[u]==modules[v]): return 0 else: return 1 def modularity(adj,num_nodes,num_edges,modules): q=0 for i in range(num_nodes): for j in range(i+1,num_nodes): q+=((is_adj(adj,i,j)-float(degree(adj,i)*degree(adj,j))/(2*num_edges))*(1-is_cluster(i,j,modules))) return q def randomarray(n,k): a=[] for i in range(n): a+=[random.randint(0,k-1)] return a def crossover(a1,a2,n):


Course: Algorithms in Nature/02317 School of Computer Science / Machine Learning Department Carnegie Mellon University

Professor: Ziv Bar-Joseph Student: Fereshteh Shahmiri c1=random.randint(0,n-1) c2=random.randint(c1,n-1) ax1=[] ax2=[] for i in range(n): if (i>c1 and i<c2): ax1+=[a2[i]] ax2+=[a1[i]] else: ax1+=[a1[i]] ax2+=[a2[i]] return (ax1,ax2) def mutation(a,n,k): m=random.randint(0,n/10) for i in range(m): index=random.randint(0,n-1) a[index]=random.randint(0,k-1) return a def selection(adj,a,num_nodes,num_edges,num_tests): mod=[] b=[] for i in range(len(a)): mod+=[modularity(adj,num_nodes,num_edges,a[i])] modx=sorted(mod) for i in range(len(a)): if mod[i]>modx[num_tests-num_tests/4-1]: b+=[a[i]] return (b,max(mod),a[mod.index(max(mod))]) #t1=time.time() filename=raw_input("Enter File Name: ") f=int(raw_input("How fast you want the result? Enter a number between 1 and 10, higher value return result faster: ")) (num_nodes,num_edges,adj)=adjacency(readFile(filename)) num_tests=num_nodes*num_nodes/(f*4) maxm=modularity(adj,num_nodes,num_edges,randomarray(num_nodes,1)) #print maxm for k in range(2,4): a=[] #maxm=-1000 for i in range(num_tests): a+=[randomarray(num_nodes,k)] j=0 while(j<30): j+=1 (b,mod,indiv)=selection(adj,a,num_nodes,num_edges,num_tests) if mod>maxm: maxm=mod cluster=indiv a=[] for i in range(num_tests): i1=random.randint(0,num_tests/4-1) i2=random.randint(0,num_tests/4-1) (cr1,cr2)=crossover(b[i1],b[i2],num_nodes) a+=[mutation(cr1,num_nodes,k)] a+=[mutation(cr2,num_nodes,k)]


Course: Algorithms in Nature/02317 School of Computer Science / Machine Learning Department Carnegie Mellon University

Professor: Ziv Bar-Joseph Student: Fereshteh Shahmiri #print maxm #print cluster print "#modularity="+str(maxm) for i in range(max(cluster)+1): print "Module "+str(i+1)+": ", for j in range(len(cluster)): if cluster[j]==i: print str(j), print #print time.time()-t1