Extracting network of critics based on their selection and evaluation of products by Işık Barış Fidaner

CMPE 58B Project Report

Extracting network of critics based on their selection and evaluation of products

Işık Barış Fidaner

Introduction Metacritic is a website where movie, game, music, book and TV reviews are aggregated. Most of these reviews are online and available in the websites of newspapers and magazines, but Metacritic collects them in a more uniform and structured way. This site is referred by many professionals as a metric to determine a product’s quality. In addition to collecting and summarizing verbal content, Metacritic also calculates the mean of several ratings given by different sources to the same product, namely, the Metascore. This meta-rating is considered to be so informative that it is even used by producers to choose whether to work with a studio or an artist or not. In this project, I extracted the similarity relations among the critics (newspapers, magazines) based on which products they chose to review, and how they evaluated them. Then, the relations among all the critics in a specific domain (TV, music or cinema) were used to extract a hierarchical classification of publications in that domain. In this way, we get a picture that clearly shows the overall similarity groups of publications.

Intention If a critic reviewed a product and has given a high score to it, we can assume that there is a social connection between the audience of that publication and the audience of that movie/album etc. Thus, if we can extract similarities and differences between these publications, this information might yield clues for understanding the cultural network of the audience of these products. For example, if we can extract clusters among the publications that correspond to mainstream, alternative or other types of products (and their respective audience), we can consider this regularity as the reflection of a deeper structure which is the cultural network

of the population. Then, we can independently relate ratings of these publications to fractions of the population that share the same culture.

Dataset In Metacritic, every culture product has a page in which its reviews are listed. Every review belongs to a certain critic that has given a score (over 100) and a short quote from the review is provided. An example data sample is as follows:

Score: 38 Reviewer: Miami Herald Quote: A loud, dumb movie, but its male, car-obsessed audience will probably enjoy it anyway. Figure 1. An example data sample from the dataset extracted from metacritic.com

The correlation between review scores and words from review quotes were analyzed in our previous unpublished study [1]. For that study, a PHP-based web crawler was developed to extract a dataset that consists of: •

8,223 reviews of 560 TV shows, by 51 distinct publications some of which are:

•

62,293 reviews of 4,390 music albums, by 88 distinct publications

•

113,456 reviews of 6,125 movies, by 47 distinct publications

The same dataset is used in this project.

Method We have three domains (TV, music and cinema) each of which contain thousands of culture products in its pool. In every domain, we have a set of critics-reviewers that have assigned certain scores (over 100) to these products. Let R1 and R2 be two reviewers in the same domain. If P is the set of products that are reviewed by both of these reviewers, we define two criteria for similarity of two reviewers/publications/critics R 1 and R2: 1- Number of products reviewed commonly by both R1 and R2. This value S1(R1, R2) is calculated as |P|, the element count of P:

2- S2(R1, R2) is based on score similarity, calculated by a triangle function summed over P:

In this formula, d is the difference between the scores given by R1 and R2 to a product. The sum is calculated over every product in P. M is the maximum difference of two scores, a constant value that makes the similarity zero. In this dataset where scores are given over 100, we assumed M = 30. After we obtain the similarity matrix, we have to derive a clustering that shows the underlying structure of the network. We used â&#x20AC;&#x153;Wardâ&#x20AC;? hierarchical clustering method based on euclidean-like dissimilarity equation. This method is available in data analysis software Pajek. We only present results with the second similarity criterion (score differences). We believe that it is more suitable to calculate similarity of publications.

Implementation Two php files were written for the following purposes: 1- index1.php: Traverses review table to extract S1(R1, R2), the number of products commonly reviewed by every binary reviewer combination. 2- index3.php: Traverses review table to extract S2(R1, R2), the score similarity among common products reviewed by every binary reviewer combination. Other files created through Pajek are described below: 1- source_relation#.net: Network of publications in a domain, edges are weighted by S1(R1, R2) 2- source_relation#_similarity.net: Network of publications in a domain, edges are weighted by S2(R1, R2) 3- source_relation#_similarity.hie: The hierarchy of publications extracted from the similarity network by using Ward hierarchical clustering with euclidean-like dissimilarity. 4- dend#.png: The hierarchy of publications based on number of common products. 5- dend#_sim.png: The hierarchy of publications based on similarity network. 6- perm#.png: The similarity matrix of publications ordered by the hierarchical clustering.

Results The results in three domains (TV show, album and movie critics) are given in three figures each:

1- Hierarchy dendrogram: An overall picture of the hierarchy of publications. Separations on the tree that are closer to the root denote greater differences. We select clusters from the upper branches of this tree. 2- Groups of publications: Clusters extracted from the hierarchy. Each group contains a set of publications that fall under the same branch in the hierarchy. 3- Similarity matrix: This matrix shows the distinctions between clusters. White regions denote greater similarity, and black regions denote dissimilarity among these groups of publications.

Figure 2. Hierarchical clustering of TV show critics.

San Jose Mercury News TV Guide Seattle Post-Intelligencer Detroit Free Press Orlando Sentinel Philadelphia Inquirer Newark Star-Ledger

1 Los Angeles Times The New York Times Hollywood Reporter Variety Boston Globe Pittsburgh Post-Gazette Chicago Tribune New York Daily News Entertainment Weekly

New York Post Newsday USA Today Chicago Sun-Times San Francisco Chronicle Washington Post

2 Amazon.com DVD Town TVShowsOnDVD.com digitallyOBSESSED IGN Paste Magazine Under The Radar Flak Magazine Houston Chronicle The New Republic

PopMatters Salon Time LA Weekly Wall Street Journal Kansas City Star Miami Herald New York Magazine Slate Baltimore Sun People Weekly

3 Arizona Republic Christian Science Monitor The New Yorker Slant Magazine Cleveland Plain Dealer Village Voice The Onion A.V. Club

Figure 3. The clusters of TV show critics extracted from the hierarchy.

Figure 4. The similarity matrix of TV show critics.

Figure 5. Hierarchical clustering of music album critics.

Armchair DJ L.A. Weekly Salon.com Checkout.com MTV.com HOB.com Spin Cycle Ink Blot Magazine Ink 19 Planet CultureDose.net Flak Magazine Logo Mixer Revolution Epilogue Music Shredding Paper Puncture Revolver Select Outburn Resonance Philadelphia Daily News

New York Magazine Trouser Press Drawer B Nude As The News MSN Consumer Guide RapReviews.com BBC collective Vibe CDNow Launch.com Sonicnet Wall of Sound

Delusions of Adequacy Filter Urb Dusted Magazine Tiny Mix Tapes NOW Magazine Slant Magazine Paste Magazine musicOMH.com Magnet ShakingThrough.net Lost At Sea Playlouder Splendid Austin Chronicle Village Voice The New York Times E! Online

2 Pitchfork PopMatters All Music Guide Mojo Q Magazine Uncut

Boston Globe Hartford Courant Observer Music Monthly Amazon.com Los Angeles Times Sputnikmusic The Phoenix Hot Press Junkmedia The Wire Neumu.net Almost Cool No Ripcord

3 Blender Entertainment Weekly Billboard Rolling Stone Spin The Onion (A.V. Club)

Alternative Press Stylus Magazine Under The Radar Dot Music The Guardian New Musical Express cokemachineglow Prefix Magazine Drowned In Sound

Figure 6. The clusters of music album critics extracted from the hierarchy.

Figure 7. The similarity matrix of music album critics.

Figure 8. Hierarchical clustering of movie-cinema critics.

Boston Globe Chicago Tribune Entertainment Weekly San Francisco Chronicle Chicago Reader Austin Chronicle LA Weekly The Onion (A.V. Club)

New York Daily News New York Post Village Voice

TV Guide Variety The New York Times Los Angeles Times Washington Post

Chicago Sun-Times USA Today The Globe and Mail Philadelphia Inquirer Seattle Post-Intelligencer

Film Threat Salon.com Baltimore Sun ReelViews The Hollywood Reporter Miami Herald Portland Oregonian Christian Science Monitor

Newsweek Time The New Yorker New York Magazine Slate Premiere Film.com

Empire Wall Street Journal Charlotte Observer Dallas Observer Rolling Stone

4 Mr. Showbiz San Francisco Examiner New Times (L.A.) The New Republic NPR TNT RoughCut

Figure 9. The clusters of movie-cinema critics extracted from the hierarchy.

Figure 10. The similarity matrix of movie-cinema critics.

Discussion If we examine the groups of publications extracted from the tree, along with the similarity matrices, we can distinguish following properties in each domain: 1- TV show critics: The white boxes in the matrix diagonal shows that self-similarity of each group is strong. Especially publications in the groups 5 and 6 are very close to each other and themselves, whereas they are very dissimilar to the publications in group 4.

2- Music albums: Groups 1-2 and 3-4 emerge as two large sets that are uniform in themselves and similar to each other, whereas other groups (5-6-7) and especially group 5 appears to be very different from these two and also in themselves. We can say that this opposition shows a binary separation between two mainstream media groups that focus on same set of popular products and the alternative media that is interested in a larger and different set of products. 3- Movies: In movies, we see a complex relations, as the dataset is larger. The largest difference is between groups 1-2-3 and 6-7-8, and especially groups 3 and 8.

Future work There are some of things I thought of implementing, but could not make for this report: 1- Top ten lists for publication groups. If we can assemble a top ten albums list for the music critics in 1-2, 3-4 and 5-6-7, we would probably see clearer that which kind of music these publications stand for. I expect very popular songs for 1-2 and 3-4, whereas we could see 5-6-7 would reveal unknown gems. We could also extract top ten lists for movie critics to see what kind of distinct audience sets like what kind of movies. For example, we could extract publications that focus on action movies, and others focusing on dramas. 2- Number of reviews for a publication. Some publications have so many reviews that they seem to be similar to more other nodes. Some kind of normalization can be used to prevent this effect.

[1] Işık Barış Fidaner (2009) "Estimating review score from words", unpublished paper, prepared for "Artificial Neural Networks" course, available: http://issuu.com/fidaner/docs/16161675-estimating-score-from-wordsin-metacritic