Estimating score from words in Metacritic

Page 1

CMPE 545 Artificial Neural Networks

Estimating review score from words Işık Barış Fidaner


ÎŁ score

Metascore = 1/N .

i


The rating given to this product

rt =

Score

Reviewer

The source of this review

Quote A few sentences that summarize this review

xt = ? Existence of some words in the quote

Bag of words representation

+ affectionate + exuberant + embrace


Purposes 1. A new database that relates text to score (...) An affectionate, exuberant picture that seeks to bring even those who don't know Klingon from Portuguese into the embrace of a pop-culture phenomenon. (...)

?

90


Purposes 2. Quantify meaning with machine learning Review quote: An affectionate, exuberant picture that seeks to bring even those who don't know Klingon from Portuguese into the embrace of a pop-culture phenomenon.

wT xt 0 0 73 × 1 0 70 × 1 0 0 65 × 1

riveting exhilerating affectionate crafted exuberant dull lacking embrace


Purposes 3. Meta-metacritic deductions, such as riveting exhilerating crafted superb extraordinary brilliant

unfunny tedious fails mess dull lacking

Positive words

Negative words


Obtaining the database • Developed a PHP web crawler • It ran for a few days • TV show reviews – 8,335 records

PHP

• Music album reviews – 62,293 records

• Movie reviews – 113,456 records MySQL


Bag of words assumption • Features affect the result independently An affectionate, exuberant picture that seeks to bring even those who don't know Klingon from Portuguese into the embrace of a pop-culture phenomenon.

=

phenomenon from an exuberant picture those into a portugese don’t pop-culture affectionate to embrace bring klingon of who know seeks

• Semantic organization does not matter


Bag of words assumption • The problem with modifiers: This is not good.

Is this not good?

• We rely on the information encoded in the vocabulary, not grammar • Opinions expressed clearly and simply: Excellent, wonderful!

This is dreadful.


Word selection ~20 thousand words 1. Quote count (QC) 2. Product count (PC)

~300 words 3. Score mean (SM) 4. Score stdev (SS)

• Meaningful words (SS < SSmax = 20) • Frequently used words (PC > PCmin = 20) • Non-grammatical words (PC < PCmax = 100)


Significant words for TV and movies casual words!

fancy words!

unfunny waste

TV takes too much time!

disappointment supposed, fails

Movies are overrated!


Significant words for music albums masterpiece artists

Music is art date modern

Music ages quickly personality Albums are attached to the musician’s personality


The input vector and estimation • Example input vector (divided by quote size) – xt = [1 0 0 1 0 0 0 1 0 0 0 0 ... 0] / 3

• Estimation function • There is a weight for every selected word • xt chooses the subset of contained words • Estimation is the sum of w0 and the arithmetic mean of the weights of contained words


Linear and SVM regression • Linear regression uses square difference err. • Which imply these update equations: • SVM regression uses ε-sensitive error func. • With these simpler update equations


Linear regression learning

Unstable learning in validation set

Error of 17 points

Error of 14 points


SVM regression learning

Robustness increased, because SVM error function is linear and tolerant to error.

Better results with SVM! Error of 13 points Error of 11 points


Possible improvements • Non-linear model that actually weighs the importance of words • Normalization by estimating reviewer parameters • Adding two-word combinations to the input vector


CMPE 545 Artificial Neural Networks

Estimating review score from words Işık Barış Fidaner


Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.