Issuu on Google+

What Relates to High Ratings on Yelp? Zach Branson, Jinsub Hong, Kyongche Kang, Jonathan Yu Carnegie Mellon University, Pittsburgh, Pennsylvania

Introduction

Distribution of Businesses by Rating

•  Yelp is one of the most widely used business rating websites

•  Every day users log on to Yelp and rate businesses on a 1-to-5-star scale

- A business’ Yelp rating can greatly affect its future. •  Question: What relates to a high-star rating on Yelp?

Location

Reviews

Rating

Review Usefulness

•  Many variables may affect a business’ rating, including the types of reviews it gets and its location •  Our data are from Phoenix, AZ and include: - 11,537 unique businesses - 43,873 unique Yelp users - 229,907 unique reviews from users

Variables We Consider in Our Models Reviews

Review Usefulness

Location

POSTER TEMPLATE BY:

www.PosterPresentations.com

Number of reviews a business receives from Yelp users Average number of Yelp users who thought businesses’ reviews were “useful” Latitude and longitude locations of each business

Future Work

•  Logistic regression with location variables indicates that locations are not significant in classifying best and worst rated businesses. •  Added two more variables to the logistic regression with location variables: average number of votes for being cool and being funny.

- Over 100 million users have visited the site so far, and over 17 million reviews are on its database.

- A good rating will encourage other Yelp users to go to that business, while a bad rating will discourage users from going to that business

Model Evaluation

From the distribution of businesses by their ratings, we see that highly-rated businesses are more concentrated in the central city, and as ratings go lower, we see more diversion from the center. e.g. Businesses with 4 stars tend to be more concentrated in the center of the city.

Coefficients

Standard Error

P-value

Intercept

15.9

14.3

0.265

Review count

0.00402

0.000521

1.21e-14

Average Useful votes

-0.0938

0.0250

0.000175

Longitude

0.244

0.141

0.0846

Latitude

0.337

0.1708

0.0487

Average Cool votes

0.536

0.0355

< 2e-16

Average Funny votes

-0.498

0.0293

< 2e-16

Misclassification rate: 0.413

Logistic Regression •  Took a subset of data, only businesses that useful reviewers (who had average useful votes higher than 1.1 per review) have rated, in order to reduce bias •  Created a binary variable, for businesses above 4 stars as 1 and 0 for otherwise. p Model1 ⇒ log( ) = β0 + β1 (Re view _ count) + β2 (Avg.useful _ votes) 1− p Variables

Coefficients

Standard Error

P-value

Intercept

-0.0899

0.0707

0.203

Review count

0.00459

0.00062

8.73e-14

Average Useful votes

-0.0275

0.0453

0.544

Misclassification rate: 0.470 p Model2 ⇒ log( ) = β0 + β1 (Re view _ count) + β2 (Avg.useful _ votes) + 1− p β3 (Longitude) + β 4 (Latitude) Variables

Coefficients

Standard Error

P-value

Intercept

-1.353e+01

2.731e+01

0.620

Review count

4.596e-03

6.151e-04

7.93e-14

Average Useful votes

-3.042e-02

4.550e-02

0.504

Longitude

-5.265e-02

2.725e-01

0.847

Latitude

2.255e-01

3.207e-01

0.482

Misclassification rate: 0.469

Variables

•  Adding the two more variables changed allthe variables to be significant (at alpha 0.1) •  Misclassification rate improved slightly.

Clustering Analysis Applied clustering methods that capture spatial features to compare with logistic regression. •  K-means - Applied K-means clustering analysis, which classifies observations based on the nearest centroid mean. Misclassification rate: 0.541 •  K-nearest neighbors (K-NN) - Applied K-nearest neighbors analysis, which an observation is classified by a majority vote of its neighbors, with the observation being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). We chose k to be 8. - Divided the data into train and test subset. Misclassification rate (on test set): 0.101

•  Logistic regression and K-nearest performed well on classification of ratings on businesses. •  Case Based Reasoning (CBR) is an emerging decision making paradigm where new cases are solved relying on previously solved similar cases. •  K-Nearest Neighbor algorithm can be combined with various information obtained from a logistic regression model for learning process to improve classification. - Use logistic regression to make classification on usefulness of each review on businesses, and take those into accounts for K-nearest neighbors classification (vice versa). - Proposed by Campillo-Gimenez et al1), find weights of each classification by soft K-nearest neighbors and weights of predictor variables in logistic regression by Wald Statistic and Weighting of Attributes, and combine the two weights for decision making process for classification. 1)  Campillo-Gimenez, B. et al. (March 2013) Coupling K-nearest neighbors with logistic regression in case-based reasoning. CoRR, volume abs/1303.1700.

Conclusion •  We used different classification methods to classify whether or not a business is highly-rated - We consider “highly-rated” at least four stars •  Logistic regression with review counts, average votes for usefulness, coolness and funnies, as well as locations of businesses were significant •  Applying clustering analysis that captures spatial features, K-nearest neighbors performed better than logistic regression; K-means performed worse •  Location, and spatial features of businesses have influence on ratings of businesses on Yelp

Acknowledgements We owe thanks to the Department of Statistics at Carnegie Mellon University as well as Yelp, Inc. for providing us with the tools and data to analyze this data.


What Relates to High Ratings on Yelp?