What Relates to High Ratings on Yelp? Zach Branson, Jinsub Hong, Kyongche Kang, Jonathan Yu Carnegie Mellon University, Pittsburgh, Pennsylvania
Distribution of Businesses by Rating
• Yelp is one of the most widely used business rating websites
• Every day users log on to Yelp and rate businesses on a 1-to-5-star scale
- A business’ Yelp rating can greatly affect its future. • Question: What relates to a high-star rating on Yelp?
• Many variables may affect a business’ rating, including the types of reviews it gets and its location • Our data are from Phoenix, AZ and include: - 11,537 unique businesses - 43,873 unique Yelp users - 229,907 unique reviews from users
Variables We Consider in Our Models Reviews
POSTER TEMPLATE BY:
Number of reviews a business receives from Yelp users Average number of Yelp users who thought businesses’ reviews were “useful” Latitude and longitude locations of each business
• Logistic regression with location variables indicates that locations are not significant in classifying best and worst rated businesses. • Added two more variables to the logistic regression with location variables: average number of votes for being cool and being funny.
- Over 100 million users have visited the site so far, and over 17 million reviews are on its database.
- A good rating will encourage other Yelp users to go to that business, while a bad rating will discourage users from going to that business
From the distribution of businesses by their ratings, we see that highly-rated businesses are more concentrated in the central city, and as ratings go lower, we see more diversion from the center. e.g. Businesses with 4 stars tend to be more concentrated in the center of the city.
Average Useful votes
Average Cool votes
Average Funny votes
Misclassification rate: 0.413
Logistic Regression • Took a subset of data, only businesses that useful reviewers (who had average useful votes higher than 1.1 per review) have rated, in order to reduce bias • Created a binary variable, for businesses above 4 stars as 1 and 0 for otherwise. p Model1 ⇒ log( ) = β0 + β1 (Re view _ count) + β2 (Avg.useful _ votes) 1− p Variables
Average Useful votes
Misclassification rate: 0.470 p Model2 ⇒ log( ) = β0 + β1 (Re view _ count) + β2 (Avg.useful _ votes) + 1− p β3 (Longitude) + β 4 (Latitude) Variables
Average Useful votes
Misclassification rate: 0.469
• Adding the two more variables changed allthe variables to be significant (at alpha 0.1) • Misclassification rate improved slightly.
Clustering Analysis Applied clustering methods that capture spatial features to compare with logistic regression. • K-means - Applied K-means clustering analysis, which classifies observations based on the nearest centroid mean. Misclassification rate: 0.541 • K-nearest neighbors (K-NN) - Applied K-nearest neighbors analysis, which an observation is classified by a majority vote of its neighbors, with the observation being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). We chose k to be 8. - Divided the data into train and test subset. Misclassification rate (on test set): 0.101
• Logistic regression and K-nearest performed well on classification of ratings on businesses. • Case Based Reasoning (CBR) is an emerging decision making paradigm where new cases are solved relying on previously solved similar cases. • K-Nearest Neighbor algorithm can be combined with various information obtained from a logistic regression model for learning process to improve classification. - Use logistic regression to make classification on usefulness of each review on businesses, and take those into accounts for K-nearest neighbors classification (vice versa). - Proposed by Campillo-Gimenez et al1), find weights of each classification by soft K-nearest neighbors and weights of predictor variables in logistic regression by Wald Statistic and Weighting of Attributes, and combine the two weights for decision making process for classification. 1) Campillo-Gimenez, B. et al. (March 2013) Coupling K-nearest neighbors with logistic regression in case-based reasoning. CoRR, volume abs/1303.1700.
Conclusion • We used different classification methods to classify whether or not a business is highly-rated - We consider “highly-rated” at least four stars • Logistic regression with review counts, average votes for usefulness, coolness and funnies, as well as locations of businesses were significant • Applying clustering analysis that captures spatial features, K-nearest neighbors performed better than logistic regression; K-means performed worse • Location, and spatial features of businesses have influence on ratings of businesses on Yelp
Acknowledgements We owe thanks to the Department of Statistics at Carnegie Mellon University as well as Yelp, Inc. for providing us with the tools and data to analyze this data.