Data Exploration on Yelp
Yutian Luo | University of Maryland, College Park
Introduction Background As the online community becomes more popular, more people are willing to share personal experiences in restaurants, evaluating food on the Internet, and using online information to help them decide. The goal of this project is to perform data analysis on the Yelp restaurant dataset, which consists of customer review texts, restaurant rankings, location information, etc. from a variety of restaurants. At a high level, aiming to find out which attributes of the restaurants lead to higher restaurant ratings and making further inferences about the restaurant using text reviews. More specifically, our questions of interest consist of 3 perspectives: user, restaurant, and review. Study users’ ratings and reviewing patterns; find out the characteristics of high rating restaurants and basing the characteristics of reviews to deliver better recommendations for Yelp’s users, both customers, and restaurants. By gathering insights about the customer base and developing strategic factors that would influence a customer’s decision to visit a particular restaurant, not only can help diners find the most desired restaurant by attributes such as quality of a restaurant or best cuisines but also can provide recommendations for restaurants to expand their business by attracting more customers and by improving clients experiences and targeting particular customers. Besides, for Yelp's perspective, we want to help Yelp to improve its recommendation system.
Data Description Chose the Yelp dataset from Kaggle because the data is feasible and has potential due to large volumes with 10 GB. Besides, since the information is gathered from the Yelp website, which is one of the most renowned review platforms which had a monthly average of 76.7 million unique visitors via its mobile website in 2019, it is authentic and able to develop practical insights. The dataset includes 5.2 million user reviews of 174,000 businesses in 11 metropolitan areas in four countries. To avoid out-of-memory error, analysis focused on data of restaurants in Nevada, which is the state with the highest total amount of review counts across all U.S. states. We acquired 7 datasets in csv format: yelp_business_attributes.csv, yelp_business.csv, yelp_business_hours.csv, yelp_checkin.csv, yelp_tip.csv, yelp_review.csv, yelp_user.csv.
Data Acquisition Dataset Preparation