Web Scraping Hotel Details on Tripadvisor Using Python

Page 1

TRIPADVISOR USING PYTHON Web Scraping Hotel Details WWW.XBYTE.IO LARANA, INC. On

INTRODUCTION

Web scraping is the automated data extraction from websites. There are two types of web scraping: content scraping and structure scraping. Content scraping extracts textual content from a website’s pages, whereas structure scraping involves removing relational data from HTML objects.

A web scraper is an agent that performs web scraping to extract information for further use.

The use of web scrapers can be diverse, such as monitoring online trends or news, updating existing data sets by extracting information from websites and analyzing them further, maintaining sites, detecting broken links, and correcting them.

In addition to being done manually, the software is generally used to automate it. Python is a popular language for web scraping because it has several libraries that make it easy to scrape data from websites.

IMPORTING PACKAGES

WE NEED TO IMPORT A FEW PACKAGES TO SCRAPE DATA FROM A WEBSITE. THE FIRST PACKAGE IS THE REQUESTS PACKAGE, WHICH ALLOWS US TO MAKE HTTP REQUESTS TO WEBSITES. WE ALSO NEED THE BEAUTIFULSOUP PACKAGE, WHICH WILL ENABLE US TO PARSE HTML AND EXTRACT DATA. FINALLY, WE NEED THE PANDAS PACKAGE, WHICH ALLOWS US TO STORE DATA IN A DATA FRAME.

WE NEED TO IMPORT A FEW PACKAGES TO SCRAPE DATA FROM A WEBSITE. THE FIRST PACKAGE IS THE REQUESTS PACKAGE, WHICH ALLOWS US TO MAKE HTTP REQUESTS TO WEBSITES. WE ALSO NEED THE BEAUTIFULSOUP PACKAGE, WHICH WILL ENABLE US TO PARSE HTML AND EXTRACT DATA. FINALLY, WE NEED THE PANDAS PACKAGE, WHICH ALLOWS US TO STORE DATA IN A DATA FRAME.

FROM SELENIUM IMPORT WEBDRIVER FROM SELENIUM.WEBDRIVER.CHROME.SERVICE IMPORT SERVICE FROM SELENIUM.WEBDRIVER.COMMON.BY IMPORT BY FROM SELENIUM.WEBDRIVER.COMMON.KEYS IMPORT KEYS FROM SELENIUM.WEBDRIVER.SUPPORT.UI IMPORT WEBDRIVERWAIT FROM SELENIUM.WEBDRIVER.SUPPORT IMPORT EXPECTED _ CONDITIONS AS EC IMPORT TIME FROM BS4 IMPORT BEAUTIFULSOUP IMPORT PANDAS AS P

TRIPADVISOR URL

THE FIRST STEP IN WEB SCRAPING IS TO FIND THE WEBSITE URL WE WANT TO SCRAPE. WE CAN START BY LOOKING AT THE TRIPADVISOR HOME PAGE. FROM THERE, WE CAN NAVIGATE TO THE PAGE FOR A SPECIFIC HOTEL.

SCRAPING HOTEL DETAILS IN PYTHON

ONCE WE HAVE THE URL FOR THE HOTEL, WE CAN START SCRAPING THE DATA. WE CAN USE THE REQUESTS PACKAGE TO MAKE A GET REQUEST TO THE HOTEL’S TRIPADVISOR PAGE. IT WILL GIVE US THE HTML OF THE PAGE, WHICH WE CAN THEN PARSE USING BEAUTIFUL SOUP. WE CAN USE BEAUTIFULSOUP TO FIND ALL OF THE ELEMENTS ON THE PAGE THAT CONTAIN DATA ABOUT THE HOTEL. IN THIS CASE, WE WANT TO SEE THE ELEMENTS THAT COMPRISE THE HOTEL’S NAME, RATING, NUMBER OF REVIEWS, AND PRICE. WE CAN THEN EXTRACT THE DATA FROM THESE ELEMENTS AND STORE IT IN A LIST. WE CAN THEN WRITE THE HOTEL’S NAME AND COST INTO A TEXT FILE FOR LATER USE AND KEEP IT IN A PANDAS DATA FRAME.

THIS ARTICLE AIMS TO WRITE A PIECE OF CODE IN WEB SCRAPING TECHNIQUE, EXTRACTING ALL THE INFORMATION ON FAMOUS HOTELS AND THEIR HOTELS LOCATED AROUND THE WORLD AND COMPARING THEM WITH EACH OTHER CONSIDERING THEIR RATINGS, LOCATION, PRICES, AND REVIEWS.

READING THE DATA INTO A DATA FRAME USING PANDAS

ONCE WE HAVE EXTRACTED THE DATA FROM THE HTML, WE CAN USE THE PANDAS PACKAGE TO READ IT INTO A DATA FRAME. IT WILL ALLOW US TO ANALYZE THE DATA BETTER. TO READ THE DATA, WE CAN USE PD.READ _ HTML.

OUR PROGRAM SHOULD DO THE FOLLOWING:

HERE IS WHAT OUR PROGRAM LOOKS LIKE SO FAR. IT SCRAPES THE DATA FROM TRIPADVISOR AND STORES IT IN A PANDAS DATA FRAME.

NOW THAT WE HAVE OUR HOTEL INFORMATION STORED IN A PANDAS DATA FRAME, WE CAN PLOT THE RATINGS OF DIFFERENT HOTELS AGAINST EACH OTHER TO UNDERSTAND BETTER HOW THEY DIFFER. IT CAN GIVE US GOOD INSIGHT INTO WHICH HOTELS ARE BETTER THAN OTHERS AND HELP US MAKE INFORMED DECISIONS WHEN BOOKING HOTELS.

CLEANING THE DATA

ONCE WE HAVE THE DATA IN A DATA FRAME, WE CAN CLEAN IT UP. IT MAY INVOLVE REMOVING DUPLICATES, FILLING IN MISSING VALUES, OR CHANGING THE DATA FORMAT. IN THIS CASE, WE WILL REMOVE THE DUPLICATES. WE ALSO WANT TO REMOVE THE COMMENTS ASSOCIATED WITH THE TOP FIVE HOTELS SO THAT WE ONLY HAVE INFORMATION FOR THE THREE HOTELS IN OUR ANALYSIS. TO ACCOMPLISH THIS, WE WILL USE A REGEX. HERE IS WHAT THE REGEX LOOKS LIKE: THIS CODE WILL REPLACE ALL ” #” WITH SPACES AND ALL “&” WITH & AND APPEND THE COMMENT BEFORE OR AFTER.

THIS CODE WILL REPLACE ALL ” #” WITH SPACES AND ALL “&” WITH & AND APPEND THE COMMENT BEFORE OR AFTER.

WE CAN NOW STORE THIS DATA IN A LIST AND DISPLAY OUR RANKED HOTEL LIST TO SEE HOW THEY COMPARE.

ONCE OUR DATA IS CLEAN, WE CAN ANALYZE IT FURTHER BY PLOTTING A SCATTER PLOT OF HOTEL RATINGS AGAINST EACH OTHER, AS SHOWN BELOW.

CONCLUSION

WEB SCRAPING IS A POWERFUL TOOL FOR COLLECTING DATA FROM WEBSITES.

PYTHON MAKES IT EASY TO SCRAPE DATA FROM WEBSITES USING A FEW DIFFERENT PACKAGES.

ONCE YOU HAVE THE DATA, YOU CAN USE PANDAS TO READ IT INTO A DATA FRAME FOR FURTHER ANALYSIS. WITH URL, ONE CAN SCRAB DATA FROM IT AND STORE IT IN PANDAS DATA FRAME.

YOU CAN ACCESS THE DATA IN THE DATA FRAME USING THE PANDAS PACKAGE.

WE CAN ALSO CLEAN THE DATA AND REMOVE UNWANTED DATA WITH REGEX.

Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.