A Web Scraping Solution for a Multi-Subdomain Website
hen you hear a word like “web scraping” what comes to mind?
If you’re like me – you may think of an unethical spammer. One who sets out to scrape email addresses off hundreds or thousands of websites and then spams them relentlessly until they purchase the Viagra, Cialis or acai berry products that they’re promoting. But, just like electricity can be used to power hospitals, convents and orphanages but can equally be used to power grow ops or electrocute people – web scraping also has its good side; it all comes down to the intent of the user. This case study will show how our web scraping utility called Power Search was able to help one of our clients get their work done more quickly and easily.
Practical Uses for Scraping the Web We recently had a customer of ours with a website in the legal industry who wanted to use our Web2Disk product to make a copy of their website.
The only problem was, this was a huge site with multiple subdomains and even they weren’t quite sure how many potential subdomains existed. Out of the box, Web2Disk is designed to crawl a root url, but not all subdomains due to the potential downside of crawling the web endlessly. Imagine wanting to create a website backup of only wordpress.com but ending up crawling all the other subdomains like:
en.wordpress.com themes.wordpress.com en.support.wordpress.com en.blog.wordpress.com
Web2Disk protects users from unintentionally making this mistake. It can be configured to crawl multiple domains and then group them under one project – but only if the domains are added as additional root URLs. So what do you do when you’re in the position of our Web2Disk customer and don’t even know all of the URLs you need to create a backup for? Our support team was on the case and soon came up with a solution using the ‘swiss army knife’ of the Inspyder software suite: Power Search. Here’s how we did it.
Using Power Search to find Unknown Sub Domains 1) Set the Root URL to the site we wanted to scrape the root URL (www.yahoo.com) 2) Change the "Query Type" to "Wildcard Match" (we could have also used "Regular Expression", but to keep it simple, Wildcard Match is easier). 3) We used the following query: href="http://#Subdomain#. yahoo.com/*" This will look for any text on the site like this: href="http://______. yahoo.com /____" And scrape the first match part into a result column called "Subdomain". 4) Next, we turned on "Ignore Case" and "Include HTML". By default Power Search will ignore HTML content (searching only the visible text on the site). Because we wanted to actually search for an HTML snippet, turning on "Include HTML" is critical. 5) Finally, we went into "More Query Options" and made the following changes:
1. Turned ON Ignore Duplicate Matches 2. Turned OFF Show URL in Results 3. Turned OFF Include Context of Match Show URL in Results is useful if you want to know what page something was found on, and context of match shows where on the page it was found. For our intents and purposes, we just wanted a simple list with no duplicates, so context and URL were unnecessary. 6) Press Go! Power Search will crawl the site and pull out all the matches. When you're setting up a new scrape it's sometimes useful to leave the "Context" and "URL" data turned on while you fine tune the scrape. We like to let it grab the first couple of matches before taking "the training wheels" off. Once we’re satisfied with the results, we'll stop, disable Context and URL and then let it run the full scan.
Summary Power Search was able to solve a problem for our client that they didn’t even realize they had when they purchased our Web2Disk software. Not only did it provide the required data, but it only took minutes. For one low price, our customer could not only have solved this problem themselves, but continue to use Power Search indefinitely for:
If you manage multiple sites, or inherited one that has undergone multiple updates and different managers over the years, Power Search can save you a significant amount of time. It will often pay for itself by only solving even one problem for your business.
Power Search: Find the Needle in Your Haystack
Or Download the FREE Power Search Trial