search engines for software engineers by regan rose

View with images and charts Crawler Based Search Engine for Software Professionals Chapter 1 Introduction According to Internet World Stats survey, as on March 31, 2008, 1.407 billion people Use the Internet. The vast expansion of the internet is getting more and more day by day. The World Wide Web (commonly termed as the Web) is a system of interlinked Hypertext documents accessed via the Internet. With a Web browser, a user views Web pages that may contain text, images, videos, and other multimedia and navigates between them using hyperlinks [1]. Difference between Web and Internet One can easily get confused by thinking that both World Wide Web and the Internet is the same thing. But the fact is that both are quite different. The Internet and the World Wide Web are not one and the same. The Internet is a collection of interconnected computer networks, linked by copper wires, fiber-optic cables, wireless connections, etc. In contrast, the Web is a collection of interconnected documents and other resources, linked by hyperlinks and URLs. The World Wide Web is one of the services accessible via the Internet, along with various others including e-mail, File sharing, online gaming and others described below. However, "the Internet" and "the Web" are commonly used interchangeably in non-technical settings.

1.1

Web Search: origins, today's usage, problems

In the beginning, there was a complete directory of the whole World Web. These were the times when one could know all the existing servers in the web. Later, other web directories appeared. Some of them are Yahoo, AltaVista, Lycos and Ask. These newer web directories kept a hierarchy of the web pages based on their topics. Web directories are human-edited, thus making them very hard to maintain when the web is growing up so fast. As a result, information retrieval techniques that had been developed for physical sets of documents, such as libraries, were put into practice in the web. The first web search engines appeared on 1993. Those web search engines did not keep information about the content of the web pages; instead, they only indexed information about the title of the pages. It was in 1994, when web search engines started to index the whole web content, so that the user could search into the content of the web pages, not only in the title. On 1998, Google appeared and this changed everything. The searches done by this search engine got better results than the previous search engines would get. This new search engine considered the links structure of the web, not only its contents. The algorithm used to analyze the links structure of the web was called Page Rank. This algorithm introduced the concept of â&#x20AC;&#x153;citation" into the web: the more citations a web page has, the more important it is; furthermore, the more important is the one who cites, the more important the cited is. The information about the citations was taken from links in the web pages. Nowadays, web search engines are widely used, and their usage is still growing. As of November 2008, Google performed 7.23 billion searches.

Web search engines are today used by everyone with access to computers, and those people have very different interests. But search engines always return the same result, regardless of who did the search. Search results could be improved if more information about the user was considered [2].

1.2

Aim of the thesis

Web search engines have, broadly speaking, and three basic phases. They are crawling, indexing and searching. The information available about the userâ&#x20AC;&#x2122;s interest can be considered in some of those three phases, depending on its nature. Work on search personalization already exists. In order to solve the problems of ignorance in relation to the user and his interests, we have developed a system only for the Software Professionals that searches over fixed number of seed hosts and generates results using our own algorithm and some prediction.

1.3

Web Search Engine

A web search engine is designed to search for information on the World Wide Web. The search results are usually presented in a list of results and are commonly called hits. The information may consist of web pages, images, information and other types of files. Some search engines also mine data available in databases or open directories. Unlike Web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorithmic and human input [3]. An Internet search engine is an information retrieval system, which helps us find information on the World Wide Web. World Wide Web is the universe of information where this information is accessible on the network. It facilitates global sharing of information. But WWW is seen as an unstructured database. It is exponentially growing to become enormous store of information. Searching for information on the web is hence a difficult task. There is a need to have a tool to manage, filter and retrieve this oceanic information. A search engine serves this purpose [4]. In the contest of the Internet, search engines refer to the World Wide Web and not other protocols or areas. Furthermore search engines mine data available in newspapers, large databases, and open directory like DMOZ.org. Because the data collection is automated, they are distinguished from Web directories, which are maintained by people [5]. A search engine is program designed to help find files stored on a computer, for example a public server on the World Wide Web, or oneâ&#x20AC;&#x2122;s own computer. The search engine allows one to ask for media content meeting specific criteria (typically those containing a given word or phrase) and retrieving a list of files that matches criteria. A search engine often uses a previously made, and regularly updated index to look for files after the user has entered search criteria [6]. The vast majority of search engine are run by private companies using proprietary algorithms and closed databases, the most popular currently being Google (with MSN Search and Yahoo! Closely behind). There have been several attempts to create open-source search engines, among which are Htdig, Nutch, Egothor, and OpenFTS.

On the Internet, a search engine is a coordinated set of programs that includes [6]: • • •

A spider (also called a “crawler” or a “bot”) that goes to every page or representative pages on every Web site that wants to be searchable and read it, using hypertext links on each pages to discover and read a site’s other pages A program that creates a huge index (sometimes called a “catalog”) from the pages that have been read A program that receives our search request, compares it to the entries in the index, and returns results to we

An alternative to using a search engine is to explore a structured directory of topics. Yahoo, which also lets we use its search engine, is the most widely-used directory on the Web. A number of Web portal sites offer both the search engine and directory approaches to finding information. Are Search Engines and Directories The Same Thing? Search engines and Web directories are not the same thing; although the term "search engine" often is used interchangeably. Search engines automatically create web site listings by using spiders that "crawling" web pages, index their information, and optimally follows that site's links to other pages. Spiders return to already-crawled sites on a pretty regular basis in order to check for updates or changes, and everything that these spiders find goes into the search engine database. On the other hand, Web directories are databases of human-compiled results. Web directories are also known as human-powered search engines [7].

1.4 •

Different Search Engine Approaches Major search engines such as Google, Yahoo (which uses Google), AltaVista, and Lycos index the content of a large portion of the Web and provide results that can run for pages - and consequently overwhelm the user.

Specialized content search engines are selective about what part of the Web is crawled and indexed. For example, TechTarget sites for products such as the AS/400 (http://www.search400.com) and CRM applications (http://www.searchCRM.com) selectively index only the best sites about these products and provide a shorter but more focused list of results. • Ask Jeeves (http://www.ask.com) provides a general search of the Web but allows us to enter a search request in natural language, such as "What's the weather in Seattle today?" • Special tools and some major Web sites such as Yahoo let us use a number of search engines at the same time and compile results in a single list. • Individual Web sites, especially larger corporate sites, may use a search engine to index and retrieve the content of just their own site. Some of the major search engine companies’ license or sell their search engines for use on individual sites [8]. 1.4.1

Where to Search First

The last time we looked, the Open Directory Project listed 370 search engines available for Internet users. There are about ten major search engines, each with its own anchor Web site (although some have an arrangement to use another site's search engine or license their own search engine for use by other Web sites). Some sites, such as Yahoo, search not only using their search engine but also give the results from simultaneous searches of other search indexes. Sites that let us search multiple indexes simultaneously include [8]: • Yahoo (http://www.yahoo.com) • search.com (http://search.com) • Easy Searcher (http://www.easysearcher.com) Yahoo first searches it own hierarchically-structured subject directory and gives those entries. Then, it provides a few entries from the AltaVista search engine. It also launches a concurrent search for entries matching our search argument with six or seven other major search engines. We can link to each of them from Yahoo (at the bottom of the search result page) to see what the results were from each of these search engines. A significant advantage of a Yahoo search is that if locate an entry in Yahoo, it's likely to lead to a Web site or entire categories of sites related to our search argument. A search.com search primarily searches the Info seek index first but also search the other major search engines as well.

Easy Searcher lets us choose from either the popular search engines or a very comprehensive list of specialized search engine/databases in a number of fields.

Yahoo, search.com, and Easy Searcher all provide help with entering our search phrase. Most Web portal sites offer a quickly-located search entry box that connects us to the major search engines.

1.4.1

How to Search

By "How to Search," we mean a general approach to searching: what to try first, how many search engines to try, whether to search USENET newsgroups, when to quit. It's difficult to generalize, but this is the general approach we use at whatis.com [8]: 1. If we know of a specialized search engine such as Search Networking that matches our subject (for example, Networking), we'll save time by using that search engine. We’ll find some specialized databases accessible from Easy Searcher 2. 2. If there isn't a specialized search engine, try Yahoo. Sometimes we'll find a matching subject category or two and that's all we'll need.

3. If Yahoo doesn't turn up anything, try AltaVista, Google, Hotbot, Lycos, and perhaps other search engines for their results. Depending on how important the search is, we usually don't need to go below the first 20 entries on each. 4. For efficiency, consider using a ferret that will use a number of search engines simultaneously for us. 5. At this point, if we haven't found what we need, consider using the subject directory approach to searching. Look at Yahoo or someone else's structured organization of subject categories and see if we can narrow down a category our term or phrase is likely to be in. If nothing else, this may give us ideas for new search phrases. 6. If we feel it's necessary, also search the Usenet newsgroups as well as the Web. 7. As we continue to search, keep rethinking our search arguments. What new approaches could we use? What are some related subjects to search for that might lead us to the one we really want? 8. Finally, consider whether our subject is so new that not much is available on it yet. If so, we may want to go out and check the very latest computer and Internet magazines or locate companies that we think may be involved in research or development related to the subject.

1.5

Historical Search Engine Information

During the early development of the web, there was a list of web servers edited by Tim Berners-Lee and hosted on the CERN web server. One historical snapshot from 1992 has remained. As more web servers went online the central list could not keep up. On the NCSA site new servers were announced under the title "What's New!"[3]

The very first tool used for searching on the Internet was Archie. The name stands for "archive" without the "v." It was created in 1990 by Alan Emtage, a student at McGill University in Montreal. The program downloaded the directory listings of all the files located on public anonymous FTP (File Transfer Protocol) sites, creating a searchable database of file names; however, Archie did not index the contents of these sites.

The rise of Gopher (created in 1991 by Mark McCahill at the University of Minnesota) led to two new search programs, Veronica and Jughead. Like Archie, they searched the file names and titles stored in Gopher index systems. Veronica (Very Easy Rodent-Oriented Net-wide Index to Computerized Archives) provided a keyword search of most Gopher menu titles in the entire Gopher listings. Jughead (Jonzy's Universal Gopher Hierarchy Excavation and Display) was a tool for obtaining menu information from specific Gopher servers. While the name of the search engine "Archie" was not a reference to the Archie comic book series, "Veronica" and "Jughead" are characters in the series, thus referencing their predecessor.

In the summer of 1993, no search engine existed yet for the web, though numerous specialized catalogues were maintained by hand. Oscar Nierstrasz at the University of Geneva wrote a series of Perl scripts that would periodically mirror these pages and rewrite them into a standard format which formed the basis for W3Catalog, the web's first primitive search engine, released on September 2, 1993.

In June 1993, Matthew Gray, then at MIT, produced what was probably the first web robot, the Perl-based World Wide Web Wanderer, and used it to generate an index called 'Wandex'. The purpose of the Wanderer was to measure the size of the World Wide Web, which it did until late 1995. The web's second search engine Aliweb appeared in November 1993. Aliweb did not use a web robot, but instead depended on being notified by website administrators of the existence at each site of an index file in a particular format.

Jump Station (released in December 1993) used a web robot to find web pages and to build its index, and used a web form as the interface to its query program. It was thus the first WWW resource-discovery tool to combine the three essential features of a web search engine (crawling, indexing, and searching) as described below. Because of the limited resources available on the platform on which it ran, its indexing and hence searching were limited to the titles and headings found in the web pages the crawler encountered.

One of the first "full text" crawler-based search engines was WebCrawler, which came out in 1994. Unlike its predecessors, it let users search for any word in any webpage, which has become the standard for all major search engines since. It was also the first one to be widely known by the public. Also in 1994, Lycos (which started at Carnegie Mellon University) was launched and became a major commercial endeavor.

Soon after, many search engines appeared and vied for popularity. These included Magellan, Excite, Info seek, Inktomi, Northern Light, and AltaVista. Yahoo! was among the most popular ways for people to find web pages of interest, but its search function operated on its web directory, rather than full-text copies of web pages. Information seekers could also browse the directory instead of doing a keyword-based search. In 1996, Netscape was looking to give a single search engine an exclusive deal to be their featured search engine. There was so much interest that instead a deal was struck with Netscape by five of the major search engines, where for $5Million per year each search engine would be in a rotation on the Netscape search engine page. The five engines were Yahoo!, Magellan, Lycos, Info seek, and Excite.

Around 2000, the Google search engine rose to prominence The Company achieved better results for many searches with an innovation called Page Rank. This iterative algorithm ranks web pages based on the number and Page Rank of other web sites and pages that link there, on the premise that good or desirable pages are linked to more than others. Google also

maintained a minimalist interface to its search engine. In contrast, many of its competitors embedded a search engine in a web portal.

By 2000, Yahoo was providing search services based on Inktomi's search engine. Yahoo! acquired Inktomi in 2002 and Overture (which owned AlltheWeb and AltaVista) in 2003. Yahoo! switched to Google's search engine until 2004, when it launched its own search engine based on the combined technologies of its acquisitions.

Microsoft first launched MSN Search in the fall of 1998 using search results from Inktomi. In early 1999 the site began to display listings from Look smart blended with results from Inktomi except for a short time in 1999 when results from AltaVista were used instead. In 2004, Microsoft began a transition to its own search technology, powered by its own web crawler (called msnbot).

Microsoft's rebranded search engine, Bing, was launched on June 1, 2009. On July 29, 2009, Yahoo! and Microsoft finalized a deal in which Yahoo! Search would be powered by Microsoft Bing technology.

According to Hit box, Google's worldwide popularity peaked at 82.7% in December, 2008. July 2009 rankings showed Google (78.4%) losing traffic to Baidu (8.87%), and Bing (3.17%). The market share of Yahoo! Search (7.16%) and AOL (0.6%) were also declining. In the United States, Google held a 63.2% market share in May 2009, according to Nielsen Net Ratings. In the People's Republic of China, Baidu held a 61.6% market share for web search in July 2009 [3].

1.6

Challenges faced by search engines

• •

The web is growing much faster than any present-technology search engine can possibly index. Many web pages are updated frequently, which forces the search engine to revisit them periodically.

•

The queries one can make are currently limited to searching for key words, which may results in many false positives.

•

Dynamically generated sites, which may be slow or difficult to index, or may result in excessive results from a single site.

•

Many dynamically generated sites are not index able by search engines; this phenomenon is known as the invisible web.

•

Some search engines do not order the results by relevance, but rather according to how much money the sites have paid them.

Some sites use tricks to manipulate the search engine to display them as the first result returned for some keywords. This can lead to some search results being polluted, with more relevant links being pushed down in the result list [5].

1.7 Types of Search Engines

In the early 2000s, more than 1,000 different search engines were in existence, although most Web masters focused their efforts on getting good placement in the leading 10. This, however, was easier said than done. InfoWorld explained that the process was more art than science, requiring continuous adjustments and tweaking, along with regularly submitting pages to different engines for good or excellent results. The reason for this is that every search engine works differently. Not only are there different types of search engines—those that use spiders to obtain results, directory-based engines, and link-based engines—but engines within each category are unique. They each have different rules and procedures companies need to follow in order to register their site with the engine.

The term "search engine" is often used generically to describe crawler-based search engines, human-powered directories, and hybrid search engines. These types of search engines gather their listings in different ways, through crawler-based searches, human-powered directories, and hybrid searches [9].

1.7.1 Crawler-based search engines

Crawler-based search engines, such as Google, create their listings automatically. They "crawl" or "spider" the web, then people search through what they have found. If web pages are changed, crawler-based search engines eventually find these changes, and that can affect how those pages are listed. Page titles, body copy and other elements all play a role.

The life span of a typical web query normally lasts less than half a second, yet involves a number of different steps that must be completed before results can be delivered to a person seeking information. The following graphic (Figure 1.7.1) illustrates this life span:

3. The search results are returned to the user in a fraction of a second.

1. The web server sends the query to the index servers. The content inside the index servers is similar to the index in the back of a book - it tells which pages contain the words that match the query.

2. The query travels to the doc servers, which actually retrieve the stored documents. Snippets are generated to describe each search result.

Figure 1.7.1: The life span of a typical web query

Steps of Crawler Based Search-engines [10]: 1. Web–Crawling: Search-Engines use a special program called Robot or Spider which crawls (travels) the web from one page to another. It travels the popular sites and then follows each link available at that site. 2. Information Collection: Spider records all the words and their respective position on Visited web-page. Some search-engines do not consider common words such as articles ( ‘a’, ’an’, ’the’); prepositions (‘of’, ’on’). 3. Build Index: After collecting all the data, search-engines build an index to store that data

So that user can access pages quickly. Different search-engines use different approach for indexing. Due toâ&#x20AC;&#x201C; this fact the different search-engines give different results for the same query. Some important considerations for building indexes include: the frequency of a term of appearing in a web-page, part of a web-page where that term appears, font-size of a term (whether capitalized or not). In fact, Google ranks a page higher if more number of pages vote (having links) to that particular page. 4. Data Encoding: Before storing the indexing information in databases, it is encoded into reduced size to speed up the response time of particular search-engine. 5. Store Data: the last step is to store this indexing information into databases. 1.7.2 Human-powered directories

A human-powered directory, such as the Open Directory Project depends on humans for its listings. (Yahoo!, which used to be a directory, now gets its information from the use of crawlers.) A directory gets its information from submissions, which include a short description to the directory for the entire site, or from editors who write one for sites they review. A search looks for matches only in the descriptions submitted. Changing web pages, therefore, has no effect on how they are listed. Techniques that are useful for improving a listing with a search engine have nothing to do with improving a listing in a directory. The only exception is that a good site, with good content, might be more likely to get reviewed for free than a poor site [9].

Open Directory is one such directory and submission depends on a human to actually submit a website. The submitter must provide website information including a proper title and description. Open Directory's editors may write their own description of our site, or re-write the information submitted. They have total control over our submission.

When we submit a website to a human-powered directory I must follow the rules and regulations set forth by that specific directory. While following its directives, we must submit the most appropriate information needed by potential internet users. A good site with good content has a greater chance of being reviewed and accepted by a human-powered directory.

Human Search Method

From Bessed's perspective, a human-powered search engine finds useful sites, attempts to rank them by usefulness, and attempts to find answers for "long-tail" searches that a directory never would. A human-powered search engine also doesn't care about hierarchies --- there's

no infrastructure that says we have to drill down to Business and Industry_Apparel_Shoes_Crocs in order to find sites that sell Crocs. We just create a list of sites where I can find Crocs, which is all I want from the searcher perspective. Also, our goal is to update searches to weed out dated material that would sit in a directory forever. And we would never charge for inclusion [11].

1.7.3 Hybrid search engines

Today, it is extremely common for crawler-type and human-powered results to be combined when conducting a search. Usually, a hybrid search engine will favor one type of listings over another. For example, MSN Search is more likely to present human-powered listings from Look Smart. However, it also presents crawler-based results, especially for more obscure queries [9].

1.7.4 Meta-search engines

A meta-search engine is a search tool that sends user requests to several other search engines and/or databases and aggregates the results into a single list or displays them according to their source. Meta-search engines enable users to enter search criteria once and access several search engines simultaneously. Meta-search engines operate on the premise that the Web is too large for any one search engine to index it all and that more comprehensive search results can be obtained by combining the results from several search engines. This also may save the user from having to use multiple search engines separately. The term "meta-search" is frequently used to classify a set of commercial search engines, see the list of search engines, but is also used to describe the paradigm of searching multiple data sources in real time. The National Information Standards Organization (NISO) uses the terms Federated Search and Meta-search interchangeably to describe this web search paradigm [12].

Figure 1.7.4: Architecture of a Meta search engine Operation

Meta-search engines create what is known as a virtual database. They do not compile a physical database or catalogue of the web. Instead, they take a user's request, pass it to several other heterogeneous databases and then compile the results in a homogeneous manner based on a specific algorithm.

No two meta-search engines are alike. Some search only the most popular search engines while others also search lesser-known engines, newsgroups, and other databases. They also differ in how the results are presented and the quantity of engines that are used. Some will list results according to search engine or database. Others return results according to relevance, often concealing which search engine returned which results. This benefits the user by eliminating duplicate hits and grouping the most relevant ones at the top of the list. Search engines frequently have different ways they expect requests submitted. For example, some search engines allow the usage of the word "AND" while others require "+" and others require only a space to combine words. The better meta-search engines try to synthesize requests appropriately when submitting them.

Quality of results

Results can vary between meta-search engines based on a large number of variables. Still, even the most basic meta-search engine will allow more of the web to be searched at once than any one stand-alone search engine. On the other hand, the results are said to be less relevant, since a meta-search engine can’t know the internal “alchemy” a search engine does on its result (a meta-search engine does not have any direct access to the search engines’ database).

Meta-search engines are sometimes used in vertical search portals, and to search the deep web.

1.8

How web search engines work

A search engine operates, in the following order [3] 1. Web crawling 2. Indexing 3. Searching Web search engines work by storing information about many web pages, which they retrieve from the html itself. These pages are retrieved by a Web crawler (sometimes also known as a spider) â&#x20AC;&#x201D; an automated Web browser which follows every link on the site. Exclusions can be made by the use of robots.txt. The contents of each page are then analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called Meta tags). Data about web pages are stored in an index database for use in later queries. A query can be a single word. The purpose of an index is to allow information to be found as quickly as possible. Some search engines, such as Google, store all or part of the source page (referred to as a cache) as well as information about the web pages, whereas others, such as AltaVista, store every word of every page they find. This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. This problem might be considered to be a mild form of link rot, and Google's handling of it increases usability by satisfying user expectations that the search terms will be on the returned webpage. This satisfies the principle of least astonishment since the user normally expects the search terms to be on the returned pages. Increased search relevance makes these cached pages very useful, even beyond the fact that they may contain data that may no longer be available elsewhere. When a user enters a query into a search engine (typically by using key words), the engine examines its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text. The index is built from the information stored with the data and the method by which the information is indexed. Unfortunately, there are currently no known public search engines that allow documents to be searched by date. Most search engines support the use of the Boolean operators AND, OR and NOT to further specify the search query. Boolean operators are for literal searches that allow the user to refine and extend the terms of the search. The engine looks for the words or phrases exactly as entered. Some search engines provide an advanced feature called proximity search which allows users to define the distance between keywords. There is also concept-based searching where the research involves using statistical analysis on pages containing the words or phrases we search for. As well, natural language queries allow the user to type a question in the same form one would ask it to a human. A site like this would be ask.com. All search engines go by this basic process when conducting search processes, but because there are differences in search engines, there are bound to be different results depending on which engine we use [7]. 1. The searcher types a query into a search engine. 2. Search engine software quickly sorts through literally millions of pages in its database to find matches to this query. 3. The search engine's results are ranked in order of relevancy.

What follows is a basic explanation of how search engines work [13]. • •

Keyword Searching Refining Our Search

•

Relevancy Ranking

•

Meta Tags

Search engines use automated software programs know as spiders or bots to survey the Web and build their databases. Web documents are retrieved by these programs and analyzed. Data collected from each web page are then added to the search engine index. When we enter a query at a search engine site, our input is checked against the search engine's index of all the web pages it has analyzed. The best URLs are then returned to we as hits, ranked in order with the best results at the top. 1.8.1 Keyword Searching This is the most common form of text search on the Web. Most search engines do their text query and retrieval using keywords. What is a keyword, exactly? It can simply be any word on a webpage. For example, I used the word "simply" in the previous sentence, making it one of the keywords for this particular webpage in some search engine's index. However, since the word "simply" has nothing to do with the subject of this webpage (i.e., how search engines work), it is not a very useful keyword. Useful keywords and key phrases for this page would be "search," "search engines," "search engine methods," "how search engines work," "ranking" "relevancy," "search engine tutorials," etc. Those keywords would actually tell a user something about the subject and content of this page. Unless the author of the Web document specifies the keywords for her document (this is possible by using Meta tags), it's up to the search engine to determine them. Essentially, this means that search engines pull out and index words that appear to be significant. Since search engines are software programs, not rational human beings, they work according to rules established by their creators for what words are usually important in a broad range of documents. The title of a page, for example, usually gives useful information about the subject of the page (if it doesn't, it should!). Words that are mentioned towards the beginning of a document (think of the "topic sentence" in a high school essay, where we lay out the subject we intend to discuss) are given more weight by most search engines. The same goes for words that are repeated several times throughout the document. Some search engines index every word on every page. Others index only part of the document. Full-text indexing systems generally pick up every word in the text except commonly occurring stop words such as "a," "an," "the," "is," "and," "or," and "www." Some of the search engines discriminate upper case from lower case; others store all words without reference to capitalization. The Problem with Keyword Searching

Keyword searches have a tough time distinguishing between words that are spelled the same way, but mean something different (i.e. hard cider, a hard stone, a hard exam, and the hard drive on our computer). This often results in hits that are completely irrelevant to our query. Some search engines also have trouble with so-called stemming -- i.e., if I enter the word "big," should they return a hit on the word, "bigger?" What about singular and plural words? What about verb tenses that differ from the word we entered by only an "s," or an "ed"? Search engines also cannot return hits on keywords that mean the same, but are not actually entered in our query. A query on heart disease would not return a document that used the word "cardiac" instead of "heart." 1.8.2 Refining Search Most sites offer two different types of searches--"basic" and "refined" or "advanced." In a "basic" search, I just enter a keyword without sifting through any pull down menus of additional options. Depending on the engine, though, "basic" searches can be quite complex. Advanced search refining options differ from one search engine to another, but some of the possibilities include the ability to search on more than one word, to give more weight to one search term than we give to another, and to exclude words that might be likely to muddy the results. We might also be able to search on proper names, on phrases, and on words that are found within a certain proximity to other search terms. Some search engines also allow us to specify what form we'd like our results to appear in, and whether we wish to restrict our search to certain fields on the internet (i.e., Usenet or the Web) or to specific parts of Web documents (i.e., the title or URL). Many, but not all search engines allow us to use so-called Boolean operators to refine our search. These are the logical terms AND, OR, NOT, and the so-called proximal locators, NEAR and FOLLOWED BY. Boolean AND means that all the terms we specify must appear in the documents, i.e., "heart" AND "attack." We might use this if we wanted to exclude common hits that would be irrelevant to our query. Boolean OR means that at least one of the terms we specify must appear in the documents, i.e., bronchitis, acute OR chronic. We might use this if we didn't want to rule out too much. Boolean NOT means that at least one of the terms we specify must not appear in the documents. We might use this if we anticipated results that would be totally off-base, i.e., nirvana AND Buddhism, NOT Cobain. Not quite Boolean + and - Some search engines use the characters + and - instead of Boolean operators to include and exclude terms. NEAR means that the terms we enter should be within a certain number of words of each other. FOLLOWED BY means that one term must directly follow the other. ADJ, for adjacent, serves the same function. A search engine that will allow us to search on phrases uses, essentially, the same method (i.e., determining adjacency of keywords).

Phrases: The ability to query on phrases is very important in a search engine. Those that allow it usually require that we enclose the phrase in quotation marks, i.e., "spaces the final frontier." Capitalization: This is essential for searching on proper names of people, companies or products. Unfortunately, many words in English are used both as proper and common nouns-Bill, bill, Gates, gates, Oracle, oracle, Lotus, lotus, Digital, digital--the list is endless. All the search engines have different methods of refining queries. The best way to learn them is to read the help files on the search engine sites and practice! 1.8.3 Relevancy Rankings Most of the search engines return results with confidence or relevancy rankings. In other words, they list the hits according to how closely they think the results match the query. However, these lists often leave users shaking their heads on confusion, since, to the user; the results may seem completely irrelevant. Why does this happen? Basically it's because search engine technology has not yet reached the point where humans and computers understand each other well enough to communicate clearly. Most search engines use search term frequency as a primary way of determining whether a document is relevant. If weâ&#x20AC;&#x2122;re researching diabetes and the word "diabetes" appears multiple times in a Web document, it's reasonable to assume that the document will contain useful information. Therefore, a document that repeats the word "diabetes" over and over is likely to turn up near the top of our list. If our keyword is a common one, or if it has multiple other meanings, we could end up with a lot of irrelevant hits. And if our keyword is a subject about which we desire information, we don't need to see it repeated over and over--it's the information about that word that we're interested in, not the word itself. Some search engines consider both the frequency and the positioning of keywords to determine relevancy, reasoning that if the keywords appear early in the document, or in the headers, this increases the likelihood that the document is on target. For example, one method is to rank hits according to how many times our keywords appear and in which fields they appear (i.e., in headers, titles or plain text). Another method is to determine which documents are most frequently linked to other documents on the Web. The reasoning here is that if other folks consider certain pages important, we should, too. If we use the advanced query form on AltaVista, we can assign relevance weights to our query terms before conducting a search. Although this takes some practice, it essentially allows us to have a stronger say in what results we will get back. As far as the user is concerned, relevancy ranking is critical, and becomes more so as the sheer volume of information on the Web grows. Most of us don't have the time to sift through scores of hits to determine which hyperlinks we should actually explore. The more clearly relevant the results are, the more we're likely to value the search engine.

1.8.4 Information on Meta Tags Some search engines are now indexing Web documents by the Meta tags in the documents' HTML (at the beginning of the document in the so-called "head" tag). What this means is that the Web page author can have some influence over which keywords are used to index the document, and even in the description of the document that appears when it comes up as a search engine hit. This is obviously very important if we are trying to draw people to our website based on how our site ranks in search engines hit lists. There is no perfect way to ensure that we'll receive a high ranking. Even if we do get a great ranking, there's no assurance that we'll keep it for long. For example, at one period a page from the Spider's Apprentice was the number- one-ranked result on AltaVista for the phrase "how search engines work." A few months later, however, it had dropped lower in the listings. There is a lot of conflicting information out there on meta-tagging. If we're confused it may be because different search engines look at Meta tags in different ways. Some rely heavily on Meta tags; others don't use them at all. The general opinion seems to be that Meta tags are less useful than they were a few years ago, largely because of the high rate of spam-indexing (web authors using false and misleading keywords in the Meta tags). It seems to be generally agreed that the "title" and the "description" Meta tags are important to write effectively, since several major search engines use them in their indices. Use relevant keywords in our title, and vary the titles on the different pages that make up our website, in order to target as many keywords as possible. As for the "description" Meta tag, some search engines will use it as their short summary of our URL, so make sure our description is one that will entice surfers to our site. Note: The "description" Meta tag is generally held to be the most valuable, and the most likely to be indexed, so pay special attention to this one. In the keyword tag, list a few synonyms for keywords, or foreign translations of keywords (if we anticipate traffic from foreign surfers). Make sure the keywords refer to, or are directly related to, the subject or material on the page. Do NOT use false or misleading keywords in an attempt to gain a higher ranking for our pages. The "keyword" Meta tag has been abused by some webmasters. For example, a recent ploy has been to put such words "mp3" into keyword Meta tags, in hopes of luring searchers to one's website by using popular keywords. The search engines are aware of such deceptive tactics, and have devised various methods to circumvent them, so be careful. Use keywords that are appropriate to our subject, and make sure they appear in the top paragraphs of actual text on our webpage. Many search engine algorithms score the words that appear towards the top of our document more highly than the words that appear towards the bottom. Words that appear in HTML header tags (H1, H2, H3, etc) are also given more weight by some search engines. It sometimes helps to give our page a file name that makes use of one of our prime keywords, and to include keywords in the "alt" image tags.

One thing we should not do is use some other company's trademarks in our Meta tags. Some website owners have been sued for trademark violations because they've used other company names in the Meta tags. We have, in fact, testified as an expert witness in such cases. We do not want the expense of being sued! Remember that all the major search engines have slightly different policies. If we're designing a website and meta-tagging our documents, we recommend that we take the time to check out what the major search engines say in their help files about how they each use meta tags. We might want to optimize our Meta tags for the search engines we believe are sending the most traffic to our site. Chapter 2 Search Engine with Web Crawling In the previous chapter we briefly discussed about the vast expansion occurring in the World Wide Web. As the web of pages around the world is increasing day by day, the need of search engines has also emerged. In this chapter, we explain the basic components of any basic search engine along with its working. After this, the role of web crawlers, one of the essential components of any search engine, is explained. 2.1

Basic Web Search Engine

The plentiful content of the World-Wide Web is useful to millions. Some simply browse the Web through entry points such as Yahoo, MSN etc. But many information seekers use a search engine to begin their Web activity [14]. In this case, users submit a query, typically a list of keywords, and receive a list of Web pages that may be relevant, typically pages that contain the keywords. By Search Engine in relation to the Web, we are usually referring to the actual search form that searches through- databases of HTML documents. Crawler based search engines use automated software agents (called crawlers) that visit a Web site, read the information on the actual site, read the site's meta tags and also follow the links that the site connects to performing indexing on all linked Web sites as well. The crawler returns all that information back to a central depository, where the data is indexed. The crawler will periodically return to the sites to check for any information that has changed. The frequency with which this happens is determined by the administrators of the search engine. 2.2

Structure & Working of Search Engine

The basic structure of any crawler based search engine is shown in figure 2.2. Thus the Main steps in any search engine are [14]-

Figure 2.2: Generic Structure of a Search Engine 2.2.1 Gathering also called “Crawling” Every engine relies on a crawler module to provide the grist for its operation. This operation is performed by special software; called “Crawlers” Crawlers are small programs that `browse' the Web on the search engine's behalf, similarly to how a human user would follow links to reach different pages. The programs are given a starting set of URLs, whose pages they retrieve from the Web. The crawlers extract URLs appearing in the retrieved pages, and give this information to the crawler control module. This module determines what links to visit next, and feeds the links to visit back to the crawlers. 2.2.2 Maintaining Database/Repository All the data of the search engine is stored in a database as shown in the figure 2.2.All the searching is performed through that database and it needs to be updated frequently. During a crawling process, and after completing crawling process, search engines must store all the new useful pages that they have retrieved from the Web. The page repository (collection) in Figure 2.2 represents this possibly temporary collection. Sometimes search engines maintain a cache of the pages they have visited beyond the time required to build the index. This cache allows them to serve out result pages very quickly, in addition to providing basic search facilities. 2.2.3 Indexing

Once the pages are stored in the repository, the next job of search engine is to make an index of stored data. The indexer module extracts all the words from each page, and records the URL where each word occurred. The result is a generally very large â&#x20AC;&#x153;lookup table" that can provide all the URLs that point to pages where a given word occurs. The table is of course limited to the pages that were covered in the crawling process. As mentioned earlier, text indexing of the Web poses special difficulties, due to its size, and its rapid rate of change. In addition to these quantitative challenges, the Web calls for some special, less common kinds of indexes. For example, the indexing module may also create a structure index, which reflects the links between pages. 2.2.4 Querying This sections deals with the user queries. The query engine module is responsible for receiving and filling search requests from users. The engine relies heavily on the indexes, and sometimes on the page repository. Because of the Web's size, and the fact that users typically only enter one or two keywords, result sets are usually very large. 2.2.5 Ranking Since the user query results in a large number of results, it is the job of the search engine to display the most appropriate results to the user. To do this efficient searching, the ranking of the results are performed. The ranking module therefore has the task of sorting the results such that results near the top are the most likely ones to be what the user is looking for. Once the ranking is done by the Ranking component, the final results are displayed to the user. This is how any search engine works. 2.3

Webs-Crawling

A spider, also known as a robot or a crawler, is actually just a program that follows, or â&#x20AC;&#x153;crawlsâ&#x20AC;?, links throughout the internet, grabbing content from sites and adding it to search engine indexes. Spiders only can follow links from one page to another and from one site to another. That is the primary reason why links to the site (inbound links) are so important. Links to the website from other websites will give the search engine spiders more "food" to chew on. The more times they find links to the site, the more times they will stop by and visit. Google especially relies on its spiders to create their vast index of listings. Spiders find Web pages by following links from other Web pages, but we can also submit our Web pages directly to a search engine or directory and request a visit by their spider. In fact, it's a good idea to manually submit our site to a human-edited directory such as Yahoo, and usually spiders from other search engines (such as Google) will find it and add it to their database. It can be useful to submit our URL straight to the various search engines as well; but spider-based engines will usually pick up our site regardless of whether or not we've submitted it to a search engine [15].

Figure 2.3: "Spiders" take a Web page's content and create key search words that enable online users to find pages they're looking for. 2.3.1 A Survey of Web Crawlers Web crawlers are almost as old as the web itself. The first crawler, Matthew Grayâ&#x20AC;&#x2122;s Wanderer, was written in the spring of 1993, roughly coinciding with the first release of NCSA Mosaic. Several papers about web crawling were presented at the first two World Wide Web conferences. However, at the time, the web was three to four orders of magnitude smaller than it is today, so those systems did not address the scaling problems inherent in a crawl of todayâ&#x20AC;&#x2122;s web. Obviously, all of the popular search engines use crawlers that must

scale up to substantial portions of the web. However, due to the competitive nature of the search engine business, the designs of these crawlers have not been publicly described. There are two notable exceptions: the Google crawler and the Internet Archive crawler. The original Google crawler [4] (developed at Stanford) consisted of five functional components running in different processes. A URL server process read URLs out of a file and forwarded them to multiple crawler processes. Each crawler process ran on a different machine, was singlethreaded, and used asynchronous I/O to fetch data from up to 300 web servers in parallel. The crawlers transmitted downloaded pages to a single Store Server process, which compressed the pages and stored them to disk. The pages were then read back from disk by an indexer process, which extracted links from HTML pages and saved them to a different disk file. A URL resolve process read the link file; the URLs contained therein, and saved the absolute URLs to the disk file that was read by the URL server. Typically, three to four crawler machines were used, so the entire system required between four and eight machines. Research on web crawling continues at Stanford even after Google has been transformed into a commercial effort. The Internet Archive also used multiple machines to crawl the web. Each crawler process was assigned up to 64 sites to crawl, and no site was assigned to more than one crawler. Each single-threaded crawler process read a list of seed URLs for its assigned sites from disk into per-site queues, and then used asynchronous I/O to fetch pages from these queues in parallel. Once a page was downloaded, the crawler extracted the links contained in it. If a link referred to the site of the page it was contained in, it was added to the appropriate site queue; otherwise it was logged to disk. Periodically, a batch process merged these logged “crosssite” URLs into the site-specific seed sets, filtering out duplicates in the process. 2.3.2 Basic Crawling Terminology Before we discuss the working of crawlers, it is worth to explain some of the basic terminology that is related with crawlers. These terms will be used in the forth coming chapters as well [14]. 2.3.2.1 Seed Page: By crawling, we mean to traverse the Web by recursively following links from a starting URL or a set of starting URLs. This starting URL set is the entry point though which any crawler starts searching procedure. This set of starting URL is known as “Seed Page”. The selection of a good seed is the most important factor in any crawling process. 2.3.2.2 Frontier (Processing Queue): The crawling method starts with a given URL (seed), extracting links from it and adding them to an un-visited list of URLs. This list of un-visited links or URLs is known as, “Frontier”. Each time, a URL is picked from the frontier by the Crawler Scheduler. This frontier is implemented by using Queue, Priority Queue Data structures. The maintenance of the Frontier is also a major functionality of any Crawler. 2.3.2.3 Parser: Once a page has been fetched, we need to parse its content to extract information that will feed and possibly guide the future path of the crawler. Parsing may imply simple hyperlink/URL extraction or it may involve the more complex process of tidying up the HTML content in order to analyze the HTML tag tree. The job of any parser is to parse the fetched web page to extract list of new URLs from it and return the new unvisited URLs to the Frontier.

2.4

Working of Basic Web Crawler

From the beginning, a key motivation for designing Web crawlers has been to retrieve web pages and add them or their representations to a local repository. Such a repository may then serve particular application needs such as those of a Web search engine. In its simplest form a crawler starts from a seed page and then uses the external links within it to attend to other pages. The structure of a basic crawler is shown in figure 2.4(a). The process repeats with the new pages offering more external links to follow, until a sufficient number of pages are identified or some higher level objective is reached. Behind this simple description lies a host of issues related to network connections, and parsing of fetched HTML pages to find new URL links.

Figure 2.4(a): Components of a web-crawler Common web crawler implements method composed from following steps: • • • •

Acquire URL of processed web document from processing queue Download web document Parse document’s content to extract set of URL links to other resources and update processing queue Store web document for further processing

The basic working of a web-crawler can be discussed as follows: • • • • • • •

Select a starting seed URL or URLs Add it to the frontier Now pick the URL from the frontier Fetch the web-page corresponding to that URL Parse that web-page to find new URL links Add all the newly found URLs into the frontier Repeat while the frontier is not Empty

Thus a crawler will recursively keep on adding newer URLs to the database repository of the search engine. So we can see that the main function of a crawler is to add new links into the frontier and to select a new URL from the frontier for further processing after each recursive step. The working of the crawlers can also be shown in the form of a flow 窶田hart (Figure 2.4(b)). Note that it also depicts the 7 steps given earlier [16]. Such crawlers are called sequential crawlers because they follow a sequential approach. In simple form, the flow chart of a web crawler can be stated as below:

Figure 2.4(b): Sequential flow of a Crawler

2.5

Crawling Policies

There are important characteristics of the Web that make crawling very difficult [17]: • Its large volume, • Its fast rate of change, and • Dynamic page generation. These characteristics combine to produce a wide variety of possible crawl able URLs. The large volume implies that the crawler can only download a fraction of the Web pages within a given time, so it needs to prioritize its download. The high rate of change implies that by the time the crawler is downloading the last pages from a site, it is very likely that new pages have been added to the site, or that pages have already been updated or even deleted. The number of pages being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content. As Edwards et al. noted, "Given that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained.” A crawler must carefully choose at each step which pages to visit next. The behavior of a Web crawler is the outcome of a combination of policies: • • • •

A selection policy that states which pages to download, A re-visit policy that states when to check for changes to the pages, A politeness policy that states how to avoid overloading Web sites, and A parallelization policy that states how to coordinate distributed Web crawlers.

2.5.1 Selection Policy Given the current size of the Web, even large search engines cover only a portion of the publicly-available Internet; a study by Dr. Steve Lawrence and Lee Giles showed that no search engine indexes more than 16% of the Web in 1999. As a crawler always downloads just a fraction of the Web pages, it is highly desirable that the downloaded fraction contains the most relevant pages and not just a random sample of the Web. This requires a metric of importance for prioritizing Web pages. The importance of a page is a function of its intrinsic quality, its popularity in terms of links or visits, and even of its URL

(the latter is the case of vertical search engines restricted to a single top-level domain, or search engines restricted to a fixed Web site). Designing a good selection policy has an added difficulty: it must work with partial information, as the complete set of Web pages is not known during crawling. â&#x20AC;&#x2DC;Cho et al.â&#x20AC;&#x2122; made the first study on policies for crawling scheduling. Their data set was a 180,000-pages crawl from the stanford.edu domain, in which a crawling simulation was done with different strategies. The ordering metrics tested were breadth-first, back link-count and partial Page rank calculations. One of the conclusions was that if the crawler wants to download pages with high Page rank early during the crawling process, then the partial Page rank strategy is the better, followed by breadth-first and back link-count. However, these results are for just a single domain. Najork and Wiener performed an actual crawl on 328 million pages, using breadth-first ordering. They found that a breadth-first crawl captures pages with high Page rank early in the crawl (but they did not compare this strategy against other strategies). The explanation given by the authors for this result is that "the most important pages have many links to them from numerous hosts, and those links will be found early, regardless of on which host or page the crawl originates". Abiteboul designed a crawling strategy based on an algorithm called OPIC (On-line Page Importance Computation). In OPIC, each page is given an initial sum of "cash" that is distributed equally among the pages it points to. It is similar to a Page rank computation, but it is faster and is only done in one step. An OPIC-driven crawler downloads first the pages in the crawling frontier with higher amounts of "cash". Experiments were carried in a 100,000-pages synthetic graph with a power-law distribution of in-links. However, there was no comparison with other strategies or experiments in the real Web. Boldi et al. used simulation on subsets of the Web of 40 million pages from the .it domain and 100 million pages from the Web Base crawl, testing breadth-first against depth-first, random ordering and an omniscient strategy. The comparison was based on how well Page Rank computed on a partial crawl approximates the true Page Rank value. Surprisingly, some visits that accumulate Page Rank very quickly (most notably, breadth-first and the omniscient visit) provide very poor progressive approximations. Baeza-Yates et al. used simulation on two subsets of the Web of 3 million pages from the .gr and .cl domain, testing several crawling strategies. They showed that both the OPIC strategy and a strategy that uses the length of the per-site queues are better than breadth-first crawling, and that it is also very effective to use a previous crawl, when it is available, to guide the current one. Daneshpajouh et al. designed a community based algorithm for discovering good seeds. Their methods are crawling web pages with high Page Rank from different communities in less iteration in comparison with crawl starting from random seeds. One can extract good seed from a previously-crawled-Web graph using this new method. Using these seeds a new crawl can be very effective. 2.5.1.1 Restricting followed links A crawler may only want to seek out HTML pages and avoid all other MIME types. In order to request only HTML resources, a crawler may make an HTTP HEAD request to determine a Web resource's MIME type before requesting the entire resource with a GET request. To avoid making numerous HEAD requests, a crawler may examine the URL and only request a resource if the URL ends with certain characters such as .html, .htm, .asp, .aspx, .php, .jsp,

.jspx or a slash. This strategy may cause numerous HTML Web resources to be unintentionally skipped. Some crawlers may also avoid requesting any resources that have a "?" in them (are dynamically produced) in order to avoid spider traps that may cause the crawler to download an infinite number of URLs from a Web site. This strategy is unreliable if the site uses URL rewriting to simplify its URLs. 2.5.1.2 Path-ascending Crawling Some crawlers intend to download as many resources as possible from a particular web site. So path-ascending crawler was introduced that would ascend to every path in each URL that it http://llama.org/hamster/monkey/page.html, it will attempt to crawl /hamster/monkey/, /hamster/, intends to crawl. For example, when given a seed URL of and /. Cothey found that a path-ascending crawler was very effective in finding isolated resources, or resources for which no inbound link would have been found in regular crawling. Many path-ascending crawlers are also known as Harvester software, because they're used to "harvest" or collect all the content â&#x20AC;&#x201D; perhaps the collection of photos in a gallery â&#x20AC;&#x201D; from a specific page or host. 2.5.1.3 Focused crawling The importance of a page for a crawler can also be expressed as a function of the similarity of a page to a given query. Web crawlers that attempt to download pages that are similar to each other are called focused crawler or topical crawlers. The concepts of topical and focused crawling were first introduced by Menczer and by Chakrabarti et al. The main problem in focused crawling is that in the context of a Web crawler, we would like to be able to predict the similarity of the text of a given page to the query before actually downloading the page. A possible predictor is the anchor text of links; this was the approach taken by Pinkerton in a crawler developed in the early days of the Web. Diligent et al. propose to use the complete content of the pages already visited to infer the similarity between the driving query and the pages that have not been visited yet. The performance of a focused crawling depends mostly on the richness of links in the specific topic being searched, and a focused crawling usually relies on a general Web search engine for providing starting points. 2.5.1.4 Crawl the Deep Web A vast amount of Web pages lie in the deep or invisible Web. These pages are typically only accessible by submitting queries to a database, and regular crawlers are unable to find these pages if there are no links that point to them. Google's Sitemap Protocol and mod oai are intended to allow discovery of these deep-Web resources. Deep Web crawling also multiplies the number of Web links to be crawled. Some crawlers only take some of the <a href="URL"-shaped URLs. In some cases, such as the Googlebot, Web crawling is done on all text contained inside the hypertext content, tags, or text. 2.5.2 Re-visit Policy

The Web has a very dynamic nature, and crawl a fraction of the Web can take weeks or months. By the time a Web crawler has finished its crawl, many events could have happened, including creations, updates and deletions. From the search engine's point of view, there is a cost associated with not detecting an event, and thus having an outdated copy of a resource. The most-used cost functions are freshness and age. Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The freshness of a page p in the repository at time t is defined as:

Age: This is a measure that indicates how outdated the local copy is. The age of a page p in the repository, at time t is defined as:

Coffman et al. worked with a definition of the objective of a Web crawler that is equivalent to freshness, but use a different wording: they propose that a crawler must minimize the fraction of time pages remain outdated. They also noted that the problem of Web crawling can be modeled as a multiple-queue, single-server polling system, on which the Web crawler is the server and the Web sites are the queues. Page modifications are the arrival of the customers, and switch-over times are the interval between page accesses to a single Web site. Under this model, mean waiting time for a customer in the polling system is equivalent to the average age for the Web crawler. The objective of the crawler is to keep the average freshness of pages in its collection as high as possible, or to keep the average age of pages as low as possible. These objectives are not equivalent: in the first case, the crawler is just concerned with how many pages are out-dated, while in the second case, the crawler is concerned with how old the local copies of pages are. Two simple re-visiting policies were studied by Cho and Garcia-Molina: Uniform policy: This involves re-visiting all pages in the collection with the same frequency, regardless of their rates of change. Proportional policy: This involves re-visiting more often the pages that change more frequently. The visiting frequency is directly proportional to the (estimated) change frequency. (In both cases, the repeated crawling order of pages can be done either in a random or a fixed order.)

Cho and Garcia-Molina proved the surprising result that, in terms of average freshness, the uniform policy outperforms the proportional policy in both a simulated Web and a real Web crawl. The explanation for this result comes from the fact that, when a page changes too often, the crawler will waste time by trying to re-crawl it too fast and still will not be able to keep its copy of the page fresh. To improve freshness, the crawler should penalize the elements that change too often. The optimal re-visiting policy is neither the uniform policy nor the proportional policy. The optimal method for keeping average freshness high includes ignoring the pages that change too often, and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page. In both cases, the optimal is closer to the uniform policy than to the proportional policy: as Coffman et al. note, "in order to minimize the expected obsolescence time, the accesses to any particular page should be kept as evenly spaced as possible". Explicit formulas for the re-visit policy are not attainable in general, but they are obtained numerically, as they depend on the distribution of page changes. Cho and Garcia-Molina show that the exponential distribution is a good fit for describing page changes, while Ipeirotis et al. Show how to use statistical tools to discover parameters that affect this distribution. Note that the re-visiting policies considered here regard all pages as homogeneous in terms of quality ("all pages on the Web are worth the same"), something that is not a realistic scenario, so further information about the Web page quality should be included to achieve a better crawling policy. 2.5.3 Politeness Policy Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. Needless to say, if a single crawler is performing multiple requests per second and/or downloading large files, a server would have a hard time keeping up with requests from multiple crawlers. As noted by Koster, the use of Web crawlers is useful for a number of tasks, but comes with a price for the general community. The costs of using Web crawlers include: • •

Network resources, as crawlers require considerable bandwidth and operate with a high Degree of parallelism during a long period of time; Server overload, especially if the frequency of accesses to a given server is too high;

•

Poorly-written crawlers, which can crash servers or routers, or which download pages they cannot handle; and

•

Personal crawlers that, if deployed by too many users, can disrupt networks and Web servers.

A partial solution to these problems is the robots exclusion protocol, also known as the robots.txt protocol that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers. This standard does not include a suggestion for the interval of visits to the same server, even though this interval is the most effective way of avoiding server overload. Recently commercial search engines like Ask Jeeves, MSN and Yahoo are able to use an extra "Crawl-delay:" parameter in the robots.txt file to indicate the number of seconds to delay between requests.

The first proposal for the interval between connections was given in and was 60 seconds. However, if pages were downloaded at this rate from a website with more than 100,000 pages over a perfect connection with zero latency and infinite bandwidth, it would take more than 2 months to download only that entire Web site; also, only a fraction of the resources from that Web server would be used. This does not seem acceptable. Cho uses 10 seconds as an interval for accesses, and the WIRE crawler uses 15 seconds as the default. The Mercator Web crawler follows an adaptive politeness policy: if it took t seconds to download a document from a given server, the crawler waits for 10t seconds before downloading the next page. Dill et al. use 1 second. For those using Web crawlers for research purposes, a more detailed cost-benefit analysis is needed and ethical considerations should be taken into account when deciding where to crawl and how fast to crawl. Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 3â&#x20AC;&#x201C;4 minutes. It is worth noticing that even when being very polite, and taking all the safeguards to avoid overloading Web servers, some complaints from Web server administrators are received. Brin and Page note that: "... running a crawler which connects to more than half a million servers (...) generates a fair amount of e-mail and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen.â&#x20AC;? 2.5.4 Parallelization Policy A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes. 2.6

URL Normalization

Crawlers usually perform some type of URL normalization in order to avoid crawl the same resource more than once. The term URL normalization, also called URL canonicalization, refers to the process of modifying and standardizing a URL in a consistent manner. There are several types of normalization that may be performed including conversion of URLs to lowercase, removal of "." and "...â&#x20AC;? segments, and adding trailing slashes to the non-empty path component [17]. 2.7

Crawler Identification

Web crawlers typically identify themselves to a Web server by using the User-agent field of an HTTP request. Web site administrators typically examine their Web servers' log and use the user agent field to determine which crawlers have visited the web server and how often. The user agent field may include a URL where the Web site administrator may find out more information about the crawler. Spam bots and other malicious Web crawlers are unlikely to place identifying information in the user agent field, or they may mask their identity as a browser or other well-known crawler.

It is important for Web crawlers to identify themselves so that Web site administrators can contact the owner if needed. In some cases, crawlers may be accidentally trapped in a crawler trap or they may be overloading a Web server with requests, and the owner needs to stop the crawler. Identification is also useful for administrators that are interested in knowing when they may expect their Web pages to be indexed by a particular search engine [17]. 2.8

Crawler-Based Search Engine Criteria

Comparing and understanding the differences in crawling-type search engines can greatly assist a web designer in writing and coding the pages. The table below provides an accurate and concise comparison of the major crawling search engines and their criteria for sorting and ranking query results. Each of the terms is defined following the table [18]. Crawling

Yes No Notes AllTheWeb, AltaVista, Deep Crawl Google, Inktomi Teoma Frames Support All n/a robots.txt All n/a Meta Robots All n/a Tag Paid Inclusion All but... Google Full Body Text All n/a Some stop words may not be indexed AltaVista, Stop Words FAST Teoma unknown Inktomi, Google All provide some support, but Meta AltaVista, AllTheWeb and Teoma Description make most use of the tag AllTheWeb, Meta Keywords Inktomi, Teoma AltaVista, Teoma support is "unofficial" Google AltaVista, AllTheWeb, ALT text Google, Inktomi Teoma Comments Inktomi Others Chapter 3 Indexing 3.1

Search Engine Indexing

Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics and computer science. An alternate name for the process in the context of search engines designed to find web pages on the Internet is Web indexing [19].

Figure 3.1: The general structure and flow of a search engine. Popular engines focus on the full-text indexing of online, natural language documents. Media types such as video and audio and graphics are also searchable. Meta search engines reuse the indices of other services and do not store a local index, whereas cache-based search engines permanently store the index along with the corpus. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while agent-based search engines index in real time. 3.2

Indexing

The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours. The additional computer storage required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval [19]. 3.2.1 Index Design Factors Major factors in designing a search engine's architecture include: Merge factors: How data enters the index, or how words or subject features are added to the index during text corpus traversal, and whether multiple indexers can work asynchronously. The indexer must first check whether it is updating old content or adding new content.

Traversal typically correlates to the data collection policy. Search engine index merging is similar in concept to the SQL Merge command and other merge algorithms. Storage techniques: How to store the index data, that is, whether information should be data compressed or filtered. Index size: How much computer storage is required to support the index? Lookup speed: How quickly a word can be found in the inverted index. The speed of finding an entry in a data structure, compared with how quickly it can be updated or removed. Maintenance: How the index is maintained over time. Fault tolerance: How important it is for the service to be reliable. Issues include dealing with index corruption, determining whether bad data can be treated in isolation, dealing with bad hardware, partitioning, and schemes such as hash-based or composite partitioning, as well as replication. 3.2.2 Index Data Structures Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors. Types of indices include: Suffix tree: Figuratively structured like a tree, supports linear time lookup. Built by storing the suffixed of words. The suffix tree is a type of tire. Tries support extendable hashing, which is important for search engine indexing. Used for searching for patterns in DNA sequences and clustering. A major drawback is that the storage of a word in the tree may require more storage than storing the word itself. An alternate representation is a suffix array, which is considered to require less virtual memory and supports data compression such as the BWT algorithm. Inverted index: Store a list of occurrences of each atomic search criterion, typically in the form of a hash table or binary tree. Citation index: Store citations or hyperlinks of document to support citation analysis, a subject of Bibliometrics. Ngram index: Store sequences of length of data to support other types of retrieval or text mining. Document-term matrix: Use in latent semantic analysis, stores the occurrences of words in documents in a two-dimensional sparse matrix. 3.2.3 Challenges in Parallelism A major challenge in the design of search engines is the management of parallel computing processes. There are many opportunities for race conditions and coherent faults. For example, a new document is added to the corpus and the index must be updated, but the index simultaneously needs to continue responding to search queries. This is a collision between

two competing tasks. Consider that authors are producers of information, and a web crawler is the consumer of this information, grabbing the text and storing it in a cache (or corpus). The forward index is the consumer of the information produced by the corpus, and the inverted index is the consumer of information produced by the forward index. This is commonly referred to as a producer-consumer model. The indexer is the producer of searchable information and users are the consumers that need to search. The challenge is magnified when working with distributed storage and distributed processing. In an effort to scale with larger amounts of indexed information, the search engine's architecture may involve distributed computing, where the search engine consists of several machines operating in unison. This increases the possibilities for incoherency and makes it more difficult to maintain a fully-synchronized, distributed, parallel architecture. 3.2.4 Inverted indices Many search engines incorporate an inverted index when evaluating a search query to quickly locate documents containing the words in a query and then rank these documents by relevance. Because the inverted index stores a list of the documents containing each word, the search engine can use direct access to find the documents associated with each word in the query in order to retrieve the matching documents quickly. The following is a simplified illustration of an inverted index: Inverted Index Word Documents Document 1, Document 3, Document 4, the Document 5 cow Document 2, Document 3, Document 4 says Document 5 moo Document 7 This index can only determine whether a word exists within a particular document, since it stores no information regarding the frequency and position of the word; it is therefore considered to be a Boolean index. Such an index determines which documents match a query but does not rank matched documents. In some designs the index includes additional information such as the frequency of each word in each document or the positions of a word in each document. Position information enables the search algorithm to identify word proximity to support searching for phrases; frequency can be used to help in ranking the relevance of documents to the query. Such topics are the central research focus of information retrieval. The inverted index is a sparse matrix, since not all words are present in each document. To reduce computer storage memory requirements, it is stored differently from a two dimensional array. The index is similar to the term document matrices employed by latent semantic analysis. The inverted index can be considered a form of a hash table. In some cases the index is a form of a binary tree, which requires additional storage but may reduce the lookup time. In larger indices the architecture is typically a distributed hash table. 3.2.5 Index Merging

The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but first deletes the contents of the inverted index. The architecture may be designed to support incremental indexing, where a merge identifies the document or documents to be added or updated and then parses each document into words. For technical accuracy, a merge conflates newly indexed documents, typically residing in virtual memory, with the index cache residing on one or more computer hard drives. After parsing, the indexer adds the referenced document to the document list for the appropriate words. In a larger search engine, the process of finding each word in the inverted index (in order to report that it occurred within a document) may be too time consuming, and so this process is commonly split up into two parts, the development of a forward index and a process which sorts the contents of the forward index into the inverted index. The inverted index is so named because it is an inversion of the forward index. 3.2.6 The Forward Index The forward index stores a list of words for each document. The following is a simplified form of the forward index: Forward Index Document Words Document 1 the, cow, says, moo Document 2 the, cat, and, the, hat Document 3 The, dish, ran , away, with, the, spoon

The rationale behind developing a forward index is that as documents are parsing, it is better to immediately store the words per document. The delineation enables Asynchronous system processing, which partially circumvents the inverted index update bottleneck. The forward index is sorted to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index. 3.2.7 Compression Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. Many search engines utilize a form of compression to reduce the size of the indices on disk. Consider the following scenario for a full text, Internet search engine. â&#x20AC;˘ â&#x20AC;˘

An estimated 2,000,000,000 different web pages exist as of the year 2000 Suppose there are 250 words on each webpage (based on the assumption they are similar to the pages of a novel.

â&#x20AC;˘

It takes 8 bits (or 1 byte) to store a single character. Some encodings use 2 bytes per character

â&#x20AC;˘

The average number of characters in any given word on a page may be estimated at 5 (Wikipedia: Size comparisons)

â&#x20AC;˘

The average personal computer comes with 100 to 250 gigabytes of usable space

Given this scenario, an uncompressed index (assuming a non-conflated, simple, index) for 2 billion web pages would need to store 500 billion word entries. At 1 byte per character, or 5 bytes per word, this would require 2500 gigabytes of storage space alone, more than the average free disk space of 25 personal computers. This space requirement may be even larger for fault-tolerant distributed storage architecture. Depending on the compression technique chosen, the index can be reduced to a fraction of this size. The tradeoff is the time and processing power required to perform compression and decompression. Notably, large scale search engine designs incorporate the cost of storage as well as the costs of electricity to power the storage. Thus compression is a measure of cost. 3.3

Document Parsing

Document parsing breaks apart the components (words) of a document or other form of media for insertion into the forward and inverted indices. The words found are called tokens, and so, in the context of search engine indexing and natural language processing, parsing is more commonly referred to as tokenization. It is also sometimes called word boundary disambiguation, tagging, text segmentation, content analysis, text analysis, text mining, concordance generation, speech segmentation, lexeme, or lexical analysis. The terms 'indexing', 'parsing', and 'tokenization' are used interchangeably in corporate slang. Natural language processing, as of 2006, is the subject of continuous research and technological improvement. Tokenization presents many challenges in extracting the necessary information from documents for indexing to support quality searching. Tokenization for indexing involves multiple technologies, the implementations of which are commonly kept as corporate secrets [19]. 3.3.1 Challenges in Natural Language Processing Word Boundary Ambiguity: Native English speakers may at first consider tokenization to be a straightforward task, but this is not the case with designing a multilingual indexer. In digital form, the texts of other languages such as Chinese, Japanese or Arabic represent a greater challenge, as words are not clearly delineated by white space. The goal during tokenization is to identify words for which users will search. Language-specific logic is employed to properly identify the boundaries of words, which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax). Language Ambiguity: To assist with properly ranking matching documents, many search engines collect additional information about each word, such as its language or lexical category (part of speech). These techniques are language-dependent, as the syntax varies among languages. Documents do not always clearly identify the language of the document or

represent it accurately. In tokenizing the document, some search engines attempt to automatically identify the language of the document. Diverse File Formats: In order to correctly identify which bytes of a document represent characters, the file format must be correctly handled. Search engines which support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document. Faulty Storage: The quality of the natural language data may not always be perfect. An unspecified number of documents, particular on the Internet, do not closely obey proper file protocol. Binary characters may be mistakenly encoded into various parts of a document. Without recognition of these characters and appropriate handling, the index quality or indexer performance could degrade. 3.3.2 Tokenization Unlike literate humans, computers do not understand the structure of a natural language document and cannot automatically recognize words and sentences. To a computer, a document is only a sequence of bytes. Computers do not 'know' that a space character separates words in a document. Instead, humans must program the computer to identify what constitutes an individual or distinct word, referred to as a token. Such a program is commonly called tokenize or parser or laxer. Many search engines, as well as other natural language processing software, incorporate specialized programs for parsing, such as YACC or Lax. During tokenization, the parser identifies sequences of characters which represent words and other elements, such as punctuation, which are represented by numeric codes, some of which are non-printing control characters. The parser can also identify entities such as email addresses, phone numbers, and URLs. When identifying each token, several characteristics may be stored, such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number. 3.3.3 Language Recognition If the search engine supports multiple languages, a common initial step during tokenization is to identify each document's language; many of the subsequent steps are language dependent (such as stemming and part of speech tagging). Language recognition is the process by which a computer program attempts to automatically identify, or categorize, the language of a document. Other names for language recognition include language classification, language analysis, language identification, and language tagging. Automated language recognition is the subject of ongoing research in natural language processing. Finding which language the words belongs to may involve the use of a language recognition chart. 3.3.4 Format Analysis If the search engine supports multiple document formats, documents must be prepared for tokenization. The challenge is that many document formats contain formatting information in addition to textual content. For example, HTML documents contain HTML tags, which specify formatting information such as new line starts, bold emphasis, and font size or style. If the search engine were to ignore the difference between content and 'markup', extraneous

information would be included in the index, leading to poor search results. Format analysis is the identification and handling of the formatting content embedded within documents which controls the way the document is rendered on a computer screen or interpreted by a software program. Format analysis is also referred to as structure analysis, format parsing, tag stripping, format stripping, text normalization, text cleaning, and text preparation. The challenge of format analysis is further complicated by the intricacies of various file formats. Certain file formats are proprietary with very little information disclosed, while others are well documented. Common, well-documented file formats that many search engines support include: • •

HTML ASCII text files (a text document without specific computer readable formatting)

•

Adobe's Portable Document Format (PDF)

•

PostScript (PS)

•

Latex

•

UseNet Netnews server formats

•

XML and derivatives like RSS

•

SGML

•

Multimedia meta data formats like ID3

•

Microsoft Word

•

Microsoft Excel

•

Microsoft PowerPoint

•

IBM Lotus Notes

•

And other few common image file formats.

Options for dealing with various formats include using a publicly available commercial parsing tool that is offered by the organization which developed, maintains, or owns the format, and writing a custom parser. Some search engines support inspection of files that are stored in a compressed or encrypted file format. When working with a compressed format, the indexer first decompresses the document; this step may result in one or more files, each of which must be indexed separately. Commonly supported compressed file formats include: • •

ZIP - Zip archive file RAR - Roshal Archive File

•

CAB - Microsoft Windows Cabinet File

•

Gzip - File compressed with gzip

•

BZIP - File compressed using bzip2

•

Tape ARchive (TAR), Unix archive file, not (itself) compressed

•

TAR.Z, TAR.GZ or TAR.BZ2 - Unix archive files compressed with Compress, GZIP or BZIP2

Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. Content can manipulate the formatting information to include additional content. Examples of abusing document formatting for spamdexing: •

•

Including hundreds or thousands of words in a section which is hidden from view on the computer screen, but visible to the indexer, by use of formatting (e.g. hidden "div" tag in HTML, which may incorporate the use of CSS or JavaScript to do so). Setting the foreground font color of words to the same as the background color, making words hidden on the computer screen to a person viewing the document, but not hidden to the indexer.

3.3.5 Section Recognition Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well-written book, divided into organized chapters and pages. Many documents on the web, such as newsletters and corporate reports, contain erroneous content and side-sections which do not contain primary material (that which the document is about). For example, this article displays a side menu with links to other web pages. Some file formats, like HTML or PDF, allow for content to be displayed in columns. Even though the content is displayed, or rendered, in different areas of the view, the raw markup content may store this information sequentially. Words that appear sequentially in the raw source content are indexed sequentially, even though these sentences and paragraphs are rendered in different parts of the computer screen. If search engines index this content as if it were normal content, the quality of the index and search quality may be degraded due to the mixed content and improper word proximity. Two primary problems are noted: • •

Content in different sections is treated as related in the index, when in reality it is not Organizational 'side bar' content is included in the index, but the side bar content does not contribute to the meaning of the document, and the index is filled with a poor representation of its documents.

Section analysis may require the search engine to implement the rendering logic of each document, essentially an abstract representation of the actual document, and then index the representation instead. For example, some content on the Internet is rendered via JavaScript. If the search engine does not render the page and evaluate the JavaScript within the page, it would not 'see' this content in the same way and would index the document incorrectly. Given that some search engines do not bother with rendering issues, many web page designers avoid displaying content via JavaScript or use the No script tag to ensure that the web page is indexed properly. At the same time, this fact can also be exploited to cause the search engine indexer to 'see' different content than the viewer. 3.3.6 Meta Tag Indexing

Specific documents often contain embedded Meta information such as author, keywords, description, and language. For HTML pages, the Meta tag contains keywords which are also included in the index. Earlier Internet search engine technology would only index the keywords in the Meta tags for the forward index; the full document would not be parsed. At that time full-text indexing was not as well established, nor was the hardware able to support such technology. The design of the HTML markup language initially included support for Meta tags for the very purpose of being properly and easily indexed, without requiring tokenization. As the Internet grew through the 1990s, many brick-and-mortar corporations went 'online' and established corporate websites. The keywords used to describe WebPages (many of which were corporate-oriented WebPages similar to product brochures) changed from descriptive to marketing-oriented keywords designed to drive sales by placing the webpage high in the search results for specific search queries. The fact that these keywords were subjectively-specified was leading to spamdexing, which drove many search engines to adopt full-text indexing technologies in the 1990s. Search engine designers and companies could only place so many 'marketing keywords' into the content of a webpage before draining it of all interesting and useful information. Given that conflict of interest with the business goal of designing user-oriented websites which were 'sticky', the customer lifetime value equation was changed to incorporate more useful content into the website in hopes of retaining the visitor. In this sense, full-text indexing was more objective and increased the quality of search engine results, as it was one more step away from subjective control of search engine result placement, which in turn furthered research of full-text indexing technologies. In Desktop search, many solutions incorporate Meta tags to provide a way for authors to further customize how the search engine will index content from various files that is not evident from the file content. Desktop search is more under the control of the user, while Internet search engines must focus more on the full text index. Chapter 4 Page Ranking and Searching 4.1

Page Ranking

The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of web pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. The methods also change over time as Internet usage changes and new techniques evolve. There are two main types of search engine that have evolved: one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively. The other is a system that generates an "inverted index" by analyzing texts it locates. This second form relies much more heavily on the computer itself to do the bulk of the work [3]. Most Web search engines are commercial ventures supported by advertising revenue and, as a result, some employ the practice of allowing advertisers to pay money to have their listings ranked higher in search results. Those search engines which do not accept money for their

search engine results make money by running search related ads alongside the regular search engine results. The search engines make money every time someone clicks on one of these ads [3]. Search engine software is the third part of a search engine. This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant. They determine relevance by following a set of rules, known as an algorithm. Exactly how a search engine’s algorithm works is not disclosed to the public. However, the following general rules applies to all search engines, which can be categorized as “on the page” (more controllable) and “off the page” (less controllable) factors [20]. "On the Page" Factors Location and Frequency of Keywords One of the main rules in a ranking algorithm involves the location and frequency of keywords on a web page. Location involves searching for pages with the search terms appearing in the HTML title tags, which are assumed to be most relevant. Search engines will also check to see if the search keywords appear near the top of a web page, such as in the headline or in the first few paragraphs of text. For instance, Figure below contains an example of HTML coding positioned within the EDTECH website header tag. Notice that the title tag includes important words as to the content of the webpage/site [21]: < html> < head> < meta http-equiv= "Content-Language" content= "en-us"> < meta name= "GENERATOR" content= "Microsoft FrontPage 4.0"> < meta name= "ProgId" content= "FrontPage.Editor.Document"> < meta http-equiv= "Content-Type" content= "text/html; charset=windows-1252"> < title >Educational Technology at Boise State University</ title> < meta name = "keywords" content= "education, educator, educational, educational technology, instructional, instructional technology, graduate, graduate certificate, graduate certificates, masters degree, master's degree, online masters degree, online master's degree, educational research, instructional theory, integration, integrating technology, technology integration, multimedia, evaluation, assessment, authentic assessment, teaching online, online teaching, graduate certificate, problem based learning, problem-based learning, instructional theory, learning theory, online, Internet, internet, asynchronous, interactive, technology, constructivist, constructivism, accredited, regionally accredited, national council for accreditation of teacher education, NCATE"> < meta name= "Microsoft Border" content= "l"> </ head> Frequency is the other major factor in how search engines determine ranking. A search engine will analyze how often keywords appear in relation to other words in a web page. Those with a higher frequency are often deemed more relevant than other web pages. While all major search engines follow the above procedure to some degree, they each have their own specific criteria. Some search engines index more web pages than others. That is why when search terms are inserted in different search engines, different results occur. Search engines may also penalize pages or exclude them from the index, if they detect search engine "spamming." This occurs when a word is repeated hundreds of times on a page to increase the frequency and put the page higher in the listings. Search engines watch for

common spamming methods in a variety of ways, including responding to complaints from their users. While web designers can control the above coding and design of their websites, there are additional factors in search engine criteria that are less controllable, often called “off the page” factors. "Off the Page" Factors Link Analysis In order to maintain an accurate representation of web pages indexed in a search, search engines also use link analysis to determine relevance. By analyzing how pages link to each other, a search engine can often determine what a page is about and whether that page might be important, resulting in a rank increase. This is considered an “off the page” factor, as it cannot be as easily controlled and manipulated by web designers [22]. Click through Measurement Another “off the page” factor is click through measurement. This refers to the way search engines watch what someone selects from a list of search results. Search engines will eventually drop high-ranking pages that do not attract clicks, while promoting lower-ranking pages that generate more. As with link analysis, search engines have systems in place that will identify artificial links created by unethical web designers [23]. 4.2

Search Engine Ranking Algorithms

Search engine ranking algorithms are closely guarded secrets, for at least two reasons: search engine companies want to protect their methods from their competitors, and they also want to make it difficult. That said, a specific page's relevance ranking for a specific query currently depends on three factors: for web site owners to manipulate their ranking [23]. • Its relevance to the words and concepts in the query • Its overall link popularity •

Whether or not it is being penalized for excessive search engine optimization (SEO).

Examples of SEO abuse would be a lot of sites linked to each other in a circular scam, or excessive and highly ungrammatical staffing with keyword. Factor #2 was innovated by Google with Page Rank. Essentially, the more incoming links our page has, the better. But it is more complicated than that: indeed, Page Rank is a tricky concept because it is circular, as follows: Every page on the Internet has a minimum Page Rank score just for existing. 85% (at least, that's the best known estimate, based on an early paper) of this Page Rank is passed along to the pages that page links to, divided more or less equally along its outgoing links. A page's Page Rank is the sum of the minimum value plus all the Page Rank passed to it via incoming links. Although this is circular, mathematical algorithms exist for calculating it iteratively.

In one final complication, what I just said applies to "raw Page Rank." Google actually reports Page Rank scores of 0 to 10 that are believed to be based on the logarithm of raw Page Rank (they're reported as whole numbers). And the base of that logarithm is believed to be approximately 6. Anyhow, there are about 30 sites on the Web of PageRank10, including Yahoo, Google, Microsoft, Intel, and NASA. IBM, AOL, and CNN, by way of contrast, were only at Page Rank 9 as of early in 2004. Further refinements in link popularity rankings are under development. Notably, link popularity can be made specific to a subject or category; i.e., pages can have different Page Ranks for health vs. sports vs. computers vs. whatever. Supposedly, AskJeeves/Teoma already works that way. It is believed that Inktomi, AltaVista, et al. use link popularity in their ranking algorithms, but to a much lesser extent than Google. Yahoo, owner of Inktomi, AltaVista, Alltheweb, is rolling out a new search engine, which reportedly includes a feature called Web Rank. More on how that works soon. Keyword Search Most search engines handle words and simple phrases. In its simplest form, text search looks for pages with lots of occurrences of each of the words in a query, stop words aside. The more common a word is on a page, compared with its frequency in the overall language, the more likely that page will appear among the search results. Hitting all the words in a query is a lot better than missing some. Search engines also make some efforts to “understand” what is meant by the query words. For example, most search engines now offer optional spelling correction. And increasingly they search not just on the words and phrases actually entered, but the also use stemming to search for alternate forms of the words (e.g., speak, speaker, speaking, spoke). Teoma-based engines are also offering refinement by category, ala the now-defunct Northern Light. However, Excite-like concept search has otherwise not made a comeback yet, since the concept categories are too unstable. When ranking results, search engines give special weight to keywords that appear: • •

High up on the page In headings

•

In BOLDFACE (at least in Inktomi)

•

In the URL

•

In the title (important)

•

In the description

•

In the ALT tags for graphics.

•

In the generic keywords metatags (only for Inktomi, and only a little bit even for them)

•

In the link text for inbound links.

More weight is put on the factors that the site owner would find it awkward to fake, such as inbound link text, page title (which shows up on the SERP -- Search Engine Results Page), and description. How sites get into search engines The base case is that spider crawl the entire Web, starting from known pages and following all links, and also crawling pages that are hand-submitted. Google is pretty much like that still. If a site has high Page Rank, it is spider more often and more deeply. However, search engines are trying to encourage site owners to pay for the privilege of having their pages spider. Teoma's index is very hard to get into without paying money, and Inktomi's isn't that easy either. And even if we do get into Inktomi for free, they'll take a long time to respider, while if we pay they respider constantly. One advantage of being respidered often is that we can tweak our page to come up higher in their relevancy rankings, and then see if our changes worked. Finally, we can also pay to appear on a search page. That is, our link will appear when someone searches on a specific keyword or key phrase. Google does a good job of making it pretty clear which results (at the top or on the right of the page) are paid; others maybe do a not-so-good job. Paid search results are typically all pay-per-click, based on keyword. The advertiser pays the search engine vendor a specific amount of money each time an ad is clicked on, this fee having been determined by an auction of each keyword or key phrase. 4.3

Web Search Query

A web search query is a query that a user enters into web search engine to satisfy his or her information needs. Web search queries are distinctive in that they are unstructured and often ambiguous; they vary greatly from standard query languages which are governed by strict syntax rules [24]. 4.3.1 Types of Search Query There are three broad categories that cover most web search queries: •

Informational Queries – Queries that cover a broad topic (e.g., Colorado or trucks) for which there may be thousands of relevant results.

•

Navigational Queries – Queries that seek a single website or web page of a single entity (e.g., you tube or delta airlines).

•

Transactional Queries – Queries that reflect the intent of the user to perform a particular action, like purchasing a car or downloading a screen saver.

Search engines often support a fourth type of query that is used far less frequently:

•

Connectivity Queries – Queries that report on the connectivity of the indexed web graph (e.g., which links point to this URL, and how many pages are indexed from this domain name?).

4.3.2 Characteristics Most commercial web search engines do not disclose their search logs, so information about what users are searching for on the Web is difficult to come by. Nevertheless, a study in 2001 analyzed the queries from the Excite search engine showed some interesting characteristics of web search: • •

The average length of a search query was 2.4 terms. About half of the users entered a single query while a little less than a third of users entered three or more unique queries.

•

Close to half of the users examined only the first one or two pages of results (10 results per page).

•

Less than 5% of users used advanced search features (e.g., Boolean operators like AND, OR, and NOT).

•

The top four most frequently used terms were, (empty search), and, of, and sex.

A study of the same Excite query logs revealed that 19% of the queries contained a geographic term (e.g., place names, zip codes, geographic features, etc.). A 2005 study of Yahoo's query logs revealed 33% of the queries from the same user were repeat queries and that 87% of the time the user would click on the same result. This suggests that many users use repeat queries to revisit or re-find information. In addition, much research has shown that query term frequency distributions conform to the power law, or long tail distribution curves. That is, a small portion of the terms observed in a large query log (e.g. > 100 million queries) are used most often, while the remaining terms are used less often individually. This example of the Pareto principle (or 80-20 rule) allows search engines to employ optimization techniques such as index or database partitioning, caching and pre-fetching [24]. 4.3.3 Structured Queries With search engines that support Boolean operators and parentheses, a technique traditionally used by librarians can be applied. A user who is looking for documents that cover several topics or facets may want to describe each of them by a disjunction of characteristic words, such as vehicles OR cars OR automobiles. A faceted query is a conjunction of such facets; e.g. a query such as (electronic OR computerized OR DRE) AND (voting OR elections OR election OR balloting OR electoral) is likely to find documents about electronic voting even if they omit one of the words "electronic" and "voting", or even both [24]. 4.3.4 Web Query Classification Web query topic classification/categorization is a problem in information science. The task is to assign a Web search query to one or more predefined categories, based on its topics. The

importance of query classification is underscored by many services provided by Web search. A direct application is to provide better search result pages for users with interests of different categories. For example, the users issuing a Web query “apple” might expect to see Web pages related to the fruit apple, or they may prefer to see products or news related to the computer company. Online advertisement services can rely on the query classification results to promote different products more accurately. Search result pages can be grouped according to the categories predicted by a query classification algorithm. However, the computation of query classification is non-trivial. Different from the document classification tasks, queries submitted by Web search users are usually short and ambiguous; also the meanings of the queries are evolving over time. Therefore, query topic classification is much more difficult than traditional document classification tasks [25]. KDDCUP 2005 KDDCUP 2005 competition highlighted the interests in query classification. The objective of this competition is to classify 800,000 real user queries into 67 target categories. Each query can belong to more than one target category. As an example of a QC task, given the query “apple”, it should be classified into ranked categories: “Computers \ Hardware; Living \ Food & Cooking” [25].

Query