Page 1

IS531 - Ch 8

Modern Information Retrieval

Indexing and Searching Presented by Raed Ibrahim Al-Fayez Ali Sulaiman Al-Humaimidi

Supervised by: Dr. Mourad Ykhlef 1


IS531 - Ch 8

Contents  8.1 Introduction  8.2 Inverted Files  8.3 Other Indices for Text  8.4 Boolean Queries  8.5 Sequential Searching  8.6 Pattern Matching  8.7 Structural Queries  8.8 Compression  8.9 Trends and Research Issues 2


IS531 - Ch 8

8.1 Introduction (1)  Option in searching for basic queries: – Sequential/on-line text searching: searching • Finding the occurrences of a pattern in a text when the text is not preprocessed. preprocessed • Good: Good when text is small (in MB) & when index overhead can’t be afforded. – Indexed searching: searching • Build data structure over the text (indices) to speedup the search. • Good: Good when text is large or huge & the text is semi-static (not often updated). 3


IS531 - Ch 8

8.1 Introduction (3)  Main indexing techniques: – Inverted files (Keyword-based search) • best choice for most application. – Suffix arrays/trees • Faster for phrase searches but hared to build & maintain. – Signature files. files • Was popular in the mid 1980 & inverted files take place.

 For each techniques pay attention to: – Search cost & Space overhead, – construction cost & maintenance cost. 4


IS531 - Ch 8

8.1 Introduction (4)  Index should be built and stored in a data structure before searching: – Basic data structures: Sorted Arrays, Binary search tree, Btree, hash table, Trie, Patricia tree ..etc.  Trie (from retrieval): – Multi-way trees that store set of strings and able to retrieve them so fast depend on string length. – Every edge of a tree is labeled with a letter. – for storing strings over an alphabet – Used in dictionaries (a, an, and...etc)

5


IS531 - Ch 8

8.2 Inverted files (1)  Definition: – A word-oriented mechanism for indexing a text collection in order to speed up the searching task. – Also called inverted index.

 Composed of 2 elements: – Vocabulary: Vocabulary • Set of all different words in the text.

– Occurrences: Occurrences • for each word a list of all the text positions the word appears. • the positions can refer to words or characters. 6


IS531 - Ch 8

8.2 Inverted files (2)  A sample text and an inverted index built on it: 1 6 9 11 17 19 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from letters text Vocabulary letters made many text words

Occurrence 60… 50… 28… 11, 19… 33, 40…

7

inverted index


IS531 - Ch 8

8.2 Inverted files (3)  Required space – The space required for the vocabulary is rather small. – The occurrences demand much more space.

 Block addressing – Reduces space requirements: • Pointers are smaller due to fewer blocks. • Also word may occurs in the same block

– The text is divided in blocks, and the occurrences point to the blocks where the word appears (instead of the exact position). – If the exact occurrence positions are required: • Do online search over the qualifying blocks has to be performed • Note: max 256 block and 200MB text! 8


IS531 - Ch 8

8.2 Inverted files (4)  The sample text split into four blocks block 1 block 2 block3 block 4 This is a text. A text has many words. Words are made from letters text Vocabulary letters made many text words

Occurrence 4… 4… 2… 1, 2… 3…

9

inverted index


IS531 - Ch 8

8.2 Inverted files (5)  Block addressing .. : – Blocks of fixed size • Improve efficiency at retrieval time. • larger blocks match queries incur more sequential traversals of text.

– Blocks of natural division of the text collection (files, docs, web pages ..etc) • good for single-word queries without the exact occurrence position requirement.

10


IS531 - Ch 8

8.2.1 Searching (1)  General search steps – Vocabulary search: • The words and patterns present in the query are isolated and searched in the vocabulary.

– Retrieval of occurrences • The lists of the occurrences of all the words found are retrieved.

– Manipulation of occurrences • The occurrences are processed to solve phrases, proximity, or Boolean operations. • If block addressing is used, it may be necessary to directly search the text to find the information missing from the occurrences. 11


IS531 - Ch 8

8.2.1 Searching (2)  Singe word queries (Simple): – Return the list of occurrence.

 Context queries (Complex): – Each element searched separately and a list is generated for each of them. – Lists are traversed to find places where all the words appear in sequence for a phrase query or appear close enough for a proximity query.

 In Block addressing watch block boundaries since they may split a match (time consuming). 12


IS531 - Ch 8

8.2.2 Construction (1)  Constructing: – Building and maintaining an inverted index is relatively low cost task. – All vocabulary and stored in a data structure (Trie) and storing with each word a list of occurrences. – Once constructed, it is written to disk in two files: • Posting file: lists of occurrences are stored contiguously. • Vocabulary file: Vocabulary is stored in lexicographical order with a pointer for each word to its list in the first file.

– Spliting the index into 2 files allows the vocabulary to be kept in memory to speed up the search. 13


IS531 - Ch 8

8.2.2 Construction (2) ď Ž

Construction step 1. 2.

Read each word of the text Search the word in the trie. •

3. 4.

All the vocabulary known up to now is kept in a trie structure.

If word is not found in the trie, it is added to the trie with its list of occurrence. If word is in the trie, the new position is added to the end of its list of occurrence.

14


IS531 - Ch 8

8.2.2 Construction (3)  Building an inverted index for the sample text 1 6 9 11 17 19 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from letters letters: 60 ‘l’ ‘m’

made: 50

‘n’

many: 28

‘a’

‘t’ ‘w’

‘d’

text: 11,19 words: 33,40

15


IS531 - Ch 8

Example (2)

16


IS531 - Ch 8

8.3 other indices for text  Suffix trees and suffix arrays  Signature file

17


IS531 - Ch 8

Suffix Trees and Suffix Arrays  Suffix – Each position in the text is considered as a text suffix. • A string that start from that text position to the end to the text

 Both: – – – –

They answer efficiently more complex queries. Costly construction process The text must be readily available at query time. The results are not delivered in text position order.

18


IS531 - Ch 8

Suffix tree (1)  Index points of interest: – selected form the text, which point to the beginning of the text positions which will be retrievable. – each position is considered as a text suffix – each suffix is uniquely identified by its position

 structure – Trie data structure built over all the suffixes of the text • The pointers to the suffixes are stored at the leaf nodes • This trie is compacted into a Patricia tree (compressing unary paths).

 Searching – Many basic patterns such as words, prefixes, and phrases can be searched by a simple trie search. 19


IS531 - Ch 8

Suffix tree (2)  The suffix trie and suffix tree for the sample text

Suffixes Index point of interest

text. A text has many words. Words are made from letters. many words. Words are made from letters. ………… made from letters. letters.

1 6 9 11 17 19 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from letters suffix trie

60 ‘l ’ ‘m ’ ‘t’ ‘w ’

‘a’ ‘e’ ‘o’

‘d’

suffix tree (PAT)

50

‘n’ 28 ‘x’ ‘t’ ‘r’

‘d’

‘’ 19 ‘.’ 11 ‘’ ‘s’

40

‘.’ 33

20

60

‘l ‘d’ 50 1’ ‘m 3 ’ ‘n’ 28 ‘t’ ‘’ 19 5 ‘.’ 11 ‘w ‘’ 40 ’ 6 33 ‘.’


IS531 - Ch 8

Suffix tree (Example) Let S=abab, a suffix tree of s is a compressed trie of all suffixes of S= abab$ $

{ $ b$ ab$ bab$ abab$

5 4 3 2 1

a b a b $

} 21

b $

a b

$ $


IS531 - Ch 8

Trivial algorithm to build a Suffix tree { $ b$ ab$ bab$ abab$

5 4 3 2 1

Put the largest suffix in

a b a b $

} Put the suffix bab$ in

a b a b $ 22

b a b $


IS531 - Ch 8

Trivial algorithm to build a Suffix tree { $ b$ ab$ bab$ abab$

a b a b $

5 4 3 2 1

b a b $

} Put the suffix ab$ in

a b a b $ 23

$

b a b $


IS531 - Ch 8

Trivial algorithm to build a Suffix tree { $ b$ ab$ bab$ abab$

a b

5 4 3 2 1

a b $

$

b a b $

} Put the suffix b$ in

a b a b $ 24

b $

$

a b $


IS531 - Ch 8

Trivial algorithm to build a Suffix tree { $ b$ ab$ bab$ abab$

5 4 3 2 1

a b a b $

b $

$

a b $

}

$

Put the suffix $ in

a b

END: label each leaf with the starting point of the corresponding suffix.

25

a b $ 1

b $ 3

5

$

a b

$

2

4


IS531 - Ch 8

Suffix arrays (1)  Structure – Suffix arrays are space efficient implementation of suffix trees. – Simply an array containing all the pointers to the text suffixes listed in lexicographical order. – Supra-indices: • If the suffix array is large, this binary search can perform poorly because of the number of random disk accesses. • Suffix arrays are designed to allow binary searches done by comparing the contents of each pointer. • To remedy this situation, the use of supra-indices over the suffix array has been proposed. 26


IS531 - Ch 8

Suffix arrays (2) ď Ž Example 1 6 9 11 17 19 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from letters suffix tree 60

50

28

19

11

40

33

Suffix Array

1

60 3 5

lett

text

word

Supra-Index 6

60

50

28

19

11

40

27

33

Suffix Array

50 28 19 11 40 33


IS531 - Ch 8

Suffix arrays (3)  Searching – Search steps • Originate two limiting patterns P1 and P2. √ P1 ≤ S < P2

, S is original pattern

• Binary search both limiting patterns in the suffix array. √ Supra-indices are used as a first step to alleviate disk access.

• All the elements lying between both positions point to exactly those suffixes that start like the original pattern.

28


IS531 - Ch 8

Signature files (1)  Definition – Word-oriented index structure based on hashing. – Use liner search. – Suitable for not very large texts.

 Structure – Based on a Hash function that maps words to bit masks. – The text is divided in blocks. • Bit mask of block is obtained by bitwise ORing the signatures of all the words in the text block. • Word not found, if no match between all 1 bits in the query mask and the block mask.

29


IS531 - Ch 8

Signature files (2) ď Ž Example: block 1 block 2 block3 block 4 This is a text. A text has many words. Words are made from letters

000101 h(text) h(many) h(words) h(made) h(letters)

110101 = 000101 = 110000 = 100100 = 001100 = 100001

100100

Signature function

30

101101

Text signature


IS531 - Ch 8

Signature files (3)  False drop Problem – The corresponding bits are set even though the word is not there! – The design should insure that the probability of false drop is low. • Also the Signature file should be as short as possible.

– Enhance the hashing function to minimize the error probability.

31


IS531 - Ch 8

Signature files (4) 

Searching 1. 2.

If searching a single word, Hash word to a bit mask W. If searching phrases and reasonable proximity queries, 1) Hash words in query to a bit mask. 2) Bitwise OR of all the query masks to a bit mask W.

3.

Compare W to the bit masks Bi of all the text blocks. •

4.

If all the bits set in W are also in Bi, then text block may contain the word.

For all candidate text blocks, an online traversal must be performed to verify if the query is actually there.

Construction 1. 2.

Cut the text in blocks. Generate an entry of the signature file for each block. •

This entry is the bitwise OR of the signatures of all the words in the block. 32


IS531 - Ch 8

8.4 Boolean queries  Its manipulations algorithms – –

Used to operate on sets of results. Example: a OR (b AND c)

 Search phase 1. Determine which documents classify 2. Determines the relevance of the classifying documents so as to present them appropriately to the user. 3. Retrieves the exact positions of the matches to highlight them in those documents that the user actually wants to see. 33


IS531 - Ch 8

Ali

8.5 Sequential searching

– Used for text searching when no data structure has been built on the text. – The problem of exact string matching is : • Given a short pattern P of length m and long T of length n, find all the text position where the pattern occurs.

34


IS531 - Ch 8

8.5 Sequential searching  Brute force  Knuth-Morris-Pratt  Boyer-Moore Family  Shift-Or

35


IS531 - Ch 8

Brute Force  Brute Force algorithm (BF) – –

The simplest possible one. It consists of merely trying all possible pattern positions in the text. For each such position, it verifies whether the pattern matches at that position. – Does not need any pattern preprocessing. – Many algorithms use a modification of this scheme. – Left to right search.

36


IS531 - Ch 8

Brute Force example a

b

r

a

c

a

b

a

b

r

a

c

a

d

a

b

a

b

r

a

c

a

d

a

b

r

a

r

a

c

a

d

a

b

r

a

a a

a

37


IS531 - Ch 8

Knuth-Morris-Pratt(1)  Reuse information from previous checks –

When the window has to be shifted, there is a prefix of the pattern that matched the text. – The algorithm takes advantage of this information to avoid trying window positions which can be deduced not to match. – left to right scan like the Brute Force algorithm.

38


IS531 - Ch 8

Knuth-Morris-Pratt(2)  Next table – The next table at position j says which is the longest proper prefix of P1..j-1 which is also a suffix and the characters following prefix and suffix are different. • j-next[j]+1 window positions can be safely skipped if the characters up to j-1 matched, and the j-th did not.

39


IS531 - Ch 8

Knuth-Morris-Pratt(3)  Next table for ‘abracadabra’

next

0

0

0

0

1

0

search pattern

a b r a c a d a b r a

[next function]

40

1

0

0

0

0

4


IS531 - Ch 8

Knuth-Morris-Pratt(4)  Searching ‘abracadabra’ a

b

r

a

c

a

b

a

b

r

a

c

a

d

a

b

r

a

c

a

d

a

b

r

a

r

a

c

a

d

a

b

r

a

[search example]

41


IS531 - Ch 8

Boyer-Moore Family(1)  BM algorithm – Based on the fact that the check inside the window can proceed backwards. • When a match or mismatch is determined, a suffix of the pattern has been compared and found equal to the text in the window.

42


IS531 - Ch 8

Boyer-Moore Family(2)  BM example – Searching ‘date’ p=’’date’’ index[d]=0 index[a]=1 index[t]=2 index[e]=3 index[anything else] = -1

43


IS531 - Ch 8

Boyer-Moore Family(3)  BM example – Searching ‘date’ T="some date" P="date” ** m<>t.. index[m]=-1 so move so -1th posn of P below m T="some date" "date" * a<>e.. index[a]=1 so move so char 1 of P below a. T="some date" “date” **** 44


IS531 - Ch 8

Shift-Or(1) â&#x20AC;&#x201C; The basic idea of the Shift-Or (SO) algorithm, is to represent the state of the search as a number, and each search step costs a small number of arithmetic and logical operations. â&#x20AC;&#x201C; Efficient if the pattern length is no longer than the memory-word size of the machine w.( w is 32,64).

45


IS531 - Ch 8

Shift-Or(2)  SO example – Searching ‘GCAGAGAG’.

– p has been found at position 12-8+1=5 46


IS531 - Ch 8

Phrases and proximity  the best way to search a phrase – search for the element which is less frequent or can be searched faster. • for instance, √ longer patterns are better than shorter ones. √ allowing fewer errors is better than allowing more errors.

 the best way to search a proximity – is similar to the best way to search a phrase.

47


IS531 - Ch 8

8.6 Pattern Matching ď Ž String matching allowing errors. ď Ž Pattern matching Using indices.

48


IS531 - Ch 8

String matching allowing errors(1) – This problem called ‘approximate string matching’. – Can be stated as follows: • Given a short pattern P of length m, a long text T of length n, and a maximum allowed number of errors k, find all the text position where the pattern occurs.

49


IS531 - Ch 8

String matching allowing errors(1)  Dynamic programming – Classical solution to approximate string matching. – A matrix C[0..m, 0..n] is filled column by column, where C[i,j] represents the minimum number of errors needed to match P1..i to a suffix of T1..j. • m: length of a short pattern P. • n: length of a long text T.

50


IS531 - Ch 8

String matching allowing errors(2)  Dynamic programming – This is computed as follows: C[0,j]=0 C[i,0]=I C[i, j]= if (Pi = Tj) then C[i-1, j-1] else 1+min(C[i -1, j], C[i, j-1], C[i -1, j-1]) – A match is reported at text positions j such that C[m, j ] ≤ k

51


IS531 - Ch 8

String matching allowing errors(3)  Dynamic programming – search ‘survey’ in the text ‘surgery’ with two errors s

u

r

g

e

r

y

0

0

0

0

0

0

0

0

s

1

0

1

1

1

1

1

1

u

2

1

0

1

2

2

2

2

r

3

2

1

0

1

2

2

2

v

4

3

2

1

1

2

3

3

e

5

4

3

2

2

1

2

3

y

6

5

4

3

3

2

2

2

52


IS531 - Ch 8

String matching allowing errors(4) ď Ž Dynamic programming survey sur_e_y

53


IS531 - Ch 8

String matching allowing errors(5)  Bit-Parallelism – Has been used to parallelize the computation of the dynamic programming matrix .

 Filtering – Filter the text , reducing the area where dynamic programming needs to be used.

54


IS531 - Ch 8

Pattern matching Using indices  Inverted Files – Are word-oriented. – Queries such as suffix or sub-string queries ,searching allowing errors and regular expressions are solved by a sequential search over the vocabulary. – If block addressing is used , the search must be completed with a sequential search over the blocks. – Not able to efficiently find approximate matches or regular expressions that span many words.

55


IS531 - Ch 8

8.7 Structural Queries(1)  The algorithms to search on structured text – Some implementations build an ad hoc index to store the structure. • More efficient and independent of any consideration about the text. • Need extra development and maintenance effort.

56


IS531 - Ch 8

8.7 Structural Queries(2)  The algorithms to search on structured text – Other techniques assume that the structure is marked in the text using ‘tags’. ( case of HTML text). • The techniques rely on the same index to query content (such as inverted files), using it to index and search those tags as if they were words. • In many cases this is as efficient as an ad hoc index. • Its integration into an existing text database is simpler.

57


IS531 - Ch 8

Compressed indices(1)  Inverted files – Are quite amenable to compression, because the lists of occurrences are in increasing order of text position. – An obvious choice is to represent the differences between the previous position and the current one. – The text can be compressed independently of index.

58


IS531 - Ch 8

Compressed indices(2)  Suffix trees and suffix arrays – Suffix arrays are very hard to compress further. • Because they represent an almost permutation of the pointers to the text.

perfectly

random

– Suffix arrays on compressed text • The main advantage is that both index construction and querying almost double their performance. √ Construction is faster because more compressed text fits in the same memory space and therefore fewer text blocks are needed. √ Searching is faster because a large part of the search time is spent in disk seek operations over the text area to compare suffixes. 59


IS531 - Ch 8

8.9 Trends and Research Issues  The main trends in indexing and searching textual databases: – Text collections are becoming huge. – Searching is becoming more complex. – Compression is becoming a star in the field.

60


IS531 - Ch 8

References  “Modern Information Retrieval”, Ricardo Baeza & Berthier Ribeiro, Addison Wesley 1999.  Readings in Information Retrieval, K.Sparck Jones and P. Willett

 Many different Resources on the Internet: – http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Suffix/ – http://www.cs.unt.edu/~rada/CSCE5200/

61


IS531 - Ch 8

.. Thatâ&#x20AC;&#x2122;s All..

? Thanks .. Any Questions 62

Software Reaserch Group  

create software