Issuu on Google+

Introduction to basics of Search and Relevancy with Apache Solr FEATURING:

Mark Bennett, CTO


Agenda • Prerequisites: Browser Tricks • Web “Command Line” • The DisMax Parser • Boosting Formula • Explaining “Explain” • Check Your Index! • Q&A • Resources / About NIE

12/2/2009

Lucid Imagination, Inc.

2


Prerequisite: Some Browser Tricks

12/2/2009

Lucid Imagination, Inc.

3


Browsers Matter – install them all! Firefox:

IE and Safari: • Default XML Rendering • (also some versions of IE)

• Lots of Plugins

• Better “Explain” copy & paste maintains line breaks • Better table copy and paste

12/2/2009

Lucid Imagination, Inc.

4


Larger Firefox “Command Line”

Customize the Firefox URL box as a command line in 3 easy steps 1. Toolbar: Right Click 2. Customize… Add New Toolbar 3. URL bar ->CLICK and DRAG

Lucid Imagination, Inc.

5


Turn off Solr HTTP Caching

• Change in solrconfig.xml • Disable the http304 section

• Turn it back on before you deploy!

12/2/2009

Lucid Imagination, Inc.

6


Understanding Solr’s “Web Command Line”

12/2/2009

Lucid Imagination, Inc.

7


The “Web Command Line” CLI CONCEPT

SOLR EQUIVALENT

• Command Prompt

URL bar

• -o or --foo bar

? or & and =

• (spaces)

+

• some punctuation

%nn

• output

XML or HTML

• Command line “adapter”

Curl

• Script files can call URLs • Not built into Windows – try cygwin 12/2/2009

Lucid Imagination, Inc.

8


Solr “Command Line” • Typical Base URL • http://localhost:8983/solr/select?...

• Basic Input (not counting dismax) • q = query, fq = filter query • df = default field • qt = query type (standard / dismax)

• Controlling Output (lots more!!!) • • • • 12/2/2009

debugQuery = true wt = “what type” (actually “writer type”) standard/XML, xslt (with tr=), javabin, json… fl = *,score (which fields) Lucid Imagination, Inc.

9


Example: search for “solr” http://localhost:8983/solr/select?q=solr&debugQuery=true With Firefox you get XML output you can expand and collapse

With MSIE* and Safari, not so much

* Some versions 12/2/2009

Lucid Imagination, Inc.

10


Detailed Debug & Explain Output http://localhost:8983/solr/select?q=solr&debugQuery=true

<str name="parsedquery">text:solr</str> â&#x20AC;Ś <lst name="explain"> <str name="SOLR1000">

0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of: 1.4142135 = tf(termFreq(text:solr)=2) 3.6026897 = idf(docFreq=1, numDocs=26) 0.125 = fieldNorm(field=text, doc=13) </str> </lst>

12/2/2009

Lucid Imagination, Inc.

11


A look at the DisMax query parser

12/2/2009

Lucid Imagination, Inc.

12


Solr DisMax: Defined • What is it? • Dis-joint text (Multiple fields) • Max-imum match (score)

• How do you get it? •

Configured in: •

Called with: •

qt=dismax

Adjusted with: •

12/2/2009

solrconfig.xml and schema.xml

mm, bf, qf, pf, qs, ps, tie

Lucid Imagination, Inc.

13


Solr DisMax: Pros and Cons General Benefits • Multiple Fields • Multiple Relevancy Rules • Great for Freshness / Popularity

Issues to be Aware of • Tie-in between schema.xml & solrconfig.xml • Trouble with some CJK (Chinese, Japanese, Korean) • Limited wildcard / field / range support • Difficult to customize and debug • Trouble with shingles • Understand mm! Lucid Imagination, Inc.

14


About the “dis” and the “max” Distributed across multiple fields • Breakup query into words • Each part becomes field clause • Like an OR but with extra credit

Takes the Maximum of each set • Word 1 had highest score in Title • Word 2 very dense in the doc body • Adds in Tie breaker if in multiple fields

Lucid Imagination, Inc.

15


Coming soon: Extended DisMax Improvements • Flexible case Boolean ops: AND/and, OR/or • Auto-escape punctuation & -> \&, etc. • Improved Proximity Boosting (via word bigrams) • Other changes in stop words, relevancy calc, URL arguments

How to get it • Post 1.4 patch, planned for 1.5 • Details + Patch in JIRA: SOLR-1553 http://issues.apache.org/jira/browse/SOLR-1553

• TBD: change URL option qt=edismax (or qt=dismax )

Lucid Imagination, Inc.

16


Boosting Formulas

12/2/2009

Lucid Imagination, Inc.

17


Boost Functions in Dismax High Level Feature • Numeric functions for scoring • sum(), product(), sqrt(), log(), etc.

• Boost on recent dates, user popularity

Good Combination: Reverse-Ordinal & Reciprocal • Position in index : ord(), reverse is: rord() • Larger y for smaller x: recip()

How to get it • URL parameter bf = “boost function” • Configured in solrconfig.xml • See http://wiki.apache.org/solr/FunctionQuery Lucid Imagination, Inc.

18


“Freshness”: Boosting Recent Dates Position N-Position Date ord() rord()

WIKI EXAMPLE: recip( rord(creationDate), 1, 1000, 1000 )

mx+c a / mx+c Linear (x,m,c) recip(x,m,a,c)

1/1/2000

1

120

1120

0.89286

2/1/2000

2

119

1119

0.89366

3/1/2000

3

118

1118

0.89445

1/1/2005

61

60

1060

0.94340

1/1/2009

109

12

1012

0.98814

2/1/2009

110

11

1011

0.98912

3/1/2009

111

10

1010

0.99010

4/1/2009

112

9

1009

0.99108

5/1/2009

113

8

1008

0.99206

6/1/2009

114

7

1007

0.99305

7/1/2009

115

6

1006

0.99404

8/1/2009

116

5

1005

0.99502

9/1/2009

117

4

1004

0.99602

10/1/2009

118

3

1003

0.99701

11/1/2009

119

2

1002

0.99800

12/1/2009

120

1

1001

0.99900

slope

m

1

numerator

a

1000

intercept

c

1000

(aka "b")

1.000 0.980 0.960 0.940 0.920 0.900 0.880

Lucid Imagination, Inc.

19


Sifting through Solr’s “Explain” output

12/2/2009

Lucid Imagination, Inc.

20


DisMax Example for “solr” INPUT: http://localhost:8983/solr

/select?q=solr&debugQuery=true&qt=dismax DEBUG OUTPUT: (1 OF 2) <str name="parsedquery"> +DisjunctionMaxQuery((id:solr^10.0 | text:solr^0.5 | cat:solr^1.4 | manu:solr^1.1 | name:solr^1.2 | features:solr | sku:solr^1.5)~0.01) DisjunctionMaxQuery((manu_exact:solr^1.9 | features:solr^1.1 | text:solr^0.2 | manu:solr^1.4 | name:solr^1.5)~0.01) FunctionQuery((top(ord(popularity)))^0.5) FunctionQuery((1000.0/(1.0*float(top(rord(price)))+1000.0))^0.3) </str> 12/2/2009

Lucid Imagination, Inc.

21


DisMax explain output for a single word query 3.6026897 = (MATCH) fieldWeight(sku:solr in 13), product of: 0.125 = fieldNorm(field=text, doc=13) <lst name="explain"> 1.0 = tf(termFreq(sku:solr)=1) 0.22260013 = (MATCH) weight(name:solr^1.5 <str name="SOLR1000"> 3.6026897 = idf(docFreq=1, numDocs=26) in 13), product of: 0.74609417 = (MATCH) sum of: 1.0 = fieldNorm(field=sku, doc=13) 0.12357441 = queryWeight(name:solr^1.5), 0.4476144 = (MATCH) max plus 0.01 times others of: product of: 0.026233677 = (MATCH) weight(text:solr^0.5 in 13), product of: 1.0 = tf(termFreq(features:solr)=1) 3.6026897 = idf(docFreq=1, numDocs=26) 1.5 = boost 0.04119147 = queryWeight(text:solr^0.5), product of: 0.125 = fieldNorm(field=features, doc=13) 3.6026897 = idf(docFreq=1, numDocs=26) 0.5 = boost 0.44520026 = (MATCH) weight(sku:solr^1.5 in 13), product of: 0.022867065 = queryNorm 3.6026897 = idf(docFreq=1, numDocs=26) 0.12357441 = queryWeight(sku:solr^1.5), product of: 1.8013449 = (MATCH) fieldWeight(name:solr 0.022867065 = queryNorm 1.5 = boost in 13), product of: 0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of: 3.6026897 = idf(docFreq=1, numDocs=26) 1.0 = tf(termFreq(name:solr)=1) 1.4142135 = tf(termFreq(text:solr)=2) 0.022867065 = queryNorm 3.6026897 = idf(docFreq=1, numDocs=26) 3.6026897 = idf(docFreq=1, numDocs=26) 3.6026897 = (MATCH) fieldWeight(sku:solr in 13), product of: 0.5 = fieldNorm(field=name, doc=13) 0.125 = fieldNorm(field=text, doc=13) 1.0 = tf(termFreq(sku:solr)=1) 0.06860119 = (MATCH) 0.17808011 = (MATCH) weight(name:solr^1.2 in 13), product of: 3.6026897 = idf(docFreq=1, numDocs=26) FunctionQuery(top(ord(popularity))), 0.09885953 = queryWeight(name:solr^1.2), product of: 1.0 = fieldNorm(field=sku, doc=13) product of: 1.2 = boost 0.22311316 = (MATCH) max plus 0.01 times others of: 6.0 = ord(popularity)=6 3.6026897 = idf(docFreq=1, numDocs=26) 0.040810023 = (MATCH) weight(features:solr^1.1 in 13), 0.5 = boost 0.022867065 = queryNorm product of: 0.022867065 = queryNorm 1.8013449 = (MATCH) fieldWeight(name:solr in 13), product of: 0.09062123 = queryWeight(features:solr^1.1), product of: 0.0067654043 = (MATCH) 1.0 = tf(termFreq(name:solr)=1) 1.1 = boost FunctionQuery(1000.0/(1.0*float(top(ror 3.6026897 = idf(docFreq=1, numDocs=26) 3.6026897 = idf(docFreq=1, numDocs=26) d(price)))+1000.0)), product of: 0.5 = fieldNorm(field=name, doc=13) 0.022867065 = queryNorm 0.9861933 = 0.03710002 = (MATCH) weight(features:solr in 13), product of: 0.45033622 = (MATCH) fieldWeight(features:solr in 13), 1000.0/(1.0*float(rord(price)=14)+1000.0 0.08238294 = queryWeight(features:solr), product of: product of: ) 3.6026897 = idf(docFreq=1, numDocs=26) 1.0 = tf(termFreq(features:solr)=1) 0.3 = boost 0.022867065 = queryNorm 0.022867065 = queryNorm 0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of: 3.6026897 = idf(docFreq=1, numDocs=26) 0.125 = fieldNorm(field=features, doc=13) </str> 1.0 = tf(termFreq(features:solr)=1) 0.01049347 = (MATCH) weight(text:solr^0.2 in 13), product of: </lst> 3.6026897 = idf(docFreq=1, numDocs=26) 0.016476588 = queryWeight(text:solr^0.2), product of: 0.125 = fieldNorm(field=features, doc=13) 0.2 = boost 0.44520026 = (MATCH) weight(sku:solr^1.5 in 13), product of: 3.6026897 = idf(docFreq=1, numDocs=26) 0.12357441 = queryWeight(sku:solr^1.5), product of: 0.022867065 = queryNorm 1.5 = boost 0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of: 3.6026897 = idf(docFreq=1, numDocs=26) 1.4142135 = tf(termFreq(text:solr)=2) 0.022867065 = queryNorm 3.6026897 = idf(docFreq=1, numDocs=26)

12/2/2009

Lucid Imagination, Inc.

22


“Explain” example: ... 0.026233677 = (MATCH) weight(text:solr^0.5 in 13), product of: 0.04119147 = queryWeight(text:solr^0.5), product of: 0.5 = boost 3.6026897 = idf(docFreq=1, numDocs=26) 0.022867065 = queryNorm 0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of: 1.4142135 = tf(termFreq(text:solr)=2) 3.6026897 = idf(docFreq=1, numDocs=26) 0.125 = fieldNorm(field=text, doc=13) 0.17808011 = (MATCH) weight(name:solr^1.2 in 13), product of: 0.09885953 = queryWeight(name:solr^1.2), product of: 1.2 = boost 3.6026897 = idf(docFreq=1, numDocs=26) 0.022867065 = queryNorm 1.8013449 = (MATCH) fieldWeight(name:solr in 13), product of: 1.0 = tf(termFreq(name:solr)=1) 3.6026897 = idf(docFreq=1, numDocs=26) 0.5 = fieldNorm(field=name, doc=13) 0.03710002 = (MATCH) weight(features:solr in 13), product of: 0.08238294 = queryWeight(features:solr), product of: 3.6026897 = idf(docFreq=1, numDocs=26) 0.022867065 = queryNorm 0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of: 1.0 = tf(termFreq(features:solr)=1) 3.6026897 = idf(docFreq=1, numDocs=26) 0.125 = fieldNorm(field=features, doc=13) ...

12/2/2009

tf (termFreq(text:solr )=2) idf (docFreq=1,numDocs=26)

Lucid Imagination, Inc.

23


Solr’s XSLT “debugger” http://localhost:8983/solr/select?

q=solr &debugQuery=true &wt=xslt &tr=example.xsl &fl=*,score &qt=dismax

12/2/2009

Lucid Imagination, Inc.

24


Another way to view Explain data • Solr1.4 has Solritas • Various features, including toggle explain display • “Some assembly required…” http://www.lucidimagination.com/blog/2009/11/04/solritas-solr-1-4s-hidden-gem/

Lucid Imagination, Inc.

25


Checking your Index and IDF

12/2/2009

Lucid Imagination, Inc.

26


Checking what got Indexed Bad Index = Bad Search • Check Upper / lower case and Punctuation • Bad Fields / Meta Data = Bad Facets, Filters, Sorting

Use built-in Schema Browser: • Check each field • Common words = • IDF “Inverse Document Frequency”

Lucid Imagination, Inc.

27


Check IDF w/ the Schema Browser Start at the Admin Screen: http://localhost:8983/solr/admin

Schema Browser • select a field • change # to see more

Lucid Imagination, Inc.


About NIE New Idea Engineering

12/2/2009

Lucid Imagination, Inc.

29


NIE Resources

Newsletter & Whitepapers: www.ideaeng.com/current

Search Dev Newsgroup: www.SearchDev.org

Blogs: EnterpriseSearchBlog.com SearchComponentsOnline.com

12/2/2009

Lucid Imagination, Inc.

30


Finish Line / Q & A Review & Questions

Mark Bennett mbennett@ideaeng.com main 408-446-3460 cell 408-829-6513

12/2/2009

Lucid Imagination, Inc.

31


Q&A These slides and a recorded presentation are available at

bit.ly/SolrRelevancy 12/2/2009

Lucid Imagination, Inc.


An Introduction to Basics of Search and Relevancy with Apache Solr