Page 1

Introduction to basics of Search and Relevancy with Apache Solr FEATURING:

Mark Bennett, CTO


Agenda • Prerequisites: Browser Tricks • Web “Command Line” • The DisMax Parser • Boosting Formula • Explaining “Explain” • Check Your Index! • Q&A • Resources / About NIE

12/2/2009

Lucid Imagination, Inc.

2


Prerequisite: Some Browser Tricks

12/2/2009

Lucid Imagination, Inc.

3


Browsers Matter – install them all! Firefox:

IE and Safari: • Default XML Rendering • (also some versions of IE)

• Lots of Plugins

• Better “Explain” copy & paste maintains line breaks • Better table copy and paste

12/2/2009

Lucid Imagination, Inc.

4


Larger Firefox “Command Line”

Customize the Firefox URL box as a command line in 3 easy steps 1. Toolbar: Right Click 2. Customize… Add New Toolbar 3. URL bar ->CLICK and DRAG

Lucid Imagination, Inc.

5


Turn off Solr HTTP Caching

• Change in solrconfig.xml • Disable the http304 section

• Turn it back on before you deploy!

12/2/2009

Lucid Imagination, Inc.

6


Understanding Solr’s “Web Command Line”

12/2/2009

Lucid Imagination, Inc.

7


The “Web Command Line” CLI CONCEPT

SOLR EQUIVALENT

• Command Prompt

URL bar

• -o or --foo bar

? or & and =

• (spaces)

+

• some punctuation

%nn

• output

XML or HTML

• Command line “adapter”

Curl

• Script files can call URLs • Not built into Windows – try cygwin 12/2/2009

Lucid Imagination, Inc.

8


Solr “Command Line” • Typical Base URL • http://localhost:8983/solr/select?...

• Basic Input (not counting dismax) • q = query, fq = filter query • df = default field • qt = query type (standard / dismax)

• Controlling Output (lots more!!!) • • • • 12/2/2009

debugQuery = true wt = “what type” (actually “writer type”) standard/XML, xslt (with tr=), javabin, json… fl = *,score (which fields) Lucid Imagination, Inc.

9


Example: search for “solr” http://localhost:8983/solr/select?q=solr&debugQuery=true With Firefox you get XML output you can expand and collapse

With MSIE* and Safari, not so much

* Some versions 12/2/2009

Lucid Imagination, Inc.

10


Detailed Debug & Explain Output http://localhost:8983/solr/select?q=solr&debugQuery=true

<str name="parsedquery">text:solr</str> â&#x20AC;Ś <lst name="explain"> <str name="SOLR1000">

0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of: 1.4142135 = tf(termFreq(text:solr)=2) 3.6026897 = idf(docFreq=1, numDocs=26) 0.125 = fieldNorm(field=text, doc=13) </str> </lst>

12/2/2009

Lucid Imagination, Inc.

11


A look at the DisMax query parser

12/2/2009

Lucid Imagination, Inc.

12


Solr DisMax: Defined • What is it? • Dis-joint text (Multiple fields) • Max-imum match (score)

• How do you get it? •

Configured in: •

Called with: •

qt=dismax

Adjusted with: •

12/2/2009

solrconfig.xml and schema.xml

mm, bf, qf, pf, qs, ps, tie

Lucid Imagination, Inc.

13


Solr DisMax: Pros and Cons General Benefits • Multiple Fields • Multiple Relevancy Rules • Great for Freshness / Popularity

Issues to be Aware of • Tie-in between schema.xml & solrconfig.xml • Trouble with some CJK (Chinese, Japanese, Korean) • Limited wildcard / field / range support • Difficult to customize and debug • Trouble with shingles • Understand mm! Lucid Imagination, Inc.

14


About the “dis” and the “max” Distributed across multiple fields • Breakup query into words • Each part becomes field clause • Like an OR but with extra credit

Takes the Maximum of each set • Word 1 had highest score in Title • Word 2 very dense in the doc body • Adds in Tie breaker if in multiple fields

Lucid Imagination, Inc.

15


Coming soon: Extended DisMax Improvements • Flexible case Boolean ops: AND/and, OR/or • Auto-escape punctuation & -> \&, etc. • Improved Proximity Boosting (via word bigrams) • Other changes in stop words, relevancy calc, URL arguments

How to get it • Post 1.4 patch, planned for 1.5 • Details + Patch in JIRA: SOLR-1553 http://issues.apache.org/jira/browse/SOLR-1553

• TBD: change URL option qt=edismax (or qt=dismax )

Lucid Imagination, Inc.

16


Boosting Formulas

12/2/2009

Lucid Imagination, Inc.

17


Boost Functions in Dismax High Level Feature • Numeric functions for scoring • sum(), product(), sqrt(), log(), etc.

• Boost on recent dates, user popularity

Good Combination: Reverse-Ordinal & Reciprocal • Position in index : ord(), reverse is: rord() • Larger y for smaller x: recip()

How to get it • URL parameter bf = “boost function” • Configured in solrconfig.xml • See http://wiki.apache.org/solr/FunctionQuery Lucid Imagination, Inc.

18


“Freshness”: Boosting Recent Dates Position N-Position Date ord() rord()

WIKI EXAMPLE: recip( rord(creationDate), 1, 1000, 1000 )

mx+c a / mx+c Linear (x,m,c) recip(x,m,a,c)

1/1/2000

1

120

1120

0.89286

2/1/2000

2

119

1119

0.89366

3/1/2000

3

118

1118

0.89445

1/1/2005

61

60

1060

0.94340

1/1/2009

109

12

1012

0.98814

2/1/2009

110

11

1011

0.98912

3/1/2009

111

10

1010

0.99010

4/1/2009

112

9

1009

0.99108

5/1/2009

113

8

1008

0.99206

6/1/2009

114

7

1007

0.99305

7/1/2009

115

6

1006

0.99404

8/1/2009

116

5

1005

0.99502

9/1/2009

117

4

1004

0.99602

10/1/2009

118

3

1003

0.99701

11/1/2009

119

2

1002

0.99800

12/1/2009

120

1

1001

0.99900

slope

m

1

numerator

a

1000

intercept

c

1000

(aka "b")

1.000 0.980 0.960 0.940 0.920 0.900 0.880

Lucid Imagination, Inc.

19


Sifting through Solr’s “Explain” output

12/2/2009

Lucid Imagination, Inc.

20


DisMax Example for “solr” INPUT: http://localhost:8983/solr

/select?q=solr&debugQuery=true&qt=dismax DEBUG OUTPUT: (1 OF 2) <str name="parsedquery"> +DisjunctionMaxQuery((id:solr^10.0 | text:solr^0.5 | cat:solr^1.4 | manu:solr^1.1 | name:solr^1.2 | features:solr | sku:solr^1.5)~0.01) DisjunctionMaxQuery((manu_exact:solr^1.9 | features:solr^1.1 | text:solr^0.2 | manu:solr^1.4 | name:solr^1.5)~0.01) FunctionQuery((top(ord(popularity)))^0.5) FunctionQuery((1000.0/(1.0*float(top(rord(price)))+1000.0))^0.3) </str> 12/2/2009

Lucid Imagination, Inc.

21


DisMax explain output for a single word query 3.6026897 = (MATCH) fieldWeight(sku:solr in 13), product of: 0.125 = fieldNorm(field=text, doc=13) <lst name="explain"> 1.0 = tf(termFreq(sku:solr)=1) 0.22260013 = (MATCH) weight(name:solr^1.5 <str name="SOLR1000"> 3.6026897 = idf(docFreq=1, numDocs=26) in 13), product of: 0.74609417 = (MATCH) sum of: 1.0 = fieldNorm(field=sku, doc=13) 0.12357441 = queryWeight(name:solr^1.5), 0.4476144 = (MATCH) max plus 0.01 times others of: product of: 0.026233677 = (MATCH) weight(text:solr^0.5 in 13), product of: 1.0 = tf(termFreq(features:solr)=1) 3.6026897 = idf(docFreq=1, numDocs=26) 1.5 = boost 0.04119147 = queryWeight(text:solr^0.5), product of: 0.125 = fieldNorm(field=features, doc=13) 3.6026897 = idf(docFreq=1, numDocs=26) 0.5 = boost 0.44520026 = (MATCH) weight(sku:solr^1.5 in 13), product of: 0.022867065 = queryNorm 3.6026897 = idf(docFreq=1, numDocs=26) 0.12357441 = queryWeight(sku:solr^1.5), product of: 1.8013449 = (MATCH) fieldWeight(name:solr 0.022867065 = queryNorm 1.5 = boost in 13), product of: 0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of: 3.6026897 = idf(docFreq=1, numDocs=26) 1.0 = tf(termFreq(name:solr)=1) 1.4142135 = tf(termFreq(text:solr)=2) 0.022867065 = queryNorm 3.6026897 = idf(docFreq=1, numDocs=26) 3.6026897 = idf(docFreq=1, numDocs=26) 3.6026897 = (MATCH) fieldWeight(sku:solr in 13), product of: 0.5 = fieldNorm(field=name, doc=13) 0.125 = fieldNorm(field=text, doc=13) 1.0 = tf(termFreq(sku:solr)=1) 0.06860119 = (MATCH) 0.17808011 = (MATCH) weight(name:solr^1.2 in 13), product of: 3.6026897 = idf(docFreq=1, numDocs=26) FunctionQuery(top(ord(popularity))), 0.09885953 = queryWeight(name:solr^1.2), product of: 1.0 = fieldNorm(field=sku, doc=13) product of: 1.2 = boost 0.22311316 = (MATCH) max plus 0.01 times others of: 6.0 = ord(popularity)=6 3.6026897 = idf(docFreq=1, numDocs=26) 0.040810023 = (MATCH) weight(features:solr^1.1 in 13), 0.5 = boost 0.022867065 = queryNorm product of: 0.022867065 = queryNorm 1.8013449 = (MATCH) fieldWeight(name:solr in 13), product of: 0.09062123 = queryWeight(features:solr^1.1), product of: 0.0067654043 = (MATCH) 1.0 = tf(termFreq(name:solr)=1) 1.1 = boost FunctionQuery(1000.0/(1.0*float(top(ror 3.6026897 = idf(docFreq=1, numDocs=26) 3.6026897 = idf(docFreq=1, numDocs=26) d(price)))+1000.0)), product of: 0.5 = fieldNorm(field=name, doc=13) 0.022867065 = queryNorm 0.9861933 = 0.03710002 = (MATCH) weight(features:solr in 13), product of: 0.45033622 = (MATCH) fieldWeight(features:solr in 13), 1000.0/(1.0*float(rord(price)=14)+1000.0 0.08238294 = queryWeight(features:solr), product of: product of: ) 3.6026897 = idf(docFreq=1, numDocs=26) 1.0 = tf(termFreq(features:solr)=1) 0.3 = boost 0.022867065 = queryNorm 0.022867065 = queryNorm 0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of: 3.6026897 = idf(docFreq=1, numDocs=26) 0.125 = fieldNorm(field=features, doc=13) </str> 1.0 = tf(termFreq(features:solr)=1) 0.01049347 = (MATCH) weight(text:solr^0.2 in 13), product of: </lst> 3.6026897 = idf(docFreq=1, numDocs=26) 0.016476588 = queryWeight(text:solr^0.2), product of: 0.125 = fieldNorm(field=features, doc=13) 0.2 = boost 0.44520026 = (MATCH) weight(sku:solr^1.5 in 13), product of: 3.6026897 = idf(docFreq=1, numDocs=26) 0.12357441 = queryWeight(sku:solr^1.5), product of: 0.022867065 = queryNorm 1.5 = boost 0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of: 3.6026897 = idf(docFreq=1, numDocs=26) 1.4142135 = tf(termFreq(text:solr)=2) 0.022867065 = queryNorm 3.6026897 = idf(docFreq=1, numDocs=26)

12/2/2009

Lucid Imagination, Inc.

22


“Explain” example: ... 0.026233677 = (MATCH) weight(text:solr^0.5 in 13), product of: 0.04119147 = queryWeight(text:solr^0.5), product of: 0.5 = boost 3.6026897 = idf(docFreq=1, numDocs=26) 0.022867065 = queryNorm 0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of: 1.4142135 = tf(termFreq(text:solr)=2) 3.6026897 = idf(docFreq=1, numDocs=26) 0.125 = fieldNorm(field=text, doc=13) 0.17808011 = (MATCH) weight(name:solr^1.2 in 13), product of: 0.09885953 = queryWeight(name:solr^1.2), product of: 1.2 = boost 3.6026897 = idf(docFreq=1, numDocs=26) 0.022867065 = queryNorm 1.8013449 = (MATCH) fieldWeight(name:solr in 13), product of: 1.0 = tf(termFreq(name:solr)=1) 3.6026897 = idf(docFreq=1, numDocs=26) 0.5 = fieldNorm(field=name, doc=13) 0.03710002 = (MATCH) weight(features:solr in 13), product of: 0.08238294 = queryWeight(features:solr), product of: 3.6026897 = idf(docFreq=1, numDocs=26) 0.022867065 = queryNorm 0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of: 1.0 = tf(termFreq(features:solr)=1) 3.6026897 = idf(docFreq=1, numDocs=26) 0.125 = fieldNorm(field=features, doc=13) ...

12/2/2009

tf (termFreq(text:solr )=2) idf (docFreq=1,numDocs=26)

Lucid Imagination, Inc.

23


Solr’s XSLT “debugger” http://localhost:8983/solr/select?

q=solr &debugQuery=true &wt=xslt &tr=example.xsl &fl=*,score &qt=dismax

12/2/2009

Lucid Imagination, Inc.

24


Another way to view Explain data • Solr1.4 has Solritas • Various features, including toggle explain display • “Some assembly required…” http://www.lucidimagination.com/blog/2009/11/04/solritas-solr-1-4s-hidden-gem/

Lucid Imagination, Inc.

25


Checking your Index and IDF

12/2/2009

Lucid Imagination, Inc.

26


Checking what got Indexed Bad Index = Bad Search • Check Upper / lower case and Punctuation • Bad Fields / Meta Data = Bad Facets, Filters, Sorting

Use built-in Schema Browser: • Check each field • Common words = • IDF “Inverse Document Frequency”

Lucid Imagination, Inc.

27


Check IDF w/ the Schema Browser Start at the Admin Screen: http://localhost:8983/solr/admin

Schema Browser • select a field • change # to see more

Lucid Imagination, Inc.


About NIE New Idea Engineering

12/2/2009

Lucid Imagination, Inc.

29


NIE Resources

Newsletter & Whitepapers: www.ideaeng.com/current

Search Dev Newsgroup: www.SearchDev.org

Blogs: EnterpriseSearchBlog.com SearchComponentsOnline.com

12/2/2009

Lucid Imagination, Inc.

30


Finish Line / Q & A Review & Questions

Mark Bennett mbennett@ideaeng.com main 408-446-3460 cell 408-829-6513

12/2/2009

Lucid Imagination, Inc.

31


Q&A These slides and a recorded presentation are available at

bit.ly/SolrRelevancy 12/2/2009

Lucid Imagination, Inc.

An Introduction to Basics of Search and Relevancy with Apache Solr  

The open source Apache Solr open source search engine provides powerful, versatile search application development technology so you to take...

Read more
Read more
Similar to
Popular now
Just for you