Hill Jay - 7 Deadly Sins of Solr

Page 1


Introductions…!   Who the hell am I?  Jay Hill, Lucid Imagina-on  7 years Lucene experience  4 years Solr experience  Author of Lucid Training  SME for Lucid Cer-fica-on   Who the hell are you?  New to search?  New to Lucene/Solr?  BaKle-­‐tested veterans?

© Lucid Imagina-on, Inc.


We'll Leave Time For Q&A!   Who's doing what?  Solr 3.1?  Solr 1.4.1?  Nightly build?  Solr 1.3 or older?   Are there any specific problems you're having?   Meanwhile, interrupt, ask ques8ons as we go, etc.

© Lucid Imagina-on, Inc.


A Brief Word About Lucid Imagination!   Lucid Imagina8on:  The commercial company suppor-ng Lucene/Solr open source search.  Founded by  Yonik Seeley – Creator of Solr  Erik Hatcher – Co-­‐author, Lucene In Ac-on  Grant Ingersoll – Apache PMC Chair  Marc Krellenstein – Lucid CTO  Staff includes 9 Lucene/Solr commiKers  Training, cer-fica-on, support, LucidWorks Enterprise

© Lucid Imagina-on, Inc.


Lucid Customers (That I've Worked With)!

© Lucid Imagina-on, Inc.


…On To The Sinning!!

© Lucid Imagina-on, Inc.


Sins As Anti-Patterns?!   "Sorta kinda"  Specify Nothing (Sloth)  Creeping Featurei-s (Greed)  Blowhard Jamboree (Pride)  Boat Anchor (Lust)  Not Invented Here (Envy)  Phatware (GluKony)  Emperor's New Clothes (Wrath)

© Lucid Imagina-on, Inc.


Sins Can Contradict One Another!!   You'll no-ce that many of the "sins" we see will be the exact opposite of others   Just as some of us tend towards laziness, others towards excess   Some-mes you -­‐  "Look before you leap."   Other -mes,  "He who hesitates is lost."   In Solr (or any search app), one size never fits all © Lucid Imagina-on, Inc.


"I don't know and I don't care."

© Lucid Imagina-on, Inc.


Sloth!   "We aren't really into open source."  Lack of commitment to Solr and/or the search applica-on itself   Not developing in-­‐house Solr exper-se   Not paying enough aKen-on to JVM sebngs, garbage collec-on, and RAM alloca-on.

© Lucid Imagina-on, Inc.


Sloth!   Neglec-ng to get familiar with the source code  It is open source ader all!   Not taking the -me to understand the main parts of Solr:  Request Handlers  Search components  Query parsers  Extend QParserPlugin class  ValueSource & ValueSourceParser – custom func-ons  New pseudo-­‐fields in 4.x  Response writers © Lucid Imagina-on, Inc.


Sloth!   Not keeping up with new features and developments in Lucene and Solr

CHANGES.txt – use "diff" to keep up on changes

© Lucid Imagina-on, Inc.


Sloth!   New features in Solr 3.1:  Solr spa8al  Edismax query parser  NOT experimental!  Dynamic metadata extrac-on via UIMA  Numeric range face8ng (like date face-ng)  Lucene RAMDirectoryFactory available  Face-ng performance improvements  Spellcheck and Terms components now work for distributed search  Suggester component – beKer autosuggest!  Can add custom dict., phrases, etc. © Lucid Imagina-on, Inc.


Sloth!   New features coming in Solr 4.x:  Lucene DocumentWritersPerThread (DWPT)  Moving towards "real -me"  UpdateHandler upgrade to work with real-­‐-me  Field collapsing/grouping  Pivot facets  SolrCloud (Zookeeper)  Fuzzy queries 100 -mes faster  Pseudo fields via func-ons  Relevancy func-on queries: n, idf, docFreq, norm, …

© Lucid Imagina-on, Inc.


Sloth: The Path To Salvation! Commit to the project and to learning Solr Stay up to date on Solr changes Stay current with ongoing releases Get familiar with the source code Spend some -me to understand the main configura-on files:  solrconfig.xml  schema.xml   Read through the en-re Solr Wiki once every so oden   Develop in-­‐house Solr exper-se         

© Lucid Imagina-on, Inc.


Save a penny, lose a customer.

© Lucid Imagina-on, Inc.


Greed!   Skimping on resources such as:  RAM  "Here's a quarter buddy, go buy some RAM!"  Storage space   You will get what you pay for!  …on the other hand, not every company has "deep pockets"

© Lucid Imagina-on, Inc.


Greed!   Trying to "squeeze by", indexing to, and searching on, the same server Indexing

Indexing

Shards (Indexers)

Slave/Searchers

Searches

Load Balancer

Searches © Lucid Imagina-on, Inc.


Greed!   Not making the effort to find the right balance between precision and recall Recall: What frac-on of the relevant documents in the collec-on were re-­‐ turned by the system?

© Lucid Imagina-on, Inc.

Precision: What frac-on of the returned results are relevant to the informa-on need?


Greed!   A few thoughts about relevance:  Get feedback from domain experts  Is it beKer to have lots of results with less precision, or fewer, more targeted results?  Different sites will have very different requirements

© Lucid Imagina-on, Inc.


Greed: The Path To Salvation!       

Pry open your wallet – don't be cheap You don't have to push the envelope Find the right balance between recall and precision Don't push for more results over precision – unless that is a clear requirement (some-mes it is)

© Lucid Imagina-on, Inc.


"What could possibly go wrong?

© Lucid Imagina-on, Inc.


Pride!   Reinven-ng the wheel  "Why don't we just write our own search libraries?"  Nobody has a use case like us – right?  "We need to change the scoring algorithms."

© Lucid Imagina-on, Inc.


Pride!   Thinking you can "do it all" in Solr  Solr is rarely a good choice as a SOR   Consider other tools to work with Solr:  Nutch  Mahout  OpenNLP  Google Connector Framework  Your own code

© Lucid Imagina-on, Inc.


Pride!   Stubbornly refusing to use resources such as the mailing lists:  Solr user list:  solr-­‐user@lucene.apache.org  Solr developer list:  dev@lucene.apache.org  Lucene user list:  java-­‐user@lucene.apache.org   LucidFind: hKp://www.lucidimagina-on.com/search/

© Lucid Imagina-on, Inc.


Pride!   "I will not yield!"  Trying to "win baKles" on the mailing lists  Good Karma – be a good ci-zen in the community

© Lucid Imagina-on, Inc.


Pride: The Path To Salvation!   Ask for help when needed   Let the business needs define the project – don't let the tail wag the dog   Get a feel for the Solr community and respect the experience of others   You're situa-on, while possibly unique, is probably not completely dissimilar to others. Learn from the pioneers and Solr veterans

© Lucid Imagina-on, Inc.


"Someone stop me!"

© Lucid Imagina-on, Inc.


Lust!   Obsessing over unimportant details too early in the project  Agile approach is well suited to Solr development – iterate!   Trying to "push the envelope"  Necessary some-mes, but it's not called the "bleeding edge" without reason  "Ease in" to major changes   Too much aKen-on to JVM sebngs  Solr experts are not usually JVM/GC experts

© Lucid Imagina-on, Inc.


Lust!   "An--­‐greed" – CommiEng too many resources to Solr  Make sure the OS has plenty of RAM to cache files, etc   "If one is good, a dozen must be beKer!"  As much as possible, try to get a sense of what your query volume will be, and don't just throw money at building a monstrous farm of searchers  Solr has proven to be much more efficient than some large, commercial search solu-ons

© Lucid Imagina-on, Inc.


Lust!   Blood from a turnip:  Trying some absurd new technique, "just because"   RAMDirectoryFactory – not a secret way to faster indexing/searching  No disk-­‐backed persistence  Usually not worth it  …but you never know…   Research first before going "extreme" © Lucid Imagina-on, Inc.


Lust!   No need to index millions of docs for development   BeKer to work with small sets of data while gebng started.   Don't worry too much about field types as you get started. Get data in the index, then analyze and refine.

© Lucid Imagina-on, Inc.


Lust: The Path To Salvation!   Use an agile approach – start simply, build your applica-on slowly, iterate   Deal with the low-­‐hanging fruit first   Measure twice, cut once   Don't miss the forest for the trees – no need to obsess over details in the early stages   Do some due diligence before trying unorthodox approaches   Get a small sample of data indexed w/o worrying about type, then itera-ons of refinement

© Lucid Imagina-on, Inc.


"If we had some bacon we could have some bacon and eggs – if we had some eggs."

© Lucid Imagina-on, Inc.


Envy!   Adding "cool" features you see on other sites, but don't really need  Keep it "lean and mean", especially to start  Resist the urge to include the "kitchen sink"

© Lucid Imagina-on, Inc.


Envy!   You too can master dismax!  Don't be afraid of dismax/edismax  Lots of controls to learn, but also lots of power  Flexibility to search mul-ple fields  Boost different fields  Boost phrase fields (pf) higher than query fields (qf)  Use boost queries (bq) and func-on queries (bf)  Most in-mida-ng params:  -e  mm © Lucid Imagina-on, Inc.


Envy!   Spa-al search – seems complicated, but major sites make it look easy   Now, in Solr 3.1 – it is easy!   You can:  Store spa-al data in your index  Filter by distance  Sort by distance  Boost/bias by distance  Facet by distance   Also consider: Search-­‐based naviga-on such as "Show me in-­‐stock items only" © Lucid Imagina-on, Inc.


Envy: The Path To Salvation!   Focus on your requirements, don't try to add "bells and whistles" you don't need   Don't be hesitant to dive into the power of dismax/edismax   Take advantage of new features such as Solr spa-al, if those features will add value to the end user experience

© Lucid Imagina-on, Inc.


"A fat stomach never breeds fine thoughts."

© Lucid Imagina-on, Inc.


Gluttony!   “Staying fit and trim” is usually good prac-ce when designing and running Solr applica-ons  Once again – keep it "lean and mean"   A lot of these issues cross over into the “Sloth” category  The effort needed to keep your configura-on and data efficiently managed is not considered important   Don't lose control of your configura-on files  Remove unnecessary elements  Version control all configura-on files

© Lucid Imagina-on, Inc.


Gluttony!   Slim down those "bloated" queries:  q="red shoes"& accountId=(12343 OR 338899 OR 554443 OR 243445 OR 55442OR 3330899 OR 59927 OR 3888999 OR 549 OR 440293579 34201 OR 339917 OR 300191 OR 339338 OR 109823 OR 679176 OR 31407815 OR 3001756 OR 134322 OR 311123 OR 987888 OR 997181 OR 771819 OR 100292 OR 3389474 OR 5505759 OR 2459577 OR 4499957 OR 1996571 OR 559590 OR 220299 OR 4404872 OR 151510 OR 66017 OR 666 OR 113459 OR 890575 OR 505725 OR 330393 OR 349940 OR 4094994 OR 1245995 OR 2459959 OR 4255909 OR 899955 OR 7878899 OR 100999 … ∞ )

© Lucid Imagina-on, Inc.


Gluttony!   Stay in shape – Flex Your Solr Muscles!  Keep up on new features  Training, when appropriate  Cer-fica-on  Contribute!  Follow the user lists  Refactor when new features can help  Keep up to date on new releases

© Lucid Imagina-on, Inc.


Gluttony: The Path To Salvation!   Keep configura-on files clean and trim. Remove unused elements   Periodically review queries to make sure they are efficient   Refactor when necessary – keep your applica-on fit and trim

© Lucid Imagina-on, Inc.


"Hope is the denial of reality."

© Lucid Imagina-on, Inc.


Wrath!   Wrath -­‐ usually synonymous with anger, but…   Let’s use an older defini-on here:  “A vehement denial of the truth, both to others and in the form of self-­‐denial and impaMence.”   Step back every now and then and look objec-vely at your applica-on

© Lucid Imagina-on, Inc.


Wrath!   Resist the push to rush to produc-on…

© Lucid Imagina-on, Inc.


Wrath!   Ignoring new Solr releases  OK to wait un-l a release is proven  But gebng too far behind makes upgrading more painful with each release   We don't have -me to do it right, but we always have -me to fix it

© Lucid Imagina-on, Inc.


Wrath!   Ignoring complaints about results relevance   Disregarding feedback from stakeholders   Remember – the point of your search applica-on is to support the business, not to "build cool stuff"   Not taking advantage of log files  Consider mining log files, storing data in rela-onal DB for genera-ng reports  Capturing user queries and query counts can be extremely useful  Can also be used for query-­‐based autosuggest. (not just indexed terms)

© Lucid Imagina-on, Inc.


Wrath: The Path To Salvation!   Keep your version of Solr up to date  OK to wait "awhile", but don't skip versions   Seek and embrace feedback from business and domain experts   Constantly gauge and improve relevance as an ongoing task   Avoid the push to release too soon (as best you can)   Take advantage of log files to understand what users are doing, and what is not working well

© Lucid Imagina-on, Inc.


¡Búsqueda, y usted encontrará!


Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.