Overview of Searching in Solr 1.4 by ethan ray

Overview of Searching in Solr 1.4

Solr 1.4 can now ingest these other types of documents using a feature called Solr Cell.3 Solr Cell uses another open source project, Tika, to read documents in a variety of formats and convert them to an XHTML stream. Solr parses the stream to produce a document, which is then indexed. Here are a few of the formats that Tika understands: •

PDF

•

OpenDocument (OpenOffice formats)

•

Microsoft OLE 2 Compound Document (Word, PowerPoint, Excel, Visio, etc.)

•

HTML

•

RTF

•

gzip

•

ZIP

•

Java Archive (JAR) files

DataImportHandler Enhancements DataImportHandler knows how to index data pulled from relational databases or XML files. The details of what is indexed and how it happens are configured in solrconfig.xml. Solr 1.4 contains some extremely useful upgrades to DataImportHandler. The first is the ability to push data into DataImportHandler. In Solr 1.3, DataImportHandler was pull-only. This meant that the only possibly way to push data to Solr was to use the update XML or CSV format, which meant you couldn’t take advantage of any of DataImportHandler’s capabilities. In the Solr 1.4 world, a new component called ContentStreamDataSource allows you to use DataImportHandler’s features for indexing content. Another powerful enhancement in Solr 1.4 is the ability to listen for import events. All you need to do is provide an implementation of the EventListener interface and let Solr

The name is based on the acronym Content Extraction Library (CEL). This feature is also known by its more technical name ExtractingRequestHandler.

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009

Page 6