SOLR DataImportHandler performance tunning

Some hopefully useful tips on speeding up data import via SOLR DataImportHandler.

Performance tips

  1. Analyze your SQL queries, and optimize them !
  2. Don’t use sub-entities. Create one big select in query returning all the required fields at once. This will avoid execution of multiple subselects per one imported document
  3. Split multiValued fields by using built-in RegexTransformer or maybe a by using a bit faster SQLArrayTransformer from my solr-utils, that transforms native SQL arrays types to  multiValues.
  4. For deltaQuery use some kind of dummy query ie.
    SELECT 1 AS id

    or

    SELECT 1 AS id FROM dual
  5. and return all the documents for delta import in deltaImportQuery (the same way as in query).
  6. now compare original per document delta import with the new one, if you are importing only few documents (ie. less the 20-30) at one delta import that per document might be faster. For bigger batches it will be probably much slow (to see the difference: on one schema I’m using per-document delta import makes 4 docs per second compared to 300 docs per second when using multi delta import query, but the query runs always more than 30 seconds. SQL schema and quality of your queries decide whats really better).
  7. consider creation of views for all the import queries instead of writing them to dataimport.xml. This way you can easily test the correctness of them, and also you can also restrict solr import user only to this views.

Example

<document name="docs">
  <entity name="doc" pk="id" rootEntity="true" onError="skip"
          transformer="sk.dob.search.solr.dih.SQLArrayTransformer"
          query="SELECT * FROM export.doc_active"
          deltaQuery="SELECT 1 AS id"
          deltaImportQuery="SELECT j.* FROM export.doc_active_change_time jact JOIN export.doc_active j ON (j.id = jact.id) WHERE jact.create_time >= '${dataimporter.last_index_time} CEST' AT TIME ZONE 'GMT'"
          deletedPkQuery="SELECT id FROM doc.get_inactive('${dataimporter.last_index_time} CEST' AT TIME ZONE 'GMT') AS j(id)">

    <!-- fields for SQLArray split -->
    <field column="advert_alias"/>
    <field column="tag"/>
    <field column="tag_location"/>
    <field column="tag_industry"/>
  </entity>
</document>

For a database with about 40k documents I could reduce the import time to 5-7 minutes from over 30 minutes (my initial config with multiple entities). Recently after upgrade to Solr6 and with use of SQLArrayTransformer it reduced further to only 1-2 minutes (but, that’s different SOLR, and the schema also changed, so it can’t be really compared).