log in | about 

Turns out that sometimes fields in Solr (or Lucene) are to be renamed. There is a long-standing request to implement a standard field-renaming utility in Lucene. Some hacky solutions were proposed, but these solutions are not guaranteed to work in all cases. For details see a discussion between John Wang and Michael McCandless.

Essentially, re-indexing (or re-importing) seems to be inevitable and the question is how to do it in the easiest way. Turns out that in the latest Solr versions, you can simply define a DataImportHandler that would read records from an original Solr instance and save them to a new one! In doing so, the DataImportHandler would rename fields if necessary. Mikhail Khludnev pointed out that this solution would work only for stored fields. Yet, it may still be useful as many users prefer to store the values of indexed fields.

Creating a new index via DataImportHandler is a conceptually simple solution, which is somewhat hard to implement. This use case (copying data from one Solr instance to another) is not covered well. I tried to search for good examples on the Web, but I could only find an outdated one. This is why I decided to write this small HOWTO for Solr 4.x.

First of all, one needs to create a second Solr instance that has an almost identical configuration, except some fields would be named differently. I assume that the reader already knows the basic of Solr configuration and this step needs no further explanation. Then, one needs to add a description of the import handler to the solrconfig.xml file of the new instance.

  <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">solr-data-config.xml</str>

This description simply delegates most of the configuration to the file solr-data-config.xml. The format of this configuration file is sketched on the Apache web site.

Two key elements need to be defined. The first element is a dataSource. Let us use the URLDataSource. For this data source, we need to specify only the type, the encoding (optional), and (also optionally) timeout values.

The second element is an entity processor. We need SolrEntityProcessor. To indicate which fields to rename, we should use the element field. The attribute column would refer to the source field name, while the attribute name would denote the field name in the new instance. The field element defines renaming rules.

Here is an example of the configuration file solr-data-config.xml:

  <dataSource type="URLDataSource"  encoding="UTF-8" connectionTimeout="5000" readTimeout="10000" />
    <entity name="rename-fields" processor="SolrEntityProcessor" query="*:*" url="http://localhost:8984/solr/Wiki" 
         rows="100" fl="id,text,annotation">
      <field column="id" name="Id" />
      <field column="text" name="Text4Annotation" />
      <field column="annotation" name="Annotation" />

Next, note that this is very important, we need to copy a jar solr-dataimporthandler-4.x.jar (x stands for the Solr version) to the lib folder inside the instance directory. This jar-file comes with the standard Solr distribution, but it is not enabled by default!

Why do we need to copy it to the lib folder inside the instance directory, is there a way to specify an arbitrary location? This is should be possible in principle, but the feature appears to be broken (at least in Solr 4.6). I submitted a bug report, but it was neither confirmed nor rejected.

Finally, you can restart the instance of Solr and open the Solr Admin UI in your favorite browser.

Select the target instance and click on the dataimport menu item. Then, select the command (e.g., full-import), the entity (in our case it's rename-fields) and check the box "Auto-Refresh" status. You will also need to set the start row and the number of rows to import. When all is done, click Execute

I hope this was helpful and the import would succeed. If not (e.g., the configuration is broken and a target instance cannot be loaded), please check the Solr log.