log in | contact | about 

How to rename stored fields in Solr

Turns out that sometimes fields in Solr (or Lucene) are to be renamed. There is a long-standing request to implement a standard field-renaming utility in Lucene. Some hacky solutions were proposed, but these solutions are not guaranteed to work in all cases. For details see a discussion between John Wang and Michael McCandless.

Essentially, re-indexing (or re-importing) seems to be inevitable and the question is how to do it in the easiest way. Turns out that in the latest Solr versions, you can simply define a DataImportHandler that would read records from an original Solr instance and save them to a new one! In doing so, the DataImportHandler would rename fields if necessary. Mikhail Khludnev pointed out that this solution would work only for stored fields. Yet, it may still be useful as many users prefer to store the values of indexed fields.

Creating a new index via DataImportHandler is a conceptually simple solution, which is somewhat hard to implement. This use case (copying data from one Solr instance to another) is not covered well. I tried to search for good examples on the Web, but I could only find an outdated one. This is why I decided to write this small HOWTO for Solr 4.x.

First of all, one needs to create a second Solr instance that has an almost identical configuration, except some fields would be named differently. I assume that the reader already knows the basic of Solr configuration and this step needs no further explanation. Then, one needs to add a description of the import handler to the solrconfig.xml file of the new instance.

  <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">solr-data-config.xml</str>

This description simply delegates most of the configuration to the file solr-data-config.xml. The format of this configuration file is sketched on the Apache web site.

Two key elements need to be defined. The first element is a dataSource. Let us use the URLDataSource. For this data source, we need to specify only the type, the encoding (optional), and (also optionally) timeout values.

The second element is an entity processor. We need SolrEntityProcessor. To indicate which fields to rename, we should use the element field. The attribute column would refer to the source field name, while the attribute name would denote the field name in the new instance. The field element defines renaming rules.

Here is an example of the configuration file solr-data-config.xml:

  <dataSource type="URLDataSource"  encoding="UTF-8" connectionTimeout="5000" readTimeout="10000" />
    <entity name="rename-fields" processor="SolrEntityProcessor" query="*:*" url="http://localhost:8984/solr/Wiki" 
         rows="100" fl="id,text,annotation">
      <field column="id" name="Id" />
      <field column="text" name="Text4Annotation" />
      <field column="annotation" name="Annotation" />

Next, note that this is very important, we need to copy a jar solr-dataimporthandler-4.x.jar (x stands for the Solr version) to the lib folder inside the instance directory. This jar-file comes with the standard Solr distribution, but it is not enabled by default!

Why do we need to copy it to the lib folder inside the instance directory, is there a way to specify an arbitrary location? This is should be possible in principle, but the feature appears to be broken (at least in Solr 4.6). I submitted a bug report, but it was neither confirmed nor rejected.

Finally, you can restart the instance of Solr and open the Solr Admin UI in your favorite browser.

Select the target instance and click on the dataimport menu item. Then, select the command (e.g., full-import), the entity (in our case it's rename-fields) and check the box "Auto-Refresh" status. You will also need to set the start row and the number of rows to import. When all is done, click Execute

I hope this was helpful and the import would succeed. If not (e.g., the configuration is broken and a target instance cannot be loaded), please check the Solr log.

selectCovered is a substantially better version of the UIMA subiterator

As I recently wrote, annotations are a popular formalism in the world of Natural Language Processing (NLP). They are markers used to highlight parts of speech (POS), syntax structures, as well as other constructs arising from text processing. One frequently used operation consists in retrieving all annotations under a given covering annotation. For example, sentences can be marked with annotations of a special type. Given a sentence annotation, you may need to retrieve all POS-tag annotations within this sentence annotation.

In the UIMA framework, retrieval of covered annotations can be done using the subiterator function. This function is tricky, however. When, a covering and covered annotations have equal spans, UIMA has complex rules to figure if one annotation should be considered to be covered by another. These rules are defined by the so-called type priorities. Simply speaking, one annotation can be truly covered by another one, but UIMA will consider this not to be the case (which is really annoying).

Fortunately, as I learned recently, there is an easy way to avoid this type-priority-in-the-neck issue. There is a special library called UIMAfit that works on top of UIMA. And this library implements a neat replacement for the subiterator, namely, the function selectCovered. This function relies on the same approach (i.e, it also uses an annotation index), but it completely ignores the UIMA type system priorities.

There is more than one version of selectCovered. The one version accepts a covering annotation. Another one explicitly accepts a covering range. Be careful in using the second one! It is claimed to be rather inefficient. And, of course, I wanted to measure this inefficiency. To this end, I took my old code and added two additional tests for two versions of the function selectCovered.

As previously, in the bruteforce iteration approach, finding the covered annotation takes a fraction of a millisecond. For the subiterator function, time varied in the range of 1-6 microseconds, which is two orders of magnitude faster. The efficient variant of selectCovered was even 2-4 times faster than the function subiterator. However, the inefficient one, which explicitly accepts the covering range, is as slow as the bruteforce approach.

Conclusions? The UIMAfit function selectCovered is much better than the native UIMA subiterator. However, one should be careful and use the efficient variant that accepts (as an argument) the covering annotation rather than the explicit covering range!

Does arXiv really have a high citation index?

In a recent post, Daniel Lemire says that "... though unrefereed, arXiv has a better h-index than most journals". In particular, arXiv is included in the Google's list of most cited venues, where it consistently beats most other journals and conferences. Take, e.g., a look at the section Databases & Information Systems. Daniel concludes by advising to subscribe to arXiv Twitter stream.

Well, obviously, arXiv is a great collection of open-source high-quality publications (at least a subset is great), but what implications does it have for a young researcher? Does she have to stop publishing at good journals and conferences? Likely not, because the high ranking of arXiv seems to be counterfactual.

Why is that? Simply because arXiv is not an independent venue and mirrors papers published elsewhere. Consider, e.g., top 3 papers in the Databases & Information Systems section:

  1. Low, Yucheng, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. "Distributed GraphLab: a framework for machine learning and data mining in the cloud." Proceedings of the VLDB Endowment.
  2. Hay, Michael, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. "Boosting the accuracy of differentially private histograms through consistency." Proceedings of the VLDB Endowment
  3. Xiao, Xiaokui, Guozhang Wang, and Johannes Gehrke. "Differential privacy via wavelet transforms." Knowledge and Data Engineering, IEEE Transactions

All of them appeared elsewhere, two in a prestigious VLDB conference. Perhaps, this is just a sample bias, but out of top-10 papers in this section, all 10 were published elsewhere, mostly in VLDB proceedings.

However, Daniel argues that only a small fraction of VLDB papers appears on arXiv, thus, apparently implying that high ranking of arXiv cannot be explained away by the fact that arXiv is not independent:

One could argue that the good ranking can be explained by the fact that arXiv includes everything. However, it is far from true. There are typically less than 30 new database papers every month on arXiv whereas big conferences often have more than 100 articles (150 at SIGMOD 2013 and and over 200 at VLDB 2013).

But it absolutely can! Note that venues are ranked using an h5 index, which is equal to the largest number h such that h articles published in 2009-2013 have at least h citations each. For a high h5-index, it is sufficient to have just a few dozens of highly cited papers. And these papers could come from VLDB and other prestigious venues.

I have to disclaim that, aside from verifying top-10 papers in the Databases & Information Systems section of arXiv, I did not collect solid statistics on the co-publishing of top arXiv papers. If any one has such statistics and the statistics shows a low co-publishing rate, I will be happy to retract my arguments. However, so far the statement "arXiv has a high citation index" looks like an outcome from a regression that misses an important covariate.

The arguments in support of arXiv are in line with other Daniel's posts. Check, for example, his recent essay, where Daniel argues that a great paper should not necessarily be published in VLDB or SIGIR. While I absolutely agree that obsessing about top-tier conferences is outright harmful, I think that publishing some of the work there makes a lot of sense and here is why.

If you are a renowned computer scientist and have a popular blog, dissemination of your work is an easy-peasy business. You can inscribe your findings on the Great Wall of China and your colleagues will rush buying airline tickets to see it. You can send an e-mail, you can publish a paper on arXiv. Delivery method disirregardless, your paper will still get a lot of attention (as long as the content is good). For less known individuals, things are much more complicated. In particular, a young scientist has to play a close-to-zero-sum game and compete for attention of readers. If she approaches her professor or employer and says: I have done good work recently and published 10 papers on arXiv, this is almost certainly guaranteed to create merely a comical effect. She will be sneered at and taught a lesson about promoting her work better.

People are busy and nobody wants to waste time on reading potentially uninteresting papers. One good time-saving strategy is to make other people read them first. Does this screening strategy have false positives and/or false negatives? It absolutely does, but, on average, it works well. At least, this is a common belief. In particular, Daniel himself will not read any P=NP proofs.

To conclude, Knuth and other luminaries may not care about prestigious conferences and journals, but for other people they mean a lot. I am pretty sure that co-publishing your paper online and promoting it in the blogs is a great supplementary strategy (I do recommend doing this, if you care about my lowly opinion), but this is likely not a replacement for traditional publishing approaches. In addition, I am not yet convinced that arXiv could have a high citation index on its own, without being a co-publishing venue.

A catch for "Min Number Should Match" in Solr's ExtendedDisMax parser.

One great feature of Solr is that you can employ different query parsers, even in the same query. There is a standard Solr/Lucene parser and there are number of extensions. One useful extension is the ExtendedDisMax parser. In this parser, it is possible to specify a percentage of query words (or blocks) that should appear in a document. This is some kind of fuzzy matching.

Consider an example of a two-word query "telegraph invent". To retrieve documents using a 80% threshold for the number of matching words, one can specify the following search request:

_query_: "{!edismax mm=80%} telegraph invent "

There is, however, a catch. One may expect that 80% of matching words in a two-word query means that retrieved documents contain both query words. However, this appears not be the case. Somewhat counter-intuitively, the minimum required number of matching keywords is computed by rounding down rather than by a more standard rounding half up. (or half way down)

What if you want to enforce the minimum number of words appearing in a document in a more transparent way? Turns out that you can still do this. To this end, one needs to specify the minimum number of words explicitly, rather than via a percentage. The above example would need to be rewritten as follows:

_query_: "{!edismax mm=2} telegraph invent "

It should apparently be possible to specify even more complex restricting conditions where, e.g., percentages and absolute thresholds are combined. More details on this can be found here. However, combining conditionals did not work out for me (I got a syntax error).

On the Size of the Deep Web

The World Wide Web (or simply Web) started as a tiny collection of several dozen web sites about 20 years ago. Since then, the number of Web pages grew tremendously and became quite segregated. There is a well-lit part of it, a surface Web, which is indexed by search engines, and there is a so-called deep-web, which is studied only slightly better than the outer deep space.

How many pages are on the surface? According to some of the measurements, there are several dozens of billions pages indexed. Were all of these pages created by humans manually? It is possible, but I doubt it. There are about 100 million books written by humans. Let us assume that a book has 100 pages each of which is published as a separate HTML page. This would give us only 10 billion pages. I think that during the 20 years of the existence of the Web, the number of manually created pages could have hardly surpassed this threshold. Consequently, it is not unreasonable to assume that most of the Web pages were automatically generated, e.g., for spamming purposes (two common generation approaches are: scrapping/mirroring contents from other web sites and generating gibberish text algorithmically).

Ok, but what is the size of the deep web? Six years ago, Google announced it knew about a trillion of Web pages. Assuming that the Web doubles each year, the size of the deep Web should be in the dozens of trillions of pages right now. This is supported by a more recent Google announcement: There are at least 60 trillion pages lurking in the depths of the Web!

What constitutes this massive dataset? There are allegations that the Deep Web is used for all kind of illegal activities. Well, there is definitely some illegal activity going on there, but I seriously doubt that humans could have manually created even a tiny fraction of the Deep Web directly. To make this possible, everybody would have to create about 10 thousand Web pages. This would be a tremendous enterprise even if each Web page were just a short status update on Facebook or Twitter. Anyways, most people write status updates probably once a year and not everybody is connected to the Web either.

Therefore, I conclude that the Deep Web should be mostly trash generated by (supposedly) spamming software. Any other thoughts regarding the origin of so many Web pages?


Subscribe to RSS - blogs