log in | about 

Undocumented invalidation of UIMA iterators

As mentioned previously, in the world of natural language processing (at least, on some of its continents), everything is an annotation. In the Apache UIMA framework, there are capabilities to efficiently iterate over these annotations. For example, it is possible to retrieve a POS tag of every document word, or words belonging to (or covered by) a single sentence (annotation).

The iteration functionality is supported via the class FSIterator<T extends FeatureStructure>. These iterators can be "invalidated" in a rather interesting fashion, which does not seem to be documented properly.

Specifically, if one iterates over annotations and deletes them on the fly:

  1. FSIterator<Annotation> it = ...
  2. while (it.isValid()) {
  3. Annotation an = it.get();
  4. an.removeFromIndexes();
  5. an.moveToNext();
  6. }

the behavior seems to be undefined. Sometimes you may get a ConcurrentModificationException and sometimes you do not retrieve all indexed annotations. This issue is not limited to deletion. If you iterate over annotations, create new ones, and add them to the index (using the function addToIndexes), you are also likely to generate ConcurrentModificationException. This happens even if you iterate over annotations of one type and create annotations of another type.

It is a rather expected behavior, because, normally, you cannot iterate over the index and modify this index at the same time (though some fancy implementations of containers do support this). However, many UIMA users (including me) managed to fall into this trap. UIMA docs seem to be silent about this issue. The only confirmation of the described effect that I could find was in this obscure mailing list. Yet, I think an appropriate warning should be printed in a large red font.

This is the sort of English up with which I will not put!

There is a common belief that English sentences should not be ended with prepositions. I have heard that Californian teachers are especially vigorous in beating this nonsense into students' heads. There is a famous anecdote telling the story of a Nobel prize winner Winston Churchill, who was offended by an editor clumsily rearranging one of his sentences, which ended with a preposition. Being proud of his style, Winston Churchill wrote in reply (note that are several variants of this phrase circulating): "This is the sort of English up with which I will not put.”

This joke is not as good as it may seem at first glance, because, in this sentence, up is a verb particle, not a preposition! Simply speaking, the verb is the whole phrase put up. Verb particles can be moved, e.g.: both "switch off the lights" and "switch the lights off" are grammatical. However, I suspect that it is ungrammatical to move particles the way Winston Churchill did in his humorous reply to the editor.

Anyways, "stranded" prepositions are perfectly fine in English. Yet, I have been wondering why this is considered ungrammatical by so many people. Turns out that Romance languages, in general, and Latin in particular, do not have preposition stranding. Teachers believed that constructs impossible in Latin should not be allowed in English. As a result, for hundreds of years, they have been telling us that "nobody to play with" is ungrammatical.

Disclaimer: I know that there are some good arguments against the veracity of Churchill's story.

Credits: This post resulted from observations of El Nico Fauceglia and remarks by a linguist who wanted to remain anonymous. Anna Belova told me the Churchill's anecdote.

Not everything warrants an efficient implementation

"Not everything warrants an efficient implementation" is an old maxim. Yet, an explosion in computing power never stops to amaze me. For example, recently I needed to deal with a text file where sentences were divided into several classes (say 200). The file itself contained 2-3K lines. I had a loop over class identifiers. In each iteration, I got a class id x and had to retrieve sentences belonging to this class x from the file.

I had a technical problem that prevented me from parsing the file once and storing results in, say, a hash map. Instead, in each iteration I had to read the whole file, parse it, and keep only the sentences related to the current class id x. This was a horrible solution with a potentially quadratic runtime, right?

It was horrible and in the beginning, I was worried a bit about efficiency of this approach. One wouldn't do it on old i386 (or worse) machine. However, when I tested this solution on a modern core i7 laptop, it turned out that re-reading and re-parsing of the file took only 0.02 sec in Java. Other components were much slower and I could have in principle afforded to deal a 10x larger file that had 10x unique classes (compared to the current 2.5K file with less than 200 groups).

How to rename stored fields in Solr

Turns out that sometimes fields in Solr (or Lucene) are to be renamed. There is a long-standing request to implement a standard field-renaming utility in Lucene. Some hacky solutions were proposed, but these solutions are not guaranteed to work in all cases. For details see a discussion between John Wang and Michael McCandless.

Essentially, re-indexing (or re-importing) seems to be inevitable and the question is how to do it in the easiest way. Turns out that in the latest Solr versions, you can simply define a DataImportHandler that would read records from an original Solr instance and save them to a new one! In doing so, the DataImportHandler would rename fields if necessary. Mikhail Khludnev pointed out that this solution would work only for stored fields. Yet, it may still be useful as many users prefer to store the values of indexed fields.

Creating a new index via DataImportHandler is a conceptually simple solution, which is somewhat hard to implement. This use case (copying data from one Solr instance to another) is not covered well. I tried to search for good examples on the Web, but I could only find an outdated one. This is why I decided to write this small HOWTO for Solr 4.x.

First of all, one needs to create a second Solr instance that has an almost identical configuration, except some fields would be named differently. I assume that the reader already knows the basic of Solr configuration and this step needs no further explanation. Then, one needs to add a description of the import handler to the solrconfig.xml file of the new instance.

  <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">solr-data-config.xml</str>

This description simply delegates most of the configuration to the file solr-data-config.xml. The format of this configuration file is sketched on the Apache web site.

Two key elements need to be defined. The first element is a dataSource. Let us use the URLDataSource. For this data source, we need to specify only the type, the encoding (optional), and (also optionally) timeout values.

The second element is an entity processor. We need SolrEntityProcessor. To indicate which fields to rename, we should use the element field. The attribute column would refer to the source field name, while the attribute name would denote the field name in the new instance. The field element defines renaming rules.

Here is an example of the configuration file solr-data-config.xml:

  <dataSource type="URLDataSource"  encoding="UTF-8" connectionTimeout="5000" readTimeout="10000" />
    <entity name="rename-fields" processor="SolrEntityProcessor" query="*:*" url="http://localhost:8984/solr/Wiki" 
         rows="100" fl="id,text,annotation">
      <field column="id" name="Id" />
      <field column="text" name="Text4Annotation" />
      <field column="annotation" name="Annotation" />

Next, note that this is very important, we need to copy a jar solr-dataimporthandler-4.x.jar (x stands for the Solr version) to the lib folder inside the instance directory. This jar-file comes with the standard Solr distribution, but it is not enabled by default!

Why do we need to copy it to the lib folder inside the instance directory, is there a way to specify an arbitrary location? This is should be possible in principle, but the feature appears to be broken (at least in Solr 4.6). I submitted a bug report, but it was neither confirmed nor rejected.

Finally, you can restart the instance of Solr and open the Solr Admin UI in your favorite browser.

Select the target instance and click on the dataimport menu item. Then, select the command (e.g., full-import), the entity (in our case it's rename-fields) and check the box "Auto-Refresh" status. You will also need to set the start row and the number of rows to import. When all is done, click Execute

I hope this was helpful and the import would succeed. If not (e.g., the configuration is broken and a target instance cannot be loaded), please check the Solr log.

selectCovered is a substantially better version of the UIMA subiterator

As I recently wrote, annotations are a popular formalism in the world of Natural Language Processing (NLP). They are markers used to highlight parts of speech (POS), syntax structures, as well as other constructs arising from text processing. One frequently used operation consists in retrieving all annotations under a given covering annotation. For example, sentences can be marked with annotations of a special type. Given a sentence annotation, you may need to retrieve all POS-tag annotations within this sentence annotation.

In the UIMA framework, retrieval of covered annotations can be done using the subiterator function. This function is tricky, however. When, a covering and covered annotations have equal spans, UIMA has complex rules to figure if one annotation should be considered to be covered by another. These rules are defined by the so-called type priorities. Simply speaking, one annotation can be truly covered by another one, but UIMA will consider this not to be the case (which is really annoying).

Fortunately, as I learned recently, there is an easy way to avoid this type-priority-in-the-neck issue. There is a special library called UIMAfit that works on top of UIMA. And this library implements a neat replacement for the subiterator, namely, the function selectCovered. This function relies on the same approach (i.e, it also uses an annotation index), but it completely ignores the UIMA type system priorities.

There is more than one version of selectCovered. The one version accepts a covering annotation. Another one explicitly accepts a covering range. Be careful in using the second one! It is claimed to be rather inefficient. And, of course, I wanted to measure this inefficiency. To this end, I took my old code and added two additional tests for two versions of the function selectCovered.

As previously, in the bruteforce iteration approach, finding the covered annotation takes a fraction of a millisecond. For the subiterator function, time varied in the range of 1-6 microseconds, which is two orders of magnitude faster. The efficient variant of selectCovered was even 2-4 times faster than the function subiterator. However, the inefficient one, which explicitly accepts the covering range, is as slow as the bruteforce approach.

Conclusions? The UIMAfit function selectCovered is much better than the native UIMA subiterator. However, one should be careful and use the efficient variant that accepts (as an argument) the covering annotation rather than the explicit covering range!


Subscribe to RSS - blogs