Undocumented invalidation of UIMA iterators

Blog

Directory

Submitted by srchvrs on Tue, 10/07/2014 - 23:42

As mentioned previously, in the world of natural language processing (at least, on some of its continents), everything is an annotation. In the Apache UIMA framework, there are capabilities to efficiently iterate over these annotations. For example, it is possible to retrieve a POS tag of every document word, or words belonging to (or covered by) a single sentence (annotation).

The iteration functionality is supported via the class FSIterator<T extends FeatureStructure>. These iterators can be "invalidated" in a rather interesting fashion, which does not seem to be documented properly.

Specifically, if one iterates over annotations and deletes them on the fly:

FSIterator<Annotation> it = ...
while (it.isValid()) {
   Annotation an = it.get();
   an.removeFromIndexes();
   an.moveToNext();
 }

the behavior seems to be undefined. Sometimes you may get a ConcurrentModificationException and sometimes you do not retrieve all indexed annotations. This issue is not limited to deletion. If you iterate over annotations, create new ones, and add them to the index (using the function addToIndexes), you are also likely to generate ConcurrentModificationException. This happens even if you iterate over annotations of one type and create annotations of another type.

It is a rather expected behavior, because, normally, you cannot iterate over the index and modify this index at the same time (though some fancy implementations of containers do support this). However, many UIMA users (including me) managed to fall into this trap. UIMA docs seem to be silent about this issue. The only confirmation of the described effect that I could find was in this obscure mailing list. Yet, I think an appropriate warning should be printed in a large red font.

srchvrs's blog

You are here