selectCovered is a substantially better version of the UIMA subiterator

Blog

Directory

Submitted by srchvrs on Wed, 09/10/2014 - 00:19

As I recently wrote, annotations are a popular formalism in the world of Natural Language Processing (NLP). They are markers used to highlight parts of speech (POS), syntax structures, as well as other constructs arising from text processing. One frequently used operation consists in retrieving all annotations under a given covering annotation. For example, sentences can be marked with annotations of a special type. Given a sentence annotation, you may need to retrieve all POS-tag annotations within this sentence annotation.

In the UIMA framework, retrieval of covered annotations can be done using the subiterator function. This function is tricky, however. When, a covering and covered annotations have equal spans, UIMA has complex rules to figure if one annotation should be considered to be covered by another. These rules are defined by the so-called type priorities. Simply speaking, one annotation can be truly covered by another one, but UIMA will consider this not to be the case (which is really annoying).

Fortunately, as I learned recently, there is an easy way to avoid this type-priority-in-the-neck issue. There is a special library called UIMAfit that works on top of UIMA. And this library implements a neat replacement for the subiterator, namely, the function selectCovered. This function relies on the same approach (i.e, it also uses an annotation index), but it completely ignores the UIMA type system priorities.

There is more than one version of selectCovered. The one version accepts a covering annotation. Another one explicitly accepts a covering range. Be careful in using the second one! It is claimed to be rather inefficient. And, of course, I wanted to measure this inefficiency. To this end, I took my old code and added two additional tests for two versions of the function selectCovered.

As previously, in the bruteforce iteration approach, finding the covered annotation takes a fraction of a millisecond. For the subiterator function, time varied in the range of 1-6 microseconds, which is two orders of magnitude faster. The efficient variant of selectCovered was even 2-4 times faster than the function subiterator. However, the inefficient one, which explicitly accepts the covering range, is as slow as the bruteforce approach.

Conclusions? The UIMAfit function selectCovered is much better than the native UIMA subiterator. However, one should be careful and use the efficient variant that accepts (as an argument) the covering annotation rather than the explicit covering range!

srchvrs's blog

You are here