Assume that you need to identify part of speech tags (POS-tags) in a sentence: "My dog likes meat". You run a POS-tagger and get the following result (pronouns are green, nouns are blue, and verbs are red):
My dog likes meat
One can view such process as highlighting words using pens of different colors. We do not change the original text, just add additional information. Computerized processing of natural languages adopted this paper-and-pen model by introducing a concept of electronic annotations. An annotation is simply a record that "highlights" a span of characters. The annotation has a start, an end, a type, and a number of attributes. In our example, all annotations are of the same type, but their color attribute allows us to distinguish among different POS tags. In an NLP world everything is annotation!
Annotations are typically created by different annotation tools independently. Yet, we frequently need to know which annotations overlap. For example, sentence boundaries may be denoted by annotation of one type. Tokens and POS tags could be annotations of a different type. Figuring out which POS tags are assigned to tokens in a given sentence requires us to identify POS tags and tokens that overlap with the sentence (or rather tokens/tags that are contained in it) .
Probably, you do not want to invent a wheel and implement annotation processing using some off-the-shelf software. In particular, our group employs UIMA ECD, a framework on top of Apache UIMA. (For more details on UIMA-ECD, please, see a tutorial. BTW, UIMA-ECD is much easier beast to tame than pure UIMA). An overlap of annotations is computed using the function subiterator. (see also my recent comment on type priorities.)
How efficient is subiterator? In theory, UIMA uses indexes, but I could not find any comments on the efficiency of this operation in the official documentation. I tried to look at the source code, but it was hard for me to get through the maze of classes and abstract interfaces quickly. Anyways, even if UIMA uses some form of an index, how efficient is such an index? Being a bit paranoid, I decided to benchmark.
To this end, I created a simple pipeline. In this pipeline, I removed HTML from Wikipedia documents. Then, documents were annotated using the SENNA parser and OpenNLP. OpenNLP creates annotations for sentence boundaries, while the SENNA parser identifies POS tags (recall that sentence boundaries are denoted by annotations).
Finally, I iterated over sentences and for each sentence I retrieved related POS tags using two approaches. The first approach employed subiterator and was expected to be efficient (due to relying on indexes). In the second approach, I simply iterated over all POS tags in a document. This one would be slow, because a Wikipedia document has thousands of POS tags and hundreds of sentences. Thus, a nested loop (first over sentences, then over POS tags) could be expensive. My code is freely available.
Depending on a document, an average time to retrieve a POS tag using an index varied from 1 to 5 microseconds. In the bruteforce-iteration approach, which does not rely on index, a time to retrieve a POS tag varied from 0.1 to 0.5 millisecond, a two orders of magnitude difference.
Conclusions? A UIMA subiterator function is not terribly fast (1-15K CPU cycles per operation), but it is rather efficient. I would say it should be good enough for most tasks.
UPDATE1: This example will always work if a sentence span is strictly larger than the span of a contained annotation. If the spans can be of equal size, one needs to properly define type priorities. In that, the Sentence will need to have a higher priority.
UPDATE2: See also a follow-up post. There is a more efficient UIMAfit implementation of the subiterator function that also does not care about type priorities.