What's wrong with _mm_extract_ps and _mm_extract_pd?

If you are programming in C/C++, you can use Single-Instruction Multiple Data(SIMD) commands. The beauty of these commands is that they operate on small vectors rather than singular scalar values. For example, we can multiply or subtract 4 pairs of floating point numbers at the same time. SIMD commands are available on many CPUs, but my post is Intel-specific.

The current standard set of instructions on Intel SSE 4 supports vector operations on 128-bit data elements. The vector size in bits is always 128, but the size of data elements is different. You can operate on sixteen 8-bit values, eight 16-bit values, four 32-bit values, and two 64-bit values. The type of data elements can also be different. A 32-bit value, e.g., can be an integer or a single-precision floating point number. A 64-bit value can be an integer, or a double-precision floating value. Longer, e.g., 256-bit vectors are also supported by newer CPUs, but I do not consider them here.

And it is kinda nightmarish, because you cannot use your regular '+', '-', '*' any more. You have a separate function for each data type (and, even worse, a bunch of conversion functions). For instance, addition of two 4-element floating point vectors is _mm_add_ps, addition of two 2-element double-precision floating point vectors is _mm_add_pd, and addition of two 4-element integer vectors is _mm_add_epi32

It is bad, but not terribly bad, because there is a naming convention that helps you navigate through this jungle. As you might have noticed, all operations start with the same prefix _mm_, then there is a part indicating the type of the operation, and, finally, a type-specific suffix. These suffixes are as follows:

epi8 for 8-bit integers;
epi16 for 16-bit integers;
epi32 for 32-bit integers;
ps for single-precision floating point numbers;
pd for double-precision floating point numbers.

To operate on 128-bit vectors, the CPU uses special 128-bit registers. If you need to extract specific vector elements and store them in regular 32-bit or 64-bit registers, you have to use a special CPU command. Of course, you can always copy vector values to the memory and read back only a necessary portion, but this is rather slow. This is why there are commands that can copy specific elements of a 128-bit vector to a 32-bit or 64-bit CPU register. BTW, store and load operations also follow a convention. The store command for four-element single-precision floating point vectors is _mm_storeu_ps (u in storeu denotes an unaligned write).

The command _mm_exctract_epi8 treats a 128-bit register as a 16-element integer vector. It allows one to extract any of the sixteen integer vector elements (each has a size of 8 bit). _mm_extract_epi16 gives you one of the eight 16-bit vector elements_mm_extract_epi32 extracts one of the four 32-bit integer values. Ok, what does _mm_extract_ps do? Extracts one of the four single-precision floating point numbers, right? Wrong, it also extracts one of the four 32-bit integers. Furthermore, there is no function _mm_extract_pd!

To efficiently extract floating point numbers you need to use functions _mm_cvtss_f32 and _mm_cvtsd_f64. They extract only the first floating point number from the vector. Yet, there is a command to move an arbitrary element of the four-element vector to the first position. This command is called a shuffle instruction. Thus, you can first shuffle an arbitrary element to the first position, and then extract the first element. The name of the shuffle command is a bit of misnomer itself, because shuffle usually means rearranging. Yet, shuffling on Intel is IMHO multiplexing.

It does not bother me much that the actual floating-point extraction functions are missing. Yet, I cannot understand why there is a function _mm_extract_ps with a misleading name and redundant functionality? Anyway, after reading some material on the Web, I have created two simple macros: one for extraction of single-precision and and another for extraction of double-precision floating point numbers. My code is freely available.

How fast is UIMA subiterator function?

Assume that you need to identify part of speech tags (POS-tags) in a sentence: "My dog likes meat". You run a POS-tagger and get the following result (pronouns are green, nouns are blue, and verbs are red):

My dog likes meat

One can view such process as highlighting words using pens of different colors. We do not change the original text, just add additional information. Computerized processing of natural languages adopted this paper-and-pen model by introducing a concept of electronic annotations. An annotation is simply a record that "highlights" a span of characters. The annotation has a start, an end, a type, and a number of attributes. In our example, all annotations are of the same type, but their color attribute allows us to distinguish among different POS tags. In an NLP world everything is annotation!

Annotations are typically created by different annotation tools independently. Yet, we frequently need to know which annotations overlap. For example, sentence boundaries may be denoted by annotation of one type. Tokens and POS tags could be annotations of a different type. Figuring out which POS tags are assigned to tokens in a given sentence requires us to identify POS tags and tokens that overlap with the sentence (or rather tokens/tags that are contained in it) .

Probably, you do not want to invent a wheel and implement annotation processing using some off-the-shelf software. In particular, our group employs UIMA ECD, a framework on top of Apache UIMA. (For more details on UIMA-ECD, please, see a tutorial. BTW, UIMA-ECD is much easier beast to tame than pure UIMA). An overlap of annotations is computed using the function subiterator. (see also my recent comment on type priorities.)

How efficient is subiterator? In theory, UIMA uses indexes, but I could not find any comments on the efficiency of this operation in the official documentation. I tried to look at the source code, but it was hard for me to get through the maze of classes and abstract interfaces quickly. Anyways, even if UIMA uses some form of an index, how efficient is such an index? Being a bit paranoid, I decided to benchmark.

To this end, I created a simple pipeline. In this pipeline, I removed HTML from Wikipedia documents. Then, documents were annotated using the SENNA parser and OpenNLP. OpenNLP creates annotations for sentence boundaries, while the SENNA parser identifies POS tags (recall that sentence boundaries are denoted by annotations).

Finally, I iterated over sentences and for each sentence I retrieved related POS tags using two approaches. The first approach employed subiterator and was expected to be efficient (due to relying on indexes). In the second approach, I simply iterated over all POS tags in a document. This one would be slow, because a Wikipedia document has thousands of POS tags and hundreds of sentences. Thus, a nested loop (first over sentences, then over POS tags) could be expensive. My code is freely available.

Depending on a document, an average time to retrieve a POS tag using an index varied from 1 to 5 microseconds. In the bruteforce-iteration approach, which does not rely on index, a time to retrieve a POS tag varied from 0.1 to 0.5 millisecond, a two orders of magnitude difference.

Conclusions? A UIMA subiterator function is not terribly fast (1-15K CPU cycles per operation), but it is rather efficient. I would say it should be good enough for most tasks.

UPDATE1: This example will always work if a sentence span is strictly larger than the span of a contained annotation. If the spans can be of equal size, one needs to properly define type priorities. In that, the Sentence will need to have a higher priority.

UPDATE2: See also a follow-up post. There is a more efficient UIMAfit implementation of the subiterator function that also does not care about type priorities.

Unix pipes have small capacity

There is a catch in a previously described solution to wrapping NLP tools. Namely, a Unix pipe has a small buffer (I believe it is in the order of kilobytes). This is why you need to send input data in small chunks and read all the output after each chunk is processed. Otherwise, a deadlock can happen.

If you try to push a large chunk, a write call (that writes data to the output pipe) will be waiting till your application reads data from the other side of the output pipe. At the same time, your application will be trying to push input data through the input pipe. Because the pipe has a limited capacity, your application will exhaust this capacity and "freeze" in the write call. It will be waiting for the NLP tool to read the data from the other side of the input pipe. Yet, the NLP tool, in turn, will be waiting for your application!

As Di Wang pointed out, you can avoid this situation by using a temporary file and a pipe in a clever way. As in the naive and unsafe solution described in the beginning of the post, you will write the input data to a temporary file. Then, you will write the name of this temporary file to the input pipe. Because the name of the file is very small, you will not exceed the pipe capacity.

Plumbing with named pipes: a simple approach to wrap standalone NLP tools

A modus operandi of many natural language processing (NLP) tools is as follows: start a tool, read raw data from the standard input, write processed data (e.g., POS tags) to the standard output and finish. That is, a lot of NLP tools come without any server mode. At the same time the cost of starting a process can be substantial. Thus, if you have to supply raw data in small chunks, the processing can be very slow.

What is the easiest way to fix this, if we do not care about multi-threading? Writing a fully-fledged network daemon is possible, but is rather hard. Some folk default to modifying the NLP tool so that it reads input from a file as the input appears. A modus operandi is, then, as follows. Sit and wait till the input file is created. When it happens, read raw data and output processed data to some other file. It is an easy solution: synchronization can be problematic though: We may end up with one process reading from the file while another one is writing to it.

A much better way is to use a named pipe! Unix named pipes are almost identical to regular temporary pipes that we use to direct an output of one process to an input of another process (using the symbol |). The only difference is that named pipes are permanent. From a perspective of a Unix process, a named pipe is just a regular pipe from which you can read data (or write data to). A benefit of using a pipe is that the operating system takes care of synchronization: one process can safely read from the pipe while another process is writing to it.

To begin the plumbing, we need to create two named pipes, one for input, another for output:
 mkfifo input_pipe mkfifo output_pipe

Then, we need to modify our NLP tool.

1) We make the tool process data in an infinite loop (rather than exiting after processing all the input). In the beginning of the loop, we open the the output pipe for writing (and we open the input pipe only once, when the tool starts). After all data is processed, we close the output pipe. Note that closing and re-opening the output pipe is important. Otherwise, a receiving process will not read the EOF marker and consequently, will wait for ever.
2) We replace all operators that read raw data from the standard input with operators that read data from the input named pipe.
3) We replace all operators that write processed data to the standard output with operators that write data to the output named pipe.

In C/C++, this is straightforward (in other languages this is not hard either). For instance, we replace

fgets(sentence, MAX_SENTENCE_SIZE, stdin)

with
fgets(sentence, max_sent_size + 1, senna_input)

That is, pretty much it. As a working example, I publish the modified main C-file of the SENNA parser version 3.0.

Happy plumbing and let your pipes be functioning properly!

PS1: There is one catch: in a previously described solution to wrapping NLP tools. Namely, a Unix pipe has a small buffer (I believe it is in the order of kilobytes). This is why you need to send input data in small chunks and read all the output after each chunk is processed. Otherwise, a deadlock can happen.

If you try to push a large chunk, a write call (that writes data to the output pipe) will be waiting till your application reads data from the other side of the output pipe. At the same time, your application will be trying to push input data through the input pipe. Because the pipe has a limited capacity, your application will exhaust this capacity and "freeze" in the write call. It will be waiting for the NLP tool to read the data from the other side of the input pipe. Yet, the NLP tool, in turn, will be waiting for your application!

As Di Wang pointed out, you can avoid this situation by using a temporary file and a pipe in a clever way. As in the naive and unsafe solution described in the beginning of the post, you will write the input data to a temporary file. Then, you will write the name of this temporary file to the input pipe. Because the name of the file is very small, you will not exceed the pipe capacity.

PS2: In principle, even a simpler approach might be possible. And this approach would not require modification of the NLP tool. We can create a wrapper utility that would simulate the following Unix command:
 printing_binary | ./nlp_tool_binary | recipient_binary
A trick is to implement this pipeline in such a way that printing_binary is the same as the recipient_binary. In C/C++ and Unix, it is possible, e.g., through a system call popen. One big issue here, though, is that we need to feed several input batches to nlp_tool_binary. However, the output from nlp_tool_binary can be a single stream of characters/bytes without separators that indicate batch boundaries.

Clearly, we need to pause after feeding every input batch until we retrieve all the respective output data. However, if we keep reading the output of nlp_tool_binary using a blocking read operation, we will eventually end up waiting forever. We can read using a non-blocking system call, but how long should we repeat this system call before giving up and declaring that nlp_tool_binary has processed all the data? It might be possible to use some knowledge about the output format of nlp_tool_binary, or simply stop trying after a certain time elapses. Yet, none of these solutions is sufficiently reliable or generic, so if anybody knows a better approach, please, let me know.