log in | about 
 

A modus operandi of many natural language processing (NLP) tools is as follows: start a tool, read raw data from the standard input, write processed data (e.g., POS tags) to the standard output and finish. That is, a lot of NLP tools come without any server mode. At the same time the cost of starting a process can be substantial. Thus, if you have to supply raw data in small chunks, the processing can be very slow.

What is the easiest way to fix this, if we do not care about multi-threading? Writing a fully-fledged network daemon is possible, but is rather hard. Some folk default to modifying the NLP tool so that it reads input from a file as the input appears. A modus operandi is, then, as follows. Sit and wait till the input file is created. When it happens, read raw data and output processed data to some other file. It is an easy solution: synchronization can be problematic though: We may end up with one process reading from the file while another one is writing to it.

A much better way is to use a named pipe! Unix named pipes are almost identical to regular temporary pipes that we use to direct an output of one process to an input of another process (using the symbol |). The only difference is that named pipes are permanent. From a perspective of a Unix process, a named pipe is just a regular pipe from which you can read data (or write data to). A benefit of using a pipe is that the operating system takes care of synchronization: one process can safely read from the pipe while another process is writing to it.

To begin the plumbing, we need to create two named pipes, one for input, another for output:

mkfifo input_pipe
mkfifo output_pipe

Then, we need to modify our NLP tool.

1) We make the tool process data in an infinite loop (rather than exiting after processing all the input). In the beginning of the loop, we open the the output pipe for writing (and we open the input pipe only once, when the tool starts). After all data is processed, we close the output pipe. Note that closing and re-opening the output pipe is important. Otherwise, a receiving process will not read the EOF marker and consequently, will wait for ever.
2) We replace all operators that read raw data from the standard input with operators that read data from the input named pipe.
3) We replace all operators that write processed data to the standard output with operators that write data to the output named pipe.

In C/C++, this is straightforward (in other languages this is not hard either). For instance, we replace

  1. fgets(sentence, MAX_SENTENCE_SIZE, stdin)

with
  1. fgets(sentence, max_sent_size + 1, senna_input)

That is, pretty much it. As a working example, I publish the modified main C-file of the SENNA parser version 3.0.

Happy plumbing and let your pipes be functioning properly!

PS1: There is one catch: in a previously described solution to wrapping NLP tools. Namely, a Unix pipe has a small buffer (I believe it is in the order of kilobytes). This is why you need to send input data in small chunks and read all the output after each chunk is processed. Otherwise, a deadlock can happen.

If you try to push a large chunk, a write call (that writes data to the output pipe) will be waiting till your application reads data from the other side of the output pipe. At the same time, your application will be trying to push input data through the input pipe. Because the pipe has a limited capacity, your application will exhaust this capacity and "freeze" in the write call. It will be waiting for the NLP tool to read the data from the other side of the input pipe. Yet, the NLP tool, in turn, will be waiting for your application!

As Di Wang pointed out, you can avoid this situation by using a temporary file and a pipe in a clever way. As in the naive and unsafe solution described in the beginning of the post, you will write the input data to a temporary file. Then, you will write the name of this temporary file to the input pipe. Because the name of the file is very small, you will not exceed the pipe capacity.

PS2: In principle, even a simpler approach might be possible. And this approach would not require modification of the NLP tool. We can create a wrapper utility that would simulate the following Unix command:

printing_binary | ./nlp_tool_binary | recipient_binary

A trick is to implement this pipeline in such a way that printing_binary is the same as the recipient_binary. In C/C++ and Unix, it is possible, e.g., through a system call popen. One big issue here, though, is that we need to feed several input batches to nlp_tool_binary. However, the output from nlp_tool_binary can be a single stream of characters/bytes without separators that indicate batch boundaries.

Clearly, we need to pause after feeding every input batch until we retrieve all the respective output data. However, if we keep reading the output of nlp_tool_binary using a blocking read operation, we will eventually end up waiting forever. We can read using a non-blocking system call, but how long should we repeat this system call before giving up and declaring that nlp_tool_binary has processed all the data? It might be possible to use some knowledge about the output format of nlp_tool_binary, or simply stop trying after a certain time elapses. Yet, none of these solutions is sufficiently reliable or generic, so if anybody knows a better approach, please, let me know.