## NLP approaches/tools to automatically rewrite sentences

I was asked on Quora about NLP approaches or tools to automatically rewrite sentences. Here is my brief answer (feel free to vote on Quora). Also note that I have updated the answer to include seq2seq models.

Not really a specialist in sentence re-writing, but I think at least the following approaches can be used:

1) Manual rules. These can be either regexp based or tree-based. Stanford folks have a tool to create such patterns. One interesting option is to implement rules based on POS tags. I, e.g., reimplemented some of the query-rewriting algorithms from Aranea QA system. This is not easy and the quality is sometimes not ideal.

2) Learned from data. One common approach to do this is, perhaps, surprisingly machine translation. This would require a rather large monolingual corpus of paired sentences. One sentence would be source and another would be a target. You can train all sorts of translation models on such a corpus starting from simplistic Model 1 and ending with some phrase-based model (or even context-free grammar based, see the link below).

3) It is also possible to obtain paraphrasing information by pivoting on a foreign language. Here is one link PPDB: The Paraphrase Database. You may want to read papers authored by the guys who created PPDB.

4) I suspect that even better translation results will be obtained by using synchronous context-free grammars. There is in fact an open-source package that apparently supports all of this: cdec.

5) A more recent approach relies on neural sequence-to-sequence (seq2seq) models. One recent paper on this subject is: Neural Paraphrase Generation with Stacked Residual LSTM Networks, Prakash et al., 2016.

## How to declare a constant reference in C++ (not really)

As we may remember, in C++ there are two types of constant pointers. The pointer of the first type (the most common one) can be changed, but not the memory it points to:

const int * const_mem = ... ;*const_mem = 3; // compile error
The constant pointer of the second type is basically a reference and it cannot be changed, but you can still change respective memory:
int * const const_ptr = ... ;*const_ptr = 3; // fine!const_ptr++; // compile error
Of course, you can define a constant pointer to constant memory as well:
const int * const const_ptr_mem = ... ;*const_ptr_mem = 3; // compile error!const const_ptr_mem++; // compile error

References, however, are constant by design. You can assign reference a value only once. You cannot change the reference value afterwards! Thus, references are basically constant non-null pointers. Turns out that you can still define a constant reference in C++:

int const & const_ref = 3;
Well, why would such non-sense thing be possible? The answer is that it is not. C and C++ have an extremely quirky way of declaring complex types with complicated rules, which are applied basically in a inside-out right-to-left fashion. Thus, in the previous declaration const still applies to int rather than to int&. In other words, the latter declaration is equivalent to:
const int & const_mem_ref = 3;
To declare a true constant reference, which is unsurprisingly illegal, you need to place the modifier const between the '&' and the variable name:
int & const const_ref = 3;

Bottom line? Hopefully, reading this short note will help one reduce confusion in the future. As usual, simple illustration code is available.

## GCC disables isnan and isinf when compiling with -ffast-math flag

This short note is just a reminder that GCC totally ignores functions isinf and isnan when you compile your code with -ffast-math option. The demo code can be found here. One should also be aware that -ffast-math is enabled by a commonly used option -Ofast, but not by -ON option, where N is a number. I also wrote custom checking functions that do not have such problem.

## A surprising novel stopword that appears if you use Stanford NLP tokenizer

I recently learned a new stopword that seems to be missing from most of the standard lists of stopwords (for example, it is not on the list of the Lemur/Indri toolkit), which likely means it is pretty novel to the IR community. This stop word is a simple three letter combination: n't. How does it arise? Well, it is a result of tokenization of contractions such as can't or aren't. But don't blindly trust my words, check the tokenization results yourself, e.g., using the following sentences (as a reminder this can be done using an online Stanford tool):

I ain't interested in this.
I can't attend this conference.
Aren't you hungry?
Don't trust me, verify!

Well, this may be correct linguistically, but this is not something the IR community is fully aware of. In particular, a stopword list may contain full contractions such as can't, ain't or don't, but not the suffix n't! If you work with a text where contractions are often, there you go, you have a new stop word! Inclusion of stop words into a query may not necessarily have effect on accuracy, but it will certainly hurt efficiency. BTW, other contractions may also produce interesting tokens, e.g., gonna is tokenized into gon and na. Yet, tokens like na seem to be far less frequent.