I recently learned a new stopword that seems to be missing from most of the standard lists of stopwords (for example, it is not on the list of the Lemur/Indri toolkit), which likely means it is pretty novel to the IR community. This stop word is a simple three letter combination: n't. How does it arise? Well, it is a result of tokenization of contractions such as can't or aren't. But don't blindly trust my words, check the tokenization results yourself, e.g., using the following sentences (as a reminder this can be done using an online Stanford tool):
I ain't interested in this.
I can't attend this conference.
Aren't you hungry?
Don't trust me, verify!
Well, this may be correct linguistically, but this is not something the IR community is fully aware of. In particular, a stopword list may contain full contractions such as can't, ain't or don't, but not the suffix n't! If you work with a text where contractions are often, there you go, you have a new stop word! Inclusion of stop words into a query may not necessarily have effect on accuracy, but it will certainly hurt efficiency. BTW, other contractions may also produce interesting tokens, e.g., gonna is tokenized into gon and na. Yet, tokens like na seem to be far less frequent.