A surprising novel stopword that appears if you use Stanford NLP tokenizer

Blog

Directory

Submitted by srchvrs on Wed, 12/07/2016 - 13:07

I recently learned a new stopword that seems to be missing from most of the standard lists of stopwords (for example, it is not on the list of the Lemur/Indri toolkit), which likely means it is pretty novel to the IR community. This stop word is a simple three letter combination: n't. How does it arise? Well, it is a result of tokenization of contractions such as can't or aren't. But don't blindly trust my words, check the tokenization results yourself, e.g., using the following sentences (as a reminder this can be done using an online Stanford tool):

I ain't interested in this.
I can't attend this conference.
Aren't you hungry?
Don't trust me, verify!

Well, this may be correct linguistically, but this is not something the IR community is fully aware of. In particular, a stopword list may contain full contractions such as can't, ain't or don't, but not the suffix n't! If you work with a text where contractions are often, there you go, you have a new stop word! Inclusion of stop words into a query may not necessarily have effect on accuracy, but it will certainly hurt efficiency. BTW, other contractions may also produce interesting tokens, e.g., gonna is tokenized into gon and na. Yet, tokens like na seem to be far less frequent.

srchvrs's blog

Comments

Submitted by Maxim Zakharov (not verified) on Wed, 12/07/2016 - 17:51

Stopwords is the characteristic of the search domain. E.g. if you're building a search over a web site, let say of the Metropolitan Museum of Art, then the words Metropolitan, Museum and Art would be in your stopword list, as they would be found on almost every page you put into the index.
So any precompiled stopword list coming with a search library is a starting point to make your own, specific to the domain you're indexing.

Submitted by srchvrs on Wed, 12/07/2016 - 22:09

This is a fair point, but "n't" would occur in many domains, while "Metropolitan Museum of Art" is very domain-specific.

You are here

Comments