Dictionary-less spellchecking and subword language models

Blog

Directory

Submitted by srchvrs on Mon, 05/25/2015 - 22:49

It was asked on Quora about dictionary-less spellchecking (this may require subword language models). Here is my brief answer (feel free to vote on Quora).

In the early days, memory was limited, therefore people tried to avoid storing comprehensive dictionaries. Errors could have been detected by finding substrings that looked "unusual" for a given language. For example, if you compile a list of common trigrams, you could flag words containing trigrams that are off the list (see, e.g., Spellchecking by Computer by R. Mitton).

English has little morphology. In morphologically productive languages, it was often possible to compile a list of word generation rules. For example, given a root form of a verb, it would be possible to obtain all possible inflections for singular and plural forms in the simple present and other tenses. This trick provides some compaction in the case of English, however, many more verb forms can be generated in, e.g., Russian (i.e., a more compact representation is possible compared to the comprehensive dictionary that includes all word forms.).

More sophisticated checking can be done in the case of Agglutinative languages (e.g., Turkish, Finish), where a single word can be equivalent to a full-fledged sentence in English. Agglutinative langauges are typically regular: there are rules to add affixes to roots. Regular does not necessarily mean simple, however. For example, in Turkish there are all kind of suffix and root deformations that occur when word parts are assembled (see this paper for details: Design and implementation of a spelling checker for Turkish by A Solak, K Oflazer).

Perhaps, most relevant to the question are sub-word level language models (including, of course, statistical character-level models). Previously, modelling was done using n-gram language models. As pointed out recently by Yoav Goldberg, these models are quite powerful: The unreasonable effectiveness of Character-level Language Models (and why RNNs are still cool).

However, Recurrent Neural Networks (RNNs) seem to be a more effective tool. A couple more relevant references on RNNs for language modelling:

What these models do: they learn how likely is a combination of certain characters or n-grams in a real word. Errors are, therefore, detected by finding character sequences that looked "unusual" for a given language (unusual means the character sequence didn't match training data well).

The links posted by Wenhao Jiang may also relevant, however, they seem to address a different problem: given an already existing word-level language model find the most probable separation of character (or phoneme) sequences into words. Word segmentation IMHO requires even more data than a dictionary: it requires normally a word-level language model that encompasses the dictionary as a special case. In the case of an out-of-vocabulary (OOV) word, the n-gram model also backs off to using a character-level (or subword-level) language model.

srchvrs's blog

You are here