English»Software»Natural Language Processing & Information Extraction»Word segmentation/tokenization | searchivarius.org
log in | about 
 


Some of the references below were taken from this StackOverflow page


ChaSen   - Word segmentation, POS tagging and morphological analysis in Japanese.
fudannlp  
ictclas  
ik-analyzer  
mmseg4j  
NlpBamboo  
PanGu segment  
SentencePiece   - an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training.
smallseg  
Stanford Word Segmenter  
WordSegment   - Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.