Not all date extractors are born equal: on using the right extractor in Stanford NLP toolkit

Blog

Directory

Submitted by srchvrs on Tue, 08/05/2014 - 14:18

If you use a Stanford NLP toolkit, how do you extract dates? One may be tempted to directly use the statistical named entity recognizer, included in the toolkit. A demo of this NER is provided online. One immediate catch here is that there are several pre-trained statistical models. The demo code published online is using a 3-class model, which doesn't include dates! One should be careful enough to use the model english.muc.7class.distsim.crf.ser.gz.

The 7-class Muc-trained model is working ok, but there are a couple of issues. First of all, it often fails to detect a complete date. Go to the Stanford NER demo page, select the model english.muc.7class.distsim.crf.ser.gz and enter the text "Barack Hussein Obama was born on 4 August 1961.". The output would be like this:

Barack Hussein Obama was born on 4 August 1961.

Potential tags:
  LOCATION
  TIME
  PERSON
  ORGANIZATION
  MONEY
  PERCENT
  DATE

As you can see, the month and the year were tagged, but not the date of the month. BTW, not all of the Barack Obama's name was tagged either. Surely, I used a bit non-standard format of the date, but this format occurs frequently on the Web. Another issue is that the statistical tagger does not support date standardization. For example, given the dates August 1961 and 4 August 1961, the statistical NER cannot provide standardized date representations such as 1961-08 and 1961-08-04, which are easy to process and compare.

How big is the deal? My evidence is mostly anecdotal as I do not have a large enough sample to obtain reliable results. Yet, in one of our custom question answering pipeline, I gained about 20% in accuracy by using a rule-based Stanford Temporal Tagger (SUTime), instead of the statistical NER.

Interestingly, the SUTime is enabled automatically with the StanfordCoreNLP pipeline by including the NER annotator. The catch, again, is that it is not included when you use the statistical NER directly. Not only the SUTime has better recall and precision, but it also returns dates in the normalized form. An example of using the SUTime is provided by Stanford folks.

srchvrs's blog

Comments

Submitted by Shriphani Palakodety (not verified) on Tue, 08/05/2014 - 14:44

I have personally found coreNLP to be super slow for this task. The tool I personally use for timestamp extraction is Natty (https://github.com/joestelmach/natty/). It has reasonably broad coverage (works really well on web-scale corpora like clueweb and common crawl) and it a lot faster than the CoreNLP NER pipeline.

Submitted by srchvrs on Tue, 08/05/2014 - 14:56

Thank you! I will check your link. The speed is not a priority now, but the accuracy is.

You are here

Comments