If you use a Stanford NLP toolkit, how do you extract dates? One may be tempted to directly use the statistical named entity recognizer, included in the toolkit. A demo of this NER is provided online. One immediate catch here is that there are several pre-trained statistical models. The demo code published online is using a 3-class model, which doesn't include dates! One should be careful enough to use the model english.muc.7class.distsim.crf.ser.gz.
The 7-class Muc-trained model is working ok, but there are a couple of issues. First of all, it often fails to detect a complete date. Go to the Stanford NER demo page, select the model english.muc.7class.distsim.crf.ser.gz and enter the text "Barack Hussein Obama was born on 4 August 1961.". The output would be like this:
Barack Hussein Obama was born on 4 August 1961.
Potential tags:
LOCATION
TIME
PERSON
ORGANIZATION
MONEY
PERCENT
DATE
As you can see, the month and the year were tagged, but not the date of the month. BTW, not all of the Barack Obama's name was tagged either. Surely, I used a bit non-standard format of the date, but this format occurs frequently on the Web. Another issue is that the statistical tagger does not support date standardization. For example, given the dates August 1961 and 4 August 1961, the statistical NER cannot provide standardized date representations such as 1961-08 and 1961-08-04, which are easy to process and compare.
How big is the deal? My evidence is mostly anecdotal as I do not have a large enough sample to obtain reliable results. Yet, in one of our custom question answering pipeline, I gained about 20% in accuracy by using a rule-based Stanford Temporal Tagger (SUTime), instead of the statistical NER.
Interestingly, the SUTime is enabled automatically with the StanfordCoreNLP pipeline by including the NER annotator. The catch, again, is that it is not included when you use the statistical NER directly. Not only the SUTime has better recall and precision, but it also returns dates in the normalized form. An example of using the SUTime is provided by Stanford folks.
Comments
I have personally found coreNLP to be super slow for this task. The tool I personally use for timestamp extraction is Natty (https://github.com/joestelmach/natty/). It has reasonably broad coverage (works really well on web-scale corpora like clueweb and common crawl) and it a lot faster than the CoreNLP NER pipeline.
Thank you! I will check your link. The speed is not a priority now, but the accuracy is.