dkpro-jwpl - (DKPro Java Wikipedia Library) is a free, Java-based application programming interface that facilitates access to all information in Wikipedia.
grobid - A machine learning software for extracting information from scholarly documents (a highly recommended one).
Multivalent - Multivalent works on numerous document types, including PDF, HTML, DVI, UNIX man pages and more. It is especially useful for PDF, because tries to paste together word fragments into whole words. Multivalent also integrates with Lucene.
PDFtoHTML - PDFtoHTML is a utility which converts PDF files into HTML and XML formats. It is based on XPDF.