A tool for extracting plain text from Wikipedia dumps - seems to be one of the best text extrctors.
|
Alternative parsers for Wikipedia dumps |
Apache PDFBox - Java PDF Library
|
Boilerpipe: remove boilerplate HTML code |
dkpro-jwpl - (DKPro Java Wikipedia Library) is a free, Java-based application programming interface that facilitates access to all information in Wikipedia.
|
grobid - A machine learning software for extracting information from scholarly documents (a highly recommended one).
|
Html cleaner |
Html Tidy |
JSoup: Java HTML Parser |
Multivalent - Multivalent works on numerous document types, including PDF, HTML, DVI, UNIX man pages and more. It is especially useful for PDF, because tries to paste together word fragments into whole words. Multivalent also integrates with Lucene.
|
PDFtoHTML - PDFtoHTML is a utility which converts PDF files into HTML and XML formats. It is based on XPDF.
|
python-ftfy - ftfy: fixes Unicode
|
science-parse - AI2 PDF parser for scientific articles.
|
scrapinghub - A collection of utilities for web scraping, parsing, and cleaning.
|
Tika - The Apache Tika toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.
|
wikixmlj easy access to Wikipedia XML dumps. |
Xpdf: pdf utils including pdftotet |