English»Software»Natural Language Processing & Information Extraction»Document Parsers & Cleaners | searchivarius.org
log in | about 

Various PDF, RTF, DOC, HTML, etc. parsers.

A tool for extracting plain text from Wikipedia dumps   - seems to be one of the best text extrctors.
Alternative parsers for Wikipedia dumps  
Apache PDFBox   - Java PDF Library
Boilerpipe: remove boilerplate HTML code  
dkpro-jwpl   - (DKPro Java Wikipedia Library) is a free, Java-based application programming interface that facilitates access to all information in Wikipedia.
grobid   - A machine learning software for extracting information from scholarly documents (a highly recommended one).
Html cleaner  
Html Tidy  
JSoup: Java HTML Parser  
Multivalent   - Multivalent works on numerous document types, including PDF, HTML, DVI, UNIX man pages and more. It is especially useful for PDF, because tries to paste together word fragments into whole words. Multivalent also integrates with Lucene.
PDFtoHTML   - PDFtoHTML is a utility which converts PDF files into HTML and XML formats. It is based on XPDF.
python-ftfy   - ftfy: fixes Unicode
science-parse   - AI2 PDF parser for scientific articles.
scrapinghub   - A collection of utilities for web scraping, parsing, and cleaning.
Tika   - The Apache Tika toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.
wikixmlj  easy access to Wikipedia XML dumps.
Xpdf: pdf utils including pdftotet